An outline proposal

Tue Oct 19 15:19:03 CEST 2010

On 19 October 2010 12:43, Dave Pawson <dave.pawson at gmail.com> wrote:
> On 19 October 2010 11:38, Bob Jolliffe <bobjolliffe at gmail.com> wrote:
>
>>>> 1. Provide a compressed archive format for general use.
>>>>
>>>
>>> OK.
>>
>> I think we probably need to define what we mean by 'archive' format -
>> my understanding being:
>>
>> Provide a compressed format for representing a collection of files (or
>> streams) within a single stream.  I am in two minds about the use of
>> the term 'file' as it has some platform connotations, for examples
>> expectations around preservation of permissions and other file
>> attributes.
>
> How abstract do we want to be?
> IMHO we should keep out of the file system arena?

Agree

> Leave it at 'collection' is a bit woolly?

What is a package in the context we are discussing, other than a
collection (ordered collection? its serialized) of related streams
which together provide the means to represent fully a document?  In a
zip file (from the appnote) each stream also has a local file header
which contains, amongst other things, a filename.  These filenames
allow the package to be also represented exploded as files in a
filesystem which is useful.  That's a useful quality of zip.  But IMHO
the streams in the zip file are not inherently files.  But its useful
to name them as if they were in order to have this dual
representation.

>
>
>   Existing zip tools seem to behave differently in this
>> respect.  Are we simply interested in the lowest common denominator of
>> the binary stream plus name?   I know zip files are often composed
>> from files in a filesystem, but they are also frequently generated
>> directly from strems without touching the filesystem.
>
> What do others call such a collection... without calling it a file?
> Pragmatically it is likely to end up as a file on disk?

I really don't agree unless we are maybe misunderstanding each other.
The zip file is certainly most likely to end up on a disk and I don't
really have any gripe about talking about zip as a file format.  But
the things which are stored inside the zip are not necessarily likely
to either begin or end life as files on a disk.  Unless we are simply
referring to the conventional, typical, general purpose use of zip.
But if we are then the appnote maybe suffices.  Perhaps my
misunderstanding is that I'm seeing the scope, at least partly, as
defining a minimal zip profile (don't jump - I know we should define
profile :-) which is sufficient for the packaging of documents.

>
>
>>
>>>
>>>
>>>> 1.1. A compression algorithm shall be provided which is usable without
>>>> infringing any existent patent.
>>>>
>>>
>>> The goal should not be that the algorithm is free of IPR but that the
>>> users of the standard may practice the algorithm without payment of
>>> royalties.  For example, the owner of the patent could pledge not to
>>> assert their patents for implementors of the standard.  Royalty-free and
>>> IPR-free are not the same thing, though they are often confused.
>>
>> Agreed.  But I wouldn't like to throw away the desirability of using
>> algorithms which are patent-free.  Pledges, covenants, promises etc
>> are all a bit second-best.  Perhaps the statement 1.2 below (modified
>> slightly) is sufficient
>
> How about 'unencumbered'?
>
> I'd be happy with the later one inserted here. I think the aim is
> clear, no problem with wordsmithing it.
>
>
>
>
>>>> 2. The packaged entity shall hold one or more file.
>>>>
>>>
>>> OK.  This needs further specification:  A file has a name, metadata (date,
>>> permissions, archive bit?) as well as contents.  Detailed requirements?
>>
>> yes.  Minor point, but I would have thought the packaged entities are
>> the things which are packaged within the package entity.  And its the
>> nature of these "packaged entities" - files, streams or what have you
>> - which we should detail.
>
> 1. Are we into metadata? Is this going too far into implementation?
> Again, I'd prefer to stay above a file system and attributes?
> I'm -1 here (if we can do without it). That's the implementation layer,
> rather than the specification *(what)* level.

I think we agree.

>
>
>
>>>> 2.2 Any file hierarchy present when the package is created shall be
>>>> duplicated on extraction if requested.
>>>>
>>>
>>> So this leads to the requirement that you can store a file hierarchy.
>>
>> Again I would not straight away assume we are talking of a file
>> hierarchy.  The contents of the package may have started off as files.
>>  And may even be extracted to files.  But is this necessarily so?  I
>> would prefer to think of the zip as simply a container for streams.
>
> Stealing your earlier words, does entity/entities work here?
> I read 'streams' and think of the java file / stream hieararchy?
> Agreed this may all be happening in memory (and then optionally
> written to disk), but I'm short on words that reflect this
> abstraction?
>
> Anyone?
>
>
>>
>>>
>>>> 2.3 The package shall hold any combintation of  binary and/or text
>>>> files.
>>>>
>>>
>>> Not sure I agree that text files must be distinguishable from binary
>>> files.  Once you have text files you end up dealing with DOS/Unix CRLF
>>> conversions.  Better to just store the file as-is, directly, at which
>>> point there is no difference between text and binary files.
>>>
>> Agree.
>
> I'm saying absract them from the archive, not process them.
> Thats in the application layer operating on these things.
>
> Anyone see a problem omitting this differentiation? If not
> then I'm OK to remove it.
>
>
>
>>
>>>
>>>> 2.3.1 There shall be no difference between a file prior to being
>>>> archived and the corresponding file when extracted from the archive.
>>>>
>>>
>>> OK.  And per above, if you are not changing a text file on different OS's
>>> then there is no difference between text and binary.
>
> Does this address the binary/text differentiation sufficiently?
>
>
>>>
>>>> 2.3.2 No change shall be made to any character encoding by compressing
>>>> and decompressing a file. I.e. an input file after decompression must
>>>> match its character encoding prior to compression.
>>>>
>>>
>>> Again, then why distinguish text from binary?
>
> Noted.
>
>
>>>
>>>> 3. A means of verification of an archive shall be provided.
>>>>
>>>
>>> Not sure what is intended here.  Do you mean you want a specification of a
>>> verification procedure? (A validator?)Or that a conforming
>>> ZIP-consumer/ZIP-producer must include a means of verification?
>
> Needs expansion.
> *I* meant, tell me what files are contained?
>>zip -l archive.zip gives me the file sizes and names. I think that's
> all thats needed, thought a full verification, i.e. re-compress and compare...
> something (mdsum?) with archived content is perhaps the 200% check.
> Is this needed? If it fails then the earlier 'no change' fails, so IMHO
> a simple 'tell me what's inside' is sufficient.
>
> Anyone care to wordsmith that?
>
>
>>>
>>>
>>>> 4. A means of listing the contents of an archive without extraction
>>>> shall be provided.
>>>>
>>>
>>> Again, not clear who is providing this, the specification or a
>>> consumer/producer?
>
> We are requiring it of an implementation which declares itself
> compliant to our requirement.
>
>
>>
>> This is a really interesting issue.  Currently in zip appnote we have
>> a central directory record (which is supposed to, but does not
>> necessarily, reflect the collection of entries).  A problem with this
>> is that it appears at the end of the zip.  This can be quite
>> inconvenient when consuming a large incoming zip stream.  One of the
>> first requirements for a consumer is typically to determine exactly
>> what kind of package it is dealing with.  Many formats (including odf,
>> jar etc) also have a requirement for some form of manifest.  OPC has
>> .rels which addresses the same problem slightly differently.  It would
>> certainly be desirable to have this "listing" (I know its more than
>> listing) as the first entity in the packaged collection.
>
>
> Why first Bob? So long as it is available? Simply for speed of access?

Partly. But its also about not having to unzip to a tmp folder or even
persist the zip in order for a consumer to decide what to do next -
because the contents may be very large.  I'll give an example from a
piece of code I'm working on right now.  A web application imports
files in various formats.  Some of those are zipped formats (eg. xslsx
and soon odf calc but also some others).  And as I say, some of these
'files' can be very large.  The first challenge is to find out what
kind of file format I'm dealing with so I know whether I can parse it
or not - the first few bytes of the binary stream indicate its a zip.
With binary format files one would typically discover the file 'type'
by looking at these signature bytes up front.  With a zipped format we
only know at this point that we are dealing with a zipped format.
>From my perspective that's quite a big disadvantage.  Of course once
I've seen the manifest or .rels I know what I'm dealing with.  Hence
I'd like to recommend that producers put these up front where
possible.

It's only a recommendation.  But if non-naive zip producers used it,
it would give back to consumers some of what is lost by moving from
binary representations to random collections of zipped xml documents.

>
>  The zip
>> appnote doesn't say anything about ordering - most probably because
>> the original rationale for archiving was quite different to our
>> rationale for packaging.
>
> Or it's a how, not a what?
>
>
>
>>
>> Final thought - given that most of the formats we refer to make use of
>> some form of manifest, how important is it to concern ourselves too
>> much with the central directory at all?
>
> Define central directory please? Just a list of items contained in the archive?
> I required that, but as an ordinary file (thingy) within the archive.
>
>  Sure to be compatible with
>> existing general purpose zip implementations, it must be there.  But
>> when profiling a package specification on top of zip (and possibly
>> even other low level mechanisms like tar) it might be more important
>> that we focus on the manifest.
>
> Define profiling please?

I doubt I'm the best person to do this.  But my interpretation of
profiling in this context would be an additional set of constraints on
top of the general zip structure as described in the appnote which
provide a minimal set of characteristics to meet our requirements.
Which I know begs the next question .. what are these requirements?
If they are in fact as general as your point 1 above then you can
discount most of what I have said.  But my understanding of sc34
interest is that it is primarily interested in describing the features
of zip required to package documents.  And I do see this as a subtle
re-purposing of the generic zip requirement to archive stuff from the
filesystem.

Regards
Bob

>
>
>
> regards
>
> --
> Dave Pawson
> XSLT XSL-FO FAQ.
> Docbook FAQ.
> http://www.dpawson.co.uk
> _______________________________________________
> sc34wg1study mailing list
> sc34wg1study at vse.cz
> http://mailman.vse.cz/mailman/listinfo/sc34wg1study
>