An outline proposal

Tue Oct 19 12:38:26 CEST 2010

Thanks Dave for putting up a concrete proposal.  Comments below:

On 19 October 2010 01:49,  <robert_weir at us.ibm.com> wrote:
> sc34wg1study-bounces at vse.cz wrote on 10/13/2010 12:20:22 PM:
>
>>
>> My view,
>> A bare outline, skipping the 'version'/type of zip, until we get legal
> input.
>>
>
> This WG has a deadline, so we should not wait or skip anything in the vain
> hope that legal assistance will be coming.  It won't come.
>
>> Comments please?
>>
>> 2010-10-09T21:29:51Z
>> Outline requirements for a zip specification.
>> rev 1.0, Dave Pawson
>>
>> 1. Provide a compressed archive format for general use.
>>
>
> OK.

I think we probably need to define what we mean by 'archive' format -
my understanding being:

Provide a compressed format for representing a collection of files (or
streams) within a single stream.  I am in two minds about the use of
the term 'file' as it has some platform connotations, for examples
expectations around preservation of permissions and other file
attributes.  Existing zip tools seem to behave differently in this
respect.  Are we simply interested in the lowest common denominator of
the binary stream plus name?   I know zip files are often composed
from files in a filesystem, but they are also frequently generated
directly from strems without touching the filesystem.

>
>
>> 1.1. A compression algorithm shall be provided which is usable without
>> infringing any existent patent.
>>
>
> The goal should not be that the algorithm is free of IPR but that the
> users of the standard may practice the algorithm without payment of
> royalties.  For example, the owner of the patent could pledge not to
> assert their patents for implementors of the standard.  Royalty-free and
> IPR-free are not the same thing, though they are often confused.

Agreed.  But I wouldn't like to throw away the desirability of using
algorithms which are patent-free.  Pledges, covenants, promises etc
are all a bit second-best.  Perhaps the statement 1.2 below (modified
slightly) is sufficient

>
>
>> 1.2. A compression algorithm shall be provided which may be used

unconditionally and in perpetuity

>> without payment of any sort.
>>
>
> OK.
>
>>
>> 2. The packaged entity shall hold one or more file.
>>
>
> OK.  This needs further specification:  A file has a name, metadata (date,
> permissions, archive bit?) as well as contents.  Detailed requirements?

yes.  Minor point, but I would have thought the packaged entities are
the things which are packaged within the package entity.  And its the
nature of these "packaged entities" - files, streams or what have you
- which we should detail.

>
>> 2.1 It will be possible to extract one or more individual files from
>> the package.
>>
>
> OK.
>
>> 2.2 Any file hierarchy present when the package is created shall be
>> duplicated on extraction if requested.
>>
>
> So this leads to the requirement that you can store a file hierarchy.

Again I would not straight away assume we are talking of a file
hierarchy.  The contents of the package may have started off as files.
 And may even be extracted to files.  But is this necessarily so?  I
would prefer to think of the zip as simply a container for streams.

>
>> 2.3 The package shall hold any combintation of  binary and/or text
>> files.
>>
>
> Not sure I agree that text files must be distinguishable from binary
> files.  Once you have text files you end up dealing with DOS/Unix CRLF
> conversions.  Better to just store the file as-is, directly, at which
> point there is no difference between text and binary files.
>
Agree.

>
>> 2.3.1 There shall be no difference between a file prior to being
>> archived and the corresponding file when extracted from the archive.
>>
>
> OK.  And per above, if you are not changing a text file on different OS's
> then there is no difference between text and binary.
>
>> 2.3.2 No change shall be made to any character encoding by compressing
>> and decompressing a file. I.e. an input file after decompression must
>> match its character encoding prior to compression.
>>
>
> Again, then why distinguish text from binary?
>
>> 3. A means of verification of an archive shall be provided.
>>
>
> Not sure what is intended here.  Do you mean you want a specification of a
> verification procedure? (A validator?)Or that a conforming
> ZIP-consumer/ZIP-producer must include a means of verification?
>
>
>> 4. A means of listing the contents of an archive without extraction
>> shall be provided.
>>
>
> Again, not clear who is providing this, the specification or a
> consumer/producer?

This is a really interesting issue.  Currently in zip appnote we have
a central directory record (which is supposed to, but does not
necessarily, reflect the collection of entries).  A problem with this
is that it appears at the end of the zip.  This can be quite
inconvenient when consuming a large incoming zip stream.  One of the
first requirements for a consumer is typically to determine exactly
what kind of package it is dealing with.  Many formats (including odf,
jar etc) also have a requirement for some form of manifest.  OPC has
.rels which addresses the same problem slightly differently.  It would
certainly be desirable to have this "listing" (I know its more than
listing) as the first entity in the packaged collection.  The zip
appnote doesn't say anything about ordering - most probably because
the original rationale for archiving was quite different to our
rationale for packaging.  And ordering would be determined by some
sort of glob on the filesystem rather than anything more deliberate.
Perhaps there is also some historical justification around the ease of
appending the directory information to the end rather than inserting
at the beginning  (interesting that tar doesn't have any such central
directory).  I do recall OPC says something about recommending that
ordering be sensible.  For our purposes I think it would be useful to
specify that producers *should* place such manifest information
upfront.

Final thought - given that most of the formats we refer to make use of
some form of manifest, how important is it to concern ourselves too
much with the central directory at all?  Sure to be compatible with
existing general purpose zip implementations, it must be there.  But
when profiling a package specification on top of zip (and possibly
even other low level mechanisms like tar) it might be more important
that we focus on the manifest.

regards
Bob

>
>> 5. A package listing shall be created as a a plain text file within
>> the archive which lists all files within the archive excepting itself.
>>
>
> I don't see this as a requirement.  In fact, if you look at the most
> common operations on an archive, including the incremental addition and/or
> replacement of files, you'll see that the most efficient encoding of
> directory information is unlikely to be a plain text file package listing.
>
>>
>> 5. A means of extracting the contents of an archive shall be provided
>> which meets the requirement of 2.3.1
>>
>
> Same response as to 2.3.1
>
>> 5.1. A decompression algorithm shall be provided which is usable without
>> infringing any existent patent.
>>
>
> Same response as to 1.1.
>
>> 5.2. A decompression algorithm shall be provided which may be used
>> without payment of any sort.
>>
>
> Same response as to 1.2.
>
>
> Regards,
>
> -Rob
>
>
>>
>>
>>
>> --
>> Dave Pawson
>> XSLT XSL-FO FAQ.
>> Docbook FAQ.
>> http://www.dpawson.co.uk
>> _______________________________________________
>> sc34wg1study mailing list
>> sc34wg1study at vse.cz
>> http://mailman.vse.cz/mailman/listinfo/sc34wg1study
>
> _______________________________________________
> sc34wg1study mailing list
> sc34wg1study at vse.cz
> http://mailman.vse.cz/mailman/listinfo/sc34wg1study
>