An outline proposal

Tue Oct 19 13:43:56 CEST 2010

On 19 October 2010 11:38, Bob Jolliffe <bobjolliffe at gmail.com> wrote:

>>> 1. Provide a compressed archive format for general use.
>>>
>>
>> OK.
>
> I think we probably need to define what we mean by 'archive' format -
> my understanding being:
>
> Provide a compressed format for representing a collection of files (or
> streams) within a single stream.  I am in two minds about the use of
> the term 'file' as it has some platform connotations, for examples
> expectations around preservation of permissions and other file
> attributes.

How abstract do we want to be?
IMHO we should keep out of the file system arena?
Leave it at 'collection' is a bit woolly?

  Existing zip tools seem to behave differently in this
> respect.  Are we simply interested in the lowest common denominator of
> the binary stream plus name?   I know zip files are often composed
> from files in a filesystem, but they are also frequently generated
> directly from strems without touching the filesystem.

What do others call such a collection... without calling it a file?
Pragmatically it is likely to end up as a file on disk?

>
>>
>>
>>> 1.1. A compression algorithm shall be provided which is usable without
>>> infringing any existent patent.
>>>
>>
>> The goal should not be that the algorithm is free of IPR but that the
>> users of the standard may practice the algorithm without payment of
>> royalties.  For example, the owner of the patent could pledge not to
>> assert their patents for implementors of the standard.  Royalty-free and
>> IPR-free are not the same thing, though they are often confused.
>
> Agreed.  But I wouldn't like to throw away the desirability of using
> algorithms which are patent-free.  Pledges, covenants, promises etc
> are all a bit second-best.  Perhaps the statement 1.2 below (modified
> slightly) is sufficient

How about 'unencumbered'?

I'd be happy with the later one inserted here. I think the aim is
clear, no problem with wordsmithing it.

>>> 2. The packaged entity shall hold one or more file.
>>>
>>
>> OK.  This needs further specification:  A file has a name, metadata (date,
>> permissions, archive bit?) as well as contents.  Detailed requirements?
>
> yes.  Minor point, but I would have thought the packaged entities are
> the things which are packaged within the package entity.  And its the
> nature of these "packaged entities" - files, streams or what have you
> - which we should detail.

1. Are we into metadata? Is this going too far into implementation?
Again, I'd prefer to stay above a file system and attributes?
I'm -1 here (if we can do without it). That's the implementation layer,
rather than the specification *(what)* level.

>>> 2.2 Any file hierarchy present when the package is created shall be
>>> duplicated on extraction if requested.
>>>
>>
>> So this leads to the requirement that you can store a file hierarchy.
>
> Again I would not straight away assume we are talking of a file
> hierarchy.  The contents of the package may have started off as files.
>  And may even be extracted to files.  But is this necessarily so?  I
> would prefer to think of the zip as simply a container for streams.

Stealing your earlier words, does entity/entities work here?
I read 'streams' and think of the java file / stream hieararchy?
Agreed this may all be happening in memory (and then optionally
written to disk), but I'm short on words that reflect this
abstraction?

Anyone?

>
>>
>>> 2.3 The package shall hold any combintation of  binary and/or text
>>> files.
>>>
>>
>> Not sure I agree that text files must be distinguishable from binary
>> files.  Once you have text files you end up dealing with DOS/Unix CRLF
>> conversions.  Better to just store the file as-is, directly, at which
>> point there is no difference between text and binary files.
>>
> Agree.

I'm saying absract them from the archive, not process them.
Thats in the application layer operating on these things.

Anyone see a problem omitting this differentiation? If not
then I'm OK to remove it.

>
>>
>>> 2.3.1 There shall be no difference between a file prior to being
>>> archived and the corresponding file when extracted from the archive.
>>>
>>
>> OK.  And per above, if you are not changing a text file on different OS's
>> then there is no difference between text and binary.

Does this address the binary/text differentiation sufficiently?

>>
>>> 2.3.2 No change shall be made to any character encoding by compressing
>>> and decompressing a file. I.e. an input file after decompression must
>>> match its character encoding prior to compression.
>>>
>>
>> Again, then why distinguish text from binary?

Noted.

>>
>>> 3. A means of verification of an archive shall be provided.
>>>
>>
>> Not sure what is intended here.  Do you mean you want a specification of a
>> verification procedure? (A validator?)Or that a conforming
>> ZIP-consumer/ZIP-producer must include a means of verification?

Needs expansion.
*I* meant, tell me what files are contained?
>zip -l archive.zip gives me the file sizes and names. I think that's
all thats needed, thought a full verification, i.e. re-compress and compare...
something (mdsum?) with archived content is perhaps the 200% check.
Is this needed? If it fails then the earlier 'no change' fails, so IMHO
a simple 'tell me what's inside' is sufficient.

Anyone care to wordsmith that?

>>
>>
>>> 4. A means of listing the contents of an archive without extraction
>>> shall be provided.
>>>
>>
>> Again, not clear who is providing this, the specification or a
>> consumer/producer?

We are requiring it of an implementation which declares itself
compliant to our requirement.

>
> This is a really interesting issue.  Currently in zip appnote we have
> a central directory record (which is supposed to, but does not
> necessarily, reflect the collection of entries).  A problem with this
> is that it appears at the end of the zip.  This can be quite
> inconvenient when consuming a large incoming zip stream.  One of the
> first requirements for a consumer is typically to determine exactly
> what kind of package it is dealing with.  Many formats (including odf,
> jar etc) also have a requirement for some form of manifest.  OPC has
> .rels which addresses the same problem slightly differently.  It would
> certainly be desirable to have this "listing" (I know its more than
> listing) as the first entity in the packaged collection.

Why first Bob? So long as it is available? Simply for speed of access?

 The zip
> appnote doesn't say anything about ordering - most probably because
> the original rationale for archiving was quite different to our
> rationale for packaging.

Or it's a how, not a what?

>
> Final thought - given that most of the formats we refer to make use of
> some form of manifest, how important is it to concern ourselves too
> much with the central directory at all?

Define central directory please? Just a list of items contained in the archive?
I required that, but as an ordinary file (thingy) within the archive.

 Sure to be compatible with
> existing general purpose zip implementations, it must be there.  But
> when profiling a package specification on top of zip (and possibly
> even other low level mechanisms like tar) it might be more important
> that we focus on the manifest.

Define profiling please?

regards

-- 
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.
http://www.dpawson.co.uk