An outline proposal
Dave Pawson
dave.pawson at gmail.com
Tue Oct 19 13:43:56 CEST 2010
On 19 October 2010 11:38, Bob Jolliffe <bobjolliffe at gmail.com> wrote:
>>> 1. Provide a compressed archive format for general use.
>>>
>>
>> OK.
>
> I think we probably need to define what we mean by 'archive' format -
> my understanding being:
>
> Provide a compressed format for representing a collection of files (or
> streams) within a single stream. I am in two minds about the use of
> the term 'file' as it has some platform connotations, for examples
> expectations around preservation of permissions and other file
> attributes.
How abstract do we want to be?
IMHO we should keep out of the file system arena?
Leave it at 'collection' is a bit woolly?
Existing zip tools seem to behave differently in this
> respect. Are we simply interested in the lowest common denominator of
> the binary stream plus name? I know zip files are often composed
> from files in a filesystem, but they are also frequently generated
> directly from strems without touching the filesystem.
What do others call such a collection... without calling it a file?
Pragmatically it is likely to end up as a file on disk?
>
>>
>>
>>> 1.1. A compression algorithm shall be provided which is usable without
>>> infringing any existent patent.
>>>
>>
>> The goal should not be that the algorithm is free of IPR but that the
>> users of the standard may practice the algorithm without payment of
>> royalties. For example, the owner of the patent could pledge not to
>> assert their patents for implementors of the standard. Royalty-free and
>> IPR-free are not the same thing, though they are often confused.
>
> Agreed. But I wouldn't like to throw away the desirability of using
> algorithms which are patent-free. Pledges, covenants, promises etc
> are all a bit second-best. Perhaps the statement 1.2 below (modified
> slightly) is sufficient
How about 'unencumbered'?
I'd be happy with the later one inserted here. I think the aim is
clear, no problem with wordsmithing it.
>>> 2. The packaged entity shall hold one or more file.
>>>
>>
>> OK. This needs further specification: A file has a name, metadata (date,
>> permissions, archive bit?) as well as contents. Detailed requirements?
>
> yes. Minor point, but I would have thought the packaged entities are
> the things which are packaged within the package entity. And its the
> nature of these "packaged entities" - files, streams or what have you
> - which we should detail.
1. Are we into metadata? Is this going too far into implementation?
Again, I'd prefer to stay above a file system and attributes?
I'm -1 here (if we can do without it). That's the implementation layer,
rather than the specification *(what)* level.
>>> 2.2 Any file hierarchy present when the package is created shall be
>>> duplicated on extraction if requested.
>>>
>>
>> So this leads to the requirement that you can store a file hierarchy.
>
> Again I would not straight away assume we are talking of a file
> hierarchy. The contents of the package may have started off as files.
> And may even be extracted to files. But is this necessarily so? I
> would prefer to think of the zip as simply a container for streams.
Stealing your earlier words, does entity/entities work here?
I read 'streams' and think of the java file / stream hieararchy?
Agreed this may all be happening in memory (and then optionally
written to disk), but I'm short on words that reflect this
abstraction?
Anyone?
>
>>
>>> 2.3 The package shall hold any combintation of binary and/or text
>>> files.
>>>
>>
>> Not sure I agree that text files must be distinguishable from binary
>> files. Once you have text files you end up dealing with DOS/Unix CRLF
>> conversions. Better to just store the file as-is, directly, at which
>> point there is no difference between text and binary files.
>>
> Agree.
I'm saying absract them from the archive, not process them.
Thats in the application layer operating on these things.
Anyone see a problem omitting this differentiation? If not
then I'm OK to remove it.
>
>>
>>> 2.3.1 There shall be no difference between a file prior to being
>>> archived and the corresponding file when extracted from the archive.
>>>
>>
>> OK. And per above, if you are not changing a text file on different OS's
>> then there is no difference between text and binary.
Does this address the binary/text differentiation sufficiently?
>>
>>> 2.3.2 No change shall be made to any character encoding by compressing
>>> and decompressing a file. I.e. an input file after decompression must
>>> match its character encoding prior to compression.
>>>
>>
>> Again, then why distinguish text from binary?
Noted.
>>
>>> 3. A means of verification of an archive shall be provided.
>>>
>>
>> Not sure what is intended here. Do you mean you want a specification of a
>> verification procedure? (A validator?)Or that a conforming
>> ZIP-consumer/ZIP-producer must include a means of verification?
Needs expansion.
*I* meant, tell me what files are contained?
>zip -l archive.zip gives me the file sizes and names. I think that's
all thats needed, thought a full verification, i.e. re-compress and compare...
something (mdsum?) with archived content is perhaps the 200% check.
Is this needed? If it fails then the earlier 'no change' fails, so IMHO
a simple 'tell me what's inside' is sufficient.
Anyone care to wordsmith that?
>>
>>
>>> 4. A means of listing the contents of an archive without extraction
>>> shall be provided.
>>>
>>
>> Again, not clear who is providing this, the specification or a
>> consumer/producer?
We are requiring it of an implementation which declares itself
compliant to our requirement.
>
> This is a really interesting issue. Currently in zip appnote we have
> a central directory record (which is supposed to, but does not
> necessarily, reflect the collection of entries). A problem with this
> is that it appears at the end of the zip. This can be quite
> inconvenient when consuming a large incoming zip stream. One of the
> first requirements for a consumer is typically to determine exactly
> what kind of package it is dealing with. Many formats (including odf,
> jar etc) also have a requirement for some form of manifest. OPC has
> .rels which addresses the same problem slightly differently. It would
> certainly be desirable to have this "listing" (I know its more than
> listing) as the first entity in the packaged collection.
Why first Bob? So long as it is available? Simply for speed of access?
The zip
> appnote doesn't say anything about ordering - most probably because
> the original rationale for archiving was quite different to our
> rationale for packaging.
Or it's a how, not a what?
>
> Final thought - given that most of the formats we refer to make use of
> some form of manifest, how important is it to concern ourselves too
> much with the central directory at all?
Define central directory please? Just a list of items contained in the archive?
I required that, but as an ordinary file (thingy) within the archive.
Sure to be compatible with
> existing general purpose zip implementations, it must be there. But
> when profiling a package specification on top of zip (and possibly
> even other low level mechanisms like tar) it might be more important
> that we focus on the manifest.
Define profiling please?
regards
--
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.
http://www.dpawson.co.uk
More information about the sc34wg1study
mailing list