Logical archive model

Tue Nov 2 23:17:34 CET 2010

So what is needed, I think,  is something like this:

---------------------------------------------------------

An archive is in a hierarchical structure containing items.  Items may be 
directories or files.  Directories may contain other items.  Files are 
terminals and do not contain other items.  Directories may be empty.

Items are ordered in the archive, though the order of the items bears no 
necessary relationships to the hierarchical structure, e.g., there is no 
requirement that a "parent" item appear before a "child" item.

Items are identified by an IRI path, which conform to the "ipath-absolute" 
production in RFC 3987.

Items may have associated attributes.  Attributes defined by this standard 
include:

Creation Date (ISO 8601)
Modified Date (ISO 8601)
Size (long integer)

Additional attributes, including implementation-defined attributes, are 
also permitted.

An archive is stored in an archive encoding, e.g., ZIP, GZIP, TAR, XML, 
etc..

---------------------------------------------------------

We don't need a whole file system.  For example, we don't need to deal 
with locking, symbolic links, permissions or anything like that.

So stopping here, can any one think of any aspect of ODF, OOXML, EPUB 
packaging, or whatever that cannot be expressed in this model?

For example, one of the ODF requirements is that the mimetype file be the 
first in the ZIP and that it be uncompressed.  We can clearly express 
that.  Everything can just be specifying items via IRI path.

I'm putting compression aside, for a second, since I don't think that is 
an essential aspect of packaging.  It is however, an important aspect of 
particular encodings, where it would fit in as additional attributes, 
e.g.:

Compression Method (enum/string)
Original Size (long integer)

But compression per se does not really carry semantic value at the 
application/document level, at least not among formats like ODF, OOXML, 
EPUB, etc.  But a particular software application may be very interested 
in setting this attribute on a per Item basis, to optimize storage based 
on underlying content types, e.g., don't compress already compressed 
images, but do compress XML.

So this isn't rocket science, but if we had this logical archive model, as 
well as at least one encoding of it, in ZIP, then I think it would be 
possible to cleanly express what we need in document format uses.  And by 
using this separation of logical model from encoding, we also future-proof 
this technology and allow other approaches to encoding be used in the 
future, e.g., ones that are more streaming-friendly,

-Rob