UTF-8 in ZIP

Tue Nov 2 21:51:38 CET 2010

One problem is that there are systems, including Zip, where A and A/B can
each be resources.  That is, there is nothing that says "A" has to be a
directory and/or can't have content.  HTTP, WebDAV, etc., have all gone
through this already.

I think that, in addition to layers of abstraction, we need to think about
protocol stacks (sort of the OSI model).  At some level, one needs a
protocol stack that gets us down to what is written in the local/central
directory entries of a Zip archive.  So the producer has some lexical form
of an abstracted identifier and it pushes down the stack until something is
encoded in a Zip directory entry.  Then the consumer has a counterpart of
the stack that works its way back up so the same lexical for is presented at
the same level above the Zip archive itself.  That protocol agreement need
not rely on everything that could go into the Zip directory name at that
level, but it must be something that is admissible at that level.  The
constraints to less than what is admissible (perhaps with allowance for
Postel's Law) is part of the agreement among users of the upper layers and
is not known at the bottom layer.  

Since it is the lower layer objects that are interchanged, in the critical
interoperability use case, we need to know what the constrained use is with
regard to the higher layers of the protocol stack for, say, electronic
documents.

It is in this sense that I think the coding of file names at the Zip archive
layer has to be dealt with in a conservative but precise manner.

I agree that we should be able to define a couple of layers up as something
that is the common foundation for document-packaging usage, allowing
reference by IRIs, internal cross-references among the internal components,
etc.

I don't think this is necessarily an abstraction of a file system, because
it doesn't support the kinds of updating and manipulations that a file
system has as part of its functions and how it performs.  Similar maybe, but
I wouldn't want to take it too far into corresponding to a file system.  A
closer historical match is with a partitioned data set or a linker library.

 - Dennis

-----Original Message-----
From: robert_weir at us.ibm.com [mailto:robert_weir at us.ibm.com] 
Sent: Tuesday, November 02, 2010 12:12
To: dennis.hamilton at acm.org
Cc: dennis.hamilton at acm.org; 'MURATA Makoto (FAMILY Given)'; 'ISO Zip';
sc34wg1study-bounces at vse.cz
Subject: RE: UTF-8 in ZIP

"Dennis E. Hamilton" <dennis.hamilton at acm.org> wrote on 11/02/2010 
02:39:29 PM:

> 
> +1 with one wrinkle.
> 
> The names for Zipped parts were inspired by file-systems, although not 
tied
> to them (e.g., it is a flat namespace, not a hierarchical one, although
> hierarchical conventions can be carried in it, sort of like pushing HTML
> over SMTP mail).
> 

I was thinking, for example, how there is more than one way to specify 
contents in a directory.

So a ZIP with the following entries:

A
A/B

can represent the same logical file system as one that has only the single 
entry:

A/B

In other words, having an entry for a directory does not seem to be 
necessary, unless it is an empty directory.

I think if you described a virtual file system you would have it be a 
single-rooted hierarchy.  But there is more than one way to encode a 
logical A/B into a ZIP archive.

As you note, it gets more complicated with IRIs.  But that is something 
that the virtual (or "abstract" or "logical" if we prefer) file system can 
represent quite cleanly.

> The new wrinkle is that we also need something that maps from IRI/URIs, 
and
> not just a file: scheme.  (That is why I added the %-encoding notion,
> because it supports that cross-referencing among parts internal to a 
package
> via URI resolution, which is also tied to the ASCII-related code 
points.)
>

Certainly internal linking, but also I think links from externally.  For 
example, I'd like to be able to express directly a link to an image or an 
RDF XML inside an ODF document. 

> At some point you have to deal with the low-level in order to achieve
> heterogeneous interoperability, assuming it is the raw package that gets
> shipped around.  If the abstraction is mapped differently by different
> producers, we have not done much for interop (unless we can share 
profiles
> that map the same way, which still comes down to what the bits are under 
are
> abstracted cases).
> 

I'm arguing that we can compartmentalize the "dealing with the low-level" 
into one place.  So not every application-level package format needs to 
deal with it.

Or put another way, if I could snap my fingers and produce a logical file 
system spec that also defined a canonical encoding in terms of ZIP, and it 
supported UTF-8 names and greater than 2GB entries, etc., can you think of 
any reason why this would not be the preferred layer to specify packaging 
for ODF, OOXML, EPUB on top of?  Are there any "weird" ZIP dependencies 
that would not be among the features that would ordinarily be considered 
logical file system features? 
Maybe a per-item choice of compression algorithm?  But that is trivial to 
model.

-Rob

>  - Dennis
> 
> PS: OPC is abstracted above Zip, with Zip just one binding.  That is how
> they handle such things as streaming single parts from a server, using
> multiple, parallel streams, etc., as a way of feeding a publishing 
system or
> an editor that does not need to have the package local in order to work 
on
> it.  It will be interesting to see what there is to learn from that,
> depending on where XPS is handled that way.  There are implications for
> collaborative work as well.
> 
> -----Original Message-----
> From: robert_weir at us.ibm.com [mailto:robert_weir at us.ibm.com] 
> Sent: Tuesday, November 02, 2010 10:15
> To: dennis.hamilton at acm.org
> Cc: 'MURATA Makoto (FAMILY Given)'; 'ISO Zip'; 
sc34wg1study-bounces at vse.cz
> Subject: RE: UTF-8 in ZIP
> 
> sc34wg1study-bounces at vse.cz wrote on 11/02/2010 12:21:14 PM:
> > 
> > sc34wg1study-bounces at vse.cz
> > 
> > The main difficulty is that the default situation in Zip is 
single-byte
> > encoding and a presumed single-byte code page in the filename entry. 
> This
> > clashes with use of UTF-8 for any Unicode code points that do not map 
to
> > 7-bit ASCII (bit 8 = 0), where the UTF-8 is essentially single-byte 
> ASCII. 
> > 
> > There is an Appendix about this in versions of the App Note more 
recent 
> than
> > 6.2.0.
> > 
> > Of course, if we introduced %-encoding of other UTF-8 sequences (say, 
> using
> > the IRI collapse to URI mapping), it would fit that practice and we 
> would be
> > within the sweet spot that Zip has traditionally supported 
> cross-platform.
> > 
> >  - Dennis
> > 
> 
> I think you are exactly right.  But if we wanted, in a profile standard, 

> we could require that a Processor permit UTF-8 on input, convert it 
> canonically when encoded for ZIP, and that it return UTF-8 on output. 
> 
> This is why I suggested we need an abstraction of a file system and then 

> to map ZIP constructs on to that.  That mapping is where we deal with 
file 
> names, how to deal with empty folders, etc.  I don't think we want to 
> write a document packaging spec directly with respect to low level ZIP 
> operations.  Better to write to the file system abstraction and have it 
> define how this is canonically mapped to ZIP artifacts. 
> 
> [ ... ]
> 
> In other words, I'm proposing a multi-level approach that looks like 
this:
> 
> 1) Encoding = ZIP or whatever.  The bits.
> 
> 2)Abstract file system describes files, directories and related 
metadata, 
> can be canonically encoded in a particular encoding, e.g., ZIP.  The 
UTF-8 
> issue is resolved here.
> 
> 3) Package format, built on abstract file system with specific defined 
> content, perhaps related to manifest, relationships, encryption, digital 

> signatures, etc.
> 
> 4) Document Format = ODF, OOXML, EPUB
> 
> I think we use 1) via an RER.  For 2, we need to define that ourselves. 
> For 3, we'll need to discuss what is possible, probably based on the 
> commonality (to the extent there is any) among the document formats 
today. 
>  And 4 is already done, but in revision could take advantage of the work 

> we have in the lower levels.
> 
> But the meat of it is in levels 2 and 3.  There is a lot of benefits 
that 
> would come from standardizing those, especially to the stated 
stakeholders 
> of this effort, e.g., OASIS, Ecma, IDPF, etc.  That's why I think we 
> squander our opportunity to the extent we dwell on the likely 
unachievable 
> standardization of the low level ZIP encoding.
> 
> 
> -Rob
>