UTF-8 in ZIP

Dennis E. Hamilton dennis.hamilton at acm.org
Tue Nov 2 19:39:29 CET 2010

+1 with one wrinkle.

The names for Zipped parts were inspired by file-systems, although not tied
to them (e.g., it is a flat namespace, not a hierarchical one, although
hierarchical conventions can be carried in it, sort of like pushing HTML
over SMTP mail).

The new wrinkle is that we also need something that maps from IRI/URIs, and
not just a file: scheme.  (That is why I added the %-encoding notion,
because it supports that cross-referencing among parts internal to a package
via URI resolution, which is also tied to the ASCII-related code points.)

At some point you have to deal with the low-level in order to achieve
heterogeneous interoperability, assuming it is the raw package that gets
shipped around.  If the abstraction is mapped differently by different
producers, we have not done much for interop (unless we can share profiles
that map the same way, which still comes down to what the bits are under are
abstracted cases).

 - Dennis

PS: OPC is abstracted above Zip, with Zip just one binding.  That is how
they handle such things as streaming single parts from a server, using
multiple, parallel streams, etc., as a way of feeding a publishing system or
an editor that does not need to have the package local in order to work on
it.  It will be interesting to see what there is to learn from that,
depending on where XPS is handled that way.  There are implications for
collaborative work as well.

-----Original Message-----
From: robert_weir at us.ibm.com [mailto:robert_weir at us.ibm.com] 
Sent: Tuesday, November 02, 2010 10:15
To: dennis.hamilton at acm.org
Cc: 'MURATA Makoto (FAMILY Given)'; 'ISO Zip'; sc34wg1study-bounces at vse.cz
Subject: RE: UTF-8 in ZIP

sc34wg1study-bounces at vse.cz wrote on 11/02/2010 12:21:14 PM:
> sc34wg1study-bounces at vse.cz
> The main difficulty is that the default situation in Zip is single-byte
> encoding and a presumed single-byte code page in the filename entry. 
> clashes with use of UTF-8 for any Unicode code points that do not map to
> 7-bit ASCII (bit 8 = 0), where the UTF-8 is essentially single-byte 
> There is an Appendix about this in versions of the App Note more recent 
> 6.2.0.
> Of course, if we introduced %-encoding of other UTF-8 sequences (say, 
> the IRI collapse to URI mapping), it would fit that practice and we 
would be
> within the sweet spot that Zip has traditionally supported 
>  - Dennis

I think you are exactly right.  But if we wanted, in a profile standard, 
we could require that a Processor permit UTF-8 on input, convert it 
canonically when encoded for ZIP, and that it return UTF-8 on output. 

This is why I suggested we need an abstraction of a file system and then 
to map ZIP constructs on to that.  That mapping is where we deal with file 
names, how to deal with empty folders, etc.  I don't think we want to 
write a document packaging spec directly with respect to low level ZIP 
operations.  Better to write to the file system abstraction and have it 
define how this is canonically mapped to ZIP artifacts. 

[ ... ]

In other words, I'm proposing a multi-level approach that looks like this:

1) Encoding = ZIP or whatever.  The bits.

2)Abstract file system describes files, directories and related metadata, 
can be canonically encoded in a particular encoding, e.g., ZIP.  The UTF-8 
issue is resolved here.

3) Package format, built on abstract file system with specific defined 
content, perhaps related to manifest, relationships, encryption, digital 
signatures, etc.

4) Document Format = ODF, OOXML, EPUB

I think we use 1) via an RER.  For 2, we need to define that ourselves. 
For 3, we'll need to discuss what is possible, probably based on the 
commonality (to the extent there is any) among the document formats today. 
 And 4 is already done, but in revision could take advantage of the work 
we have in the lower levels.

But the meat of it is in levels 2 and 3.  There is a lot of benefits that 
would come from standardizing those, especially to the stated stakeholders 
of this effort, e.g., OASIS, Ecma, IDPF, etc.  That's why I think we 
squander our opportunity to the extent we dwell on the likely unachievable 
standardization of the low level ZIP encoding.


More information about the sc34wg1study mailing list