UTF-8 in ZIP

robert_weir at us.ibm.com robert_weir at us.ibm.com
Tue Nov 2 20:12:08 CET 2010


"Dennis E. Hamilton" <dennis.hamilton at acm.org> wrote on 11/02/2010 
02:39:29 PM:

> 
> +1 with one wrinkle.
> 
> The names for Zipped parts were inspired by file-systems, although not 
tied
> to them (e.g., it is a flat namespace, not a hierarchical one, although
> hierarchical conventions can be carried in it, sort of like pushing HTML
> over SMTP mail).
> 

I was thinking, for example, how there is more than one way to specify 
contents in a directory.

So a ZIP with the following entries:

A
A/B

can represent the same logical file system as one that has only the single 
entry:

A/B

In other words, having an entry for a directory does not seem to be 
necessary, unless it is an empty directory.

I think if you described a virtual file system you would have it be a 
single-rooted hierarchy.  But there is more than one way to encode a 
logical A/B into a ZIP archive.

As you note, it gets more complicated with IRIs.  But that is something 
that the virtual (or "abstract" or "logical" if we prefer) file system can 
represent quite cleanly.

> The new wrinkle is that we also need something that maps from IRI/URIs, 
and
> not just a file: scheme.  (That is why I added the %-encoding notion,
> because it supports that cross-referencing among parts internal to a 
package
> via URI resolution, which is also tied to the ASCII-related code 
points.)
>

Certainly internal linking, but also I think links from externally.  For 
example, I'd like to be able to express directly a link to an image or an 
RDF XML inside an ODF document. 

 
> At some point you have to deal with the low-level in order to achieve
> heterogeneous interoperability, assuming it is the raw package that gets
> shipped around.  If the abstraction is mapped differently by different
> producers, we have not done much for interop (unless we can share 
profiles
> that map the same way, which still comes down to what the bits are under 
are
> abstracted cases).
> 

I'm arguing that we can compartmentalize the "dealing with the low-level" 
into one place.  So not every application-level package format needs to 
deal with it.

Or put another way, if I could snap my fingers and produce a logical file 
system spec that also defined a canonical encoding in terms of ZIP, and it 
supported UTF-8 names and greater than 2GB entries, etc., can you think of 
any reason why this would not be the preferred layer to specify packaging 
for ODF, OOXML, EPUB on top of?  Are there any "weird" ZIP dependencies 
that would not be among the features that would ordinarily be considered 
logical file system features? 
Maybe a per-item choice of compression algorithm?  But that is trivial to 
model.

-Rob

>  - Dennis
> 
> PS: OPC is abstracted above Zip, with Zip just one binding.  That is how
> they handle such things as streaming single parts from a server, using
> multiple, parallel streams, etc., as a way of feeding a publishing 
system or
> an editor that does not need to have the package local in order to work 
on
> it.  It will be interesting to see what there is to learn from that,
> depending on where XPS is handled that way.  There are implications for
> collaborative work as well.
> 
> -----Original Message-----
> From: robert_weir at us.ibm.com [mailto:robert_weir at us.ibm.com] 
> Sent: Tuesday, November 02, 2010 10:15
> To: dennis.hamilton at acm.org
> Cc: 'MURATA Makoto (FAMILY Given)'; 'ISO Zip'; 
sc34wg1study-bounces at vse.cz
> Subject: RE: UTF-8 in ZIP
> 
> sc34wg1study-bounces at vse.cz wrote on 11/02/2010 12:21:14 PM:
> > 
> > sc34wg1study-bounces at vse.cz
> > 
> > The main difficulty is that the default situation in Zip is 
single-byte
> > encoding and a presumed single-byte code page in the filename entry. 
> This
> > clashes with use of UTF-8 for any Unicode code points that do not map 
to
> > 7-bit ASCII (bit 8 = 0), where the UTF-8 is essentially single-byte 
> ASCII. 
> > 
> > There is an Appendix about this in versions of the App Note more 
recent 
> than
> > 6.2.0.
> > 
> > Of course, if we introduced %-encoding of other UTF-8 sequences (say, 
> using
> > the IRI collapse to URI mapping), it would fit that practice and we 
> would be
> > within the sweet spot that Zip has traditionally supported 
> cross-platform.
> > 
> >  - Dennis
> > 
> 
> I think you are exactly right.  But if we wanted, in a profile standard, 

> we could require that a Processor permit UTF-8 on input, convert it 
> canonically when encoded for ZIP, and that it return UTF-8 on output. 
> 
> This is why I suggested we need an abstraction of a file system and then 

> to map ZIP constructs on to that.  That mapping is where we deal with 
file 
> names, how to deal with empty folders, etc.  I don't think we want to 
> write a document packaging spec directly with respect to low level ZIP 
> operations.  Better to write to the file system abstraction and have it 
> define how this is canonically mapped to ZIP artifacts. 
> 
> [ ... ]
> 
> In other words, I'm proposing a multi-level approach that looks like 
this:
> 
> 1) Encoding = ZIP or whatever.  The bits.
> 
> 2)Abstract file system describes files, directories and related 
metadata, 
> can be canonically encoded in a particular encoding, e.g., ZIP.  The 
UTF-8 
> issue is resolved here.
> 
> 3) Package format, built on abstract file system with specific defined 
> content, perhaps related to manifest, relationships, encryption, digital 

> signatures, etc.
> 
> 4) Document Format = ODF, OOXML, EPUB
> 
> I think we use 1) via an RER.  For 2, we need to define that ourselves. 
> For 3, we'll need to discuss what is possible, probably based on the 
> commonality (to the extent there is any) among the document formats 
today. 
>  And 4 is already done, but in revision could take advantage of the work 

> we have in the lower levels.
> 
> But the meat of it is in levels 2 and 3.  There is a lot of benefits 
that 
> would come from standardizing those, especially to the stated 
stakeholders 
> of this effort, e.g., OASIS, Ecma, IDPF, etc.  That's why I think we 
> squander our opportunity to the extent we dwell on the likely 
unachievable 
> standardization of the low level ZIP encoding.
> 
> 
> -Rob
> 



More information about the sc34wg1study mailing list