UTF-8 in ZIP

robert_weir at us.ibm.com robert_weir at us.ibm.com
Tue Nov 2 18:15:20 CET 2010


sc34wg1study-bounces at vse.cz wrote on 11/02/2010 12:21:14 PM:
> 
> sc34wg1study-bounces at vse.cz
> 
> The main difficulty is that the default situation in Zip is single-byte
> encoding and a presumed single-byte code page in the filename entry. 
This
> clashes with use of UTF-8 for any Unicode code points that do not map to
> 7-bit ASCII (bit 8 = 0), where the UTF-8 is essentially single-byte 
ASCII. 
> 
> There is an Appendix about this in versions of the App Note more recent 
than
> 6.2.0.
> 
> Of course, if we introduced %-encoding of other UTF-8 sequences (say, 
using
> the IRI collapse to URI mapping), it would fit that practice and we 
would be
> within the sweet spot that Zip has traditionally supported 
cross-platform.
> 
>  - Dennis
> 

I think you are exactly right.  But if we wanted, in a profile standard, 
we could require that a Processor permit UTF-8 on input, convert it 
canonically when encoded for ZIP, and that it return UTF-8 on output. 

This is why I suggested we need an abstraction of a file system and then 
to map ZIP constructs on to that.  That mapping is where we deal with file 
names, how to deal with empty folders, etc.  I don't think we want to 
write a document packaging spec directly with respect to low level ZIP 
operations.  Better to write to the file system abstraction and have it 
define how this is canonically mapped to ZIP artifacts. 

This also permits alternative package encoding.  Remember, ZIP was chosen 
for ODF quite a long time ago, based on estimations of the performance 
tradeoffs corresponding to common package operations, replacing, 
appending, deleting, etc.  ZIP was the best choice for the use cases circa 
2000, which were mainly disk-oriented, random-access use patterns. 
However, it isn't particularly optimal for the types of streaming 
operations common on the web today.  For example, when downloading a 50MB 
presentation document, I'd like to get to the thumbnail image, or the 
metadata XML ASAP, and not wait for the entire package to download.   If 
we're going to invest in a modeling effort around document packages, I'd 
like to do it in a flexible, modular way that does not presume that we're 
only optimizing for one thing.  So let's end the ZIP fixation.  Today 
maybe ODF is "ISO virtual package 1.0 with ZIP archive encoding" but maybe 
tomorrow we would use (hypothetically) "ISO virtual package 1.0 with 
OpenStream encoding".

In other words, I'm proposing a multi-level approach that looks like this:

1) Encoding = ZIP or whatever.  The bits.

2)Abstract file system describes files, directories and related metadata, 
can be canonically encoded in a particular encoding, e.g., ZIP.  The UTF-8 
issue is resolved here.

3) Package format, built on abstract file system with specific defined 
content, perhaps related to manifest, relationships, encryption, digital 
signatures, etc.

4) Document Format = ODF, OOXML, EPUB

I think we use 1) via an RER.  For 2, we need to define that ourselves. 
For 3, we'll need to discuss what is possible, probably based on the 
commonality (to the extent there is any) among the document formats today. 
 And 4 is already done, but in revision could take advantage of the work 
we have in the lower levels.

But the meat of it is in levels 2 and 3.  There is a lot of benefits that 
would come from standardizing those, especially to the stated stakeholders 
of this effort, e.g., OASIS, Ecma, IDPF, etc.  That's why I think we 
squander our opportunity to the extent we dwell on the likely unachievable 
standardization of the low level ZIP encoding.


-Rob


More information about the sc34wg1study mailing list