UTF-8 in ZIP
robert_weir at us.ibm.com
robert_weir at us.ibm.com
Tue Nov 2 18:15:20 CET 2010
sc34wg1study-bounces at vse.cz wrote on 11/02/2010 12:21:14 PM:
> sc34wg1study-bounces at vse.cz
> The main difficulty is that the default situation in Zip is single-byte
> encoding and a presumed single-byte code page in the filename entry.
> clashes with use of UTF-8 for any Unicode code points that do not map to
> 7-bit ASCII (bit 8 = 0), where the UTF-8 is essentially single-byte
> There is an Appendix about this in versions of the App Note more recent
> Of course, if we introduced %-encoding of other UTF-8 sequences (say,
> the IRI collapse to URI mapping), it would fit that practice and we
> within the sweet spot that Zip has traditionally supported
> - Dennis
I think you are exactly right. But if we wanted, in a profile standard,
we could require that a Processor permit UTF-8 on input, convert it
canonically when encoded for ZIP, and that it return UTF-8 on output.
This is why I suggested we need an abstraction of a file system and then
to map ZIP constructs on to that. That mapping is where we deal with file
names, how to deal with empty folders, etc. I don't think we want to
write a document packaging spec directly with respect to low level ZIP
operations. Better to write to the file system abstraction and have it
define how this is canonically mapped to ZIP artifacts.
This also permits alternative package encoding. Remember, ZIP was chosen
for ODF quite a long time ago, based on estimations of the performance
tradeoffs corresponding to common package operations, replacing,
appending, deleting, etc. ZIP was the best choice for the use cases circa
2000, which were mainly disk-oriented, random-access use patterns.
However, it isn't particularly optimal for the types of streaming
operations common on the web today. For example, when downloading a 50MB
presentation document, I'd like to get to the thumbnail image, or the
metadata XML ASAP, and not wait for the entire package to download. If
we're going to invest in a modeling effort around document packages, I'd
like to do it in a flexible, modular way that does not presume that we're
only optimizing for one thing. So let's end the ZIP fixation. Today
maybe ODF is "ISO virtual package 1.0 with ZIP archive encoding" but maybe
tomorrow we would use (hypothetically) "ISO virtual package 1.0 with
In other words, I'm proposing a multi-level approach that looks like this:
1) Encoding = ZIP or whatever. The bits.
2)Abstract file system describes files, directories and related metadata,
can be canonically encoded in a particular encoding, e.g., ZIP. The UTF-8
issue is resolved here.
3) Package format, built on abstract file system with specific defined
content, perhaps related to manifest, relationships, encryption, digital
4) Document Format = ODF, OOXML, EPUB
I think we use 1) via an RER. For 2, we need to define that ourselves.
For 3, we'll need to discuss what is possible, probably based on the
commonality (to the extent there is any) among the document formats today.
And 4 is already done, but in revision could take advantage of the work
we have in the lower levels.
But the meat of it is in levels 2 and 3. There is a lot of benefits that
would come from standardizing those, especially to the stated stakeholders
of this effort, e.g., OASIS, Ecma, IDPF, etc. That's why I think we
squander our opportunity to the extent we dwell on the likely unachievable
standardization of the low level ZIP encoding.
More information about the sc34wg1study