UTF-8 in ZIP
Dennis E. Hamilton
dennis.hamilton at acm.org
Tue Nov 2 19:39:29 CET 2010
+1 with one wrinkle.
The names for Zipped parts were inspired by file-systems, although not tied
to them (e.g., it is a flat namespace, not a hierarchical one, although
hierarchical conventions can be carried in it, sort of like pushing HTML
over SMTP mail).
The new wrinkle is that we also need something that maps from IRI/URIs, and
not just a file: scheme. (That is why I added the %-encoding notion,
because it supports that cross-referencing among parts internal to a package
via URI resolution, which is also tied to the ASCII-related code points.)
At some point you have to deal with the low-level in order to achieve
heterogeneous interoperability, assuming it is the raw package that gets
shipped around. If the abstraction is mapped differently by different
producers, we have not done much for interop (unless we can share profiles
that map the same way, which still comes down to what the bits are under are
abstracted cases).
- Dennis
PS: OPC is abstracted above Zip, with Zip just one binding. That is how
they handle such things as streaming single parts from a server, using
multiple, parallel streams, etc., as a way of feeding a publishing system or
an editor that does not need to have the package local in order to work on
it. It will be interesting to see what there is to learn from that,
depending on where XPS is handled that way. There are implications for
collaborative work as well.
-----Original Message-----
From: robert_weir at us.ibm.com [mailto:robert_weir at us.ibm.com]
Sent: Tuesday, November 02, 2010 10:15
To: dennis.hamilton at acm.org
Cc: 'MURATA Makoto (FAMILY Given)'; 'ISO Zip'; sc34wg1study-bounces at vse.cz
Subject: RE: UTF-8 in ZIP
sc34wg1study-bounces at vse.cz wrote on 11/02/2010 12:21:14 PM:
>
> sc34wg1study-bounces at vse.cz
>
> The main difficulty is that the default situation in Zip is single-byte
> encoding and a presumed single-byte code page in the filename entry.
This
> clashes with use of UTF-8 for any Unicode code points that do not map to
> 7-bit ASCII (bit 8 = 0), where the UTF-8 is essentially single-byte
ASCII.
>
> There is an Appendix about this in versions of the App Note more recent
than
> 6.2.0.
>
> Of course, if we introduced %-encoding of other UTF-8 sequences (say,
using
> the IRI collapse to URI mapping), it would fit that practice and we
would be
> within the sweet spot that Zip has traditionally supported
cross-platform.
>
> - Dennis
>
I think you are exactly right. But if we wanted, in a profile standard,
we could require that a Processor permit UTF-8 on input, convert it
canonically when encoded for ZIP, and that it return UTF-8 on output.
This is why I suggested we need an abstraction of a file system and then
to map ZIP constructs on to that. That mapping is where we deal with file
names, how to deal with empty folders, etc. I don't think we want to
write a document packaging spec directly with respect to low level ZIP
operations. Better to write to the file system abstraction and have it
define how this is canonically mapped to ZIP artifacts.
[ ... ]
In other words, I'm proposing a multi-level approach that looks like this:
1) Encoding = ZIP or whatever. The bits.
2)Abstract file system describes files, directories and related metadata,
can be canonically encoded in a particular encoding, e.g., ZIP. The UTF-8
issue is resolved here.
3) Package format, built on abstract file system with specific defined
content, perhaps related to manifest, relationships, encryption, digital
signatures, etc.
4) Document Format = ODF, OOXML, EPUB
I think we use 1) via an RER. For 2, we need to define that ourselves.
For 3, we'll need to discuss what is possible, probably based on the
commonality (to the extent there is any) among the document formats today.
And 4 is already done, but in revision could take advantage of the work
we have in the lower levels.
But the meat of it is in levels 2 and 3. There is a lot of benefits that
would come from standardizing those, especially to the stated stakeholders
of this effort, e.g., OASIS, Ecma, IDPF, etc. That's why I think we
squander our opportunity to the extent we dwell on the likely unachievable
standardization of the low level ZIP encoding.
-Rob
More information about the sc34wg1study
mailing list