Full validation of ODF/OOXML/EPUB, etc. with DFDL and NVDL?

robert_weir at us.ibm.com robert_weir at us.ibm.com
Wed Nov 17 14:50:21 CET 2010


I recently ran across Data Format Description Language (DFDL), pronounced 
"Daffodil", a draft standard in the Open Grid Forum.  It is a validation 
language for non-markup data, including text and binary files.   It is 
oriented toward record based formats, commonly used in scientific and 
industrial applications, including modern and legacy formats.  DFDL is 
expressed as annotations on an XML Schema, using XML Schema (and 
Datatypes) to express the logical format of the data, and using 
annotations on the XSD to express physical aspects like byte ordering, 
etc.

I'm doing a little hand waving, but consider this:

The formats of interest for this study group are binary on the outside and 
XML and binary (images, etc,) on the inside.  And the XML on the inside is 
in a variety of languages, expressed in generally by multiple schema 
definition languages.

So NVDL has a role to play.  But to-date that would only work for the XML 
pieces.  What about the binary?

What if we brought DFDL into SC34/WG1 as a new part of DSDL?  (This is 
within the realm of possibility, based on my conversations with a 
colleague of mine who chairs the DFDL WG).

Could we then express ZIP formally using DFDL?  And by doing so in DFDL, 
enable the kind of modularization we're also seeking?

What if we then enhanced NVDL to allow an out-side in validation of such 
ZIP+XML+ binaries ?  So we can express validation not only of the ZIP, but 
also the contents of the ZIP, both markup as well as binaries.  Imagine a 
DFDL description of PNG, for example.

Obviously this is not the shortest path to getting a normative ZIP 
reference for ODF 1.2.   The shortest path is to do an RER.  And given the 
timetable that is what the ODF TC will likely end up doing.   But I think 
there is great value in tackling the general problem here, which is how 
binary and general text data and markup relate together in complex 
scenarios.  Whether we're talking about scientific data collection, legacy 
formats or even modern web formats like JSON, it is clear that a "pure" 
XML world exists only in the imagination.

Regards,

-Rob


More information about the sc34wg1study mailing list