Part 2 schema regex for ST_ContentType

John Haug johnhaug at exchange.microsoft.com
Tue Mar 17 00:11:28 CET 2015


Annex D (and E) have a very lengthy regular expression restricting the allowed values for ST_ContentType.  We looked at this in depth during the meetings in Bellevue, resulting from Murata-san’s December 2014 e-mail “Which RFC(s) for media type should we refer to?” and February 2015 e-mail “My proposals: content type and media ypest”.  There were some corrections to make and no clear answer as to what the regex is intended for.

After the meetings, I took a similar look at the regex, compared it to RFC 2616 (the regex is intended to match the RFC), compared RFCs 2616 and 7231 and documented everything in the attached.  (This includes a fix to the error Murata-san noted.)  The e-mail discussion is included below.

1. The questions I initially posed are still open:

·         Whether we are OK with the differences introduced by RFC 7231

·         Whether we should rewrite the XSD regex pattern by translating the RFC 7231 media-type definition (as Part 2 originally did with RFC 2616)

·         Whether we should then simplify the regex in some partial or extreme way, within the limits of what XSD and RNG allow
2. Francis’ question about \s+ in what I call Y still needs examination.

For further discussion among the larger group!
John

From: eb2mmrt at gmail.com [mailto:eb2mmrt at gmail.com] On Behalf Of MURATA Makoto
Sent: Thursday, March 12, 2015 10:15 PM
To: John Haug
Cc: Francis Cave; Arms, Caroline; Rex Jaeschke; Chris Rae; Gareth Horton; Alex Brown; Rich McLain; MURATA Makoto (FAMILY Given)
Subject: Re: PLEASE PROOF: Day 3 Draft Minutes from the Seattle Meeting of SC 34/WG4

John,

First, let us correctly understand what we have right now.  I do not
think this can be done without using some macro (or XML entities).

Then, compare our (faithfully rewritten) regexp and   RFC 2161 and
then compare it and RFC 7231.

John, please incorporate my change (as part of the first step) and
post your document to the WG4 mailing list.

Regards,
Makoto



2015-03-11 8:51 GMT+09:00 John Haug <johnhaug at exchange.microsoft.com<mailto:johnhaug at exchange.microsoft.com>>:
(I’ve combined the split replies from Francis and Murata-san.)


Re: Francis:
My intent here wasn’t to propose a simplification to include in Part 2.  I just wanted to reverse engineer what Part 2 currently has and compare it to the RFCs.  I therefore did no reduction of the content.  Leaving it as is showed that Part 2 contained a quite literal translation of the RFC’s BNF to XSD regex syntax.  Your comment about \s in X is true, but I left it that way intentionally as part of the reverse engineering.

Regarding \s+ in Y, I think they again literally translated qdtext, but perhaps not quite right?  Maybe \s+ should have been appended after the last item in the first bracketed expression in Y?
Y: ([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"]]|\s+)
qdtext         = <any TEXT except <">>
TEXT           = <any OCTET except CTLs, but including LWS>
OCTET          = <any 8-bit sequence of data>
CTL            = <any US-ASCII control character (octets 0 - 31) and DEL (127)>
LWS            = [CRLF] 1*( SP | HT )   ; linear white space
CRLF           = CR LF
OCTET limits us to 0-255.  CTLs removes 0-31 and 127.  LWS adds back in 9, 10, 13 (32 already allowed).  qdtext removes 34.

> Note that U+00A0 is in the Latin-1 supplement. I'm not sure whether this character should be explicitly excluded from X. It can presumably be included in Y, as this in a quoted string.
I might be missing something.  The content represented by X doesn’t include anything above 127 (0x7F).


Re: Murata-san:
Ah, I think you’re right.  I believe my omitting that one set of parentheses makes the * apply only to the [\p{IsBasicLatin}] and not also to the \\.


Given that I’m not trying to propose a simplification of the regex, or even that we should do so, assuming I’m correct above, shall I make Murata-san’s edit to the doc and send it to the WG 4 list purely as investigation into what Part 2 currently says?

John

________________________________
From: eb2mmrt at gmail.com<mailto:eb2mmrt at gmail.com> [mailto:eb2mmrt at gmail.com<mailto:eb2mmrt at gmail.com>] On Behalf Of MURATA Makoto
Sent: Saturday, February 28, 2015 11:51 PM
To: John Haug
Cc: Arms, Caroline; Rex Jaeschke; Chris Rae; Francis Cave; Gareth Horton; Alex Brown; Rich McLain
Subject: Re: PLEASE PROOF: Day 3 Draft Minutes from the Seattle Meeting of SC 34/WG4

John,

Nice work!

I think that

("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"]]|\s+)|(\\[\p{IsBasicLatin}])*)&quot<file:///\\[\p%7bIsBasicLatin%7d%5d)*)&quot>;)

cannot be simplified to

("(Y | \\[\p{IsBasicLatin}]*)&quot<file:///\\[\p%7bIsBasicLatin%7d%5d*)&quot>;)

Rather, it should become

("(Y|(\\[\p{IsBasicLatin}]))*&quot<file:///\\[\p%7bIsBasicLatin%7d%5d))*&quot>;)

Regards,
Makoto

From: Francis Cave [mailto:francis at franciscave.com<mailto:francis at franciscave.com>]
Sent: Saturday, February 28, 2015 4:37 AM
To: John Haug; Arms, Caroline; 'Rex Jaeschke'; Makoto Murata; Chris Rae; Gareth Horton; Alex Brown; Rich McLain
Subject: Re: PLEASE PROOF: Day 3 Draft Minutes from the Seattle Meeting of SC 34/WG4

John

Good job! I just have one niggle, which is with your use of \s in the definitions of both X and Y. The problem is that \s includes \t \n and \r, all of which are in \p{Cc}. The meaning of \s is not SPACE, but 'white space', where this includes all Unicode space characters (i.e. including U+0020 and U+00A0, but also presumably some other space characters), and also includes control characters TAB, CR and LF.

In X this isn't so critical, because you're excluding \s, but that means that there is redundancy in the expression, because \t \n and \r are already excluded by excluding \p{Cc}.

In Y the problem is more serious, because you are including \s+ as an alternative choice to the rest of the expression. Effectively this allows \t \n and \r in Y expressions that are white space only.

I suspect that the only white space character that should be allowed in Y, is U+0020, i.e. the regular SPACE character. This would be the same as SP in the ABNF in RFC 7231.

Note that U+00A0 is in the Latin-1 supplement. I'm not sure whether this character should be explicitly excluded from X. It can presumably be included in Y, as this in a quoted string.

Here is a list of Unicode spaces: https://www.cs.tut.fi/~jkorpela/chars/spaces.html. As this isn't an official list, I cannot be certain that this is accurate. My assumption is that \s includes all these.

I am assuming that \p{Cc} includes control characters in the Unicode range U+0080 to U+009F, as well as the control characters in the basic ASCII range.

Kind regards,

Francis




On 28/02/2015 00:31, John Haug wrote:
I’ve attached what I came up with, which should fully explain the issue and be step-by-step enough to make it easier to ensure there are no typos/errors.  Please review this before it goes out to all of WG 4.  Once we’re sure there are no typos/errors, we can either include it in the minutes of the Bellevue meeting or I can just attach it to a reply to Murata-san’s last e-mail on the subject to the WG 4 list.

Pursuant to the revision, we will need to decide:

•         Whether we are OK with the differences introduced by RFC 7231

•         Whether we should rewrite the XSD regex pattern by translating the RFC 7231 media-type definition (as Part 2 originally did with RFC 2616)

•         Whether we should then simplify the regex in some partial or extreme way, within the limits of what XSD and RNG allow

Thanks,
John

From: John Haug
Sent: Thursday, February 26, 2015 3:08 PM
To: 'Arms, Caroline'; 'Rex Jaeschke'; Makoto Murata; Chris Rae; Francis Cave; Gareth Horton; Alex Brown; Rich McLain
Subject: RE: PLEASE PROOF: Day 3 Draft Minutes from the Seattle Meeting of SC 34/WG4

Re: the ST_ContentType regex: I am nearly done with a lengthy explanatory break-it-down tutorial document that shows the derivation step by step!  Thanks much to Murata-san for all the initial investigation and to him and Francis for talking through it on the screen yesterday at (painful) length.

The short version is that the huge 6-line regex in Part 2 is a literal translation of RFC 2616’s definition of media-type into an XSD pattern.  I have that part done and am working on the differences between RFC 2616 and RFC 7231 (and friends).  I think it’s reasonable to change the Part 2 normative reference to 7231 (and friends by reference from 7231) since it has obsoleted 2616.  But we ought to understand and discuss the differences before making a concrete decision on that.

John

From: Arms, Caroline [mailto:caar at loc.gov]
Sent: Thursday, February 26, 2015 2:56 PM
To: 'Rex Jaeschke'; Makoto Murata; John Haug; Chris Rae; Francis Cave; Gareth Horton; Alex Brown; Rich McLain
Subject: RE: PLEASE PROOF: Day 3 Draft Minutes from the Seattle Meeting of SC 34/WG4

Rex,

I think you should get a sentence or two from Murata-san or someone else about what needs to be done to get the regular expression for the ContentTypes schema fixed – the third point in Murata-san’s message.   I wasn’t able to hear that discussion clearly enough.  I believe it was decided that some testing might be needed – but maybe that was only discussed but not decided.

    Caroline

From: Rex Jaeschke [mailto:rex at RexJaeschke.com]
Sent: Thursday, February 26, 2015 2:51 PM
To: Makoto Murata; Arms, Caroline; John Haug; Chris Rae; Francis Cave; Gareth Horton; Alex Brown; Rich McLain
Subject: PLEASE PROOF: Day 3 Draft Minutes from the Seattle Meeting of SC 34/WG4

Attached are the draft minutes as at the end of the meeting.

Once I get some words on XAdES, I’ll send out the final draft to WG4 and TC45.

I’ll update the DR log to reflect the minutes, in the next hour.

Rex






--

Praying for the victims of the Japan Tohoku earthquake

Makoto
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.vse.cz/pipermail/sc34wg4/attachments/20150316/758e7d21/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: content type regex.docx
Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Size: 27046 bytes
Desc: content type regex.docx
URL: <http://mailman.vse.cz/pipermail/sc34wg4/attachments/20150316/758e7d21/attachment-0001.docx>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: content type regex.pdf
Type: application/pdf
Size: 531477 bytes
Desc: content type regex.pdf
URL: <http://mailman.vse.cz/pipermail/sc34wg4/attachments/20150316/758e7d21/attachment-0001.pdf>


More information about the sc34wg4 mailing list