Part 2 schema regex for ST_ContentType

Thu Apr 9 10:49:45 CEST 2015

John,

A few comments on your comparison note.  In
"Differences between RFC 2616 and RFC 7231".

> token
> No differences

Exactly.

> qdtext
>
> RFC 2616 includes LF (octet 10 / %x0A), CR (octet 13 / %x0D),
> \ (octet 92 / %x5C).  RFC 7231 disallows these characters.

%x5C should certainly be disallowed, since it starts a
quoted pair.

> quoted-pair
>
> RFC 2616 allows any character in the standard ASCII range
> (octets 0-127).  RFC 7231 disallows the range octets 0-31
> except for octet 9 (HTAB).

Furthermore, RFC 7231 disallows 7F but allows 80-FF.

Regards,
Makoto

2015-04-09 16:07 GMT+09:00 MURATA Makoto <eb2m-mrt at asahi-net.or.jp>:

> John,
>
> Thank you very much for this summary.
>
> *Re: Whether we are OK with the differences introduced by RFC 7231*
>
> If there are no strong reasons not to use RFC 7231, we should use
> it.  Moreover, our regular expression already adopts some new features
> of RFC 7231.
>
> *Re: Whether we should rewrite the XSD regex pattern by translating *
> *the RFC 7231 media-type definition (as Part 2 originally did with RFC
> 2616)*
>
> One possibility is to drop the regexp and to introduce prose that
> references RFC 7231 normatively.  Our life will be easier, since we
> are not obliged to ensure equivalence of our regexp and RFC 7231.
> People are unlikely to validate the value of @ContentType, and
> might not notice errors.  But the value of this attribute is almost
> always a token followed by "/" followed by a token.
>
> Another possibility is to create a regexp that faithfully captures
> RFC 7231.  Our life will be harder and the regular expression
> will not be easy to read (even if it is better than the present
> regexp).  But syntactical errors will be captured by XML
> validators.
>
> *Re: Whether we should then simplify the regex in some partial or *
> *extreme way, within the limits of what XSD and RNG allow*
>
> Unnecessary complexity is harmful, but should create a loose
> regexp only for readability?  I do not like this option, but
> it might be a practical solution.
>
> Regards,
> Makoto
>
> 2015-03-17 8:11 GMT+09:00 John Haug <johnhaug at exchange.microsoft.com>:
>
>>  Annex D (and E) have a very lengthy regular expression restricting the
>> allowed values for ST_ContentType.  We looked at this in depth during the
>> meetings in Bellevue, resulting from Murata-san’s December 2014 e-mail
>> “Which RFC(s) for media type should we refer to?” and February 2015 e-mail
>> “My proposals: content type and media ypest”.  There were some corrections
>> to make and no clear answer as to what the regex is intended for.
>>
>>
>>
>> After the meetings, I took a similar look at the regex, compared it to
>> RFC 2616 (the regex is intended to match the RFC), compared RFCs 2616 and
>> 7231 and documented everything in the attached.  (This includes a fix to
>> the error Murata-san noted.)  The e-mail discussion is included below.
>>
>>
>>
>> 1. The questions I initially posed are still open:
>>
>> ·         Whether we are OK with the differences introduced by RFC 7231
>>
>> ·         Whether we should rewrite the XSD regex pattern by translating
>> the RFC 7231 media-type definition (as Part 2 originally did with RFC 2616)
>>
>> ·         Whether we should then simplify the regex in some partial or
>> extreme way, within the limits of what XSD and RNG allow
>>
>> 2. Francis’ question about \s+ in what I call Y still needs examination.
>>
>>
>>
>> For further discussion among the larger group!
>>
>> John
>>
>>
>>
>> *From:* eb2mmrt at gmail.com [mailto:eb2mmrt at gmail.com] *On Behalf Of *MURATA
>> Makoto
>> *Sent:* Thursday, March 12, 2015 10:15 PM
>> *To:* John Haug
>> *Cc:* Francis Cave; Arms, Caroline; Rex Jaeschke; Chris Rae; Gareth
>> Horton; Alex Brown; Rich McLain; MURATA Makoto (FAMILY Given)
>> *Subject:* Re: PLEASE PROOF: Day 3 Draft Minutes from the Seattle
>> Meeting of SC 34/WG4
>>
>>
>>
>> John,
>>
>>
>>
>> First, let us correctly understand what we have right now.  I do not
>>
>> think this can be done without using some macro (or XML entities).
>>
>>
>>
>> Then, compare our (faithfully rewritten) regexp and   RFC 2161 and
>> then compare it and RFC 7231.
>>
>>
>>
>> John, please incorporate my change (as part of the first step) and
>>
>> post your document to the WG4 mailing list.
>>
>>
>>
>> Regards,
>>
>> Makoto
>>
>>
>>
>>
>>
>>
>>
>> 2015-03-11 8:51 GMT+09:00 John Haug <johnhaug at exchange.microsoft.com>:
>>
>>  (I’ve combined the split replies from Francis and Murata-san.)
>>
>>
>>
>>
>>
>> Re: Francis:
>>
>> My intent here wasn’t to propose a simplification to include in Part 2.
>> I just wanted to reverse engineer what Part 2 currently has and compare it
>> to the RFCs.  I therefore did no reduction of the content.  Leaving it as
>> is showed that Part 2 contained a quite literal translation of the RFC’s
>> BNF to XSD regex syntax.  Your comment about \s in X is true, but I left it
>> that way intentionally as part of the reverse engineering.
>>
>>
>>
>> Regarding \s+ in Y, I think they again literally translated qdtext, but
>> perhaps not quite right?  Maybe \s+ should have been appended after the
>> last item in the first bracketed expression in Y?
>>
>> Y: ([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"]]|\s+)
>>
>> qdtext         = <any TEXT except <">>
>> TEXT           = <any OCTET except CTLs, but including LWS>
>> OCTET          = <any 8-bit sequence of data>
>> CTL            = <any US-ASCII control character (octets 0 - 31) and DEL
>> (127)>
>> LWS            = [CRLF] 1*( SP | HT )   ; linear white space
>> CRLF           = CR LF
>>
>> OCTET limits us to 0-255.  CTLs removes 0-31 and 127.  LWS adds back in
>> 9, 10, 13 (32 already allowed).  qdtext removes 34.
>>
>>
>>
>> > Note that U+00A0 is in the Latin-1 supplement. I'm not sure whether
>> this character should be explicitly excluded from X. It can presumably be
>> included in Y, as this in a quoted string.
>>
>> I might be missing something.  The content represented by X doesn’t
>> include anything above 127 (0x7F).
>>
>>
>>
>>
>>
>> Re: Murata-san:
>>
>> Ah, I think you’re right.  I believe my omitting that one set of
>> parentheses makes the * apply only to the [\p{IsBasicLatin}] and not also
>> to the \\.
>>
>>
>>
>>
>>
>> Given that I’m not trying to propose a simplification of the regex, or
>> even that we should do so, assuming I’m correct above, shall I make
>> Murata-san’s edit to the doc and send it to the WG 4 list purely as
>> investigation into what Part 2 currently says?
>>
>>
>>
>> John
>>
>>
>>  ------------------------------
>>
>> *From:* eb2mmrt at gmail.com [mailto:eb2mmrt at gmail.com] *On Behalf Of *MURATA
>> Makoto
>> *Sent:* Saturday, February 28, 2015 11:51 PM
>> *To:* John Haug
>> *Cc:* Arms, Caroline; Rex Jaeschke; Chris Rae; Francis Cave; Gareth
>> Horton; Alex Brown; Rich McLain
>> *Subject:* Re: PLEASE PROOF: Day 3 Draft Minutes from the Seattle
>> Meeting of SC 34/WG4
>>
>>
>>
>> John,
>>
>>
>>
>> Nice work!
>>
>>
>>
>> I think that
>>
>>
>>
>>
>> ("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"]]|\s+)|(
>> \\[\p{IsBasicLatin}])*)")
>>
>>
>>
>> cannot be simplified to
>>
>>
>>
>> ("(Y | \\[\p{IsBasicLatin}]*)")
>>
>>
>>
>> Rather, it should become
>>
>>
>>
>> ("(Y|(\\[\p{IsBasicLatin}]))*")
>>
>>
>>
>> Regards,
>>
>> Makoto
>>
>>
>>
>> *From:* Francis Cave [mailto:francis at franciscave.com]
>> *Sent:* Saturday, February 28, 2015 4:37 AM
>> *To:* John Haug; Arms, Caroline; 'Rex Jaeschke'; Makoto Murata; Chris
>> Rae; Gareth Horton; Alex Brown; Rich McLain
>> *Subject:* Re: PLEASE PROOF: Day 3 Draft Minutes from the Seattle
>> Meeting of SC 34/WG4
>>
>>
>>
>> John
>>
>> Good job! I just have one niggle, which is with your use of \s in the
>> definitions of both X and Y. The problem is that \s includes \t \n and \r,
>> all of which are in \p{Cc}. The meaning of \s is not SPACE, but 'white
>> space', where this includes all Unicode space characters (i.e. including
>> U+0020 and U+00A0, but also presumably some other space characters), and
>> also includes control characters TAB, CR and LF.
>>
>> In X this isn't so critical, because you're excluding \s, but that means
>> that there is redundancy in the expression, because \t \n and \r are
>> already excluded by excluding \p{Cc}.
>>
>> In Y the problem is more serious, because you are including \s+ as an
>> alternative choice to the rest of the expression. Effectively this allows
>> \t \n and \r in Y expressions that are white space only.
>>
>> I suspect that the only white space character that should be allowed in
>> Y, is U+0020, i.e. the regular SPACE character. This would be the same as
>> SP in the ABNF in RFC 7231.
>>
>> Note that U+00A0 is in the Latin-1 supplement. I'm not sure whether this
>> character should be explicitly excluded from X. It can presumably be
>> included in Y, as this in a quoted string.
>>
>> Here is a list of Unicode spaces:
>> https://www.cs.tut.fi/~jkorpela/chars/spaces.html. As this isn't an
>> official list, I cannot be certain that this is accurate. My assumption is
>> that \s includes all these.
>>
>> I am assuming that \p{Cc} includes control characters in the Unicode
>> range U+0080 to U+009F, as well as the control characters in the basic
>> ASCII range.
>>
>> Kind regards,
>>
>> Francis
>>
>>
>>
>>
>> On 28/02/2015 00:31, John Haug wrote:
>>
>> I’ve attached what I came up with, which should fully explain the issue
>> and be step-by-step enough to make it easier to ensure there are no
>> typos/errors.  Please review this before it goes out to all of WG 4.  Once
>> we’re sure there are no typos/errors, we can either include it in the
>> minutes of the Bellevue meeting or I can just attach it to a reply to
>> Murata-san’s last e-mail on the subject to the WG 4 list.
>>
>>
>>
>> Pursuant to the revision, we will need to decide:
>>
>> ·         Whether we are OK with the differences introduced by RFC 7231
>>
>> ·         Whether we should rewrite the XSD regex pattern by translating
>> the RFC 7231 media-type definition (as Part 2 originally did with RFC 2616)
>>
>> ·         Whether we should then simplify the regex in some partial or
>> extreme way, within the limits of what XSD and RNG allow
>>
>>
>>
>> Thanks,
>>
>> John
>>
>>
>>
>> *From:* John Haug
>> *Sent:* Thursday, February 26, 2015 3:08 PM
>> *To:* 'Arms, Caroline'; 'Rex Jaeschke'; Makoto Murata; Chris Rae;
>> Francis Cave; Gareth Horton; Alex Brown; Rich McLain
>> *Subject:* RE: PLEASE PROOF: Day 3 Draft Minutes from the Seattle
>> Meeting of SC 34/WG4
>>
>>
>>
>> Re: the ST_ContentType regex: I am nearly done with a lengthy explanatory
>> break-it-down tutorial document that shows the derivation step by step!
>> Thanks much to Murata-san for all the initial investigation and to him and
>> Francis for talking through it on the screen yesterday at (painful) length.
>>
>>
>>
>> The short version is that the huge 6-line regex in Part 2 is a literal
>> translation of RFC 2616’s definition of media-type into an XSD pattern.  I
>> have that part done and am working on the differences between RFC 2616 and
>> RFC 7231 (and friends).  I think it’s reasonable to change the Part 2
>> normative reference to 7231 (and friends by reference from 7231) since it
>> has obsoleted 2616.  But we ought to understand and discuss the differences
>> before making a concrete decision on that.
>>
>>
>>
>> John
>>
>>
>>
>> *From:* Arms, Caroline [mailto:caar at loc.gov <caar at loc.gov>]
>> *Sent:* Thursday, February 26, 2015 2:56 PM
>> *To:* 'Rex Jaeschke'; Makoto Murata; John Haug; Chris Rae; Francis Cave;
>> Gareth Horton; Alex Brown; Rich McLain
>> *Subject:* RE: PLEASE PROOF: Day 3 Draft Minutes from the Seattle
>> Meeting of SC 34/WG4
>>
>>
>>
>> Rex,
>>
>>
>>
>> I think you should get a sentence or two from Murata-san or someone else
>> about what needs to be done to get the regular expression for the
>> ContentTypes schema fixed – the third point in Murata-san’s message.   I
>> wasn’t able to hear that discussion clearly enough.  I believe it was
>> decided that some testing might be needed – but maybe that was only
>> discussed but not decided.
>>
>>
>>
>>     Caroline
>>
>>
>>
>> *From:* Rex Jaeschke [mailto:rex at RexJaeschke.com <rex at RexJaeschke.com>]
>> *Sent:* Thursday, February 26, 2015 2:51 PM
>> *To:* Makoto Murata; Arms, Caroline; John Haug; Chris Rae; Francis Cave;
>> Gareth Horton; Alex Brown; Rich McLain
>> *Subject:* PLEASE PROOF: Day 3 Draft Minutes from the Seattle Meeting of
>> SC 34/WG4
>>
>>
>>
>> Attached are the draft minutes as at the end of the meeting.
>>
>>
>>
>> Once I get some words on XAdES, I’ll send out the final draft to WG4 and
>> TC45.
>>
>>
>>
>> I’ll update the DR log to reflect the minutes, in the next hour.
>>
>>
>>
>> Rex
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>>
>> Praying for the victims of the Japan Tohoku earthquake
>>
>> Makoto
>>
>
>
>
> --
>
> Praying for the victims of the Japan Tohoku earthquake
>
> Makoto
>

-- 

Praying for the victims of the Japan Tohoku earthquake

Makoto
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.vse.cz/pipermail/sc34wg4/attachments/20150409/db3acbe4/attachment-0001.html>