Part 2 schema regex for ST_ContentType

MURATA Makoto eb2m-mrt at asahi-net.or.jp
Thu Apr 9 09:07:50 CEST 2015


John,

Thank you very much for this summary.

*Re: Whether we are OK with the differences introduced by RFC 7231*

If there are no strong reasons not to use RFC 7231, we should use
it.  Moreover, our regular expression already adopts some new features
of RFC 7231.

*Re: Whether we should rewrite the XSD regex pattern by translating *
*the RFC 7231 media-type definition (as Part 2 originally did with RFC
2616)*

One possibility is to drop the regexp and to introduce prose that
references RFC 7231 normatively.  Our life will be easier, since we
are not obliged to ensure equivalence of our regexp and RFC 7231.
People are unlikely to validate the value of @ContentType, and
might not notice errors.  But the value of this attribute is almost
always a token followed by "/" followed by a token.

Another possibility is to create a regexp that faithfully captures
RFC 7231.  Our life will be harder and the regular expression
will not be easy to read (even if it is better than the present
regexp).  But syntactical errors will be captured by XML
validators.

*Re: Whether we should then simplify the regex in some partial or *
*extreme way, within the limits of what XSD and RNG allow*

Unnecessary complexity is harmful, but should create a loose
regexp only for readability?  I do not like this option, but
it might be a practical solution.

Regards,
Makoto

2015-03-17 8:11 GMT+09:00 John Haug <johnhaug at exchange.microsoft.com>:

>  Annex D (and E) have a very lengthy regular expression restricting the
> allowed values for ST_ContentType.  We looked at this in depth during the
> meetings in Bellevue, resulting from Murata-san’s December 2014 e-mail
> “Which RFC(s) for media type should we refer to?” and February 2015 e-mail
> “My proposals: content type and media ypest”.  There were some corrections
> to make and no clear answer as to what the regex is intended for.
>
>
>
> After the meetings, I took a similar look at the regex, compared it to RFC
> 2616 (the regex is intended to match the RFC), compared RFCs 2616 and 7231
> and documented everything in the attached.  (This includes a fix to the
> error Murata-san noted.)  The e-mail discussion is included below.
>
>
>
> 1. The questions I initially posed are still open:
>
> ·         Whether we are OK with the differences introduced by RFC 7231
>
> ·         Whether we should rewrite the XSD regex pattern by translating
> the RFC 7231 media-type definition (as Part 2 originally did with RFC 2616)
>
> ·         Whether we should then simplify the regex in some partial or
> extreme way, within the limits of what XSD and RNG allow
>
> 2. Francis’ question about \s+ in what I call Y still needs examination.
>
>
>
> For further discussion among the larger group!
>
> John
>
>
>
> *From:* eb2mmrt at gmail.com [mailto:eb2mmrt at gmail.com] *On Behalf Of *MURATA
> Makoto
> *Sent:* Thursday, March 12, 2015 10:15 PM
> *To:* John Haug
> *Cc:* Francis Cave; Arms, Caroline; Rex Jaeschke; Chris Rae; Gareth
> Horton; Alex Brown; Rich McLain; MURATA Makoto (FAMILY Given)
> *Subject:* Re: PLEASE PROOF: Day 3 Draft Minutes from the Seattle Meeting
> of SC 34/WG4
>
>
>
> John,
>
>
>
> First, let us correctly understand what we have right now.  I do not
>
> think this can be done without using some macro (or XML entities).
>
>
>
> Then, compare our (faithfully rewritten) regexp and   RFC 2161 and
> then compare it and RFC 7231.
>
>
>
> John, please incorporate my change (as part of the first step) and
>
> post your document to the WG4 mailing list.
>
>
>
> Regards,
>
> Makoto
>
>
>
>
>
>
>
> 2015-03-11 8:51 GMT+09:00 John Haug <johnhaug at exchange.microsoft.com>:
>
>  (I’ve combined the split replies from Francis and Murata-san.)
>
>
>
>
>
> Re: Francis:
>
> My intent here wasn’t to propose a simplification to include in Part 2.  I
> just wanted to reverse engineer what Part 2 currently has and compare it to
> the RFCs.  I therefore did no reduction of the content.  Leaving it as is
> showed that Part 2 contained a quite literal translation of the RFC’s BNF
> to XSD regex syntax.  Your comment about \s in X is true, but I left it
> that way intentionally as part of the reverse engineering.
>
>
>
> Regarding \s+ in Y, I think they again literally translated qdtext, but
> perhaps not quite right?  Maybe \s+ should have been appended after the
> last item in the first bracketed expression in Y?
>
> Y: ([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"]]|\s+)
>
> qdtext         = <any TEXT except <">>
> TEXT           = <any OCTET except CTLs, but including LWS>
> OCTET          = <any 8-bit sequence of data>
> CTL            = <any US-ASCII control character (octets 0 - 31) and DEL
> (127)>
> LWS            = [CRLF] 1*( SP | HT )   ; linear white space
> CRLF           = CR LF
>
> OCTET limits us to 0-255.  CTLs removes 0-31 and 127.  LWS adds back in 9,
> 10, 13 (32 already allowed).  qdtext removes 34.
>
>
>
> > Note that U+00A0 is in the Latin-1 supplement. I'm not sure whether
> this character should be explicitly excluded from X. It can presumably be
> included in Y, as this in a quoted string.
>
> I might be missing something.  The content represented by X doesn’t
> include anything above 127 (0x7F).
>
>
>
>
>
> Re: Murata-san:
>
> Ah, I think you’re right.  I believe my omitting that one set of
> parentheses makes the * apply only to the [\p{IsBasicLatin}] and not also
> to the \\.
>
>
>
>
>
> Given that I’m not trying to propose a simplification of the regex, or
> even that we should do so, assuming I’m correct above, shall I make
> Murata-san’s edit to the doc and send it to the WG 4 list purely as
> investigation into what Part 2 currently says?
>
>
>
> John
>
>
>  ------------------------------
>
> *From:* eb2mmrt at gmail.com [mailto:eb2mmrt at gmail.com] *On Behalf Of *MURATA
> Makoto
> *Sent:* Saturday, February 28, 2015 11:51 PM
> *To:* John Haug
> *Cc:* Arms, Caroline; Rex Jaeschke; Chris Rae; Francis Cave; Gareth
> Horton; Alex Brown; Rich McLain
> *Subject:* Re: PLEASE PROOF: Day 3 Draft Minutes from the Seattle Meeting
> of SC 34/WG4
>
>
>
> John,
>
>
>
> Nice work!
>
>
>
> I think that
>
>
>
>
> ("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"]]|\s+)|(
> \\[\p{IsBasicLatin}])*)")
>
>
>
> cannot be simplified to
>
>
>
> ("(Y | \\[\p{IsBasicLatin}]*)")
>
>
>
> Rather, it should become
>
>
>
> ("(Y|(\\[\p{IsBasicLatin}]))*")
>
>
>
> Regards,
>
> Makoto
>
>
>
> *From:* Francis Cave [mailto:francis at franciscave.com]
> *Sent:* Saturday, February 28, 2015 4:37 AM
> *To:* John Haug; Arms, Caroline; 'Rex Jaeschke'; Makoto Murata; Chris
> Rae; Gareth Horton; Alex Brown; Rich McLain
> *Subject:* Re: PLEASE PROOF: Day 3 Draft Minutes from the Seattle Meeting
> of SC 34/WG4
>
>
>
> John
>
> Good job! I just have one niggle, which is with your use of \s in the
> definitions of both X and Y. The problem is that \s includes \t \n and \r,
> all of which are in \p{Cc}. The meaning of \s is not SPACE, but 'white
> space', where this includes all Unicode space characters (i.e. including
> U+0020 and U+00A0, but also presumably some other space characters), and
> also includes control characters TAB, CR and LF.
>
> In X this isn't so critical, because you're excluding \s, but that means
> that there is redundancy in the expression, because \t \n and \r are
> already excluded by excluding \p{Cc}.
>
> In Y the problem is more serious, because you are including \s+ as an
> alternative choice to the rest of the expression. Effectively this allows
> \t \n and \r in Y expressions that are white space only.
>
> I suspect that the only white space character that should be allowed in Y,
> is U+0020, i.e. the regular SPACE character. This would be the same as SP
> in the ABNF in RFC 7231.
>
> Note that U+00A0 is in the Latin-1 supplement. I'm not sure whether this
> character should be explicitly excluded from X. It can presumably be
> included in Y, as this in a quoted string.
>
> Here is a list of Unicode spaces:
> https://www.cs.tut.fi/~jkorpela/chars/spaces.html. As this isn't an
> official list, I cannot be certain that this is accurate. My assumption is
> that \s includes all these.
>
> I am assuming that \p{Cc} includes control characters in the Unicode range
> U+0080 to U+009F, as well as the control characters in the basic ASCII
> range.
>
> Kind regards,
>
> Francis
>
>
>
>
> On 28/02/2015 00:31, John Haug wrote:
>
> I’ve attached what I came up with, which should fully explain the issue
> and be step-by-step enough to make it easier to ensure there are no
> typos/errors.  Please review this before it goes out to all of WG 4.  Once
> we’re sure there are no typos/errors, we can either include it in the
> minutes of the Bellevue meeting or I can just attach it to a reply to
> Murata-san’s last e-mail on the subject to the WG 4 list.
>
>
>
> Pursuant to the revision, we will need to decide:
>
> ·         Whether we are OK with the differences introduced by RFC 7231
>
> ·         Whether we should rewrite the XSD regex pattern by translating
> the RFC 7231 media-type definition (as Part 2 originally did with RFC 2616)
>
> ·         Whether we should then simplify the regex in some partial or
> extreme way, within the limits of what XSD and RNG allow
>
>
>
> Thanks,
>
> John
>
>
>
> *From:* John Haug
> *Sent:* Thursday, February 26, 2015 3:08 PM
> *To:* 'Arms, Caroline'; 'Rex Jaeschke'; Makoto Murata; Chris Rae; Francis
> Cave; Gareth Horton; Alex Brown; Rich McLain
> *Subject:* RE: PLEASE PROOF: Day 3 Draft Minutes from the Seattle Meeting
> of SC 34/WG4
>
>
>
> Re: the ST_ContentType regex: I am nearly done with a lengthy explanatory
> break-it-down tutorial document that shows the derivation step by step!
> Thanks much to Murata-san for all the initial investigation and to him and
> Francis for talking through it on the screen yesterday at (painful) length.
>
>
>
> The short version is that the huge 6-line regex in Part 2 is a literal
> translation of RFC 2616’s definition of media-type into an XSD pattern.  I
> have that part done and am working on the differences between RFC 2616 and
> RFC 7231 (and friends).  I think it’s reasonable to change the Part 2
> normative reference to 7231 (and friends by reference from 7231) since it
> has obsoleted 2616.  But we ought to understand and discuss the differences
> before making a concrete decision on that.
>
>
>
> John
>
>
>
> *From:* Arms, Caroline [mailto:caar at loc.gov <caar at loc.gov>]
> *Sent:* Thursday, February 26, 2015 2:56 PM
> *To:* 'Rex Jaeschke'; Makoto Murata; John Haug; Chris Rae; Francis Cave;
> Gareth Horton; Alex Brown; Rich McLain
> *Subject:* RE: PLEASE PROOF: Day 3 Draft Minutes from the Seattle Meeting
> of SC 34/WG4
>
>
>
> Rex,
>
>
>
> I think you should get a sentence or two from Murata-san or someone else
> about what needs to be done to get the regular expression for the
> ContentTypes schema fixed – the third point in Murata-san’s message.   I
> wasn’t able to hear that discussion clearly enough.  I believe it was
> decided that some testing might be needed – but maybe that was only
> discussed but not decided.
>
>
>
>     Caroline
>
>
>
> *From:* Rex Jaeschke [mailto:rex at RexJaeschke.com <rex at RexJaeschke.com>]
> *Sent:* Thursday, February 26, 2015 2:51 PM
> *To:* Makoto Murata; Arms, Caroline; John Haug; Chris Rae; Francis Cave;
> Gareth Horton; Alex Brown; Rich McLain
> *Subject:* PLEASE PROOF: Day 3 Draft Minutes from the Seattle Meeting of
> SC 34/WG4
>
>
>
> Attached are the draft minutes as at the end of the meeting.
>
>
>
> Once I get some words on XAdES, I’ll send out the final draft to WG4 and
> TC45.
>
>
>
> I’ll update the DR log to reflect the minutes, in the next hour.
>
>
>
> Rex
>
>
>
>
>
>
>
>
>
>
>
> --
>
>
> Praying for the victims of the Japan Tohoku earthquake
>
> Makoto
>



-- 

Praying for the victims of the Japan Tohoku earthquake

Makoto
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.vse.cz/pipermail/sc34wg4/attachments/20150409/942f2d8c/attachment-0001.html>


More information about the sc34wg4 mailing list