Part 2 schema regex for ST_ContentType

MURATA Makoto eb2m-mrt at asahi-net.or.jp
Tue Apr 7 16:14:38 CEST 2015


John,

Thank you very much for this summary.

Re: Whether we are OK with the differences introduced by RFC 7231

I believe that we are OK.  Some of the changes introduced by
RFC 7231 are already introduced by the regular expression
for OPC @ContentTypes.

Re: Whether we should rewrite the XSD regex pattern by translating
the RFC 7231 media-type definition (as Part 2 originally did with RFC 2616)

Yes, I think that we should slightly adjust our regexp and make
it consistent with RFC 7231.  The current regular expression is
NOT aligned with RFC 2616 either.

Re: Whether we should then simplify the regex in some partial or
extreme way, within the limits of what XSD and RNG allow

I strongly believe that we should make the regexp readable.

Regards,
Makoto

2015-03-17 8:11 GMT+09:00 John Haug <johnhaug at exchange.microsoft.com>:

>  Annex D (and E) have a very lengthy regular expression restricting the
> allowed values for ST_ContentType.  We looked at this in depth during the
> meetings in Bellevue, resulting from Murata-san’s December 2014 e-mail
> “Which RFC(s) for media type should we refer to?” and February 2015 e-mail
> “My proposals: content type and media ypest”.  There were some corrections
> to make and no clear answer as to what the regex is intended for.
>
>
>
> After the meetings, I took a similar look at the regex, compared it to RFC
> 2616 (the regex is intended to match the RFC), compared RFCs 2616 and 7231
> and documented everything in the attached.  (This includes a fix to the
> error Murata-san noted.)  The e-mail discussion is included below.
>
>
>
> 1. The questions I initially posed are still open:
>
> ·         Whether we are OK with the differences introduced by RFC 7231
>
> ·         Whether we should rewrite the XSD regex pattern by translating
> the RFC 7231 media-type definition (as Part 2 originally did with RFC 2616)
>
> ·         Whether we should then simplify the regex in some partial or
> extreme way, within the limits of what XSD and RNG allow
>
> 2. Francis’ question about \s+ in what I call Y still needs examination.
>
>
>
> For further discussion among the larger group!
>
> John
>
>
>
> *From:* eb2mmrt at gmail.com [mailto:eb2mmrt at gmail.com] *On Behalf Of *MURATA
> Makoto
> *Sent:* Thursday, March 12, 2015 10:15 PM
> *To:* John Haug
> *Cc:* Francis Cave; Arms, Caroline; Rex Jaeschke; Chris Rae; Gareth
> Horton; Alex Brown; Rich McLain; MURATA Makoto (FAMILY Given)
> *Subject:* Re: PLEASE PROOF: Day 3 Draft Minutes from the Seattle Meeting
> of SC 34/WG4
>
>
>
> John,
>
>
>
> First, let us correctly understand what we have right now.  I do not
>
> think this can be done without using some macro (or XML entities).
>
>
>
> Then, compare our (faithfully rewritten) regexp and   RFC 2161 and
> then compare it and RFC 7231.
>
>
>
> John, please incorporate my change (as part of the first step) and
>
> post your document to the WG4 mailing list.
>
>
>
> Regards,
>
> Makoto
>
>
>
>
>
>
>
> 2015-03-11 8:51 GMT+09:00 John Haug <johnhaug at exchange.microsoft.com>:
>
>  (I’ve combined the split replies from Francis and Murata-san.)
>
>
>
>
>
> Re: Francis:
>
> My intent here wasn’t to propose a simplification to include in Part 2.  I
> just wanted to reverse engineer what Part 2 currently has and compare it to
> the RFCs.  I therefore did no reduction of the content.  Leaving it as is
> showed that Part 2 contained a quite literal translation of the RFC’s BNF
> to XSD regex syntax.  Your comment about \s in X is true, but I left it
> that way intentionally as part of the reverse engineering.
>
>
>
> Regarding \s+ in Y, I think they again literally translated qdtext, but
> perhaps not quite right?  Maybe \s+ should have been appended after the
> last item in the first bracketed expression in Y?
>
> Y: ([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"]]|\s+)
>
> qdtext         = <any TEXT except <">>
> TEXT           = <any OCTET except CTLs, but including LWS>
> OCTET          = <any 8-bit sequence of data>
> CTL            = <any US-ASCII control character (octets 0 - 31) and DEL
> (127)>
> LWS            = [CRLF] 1*( SP | HT )   ; linear white space
> CRLF           = CR LF
>
> OCTET limits us to 0-255.  CTLs removes 0-31 and 127.  LWS adds back in 9,
> 10, 13 (32 already allowed).  qdtext removes 34.
>
>
>
> > Note that U+00A0 is in the Latin-1 supplement. I'm not sure whether
> this character should be explicitly excluded from X. It can presumably be
> included in Y, as this in a quoted string.
>
> I might be missing something.  The content represented by X doesn’t
> include anything above 127 (0x7F).
>
>
>
>
>
> Re: Murata-san:
>
> Ah, I think you’re right.  I believe my omitting that one set of
> parentheses makes the * apply only to the [\p{IsBasicLatin}] and not also
> to the \\.
>
>
>
>
>
> Given that I’m not trying to propose a simplification of the regex, or
> even that we should do so, assuming I’m correct above, shall I make
> Murata-san’s edit to the doc and send it to the WG 4 list purely as
> investigation into what Part 2 currently says?
>
>
>
> John
>
>
>  ------------------------------
>
> *From:* eb2mmrt at gmail.com [mailto:eb2mmrt at gmail.com] *On Behalf Of *MURATA
> Makoto
> *Sent:* Saturday, February 28, 2015 11:51 PM
> *To:* John Haug
> *Cc:* Arms, Caroline; Rex Jaeschke; Chris Rae; Francis Cave; Gareth
> Horton; Alex Brown; Rich McLain
> *Subject:* Re: PLEASE PROOF: Day 3 Draft Minutes from the Seattle Meeting
> of SC 34/WG4
>
>
>
> John,
>
>
>
> Nice work!
>
>
>
> I think that
>
>
>
>
> ("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"]]|\s+)|(
> \\[\p{IsBasicLatin}])*)")
>
>
>
> cannot be simplified to
>
>
>
> ("(Y | \\[\p{IsBasicLatin}]*)")
>
>
>
> Rather, it should become
>
>
>
> ("(Y|(\\[\p{IsBasicLatin}]))*")
>
>
>
> Regards,
>
> Makoto
>
>
>
> *From:* Francis Cave [mailto:francis at franciscave.com]
> *Sent:* Saturday, February 28, 2015 4:37 AM
> *To:* John Haug; Arms, Caroline; 'Rex Jaeschke'; Makoto Murata; Chris
> Rae; Gareth Horton; Alex Brown; Rich McLain
> *Subject:* Re: PLEASE PROOF: Day 3 Draft Minutes from the Seattle Meeting
> of SC 34/WG4
>
>
>
> John
>
> Good job! I just have one niggle, which is with your use of \s in the
> definitions of both X and Y. The problem is that \s includes \t \n and \r,
> all of which are in \p{Cc}. The meaning of \s is not SPACE, but 'white
> space', where this includes all Unicode space characters (i.e. including
> U+0020 and U+00A0, but also presumably some other space characters), and
> also includes control characters TAB, CR and LF.
>
> In X this isn't so critical, because you're excluding \s, but that means
> that there is redundancy in the expression, because \t \n and \r are
> already excluded by excluding \p{Cc}.
>
> In Y the problem is more serious, because you are including \s+ as an
> alternative choice to the rest of the expression. Effectively this allows
> \t \n and \r in Y expressions that are white space only.
>
> I suspect that the only white space character that should be allowed in Y,
> is U+0020, i.e. the regular SPACE character. This would be the same as SP
> in the ABNF in RFC 7231.
>
> Note that U+00A0 is in the Latin-1 supplement. I'm not sure whether this
> character should be explicitly excluded from X. It can presumably be
> included in Y, as this in a quoted string.
>
> Here is a list of Unicode spaces:
> https://www.cs.tut.fi/~jkorpela/chars/spaces.html. As this isn't an
> official list, I cannot be certain that this is accurate. My assumption is
> that \s includes all these.
>
> I am assuming that \p{Cc} includes control characters in the Unicode range
> U+0080 to U+009F, as well as the control characters in the basic ASCII
> range.
>
> Kind regards,
>
> Francis
>
>
>
>
> On 28/02/2015 00:31, John Haug wrote:
>
> I’ve attached what I came up with, which should fully explain the issue
> and be step-by-step enough to make it easier to ensure there are no
> typos/errors.  Please review this before it goes out to all of WG 4.  Once
> we’re sure there are no typos/errors, we can either include it in the
> minutes of the Bellevue meeting or I can just attach it to a reply to
> Murata-san’s last e-mail on the subject to the WG 4 list.
>
>
>
> Pursuant to the revision, we will need to decide:
>
> ·         Whether we are OK with the differences introduced by RFC 7231
>
> ·         Whether we should rewrite the XSD regex pattern by translating
> the RFC 7231 media-type definition (as Part 2 originally did with RFC 2616)
>
> ·         Whether we should then simplify the regex in some partial or
> extreme way, within the limits of what XSD and RNG allow
>
>
>
> Thanks,
>
> John
>
>
>
> *From:* John Haug
> *Sent:* Thursday, February 26, 2015 3:08 PM
> *To:* 'Arms, Caroline'; 'Rex Jaeschke'; Makoto Murata; Chris Rae; Francis
> Cave; Gareth Horton; Alex Brown; Rich McLain
> *Subject:* RE: PLEASE PROOF: Day 3 Draft Minutes from the Seattle Meeting
> of SC 34/WG4
>
>
>
> Re: the ST_ContentType regex: I am nearly done with a lengthy explanatory
> break-it-down tutorial document that shows the derivation step by step!
> Thanks much to Murata-san for all the initial investigation and to him and
> Francis for talking through it on the screen yesterday at (painful) length.
>
>
>
> The short version is that the huge 6-line regex in Part 2 is a literal
> translation of RFC 2616’s definition of media-type into an XSD pattern.  I
> have that part done and am working on the differences between RFC 2616 and
> RFC 7231 (and friends).  I think it’s reasonable to change the Part 2
> normative reference to 7231 (and friends by reference from 7231) since it
> has obsoleted 2616.  But we ought to understand and discuss the differences
> before making a concrete decision on that.
>
>
>
> John
>
>
>
> *From:* Arms, Caroline [mailto:caar at loc.gov <caar at loc.gov>]
> *Sent:* Thursday, February 26, 2015 2:56 PM
> *To:* 'Rex Jaeschke'; Makoto Murata; John Haug; Chris Rae; Francis Cave;
> Gareth Horton; Alex Brown; Rich McLain
> *Subject:* RE: PLEASE PROOF: Day 3 Draft Minutes from the Seattle Meeting
> of SC 34/WG4
>
>
>
> Rex,
>
>
>
> I think you should get a sentence or two from Murata-san or someone else
> about what needs to be done to get the regular expression for the
> ContentTypes schema fixed – the third point in Murata-san’s message.   I
> wasn’t able to hear that discussion clearly enough.  I believe it was
> decided that some testing might be needed – but maybe that was only
> discussed but not decided.
>
>
>
>     Caroline
>
>
>
> *From:* Rex Jaeschke [mailto:rex at RexJaeschke.com <rex at RexJaeschke.com>]
> *Sent:* Thursday, February 26, 2015 2:51 PM
> *To:* Makoto Murata; Arms, Caroline; John Haug; Chris Rae; Francis Cave;
> Gareth Horton; Alex Brown; Rich McLain
> *Subject:* PLEASE PROOF: Day 3 Draft Minutes from the Seattle Meeting of
> SC 34/WG4
>
>
>
> Attached are the draft minutes as at the end of the meeting.
>
>
>
> Once I get some words on XAdES, I’ll send out the final draft to WG4 and
> TC45.
>
>
>
> I’ll update the DR log to reflect the minutes, in the next hour.
>
>
>
> Rex
>
>
>
>
>
>
>
>
>
>
>
> --
>
>
> Praying for the victims of the Japan Tohoku earthquake
>
> Makoto
>



-- 

Praying for the victims of the Japan Tohoku earthquake

Makoto
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.vse.cz/pipermail/sc34wg4/attachments/20150407/784d2f0a/attachment-0001.html>


More information about the sc34wg4 mailing list