Which RFC(s) for media type should we refer to?

MURATA Makoto eb2m-mrt at asahi-net.or.jp
Sun Nov 2 14:33:21 CET 2014


Folks,

The regular expression in opc-contentTypes.xsd has another interesting
subexpression:

("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"

This subexpression matches a doubly-quoted string.  But what character
is allowed as a part of this doubly-quoted string? Is this
subexpression consistent with RFC 7230?

First, unlike the first regular expression,
neither \(\)<>@,;:\\ nor /\[\]\?=\{\}\s\t are excluded.
This is because RFCs 2045 and 7230 allow tspecials as part of
doubly-quoted strings.  This is nice.

Second, escaped characters such as \a are always allows by
(\\[\p{IsBasicLatin}]) as long as escaped characters are from #x0000
to #x007F.  RFC 7230 allows more characters (#x0080 to #x007F) to be
escaped, but does not allow invisible characters such as the space
character to be escaped.  Thus, there is a discrepancy here.

Third, \p{IsLatin-1Supplement} represents characters
from #x0080 to #x00FF.  See
http://www.w3.org/TR/xmlschema-2/#nt-charClassEsc
This matches obs-text in RFC 7230.

Fourth, RFC 7230 (to be precise, qdtext) does not allow REVERSE
SOLIDUS, but the subexpression does.  Thus, we have another
discrepancy.

     qdtext         = HTAB / SP /%x21 / %x23-5B / %x5D-7E / obs-text

Regards,
Makoto

2014-11-02 16:56 GMT+09:00 MURATA Makoto <eb2m-mrt at asahi-net.or.jp>:

> Oops, I forgot to point out that \{ and \} are disallowed by
> our regular expression, but they are not tspecials as
> specified in RFC 2045.   RFC 7230 does
> not allow these two characters, and exactly catches
> the enumerated list in my previous mail.
>
> https://tools.ietf.org/html/rfc7230#section-3.2.6
>
> One could say that our regular expression is already
> aligned with RFC 7230 rather than RFC 2045.
>
> Regards,
> Makoto
>
> 2014-11-02 16:15 GMT+09:00 MURATA Makoto <eb2m-mrt at asahi-net.or.jp>:
>
>> Folks,
>>
>> I pointed out that
>>
>>    [\p{IsBasicLatin}-[\p{Cc}\(\)<>@,;:\\"/\
>> [\]\?=\{\}\s\t]]
>>
>> is used repeatedly.  This appears to represent characters in
>>
>>       token := 1*<any (US-ASCII) CHAR except SPACE, CTLs,
>>                  or tspecials>
>>
>> where
>>
>>
>>      tspecials :=  "(" / ")" / "<" / ">" / "@" /
>>                    "," / ";" / ":" / "\" / <">
>>                    "/" / "[" / "]" / "?" / "="
>>                    ; Must be in quoted-string,
>>                    ; to use within parameter values" /
>>
>> They both represent any of the following characters:
>>
>> - 0021;EXCLAMATION MARK;Po;0;ON;;;;;N;;;;;
>> - 0023;NUMBER SIGN;Po;0;ET;;;;;N;;;;;
>> - 0024;DOLLAR SIGN;Sc;0;ET;;;;;N;;;;;
>> - 0025;PERCENT SIGN;Po;0;ET;;;;;N;;;;;
>> - 0026;AMPERSAND;Po;0;ON;;;;;N;;;;;
>> - 0027;APOSTROPHE;Po;0;ON;;;;;N;APOSTROPHE-QUOTE;;;;
>> - 002A;ASTERISK;Po;0;ON;;;;;N;;;;;
>> - 002B;PLUS SIGN;Sm;0;ES;;;;;N;;;;;
>> - 002D;HYPHEN-MINUS;Pd;0;ES;;;;;N;;;;;
>> - 002E;FULL STOP;Po;0;CS;;;;;N;PERIOD;;;;
>> - 0-9
>> - A-Z
>> - 005E;CIRCUMFLEX ACCENT;Sk;0;ON;;;;;N;SPACING CIRCUMFLEX;;;;
>> - 005F;LOW LINE;Pc;0;ON;;;;;N;SPACING UNDERSCORE;;;;
>> - 0060;GRAVE ACCENT;Sk;0;ON;;;;;N;SPACING GRAVE;;;;
>> - a-z
>> - 007C;VERTICAL LINE;Sm;0;ON;;;;;N;VERTICAL BAR;;;;
>> - 007E;TILDE;Sm;0;ON;;;;;N;;;;;
>>
>>
>> The regular expression allow any of these characters as
>> part of a top-level media type name, second-level
>> media type name, and parameter name,
>>
>> Regards,
>> Makoto
>>
>> 2014-10-25 22:08 GMT+09:00 MURATA Makoto <eb2m-mrt at asahi-net.or.jp>:
>>
>>> Caroline,
>>>
>>> Thank you for your through study!  This is an
>>> eye opener.
>>>
>>> Both RFC 2616 and RFC 7321 allow the use of doubly-quoted
>>> strings and single-octet quoting by \.
>>>
>>> OPC uses content types as part of [Content_Types].xml
>>> The XSD schema for this document is opc-contentTypes.xsd.
>>> It has an ugly regular expression
>>>
>>>
>>> "(((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
>>> ,;:\\"/\[\]\?=\{\}\s\t]])+))/((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
>>> ,;:\\"/\[\]\?=\{\}\s\t]])+))((\s+)*;(\s+)*(((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
>>> ,;:\\"/\[\]\?=\{\}\s\t]])+))=((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
>>> ,;:\\"/\[\]\?=\{\}\s\t]])+)|("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"))))*)"
>>>
>>> It is not at all clear whether this is equivalent to RFC 2616,
>>> especially because XML has its own mechanism for character
>>> escaping (&#x) and also because double quotation marks
>>> cannot be used within doubly-quoted attribute values.
>>>
>>> I tried to reformulate the above regular expression.  First,
>>>
>>> [\p{IsBasicLatin}-[\p{Cc}\(\)<>@
>>> ,;:\\"/\[\]\?=\{\}\s\t]]
>>>
>>> appears repeatedly.  If we represent this string by an internal
>>> text entity X by introducing
>>>
>>> <!ENTITY X "[\p{IsBasicLatin}-[\p{Cc}\(\)<>@
>>> ,;:\\"/\[\]\?=\{\}\s\t]]">
>>>
>>> the entire expression will become
>>>
>>>
>>> "(((($X)+))/((($X)+))((\s+)*;(\s+)*(((($X)+))=((($X)+)|("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"))))*)"
>>>
>>>
>>> By removing unnecessary parentheses, this can be rewritten as
>>>
>>> "$X+/$X+(\s*;\s*
>>> ($X+=(($X+)|("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"))))*"
>>>
>>> This looks similar to what RFC 2616 defines.  But
>>> are they equivalent?
>>>
>>> Regards,
>>> Makoto
>>>
>>>
>>>
>>> 2014-10-21 6:04 GMT+09:00 Arms, Caroline <caar at loc.gov>:
>>>
>>> All,
>>>>
>>>> I started back on the Content type vs. Media type issue and ran into
>>>> the question of which RFC(s) we should refer to, thinking that would be a
>>>> good place to start thinking about rewording things.  It's not so simple!
>>>>
>>>> Part 2 currently refers to RFC 2616, which may not have been the most
>>>> appropriate RFC but that is now moot, because 2616 is obsolete and has been
>>>> replaced by a group of RFCs including RFC 7231 which refers to RFC 2046 in
>>>> its Media Type subclause but does not elaborate on what media-type actually
>>>> is.  RFC 7231 provides ABNF for media-type, but you need to refer to RFC
>>>> 7230 for an explanation of "OWS" -- used in the ABNF.  RFC 2046 lists the
>>>> top-level media types and common subtypes.  It discusses parameters.  Its
>>>> introduction refers to RFC 2045 for the Content Type context and to RFC 822
>>>> for all relevant ABNF not found in its Appendix A: Collected Grammar.
>>>> Media-type is not mentioned in Appendix A.  RFC 2045 has a copy of the
>>>> relevant ABNF from RFC 822.
>>>>
>>>> More detailed detective work with URLs  is attached below.
>>>>
>>>> The question will be how best to refer to this in Part 2.   RFC 7231 is
>>>> most convenient for getting the ABNF syntax, but you need RFC 2046 to
>>>> understand the semantics.
>>>>
>>>>    To be continued, no doubt ...
>>>>
>>>>    Caroline
>>>>
>>>> Caroline Arms
>>>> Library of Congress Contractor
>>>> Co-compiler of Sustainability of Digital Formats resource
>>>> http://www.digitalpreservation.gov/formats/
>>>>
>>>> ** Views expressed are personal and not necessarily those of the
>>>> institution **
>>>>
>>>> ==== DETAILED detective work ====
>>>>
>>>> Part 2 currently refers to RFC 2616
>>>>
>>>> https://www.mnot.net/blog/2014/06/07/rfc2616_is_dead
>>>>
>>>> http://www.rfc-editor.org/info/rfc2616  is marked as obsolete
>>>>
>>>> So I went to one of the replacement RFCs
>>>>
>>>> http://tools.ietf.org/html/rfc7231
>>>>
>>>> 3.1.1.1. Media Type
>>>>
>>>>    HTTP uses Internet media types [RFC2046] in the Content-Type
>>>>    (Section 3.1.1.5) and Accept (Section 5.3.2) header fields in order
>>>>    to provide open and extensible data typing and type negotiation.
>>>>    Media types define both a data format and various processing models:
>>>>    how to process that data in accordance with each context in which it
>>>>    is received.
>>>>
>>>>      media-type = type "/" subtype *( OWS ";" OWS parameter )
>>>>      type       = token
>>>>      subtype    = token
>>>>
>>>>    The type/subtype MAY be followed by parameters in the form of
>>>>    name=value pairs.
>>>>
>>>>      parameter      = token "=" ( token / quoted-string )
>>>>
>>>>    The type, subtype, and parameter name tokens are case-insensitive.
>>>>    Parameter values might or might not be case-sensitive, depending on
>>>>    the semantics of the parameter name.  The presence or absence of a
>>>>    parameter might be significant to the processing of a media-type,
>>>>    depending on its definition within the media type registry.
>>>>
>>>>    A parameter value that matches the token production can be
>>>>    transmitted either as a token or within a quoted-string.  The quoted
>>>>    and unquoted values are equivalent.  For example, the following
>>>>    examples are all equivalent, but the first is preferred for
>>>>    consistency:
>>>>
>>>>      text/html;charset=utf-8
>>>>      text/html;charset=UTF-8
>>>>      Text/HTML;Charset="utf-8"
>>>>      text/html; charset="utf-8"
>>>>
>>>>    Internet media types ought to be registered with IANA according to
>>>>    the procedures defined in [BCP13].
>>>>
>>>>       Note: Unlike some similar constructs in other header fields, media
>>>>       type parameters do not allow whitespace (even "bad" whitespace)
>>>>       around the "=" character.
>>>>
>>>> ===  aside on OWS  -- optional whitespace ===
>>>>
>>>>     OWS           = <OWS, see [RFC7230], Section 3.2.3>
>>>>
>>>> http://tools.ietf.org/html/rfc7230#section-3.2.3
>>>>
>>>> 3.2.3. Whitespace
>>>>
>>>>    This specification uses three rules to denote the use of linear
>>>>    whitespace: OWS (optional whitespace), RWS (required whitespace), and
>>>>    BWS ("bad" whitespace).
>>>>
>>>>    The OWS rule is used where zero or more linear whitespace octets
>>>>    might appear.  For protocol elements where optional whitespace is
>>>>    preferred to improve readability, a sender SHOULD generate the
>>>>    optional whitespace as a single SP; otherwise, a sender SHOULD NOT
>>>>    generate optional whitespace except as needed to white out invalid or
>>>>    unwanted protocol elements during in-place message filtering.
>>>>
>>>>    The RWS rule is used when at least one linear whitespace octet is
>>>>    required to separate field tokens.  A sender SHOULD generate RWS as a
>>>>    single SP.
>>>>
>>>>    The BWS rule is used where the grammar allows optional whitespace
>>>>    only for historical reasons.  A sender MUST NOT generate BWS in
>>>>    messages.  A recipient MUST parse for such bad whitespace and remove
>>>>    it before interpreting the protocol element.
>>>>
>>>>      OWS            = *( SP / HTAB )
>>>>                     ; optional whitespace
>>>>      RWS            = 1*( SP / HTAB )
>>>>                     ; required whitespace
>>>>      BWS            = OWS
>>>>                     ; "bad" whitespace
>>>>
>>>> ==== end of OWS digression
>>>>
>>>>
>>>> http://tools.ietf.org/html/rfc2046
>>>>
>>>> Multipurpose Internet Mail Extensions (MIME) Part Two:  Media Types
>>>>
>>>> Introduction
>>>>
>>>>    The first document in this set, RFC 2045, defines a number of header
>>>>    fields, including Content-Type. The Content-Type field is used to
>>>>    specify the nature of the data in the body of a MIME entity, by
>>>>    giving media type and subtype identifiers, and by providing auxiliary
>>>>    information that may be required for certain media types.  After the
>>>>    type and subtype names, the remainder of the header field is simply a
>>>>    set of parameters, specified in an attribute/value notation.  The
>>>>    ordering of parameters is not significant.
>>>>
>>>>    In general, the top-level media type is used to declare the general
>>>>    type of data, while the subtype specifies a specific format for that
>>>>    type of data.  Thus, a media type of "image/xyz" is enough to tell a
>>>>    user agent that the data is an image, even if the user agent has no
>>>>    knowledge of the specific image format "xyz".  Such information can
>>>>    be used, for example, to decide whether or not to show a user the raw
>>>>    data from an unrecognized subtype -- such an action might be
>>>>    reasonable for unrecognized subtypes of "text", but not for
>>>>    unrecognized subtypes of "image" or "audio".  For this reason,
>>>>    registered subtypes of "text", "image", "audio", and "video" should
>>>>    not contain embedded information that is really of a different type.
>>>>    Such compound formats should be represented using the "multipart" or
>>>>    "application" types.
>>>>
>>>>    Parameters are modifiers of the media subtype, and as such do not
>>>>    fundamentally affect the nature of the content.  The set of
>>>>    meaningful parameters depends on the media type and subtype.  Most
>>>>    parameters are associated with a single specific subtype.  However, a
>>>>    given top-level media type may define parameters which are applicable
>>>>    to any subtype of that type.  Parameters may be required by their
>>>>    defining media type or subtype or they may be optional.  MIME
>>>>    implementations must also ignore any parameters whose names they do
>>>>    not recognize.
>>>>
>>>> RFC 2046 lists the top-level media types and their subtypes.  As shown
>>>> in the excerpt above, it refers to RFC 2045 for the Content Type header
>>>> field in the Introduction.
>>>>
>>>> ABNF for media type is not defined in RFC 2046 but is defined in RFC
>>>> 2045 which copies it from RFC 822.  RFC 2046 has a Collected Grammar
>>>> appendix which refers to RFC 822.
>>>>
>>>> http://tools.ietf.org/html/rfc2045#page-12
>>>>
>>>> 5.1. Syntax of the Content-Type Header Field
>>>>
>>>>    In the Augmented BNF notation of RFC 822, a Content-Type header field
>>>>    value is defined as follows:
>>>>
>>>>      content := "Content-Type" ":" type "/" subtype
>>>>                 *(";" parameter)
>>>>                 ; Matching of media type and subtype
>>>>                 ; is ALWAYS case-insensitive.
>>>>
>>>>      type := discrete-type / composite-type
>>>>
>>>>      discrete-type := "text" / "image" / "audio" / "video" /
>>>>                       "application" / extension-token
>>>>
>>>>      composite-type := "message" / "multipart" / extension-token
>>>>
>>>>      extension-token := ietf-token / x-token
>>>>
>>>>      ietf-token := <An extension token defined by a
>>>>                     standards-track RFC and registered
>>>>                     with IANA.>
>>>>
>>>>      x-token := <The two characters "X-" or "x-" followed, with
>>>>                  no intervening white space, by any token>
>>>>
>>>>      subtype := extension-token / iana-token
>>>>
>>>>      iana-token := <A publicly-defined extension token. Tokens
>>>>                     of this form must be registered with IANA
>>>>                     as specified in RFC 2048.>
>>>>
>>>>      parameter := attribute "=" value
>>>>
>>>>      attribute := token
>>>>                   ; Matching of attributes
>>>>                   ; is ALWAYS case-insensitive.
>>>>
>>>>      value := token / quoted-string
>>>>
>>>>      token := 1*<any (US-ASCII) CHAR except SPACE, CTLs,
>>>>                  or tspecials>
>>>>
>>>>      tspecials :=  "(" / ")" / "<" / ">" / "@" /
>>>>                    "," / ";" / ":" / "\" / <">
>>>>                    "/" / "[" / "]" / "?" / "="
>>>>                    ; Must be in quoted-string,
>>>>                    ; to use within parameter values
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Praying for the victims of the Japan Tohoku earthquake
>>>
>>> Makoto
>>>
>>
>>
>>
>> --
>>
>> Praying for the victims of the Japan Tohoku earthquake
>>
>> Makoto
>>
>
>
>
> --
>
> Praying for the victims of the Japan Tohoku earthquake
>
> Makoto
>



-- 

Praying for the victims of the Japan Tohoku earthquake

Makoto
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.vse.cz/pipermail/sc34wg4/attachments/20141102/6421a054/attachment-0001.html>


More information about the sc34wg4 mailing list