Which RFC(s) for media type should we refer to?
MURATA Makoto
eb2m-mrt at asahi-net.or.jp
Sun Nov 2 14:33:21 CET 2014
Folks,
The regular expression in opc-contentTypes.xsd has another interesting
subexpression:
("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"
This subexpression matches a doubly-quoted string. But what character
is allowed as a part of this doubly-quoted string? Is this
subexpression consistent with RFC 7230?
First, unlike the first regular expression,
neither \(\)<>@,;:\\ nor /\[\]\?=\{\}\s\t are excluded.
This is because RFCs 2045 and 7230 allow tspecials as part of
doubly-quoted strings. This is nice.
Second, escaped characters such as \a are always allows by
(\\[\p{IsBasicLatin}]) as long as escaped characters are from #x0000
to #x007F. RFC 7230 allows more characters (#x0080 to #x007F) to be
escaped, but does not allow invisible characters such as the space
character to be escaped. Thus, there is a discrepancy here.
Third, \p{IsLatin-1Supplement} represents characters
from #x0080 to #x00FF. See
http://www.w3.org/TR/xmlschema-2/#nt-charClassEsc
This matches obs-text in RFC 7230.
Fourth, RFC 7230 (to be precise, qdtext) does not allow REVERSE
SOLIDUS, but the subexpression does. Thus, we have another
discrepancy.
qdtext = HTAB / SP /%x21 / %x23-5B / %x5D-7E / obs-text
Regards,
Makoto
2014-11-02 16:56 GMT+09:00 MURATA Makoto <eb2m-mrt at asahi-net.or.jp>:
> Oops, I forgot to point out that \{ and \} are disallowed by
> our regular expression, but they are not tspecials as
> specified in RFC 2045. RFC 7230 does
> not allow these two characters, and exactly catches
> the enumerated list in my previous mail.
>
> https://tools.ietf.org/html/rfc7230#section-3.2.6
>
> One could say that our regular expression is already
> aligned with RFC 7230 rather than RFC 2045.
>
> Regards,
> Makoto
>
> 2014-11-02 16:15 GMT+09:00 MURATA Makoto <eb2m-mrt at asahi-net.or.jp>:
>
>> Folks,
>>
>> I pointed out that
>>
>> [\p{IsBasicLatin}-[\p{Cc}\(\)<>@,;:\\"/\
>> [\]\?=\{\}\s\t]]
>>
>> is used repeatedly. This appears to represent characters in
>>
>> token := 1*<any (US-ASCII) CHAR except SPACE, CTLs,
>> or tspecials>
>>
>> where
>>
>>
>> tspecials := "(" / ")" / "<" / ">" / "@" /
>> "," / ";" / ":" / "\" / <">
>> "/" / "[" / "]" / "?" / "="
>> ; Must be in quoted-string,
>> ; to use within parameter values" /
>>
>> They both represent any of the following characters:
>>
>> - 0021;EXCLAMATION MARK;Po;0;ON;;;;;N;;;;;
>> - 0023;NUMBER SIGN;Po;0;ET;;;;;N;;;;;
>> - 0024;DOLLAR SIGN;Sc;0;ET;;;;;N;;;;;
>> - 0025;PERCENT SIGN;Po;0;ET;;;;;N;;;;;
>> - 0026;AMPERSAND;Po;0;ON;;;;;N;;;;;
>> - 0027;APOSTROPHE;Po;0;ON;;;;;N;APOSTROPHE-QUOTE;;;;
>> - 002A;ASTERISK;Po;0;ON;;;;;N;;;;;
>> - 002B;PLUS SIGN;Sm;0;ES;;;;;N;;;;;
>> - 002D;HYPHEN-MINUS;Pd;0;ES;;;;;N;;;;;
>> - 002E;FULL STOP;Po;0;CS;;;;;N;PERIOD;;;;
>> - 0-9
>> - A-Z
>> - 005E;CIRCUMFLEX ACCENT;Sk;0;ON;;;;;N;SPACING CIRCUMFLEX;;;;
>> - 005F;LOW LINE;Pc;0;ON;;;;;N;SPACING UNDERSCORE;;;;
>> - 0060;GRAVE ACCENT;Sk;0;ON;;;;;N;SPACING GRAVE;;;;
>> - a-z
>> - 007C;VERTICAL LINE;Sm;0;ON;;;;;N;VERTICAL BAR;;;;
>> - 007E;TILDE;Sm;0;ON;;;;;N;;;;;
>>
>>
>> The regular expression allow any of these characters as
>> part of a top-level media type name, second-level
>> media type name, and parameter name,
>>
>> Regards,
>> Makoto
>>
>> 2014-10-25 22:08 GMT+09:00 MURATA Makoto <eb2m-mrt at asahi-net.or.jp>:
>>
>>> Caroline,
>>>
>>> Thank you for your through study! This is an
>>> eye opener.
>>>
>>> Both RFC 2616 and RFC 7321 allow the use of doubly-quoted
>>> strings and single-octet quoting by \.
>>>
>>> OPC uses content types as part of [Content_Types].xml
>>> The XSD schema for this document is opc-contentTypes.xsd.
>>> It has an ugly regular expression
>>>
>>>
>>> "(((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
>>> ,;:\\"/\[\]\?=\{\}\s\t]])+))/((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
>>> ,;:\\"/\[\]\?=\{\}\s\t]])+))((\s+)*;(\s+)*(((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
>>> ,;:\\"/\[\]\?=\{\}\s\t]])+))=((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
>>> ,;:\\"/\[\]\?=\{\}\s\t]])+)|("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"))))*)"
>>>
>>> It is not at all clear whether this is equivalent to RFC 2616,
>>> especially because XML has its own mechanism for character
>>> escaping (&#x) and also because double quotation marks
>>> cannot be used within doubly-quoted attribute values.
>>>
>>> I tried to reformulate the above regular expression. First,
>>>
>>> [\p{IsBasicLatin}-[\p{Cc}\(\)<>@
>>> ,;:\\"/\[\]\?=\{\}\s\t]]
>>>
>>> appears repeatedly. If we represent this string by an internal
>>> text entity X by introducing
>>>
>>> <!ENTITY X "[\p{IsBasicLatin}-[\p{Cc}\(\)<>@
>>> ,;:\\"/\[\]\?=\{\}\s\t]]">
>>>
>>> the entire expression will become
>>>
>>>
>>> "(((($X)+))/((($X)+))((\s+)*;(\s+)*(((($X)+))=((($X)+)|("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"))))*)"
>>>
>>>
>>> By removing unnecessary parentheses, this can be rewritten as
>>>
>>> "$X+/$X+(\s*;\s*
>>> ($X+=(($X+)|("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"))))*"
>>>
>>> This looks similar to what RFC 2616 defines. But
>>> are they equivalent?
>>>
>>> Regards,
>>> Makoto
>>>
>>>
>>>
>>> 2014-10-21 6:04 GMT+09:00 Arms, Caroline <caar at loc.gov>:
>>>
>>> All,
>>>>
>>>> I started back on the Content type vs. Media type issue and ran into
>>>> the question of which RFC(s) we should refer to, thinking that would be a
>>>> good place to start thinking about rewording things. It's not so simple!
>>>>
>>>> Part 2 currently refers to RFC 2616, which may not have been the most
>>>> appropriate RFC but that is now moot, because 2616 is obsolete and has been
>>>> replaced by a group of RFCs including RFC 7231 which refers to RFC 2046 in
>>>> its Media Type subclause but does not elaborate on what media-type actually
>>>> is. RFC 7231 provides ABNF for media-type, but you need to refer to RFC
>>>> 7230 for an explanation of "OWS" -- used in the ABNF. RFC 2046 lists the
>>>> top-level media types and common subtypes. It discusses parameters. Its
>>>> introduction refers to RFC 2045 for the Content Type context and to RFC 822
>>>> for all relevant ABNF not found in its Appendix A: Collected Grammar.
>>>> Media-type is not mentioned in Appendix A. RFC 2045 has a copy of the
>>>> relevant ABNF from RFC 822.
>>>>
>>>> More detailed detective work with URLs is attached below.
>>>>
>>>> The question will be how best to refer to this in Part 2. RFC 7231 is
>>>> most convenient for getting the ABNF syntax, but you need RFC 2046 to
>>>> understand the semantics.
>>>>
>>>> To be continued, no doubt ...
>>>>
>>>> Caroline
>>>>
>>>> Caroline Arms
>>>> Library of Congress Contractor
>>>> Co-compiler of Sustainability of Digital Formats resource
>>>> http://www.digitalpreservation.gov/formats/
>>>>
>>>> ** Views expressed are personal and not necessarily those of the
>>>> institution **
>>>>
>>>> ==== DETAILED detective work ====
>>>>
>>>> Part 2 currently refers to RFC 2616
>>>>
>>>> https://www.mnot.net/blog/2014/06/07/rfc2616_is_dead
>>>>
>>>> http://www.rfc-editor.org/info/rfc2616 is marked as obsolete
>>>>
>>>> So I went to one of the replacement RFCs
>>>>
>>>> http://tools.ietf.org/html/rfc7231
>>>>
>>>> 3.1.1.1. Media Type
>>>>
>>>> HTTP uses Internet media types [RFC2046] in the Content-Type
>>>> (Section 3.1.1.5) and Accept (Section 5.3.2) header fields in order
>>>> to provide open and extensible data typing and type negotiation.
>>>> Media types define both a data format and various processing models:
>>>> how to process that data in accordance with each context in which it
>>>> is received.
>>>>
>>>> media-type = type "/" subtype *( OWS ";" OWS parameter )
>>>> type = token
>>>> subtype = token
>>>>
>>>> The type/subtype MAY be followed by parameters in the form of
>>>> name=value pairs.
>>>>
>>>> parameter = token "=" ( token / quoted-string )
>>>>
>>>> The type, subtype, and parameter name tokens are case-insensitive.
>>>> Parameter values might or might not be case-sensitive, depending on
>>>> the semantics of the parameter name. The presence or absence of a
>>>> parameter might be significant to the processing of a media-type,
>>>> depending on its definition within the media type registry.
>>>>
>>>> A parameter value that matches the token production can be
>>>> transmitted either as a token or within a quoted-string. The quoted
>>>> and unquoted values are equivalent. For example, the following
>>>> examples are all equivalent, but the first is preferred for
>>>> consistency:
>>>>
>>>> text/html;charset=utf-8
>>>> text/html;charset=UTF-8
>>>> Text/HTML;Charset="utf-8"
>>>> text/html; charset="utf-8"
>>>>
>>>> Internet media types ought to be registered with IANA according to
>>>> the procedures defined in [BCP13].
>>>>
>>>> Note: Unlike some similar constructs in other header fields, media
>>>> type parameters do not allow whitespace (even "bad" whitespace)
>>>> around the "=" character.
>>>>
>>>> === aside on OWS -- optional whitespace ===
>>>>
>>>> OWS = <OWS, see [RFC7230], Section 3.2.3>
>>>>
>>>> http://tools.ietf.org/html/rfc7230#section-3.2.3
>>>>
>>>> 3.2.3. Whitespace
>>>>
>>>> This specification uses three rules to denote the use of linear
>>>> whitespace: OWS (optional whitespace), RWS (required whitespace), and
>>>> BWS ("bad" whitespace).
>>>>
>>>> The OWS rule is used where zero or more linear whitespace octets
>>>> might appear. For protocol elements where optional whitespace is
>>>> preferred to improve readability, a sender SHOULD generate the
>>>> optional whitespace as a single SP; otherwise, a sender SHOULD NOT
>>>> generate optional whitespace except as needed to white out invalid or
>>>> unwanted protocol elements during in-place message filtering.
>>>>
>>>> The RWS rule is used when at least one linear whitespace octet is
>>>> required to separate field tokens. A sender SHOULD generate RWS as a
>>>> single SP.
>>>>
>>>> The BWS rule is used where the grammar allows optional whitespace
>>>> only for historical reasons. A sender MUST NOT generate BWS in
>>>> messages. A recipient MUST parse for such bad whitespace and remove
>>>> it before interpreting the protocol element.
>>>>
>>>> OWS = *( SP / HTAB )
>>>> ; optional whitespace
>>>> RWS = 1*( SP / HTAB )
>>>> ; required whitespace
>>>> BWS = OWS
>>>> ; "bad" whitespace
>>>>
>>>> ==== end of OWS digression
>>>>
>>>>
>>>> http://tools.ietf.org/html/rfc2046
>>>>
>>>> Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types
>>>>
>>>> Introduction
>>>>
>>>> The first document in this set, RFC 2045, defines a number of header
>>>> fields, including Content-Type. The Content-Type field is used to
>>>> specify the nature of the data in the body of a MIME entity, by
>>>> giving media type and subtype identifiers, and by providing auxiliary
>>>> information that may be required for certain media types. After the
>>>> type and subtype names, the remainder of the header field is simply a
>>>> set of parameters, specified in an attribute/value notation. The
>>>> ordering of parameters is not significant.
>>>>
>>>> In general, the top-level media type is used to declare the general
>>>> type of data, while the subtype specifies a specific format for that
>>>> type of data. Thus, a media type of "image/xyz" is enough to tell a
>>>> user agent that the data is an image, even if the user agent has no
>>>> knowledge of the specific image format "xyz". Such information can
>>>> be used, for example, to decide whether or not to show a user the raw
>>>> data from an unrecognized subtype -- such an action might be
>>>> reasonable for unrecognized subtypes of "text", but not for
>>>> unrecognized subtypes of "image" or "audio". For this reason,
>>>> registered subtypes of "text", "image", "audio", and "video" should
>>>> not contain embedded information that is really of a different type.
>>>> Such compound formats should be represented using the "multipart" or
>>>> "application" types.
>>>>
>>>> Parameters are modifiers of the media subtype, and as such do not
>>>> fundamentally affect the nature of the content. The set of
>>>> meaningful parameters depends on the media type and subtype. Most
>>>> parameters are associated with a single specific subtype. However, a
>>>> given top-level media type may define parameters which are applicable
>>>> to any subtype of that type. Parameters may be required by their
>>>> defining media type or subtype or they may be optional. MIME
>>>> implementations must also ignore any parameters whose names they do
>>>> not recognize.
>>>>
>>>> RFC 2046 lists the top-level media types and their subtypes. As shown
>>>> in the excerpt above, it refers to RFC 2045 for the Content Type header
>>>> field in the Introduction.
>>>>
>>>> ABNF for media type is not defined in RFC 2046 but is defined in RFC
>>>> 2045 which copies it from RFC 822. RFC 2046 has a Collected Grammar
>>>> appendix which refers to RFC 822.
>>>>
>>>> http://tools.ietf.org/html/rfc2045#page-12
>>>>
>>>> 5.1. Syntax of the Content-Type Header Field
>>>>
>>>> In the Augmented BNF notation of RFC 822, a Content-Type header field
>>>> value is defined as follows:
>>>>
>>>> content := "Content-Type" ":" type "/" subtype
>>>> *(";" parameter)
>>>> ; Matching of media type and subtype
>>>> ; is ALWAYS case-insensitive.
>>>>
>>>> type := discrete-type / composite-type
>>>>
>>>> discrete-type := "text" / "image" / "audio" / "video" /
>>>> "application" / extension-token
>>>>
>>>> composite-type := "message" / "multipart" / extension-token
>>>>
>>>> extension-token := ietf-token / x-token
>>>>
>>>> ietf-token := <An extension token defined by a
>>>> standards-track RFC and registered
>>>> with IANA.>
>>>>
>>>> x-token := <The two characters "X-" or "x-" followed, with
>>>> no intervening white space, by any token>
>>>>
>>>> subtype := extension-token / iana-token
>>>>
>>>> iana-token := <A publicly-defined extension token. Tokens
>>>> of this form must be registered with IANA
>>>> as specified in RFC 2048.>
>>>>
>>>> parameter := attribute "=" value
>>>>
>>>> attribute := token
>>>> ; Matching of attributes
>>>> ; is ALWAYS case-insensitive.
>>>>
>>>> value := token / quoted-string
>>>>
>>>> token := 1*<any (US-ASCII) CHAR except SPACE, CTLs,
>>>> or tspecials>
>>>>
>>>> tspecials := "(" / ")" / "<" / ">" / "@" /
>>>> "," / ";" / ":" / "\" / <">
>>>> "/" / "[" / "]" / "?" / "="
>>>> ; Must be in quoted-string,
>>>> ; to use within parameter values
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Praying for the victims of the Japan Tohoku earthquake
>>>
>>> Makoto
>>>
>>
>>
>>
>> --
>>
>> Praying for the victims of the Japan Tohoku earthquake
>>
>> Makoto
>>
>
>
>
> --
>
> Praying for the victims of the Japan Tohoku earthquake
>
> Makoto
>
--
Praying for the victims of the Japan Tohoku earthquake
Makoto
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.vse.cz/pipermail/sc34wg4/attachments/20141102/6421a054/attachment-0001.html>
More information about the sc34wg4
mailing list