Which RFC(s) for media type should we refer to?

MURATA Makoto eb2m-mrt at asahi-net.or.jp
Sat Oct 25 15:08:19 CEST 2014


Caroline,

Thank you for your through study!  This is an
eye opener.

Both RFC 2616 and RFC 7321 allow the use of doubly-quoted
strings and single-octet quoting by \.

OPC uses content types as part of [Content_Types].xml
The XSD schema for this document is opc-contentTypes.xsd.
It has an ugly regular expression


"(((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
,;:\\"/\[\]\?=\{\}\s\t]])+))/((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
,;:\\"/\[\]\?=\{\}\s\t]])+))((\s+)*;(\s+)*(((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
,;:\\"/\[\]\?=\{\}\s\t]])+))=((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
,;:\\"/\[\]\?=\{\}\s\t]])+)|("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"))))*)"

It is not at all clear whether this is equivalent to RFC 2616,
especially because XML has its own mechanism for character
escaping (&#x) and also because double quotation marks
cannot be used within doubly-quoted attribute values.

I tried to reformulate the above regular expression.  First,

[\p{IsBasicLatin}-[\p{Cc}\(\)<>@,;:\\"/\[\]\?=\{\}\s\t]]

appears repeatedly.  If we represent this string by an internal
text entity X by introducing

<!ENTITY X "[\p{IsBasicLatin}-[\p{Cc}\(\)<>@
,;:\\"/\[\]\?=\{\}\s\t]]">

the entire expression will become

"(((($X)+))/((($X)+))((\s+)*;(\s+)*(((($X)+))=((($X)+)|("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"))))*)"


By removing unnecessary parentheses, this can be rewritten as

"$X+/$X+(\s*;\s*
($X+=(($X+)|("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"))))*"

This looks similar to what RFC 2616 defines.  But
are they equivalent?

Regards,
Makoto



2014-10-21 6:04 GMT+09:00 Arms, Caroline <caar at loc.gov>:

> All,
>
> I started back on the Content type vs. Media type issue and ran into the
> question of which RFC(s) we should refer to, thinking that would be a good
> place to start thinking about rewording things.  It's not so simple!
>
> Part 2 currently refers to RFC 2616, which may not have been the most
> appropriate RFC but that is now moot, because 2616 is obsolete and has been
> replaced by a group of RFCs including RFC 7231 which refers to RFC 2046 in
> its Media Type subclause but does not elaborate on what media-type actually
> is.  RFC 7231 provides ABNF for media-type, but you need to refer to RFC
> 7230 for an explanation of "OWS" -- used in the ABNF.  RFC 2046 lists the
> top-level media types and common subtypes.  It discusses parameters.  Its
> introduction refers to RFC 2045 for the Content Type context and to RFC 822
> for all relevant ABNF not found in its Appendix A: Collected Grammar.
> Media-type is not mentioned in Appendix A.  RFC 2045 has a copy of the
> relevant ABNF from RFC 822.
>
> More detailed detective work with URLs  is attached below.
>
> The question will be how best to refer to this in Part 2.   RFC 7231 is
> most convenient for getting the ABNF syntax, but you need RFC 2046 to
> understand the semantics.
>
>    To be continued, no doubt ...
>
>    Caroline
>
> Caroline Arms
> Library of Congress Contractor
> Co-compiler of Sustainability of Digital Formats resource
> http://www.digitalpreservation.gov/formats/
>
> ** Views expressed are personal and not necessarily those of the
> institution **
>
> ==== DETAILED detective work ====
>
> Part 2 currently refers to RFC 2616
>
> https://www.mnot.net/blog/2014/06/07/rfc2616_is_dead
>
> http://www.rfc-editor.org/info/rfc2616  is marked as obsolete
>
> So I went to one of the replacement RFCs
>
> http://tools.ietf.org/html/rfc7231
>
> 3.1.1.1. Media Type
>
>    HTTP uses Internet media types [RFC2046] in the Content-Type
>    (Section 3.1.1.5) and Accept (Section 5.3.2) header fields in order
>    to provide open and extensible data typing and type negotiation.
>    Media types define both a data format and various processing models:
>    how to process that data in accordance with each context in which it
>    is received.
>
>      media-type = type "/" subtype *( OWS ";" OWS parameter )
>      type       = token
>      subtype    = token
>
>    The type/subtype MAY be followed by parameters in the form of
>    name=value pairs.
>
>      parameter      = token "=" ( token / quoted-string )
>
>    The type, subtype, and parameter name tokens are case-insensitive.
>    Parameter values might or might not be case-sensitive, depending on
>    the semantics of the parameter name.  The presence or absence of a
>    parameter might be significant to the processing of a media-type,
>    depending on its definition within the media type registry.
>
>    A parameter value that matches the token production can be
>    transmitted either as a token or within a quoted-string.  The quoted
>    and unquoted values are equivalent.  For example, the following
>    examples are all equivalent, but the first is preferred for
>    consistency:
>
>      text/html;charset=utf-8
>      text/html;charset=UTF-8
>      Text/HTML;Charset="utf-8"
>      text/html; charset="utf-8"
>
>    Internet media types ought to be registered with IANA according to
>    the procedures defined in [BCP13].
>
>       Note: Unlike some similar constructs in other header fields, media
>       type parameters do not allow whitespace (even "bad" whitespace)
>       around the "=" character.
>
> ===  aside on OWS  -- optional whitespace ===
>
>     OWS           = <OWS, see [RFC7230], Section 3.2.3>
>
> http://tools.ietf.org/html/rfc7230#section-3.2.3
>
> 3.2.3. Whitespace
>
>    This specification uses three rules to denote the use of linear
>    whitespace: OWS (optional whitespace), RWS (required whitespace), and
>    BWS ("bad" whitespace).
>
>    The OWS rule is used where zero or more linear whitespace octets
>    might appear.  For protocol elements where optional whitespace is
>    preferred to improve readability, a sender SHOULD generate the
>    optional whitespace as a single SP; otherwise, a sender SHOULD NOT
>    generate optional whitespace except as needed to white out invalid or
>    unwanted protocol elements during in-place message filtering.
>
>    The RWS rule is used when at least one linear whitespace octet is
>    required to separate field tokens.  A sender SHOULD generate RWS as a
>    single SP.
>
>    The BWS rule is used where the grammar allows optional whitespace
>    only for historical reasons.  A sender MUST NOT generate BWS in
>    messages.  A recipient MUST parse for such bad whitespace and remove
>    it before interpreting the protocol element.
>
>      OWS            = *( SP / HTAB )
>                     ; optional whitespace
>      RWS            = 1*( SP / HTAB )
>                     ; required whitespace
>      BWS            = OWS
>                     ; "bad" whitespace
>
> ==== end of OWS digression
>
>
> http://tools.ietf.org/html/rfc2046
>
> Multipurpose Internet Mail Extensions (MIME) Part Two:  Media Types
>
> Introduction
>
>    The first document in this set, RFC 2045, defines a number of header
>    fields, including Content-Type. The Content-Type field is used to
>    specify the nature of the data in the body of a MIME entity, by
>    giving media type and subtype identifiers, and by providing auxiliary
>    information that may be required for certain media types.  After the
>    type and subtype names, the remainder of the header field is simply a
>    set of parameters, specified in an attribute/value notation.  The
>    ordering of parameters is not significant.
>
>    In general, the top-level media type is used to declare the general
>    type of data, while the subtype specifies a specific format for that
>    type of data.  Thus, a media type of "image/xyz" is enough to tell a
>    user agent that the data is an image, even if the user agent has no
>    knowledge of the specific image format "xyz".  Such information can
>    be used, for example, to decide whether or not to show a user the raw
>    data from an unrecognized subtype -- such an action might be
>    reasonable for unrecognized subtypes of "text", but not for
>    unrecognized subtypes of "image" or "audio".  For this reason,
>    registered subtypes of "text", "image", "audio", and "video" should
>    not contain embedded information that is really of a different type.
>    Such compound formats should be represented using the "multipart" or
>    "application" types.
>
>    Parameters are modifiers of the media subtype, and as such do not
>    fundamentally affect the nature of the content.  The set of
>    meaningful parameters depends on the media type and subtype.  Most
>    parameters are associated with a single specific subtype.  However, a
>    given top-level media type may define parameters which are applicable
>    to any subtype of that type.  Parameters may be required by their
>    defining media type or subtype or they may be optional.  MIME
>    implementations must also ignore any parameters whose names they do
>    not recognize.
>
> RFC 2046 lists the top-level media types and their subtypes.  As shown in
> the excerpt above, it refers to RFC 2045 for the Content Type header field
> in the Introduction.
>
> ABNF for media type is not defined in RFC 2046 but is defined in RFC
> 2045 which copies it from RFC 822.  RFC 2046 has a Collected Grammar
> appendix which refers to RFC 822.
>
> http://tools.ietf.org/html/rfc2045#page-12
>
> 5.1. Syntax of the Content-Type Header Field
>
>    In the Augmented BNF notation of RFC 822, a Content-Type header field
>    value is defined as follows:
>
>      content := "Content-Type" ":" type "/" subtype
>                 *(";" parameter)
>                 ; Matching of media type and subtype
>                 ; is ALWAYS case-insensitive.
>
>      type := discrete-type / composite-type
>
>      discrete-type := "text" / "image" / "audio" / "video" /
>                       "application" / extension-token
>
>      composite-type := "message" / "multipart" / extension-token
>
>      extension-token := ietf-token / x-token
>
>      ietf-token := <An extension token defined by a
>                     standards-track RFC and registered
>                     with IANA.>
>
>      x-token := <The two characters "X-" or "x-" followed, with
>                  no intervening white space, by any token>
>
>      subtype := extension-token / iana-token
>
>      iana-token := <A publicly-defined extension token. Tokens
>                     of this form must be registered with IANA
>                     as specified in RFC 2048.>
>
>      parameter := attribute "=" value
>
>      attribute := token
>                   ; Matching of attributes
>                   ; is ALWAYS case-insensitive.
>
>      value := token / quoted-string
>
>      token := 1*<any (US-ASCII) CHAR except SPACE, CTLs,
>                  or tspecials>
>
>      tspecials :=  "(" / ")" / "<" / ">" / "@" /
>                    "," / ";" / ":" / "\" / <">
>                    "/" / "[" / "]" / "?" / "="
>                    ; Must be in quoted-string,
>                    ; to use within parameter values
>



-- 

Praying for the victims of the Japan Tohoku earthquake

Makoto
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.vse.cz/pipermail/sc34wg4/attachments/20141025/01f8b67f/attachment-0001.html>


More information about the sc34wg4 mailing list