Which RFC(s) for media type should we refer to?
MURATA Makoto
eb2m-mrt at asahi-net.or.jp
Sat Oct 25 15:08:19 CEST 2014
Caroline,
Thank you for your through study! This is an
eye opener.
Both RFC 2616 and RFC 7321 allow the use of doubly-quoted
strings and single-octet quoting by \.
OPC uses content types as part of [Content_Types].xml
The XSD schema for this document is opc-contentTypes.xsd.
It has an ugly regular expression
"(((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
,;:\\"/\[\]\?=\{\}\s\t]])+))/((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
,;:\\"/\[\]\?=\{\}\s\t]])+))((\s+)*;(\s+)*(((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
,;:\\"/\[\]\?=\{\}\s\t]])+))=((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
,;:\\"/\[\]\?=\{\}\s\t]])+)|("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"))))*)"
It is not at all clear whether this is equivalent to RFC 2616,
especially because XML has its own mechanism for character
escaping (&#x) and also because double quotation marks
cannot be used within doubly-quoted attribute values.
I tried to reformulate the above regular expression. First,
[\p{IsBasicLatin}-[\p{Cc}\(\)<>@,;:\\"/\[\]\?=\{\}\s\t]]
appears repeatedly. If we represent this string by an internal
text entity X by introducing
<!ENTITY X "[\p{IsBasicLatin}-[\p{Cc}\(\)<>@
,;:\\"/\[\]\?=\{\}\s\t]]">
the entire expression will become
"(((($X)+))/((($X)+))((\s+)*;(\s+)*(((($X)+))=((($X)+)|("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"))))*)"
By removing unnecessary parentheses, this can be rewritten as
"$X+/$X+(\s*;\s*
($X+=(($X+)|("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"))))*"
This looks similar to what RFC 2616 defines. But
are they equivalent?
Regards,
Makoto
2014-10-21 6:04 GMT+09:00 Arms, Caroline <caar at loc.gov>:
> All,
>
> I started back on the Content type vs. Media type issue and ran into the
> question of which RFC(s) we should refer to, thinking that would be a good
> place to start thinking about rewording things. It's not so simple!
>
> Part 2 currently refers to RFC 2616, which may not have been the most
> appropriate RFC but that is now moot, because 2616 is obsolete and has been
> replaced by a group of RFCs including RFC 7231 which refers to RFC 2046 in
> its Media Type subclause but does not elaborate on what media-type actually
> is. RFC 7231 provides ABNF for media-type, but you need to refer to RFC
> 7230 for an explanation of "OWS" -- used in the ABNF. RFC 2046 lists the
> top-level media types and common subtypes. It discusses parameters. Its
> introduction refers to RFC 2045 for the Content Type context and to RFC 822
> for all relevant ABNF not found in its Appendix A: Collected Grammar.
> Media-type is not mentioned in Appendix A. RFC 2045 has a copy of the
> relevant ABNF from RFC 822.
>
> More detailed detective work with URLs is attached below.
>
> The question will be how best to refer to this in Part 2. RFC 7231 is
> most convenient for getting the ABNF syntax, but you need RFC 2046 to
> understand the semantics.
>
> To be continued, no doubt ...
>
> Caroline
>
> Caroline Arms
> Library of Congress Contractor
> Co-compiler of Sustainability of Digital Formats resource
> http://www.digitalpreservation.gov/formats/
>
> ** Views expressed are personal and not necessarily those of the
> institution **
>
> ==== DETAILED detective work ====
>
> Part 2 currently refers to RFC 2616
>
> https://www.mnot.net/blog/2014/06/07/rfc2616_is_dead
>
> http://www.rfc-editor.org/info/rfc2616 is marked as obsolete
>
> So I went to one of the replacement RFCs
>
> http://tools.ietf.org/html/rfc7231
>
> 3.1.1.1. Media Type
>
> HTTP uses Internet media types [RFC2046] in the Content-Type
> (Section 3.1.1.5) and Accept (Section 5.3.2) header fields in order
> to provide open and extensible data typing and type negotiation.
> Media types define both a data format and various processing models:
> how to process that data in accordance with each context in which it
> is received.
>
> media-type = type "/" subtype *( OWS ";" OWS parameter )
> type = token
> subtype = token
>
> The type/subtype MAY be followed by parameters in the form of
> name=value pairs.
>
> parameter = token "=" ( token / quoted-string )
>
> The type, subtype, and parameter name tokens are case-insensitive.
> Parameter values might or might not be case-sensitive, depending on
> the semantics of the parameter name. The presence or absence of a
> parameter might be significant to the processing of a media-type,
> depending on its definition within the media type registry.
>
> A parameter value that matches the token production can be
> transmitted either as a token or within a quoted-string. The quoted
> and unquoted values are equivalent. For example, the following
> examples are all equivalent, but the first is preferred for
> consistency:
>
> text/html;charset=utf-8
> text/html;charset=UTF-8
> Text/HTML;Charset="utf-8"
> text/html; charset="utf-8"
>
> Internet media types ought to be registered with IANA according to
> the procedures defined in [BCP13].
>
> Note: Unlike some similar constructs in other header fields, media
> type parameters do not allow whitespace (even "bad" whitespace)
> around the "=" character.
>
> === aside on OWS -- optional whitespace ===
>
> OWS = <OWS, see [RFC7230], Section 3.2.3>
>
> http://tools.ietf.org/html/rfc7230#section-3.2.3
>
> 3.2.3. Whitespace
>
> This specification uses three rules to denote the use of linear
> whitespace: OWS (optional whitespace), RWS (required whitespace), and
> BWS ("bad" whitespace).
>
> The OWS rule is used where zero or more linear whitespace octets
> might appear. For protocol elements where optional whitespace is
> preferred to improve readability, a sender SHOULD generate the
> optional whitespace as a single SP; otherwise, a sender SHOULD NOT
> generate optional whitespace except as needed to white out invalid or
> unwanted protocol elements during in-place message filtering.
>
> The RWS rule is used when at least one linear whitespace octet is
> required to separate field tokens. A sender SHOULD generate RWS as a
> single SP.
>
> The BWS rule is used where the grammar allows optional whitespace
> only for historical reasons. A sender MUST NOT generate BWS in
> messages. A recipient MUST parse for such bad whitespace and remove
> it before interpreting the protocol element.
>
> OWS = *( SP / HTAB )
> ; optional whitespace
> RWS = 1*( SP / HTAB )
> ; required whitespace
> BWS = OWS
> ; "bad" whitespace
>
> ==== end of OWS digression
>
>
> http://tools.ietf.org/html/rfc2046
>
> Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types
>
> Introduction
>
> The first document in this set, RFC 2045, defines a number of header
> fields, including Content-Type. The Content-Type field is used to
> specify the nature of the data in the body of a MIME entity, by
> giving media type and subtype identifiers, and by providing auxiliary
> information that may be required for certain media types. After the
> type and subtype names, the remainder of the header field is simply a
> set of parameters, specified in an attribute/value notation. The
> ordering of parameters is not significant.
>
> In general, the top-level media type is used to declare the general
> type of data, while the subtype specifies a specific format for that
> type of data. Thus, a media type of "image/xyz" is enough to tell a
> user agent that the data is an image, even if the user agent has no
> knowledge of the specific image format "xyz". Such information can
> be used, for example, to decide whether or not to show a user the raw
> data from an unrecognized subtype -- such an action might be
> reasonable for unrecognized subtypes of "text", but not for
> unrecognized subtypes of "image" or "audio". For this reason,
> registered subtypes of "text", "image", "audio", and "video" should
> not contain embedded information that is really of a different type.
> Such compound formats should be represented using the "multipart" or
> "application" types.
>
> Parameters are modifiers of the media subtype, and as such do not
> fundamentally affect the nature of the content. The set of
> meaningful parameters depends on the media type and subtype. Most
> parameters are associated with a single specific subtype. However, a
> given top-level media type may define parameters which are applicable
> to any subtype of that type. Parameters may be required by their
> defining media type or subtype or they may be optional. MIME
> implementations must also ignore any parameters whose names they do
> not recognize.
>
> RFC 2046 lists the top-level media types and their subtypes. As shown in
> the excerpt above, it refers to RFC 2045 for the Content Type header field
> in the Introduction.
>
> ABNF for media type is not defined in RFC 2046 but is defined in RFC
> 2045 which copies it from RFC 822. RFC 2046 has a Collected Grammar
> appendix which refers to RFC 822.
>
> http://tools.ietf.org/html/rfc2045#page-12
>
> 5.1. Syntax of the Content-Type Header Field
>
> In the Augmented BNF notation of RFC 822, a Content-Type header field
> value is defined as follows:
>
> content := "Content-Type" ":" type "/" subtype
> *(";" parameter)
> ; Matching of media type and subtype
> ; is ALWAYS case-insensitive.
>
> type := discrete-type / composite-type
>
> discrete-type := "text" / "image" / "audio" / "video" /
> "application" / extension-token
>
> composite-type := "message" / "multipart" / extension-token
>
> extension-token := ietf-token / x-token
>
> ietf-token := <An extension token defined by a
> standards-track RFC and registered
> with IANA.>
>
> x-token := <The two characters "X-" or "x-" followed, with
> no intervening white space, by any token>
>
> subtype := extension-token / iana-token
>
> iana-token := <A publicly-defined extension token. Tokens
> of this form must be registered with IANA
> as specified in RFC 2048.>
>
> parameter := attribute "=" value
>
> attribute := token
> ; Matching of attributes
> ; is ALWAYS case-insensitive.
>
> value := token / quoted-string
>
> token := 1*<any (US-ASCII) CHAR except SPACE, CTLs,
> or tspecials>
>
> tspecials := "(" / ")" / "<" / ">" / "@" /
> "," / ";" / ":" / "\" / <">
> "/" / "[" / "]" / "?" / "="
> ; Must be in quoted-string,
> ; to use within parameter values
>
--
Praying for the victims of the Japan Tohoku earthquake
Makoto
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.vse.cz/pipermail/sc34wg4/attachments/20141025/01f8b67f/attachment-0001.html>
More information about the sc34wg4
mailing list