Which RFC(s) for media type should we refer to?
MURATA Makoto
eb2m-mrt at asahi-net.or.jp
Sun Nov 2 08:15:57 CET 2014
Folks,
I pointed out that
[\p{IsBasicLatin}-[\p{Cc}\(\)<>@,;:\\"/\
[\]\?=\{\}\s\t]]
is used repeatedly. This appears to represent characters in
token := 1*<any (US-ASCII) CHAR except SPACE, CTLs,
or tspecials>
where
tspecials := "(" / ")" / "<" / ">" / "@" /
"," / ";" / ":" / "\" / <">
"/" / "[" / "]" / "?" / "="
; Must be in quoted-string,
; to use within parameter values" /
They both represent any of the following characters:
- 0021;EXCLAMATION MARK;Po;0;ON;;;;;N;;;;;
- 0023;NUMBER SIGN;Po;0;ET;;;;;N;;;;;
- 0024;DOLLAR SIGN;Sc;0;ET;;;;;N;;;;;
- 0025;PERCENT SIGN;Po;0;ET;;;;;N;;;;;
- 0026;AMPERSAND;Po;0;ON;;;;;N;;;;;
- 0027;APOSTROPHE;Po;0;ON;;;;;N;APOSTROPHE-QUOTE;;;;
- 002A;ASTERISK;Po;0;ON;;;;;N;;;;;
- 002B;PLUS SIGN;Sm;0;ES;;;;;N;;;;;
- 002D;HYPHEN-MINUS;Pd;0;ES;;;;;N;;;;;
- 002E;FULL STOP;Po;0;CS;;;;;N;PERIOD;;;;
- 0-9
- A-Z
- 005E;CIRCUMFLEX ACCENT;Sk;0;ON;;;;;N;SPACING CIRCUMFLEX;;;;
- 005F;LOW LINE;Pc;0;ON;;;;;N;SPACING UNDERSCORE;;;;
- 0060;GRAVE ACCENT;Sk;0;ON;;;;;N;SPACING GRAVE;;;;
- a-z
- 007C;VERTICAL LINE;Sm;0;ON;;;;;N;VERTICAL BAR;;;;
- 007E;TILDE;Sm;0;ON;;;;;N;;;;;
The regular expression allow any of these characters as
part of a top-level media type name, second-level
media type name, and parameter name,
Regards,
Makoto
2014-10-25 22:08 GMT+09:00 MURATA Makoto <eb2m-mrt at asahi-net.or.jp>:
> Caroline,
>
> Thank you for your through study! This is an
> eye opener.
>
> Both RFC 2616 and RFC 7321 allow the use of doubly-quoted
> strings and single-octet quoting by \.
>
> OPC uses content types as part of [Content_Types].xml
> The XSD schema for this document is opc-contentTypes.xsd.
> It has an ugly regular expression
>
>
> "(((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
> ,;:\\"/\[\]\?=\{\}\s\t]])+))/((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
> ,;:\\"/\[\]\?=\{\}\s\t]])+))((\s+)*;(\s+)*(((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
> ,;:\\"/\[\]\?=\{\}\s\t]])+))=((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
> ,;:\\"/\[\]\?=\{\}\s\t]])+)|("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"))))*)"
>
> It is not at all clear whether this is equivalent to RFC 2616,
> especially because XML has its own mechanism for character
> escaping (&#x) and also because double quotation marks
> cannot be used within doubly-quoted attribute values.
>
> I tried to reformulate the above regular expression. First,
>
> [\p{IsBasicLatin}-[\p{Cc}\(\)<>@,;:\\"/\[\]\?=\{\}\s\t]]
>
> appears repeatedly. If we represent this string by an internal
> text entity X by introducing
>
> <!ENTITY X "[\p{IsBasicLatin}-[\p{Cc}\(\)<>@
> ,;:\\"/\[\]\?=\{\}\s\t]]">
>
> the entire expression will become
>
>
> "(((($X)+))/((($X)+))((\s+)*;(\s+)*(((($X)+))=((($X)+)|("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"))))*)"
>
>
> By removing unnecessary parentheses, this can be rewritten as
>
> "$X+/$X+(\s*;\s*
> ($X+=(($X+)|("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"))))*"
>
> This looks similar to what RFC 2616 defines. But
> are they equivalent?
>
> Regards,
> Makoto
>
>
>
> 2014-10-21 6:04 GMT+09:00 Arms, Caroline <caar at loc.gov>:
>
> All,
>>
>> I started back on the Content type vs. Media type issue and ran into the
>> question of which RFC(s) we should refer to, thinking that would be a good
>> place to start thinking about rewording things. It's not so simple!
>>
>> Part 2 currently refers to RFC 2616, which may not have been the most
>> appropriate RFC but that is now moot, because 2616 is obsolete and has been
>> replaced by a group of RFCs including RFC 7231 which refers to RFC 2046 in
>> its Media Type subclause but does not elaborate on what media-type actually
>> is. RFC 7231 provides ABNF for media-type, but you need to refer to RFC
>> 7230 for an explanation of "OWS" -- used in the ABNF. RFC 2046 lists the
>> top-level media types and common subtypes. It discusses parameters. Its
>> introduction refers to RFC 2045 for the Content Type context and to RFC 822
>> for all relevant ABNF not found in its Appendix A: Collected Grammar.
>> Media-type is not mentioned in Appendix A. RFC 2045 has a copy of the
>> relevant ABNF from RFC 822.
>>
>> More detailed detective work with URLs is attached below.
>>
>> The question will be how best to refer to this in Part 2. RFC 7231 is
>> most convenient for getting the ABNF syntax, but you need RFC 2046 to
>> understand the semantics.
>>
>> To be continued, no doubt ...
>>
>> Caroline
>>
>> Caroline Arms
>> Library of Congress Contractor
>> Co-compiler of Sustainability of Digital Formats resource
>> http://www.digitalpreservation.gov/formats/
>>
>> ** Views expressed are personal and not necessarily those of the
>> institution **
>>
>> ==== DETAILED detective work ====
>>
>> Part 2 currently refers to RFC 2616
>>
>> https://www.mnot.net/blog/2014/06/07/rfc2616_is_dead
>>
>> http://www.rfc-editor.org/info/rfc2616 is marked as obsolete
>>
>> So I went to one of the replacement RFCs
>>
>> http://tools.ietf.org/html/rfc7231
>>
>> 3.1.1.1. Media Type
>>
>> HTTP uses Internet media types [RFC2046] in the Content-Type
>> (Section 3.1.1.5) and Accept (Section 5.3.2) header fields in order
>> to provide open and extensible data typing and type negotiation.
>> Media types define both a data format and various processing models:
>> how to process that data in accordance with each context in which it
>> is received.
>>
>> media-type = type "/" subtype *( OWS ";" OWS parameter )
>> type = token
>> subtype = token
>>
>> The type/subtype MAY be followed by parameters in the form of
>> name=value pairs.
>>
>> parameter = token "=" ( token / quoted-string )
>>
>> The type, subtype, and parameter name tokens are case-insensitive.
>> Parameter values might or might not be case-sensitive, depending on
>> the semantics of the parameter name. The presence or absence of a
>> parameter might be significant to the processing of a media-type,
>> depending on its definition within the media type registry.
>>
>> A parameter value that matches the token production can be
>> transmitted either as a token or within a quoted-string. The quoted
>> and unquoted values are equivalent. For example, the following
>> examples are all equivalent, but the first is preferred for
>> consistency:
>>
>> text/html;charset=utf-8
>> text/html;charset=UTF-8
>> Text/HTML;Charset="utf-8"
>> text/html; charset="utf-8"
>>
>> Internet media types ought to be registered with IANA according to
>> the procedures defined in [BCP13].
>>
>> Note: Unlike some similar constructs in other header fields, media
>> type parameters do not allow whitespace (even "bad" whitespace)
>> around the "=" character.
>>
>> === aside on OWS -- optional whitespace ===
>>
>> OWS = <OWS, see [RFC7230], Section 3.2.3>
>>
>> http://tools.ietf.org/html/rfc7230#section-3.2.3
>>
>> 3.2.3. Whitespace
>>
>> This specification uses three rules to denote the use of linear
>> whitespace: OWS (optional whitespace), RWS (required whitespace), and
>> BWS ("bad" whitespace).
>>
>> The OWS rule is used where zero or more linear whitespace octets
>> might appear. For protocol elements where optional whitespace is
>> preferred to improve readability, a sender SHOULD generate the
>> optional whitespace as a single SP; otherwise, a sender SHOULD NOT
>> generate optional whitespace except as needed to white out invalid or
>> unwanted protocol elements during in-place message filtering.
>>
>> The RWS rule is used when at least one linear whitespace octet is
>> required to separate field tokens. A sender SHOULD generate RWS as a
>> single SP.
>>
>> The BWS rule is used where the grammar allows optional whitespace
>> only for historical reasons. A sender MUST NOT generate BWS in
>> messages. A recipient MUST parse for such bad whitespace and remove
>> it before interpreting the protocol element.
>>
>> OWS = *( SP / HTAB )
>> ; optional whitespace
>> RWS = 1*( SP / HTAB )
>> ; required whitespace
>> BWS = OWS
>> ; "bad" whitespace
>>
>> ==== end of OWS digression
>>
>>
>> http://tools.ietf.org/html/rfc2046
>>
>> Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types
>>
>> Introduction
>>
>> The first document in this set, RFC 2045, defines a number of header
>> fields, including Content-Type. The Content-Type field is used to
>> specify the nature of the data in the body of a MIME entity, by
>> giving media type and subtype identifiers, and by providing auxiliary
>> information that may be required for certain media types. After the
>> type and subtype names, the remainder of the header field is simply a
>> set of parameters, specified in an attribute/value notation. The
>> ordering of parameters is not significant.
>>
>> In general, the top-level media type is used to declare the general
>> type of data, while the subtype specifies a specific format for that
>> type of data. Thus, a media type of "image/xyz" is enough to tell a
>> user agent that the data is an image, even if the user agent has no
>> knowledge of the specific image format "xyz". Such information can
>> be used, for example, to decide whether or not to show a user the raw
>> data from an unrecognized subtype -- such an action might be
>> reasonable for unrecognized subtypes of "text", but not for
>> unrecognized subtypes of "image" or "audio". For this reason,
>> registered subtypes of "text", "image", "audio", and "video" should
>> not contain embedded information that is really of a different type.
>> Such compound formats should be represented using the "multipart" or
>> "application" types.
>>
>> Parameters are modifiers of the media subtype, and as such do not
>> fundamentally affect the nature of the content. The set of
>> meaningful parameters depends on the media type and subtype. Most
>> parameters are associated with a single specific subtype. However, a
>> given top-level media type may define parameters which are applicable
>> to any subtype of that type. Parameters may be required by their
>> defining media type or subtype or they may be optional. MIME
>> implementations must also ignore any parameters whose names they do
>> not recognize.
>>
>> RFC 2046 lists the top-level media types and their subtypes. As shown in
>> the excerpt above, it refers to RFC 2045 for the Content Type header field
>> in the Introduction.
>>
>> ABNF for media type is not defined in RFC 2046 but is defined in RFC
>> 2045 which copies it from RFC 822. RFC 2046 has a Collected Grammar
>> appendix which refers to RFC 822.
>>
>> http://tools.ietf.org/html/rfc2045#page-12
>>
>> 5.1. Syntax of the Content-Type Header Field
>>
>> In the Augmented BNF notation of RFC 822, a Content-Type header field
>> value is defined as follows:
>>
>> content := "Content-Type" ":" type "/" subtype
>> *(";" parameter)
>> ; Matching of media type and subtype
>> ; is ALWAYS case-insensitive.
>>
>> type := discrete-type / composite-type
>>
>> discrete-type := "text" / "image" / "audio" / "video" /
>> "application" / extension-token
>>
>> composite-type := "message" / "multipart" / extension-token
>>
>> extension-token := ietf-token / x-token
>>
>> ietf-token := <An extension token defined by a
>> standards-track RFC and registered
>> with IANA.>
>>
>> x-token := <The two characters "X-" or "x-" followed, with
>> no intervening white space, by any token>
>>
>> subtype := extension-token / iana-token
>>
>> iana-token := <A publicly-defined extension token. Tokens
>> of this form must be registered with IANA
>> as specified in RFC 2048.>
>>
>> parameter := attribute "=" value
>>
>> attribute := token
>> ; Matching of attributes
>> ; is ALWAYS case-insensitive.
>>
>> value := token / quoted-string
>>
>> token := 1*<any (US-ASCII) CHAR except SPACE, CTLs,
>> or tspecials>
>>
>> tspecials := "(" / ")" / "<" / ">" / "@" /
>> "," / ";" / ":" / "\" / <">
>> "/" / "[" / "]" / "?" / "="
>> ; Must be in quoted-string,
>> ; to use within parameter values
>>
>
>
>
> --
>
> Praying for the victims of the Japan Tohoku earthquake
>
> Makoto
>
--
Praying for the victims of the Japan Tohoku earthquake
Makoto
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.vse.cz/pipermail/sc34wg4/attachments/20141102/14b6e2e3/attachment-0001.html>
More information about the sc34wg4
mailing list