Which RFC(s) for media type should we refer to?

MURATA Makoto eb2m-mrt at asahi-net.or.jp
Sun Nov 2 08:15:57 CET 2014


Folks,

I pointed out that

   [\p{IsBasicLatin}-[\p{Cc}\(\)<>@,;:\\"/\
[\]\?=\{\}\s\t]]

is used repeatedly.  This appears to represent characters in

      token := 1*<any (US-ASCII) CHAR except SPACE, CTLs,
                 or tspecials>

where


     tspecials :=  "(" / ")" / "<" / ">" / "@" /
                   "," / ";" / ":" / "\" / <">
                   "/" / "[" / "]" / "?" / "="
                   ; Must be in quoted-string,
                   ; to use within parameter values" /

They both represent any of the following characters:

- 0021;EXCLAMATION MARK;Po;0;ON;;;;;N;;;;;
- 0023;NUMBER SIGN;Po;0;ET;;;;;N;;;;;
- 0024;DOLLAR SIGN;Sc;0;ET;;;;;N;;;;;
- 0025;PERCENT SIGN;Po;0;ET;;;;;N;;;;;
- 0026;AMPERSAND;Po;0;ON;;;;;N;;;;;
- 0027;APOSTROPHE;Po;0;ON;;;;;N;APOSTROPHE-QUOTE;;;;
- 002A;ASTERISK;Po;0;ON;;;;;N;;;;;
- 002B;PLUS SIGN;Sm;0;ES;;;;;N;;;;;
- 002D;HYPHEN-MINUS;Pd;0;ES;;;;;N;;;;;
- 002E;FULL STOP;Po;0;CS;;;;;N;PERIOD;;;;
- 0-9
- A-Z
- 005E;CIRCUMFLEX ACCENT;Sk;0;ON;;;;;N;SPACING CIRCUMFLEX;;;;
- 005F;LOW LINE;Pc;0;ON;;;;;N;SPACING UNDERSCORE;;;;
- 0060;GRAVE ACCENT;Sk;0;ON;;;;;N;SPACING GRAVE;;;;
- a-z
- 007C;VERTICAL LINE;Sm;0;ON;;;;;N;VERTICAL BAR;;;;
- 007E;TILDE;Sm;0;ON;;;;;N;;;;;


The regular expression allow any of these characters as
part of a top-level media type name, second-level
media type name, and parameter name,

Regards,
Makoto

2014-10-25 22:08 GMT+09:00 MURATA Makoto <eb2m-mrt at asahi-net.or.jp>:

> Caroline,
>
> Thank you for your through study!  This is an
> eye opener.
>
> Both RFC 2616 and RFC 7321 allow the use of doubly-quoted
> strings and single-octet quoting by \.
>
> OPC uses content types as part of [Content_Types].xml
> The XSD schema for this document is opc-contentTypes.xsd.
> It has an ugly regular expression
>
>
> "(((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
> ,;:\\"/\[\]\?=\{\}\s\t]])+))/((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
> ,;:\\"/\[\]\?=\{\}\s\t]])+))((\s+)*;(\s+)*(((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
> ,;:\\"/\[\]\?=\{\}\s\t]])+))=((([\p{IsBasicLatin}-[\p{Cc}\(\)<>@
> ,;:\\"/\[\]\?=\{\}\s\t]])+)|("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"))))*)"
>
> It is not at all clear whether this is equivalent to RFC 2616,
> especially because XML has its own mechanism for character
> escaping (&#x) and also because double quotation marks
> cannot be used within doubly-quoted attribute values.
>
> I tried to reformulate the above regular expression.  First,
>
> [\p{IsBasicLatin}-[\p{Cc}\(\)<>@,;:\\"/\[\]\?=\{\}\s\t]]
>
> appears repeatedly.  If we represent this string by an internal
> text entity X by introducing
>
> <!ENTITY X "[\p{IsBasicLatin}-[\p{Cc}\(\)<>@
> ,;:\\"/\[\]\?=\{\}\s\t]]">
>
> the entire expression will become
>
>
> "(((($X)+))/((($X)+))((\s+)*;(\s+)*(((($X)+))=((($X)+)|("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"))))*)"
>
>
> By removing unnecessary parentheses, this can be rewritten as
>
> "$X+/$X+(\s*;\s*
> ($X+=(($X+)|("(([\p{IsLatin-1Supplement}\p{IsBasicLatin}-[\p{Cc}"\n\r]]|(\s+))|(\\[\p{IsBasicLatin}]))*"))))*"
>
> This looks similar to what RFC 2616 defines.  But
> are they equivalent?
>
> Regards,
> Makoto
>
>
>
> 2014-10-21 6:04 GMT+09:00 Arms, Caroline <caar at loc.gov>:
>
> All,
>>
>> I started back on the Content type vs. Media type issue and ran into the
>> question of which RFC(s) we should refer to, thinking that would be a good
>> place to start thinking about rewording things.  It's not so simple!
>>
>> Part 2 currently refers to RFC 2616, which may not have been the most
>> appropriate RFC but that is now moot, because 2616 is obsolete and has been
>> replaced by a group of RFCs including RFC 7231 which refers to RFC 2046 in
>> its Media Type subclause but does not elaborate on what media-type actually
>> is.  RFC 7231 provides ABNF for media-type, but you need to refer to RFC
>> 7230 for an explanation of "OWS" -- used in the ABNF.  RFC 2046 lists the
>> top-level media types and common subtypes.  It discusses parameters.  Its
>> introduction refers to RFC 2045 for the Content Type context and to RFC 822
>> for all relevant ABNF not found in its Appendix A: Collected Grammar.
>> Media-type is not mentioned in Appendix A.  RFC 2045 has a copy of the
>> relevant ABNF from RFC 822.
>>
>> More detailed detective work with URLs  is attached below.
>>
>> The question will be how best to refer to this in Part 2.   RFC 7231 is
>> most convenient for getting the ABNF syntax, but you need RFC 2046 to
>> understand the semantics.
>>
>>    To be continued, no doubt ...
>>
>>    Caroline
>>
>> Caroline Arms
>> Library of Congress Contractor
>> Co-compiler of Sustainability of Digital Formats resource
>> http://www.digitalpreservation.gov/formats/
>>
>> ** Views expressed are personal and not necessarily those of the
>> institution **
>>
>> ==== DETAILED detective work ====
>>
>> Part 2 currently refers to RFC 2616
>>
>> https://www.mnot.net/blog/2014/06/07/rfc2616_is_dead
>>
>> http://www.rfc-editor.org/info/rfc2616  is marked as obsolete
>>
>> So I went to one of the replacement RFCs
>>
>> http://tools.ietf.org/html/rfc7231
>>
>> 3.1.1.1. Media Type
>>
>>    HTTP uses Internet media types [RFC2046] in the Content-Type
>>    (Section 3.1.1.5) and Accept (Section 5.3.2) header fields in order
>>    to provide open and extensible data typing and type negotiation.
>>    Media types define both a data format and various processing models:
>>    how to process that data in accordance with each context in which it
>>    is received.
>>
>>      media-type = type "/" subtype *( OWS ";" OWS parameter )
>>      type       = token
>>      subtype    = token
>>
>>    The type/subtype MAY be followed by parameters in the form of
>>    name=value pairs.
>>
>>      parameter      = token "=" ( token / quoted-string )
>>
>>    The type, subtype, and parameter name tokens are case-insensitive.
>>    Parameter values might or might not be case-sensitive, depending on
>>    the semantics of the parameter name.  The presence or absence of a
>>    parameter might be significant to the processing of a media-type,
>>    depending on its definition within the media type registry.
>>
>>    A parameter value that matches the token production can be
>>    transmitted either as a token or within a quoted-string.  The quoted
>>    and unquoted values are equivalent.  For example, the following
>>    examples are all equivalent, but the first is preferred for
>>    consistency:
>>
>>      text/html;charset=utf-8
>>      text/html;charset=UTF-8
>>      Text/HTML;Charset="utf-8"
>>      text/html; charset="utf-8"
>>
>>    Internet media types ought to be registered with IANA according to
>>    the procedures defined in [BCP13].
>>
>>       Note: Unlike some similar constructs in other header fields, media
>>       type parameters do not allow whitespace (even "bad" whitespace)
>>       around the "=" character.
>>
>> ===  aside on OWS  -- optional whitespace ===
>>
>>     OWS           = <OWS, see [RFC7230], Section 3.2.3>
>>
>> http://tools.ietf.org/html/rfc7230#section-3.2.3
>>
>> 3.2.3. Whitespace
>>
>>    This specification uses three rules to denote the use of linear
>>    whitespace: OWS (optional whitespace), RWS (required whitespace), and
>>    BWS ("bad" whitespace).
>>
>>    The OWS rule is used where zero or more linear whitespace octets
>>    might appear.  For protocol elements where optional whitespace is
>>    preferred to improve readability, a sender SHOULD generate the
>>    optional whitespace as a single SP; otherwise, a sender SHOULD NOT
>>    generate optional whitespace except as needed to white out invalid or
>>    unwanted protocol elements during in-place message filtering.
>>
>>    The RWS rule is used when at least one linear whitespace octet is
>>    required to separate field tokens.  A sender SHOULD generate RWS as a
>>    single SP.
>>
>>    The BWS rule is used where the grammar allows optional whitespace
>>    only for historical reasons.  A sender MUST NOT generate BWS in
>>    messages.  A recipient MUST parse for such bad whitespace and remove
>>    it before interpreting the protocol element.
>>
>>      OWS            = *( SP / HTAB )
>>                     ; optional whitespace
>>      RWS            = 1*( SP / HTAB )
>>                     ; required whitespace
>>      BWS            = OWS
>>                     ; "bad" whitespace
>>
>> ==== end of OWS digression
>>
>>
>> http://tools.ietf.org/html/rfc2046
>>
>> Multipurpose Internet Mail Extensions (MIME) Part Two:  Media Types
>>
>> Introduction
>>
>>    The first document in this set, RFC 2045, defines a number of header
>>    fields, including Content-Type. The Content-Type field is used to
>>    specify the nature of the data in the body of a MIME entity, by
>>    giving media type and subtype identifiers, and by providing auxiliary
>>    information that may be required for certain media types.  After the
>>    type and subtype names, the remainder of the header field is simply a
>>    set of parameters, specified in an attribute/value notation.  The
>>    ordering of parameters is not significant.
>>
>>    In general, the top-level media type is used to declare the general
>>    type of data, while the subtype specifies a specific format for that
>>    type of data.  Thus, a media type of "image/xyz" is enough to tell a
>>    user agent that the data is an image, even if the user agent has no
>>    knowledge of the specific image format "xyz".  Such information can
>>    be used, for example, to decide whether or not to show a user the raw
>>    data from an unrecognized subtype -- such an action might be
>>    reasonable for unrecognized subtypes of "text", but not for
>>    unrecognized subtypes of "image" or "audio".  For this reason,
>>    registered subtypes of "text", "image", "audio", and "video" should
>>    not contain embedded information that is really of a different type.
>>    Such compound formats should be represented using the "multipart" or
>>    "application" types.
>>
>>    Parameters are modifiers of the media subtype, and as such do not
>>    fundamentally affect the nature of the content.  The set of
>>    meaningful parameters depends on the media type and subtype.  Most
>>    parameters are associated with a single specific subtype.  However, a
>>    given top-level media type may define parameters which are applicable
>>    to any subtype of that type.  Parameters may be required by their
>>    defining media type or subtype or they may be optional.  MIME
>>    implementations must also ignore any parameters whose names they do
>>    not recognize.
>>
>> RFC 2046 lists the top-level media types and their subtypes.  As shown in
>> the excerpt above, it refers to RFC 2045 for the Content Type header field
>> in the Introduction.
>>
>> ABNF for media type is not defined in RFC 2046 but is defined in RFC
>> 2045 which copies it from RFC 822.  RFC 2046 has a Collected Grammar
>> appendix which refers to RFC 822.
>>
>> http://tools.ietf.org/html/rfc2045#page-12
>>
>> 5.1. Syntax of the Content-Type Header Field
>>
>>    In the Augmented BNF notation of RFC 822, a Content-Type header field
>>    value is defined as follows:
>>
>>      content := "Content-Type" ":" type "/" subtype
>>                 *(";" parameter)
>>                 ; Matching of media type and subtype
>>                 ; is ALWAYS case-insensitive.
>>
>>      type := discrete-type / composite-type
>>
>>      discrete-type := "text" / "image" / "audio" / "video" /
>>                       "application" / extension-token
>>
>>      composite-type := "message" / "multipart" / extension-token
>>
>>      extension-token := ietf-token / x-token
>>
>>      ietf-token := <An extension token defined by a
>>                     standards-track RFC and registered
>>                     with IANA.>
>>
>>      x-token := <The two characters "X-" or "x-" followed, with
>>                  no intervening white space, by any token>
>>
>>      subtype := extension-token / iana-token
>>
>>      iana-token := <A publicly-defined extension token. Tokens
>>                     of this form must be registered with IANA
>>                     as specified in RFC 2048.>
>>
>>      parameter := attribute "=" value
>>
>>      attribute := token
>>                   ; Matching of attributes
>>                   ; is ALWAYS case-insensitive.
>>
>>      value := token / quoted-string
>>
>>      token := 1*<any (US-ASCII) CHAR except SPACE, CTLs,
>>                  or tspecials>
>>
>>      tspecials :=  "(" / ")" / "<" / ">" / "@" /
>>                    "," / ";" / ":" / "\" / <">
>>                    "/" / "[" / "]" / "?" / "="
>>                    ; Must be in quoted-string,
>>                    ; to use within parameter values
>>
>
>
>
> --
>
> Praying for the victims of the Japan Tohoku earthquake
>
> Makoto
>



-- 

Praying for the victims of the Japan Tohoku earthquake

Makoto
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.vse.cz/pipermail/sc34wg4/attachments/20141102/14b6e2e3/attachment-0001.html>


More information about the sc34wg4 mailing list