What is a character? What is the right term?

Innovimax SARL innovimax at gmail.com
Thu Jun 25 08:05:54 CEST 2009


+1 for Unicode Scalar Value

On Sun, Jun 21, 2009 at 4:25 AM, MURATA Makoto<eb2m-mrt at asahi-net.or.jp> wrote:
> I have an action item about characters.  Here are some entries in
> Appendix G (Glossary)
> of the Uicode 5.0.0 standard.
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Character. (1) The smallest component of written language that has
> semantic value; refers
> to the abstract meaning and/or shape, rather than a specific shape
> (see also glyph), though
> in code tables some form of visual representation is essential for the
> reader’s understand-
> ing. (2) Synonym for abstract character. (3) The basic unit of
> encoding for the Unicode
> character encoding. (4) The English name for the ideographic written
> elements of Chinese
> origin. [See ideograph (2).]
>
> Abstract Character. A unit of information used for the organization,
> control, or representa-
> tion of textual data. (See definition D7 in Section 3.4, Characters
> and Encoding.)
>
> Code Point. Any value in the Unicode codespace; that is, the range of
> integers from 0 to
> 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.)
>
> Code Position. Synonym for code point. Used in ISO character encoding
> standards.
>
> Code Unit. The minimal bit combination that can represent a unit of
> encoded text for pro-
> cessing or interchange. The Unicode Standard uses 8-bit code units in
> the UTF-8 encoding
> form, 16-bit code units in the UTF-16 encoding form, and 32-bit code
> units in the UTF-32
> encoding form. (See definition D77 in Section 3.9, Unicode Encoding Forms..)
>
> Code Value. Obsolete synonym for code unit.
>
> Byte. (1) The minimal unit of addressable storage for a particular
> computer architecture.
> (2) An octet. Note that many early computer architectures used bytes
> larger than 8 bits in
> size, but the industry has now standardized almost uniformly on 8-bit
> bytes. The Unicode
> Standard follows the current industry practice in equating the term
> byte with octet and
> using the more familiar term byte in all contexts. (See octet.)
>
> Octet. An ordered sequence of eight bits considered as a unit. The
> Unicode Standard fol-
> lows current industry practice in referring to an octet as a byte. (See byte.)
>
> Unicode Scalar Value. Any Unicode code point except high-surrogate and
> low-surrogate
> code points. In other words, the ranges of integers 0 to D7FF16 and
> E00016 to 10FFFF16,
> inclusive. (See definition D76 in Section 3.9, Unicode Encoding Forms.)
>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> I now think that "Unicode scalara value" is the right term *if*
> U+101D0  PHAISTOS DISC
> SIGN PEDESTRIAN, for example, is a single something.
>
> DR 09-0070 should not be affected by this discussion, since DR 09-0070
> is concerned about
> the representation given by UTF16LE.  Meanwhile, other DRs do not
> choose and fix particlar
> encodings.
>
> Cheers,
>
>
> Makoto <EB2M-MRT at asahi-net.or.jp>
>



-- 
Innovimax SARL
Consulting, Training & XML Development
9, impasse des Orteaux
75020 Paris
Tel : +33 9 52 475787
Fax : +33 1 4356 1746
http://www.innovimax.fr
RCS Paris 488.018.631
SARL au capital de 10.000 €



More information about the sc34wg4 mailing list