What is a character? What is the right term?

MURATA Makoto eb2m-mrt at asahi-net.or.jp
Sun Jun 21 04:25:04 CEST 2009


I have an action item about characters.  Here are some entries in
Appendix G (Glossary)
of the Uicode 5.0.0 standard.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Character. (1) The smallest component of written language that has
semantic value; refers
to the abstract meaning and/or shape, rather than a specific shape
(see also glyph), though
in code tables some form of visual representation is essential for the
reader’s understand-
ing. (2) Synonym for abstract character. (3) The basic unit of
encoding for the Unicode
character encoding. (4) The English name for the ideographic written
elements of Chinese
origin. [See ideograph (2).]

Abstract Character. A unit of information used for the organization,
control, or representa-
tion of textual data. (See definition D7 in Section 3.4, Characters
and Encoding.)

Code Point. Any value in the Unicode codespace; that is, the range of
integers from 0 to
10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.)

Code Position. Synonym for code point. Used in ISO character encoding
standards.

Code Unit. The minimal bit combination that can represent a unit of
encoded text for pro-
cessing or interchange. The Unicode Standard uses 8-bit code units in
the UTF-8 encoding
form, 16-bit code units in the UTF-16 encoding form, and 32-bit code
units in the UTF-32
encoding form. (See definition D77 in Section 3.9, Unicode Encoding Forms.)

Code Value. Obsolete synonym for code unit.

Byte. (1) The minimal unit of addressable storage for a particular
computer architecture.
(2) An octet. Note that many early computer architectures used bytes
larger than 8 bits in
size, but the industry has now standardized almost uniformly on 8-bit
bytes. The Unicode
Standard follows the current industry practice in equating the term
byte with octet and
using the more familiar term byte in all contexts. (See octet.)

Octet. An ordered sequence of eight bits considered as a unit. The
Unicode Standard fol-
lows current industry practice in referring to an octet as a byte. (See byte.)

Unicode Scalar Value. Any Unicode code point except high-surrogate and
low-surrogate
code points. In other words, the ranges of integers 0 to D7FF16 and
E00016 to 10FFFF16,
inclusive. (See definition D76 in Section 3.9, Unicode Encoding Forms.)

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

I now think that "Unicode scalara value" is the right term *if*
U+101D0  PHAISTOS DISC
SIGN PEDESTRIAN, for example, is a single something.

DR 09-0070 should not be affected by this discussion, since DR 09-0070
is concerned about
the representation given by UTF16LE.  Meanwhile, other DRs do not
choose and fix particlar
encodings.

Cheers,


Makoto <EB2M-MRT at asahi-net.or.jp>



More information about the sc34wg4 mailing list