Home | History | Annotate | Line # | Download | only in ucdata
      1 #
      2 # Id: format.txt,v 1.2 2001/01/02 18:46:20 mleisher Exp 
      3 #
      4 
      5 CHARACTER DATA
      6 ==============
      7 
      8 This package generates some data files that contain character properties useful
      9 for text processing.
     10 
     11 CHARACTER PROPERTIES
     12 ====================
     13 
     14 The first data file is called "ctype.dat" and contains a compressed form of
     15 the character properties found in the Unicode Character Database (UCDB).
     16 Additional properties can be specified in limited UCDB format in another file
     17 to avoid modifying the original UCDB.
     18 
     19 The following is a property name and code table to be used with the character
     20 data:
     21 
     22 NAME CODE DESCRIPTION
     23 ---------------------
     24 Mn   0    Mark, Non-Spacing
     25 Mc   1    Mark, Spacing Combining
     26 Me   2    Mark, Enclosing
     27 Nd   3    Number, Decimal Digit
     28 Nl   4    Number, Letter
     29 No   5    Number, Other
     30 Zs   6    Separator, Space
     31 Zl   7    Separator, Line
     32 Zp   8    Separator, Paragraph
     33 Cc   9    Other, Control
     34 Cf   10   Other, Format
     35 Cs   11   Other, Surrogate
     36 Co   12   Other, Private Use
     37 Cn   13   Other, Not Assigned
     38 Lu   14   Letter, Uppercase
     39 Ll   15   Letter, Lowercase
     40 Lt   16   Letter, Titlecase
     41 Lm   17   Letter, Modifier
     42 Lo   18   Letter, Other
     43 Pc   19   Punctuation, Connector
     44 Pd   20   Punctuation, Dash
     45 Ps   21   Punctuation, Open
     46 Pe   22   Punctuation, Close
     47 Po   23   Punctuation, Other
     48 Sm   24   Symbol, Math
     49 Sc   25   Symbol, Currency
     50 Sk   26   Symbol, Modifier
     51 So   27   Symbol, Other
     52 L    28   Left-To-Right
     53 R    29   Right-To-Left
     54 EN   30   European Number
     55 ES   31   European Number Separator
     56 ET   32   European Number Terminator
     57 AN   33   Arabic Number
     58 CS   34   Common Number Separator
     59 B    35   Block Separator
     60 S    36   Segment Separator
     61 WS   37   Whitespace
     62 ON   38   Other Neutrals
     63 Pi   47   Punctuation, Initial
     64 Pf   48   Punctuation, Final
     65 #
     66 # Implementation specific properties.
     67 #
     68 Cm   39   Composite
     69 Nb   40   Non-Breaking
     70 Sy   41   Symmetric (characters which are part of open/close pairs)
     71 Hd   42   Hex Digit
     72 Qm   43   Quote Mark
     73 Mr   44   Mirroring
     74 Ss   45   Space, Other (controls viewed as spaces in ctype isspace())
     75 Cp   46   Defined character
     76 
     77 The actual binary data is formatted as follows:
     78 
     79   Assumptions: unsigned short is at least 16-bits in size and unsigned long
     80                is at least 32-bits in size.
     81 
     82     unsigned short ByteOrderMark
     83     unsigned short OffsetArraySize
     84     unsigned long  Bytes
     85     unsigned short Offsets[OffsetArraySize + 1]
     86     unsigned long  Ranges[N], N = value of Offsets[OffsetArraySize]
     87 
     88   The Bytes field provides the total byte count used for the Offsets[] and
     89   Ranges[] arrays.  The Offsets[] array is aligned on a 4-byte boundary and
     90   there is always one extra node on the end to hold the final index of the
     91   Ranges[] array.  The Ranges[] array contains pairs of 4-byte values
     92   representing a range of Unicode characters.  The pairs are arranged in
     93   increasing order by the first character code in the range.
     94 
     95   Determining if a particular character is in the property list requires a
     96   simple binary search to determine if a character is in any of the ranges
     97   for the property.
     98 
     99   If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a
    100   machine with a different endian order and the values must be byte-swapped.
    101 
    102   To swap a 16-bit value:
    103      c = (c >> 8) | ((c & 0xff) << 8)
    104 
    105   To swap a 32-bit value:
    106      c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) |
    107          (((c >> 16) & 0xff) << 8) | (c >> 24)
    108 
    109 CASE MAPPINGS
    110 =============
    111 
    112 The next data file is called "case.dat" and contains three case mapping tables
    113 in the following order: upper, lower, and title case.  Each table is in
    114 increasing order by character code and each mapping contains 3 unsigned longs
    115 which represent the possible mappings.
    116 
    117 The format for the binary form of these tables is:
    118 
    119   unsigned short ByteOrderMark
    120   unsigned short NumMappingNodes, count of all mapping nodes
    121   unsigned short CaseTableSizes[2], upper and lower mapping node counts
    122   unsigned long  CaseTables[NumMappingNodes]
    123 
    124   The starting indexes of the case tables are calculated as following:
    125 
    126     UpperIndex = 0;
    127     LowerIndex = CaseTableSizes[0] * 3;
    128     TitleIndex = LowerIndex + CaseTableSizes[1] * 3;
    129 
    130   The order of the fields for the three tables are:
    131 
    132     Upper case
    133     ----------
    134     unsigned long upper;
    135     unsigned long lower;
    136     unsigned long title;
    137 
    138     Lower case
    139     ----------
    140     unsigned long lower;
    141     unsigned long upper;
    142     unsigned long title;
    143 
    144     Title case
    145     ----------
    146     unsigned long title;
    147     unsigned long upper;
    148     unsigned long lower;
    149 
    150   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
    151   same way as described in the CHARACTER PROPERTIES section.
    152 
    153   Because the tables are in increasing order by character code, locating a
    154   mapping requires a simple binary search on one of the 3 codes that make up
    155   each node.
    156 
    157   It is important to note that there can only be 65536 mapping nodes which
    158   divided into 3 portions allows 21845 nodes for each case mapping table.  The
    159   distribution of mappings may be more or less than 21845 per table, but only
    160   65536 are allowed.
    161 
    162 COMPOSITIONS
    163 ============
    164 
    165 This data file is called "comp.dat" and contains data that tracks character
    166 pairs that have a single Unicode value representing the combination of the two
    167 characters.
    168 
    169 The format for the binary form of this table is:
    170 
    171   unsigned short ByteOrderMark
    172   unsigned short NumCompositionNodes, count of composition nodes
    173   unsigned long  Bytes, total number of bytes used for composition nodes
    174   unsigned long  CompositionNodes[NumCompositionNodes * 4]
    175 
    176   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
    177   same way as described in the CHARACTER PROPERTIES section.
    178 
    179   The CompositionNodes[] array consists of groups of 4 unsigned longs.  The
    180   first of these is the character code representing the combination of two
    181   other character codes, the second records the number of character codes that
    182   make up the composition (not currently used), and the last two are the pair
    183   of character codes whose combination is represented by the character code in
    184   the first field.
    185 
    186 DECOMPOSITIONS
    187 ==============
    188 
    189 The next data file is called "decomp.dat" and contains the decomposition data
    190 for all characters with decompositions containing more than one character and
    191 are *not* compatibility decompositions.  Compatibility decompositions are
    192 signaled in the UCDB format by the use of the <compat> tag in the
    193 decomposition field.  Each list of character codes represents a full
    194 decomposition of a composite character.  The nodes are arranged in increasing
    195 order by character code.
    196 
    197 The format for the binary form of this table is:
    198 
    199   unsigned short ByteOrderMark
    200   unsigned short NumDecompNodes, count of all decomposition nodes
    201   unsigned long  Bytes
    202   unsigned long  DecompNodes[(NumDecompNodes * 2) + 1]
    203   unsigned long  Decomp[N], N = sum of all counts in DecompNodes[]
    204 
    205   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
    206   same way as described in the CHARACTER PROPERTIES section.
    207 
    208   The DecompNodes[] array consists of pairs of unsigned longs, the first of
    209   which is the character code and the second is the initial index of the list
    210   of character codes representing the decomposition.
    211 
    212   Locating the decomposition of a composite character requires a binary search
    213   for a character code in the DecompNodes[] array and using its index to
    214   locate the start of the decomposition.  The length of the decomposition list
    215   is the index in the following element in DecompNode[] minus the current
    216   index.
    217 
    218 COMBINING CLASSES
    219 =================
    220 
    221 The fourth data file is called "cmbcl.dat" and contains the characters with
    222 non-zero combining classes.
    223 
    224 The format for the binary form of this table is:
    225 
    226   unsigned short ByteOrderMark
    227   unsigned short NumCCLNodes
    228   unsigned long  Bytes
    229   unsigned long  CCLNodes[NumCCLNodes * 3]
    230 
    231   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
    232   same way as described in the CHARACTER PROPERTIES section.
    233 
    234   The CCLNodes[] array consists of groups of three unsigned longs.  The first
    235   and second are the beginning and ending of a range and the third is the
    236   combining class of that range.
    237 
    238   If a character is not found in this table, then the combining class is
    239   assumed to be 0.
    240 
    241   It is important to note that only 65536 distinct ranges plus combining class
    242   can be specified because the NumCCLNodes is usually a 16-bit number.
    243 
    244 NUMBER TABLE
    245 ============
    246 
    247 The final data file is called "num.dat" and contains the characters that have
    248 a numeric value associated with them.
    249 
    250 The format for the binary form of the table is:
    251 
    252   unsigned short ByteOrderMark
    253   unsigned short NumNumberNodes
    254   unsigned long  Bytes
    255   unsigned long  NumberNodes[NumNumberNodes]
    256   unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long)))
    257                             / sizeof(short)]
    258 
    259   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
    260   same way as described in the CHARACTER PROPERTIES section.
    261 
    262   The NumberNodes array contains pairs of values, the first of which is the
    263   character code and the second an index into the ValueNodes array.  The
    264   ValueNodes array contains pairs of integers which represent the numerator
    265   and denominator of the numeric value of the character.  If the character
    266   happens to map to an integer, both the values in ValueNodes will be the
    267   same.
    268