1 # 2 # Id: format.txt,v 1.2 2001/01/02 18:46:20 mleisher Exp 3 # 4 5 CHARACTER DATA 6 ============== 7 8 This package generates some data files that contain character properties useful 9 for text processing. 10 11 CHARACTER PROPERTIES 12 ==================== 13 14 The first data file is called "ctype.dat" and contains a compressed form of 15 the character properties found in the Unicode Character Database (UCDB). 16 Additional properties can be specified in limited UCDB format in another file 17 to avoid modifying the original UCDB. 18 19 The following is a property name and code table to be used with the character 20 data: 21 22 NAME CODE DESCRIPTION 23 --------------------- 24 Mn 0 Mark, Non-Spacing 25 Mc 1 Mark, Spacing Combining 26 Me 2 Mark, Enclosing 27 Nd 3 Number, Decimal Digit 28 Nl 4 Number, Letter 29 No 5 Number, Other 30 Zs 6 Separator, Space 31 Zl 7 Separator, Line 32 Zp 8 Separator, Paragraph 33 Cc 9 Other, Control 34 Cf 10 Other, Format 35 Cs 11 Other, Surrogate 36 Co 12 Other, Private Use 37 Cn 13 Other, Not Assigned 38 Lu 14 Letter, Uppercase 39 Ll 15 Letter, Lowercase 40 Lt 16 Letter, Titlecase 41 Lm 17 Letter, Modifier 42 Lo 18 Letter, Other 43 Pc 19 Punctuation, Connector 44 Pd 20 Punctuation, Dash 45 Ps 21 Punctuation, Open 46 Pe 22 Punctuation, Close 47 Po 23 Punctuation, Other 48 Sm 24 Symbol, Math 49 Sc 25 Symbol, Currency 50 Sk 26 Symbol, Modifier 51 So 27 Symbol, Other 52 L 28 Left-To-Right 53 R 29 Right-To-Left 54 EN 30 European Number 55 ES 31 European Number Separator 56 ET 32 European Number Terminator 57 AN 33 Arabic Number 58 CS 34 Common Number Separator 59 B 35 Block Separator 60 S 36 Segment Separator 61 WS 37 Whitespace 62 ON 38 Other Neutrals 63 Pi 47 Punctuation, Initial 64 Pf 48 Punctuation, Final 65 # 66 # Implementation specific properties. 67 # 68 Cm 39 Composite 69 Nb 40 Non-Breaking 70 Sy 41 Symmetric (characters which are part of open/close pairs) 71 Hd 42 Hex Digit 72 Qm 43 Quote Mark 73 Mr 44 Mirroring 74 Ss 45 Space, Other (controls viewed as spaces in ctype isspace()) 75 Cp 46 Defined character 76 77 The actual binary data is formatted as follows: 78 79 Assumptions: unsigned short is at least 16-bits in size and unsigned long 80 is at least 32-bits in size. 81 82 unsigned short ByteOrderMark 83 unsigned short OffsetArraySize 84 unsigned long Bytes 85 unsigned short Offsets[OffsetArraySize + 1] 86 unsigned long Ranges[N], N = value of Offsets[OffsetArraySize] 87 88 The Bytes field provides the total byte count used for the Offsets[] and 89 Ranges[] arrays. The Offsets[] array is aligned on a 4-byte boundary and 90 there is always one extra node on the end to hold the final index of the 91 Ranges[] array. The Ranges[] array contains pairs of 4-byte values 92 representing a range of Unicode characters. The pairs are arranged in 93 increasing order by the first character code in the range. 94 95 Determining if a particular character is in the property list requires a 96 simple binary search to determine if a character is in any of the ranges 97 for the property. 98 99 If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a 100 machine with a different endian order and the values must be byte-swapped. 101 102 To swap a 16-bit value: 103 c = (c >> 8) | ((c & 0xff) << 8) 104 105 To swap a 32-bit value: 106 c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) | 107 (((c >> 16) & 0xff) << 8) | (c >> 24) 108 109 CASE MAPPINGS 110 ============= 111 112 The next data file is called "case.dat" and contains three case mapping tables 113 in the following order: upper, lower, and title case. Each table is in 114 increasing order by character code and each mapping contains 3 unsigned longs 115 which represent the possible mappings. 116 117 The format for the binary form of these tables is: 118 119 unsigned short ByteOrderMark 120 unsigned short NumMappingNodes, count of all mapping nodes 121 unsigned short CaseTableSizes[2], upper and lower mapping node counts 122 unsigned long CaseTables[NumMappingNodes] 123 124 The starting indexes of the case tables are calculated as following: 125 126 UpperIndex = 0; 127 LowerIndex = CaseTableSizes[0] * 3; 128 TitleIndex = LowerIndex + CaseTableSizes[1] * 3; 129 130 The order of the fields for the three tables are: 131 132 Upper case 133 ---------- 134 unsigned long upper; 135 unsigned long lower; 136 unsigned long title; 137 138 Lower case 139 ---------- 140 unsigned long lower; 141 unsigned long upper; 142 unsigned long title; 143 144 Title case 145 ---------- 146 unsigned long title; 147 unsigned long upper; 148 unsigned long lower; 149 150 If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the 151 same way as described in the CHARACTER PROPERTIES section. 152 153 Because the tables are in increasing order by character code, locating a 154 mapping requires a simple binary search on one of the 3 codes that make up 155 each node. 156 157 It is important to note that there can only be 65536 mapping nodes which 158 divided into 3 portions allows 21845 nodes for each case mapping table. The 159 distribution of mappings may be more or less than 21845 per table, but only 160 65536 are allowed. 161 162 COMPOSITIONS 163 ============ 164 165 This data file is called "comp.dat" and contains data that tracks character 166 pairs that have a single Unicode value representing the combination of the two 167 characters. 168 169 The format for the binary form of this table is: 170 171 unsigned short ByteOrderMark 172 unsigned short NumCompositionNodes, count of composition nodes 173 unsigned long Bytes, total number of bytes used for composition nodes 174 unsigned long CompositionNodes[NumCompositionNodes * 4] 175 176 If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the 177 same way as described in the CHARACTER PROPERTIES section. 178 179 The CompositionNodes[] array consists of groups of 4 unsigned longs. The 180 first of these is the character code representing the combination of two 181 other character codes, the second records the number of character codes that 182 make up the composition (not currently used), and the last two are the pair 183 of character codes whose combination is represented by the character code in 184 the first field. 185 186 DECOMPOSITIONS 187 ============== 188 189 The next data file is called "decomp.dat" and contains the decomposition data 190 for all characters with decompositions containing more than one character and 191 are *not* compatibility decompositions. Compatibility decompositions are 192 signaled in the UCDB format by the use of the <compat> tag in the 193 decomposition field. Each list of character codes represents a full 194 decomposition of a composite character. The nodes are arranged in increasing 195 order by character code. 196 197 The format for the binary form of this table is: 198 199 unsigned short ByteOrderMark 200 unsigned short NumDecompNodes, count of all decomposition nodes 201 unsigned long Bytes 202 unsigned long DecompNodes[(NumDecompNodes * 2) + 1] 203 unsigned long Decomp[N], N = sum of all counts in DecompNodes[] 204 205 If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the 206 same way as described in the CHARACTER PROPERTIES section. 207 208 The DecompNodes[] array consists of pairs of unsigned longs, the first of 209 which is the character code and the second is the initial index of the list 210 of character codes representing the decomposition. 211 212 Locating the decomposition of a composite character requires a binary search 213 for a character code in the DecompNodes[] array and using its index to 214 locate the start of the decomposition. The length of the decomposition list 215 is the index in the following element in DecompNode[] minus the current 216 index. 217 218 COMBINING CLASSES 219 ================= 220 221 The fourth data file is called "cmbcl.dat" and contains the characters with 222 non-zero combining classes. 223 224 The format for the binary form of this table is: 225 226 unsigned short ByteOrderMark 227 unsigned short NumCCLNodes 228 unsigned long Bytes 229 unsigned long CCLNodes[NumCCLNodes * 3] 230 231 If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the 232 same way as described in the CHARACTER PROPERTIES section. 233 234 The CCLNodes[] array consists of groups of three unsigned longs. The first 235 and second are the beginning and ending of a range and the third is the 236 combining class of that range. 237 238 If a character is not found in this table, then the combining class is 239 assumed to be 0. 240 241 It is important to note that only 65536 distinct ranges plus combining class 242 can be specified because the NumCCLNodes is usually a 16-bit number. 243 244 NUMBER TABLE 245 ============ 246 247 The final data file is called "num.dat" and contains the characters that have 248 a numeric value associated with them. 249 250 The format for the binary form of the table is: 251 252 unsigned short ByteOrderMark 253 unsigned short NumNumberNodes 254 unsigned long Bytes 255 unsigned long NumberNodes[NumNumberNodes] 256 unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long))) 257 / sizeof(short)] 258 259 If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the 260 same way as described in the CHARACTER PROPERTIES section. 261 262 The NumberNodes array contains pairs of values, the first of which is the 263 character code and the second an index into the ValueNodes array. The 264 ValueNodes array contains pairs of integers which represent the numerator 265 and denominator of the numeric value of the character. If the character 266 happens to map to an integer, both the values in ValueNodes will be the 267 same. 268