Home | History | Annotate | only in /src/external/bsd/openldap/dist/libraries/liblunicode/ucdata
Up to higher level directory
NameDateSize
api.txt14-Aug-202111.6K
bidiapi.txt14-Aug-20213.1K
format.txt28-May-20149.2K
MUTTUCData.txt28-May-20149.8K
README28-May-201410.5K
ucdata.c05-Sep-202535.7K
ucdata.h05-Sep-202513.7K
ucdata.man14-Aug-202112K
ucgendat.c05-Sep-202548.8K
ucpgba.c05-Sep-202523.1K
ucpgba.h05-Sep-20256K
ucpgba.man14-Aug-20212.6K
uctable.h14-Aug-2021500.1K

README

      1 #
      2 # Id: README,v 1.33 2001/01/02 18:46:19 mleisher Exp 
      3 #
      4 
      5                            MUTT UCData Package 2.5
      6                            -----------------------
      7 
      8 This is a package that supports ctype-like operations for Unicode UCS-2 text
      9 (and surrogates), case mapping, decomposition lookup, and provides a
     10 bidirectional reordering algorithm.  To use it, you will need to get the
     11 latest "UnicodeData-*.txt" (or later) file from the Unicode Web or FTP site.
     12 
     13 The character information portion of the package consists of three parts:
     14 
     15   1. A program called "ucgendat" which generates five data files from the
     16      UnicodeData-*.txt file.  The files are:
     17 
     18      A. case.dat   - the case mappings.
     19      B. ctype.dat  - the character property tables.
     20      C. comp.dat   - the character composition pairs.
     21      D. decomp.dat - the character decompositions.
     22      E. cmbcl.dat  - the non-zero combining classes.
     23      F. num.dat    - the codes representing numbers.
     24 
     25   2. The "ucdata.[ch]" files which implement the functions needed to
     26      check to see if a character matches groups of properties, to map between
     27      upper, lower, and title case, to look up the decomposition of a
     28      character, look up the combining class of a character, and get the number
     29      value of a character.
     30 
     31   3. The UCData.java class which provides the same API (with minor changes for
     32      the numbers) and loads the same binary data files as the C code.
     33 
     34 A short reference to the functions available is in the "api.txt" file.
     35 
     36 Techie Details
     37 ==============
     38 
     39 The "ucgendat" program parses files from the command line which are all in the
     40 Unicode Character Database (UCDB) format.  An additional properties file,
     41 "MUTTUCData.txt", provides some extra properties for some characters.
     42 
     43 The program looks for the two character properties fields (2 and 4), the
     44 combining class field (3), the decomposition field (5), the numeric value
     45 field (8), and the case mapping fields (12, 13, and 14).  The decompositions
     46 are recursively expanded before being written out.
     47 
     48 The decomposition table contains all the canonical decompositions.  This means
     49 all decompositions that do not have tags such as "<compat>" or "<font>".
     50 
     51 The data is almost all stored as unsigned longs (32-bits assumed) and the
     52 routines that load the data take care of endian swaps when necessary.  This
     53 also means that supplementary characters (>= 0x10000) can be placed in the
     54 data files the "ucgendat" program parses.
     55 
     56 The data is written as external files and broken into six parts so it can be
     57 selectively updated at runtime if necessary.
     58 
     59 The data files currently generated from the "ucgendat" program total about 56K
     60 in size all together.
     61 
     62 The format of the binary data files is documented in the "format.txt" file.
     63 
     64 ==========================================================================
     65 
     66                        The "Pretty Good Bidi Algorithm"
     67                        --------------------------------
     68 
     69 This routine provides an alternative to the Unicode Bidi algorithm.  The
     70 difference is that this version of the PGBA does not handle the explicit
     71 directional codes (LRE, RLE, LRO, RLO, PDF).  It should now produce the same
     72 results as the Unicode BiDi algorithm for implicit reordering.  Included are
     73 functions for doing cursor motion in both logical and visual order.
     74 
     75 This implementation is provided to demonstrate an effective alternate method
     76 for implicit reordering.  To make this useful for an application, it probably
     77 needs some changes to the memory allocation and deallocation, as well as data
     78 structure additions for rendering.
     79 
     80 Mark Leisher <mleisher@crl.nmsu.edu>
     81 19 November 1999
     82 
     83 -----------------------------------------------------------------------------
     84 
     85 CHANGES
     86 =======
     87 Version 2.5
     88 -----------
     89 1. Changed the number lookup to set the denominator to 1 in cases of digits.
     90    This restores functional compatibility with John Cowan's UCType package.
     91 
     92 2. Added support for the AL property.
     93 
     94 3. Modified load and reload functions to return error codes.
     95 
     96 Version 2.4
     97 -----------
     98 1. Improved some bidi algorithm documentation in the code.
     99 
    100 2. Fixed a code mixup that produced a non-working version.
    101 
    102 Version 2.3
    103 -----------
    104 1. Fixed a misspelling in the ucpgba.h header file.
    105 
    106 2. Fixed a bug which caused trailing weak non-digit sequences to be left out of
    107    the reordered string in the bidi algorithm.
    108 
    109 3. Fixed a problem with weak sequences containing non-spacing marks in the
    110    bidi algorithm.
    111 
    112 4. Fixed a problem with text runs of the opposite direction of the string
    113    surrounding a weak + neutral text run appearing in the wrong order in the
    114    bidi algorithm.
    115 
    116 5. Added a default overall direction parameter to the reordering function for
    117    cases of strings with no strong directional characters in the bidi
    118    algorithm.
    119 
    120 6. The bidi API documentation was improved.
    121 
    122 7. Added a man page for the bidi API.
    123 
    124 Version 2.2
    125 -----------
    126 1. Fixed a problem with the bidi algorithm locating directional section
    127    boundaries.
    128 
    129 2. Fixed a problem with the bidi algorithm starting the reordering correctly.
    130 
    131 3. Fixed a problem with the bidi algorithm determining end boundaries for LTR
    132    segments.
    133 
    134 4. Fixed a problem with the bidi algorithm reordering weak (digits and number
    135    separators) segments.
    136 
    137 5. Added automatic switching of symmetrically paired characters when
    138    reversing RTL segments.
    139 
    140 6. Added a missing symmetric character to the extra character properties in
    141    MUTTUCData.txt.
    142 
    143 7. Added support for doing logical and visual cursor traversal.
    144 
    145 Version 2.1
    146 -----------
    147 1. Updated the ucgendat program to handle the Unicode 3.0 character database
    148    properties.  The AL and BM bidi properties gets marked as strong RTL and
    149    Other Neutral, the NSM, LRE, RLE, PDF, LRO, and RLO controls all get marked
    150    as Other Neutral.
    151 
    152 2. Fixed some problems with testing against signed values in the UCData.java
    153    code and some minor cleanup.
    154 
    155 3. Added the "Pretty Good Bidi Algorithm."
    156 
    157 Version 2.0
    158 -----------
    159 1. Removed the old Java stuff for a new class that loads directly from the
    160    same data files as the C code does.
    161 
    162 2. Fixed a problem with choosing the correct field when mapping case.
    163 
    164 3. Adjust some search routines to start their search in the correct position.
    165 
    166 4. Moved the copyright year to 1999.
    167 
    168 Version 1.9
    169 -----------
    170 1. Fixed a problem with an incorrect amount of storage being allocated for the
    171    combining class nodes.
    172 
    173 2. Fixed an invalid initialization in the number code.
    174 
    175 3. Changed the Java template file formatting a bit.
    176 
    177 4. Added tables and function for getting decompositions in the Java class.
    178 
    179 Version 1.8
    180 -----------
    181 1. Fixed a problem with adding certain ranges.
    182 
    183 2. Added two more macros for testing for identifiers.
    184 
    185 3. Tested with the UnicodeData-2.1.5.txt file.
    186 
    187 Version 1.7
    188 -----------
    189 1. Fixed a problem with looking up decompositions in "ucgendat."
    190 
    191 Version 1.6
    192 -----------
    193 1. Added two new properties introduced with UnicodeData-2.1.4.txt.
    194 
    195 2. Changed the "ucgendat.c" program a little to automatically align the
    196    property data on a 4-byte boundary when new properties are added.
    197 
    198 3. Changed the "ucgendat.c" programs to only generate canonical
    199    decompositions.
    200 
    201 4. Added two new macros ucisinitialpunct() and ucisfinalpunct() to check for
    202    initial and final punctuation characters.
    203 
    204 5. Minor additions and changes to the documentation.
    205 
    206 Version 1.5
    207 -----------
    208 1. Changed all file open calls to include binary mode with "b" for DOS/WIN
    209    platforms.
    210 
    211 2. Wrapped the unistd.h include so it won't be included when compiled under
    212    Win32.
    213 
    214 3. Fixed a bad range check for hex digits in ucgendat.c.
    215 
    216 4. Fixed a bad endian swap for combining classes.
    217 
    218 5. Added code to make a number table and associated lookup functions.
    219    Functions added are ucnumber(), ucdigit(), and ucgetnumber().  The last
    220    function is to maintain compatibility with John Cowan's "uctype" package.
    221 
    222 Version 1.4
    223 -----------
    224 1. Fixed a bug with adding a range.
    225 
    226 2. Fixed a bug with inserting a range in order.
    227 
    228 3. Fixed incorrectly specified ucisdefined() and ucisundefined() macros.
    229 
    230 4. Added the missing unload for the combining class data.
    231 
    232 5. Fixed a bad macro placement in ucisweak().
    233 
    234 Version 1.3
    235 -----------
    236 1. Bug with case mapping calculations fixed.
    237 
    238 2. Bug with empty character property entries fixed.
    239 
    240 3. Bug with incorrect type in the combining class lookup fixed.
    241 
    242 4. Some corrections done to api.txt.
    243 
    244 5. Bug in certain character property lookups fixed.
    245 
    246 6. Added a character property table that records the defined characters.
    247 
    248 7. Replaced ucisunknown() with ucisdefined() and ucisundefined().
    249 
    250 Version 1.2
    251 -----------
    252 1. Added code to ucgendat to generate a combining class table.
    253 
    254 2. Fixed an endian problem with the byte count of decompositions.
    255 
    256 3. Fixed some minor problems in the "format.txt" file.
    257 
    258 4. Removed some bogus "Ss" values from MUTTUCData.txt file.
    259 
    260 5. Added API function to get combining class.
    261 
    262 6. Changed the open mode to "rb" so binary data files will be opened correctly
    263    on DOS/WIN as well as other platforms.
    264 
    265 7. Added the "api.txt" file.
    266 
    267 Version 1.1
    268 -----------
    269 1. Added ucisxdigit() which I overlooked.
    270 
    271 2. Added UC_LT to the ucisalpha() macro which I overlooked.
    272 
    273 3. Change uciscntrl() to include UC_CF.
    274 
    275 4. Added ucisocntrl() and ucfntcntrl() macros.
    276 
    277 5. Added a ucisblank() which I overlooked.
    278 
    279 6. Added missing properties to ucissymbol() and ucisnumber().
    280 
    281 7. Added ucisgraph() and ucisprint().
    282 
    283 8. Changed the "Mr" property to "Sy" to mark this subset of mirroring
    284    characters as symmetric to avoid trampling the Unicode/ISO10646 sense of
    285    mirroring.
    286 
    287 9. Added another property called "Ss" which includes control characters
    288    traditionally seen as spaces in the isspace() macro.
    289 
    290 10. Added a bunch of macros to be API compatible with John Cowan's package.
    291 
    292 ACKNOWLEDGEMENTS
    293 ================
    294 
    295 Thanks go to John Cowan <cowan@locke.ccil.org> for pointing out lots of
    296 missing things and giving me stuff, particularly a bunch of new macros.
    297 
    298 Thanks go to Bob Verbrugge <bob_verbrugge@nl.compuware.com> for pointing out
    299 various bugs.
    300 
    301 Thanks go to Christophe Pierret <cpierret@businessobjects.com> for pointing
    302 out that file modes need to have "b" for DOS/WIN machines, pointing out
    303 unistd.h is not a Win 32 header, and pointing out a problem with ucisalnum().
    304 
    305 Thanks go to Kent Johnson <kent@pondview.mv.com> for finding a bug that caused
    306 incomplete decompositions to be generated by the "ucgendat" program.
    307 
    308 Thanks go to Valeriy E. Ushakov <uwe@ptc.spbu.ru> for spotting an allocation
    309 error and an initialization error.
    310 
    311 Thanks go to Stig Venaas <Stig.Venaas@uninett.no> for providing a patch to
    312 support return types on load and reload, and for major updates to handle
    313 canonical composition and decomposition.
    314