Home | History | Annotate | Line # | Download | only in unicode
      1 This directory contains a mechanism for GCC to have its own internal
      2 implementation of wcwidth functionality (cpp_wcwidth () in libcpp/charset.c),
      3 as well as a mechanism to update the information about codepoints permitted in
      4 identifiers, which is encoded in libcpp/ucnid.h, and mapping between Unicode
      5 names and codepoints, which is encoded in libcpp/uname2c.h.
      6 
      7 The idea is to produce the necessary lookup tables
      8 (../../libcpp/{ucnid.h,uname2c.h,generated_cpp_wcwidth.h}) in a reproducible
      9 way, starting from the following files that are distributed by the Unicode
     10 Consortium:
     11 
     12 ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
     13 ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt
     14 ftp://ftp.unicode.org/Public/UNIDATA/PropList.txt
     15 ftp://ftp.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt
     16 ftp://ftp.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
     17 ftp://ftp.unicode.org/Public/UNIDATA/NameAliases.txt
     18 
     19 Two additional files are needed for lookup tables in libstdc++:
     20 
     21 ftp://ftp.unicode.org/Public/UNIDATA/auxiliary/GraphemeBreakProperty.txt
     22 ftp://ftp.unicode.org/Public/UNIDATA/emoji/emoji-data.txt
     23 
     24 All these files have been added to source control in this directory;
     25 please see unicode-license.txt for the relevant copyright information.
     26 
     27 In order to keep in sync with glibc's wcwidth as much as possible, it is
     28 desirable for the logic that processes the Unicode data to be the same as
     29 glibc's.  To that end, we also put in this directory, in the from_glibc/
     30 directory, the glibc python code that implements their logic.  This code was
     31 copied verbatim from glibc, and it can be updated at any time from the glibc
     32 source code repository.  The files copied from that repository are:
     33 
     34 localedata/unicode-gen/unicode_utils.py
     35 localedata/unicode-gen/utf8_gen.py
     36 
     37 And the most recent versions added to GCC are from glibc git commit:
     38 71de3aead9fffe89556e80ebc94aa918d8ee7bca
     39 
     40 The script gen_wcwidth.py found here contains the GCC-specific code to
     41 map glibc's output to the lookup tables we require.  This script should not need
     42 to change, unless there are structural changes to the Unicode data files or to
     43 the glibc code.  Similarly, makeucnid.cc in ../../libcpp contains the logic to
     44 produce ucnid.h.
     45 
     46 The procedure to update GCC's Unicode support is the following:
     47 
     48 1.  Update the six Unicode data files from the above URLs.
     49 
     50 2.  Update the two glibc files in from_glibc/ from glibc's git.  Update
     51     the commit number above in this README.
     52 
     53 3.  Run ./gen_wcwidth.py X.Y > ../../libcpp/generated_cpp_wcwidth.h
     54     (where X.Y is the version of the Unicode standard corresponding to the
     55     Unicode data files being used, most recently, 15.1.0).
     56 
     57 4.  Update Unicode Copyright years in libcpp/makeucnid.cc and in
     58     libcpp/makeuname2c.cc up to the year in which the Unicode
     59     standard has been released.
     60 
     61 5.  Compile makeucnid, e.g. with:
     62       g++ -O2 ../../libcpp/makeucnid.cc -o ../../libcpp/makeucnid
     63 
     64 6.  Generate ucnid.h as follows:
     65       ../../libcpp/makeucnid ../../libcpp/ucnid.tab UnicodeData.txt \
     66 	DerivedNormalizationProps.txt DerivedCoreProperties.txt \
     67 	> ../../libcpp/ucnid.h
     68 
     69 7.  Read the corresponding Unicode's standard and update correspondingly
     70     generated_ranges table in libcpp/makeuname2c.cc (in Unicode 15 all
     71     the needed information was in Table 4-8).
     72 
     73 8.  Compile makeuname2c, e.g. with:
     74       g++ -O2 ../../libcpp/makeuname2c.cc -o ../../libcpp/makeuname2c
     75 
     76 9:  Generate uname2c.h as follows:
     77       ../../libcpp/makeuname2c UnicodeData.txt NameAliases.txt \
     78 	> ../../libcpp/uname2c.h
     79 
     80 See gen_libstdcxx_unicode_data.py for instructions on updating the lookup
     81 tables in libstdc++.
     82