1<?xml version="1.0" encoding="UTF-8" ?>
2<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
3                   "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd"
4[
5<!ENTITY % defs SYSTEM "defs.ent"> %defs;
6]>
7
8
9<!-- lifted from troff+ms+XMan by doclifter -->
10<article id="ctext">
11
12<articleinfo>
13   <title>Compound Text Encoding</title>
14   <subtitle>X Consortium Standard</subtitle>
15   <authorgroup>
16      <author>
17         <firstname>Robert</firstname><othername>W.</othername><surname>Scheifler</surname>
18         <affiliation><orgname>X Consortium</orgname></affiliation>
19      </author>
20   </authorgroup>
21   <copyright><year>1989</year><holder>X Consortium</holder></copyright>
22   <releaseinfo>X Version 11, Release &fullrelvers;</releaseinfo>
23   <releaseinfo>Version 1.1</releaseinfo>
24
25<legalnotice>
26<para>
27Permission is hereby granted, free of charge, to any person obtaining a copy
28of this software and associated documentation files (the "Software"), to deal
29in the Software without restriction, including without limitation the rights
30to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
31copies of the Software, and to permit persons to whom the Software is
32furnished to do so, subject to the following conditions:
33</para>
34
35<para>
36The above copyright notice and this permission notice shall be included in
37all copies or substantial portions of the Software.
38</para>
39
40<para>
41THE SOFTWARE IS PROVIDED &ldquo;AS IS&rdquo;, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
42IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
43FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
44X CONSORTIUM BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN
45AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
46CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
47</para>
48
49<para>
50Except as contained in this notice, the name of the X Consortium shall not be
51used in advertising or otherwise to promote the sale, use or other dealings
52in this Software without prior written authorization from the X Consortium.
53</para>
54<para>X Window System is a trademark of The Open Group.</para>
55</legalnotice>
56</articleinfo>
57<sect1 id="Overview">
58<title>Overview</title>
59
60<para>
61Compound Text is a format for multiple character set data, such as
62multi-lingual text.  The format is based on ISO
63standards for encoding and combining character sets.  Compound Text is intended
64to be used in three main contexts: inter-client communication using selections,
65as defined in the
66<emphasis remap='I'>Inter-Client Communication Conventions Manual</emphasis>
67(ICCCM); <!-- xref -->
68window properties (e.g., window manager hints as defined in the ICCCM);
69and resources (e.g., as defined in Xlib and the Xt Intrinsics).
70</para>
71
72<para>
73Compound Text is intended as an external representation, or interchange format,
74not as an internal representation.  It is expected (but not required) that
75clients will convert Compound Text to some internal representation for
76processing and rendering, and convert from that internal representation to
77Compound Text when providing textual data to another client.
78</para>
79</sect1>
80
81<sect1 id="Values">
82<title>Values</title>
83<para>
84<!-- .LP -->
85The name of this encoding is "COMPOUND_TEXT".  When text values are used in
86the ICCCM-compliant selection mechanism or are stored as window properties in
87the server, the type used should be the atom for "COMPOUND_TEXT".
88</para>
89
90<para>
91<!-- .LP -->
92Octet values are represented in this document as two decimal numbers in the
93form col/row.  This means the value (col * 16) + row.  For example, 02/01 means
94the value 33.
95</para>
96<para>
97For our purposes, the octet encoding space is divided into four ranges:
98</para>
99
100<informaltable frame="none">
101  <?dbfo keep-together="always" ?>
102  <tgroup cols='2' align='left'  colsep='0' rowsep='0'>
103  <colspec colname='c1' colwidth="1.0*"/>
104  <colspec colname='c2' colwidth="9.0*"/>
105  <tbody>
106    <row>
107      <entry>C0</entry>
108      <entry>octets from 00/00 to 01/15</entry>
109    </row>
110    <row>
111      <entry>GL</entry>
112      <entry>octets from 02/00 to 07/15</entry>
113    </row>
114    <row>
115      <entry>C1</entry>
116      <entry>octets from 08/00 to 09/15</entry>
117    </row>
118    <row>
119      <entry>GR</entry>
120      <entry>octets from 10/00 to 15/15</entry>
121    </row>
122  </tbody>
123  </tgroup>
124</informaltable>
125
126<para>
127<!-- .LP -->
128C0 and C1 are "control character" sets, while GL and GR are "graphic
129character" sets.  Only a subset of C0 and C1 octets are used in the encoding,
130and depending on the character set encoding defined as GL or GR, a subset of
131GL and GR octets may be used; see below for details.  All octets (00/00 to
13215/15) may appear inside the text of extended segments (defined below).
133</para>
134<para>
135<!-- .LP -->
136[For those familiar with ISO 2022, we will use only an 8-bit environment, and
137we will always use G0 for GL and G1 for GR.]
138</para>
139</sect1>
140
141<sect1 id="Control_Characters">
142<title>Control Characters</title>
143<para>
144In C0, only the following values will be used:
145</para>
146
147<informaltable frame="none">
148  <?dbfo keep-together="always" ?>
149  <tgroup cols='3' align='left' colsep='0' rowsep='0'>
150  <colspec colname='c1' colwidth="1.0*"/>
151  <colspec colname='c2' colwidth="1.0*"/>
152  <colspec colname='c3' colwidth="5.0*"/>
153  <tbody>
154    <row>
155      <entry>00/09</entry>
156      <entry>HT</entry>
157      <entry>HORIZONTAL TABULATION</entry>
158    </row>
159    <row>
160      <entry>00/10</entry>
161      <entry>NL</entry>
162      <entry>NEW LINE</entry>
163    </row>
164    <row>
165      <entry>01/11</entry>
166      <entry>ESC</entry>
167      <entry>(ESCAPE)</entry>
168    </row>
169  </tbody>
170  </tgroup>
171</informaltable>
172
173<para>
174In C1, only the following value will be used:
175</para>
176
177<informaltable frame="none">
178  <?dbfo keep-together="always" ?>
179  <tgroup cols='3' align='left' colsep='0' rowsep='0'>
180  <colspec colname='c1' colwidth="1.0*"/>
181  <colspec colname='c2' colwidth="1.0*"/>
182  <colspec colname='c3' colwidth="5.0*"/>
183  <tbody>
184    <row>
185      <entry>09/11</entry>
186      <entry>CSI</entry>
187      <entry>CONTROL SEQUENCE INTRODUCER</entry>
188    </row>
189  </tbody>
190  </tgroup>
191</informaltable>
192
193<para>
194<!-- .LP -->
195[The alternate 7-bit CSI encoding 01/11 05/11 is not used in Compound Text.]
196</para>
197<para>
198<!-- .LP -->
199No control sequences are defined in Compound Text for changing the C0 and C1
200sets.
201</para>
202<para>
203<!-- .LP -->
204A horizontal tab can be represented with the octet 00/09.  Specification of
205tabulation width settings is not part of Compound Text and must be obtained
206from context (in an unspecified manner).
207</para>
208<para>
209<!-- .LP -->
210[Inclusion of horizontal tab is for consistency with the STRING type currently
211defined in the ICCCM.]
212</para>
213<para>
214<!-- .LP -->
215A newline (line separator/terminator) can be represented with the octet 00/10.
216</para>
217<para>
218<!-- .LP -->
219[Note that 00/10 is normally LINEFEED, but is being interpreted as NEWLINE.
220This can be thought of as using the (deprecated) NEW LINE mode, E.1.3, in ISO
2216429.  Use of this value instead of 08/05 (NEL, NEXT LINE) is for consistency
222with the STRING type currently defined in the ICCCM.]
223</para>
224<para>
225<!-- .LP -->
226The remaining C0 and C1 values (01/11 and 09/11) are only used in the control
227sequences defined below.
228</para>
229</sect1>
230
231<sect1 id="Standard_Character_Set_Encodings">
232<title>Standard Character Set Encodings</title>
233<para>
234<!-- .LP -->
235The default GL and GR sets in Compound Text correspond to the left and right
236halves of ISO 8859-1 (Latin 1).  As such, any legal instance of a STRING type
237(as defined in the ICCCM) is also a legal instance of type COMPOUND_TEXT.
238</para>
239<para>
240[The implied initial state in ISO 2022 is defined with the sequence:
241 01/11 02/00 04/03  GO and G1 in an 8-bit environment only.  Designation also invokes.
242 01/11 02/00 04/07  In an 8-bit environment, C1 represented as 8-bits.
243 01/11 02/00 04/09  Graphic character sets can be 94 or 96.
244 01/11 02/00 04/11  8-bit code is used.
245 01/11 02/08 04/02  Designate ASCII into G0.
246 01/11 02/13 04/01  Designate right-hand part of ISO Latin-1 into G1.
247]
248</para>
249
250<para>
251To define one of the approved standard character set encodings to be
252the GL set, one of the following control sequences is used:
253</para>
254
255<informaltable frame="none">
256  <?dbfo keep-together="always" ?>
257  <tgroup cols='4' align='left' colsep='0' rowsep='0'>
258  <colspec colname='c1' colwidth="1.0*"/>
259  <colspec colname='c2' colwidth="1.0*"/>
260  <colspec colname='c3' colwidth="2.0*"/>
261  <colspec colname='c4' colwidth="8.0*"/>
262  <tbody>
263    <row>
264      <entry>01/11</entry>
265      <entry>02/08</entry>
266      <entry>{I} F</entry>
267      <entry>94 character set</entry>
268    </row>
269    <row>
270      <entry>01/11</entry>
271      <entry>02/04</entry>
272      <entry>02/08{I} F</entry>
273      <entry>94<superscript>N</superscript> character set</entry>
274    </row>
275  </tbody>
276  </tgroup>
277</informaltable>
278
279<para>
280<!-- .LP -->
281To define one of the approved standard character set encodings to be
282the GR set, one of the following control sequences is used:
283</para>
284
285<informaltable frame="none">
286  <?dbfo keep-together="always" ?>
287  <tgroup cols='4' align='left' colsep='0' rowsep='0'>
288  <colspec colname='c1' colwidth="1.0*"/>
289  <colspec colname='c2' colwidth="1.0*"/>
290  <colspec colname='c3' colwidth="2.0*"/>
291  <colspec colname='c4' colwidth="8.0*"/>
292  <tbody>
293    <row>
294      <entry>01/11</entry>
295      <entry>02/09</entry>
296      <entry>{I} F</entry>
297      <entry>94 character set</entry>
298    </row>
299    <row>
300      <entry>01/11</entry>
301      <entry>02/13</entry>
302      <entry>{I} F</entry>
303      <entry>96 character set</entry>
304    </row>
305    <row>
306      <entry>01/11</entry>
307      <entry>02/04</entry>
308      <entry>02/09 {I} F</entry>
309      <entry>94<superscript>N</superscript> character set</entry>
310    </row>
311  </tbody>
312  </tgroup>
313</informaltable>
314
315<para>
316<!-- .LP -->
317The "F"in the control sequences above stands for "Final character", which
318is always in the range 04/00 to 07/14.  The "{I}" stands for zero or more
319"intermediate characters", which are always in the range 02/00 to 02/15, with
320the first intermediate character always in the range 02/01 to 02/03.  The
321registration authority has defined an "{I} F" sequence for each registered
322character set encoding.
323</para>
324
325<para>
326<!-- .LP -->
327[Final characters for private encodings (in the range 03/00 to 03/15) are not
328permitted here in Compound Text.]
329</para>
330<para>
331<!-- .LP -->
332For GL, octet 02/00 is always defined as SPACE, and octet 07/15 (normally
333DELETE) is never used.  For a 94-character set defined as GR, octets 10/00 and
33415/15 are never used.
335</para>
336<para>
337<!-- .LP -->
338[This is consistent with ISO 2022.]
339</para>
340<para>
341<!-- .LP -->
342A 94<superscript>N</superscript> character set uses N octets (N &gt; 1) for each character.
343The value of N is derived from the column value for F:
344</para>
345
346<informaltable frame="none">
347  <?dbfo keep-together="always" ?>
348  <tgroup cols='2' align='left' colsep='0' rowsep='0'>
349  <colspec colname='c1' colwidth="1.0*"/>
350  <colspec colname='c2' colwidth="3.0*"/>
351  <tbody>
352    <row>
353      <entry>column 04 or 05</entry>
354      <entry>2 octets</entry>
355    </row>
356    <row>
357      <entry>column 06</entry>
358      <entry>3 octets</entry>
359    </row>
360    <row>
361      <entry>column 07</entry>
362      <entry>4 or more octets</entry>
363    </row>
364  </tbody>
365  </tgroup>
366</informaltable>
367<para>
368<!-- .LP -->
369In a 94<superscript>N</superscript> encoding, the octet values 02/00 and 07/15 (in GL) and
37010/00 and 15/15 (in GR) are never used.
371</para>
372<para>
373<!-- .LP -->
374[The column definitions come from ISO 2022.]
375</para>
376<para>
377<!-- .LP -->
378Once a GL or GR set has been defined, all further octets in that range (except
379within control sequences and extended segments) are interpreted with respect to
380that character set encoding, until the GL or GR set is redefined.  GL and GR
381sets can be defined independently, they do not have to be defined in pairs.
382</para>
383<para>
384<!-- .LP -->
385Note that when actually using a character set encoding as the GR set, you must
386force the most significant bit (08/00) of each octet to be a one, so that it
387falls in the range 10/00 to 15/15.
388</para>
389<para>
390<!-- xref -->
391[Control sequences to specify character set encoding revisions (as in section
3926.3.13 of ISO 2022) are not used in Compound Text.  Revision indicators do not
393appear to provide useful information in the context of Compound Text.  The most
394recent revision can always be assumed, since revisions are upward compatible.]
395</para>
396</sect1>
397
398<sect1 id="Approved_Standard_Encodings">
399<title>Approved Standard Encodings</title>
400<para>
401The following are the approved standard encodings to be used with Compound
402Text.  Note that none have Intermediate characters; however, a good parser will
403still deal with Intermediate characters in the event that additional encodings
404are later added to this list.
405</para>
406
407<informaltable frame="topbot">
408  <?dbfo keep-together="auto" ?>
409  <tgroup cols='3' align='left' colsep='0' rowsep='0'>
410  <colspec colname='c1' colwidth="1.0*"/>
411  <colspec colname='c2' colwidth="1.0*"/>
412  <colspec colname='c3' colwidth="10.0*"/>
413  <thead>
414    <row rowsep='1'>
415      <entry>{I} F</entry>
416      <entry>94/96</entry>
417      <entry>Description</entry>
418    </row>
419  </thead>
420  <tbody>
421    <row>
422      <entry>4/02</entry>
423      <entry>94</entry>
424      <entry>
4257-bit ASCII graphics (ANSI X3.4-1968), Left half of ISO 8859 sets
426      </entry>
427    </row>
428    <row>
429      <entry>04/09</entry>
430      <entry>94</entry>
431      <entry>
432Right half of JIS X0201-1976 (reaffirmed 1984),
4338-Bit Alphanumeric-Katakana Code
434      </entry>
435    </row>
436    <row>
437      <entry>04/10</entry>
438      <entry>94</entry>
439      <entry>
440Left half of JIS X0201-1976 (reaffirmed 1984),
4418-Bit Alphanumeric-Katakana Code
442      </entry>
443    </row>
444    <row>
445      <entry>04/01</entry>
446      <entry>96</entry>
447      <entry>Right half of ISO 8859-1, Latin alphabet No. 1</entry>
448    </row>
449    <row>
450      <entry>04/02</entry>
451      <entry>96</entry>
452      <entry>Right half of ISO 8859-2, Latin alphabet No. 2</entry>
453    </row>
454    <row>
455      <entry>04/03</entry>
456      <entry>96</entry>
457      <entry>Right half of ISO 8859-3, Latin alphabet No. 3</entry>
458    </row>
459    <row>
460      <entry>04/04</entry>
461      <entry>96</entry>
462      <entry>Right half of ISO 8859-4, Latin alphabet No. 4</entry>
463    </row>
464    <row>
465      <entry>04/06</entry>
466      <entry>96</entry>
467      <entry>Right half of ISO 8859-7, Latin/Greek alphabet</entry>
468    </row>
469    <row>
470      <entry>04/07</entry>
471      <entry>96</entry>
472      <entry>Right half of ISO 8859-6, Latin/Arabic alphabet</entry>
473    </row>
474    <row>
475      <entry>04/08</entry>
476      <entry>96</entry>
477      <entry>Right half of ISO 8859-8, Latin/Hebrew alphabet</entry>
478    </row>
479    <row>
480      <entry>04/12</entry>
481      <entry>96</entry>
482      <entry>Right half of ISO 8859-5, Latin/Cyrillic alphabet</entry>
483    </row>
484    <row>
485      <entry>04/13</entry>
486      <entry>96</entry>
487      <entry>Right half of ISO 8859-9, Latin alphabet No. 5</entry>
488    </row>
489    <row>
490      <entry>04/01</entry>
491      <entry>942</entry>
492      <entry>GB2312-1980, China (PRC) Hanzi</entry>
493    </row>
494    <row>
495      <entry>04/02</entry>
496      <entry>942</entry>
497      <entry>JIS X0208-1983, Japanese Graphic Character Set</entry>
498    </row>
499    <row>
500      <entry>04/03</entry>
501      <entry>942</entry>
502      <entry>KS C5601-1987, Korean Graphic Character Set</entry>
503    </row>
504  </tbody>
505  </tgroup>
506</informaltable>
507
508<para>
509<!-- .LP -->
510The sets listed as "Left half of ..." should always be defined as GL.  The
511sets listed as "Right half of ..." should always be defined as GR.  Other
512sets can be defined either as GL or GR.
513</para>
514</sect1>
515
516<sect1 id="Non_Standard_Character_Set_Encodings">
517<title>Non-Standard Character Set Encodings</title>
518<para>
519Character set encodings that are not in the list of approved standard
520encodings can be included
521using "extended segments".  An extended segment begins with one of the
522following sequences:
523</para>
524
525<informaltable frame="none">
526  <?dbfo keep-together="always" ?>
527  <tgroup cols='2' align='left' colsep='0' rowsep='0'>
528  <colspec colname='c1' colwidth="1.0*"/>
529  <colspec colname='c2' colwidth="2.0*"/>
530  <tbody>
531    <row>
532      <entry>01/11 2/05 02/15 03/00 M L</entry>
533      <entry>variable number of octets per character</entry>
534    </row>
535    <row>
536      <entry>01/11 2/05 02/15 03/01 M L</entry>
537      <entry>1 octet per character</entry>
538    </row>
539    <row>
540      <entry>01/11 2/05 02/15 03/02 M L</entry>
541      <entry>2 octet per character</entry>
542    </row>
543    <row>
544      <entry>01/11 2/05 02/15 03/03 M L</entry>
545      <entry>3 octet per character</entry>
546    </row>
547    <row>
548      <entry>01/11 2/05 02/15 03/04 M L</entry>
549      <entry>4 octet per character</entry>
550    </row>
551  </tbody>
552  </tgroup>
553</informaltable>
554
555<para>
556[This uses the "other coding system" of ISO 2022, using private Final
557characters.]
558</para>
559<para>
560<!-- .LP -->
561The "M" and "L" octets represent a 14-bit unsigned value giving the number
562of octets that appear in the remainder of the segment.  The number is computed
563as ((M - 128) * 128) + (L - 128).  The most significant bit M and L are always
564set to one.  The remainder of the segment consists of two parts, the name of
565the character set encoding and the actual text.  The name of the encoding comes
566first and is separated from the text by the octet 00/02 (STX, START OF TEXT).
567Note that the length defined by M and L includes the encoding name and
568separator.
569</para>
570<para>
571<!-- .LP -->
572[The encoding of the length is chosen to avoid having zero octets in Compound
573Text when possible, because embedded NUL values are problematic in many C
574language routines.  The use of zero octets cannot be ruled out entirely
575however, since some octets in the actual text of the extended segment may have
576to be zero.]
577</para>
578<para>
579<!-- .LP -->
580The name of the encoding should be registered with the X Consortium to avoid
581conflicts and should when appropriate match the CharSet Registry and Encoding
582registration used in the X Logical Font Description.  The name itself should be
583encoded using ISO 8859-1 (Latin 1), should not use question mark (03/15) or
584asterisk (02/10), and should use hyphen (02/13) only in accordance with the X
585Logical Font Description.
586</para>
587<para>
588<!-- .LP -->
589Extended segments are not to be used for any character set encoding that can
590be constructed from a GL/GR pair of approved standard encodings. For
591example, it is incorrect to use an extended segment for any of the ISO 8859
592family of encodings.
593</para>
594<para>
595<!-- .LP -->
596It should be noted that the contents of an extended segment are arbitrary;
597for example,
598they may contain octets in the C0 and C1 ranges, including 00/00, and
599octets comprising a given character may differ in their most significant bit.
600</para>
601<para>
602<!-- .LP -->
603[ISO-registered "other coding systems" are not used in Compound Text;
604extended segments are the only mechanism for non-2022 encodings.]
605</para>
606</sect1>
607
608<sect1 id="Directionality">
609<title>Directionality</title>
610<para>
611<!-- .LP -->
612If desired, horizontal text direction can be indicated using the following
613control sequences:
614</para>
615
616<informaltable frame="none">
617  <?dbfo keep-together="always" ?>
618  <tgroup cols='2' align='left' colsep='0' rowsep='0'>
619  <colspec colname='c1' colwidth="1.0*"/>
620  <colspec colname='c2' colwidth="2.0*"/>
621  <tbody>
622    <row>
623      <entry>09/11 03/01 05/13</entry>
624      <entry>begin left-to-right text</entry>
625    </row>
626    <row>
627      <entry>09/11 03/02 05/13</entry>
628      <entry>begin right-to-left text</entry>
629    </row>
630    <row>
631      <entry>09/11 05/13</entry>
632      <entry>end of string</entry>
633    </row>
634  </tbody>
635  </tgroup>
636</informaltable>
637
638<para>
639<!-- .LP -->
640[This is a subset of the SDS (START DIRECTED STRING) control in the Draft
641Bidirectional Addendum to ISO 6429.]
642</para>
643<para>
644<!-- .LP -->
645Directionality can be nested.  Logically, a stack of directions is maintained.
646Each of the first two control sequences pushes a new direction on the stack,
647and the third sequence (revert) pops a direction from the stack.  The stack
648starts out empty at the beginning of a Compound Text string.  When the stack is
649empty, the directionality of the text is unspecified.
650</para>
651<para>
652<!-- .LP -->
653Directionality applies to all subsequent text, whether in GL, GR, or an
654extended segment.  If the desired directionality of GL, GR, or extended
655segments differs, then directionality control sequences must be inserted when
656switching between them.
657</para>
658<para>
659<!-- .LP -->
660Note that definition of GL and GR sets is independent of directionality;
661defining a new GL or GR set does not change the current directionality, and
662pushing or popping a directionality does not change the current GL and GR
663definitions.
664</para>
665<para>
666<!-- .LP -->
667Specification of directionality is entirely optional; text direction should be
668clear from context in most cases.  However, it must be the case that either
669all characters in a Compound Text string have explicitly specified direction
670or that all characters have unspecified direction.  That is, if directionality
671control sequences are used, the first such control sequence must precede the
672first graphic character in a Compound Text string, and graphic characters are
673not permitted whenever the directionality stack is empty.
674</para>
675</sect1>
676
677<sect1 id="Resources">
678<title>Resources</title>
679<para>
680<!-- .LP -->
681To use Compound Text in a resource, you can simply treat all octets as if they
682were ASCII/Latin-1 and just replace all "\" octets (05/12) with the two
683octets "\\", all newline octets (00/10) with the two octets "\n", and
684all zero octets with the four octets "\000".
685It is up to the client making use of the resource to interpret the data as
686Compound Text; the policy by which this is ascertained is not constrained by
687the Compound Text specification.
688</para>
689</sect1>
690
691<sect1 id="Font_Names">
692<title>Font Names</title>
693
694<para>
695The following CharSet names for the standard character set encodings are
696registered for use in font names under the X Logical Font Description:
697</para>
698
699<informaltable frame="topbot">
700  <?dbfo keep-together="auto" ?>
701  <tgroup cols='3' align='left' colsep='0' rowsep='0'>
702  <colspec colname='c1' colwidth="1.0*"/>
703  <colspec colname='c2' colwidth="2.0*"/>
704  <colspec colname='c3' colwidth="2.0*"/>
705  <thead>
706    <row rowsep='1'>
707      <entry>Name</entry>
708      <entry>Encoding Standard</entry>
709      <entry>Description</entry>
710    </row>
711  </thead>
712  <tbody>
713    <row>
714      <entry>ISO8859-1</entry>
715      <entry>ISO8859-1</entry>
716      <entry>Latinalphabet No. 1</entry>
717    </row>
718    <row>
719      <entry>ISO8859-2</entry>
720      <entry>ISO8859-2</entry>
721      <entry>Latinalphabet No. 2</entry>
722    </row>
723    <row>
724      <entry>ISO8859-3</entry>
725      <entry>ISO8859-3</entry>
726      <entry>Latinalphabet No. 3</entry>
727    </row>
728    <row>
729      <entry>ISO8859-4</entry>
730      <entry>ISO8859-4</entry>
731      <entry>Latinalphabet No. 4</entry>
732    </row>
733    <row>
734      <entry>ISO8859-5</entry>
735      <entry>ISO 8859-5</entry>
736      <entry>Latin/Cyrillic alphabet</entry>
737    </row>
738    <row>
739      <entry>ISO8859-6</entry>
740      <entry>ISO 8859-6</entry>
741      <entry>Latin/Arabic alphabet</entry>
742    </row>
743    <row>
744      <entry>ISO8859-7</entry>
745      <entry>ISO8859-7</entry>
746      <entry>Latin/Greekalphabet</entry>
747    </row>
748    <row>
749      <entry>ISO8859-8</entry>
750      <entry>ISO8859-8</entry>
751      <entry>Latin/Hebrew alphabet</entry>
752    </row>
753    <row>
754      <entry>ISO8859-9</entry>
755      <entry>ISO8859-9</entry>
756      <entry>Latinalphabet No. 5</entry>
757    </row>
758    <row>
759      <entry>JISX0201.1976-0</entry>
760      <entry>JIS X0201-1976 (reaffirmed 1984)</entry>
761      <entry>8-bit Alphanumeric-Katakana Code</entry>
762    </row>
763    <row>
764      <entry>GB2312.1980-0</entry>
765      <entry>GB2312-1980, GL encoding</entry>
766      <entry>China (PRC) Hanzi</entry>
767    </row>
768    <row>
769      <entry>JISX0208.1983-0</entry>
770      <entry>JIS X0208-1983, GL encoding</entry>
771      <entry>Japanese Graphic Character Set</entry>
772    </row>
773    <row>
774      <entry>KSC5601.1987-0</entry>
775      <entry>KS C5601-1987, GL encoding</entry>
776      <entry>Korean Graphic Character Set</entry>
777    </row>
778  </tbody>
779  </tgroup>
780</informaltable>
781
782</sect1>
783<sect1 id="Extensions">
784<title>Extensions</title>
785<para>
786<!-- .LP -->
787There is no absolute requirement for a parser to deal with anything but the
788particular encoding syntax defined in this specification.  However, it is
789possible that Compound Text may be extended in the future, and as such it may
790be desirable to construct the parser to handle 2022/6429 syntax more generally.
791</para>
792<para>
793<!-- .LP -->
794There are two general formats covering all control sequences that are expected
795to appear in extensions:
796</para>
797
798<para>
79901/11 {I} F
800</para>
801
802<para>
803For this format, I is always in the range 02/00 to 02/15, and F is always
804in the range 03/00 to 07/14.
805</para>
806
807<para>
80809/11 {P} {I} F
809</para>
810
811<para>
812For this format, P is always in the range 03/00 to 03/15, I is always in
813the range 02/00 to 02/15, and F is always in the range 04/00 to 07/14.
814</para>
815
816<para>
817<!-- .LP -->
818In addition, new (singleton) control characters (in the C0 and C1 ranges) might
819be defined in the future.
820</para>
821
822<para>
823<!-- .LP -->
824Finally, new kinds of "segments" might be defined in the future using syntax
825similar to extended segments:
826</para>
827
828<para>
82901/11 02/05 02/15 F M L
830</para>
831
832<para>
833For this format, F is in the range 03/05 to 3/15.  M and L are as defined
834in extended segments.  Such a segment will always be followed by the number
835of octets defined by M and L.  These octets can have arbitrary values and
836need not follow the internal structure defined for current extended
837segments.
838</para>
839
840<para>
841<!-- .LP -->
842If extensions to this specification are defined in the future, then any string
843incorporating instances of such extensions must start with one of the following
844control sequences:
845</para>
846
847<informaltable frame="none">
848  <?dbfo keep-together="always" ?>
849  <tgroup cols='2' align='left' colsep='0' rowsep='0'>
850  <colspec colname='c1' colwidth="1.0*"/>
851  <colspec colname='c2' colwidth="2.0*"/>
852  <tbody>
853    <row>
854      <entry>01/11 02/03 V 03/00</entry>
855      <entry>ignoring extensions is OK</entry>
856    </row>
857    <row>
858      <entry>01/11 02/03 V 03/01</entry>
859      <entry>ignoring extensions is not OK</entry>
860    </row>
861  </tbody>
862  </tgroup>
863</informaltable>
864
865<para>
866<!-- .LP -->
867In either case, V is in the range 02/00 to 02/15 and indicates the major
868version
869minus one of the specification being used.  These version control sequences are
870for use by clients that implement earlier versions, but have implemented a
871general parser.  The first control sequence indicates that it is acceptable to
872ignore all extension control sequences; no mandatory information will be lost
873in the process.  The second control sequence indicates that it is unacceptable
874to ignore any extension control sequences; mandatory information would be lost
875in the process.  In general, it will be up to the client generating the
876Compound Text to decide which control sequence to use.
877</para>
878</sect1>
879
880<sect1 id="Errors">
881<title>Errors</title>
882<para>
883<!-- .LP -->
884If a Compound Text string does not match the specification here (e.g., uses
885undefined control characters, or undefined control sequences, or incorrectly
886formatted extended segments), it is best to treat the entire string as invalid,
887except as indicated by a version control sequence.
888</para>
889</sect1>
890</article>
891