1 2 3 4 5 6 7 Network Working Group A. Costello 8 Request for Comments: 3492 Univ. of California, Berkeley 9 Category: Standards Track March 2003 10 11 12 Punycode: A Bootstring encoding of Unicode 13 for Internationalized Domain Names in Applications (IDNA) 14 15 Status of this Memo 16 17 This document specifies an Internet standards track protocol for the 18 Internet community, and requests discussion and suggestions for 19 improvements. Please refer to the current edition of the "Internet 20 Official Protocol Standards" (STD 1) for the standardization state 21 and status of this protocol. Distribution of this memo is unlimited. 22 23 Copyright Notice 24 25 Copyright (C) The Internet Society (2003). All Rights Reserved. 26 27 Abstract 28 29 Punycode is a simple and efficient transfer encoding syntax designed 30 for use with Internationalized Domain Names in Applications (IDNA). 31 It uniquely and reversibly transforms a Unicode string into an ASCII 32 string. ASCII characters in the Unicode string are represented 33 literally, and non-ASCII characters are represented by ASCII 34 characters that are allowed in host name labels (letters, digits, and 35 hyphens). This document defines a general algorithm called 36 Bootstring that allows a string of basic code points to uniquely 37 represent any string of code points drawn from a larger set. 38 Punycode is an instance of Bootstring that uses particular parameter 39 values specified by this document, appropriate for IDNA. 40 41 Table of Contents 42 43 1. Introduction...............................................2 44 1.1 Features..............................................2 45 1.2 Interaction of protocol parts.........................3 46 2. Terminology................................................3 47 3. Bootstring description.....................................4 48 3.1 Basic code point segregation..........................4 49 3.2 Insertion unsort coding...............................4 50 3.3 Generalized variable-length integers..................5 51 3.4 Bias adaptation.......................................7 52 4. Bootstring parameters......................................8 53 5. Parameter values for Punycode..............................8 54 6. Bootstring algorithms......................................9 55 56 57 58 Costello Standards Track [Page 1] 59 61 RFC 3492 IDNA Punycode March 2003 62 63 64 6.1 Bias adaptation function.............................10 65 6.2 Decoding procedure...................................11 66 6.3 Encoding procedure...................................12 67 6.4 Overflow handling....................................13 68 7. Punycode examples.........................................14 69 7.1 Sample strings.......................................14 70 7.2 Decoding traces......................................17 71 7.3 Encoding traces......................................19 72 8. Security Considerations...................................20 73 9. References................................................21 74 9.1 Normative References.................................21 75 9.2 Informative References...............................21 76 A. Mixed-case annotation.....................................22 77 B. Disclaimer and license....................................22 78 C. Punycode sample implementation............................23 79 Author's Address.............................................34 80 Full Copyright Statement.....................................35 81 82 1. Introduction 83 84 [IDNA] describes an architecture for supporting internationalized 85 domain names. Labels containing non-ASCII characters can be 86 represented by ACE labels, which begin with a special ACE prefix and 87 contain only ASCII characters. The remainder of the label after the 88 prefix is a Punycode encoding of a Unicode string satisfying certain 89 constraints. For the details of the prefix and constraints, see 90 [IDNA] and [NAMEPREP]. 91 92 Punycode is an instance of a more general algorithm called 93 Bootstring, which allows strings composed from a small set of "basic" 94 code points to uniquely represent any string of code points drawn 95 from a larger set. Punycode is Bootstring with particular parameter 96 values appropriate for IDNA. 97 98 1.1 Features 99 100 Bootstring has been designed to have the following features: 101 102 * Completeness: Every extended string (sequence of arbitrary code 103 points) can be represented by a basic string (sequence of basic 104 code points). Restrictions on what strings are allowed, and on 105 length, can be imposed by higher layers. 106 107 * Uniqueness: There is at most one basic string that represents a 108 given extended string. 109 110 * Reversibility: Any extended string mapped to a basic string can 111 be recovered from that basic string. 112 113 114 115 Costello Standards Track [Page 2] 116 118 RFC 3492 IDNA Punycode March 2003 119 120 121 * Efficient encoding: The ratio of basic string length to extended 122 string length is small. This is important in the context of 123 domain names because RFC 1034 [RFC1034] restricts the length of a 124 domain label to 63 characters. 125 126 * Simplicity: The encoding and decoding algorithms are reasonably 127 simple to implement. The goals of efficiency and simplicity are 128 at odds; Bootstring aims at a good balance between them. 129 130 * Readability: Basic code points appearing in the extended string 131 are represented as themselves in the basic string (although the 132 main purpose is to improve efficiency, not readability). 133 134 Punycode can also support an additional feature that is not used by 135 the ToASCII and ToUnicode operations of [IDNA]. When extended 136 strings are case-folded prior to encoding, the basic string can use 137 mixed case to tell how to convert the folded string into a mixed-case 138 string. See appendix A "Mixed-case annotation". 139 140 1.2 Interaction of protocol parts 141 142 Punycode is used by the IDNA protocol [IDNA] for converting domain 143 labels into ASCII; it is not designed for any other purpose. It is 144 explicitly not designed for processing arbitrary free text. 145 146 2. Terminology 147 148 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 149 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 150 document are to be interpreted as described in BCP 14, RFC 2119 151 [RFC2119]. 152 153 A code point is an integral value associated with a character in a 154 coded character set. 155 156 As in the Unicode Standard [UNICODE], Unicode code points are denoted 157 by "U+" followed by four to six hexadecimal digits, while a range of 158 code points is denoted by two hexadecimal numbers separated by "..", 159 with no prefixes. 160 161 The operators div and mod perform integer division; (x div y) is the 162 quotient of x divided by y, discarding the remainder, and (x mod y) 163 is the remainder, so (x div y) * y + (x mod y) == x. Bootstring uses 164 these operators only with nonnegative operands, so the quotient and 165 remainder are always nonnegative. 166 167 The break statement jumps out of the innermost loop (as in C). 168 169 170 171 172 Costello Standards Track [Page 3] 173 175 RFC 3492 IDNA Punycode March 2003 176 177 178 An overflow is an attempt to compute a value that exceeds the maximum 179 value of an integer variable. 180 181 3. Bootstring description 182 183 Bootstring represents an arbitrary sequence of code points (the 184 "extended string") as a sequence of basic code points (the "basic 185 string"). This section describes the representation. Section 6 186 "Bootstring algorithms" presents the algorithms as pseudocode. 187 Sections 7.1 "Decoding traces" and 7.2 "Encoding traces" trace the 188 algorithms for sample inputs. 189 190 The following sections describe the four techniques used in 191 Bootstring. "Basic code point segregation" is a very simple and 192 efficient encoding for basic code points occurring in the extended 193 string: they are simply copied all at once. "Insertion unsort 194 coding" encodes the non-basic code points as deltas, and processes 195 the code points in numerical order rather than in order of 196 appearance, which typically results in smaller deltas. The deltas 197 are represented as "generalized variable-length integers", which use 198 basic code points to represent nonnegative integers. The parameters 199 of this integer representation are dynamically adjusted using "bias 200 adaptation", to improve efficiency when consecutive deltas have 201 similar magnitudes. 202 203 3.1 Basic code point segregation 204 205 All basic code points appearing in the extended string are 206 represented literally at the beginning of the basic string, in their 207 original order, followed by a delimiter if (and only if) the number 208 of basic code points is nonzero. The delimiter is a particular basic 209 code point, which never appears in the remainder of the basic string. 210 The decoder can therefore find the end of the literal portion (if 211 there is one) by scanning for the last delimiter. 212 213 3.2 Insertion unsort coding 214 215 The remainder of the basic string (after the last delimiter if there 216 is one) represents a sequence of nonnegative integral deltas as 217 generalized variable-length integers, described in section 3.3. The 218 meaning of the deltas is best understood in terms of the decoder. 219 220 The decoder builds the extended string incrementally. Initially, the 221 extended string is a copy of the literal portion of the basic string 222 (excluding the last delimiter). The decoder inserts non-basic code 223 points, one for each delta, into the extended string, ultimately 224 arriving at the final decoded string. 225 226 227 228 229 Costello Standards Track [Page 4] 230 232 RFC 3492 IDNA Punycode March 2003 233 234 235 At the heart of this process is a state machine with two state 236 variables: an index i and a counter n. The index i refers to a 237 position in the extended string; it ranges from 0 (the first 238 position) to the current length of the extended string (which refers 239 to a potential position beyond the current end). If the current 240 state is <n,i>, the next state is <n,i+1> if i is less than the 241 length of the extended string, or <n+1,0> if i equals the length of 242 the extended string. In other words, each state change causes i to 243 increment, wrapping around to zero if necessary, and n counts the 244 number of wrap-arounds. 245 246 Notice that the state always advances monotonically (there is no way 247 for the decoder to return to an earlier state). At each state, an 248 insertion is either performed or not performed. At most one 249 insertion is performed in a given state. An insertion inserts the 250 value of n at position i in the extended string. The deltas are a 251 run-length encoding of this sequence of events: they are the lengths 252 of the runs of non-insertion states preceeding the insertion states. 253 Hence, for each delta, the decoder performs delta state changes, then 254 an insertion, and then one more state change. (An implementation 255 need not perform each state change individually, but can instead use 256 division and remainder calculations to compute the next insertion 257 state directly.) It is an error if the inserted code point is a 258 basic code point (because basic code points were supposed to be 259 segregated as described in section 3.1). 260 261 The encoder's main task is to derive the sequence of deltas that will 262 cause the decoder to construct the desired string. It can do this by 263 repeatedly scanning the extended string for the next code point that 264 the decoder would need to insert, and counting the number of state 265 changes the decoder would need to perform, mindful of the fact that 266 the decoder's extended string will include only those code points 267 that have already been inserted. Section 6.3 "Encoding procedure" 268 gives a precise algorithm. 269 270 3.3 Generalized variable-length integers 271 272 In a conventional integer representation the base is the number of 273 distinct symbols for digits, whose values are 0 through base-1. Let 274 digit_0 denote the least significant digit, digit_1 the next least 275 significant, and so on. The value represented is the sum over j of 276 digit_j * w(j), where w(j) = base^j is the weight (scale factor) for 277 position j. For example, in the base 8 integer 437, the digits are 278 7, 3, and 4, and the weights are 1, 8, and 64, so the value is 7 + 279 3*8 + 4*64 = 287. This representation has two disadvantages: First, 280 there are multiple encodings of each value (because there can be 281 extra zeros in the most significant positions), which is inconvenient 282 283 284 285 286 Costello Standards Track [Page 5] 287 289 RFC 3492 IDNA Punycode March 2003 290 291 292 when unique encodings are needed. Second, the integer is not self- 293 delimiting, so if multiple integers are concatenated the boundaries 294 between them are lost. 295 296 The generalized variable-length representation solves these two 297 problems. The digit values are still 0 through base-1, but now the 298 integer is self-delimiting by means of thresholds t(j), each of which 299 is in the range 0 through base-1. Exactly one digit, the most 300 significant, satisfies digit_j < t(j). Therefore, if several 301 integers are concatenated, it is easy to separate them, starting with 302 the first if they are little-endian (least significant digit first), 303 or starting with the last if they are big-endian (most significant 304 digit first). As before, the value is the sum over j of digit_j * 305 w(j), but the weights are different: 306 307 w(0) = 1 308 w(j) = w(j-1) * (base - t(j-1)) for j > 0 309 310 For example, consider the little-endian sequence of base 8 digits 311 734251... Suppose the thresholds are 2, 3, 5, 5, 5, 5... This 312 implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5) = 313 90, 90*(8-5) = 270, and so on. 7 is not less than 2, and 3 is not 314 less than 3, but 4 is less than 5, so 4 is the last digit. The value 315 of 734 is 7*1 + 3*6 + 4*30 = 145. The next integer is 251, with 316 value 2*1 + 5*6 + 1*30 = 62. Decoding this representation is very 317 similar to decoding a conventional integer: Start with a current 318 value of N = 0 and a weight w = 1. Fetch the next digit d and 319 increase N by d * w. If d is less than the current threshold (t) 320 then stop, otherwise increase w by a factor of (base - t), update t 321 for the next position, and repeat. 322 323 Encoding this representation is similar to encoding a conventional 324 integer: If N < t then output one digit for N and stop, otherwise 325 output the digit for t + ((N - t) mod (base - t)), then replace N 326 with (N - t) div (base - t), update t for the next position, and 327 repeat. 328 329 For any particular set of values of t(j), there is exactly one 330 generalized variable-length representation of each nonnegative 331 integral value. 332 333 Bootstring uses little-endian ordering so that the deltas can be 334 separated starting with the first. The t(j) values are defined in 335 terms of the constants base, tmin, and tmax, and a state variable 336 called bias: 337 338 t(j) = base * (j + 1) - bias, 339 clamped to the range tmin through tmax 340 341 342 343 Costello Standards Track [Page 6] 344 346 RFC 3492 IDNA Punycode March 2003 347 348 349 The clamping means that if the formula yields a value less than tmin 350 or greater than tmax, then t(j) = tmin or tmax, respectively. (In 351 the pseudocode in section 6 "Bootstring algorithms", the expression 352 base * (j + 1) is denoted by k for performance reasons.) These t(j) 353 values cause the representation to favor integers within a particular 354 range determined by the bias. 355 356 3.4 Bias adaptation 357 358 After each delta is encoded or decoded, bias is set for the next 359 delta as follows: 360 361 1. Delta is scaled in order to avoid overflow in the next step: 362 363 let delta = delta div 2 364 365 But when this is the very first delta, the divisor is not 2, but 366 instead a constant called damp. This compensates for the fact 367 that the second delta is usually much smaller than the first. 368 369 2. Delta is increased to compensate for the fact that the next delta 370 will be inserting into a longer string: 371 372 let delta = delta + (delta div numpoints) 373 374 numpoints is the total number of code points encoded/decoded so 375 far (including the one corresponding to this delta itself, and 376 including the basic code points). 377 378 3. Delta is repeatedly divided until it falls within a threshold, to 379 predict the minimum number of digits needed to represent the next 380 delta: 381 382 while delta > ((base - tmin) * tmax) div 2 383 do let delta = delta div (base - tmin) 384 385 4. The bias is set: 386 387 let bias = 388 (base * the number of divisions performed in step 3) + 389 (((base - tmin + 1) * delta) div (delta + skew)) 390 391 The motivation for this procedure is that the current delta 392 provides a hint about the likely size of the next delta, and so 393 t(j) is set to tmax for the more significant digits starting with 394 the one expected to be last, tmin for the less significant digits 395 up through the one expected to be third-last, and somewhere 396 between tmin and tmax for the digit expected to be second-last 397 398 399 400 Costello Standards Track [Page 7] 401 403 RFC 3492 IDNA Punycode March 2003 404 405 406 (balancing the hope of the expected-last digit being unnecessary 407 against the danger of it being insufficient). 408 409 4. Bootstring parameters 410 411 Given a set of basic code points, one needs to be designated as the 412 delimiter. The base cannot be greater than the number of 413 distinguishable basic code points remaining. The digit-values in the 414 range 0 through base-1 need to be associated with distinct non- 415 delimiter basic code points. In some cases multiple code points need 416 to have the same digit-value; for example, uppercase and lowercase 417 versions of the same letter need to be equivalent if basic strings 418 are case-insensitive. 419 420 The initial value of n cannot be greater than the minimum non-basic 421 code point that could appear in extended strings. 422 423 The remaining five parameters (tmin, tmax, skew, damp, and the 424 initial value of bias) need to satisfy the following constraints: 425 426 0 <= tmin <= tmax <= base-1 427 skew >= 1 428 damp >= 2 429 initial_bias mod base <= base - tmin 430 431 Provided the constraints are satisfied, these five parameters affect 432 efficiency but not correctness. They are best chosen empirically. 433 434 If support for mixed-case annotation is desired (see appendix A), 435 make sure that the code points corresponding to 0 through tmax-1 all 436 have both uppercase and lowercase forms. 437 438 5. Parameter values for Punycode 439 440 Punycode uses the following Bootstring parameter values: 441 442 base = 36 443 tmin = 1 444 tmax = 26 445 skew = 38 446 damp = 700 447 initial_bias = 72 448 initial_n = 128 = 0x80 449 450 Although the only restriction Punycode imposes on the input integers 451 is that they be nonnegative, these parameters are especially designed 452 to work well with Unicode [UNICODE] code points, which are integers 453 in the range 0..10FFFF (but not D800..DFFF, which are reserved for 454 455 456 457 Costello Standards Track [Page 8] 458 460 RFC 3492 IDNA Punycode March 2003 461 462 463 use by the UTF-16 encoding of Unicode). The basic code points are 464 the ASCII [ASCII] code points (0..7F), of which U+002D (-) is the 465 delimiter, and some of the others have digit-values as follows: 466 467 code points digit-values 468 ------------ ---------------------- 469 41..5A (A-Z) = 0 to 25, respectively 470 61..7A (a-z) = 0 to 25, respectively 471 30..39 (0-9) = 26 to 35, respectively 472 473 Using hyphen-minus as the delimiter implies that the encoded string 474 can end with a hyphen-minus only if the Unicode string consists 475 entirely of basic code points, but IDNA forbids such strings from 476 being encoded. The encoded string can begin with a hyphen-minus, but 477 IDNA prepends a prefix. Therefore IDNA using Punycode conforms to 478 the RFC 952 rule that host name labels neither begin nor end with a 479 hyphen-minus [RFC952]. 480 481 A decoder MUST recognize the letters in both uppercase and lowercase 482 forms (including mixtures of both forms). An encoder SHOULD output 483 only uppercase forms or only lowercase forms, unless it uses mixed- 484 case annotation (see appendix A). 485 486 Presumably most users will not manually write or type encoded strings 487 (as opposed to cutting and pasting them), but those who do will need 488 to be alert to the potential visual ambiguity between the following 489 sets of characters: 490 491 G 6 492 I l 1 493 O 0 494 S 5 495 U V 496 Z 2 497 498 Such ambiguities are usually resolved by context, but in a Punycode 499 encoded string there is no context apparent to humans. 500 501 6. Bootstring algorithms 502 503 Some parts of the pseudocode can be omitted if the parameters satisfy 504 certain conditions (for which Punycode qualifies). These parts are 505 enclosed in {braces}, and notes immediately following the pseudocode 506 explain the conditions under which they can be omitted. 507 508 509 510 511 512 513 514 Costello Standards Track [Page 9] 515 517 RFC 3492 IDNA Punycode March 2003 518 519 520 Formally, code points are integers, and hence the pseudocode assumes 521 that arithmetic operations can be performed directly on code points. 522 In some programming languages, explicit conversion between code 523 points and integers might be necessary. 524 525 6.1 Bias adaptation function 526 527 function adapt(delta,numpoints,firsttime): 528 if firsttime then let delta = delta div damp 529 else let delta = delta div 2 530 let delta = delta + (delta div numpoints) 531 let k = 0 532 while delta > ((base - tmin) * tmax) div 2 do begin 533 let delta = delta div (base - tmin) 534 let k = k + base 535 end 536 return k + (((base - tmin + 1) * delta) div (delta + skew)) 537 538 It does not matter whether the modifications to delta and k inside 539 adapt() affect variables of the same name inside the 540 encoding/decoding procedures, because after calling adapt() the 541 caller does not read those variables before overwriting them. 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 Costello Standards Track [Page 10] 572 574 RFC 3492 IDNA Punycode March 2003 575 576 577 6.2 Decoding procedure 578 579 let n = initial_n 580 let i = 0 581 let bias = initial_bias 582 let output = an empty string indexed from 0 583 consume all code points before the last delimiter (if there is one) 584 and copy them to output, fail on any non-basic code point 585 if more than zero code points were consumed then consume one more 586 (which will be the last delimiter) 587 while the input is not exhausted do begin 588 let oldi = i 589 let w = 1 590 for k = base to infinity in steps of base do begin 591 consume a code point, or fail if there was none to consume 592 let digit = the code point's digit-value, fail if it has none 593 let i = i + digit * w, fail on overflow 594 let t = tmin if k <= bias {+ tmin}, or 595 tmax if k >= bias + tmax, or k - bias otherwise 596 if digit < t then break 597 let w = w * (base - t), fail on overflow 598 end 599 let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?) 600 let n = n + i div (length(output) + 1), fail on overflow 601 let i = i mod (length(output) + 1) 602 {if n is a basic code point then fail} 603 insert n into output at position i 604 increment i 605 end 606 607 The full statement enclosed in braces (checking whether n is a basic 608 code point) can be omitted if initial_n exceeds all basic code points 609 (which is true for Punycode), because n is never less than initial_n. 610 611 In the assignment of t, where t is clamped to the range tmin through 612 tmax, "+ tmin" can always be omitted. This makes the clamping 613 calculation incorrect when bias < k < bias + tmin, but that cannot 614 happen because of the way bias is computed and because of the 615 constraints on the parameters. 616 617 Because the decoder state can only advance monotonically, and there 618 is only one representation of any delta, there is therefore only one 619 encoded string that can represent a given sequence of integers. The 620 only error conditions are invalid code points, unexpected end-of- 621 input, overflow, and basic code points encoded using deltas instead 622 of appearing literally. If the decoder fails on these errors as 623 shown above, then it cannot produce the same output for two distinct 624 inputs. Without this property it would have been necessary to re- 625 626 627 628 Costello Standards Track [Page 11] 629 631 RFC 3492 IDNA Punycode March 2003 632 633 634 encode the output and verify that it matches the input in order to 635 guarantee the uniqueness of the encoding. 636 637 6.3 Encoding procedure 638 639 let n = initial_n 640 let delta = 0 641 let bias = initial_bias 642 let h = b = the number of basic code points in the input 643 copy them to the output in order, followed by a delimiter if b > 0 644 {if the input contains a non-basic code point < n then fail} 645 while h < length(input) do begin 646 let m = the minimum {non-basic} code point >= n in the input 647 let delta = delta + (m - n) * (h + 1), fail on overflow 648 let n = m 649 for each code point c in the input (in order) do begin 650 if c < n {or c is basic} then increment delta, fail on overflow 651 if c == n then begin 652 let q = delta 653 for k = base to infinity in steps of base do begin 654 let t = tmin if k <= bias {+ tmin}, or 655 tmax if k >= bias + tmax, or k - bias otherwise 656 if q < t then break 657 output the code point for digit t + ((q - t) mod (base - t)) 658 let q = (q - t) div (base - t) 659 end 660 output the code point for digit q 661 let bias = adapt(delta, h + 1, test h equals b?) 662 let delta = 0 663 increment h 664 end 665 end 666 increment delta and n 667 end 668 669 The full statement enclosed in braces (checking whether the input 670 contains a non-basic code point less than n) can be omitted if all 671 code points less than initial_n are basic code points (which is true 672 for Punycode if code points are unsigned). 673 674 The brace-enclosed conditions "non-basic" and "or c is basic" can be 675 omitted if initial_n exceeds all basic code points (which is true for 676 Punycode), because the code point being tested is never less than 677 initial_n. 678 679 In the assignment of t, where t is clamped to the range tmin through 680 tmax, "+ tmin" can always be omitted. This makes the clamping 681 calculation incorrect when bias < k < bias + tmin, but that cannot 682 683 684 685 Costello Standards Track [Page 12] 686 688 RFC 3492 IDNA Punycode March 2003 689 690 691 happen because of the way bias is computed and because of the 692 constraints on the parameters. 693 694 The checks for overflow are necessary to avoid producing invalid 695 output when the input contains very large values or is very long. 696 697 The increment of delta at the bottom of the outer loop cannot 698 overflow because delta < length(input) before the increment, and 699 length(input) is already assumed to be representable. The increment 700 of n could overflow, but only if h == length(input), in which case 701 the procedure is finished anyway. 702 703 6.4 Overflow handling 704 705 For IDNA, 26-bit unsigned integers are sufficient to handle all valid 706 IDNA labels without overflow, because any string that needed a 27-bit 707 delta would have to exceed either the code point limit (0..10FFFF) or 708 the label length limit (63 characters). However, overflow handling 709 is necessary because the inputs are not necessarily valid IDNA 710 labels. 711 712 If the programming language does not provide overflow detection, the 713 following technique can be used. Suppose A, B, and C are 714 representable nonnegative integers and C is nonzero. Then A + B 715 overflows if and only if B > maxint - A, and A + (B * C) overflows if 716 and only if B > (maxint - A) div C, where maxint is the greatest 717 integer for which maxint + 1 cannot be represented. Refer to 718 appendix C "Punycode sample implementation" for demonstrations of 719 this technique in the C language. 720 721 The decoding and encoding algorithms shown in sections 6.2 and 6.3 722 handle overflow by detecting it whenever it happens. Another 723 approach is to enforce limits on the inputs that prevent overflow 724 from happening. For example, if the encoder were to verify that no 725 input code points exceed M and that the input length does not exceed 726 L, then no delta could ever exceed (M - initial_n) * (L + 1), and 727 hence no overflow could occur if integer variables were capable of 728 representing values that large. This prevention approach would 729 impose more restrictions on the input than the detection approach 730 does, but might be considered simpler in some programming languages. 731 732 In theory, the decoder could use an analogous approach, limiting the 733 number of digits in a variable-length integer (that is, limiting the 734 number of iterations in the innermost loop). However, the number of 735 digits that suffice to represent a given delta can sometimes 736 represent much larger deltas (because of the adaptation), and hence 737 this approach would probably need integers wider than 32 bits. 738 739 740 741 742 Costello Standards Track [Page 13] 743 745 RFC 3492 IDNA Punycode March 2003 746 747 748 Yet another approach for the decoder is to allow overflow to occur, 749 but to check the final output string by re-encoding it and comparing 750 to the decoder input. If and only if they do not match (using a 751 case-insensitive ASCII comparison) overflow has occurred. This 752 delayed-detection approach would not impose any more restrictions on 753 the input than the immediate-detection approach does, and might be 754 considered simpler in some programming languages. 755 756 In fact, if the decoder is used only inside the IDNA ToUnicode 757 operation [IDNA], then it need not check for overflow at all, because 758 ToUnicode performs a higher level re-encoding and comparison, and a 759 mismatch has the same consequence as if the Punycode decoder had 760 failed. 761 762 7. Punycode examples 763 764 7.1 Sample strings 765 766 In the Punycode encodings below, the ACE prefix is not shown. 767 Backslashes show where line breaks have been inserted in strings too 768 long for one line. 769 770 The first several examples are all translations of the sentence "Why 771 can't they just speak in <language>?" (courtesy of Michael Kaplan's 772 "provincial" page [PROVINCIAL]). Word breaks and punctuation have 773 been removed, as is often done in domain names. 774 775 (A) Arabic (Egyptian): 776 u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644 777 u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F 778 Punycode: egbpdaj6bu4bxfgehfvwxn 779 780 (B) Chinese (simplified): 781 u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587 782 Punycode: ihqwcrb4cv8a8dqg056pqjye 783 784 (C) Chinese (traditional): 785 u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587 786 Punycode: ihqwctvzc91f659drss3x8bo0yb 787 788 (D) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky 789 U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074 790 u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D 791 u+0065 u+0073 u+006B u+0079 792 Punycode: Proprostnemluvesky-uyb24dma41a 793 794 795 796 797 798 799 Costello Standards Track [Page 14] 800 802 RFC 3492 IDNA Punycode March 2003 803 804 805 (E) Hebrew: 806 u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8 807 u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2 808 u+05D1 u+05E8 u+05D9 u+05EA 809 Punycode: 4dbcagdahymbxekheh6e0a7fei0b 810 811 (F) Hindi (Devanagari): 812 u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D 813 u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939 814 u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947 815 u+0939 u+0948 u+0902 816 Punycode: i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd 817 818 (G) Japanese (kanji and hiragana): 819 u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092 820 u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B 821 Punycode: n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa 822 823 (H) Korean (Hangul syllables): 824 u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774 825 u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74 826 u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C 827 Punycode: 989aomsvi5e83db1d2a355cv1e0vak1dwrv93d5xbh15a0dt30a5j\ 828 psd879ccm6fea98c 829 830 (I) Russian (Cyrillic): 831 U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E 832 u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440 833 u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A 834 u+0438 835 Punycode: b1abfaaepdrnnbgefbaDotcwatmq2g4l 836 837 (J) Spanish: Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol 838 U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070 839 u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070 840 u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061 841 u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070 842 u+0061 u+00F1 u+006F u+006C 843 Punycode: PorqunopuedensimplementehablarenEspaol-fmd56a 844 845 (K) Vietnamese: 846 T<adotbelow>isaoh<odotbelow>kh<ocirc>ngth<ecirchookabove>ch\ 847 <ihookabove>n<oacute>iti<ecircacute>ngVi<ecircdotbelow>t 848 U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B 849 u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068 850 u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067 851 U+0056 u+0069 u+1EC7 u+0074 852 Punycode: TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g 853 854 855 856 Costello Standards Track [Page 15] 857 859 RFC 3492 IDNA Punycode March 2003 860 861 862 The next several examples are all names of Japanese music artists, 863 song titles, and TV programs, just because the author happens to have 864 them handy (but Japanese is useful for providing examples of single- 865 row text, two-row text, ideographic text, and various mixtures 866 thereof). 867 868 (L) 3<nen>B<gumi><kinpachi><sensei> 869 u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F 870 Punycode: 3B-ww4c5e180e575a65lsy2b 871 872 (M) <amuro><namie>-with-SUPER-MONKEYS 873 u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074 874 u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D 875 U+004F U+004E U+004B U+0045 U+0059 U+0053 876 Punycode: -with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n 877 878 (N) Hello-Another-Way-<sorezore><no><basho> 879 U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F 880 u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D 881 u+305D u+308C u+305E u+308C u+306E u+5834 u+6240 882 Punycode: Hello-Another-Way--fc4qua05auwb3674vfr0b 883 884 (O) <hitotsu><yane><no><shita>2 885 u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032 886 Punycode: 2-u9tlzr9756bt3uc0v 887 888 (P) Maji<de>Koi<suru>5<byou><mae> 889 U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059 890 u+308B u+0035 u+79D2 u+524D 891 Punycode: MajiKoi5-783gue6qz075azm5e 892 893 (Q) <pafii>de<runba> 894 u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0 895 Punycode: de-jg4avhby1noc0d 896 897 (R) <sono><supiido><de> 898 u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067 899 Punycode: d9juau41awczczp 900 901 The last example is an ASCII string that breaks the existing rules 902 for host name labels. (It is not a realistic example for IDNA, 903 because IDNA never encodes pure ASCII labels.) 904 905 (S) -> $1.00 <- 906 u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020 907 u+003C u+002D 908 Punycode: -> $1.00 <-- 909 910 911 912 913 Costello Standards Track [Page 16] 914 916 RFC 3492 IDNA Punycode March 2003 917 918 919 7.2 Decoding traces 920 921 In the following traces, the evolving state of the decoder is shown 922 as a sequence of hexadecimal values, representing the code points in 923 the extended string. An asterisk appears just after the most 924 recently inserted code point, indicating both n (the value preceeding 925 the asterisk) and i (the position of the value just after the 926 asterisk). Other numerical values are decimal. 927 928 Decoding trace of example B from section 7.1: 929 930 n is 128, i is 0, bias is 72 931 input is "ihqwcrb4cv8a8dqg056pqjye" 932 there is no delimiter, so extended string starts empty 933 delta "ihq" decodes to 19853 934 bias becomes 21 935 4E0D * 936 delta "wc" decodes to 64 937 bias becomes 20 938 4E0D 4E2D * 939 delta "rb" decodes to 37 940 bias becomes 13 941 4E3A * 4E0D 4E2D 942 delta "4c" decodes to 56 943 bias becomes 17 944 4E3A 4E48 * 4E0D 4E2D 945 delta "v8a" decodes to 599 946 bias becomes 32 947 4E3A 4EC0 * 4E48 4E0D 4E2D 948 delta "8d" decodes to 130 949 bias becomes 23 950 4ED6 * 4E3A 4EC0 4E48 4E0D 4E2D 951 delta "qg" decodes to 154 952 bias becomes 25 953 4ED6 4EEC * 4E3A 4EC0 4E48 4E0D 4E2D 954 delta "056p" decodes to 46301 955 bias becomes 84 956 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 4E2D 6587 * 957 delta "qjye" decodes to 88531 958 bias becomes 90 959 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 * 4E2D 6587 960 961 962 963 964 965 966 967 968 969 970 Costello Standards Track [Page 17] 971 973 RFC 3492 IDNA Punycode March 2003 974 975 976 Decoding trace of example L from section 7.1: 977 978 n is 128, i is 0, bias is 72 979 input is "3B-ww4c5e180e575a65lsy2b" 980 literal portion is "3B-", so extended string starts as: 981 0033 0042 982 delta "ww4c" decodes to 62042 983 bias becomes 27 984 0033 0042 5148 * 985 delta "5e" decodes to 139 986 bias becomes 24 987 0033 0042 516B * 5148 988 delta "180e" decodes to 16683 989 bias becomes 67 990 0033 5E74 * 0042 516B 5148 991 delta "575a" decodes to 34821 992 bias becomes 82 993 0033 5E74 0042 516B 5148 751F * 994 delta "65l" decodes to 14592 995 bias becomes 67 996 0033 5E74 0042 7D44 * 516B 5148 751F 997 delta "sy2b" decodes to 42088 998 bias becomes 84 999 0033 5E74 0042 7D44 91D1 * 516B 5148 751F 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 Costello Standards Track [Page 18] 1028 1030 RFC 3492 IDNA Punycode March 2003 1031 1032 1033 7.3 Encoding traces 1034 1035 In the following traces, code point values are hexadecimal, while 1036 other numerical values are decimal. 1037 1038 Encoding trace of example B from section 7.1: 1039 1040 bias is 72 1041 input is: 1042 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 4E2D 6587 1043 there are no basic code points, so no literal portion 1044 next code point to insert is 4E0D 1045 needed delta is 19853, encodes as "ihq" 1046 bias becomes 21 1047 next code point to insert is 4E2D 1048 needed delta is 64, encodes as "wc" 1049 bias becomes 20 1050 next code point to insert is 4E3A 1051 needed delta is 37, encodes as "rb" 1052 bias becomes 13 1053 next code point to insert is 4E48 1054 needed delta is 56, encodes as "4c" 1055 bias becomes 17 1056 next code point to insert is 4EC0 1057 needed delta is 599, encodes as "v8a" 1058 bias becomes 32 1059 next code point to insert is 4ED6 1060 needed delta is 130, encodes as "8d" 1061 bias becomes 23 1062 next code point to insert is 4EEC 1063 needed delta is 154, encodes as "qg" 1064 bias becomes 25 1065 next code point to insert is 6587 1066 needed delta is 46301, encodes as "056p" 1067 bias becomes 84 1068 next code point to insert is 8BF4 1069 needed delta is 88531, encodes as "qjye" 1070 bias becomes 90 1071 output is "ihqwcrb4cv8a8dqg056pqjye" 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 Costello Standards Track [Page 19] 1085 1087 RFC 3492 IDNA Punycode March 2003 1088 1089 1090 Encoding trace of example L from section 7.1: 1091 1092 bias is 72 1093 input is: 1094 0033 5E74 0042 7D44 91D1 516B 5148 751F 1095 basic code points (0033, 0042) are copied to literal portion: "3B-" 1096 next code point to insert is 5148 1097 needed delta is 62042, encodes as "ww4c" 1098 bias becomes 27 1099 next code point to insert is 516B 1100 needed delta is 139, encodes as "5e" 1101 bias becomes 24 1102 next code point to insert is 5E74 1103 needed delta is 16683, encodes as "180e" 1104 bias becomes 67 1105 next code point to insert is 751F 1106 needed delta is 34821, encodes as "575a" 1107 bias becomes 82 1108 next code point to insert is 7D44 1109 needed delta is 14592, encodes as "65l" 1110 bias becomes 67 1111 next code point to insert is 91D1 1112 needed delta is 42088, encodes as "sy2b" 1113 bias becomes 84 1114 output is "3B-ww4c5e180e575a65lsy2b" 1115 1116 8. Security Considerations 1117 1118 Users expect each domain name in DNS to be controlled by a single 1119 authority. If a Unicode string intended for use as a domain label 1120 could map to multiple ACE labels, then an internationalized domain 1121 name could map to multiple ASCII domain names, each controlled by a 1122 different authority, some of which could be spoofs that hijack 1123 service requests intended for another. Therefore Punycode is 1124 designed so that each Unicode string has a unique encoding. 1125 1126 However, there can still be multiple Unicode representations of the 1127 "same" text, for various definitions of "same". This problem is 1128 addressed to some extent by the Unicode standard under the topic of 1129 canonicalization, and this work is leveraged for domain names by 1130 Nameprep [NAMEPREP]. 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 Costello Standards Track [Page 20] 1142 1144 RFC 3492 IDNA Punycode March 2003 1145 1146 1147 9. References 1148 1149 9.1 Normative References 1150 1151 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1152 Requirement Levels", BCP 14, RFC 2119, March 1997. 1153 1154 9.2 Informative References 1155 1156 [RFC952] Harrenstien, K., Stahl, M. and E. Feinler, "DOD Internet 1157 Host Table Specification", RFC 952, October 1985. 1158 1159 [RFC1034] Mockapetris, P., "Domain Names - Concepts and 1160 Facilities", STD 13, RFC 1034, November 1987. 1161 1162 [IDNA] Faltstrom, P., Hoffman, P. and A. Costello, 1163 "Internationalizing Domain Names in Applications 1164 (IDNA)", RFC 3490, March 2003. 1165 1166 [NAMEPREP] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 1167 Profile for Internationalized Domain Names (IDN)", RFC 1168 3491, March 2003. 1169 1170 [ASCII] Cerf, V., "ASCII format for Network Interchange", RFC 1171 20, October 1969. 1172 1173 [PROVINCIAL] Kaplan, M., "The 'anyone can be provincial!' page", 1174 http://www.trigeminal.com/samples/provincial.html. 1175 1176 [UNICODE] The Unicode Consortium, "The Unicode Standard", 1177 http://www.unicode.org/unicode/standard/standard.html. 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 Costello Standards Track [Page 21] 1199 1201 RFC 3492 IDNA Punycode March 2003 1202 1203 1204 A. Mixed-case annotation 1205 1206 In order to use Punycode to represent case-insensitive strings, 1207 higher layers need to case-fold the strings prior to Punycode 1208 encoding. The encoded string can use mixed case as an annotation 1209 telling how to convert the folded string into a mixed-case string for 1210 display purposes. Note, however, that mixed-case annotation is not 1211 used by the ToASCII and ToUnicode operations specified in [IDNA], and 1212 therefore implementors of IDNA can disregard this appendix. 1213 1214 Basic code points can use mixed case directly, because the decoder 1215 copies them verbatim, leaving lowercase code points lowercase, and 1216 leaving uppercase code points uppercase. Each non-basic code point 1217 is represented by a delta, which is represented by a sequence of 1218 basic code points, the last of which provides the annotation. If it 1219 is uppercase, it is a suggestion to map the non-basic code point to 1220 uppercase (if possible); if it is lowercase, it is a suggestion to 1221 map the non-basic code point to lowercase (if possible). 1222 1223 These annotations do not alter the code points returned by decoders; 1224 the annotations are returned separately, for the caller to use or 1225 ignore. Encoders can accept annotations in addition to code points, 1226 but the annotations do not alter the output, except to influence the 1227 uppercase/lowercase form of ASCII letters. 1228 1229 Punycode encoders and decoders need not support these annotations, 1230 and higher layers need not use them. 1231 1232 B. Disclaimer and license 1233 1234 Regarding this entire document or any portion of it (including the 1235 pseudocode and C code), the author makes no guarantees and is not 1236 responsible for any damage resulting from its use. The author grants 1237 irrevocable permission to anyone to use, modify, and distribute it in 1238 any way that does not diminish the rights of anyone else to use, 1239 modify, and distribute it, provided that redistributed derivative 1240 works do not contain misleading author or version information. 1241 Derivative works need not be licensed under similar terms. 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 Costello Standards Track [Page 22] 1256 1258 RFC 3492 IDNA Punycode March 2003 1259 1260 1261 C. Punycode sample implementation 1262 1263 /* 1264 punycode.c from RFC 3492 1265 http://www.nicemice.net/idn/ 1266 Adam M. Costello 1267 http://www.nicemice.net/amc/ 1268 1269 This is ANSI C code (C89) implementing Punycode (RFC 3492). 1270 1271 */ 1272 1273 1274 /************************************************************/ 1275 /* Public interface (would normally go in its own .h file): */ 1276 1277 #include <limits.h> 1278 1279 enum punycode_status { 1280 punycode_success, 1281 punycode_bad_input, /* Input is invalid. */ 1282 punycode_big_output, /* Output would exceed the space provided. */ 1283 punycode_overflow /* Input needs wider integers to process. */ 1284 }; 1285 1286 #if UINT_MAX >= (1 << 26) - 1 1287 typedef unsigned int punycode_uint; 1288 #else 1289 typedef unsigned long punycode_uint; 1290 #endif 1291 1292 enum punycode_status punycode_encode( 1293 punycode_uint input_length, 1294 const punycode_uint input[], 1295 const unsigned char case_flags[], 1296 punycode_uint *output_length, 1297 char output[] ); 1298 1299 /* punycode_encode() converts Unicode to Punycode. The input */ 1300 /* is represented as an array of Unicode code points (not code */ 1301 /* units; surrogate pairs are not allowed), and the output */ 1302 /* will be represented as an array of ASCII code points. The */ 1303 /* output string is *not* null-terminated; it will contain */ 1304 /* zeros if and only if the input contains zeros. (Of course */ 1305 /* the caller can leave room for a terminator and add one if */ 1306 /* needed.) The input_length is the number of code points in */ 1307 /* the input. The output_length is an in/out argument: the */ 1308 /* caller passes in the maximum number of code points that it */ 1309 1310 1311 1312 Costello Standards Track [Page 23] 1313 1315 RFC 3492 IDNA Punycode March 2003 1316 1317 1318 /* can receive, and on successful return it will contain the */ 1319 /* number of code points actually output. The case_flags array */ 1320 /* holds input_length boolean values, where nonzero suggests that */ 1321 /* the corresponding Unicode character be forced to uppercase */ 1322 /* after being decoded (if possible), and zero suggests that */ 1323 /* it be forced to lowercase (if possible). ASCII code points */ 1324 /* are encoded literally, except that ASCII letters are forced */ 1325 /* to uppercase or lowercase according to the corresponding */ 1326 /* uppercase flags. If case_flags is a null pointer then ASCII */ 1327 /* letters are left as they are, and other code points are */ 1328 /* treated as if their uppercase flags were zero. The return */ 1329 /* value can be any of the punycode_status values defined above */ 1330 /* except punycode_bad_input; if not punycode_success, then */ 1331 /* output_size and output might contain garbage. */ 1332 1333 enum punycode_status punycode_decode( 1334 punycode_uint input_length, 1335 const char input[], 1336 punycode_uint *output_length, 1337 punycode_uint output[], 1338 unsigned char case_flags[] ); 1339 1340 /* punycode_decode() converts Punycode to Unicode. The input is */ 1341 /* represented as an array of ASCII code points, and the output */ 1342 /* will be represented as an array of Unicode code points. The */ 1343 /* input_length is the number of code points in the input. The */ 1344 /* output_length is an in/out argument: the caller passes in */ 1345 /* the maximum number of code points that it can receive, and */ 1346 /* on successful return it will contain the actual number of */ 1347 /* code points output. The case_flags array needs room for at */ 1348 /* least output_length values, or it can be a null pointer if the */ 1349 /* case information is not needed. A nonzero flag suggests that */ 1350 /* the corresponding Unicode character be forced to uppercase */ 1351 /* by the caller (if possible), while zero suggests that it be */ 1352 /* forced to lowercase (if possible). ASCII code points are */ 1353 /* output already in the proper case, but their flags will be set */ 1354 /* appropriately so that applying the flags would be harmless. */ 1355 /* The return value can be any of the punycode_status values */ 1356 /* defined above; if not punycode_success, then output_length, */ 1357 /* output, and case_flags might contain garbage. On success, the */ 1358 /* decoder will never need to write an output_length greater than */ 1359 /* input_length, because of how the encoding is defined. */ 1360 1361 /**********************************************************/ 1362 /* Implementation (would normally go in its own .c file): */ 1363 1364 #include <string.h> 1365 1366 1367 1368 1369 Costello Standards Track [Page 24] 1370 1372 RFC 3492 IDNA Punycode March 2003 1373 1374 1375 /*** Bootstring parameters for Punycode ***/ 1376 1377 enum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700, 1378 initial_bias = 72, initial_n = 0x80, delimiter = 0x2D }; 1379 1380 /* basic(cp) tests whether cp is a basic code point: */ 1381 #define basic(cp) ((punycode_uint)(cp) < 0x80) 1382 1383 /* delim(cp) tests whether cp is a delimiter: */ 1384 #define delim(cp) ((cp) == delimiter) 1385 1386 /* decode_digit(cp) returns the numeric value of a basic code */ 1387 /* point (for use in representing integers) in the range 0 to */ 1388 /* base-1, or base if cp is does not represent a value. */ 1389 1390 static punycode_uint decode_digit(punycode_uint cp) 1391 { 1392 return cp - 48 < 10 ? cp - 22 : cp - 65 < 26 ? cp - 65 : 1393 cp - 97 < 26 ? cp - 97 : base; 1394 } 1395 1396 /* encode_digit(d,flag) returns the basic code point whose value */ 1397 /* (when used for representing integers) is d, which needs to be in */ 1398 /* the range 0 to base-1. The lowercase form is used unless flag is */ 1399 /* nonzero, in which case the uppercase form is used. The behavior */ 1400 /* is undefined if flag is nonzero and digit d has no uppercase form. */ 1401 1402 static char encode_digit(punycode_uint d, int flag) 1403 { 1404 return d + 22 + 75 * (d < 26) - ((flag != 0) << 5); 1405 /* 0..25 map to ASCII a..z or A..Z */ 1406 /* 26..35 map to ASCII 0..9 */ 1407 } 1408 1409 /* flagged(bcp) tests whether a basic code point is flagged */ 1410 /* (uppercase). The behavior is undefined if bcp is not a */ 1411 /* basic code point. */ 1412 1413 #define flagged(bcp) ((punycode_uint)(bcp) - 65 < 26) 1414 1415 /* encode_basic(bcp,flag) forces a basic code point to lowercase */ 1416 /* if flag is zero, uppercase if flag is nonzero, and returns */ 1417 /* the resulting code point. The code point is unchanged if it */ 1418 /* is caseless. The behavior is undefined if bcp is not a basic */ 1419 /* code point. */ 1420 1421 static char encode_basic(punycode_uint bcp, int flag) 1422 { 1423 1424 1425 1426 Costello Standards Track [Page 25] 1427 1429 RFC 3492 IDNA Punycode March 2003 1430 1431 1432 bcp -= (bcp - 97 < 26) << 5; 1433 return bcp + ((!flag && (bcp - 65 < 26)) << 5); 1434 } 1435 1436 /*** Platform-specific constants ***/ 1437 1438 /* maxint is the maximum value of a punycode_uint variable: */ 1439 static const punycode_uint maxint = -1; 1440 /* Because maxint is unsigned, -1 becomes the maximum value. */ 1441 1442 /*** Bias adaptation function ***/ 1443 1444 static punycode_uint adapt( 1445 punycode_uint delta, punycode_uint numpoints, int firsttime ) 1446 { 1447 punycode_uint k; 1448 1449 delta = firsttime ? delta / damp : delta >> 1; 1450 /* delta >> 1 is a faster way of doing delta / 2 */ 1451 delta += delta / numpoints; 1452 1453 for (k = 0; delta > ((base - tmin) * tmax) / 2; k += base) { 1454 delta /= base - tmin; 1455 } 1456 1457 return k + (base - tmin + 1) * delta / (delta + skew); 1458 } 1459 1460 /*** Main encode function ***/ 1461 1462 enum punycode_status punycode_encode( 1463 punycode_uint input_length, 1464 const punycode_uint input[], 1465 const unsigned char case_flags[], 1466 punycode_uint *output_length, 1467 char output[] ) 1468 { 1469 punycode_uint n, delta, h, b, out, max_out, bias, j, m, q, k, t; 1470 1471 /* Initialize the state: */ 1472 1473 n = initial_n; 1474 delta = out = 0; 1475 max_out = *output_length; 1476 bias = initial_bias; 1477 1478 /* Handle the basic code points: */ 1479 1480 1481 1482 1483 Costello Standards Track [Page 26] 1484 1486 RFC 3492 IDNA Punycode March 2003 1487 1488 1489 for (j = 0; j < input_length; ++j) { 1490 if (basic(input[j])) { 1491 if (max_out - out < 2) return punycode_big_output; 1492 output[out++] = 1493 case_flags ? encode_basic(input[j], case_flags[j]) : input[j]; 1494 } 1495 /* else if (input[j] < n) return punycode_bad_input; */ 1496 /* (not needed for Punycode with unsigned code points) */ 1497 } 1498 1499 h = b = out; 1500 1501 /* h is the number of code points that have been handled, b is the */ 1502 /* number of basic code points, and out is the number of characters */ 1503 /* that have been output. */ 1504 1505 if (b > 0) output[out++] = delimiter; 1506 1507 /* Main encoding loop: */ 1508 1509 while (h < input_length) { 1510 /* All non-basic code points < n have been */ 1511 /* handled already. Find the next larger one: */ 1512 1513 for (m = maxint, j = 0; j < input_length; ++j) { 1514 /* if (basic(input[j])) continue; */ 1515 /* (not needed for Punycode) */ 1516 if (input[j] >= n && input[j] < m) m = input[j]; 1517 } 1518 1519 /* Increase delta enough to advance the decoder's */ 1520 /* <n,i> state to <m,0>, but guard against overflow: */ 1521 1522 if (m - n > (maxint - delta) / (h + 1)) return punycode_overflow; 1523 delta += (m - n) * (h + 1); 1524 n = m; 1525 1526 for (j = 0; j < input_length; ++j) { 1527 /* Punycode does not need to check whether input[j] is basic: */ 1528 if (input[j] < n /* || basic(input[j]) */ ) { 1529 if (++delta == 0) return punycode_overflow; 1530 } 1531 1532 if (input[j] == n) { 1533 /* Represent delta as a generalized variable-length integer: */ 1534 1535 for (q = delta, k = base; ; k += base) { 1536 if (out >= max_out) return punycode_big_output; 1537 1538 1539 1540 Costello Standards Track [Page 27] 1541 1543 RFC 3492 IDNA Punycode March 2003 1544 1545 1546 t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */ 1547 k >= bias + tmax ? tmax : k - bias; 1548 if (q < t) break; 1549 output[out++] = encode_digit(t + (q - t) % (base - t), 0); 1550 q = (q - t) / (base - t); 1551 } 1552 1553 output[out++] = encode_digit(q, case_flags && case_flags[j]); 1554 bias = adapt(delta, h + 1, h == b); 1555 delta = 0; 1556 ++h; 1557 } 1558 } 1559 1560 ++delta, ++n; 1561 } 1562 1563 *output_length = out; 1564 return punycode_success; 1565 } 1566 1567 /*** Main decode function ***/ 1568 1569 enum punycode_status punycode_decode( 1570 punycode_uint input_length, 1571 const char input[], 1572 punycode_uint *output_length, 1573 punycode_uint output[], 1574 unsigned char case_flags[] ) 1575 { 1576 punycode_uint n, out, i, max_out, bias, 1577 b, j, in, oldi, w, k, digit, t; 1578 1579 /* Initialize the state: */ 1580 1581 n = initial_n; 1582 out = i = 0; 1583 max_out = *output_length; 1584 bias = initial_bias; 1585 1586 /* Handle the basic code points: Let b be the number of input code */ 1587 /* points before the last delimiter, or 0 if there is none, then */ 1588 /* copy the first b code points to the output. */ 1589 1590 for (b = j = 0; j < input_length; ++j) if (delim(input[j])) b = j; 1591 if (b > max_out) return punycode_big_output; 1592 1593 for (j = 0; j < b; ++j) { 1594 1595 1596 1597 Costello Standards Track [Page 28] 1598 1600 RFC 3492 IDNA Punycode March 2003 1601 1602 1603 if (case_flags) case_flags[out] = flagged(input[j]); 1604 if (!basic(input[j])) return punycode_bad_input; 1605 output[out++] = input[j]; 1606 } 1607 1608 /* Main decoding loop: Start just after the last delimiter if any */ 1609 /* basic code points were copied; start at the beginning otherwise. */ 1610 1611 for (in = b > 0 ? b + 1 : 0; in < input_length; ++out) { 1612 1613 /* in is the index of the next character to be consumed, and */ 1614 /* out is the number of code points in the output array. */ 1615 1616 /* Decode a generalized variable-length integer into delta, */ 1617 /* which gets added to i. The overflow checking is easier */ 1618 /* if we increase i as we go, then subtract off its starting */ 1619 /* value at the end to obtain delta. */ 1620 1621 for (oldi = i, w = 1, k = base; ; k += base) { 1622 if (in >= input_length) return punycode_bad_input; 1623 digit = decode_digit(input[in++]); 1624 if (digit >= base) return punycode_bad_input; 1625 if (digit > (maxint - i) / w) return punycode_overflow; 1626 i += digit * w; 1627 t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */ 1628 k >= bias + tmax ? tmax : k - bias; 1629 if (digit < t) break; 1630 if (w > maxint / (base - t)) return punycode_overflow; 1631 w *= (base - t); 1632 } 1633 1634 bias = adapt(i - oldi, out + 1, oldi == 0); 1635 1636 /* i was supposed to wrap around from out+1 to 0, */ 1637 /* incrementing n each time, so we'll fix that now: */ 1638 1639 if (i / (out + 1) > maxint - n) return punycode_overflow; 1640 n += i / (out + 1); 1641 i %= (out + 1); 1642 1643 /* Insert n at position i of the output: */ 1644 1645 /* not needed for Punycode: */ 1646 /* if (decode_digit(n) <= base) return punycode_invalid_input; */ 1647 if (out >= max_out) return punycode_big_output; 1648 1649 if (case_flags) { 1650 memmove(case_flags + i + 1, case_flags + i, out - i); 1651 1652 1653 1654 Costello Standards Track [Page 29] 1655 1657 RFC 3492 IDNA Punycode March 2003 1658 1659 1660 /* Case of last character determines uppercase flag: */ 1661 case_flags[i] = flagged(input[in - 1]); 1662 } 1663 1664 memmove(output + i + 1, output + i, (out - i) * sizeof *output); 1665 output[i++] = n; 1666 } 1667 1668 *output_length = out; 1669 return punycode_success; 1670 } 1671 1672 /******************************************************************/ 1673 /* Wrapper for testing (would normally go in a separate .c file): */ 1674 1675 #include <assert.h> 1676 #include <stdio.h> 1677 #include <stdlib.h> 1678 #include <string.h> 1679 1680 /* For testing, we'll just set some compile-time limits rather than */ 1681 /* use malloc(), and set a compile-time option rather than using a */ 1682 /* command-line option. */ 1683 1684 enum { 1685 unicode_max_length = 256, 1686 ace_max_length = 256 1687 }; 1688 1689 static void usage(char **argv) 1690 { 1691 fprintf(stderr, 1692 "\n" 1693 "%s -e reads code points and writes a Punycode string.\n" 1694 "%s -d reads a Punycode string and writes code points.\n" 1695 "\n" 1696 "Input and output are plain text in the native character set.\n" 1697 "Code points are in the form u+hex separated by whitespace.\n" 1698 "Although the specification allows Punycode strings to contain\n" 1699 "any characters from the ASCII repertoire, this test code\n" 1700 "supports only the printable characters, and needs the Punycode\n" 1701 "string to be followed by a newline.\n" 1702 "The case of the u in u+hex is the force-to-uppercase flag.\n" 1703 , argv[0], argv[0]); 1704 exit(EXIT_FAILURE); 1705 } 1706 1707 static void fail(const char *msg) 1708 1709 1710 1711 Costello Standards Track [Page 30] 1712 1714 RFC 3492 IDNA Punycode March 2003 1715 1716 1717 { 1718 fputs(msg,stderr); 1719 exit(EXIT_FAILURE); 1720 } 1721 1722 static const char too_big[] = 1723 "input or output is too large, recompile with larger limits\n"; 1724 static const char invalid_input[] = "invalid input\n"; 1725 static const char overflow[] = "arithmetic overflow\n"; 1726 static const char io_error[] = "I/O error\n"; 1727 1728 /* The following string is used to convert printable */ 1729 /* characters between ASCII and the native charset: */ 1730 1731 static const char print_ascii[] = 1732 "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n" 1733 "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n" 1734 " !\"#$%&'()*+,-./" 1735 "0123456789:;<=>?" 1736 "@ABCDEFGHIJKLMNO" 1737 "PQRSTUVWXYZ[\\]^_" 1738 "`abcdefghijklmno" 1739 "pqrstuvwxyz{|}~\n"; 1740 1741 int main(int argc, char **argv) 1742 { 1743 enum punycode_status status; 1744 int r; 1745 unsigned int input_length, output_length, j; 1746 unsigned char case_flags[unicode_max_length]; 1747 1748 if (argc != 2) usage(argv); 1749 if (argv[1][0] != '-') usage(argv); 1750 if (argv[1][2] != 0) usage(argv); 1751 1752 if (argv[1][1] == 'e') { 1753 punycode_uint input[unicode_max_length]; 1754 unsigned long codept; 1755 char output[ace_max_length+1], uplus[3]; 1756 int c; 1757 1758 /* Read the input code points: */ 1759 1760 input_length = 0; 1761 1762 for (;;) { 1763 r = scanf("%2s%lx", uplus, &codept); 1764 if (ferror(stdin)) fail(io_error); 1765 1766 1767 1768 Costello Standards Track [Page 31] 1769 1771 RFC 3492 IDNA Punycode March 2003 1772 1773 1774 if (r == EOF || r == 0) break; 1775 1776 if (r != 2 || uplus[1] != '+' || codept > (punycode_uint)-1) { 1777 fail(invalid_input); 1778 } 1779 1780 if (input_length == unicode_max_length) fail(too_big); 1781 1782 if (uplus[0] == 'u') case_flags[input_length] = 0; 1783 else if (uplus[0] == 'U') case_flags[input_length] = 1; 1784 else fail(invalid_input); 1785 1786 input[input_length++] = codept; 1787 } 1788 1789 /* Encode: */ 1790 1791 output_length = ace_max_length; 1792 status = punycode_encode(input_length, input, case_flags, 1793 &output_length, output); 1794 if (status == punycode_bad_input) fail(invalid_input); 1795 if (status == punycode_big_output) fail(too_big); 1796 if (status == punycode_overflow) fail(overflow); 1797 assert(status == punycode_success); 1798 1799 /* Convert to native charset and output: */ 1800 1801 for (j = 0; j < output_length; ++j) { 1802 c = output[j]; 1803 assert(c >= 0 && c <= 127); 1804 if (print_ascii[c] == 0) fail(invalid_input); 1805 output[j] = print_ascii[c]; 1806 } 1807 1808 output[j] = 0; 1809 r = puts(output); 1810 if (r == EOF) fail(io_error); 1811 return EXIT_SUCCESS; 1812 } 1813 1814 if (argv[1][1] == 'd') { 1815 char input[ace_max_length+2], *p, *pp; 1816 punycode_uint output[unicode_max_length]; 1817 1818 /* Read the Punycode input string and convert to ASCII: */ 1819 1820 fgets(input, ace_max_length+2, stdin); 1821 if (ferror(stdin)) fail(io_error); 1822 1823 1824 1825 Costello Standards Track [Page 32] 1826 1828 RFC 3492 IDNA Punycode March 2003 1829 1830 1831 if (feof(stdin)) fail(invalid_input); 1832 input_length = strlen(input) - 1; 1833 if (input[input_length] != '\n') fail(too_big); 1834 input[input_length] = 0; 1835 1836 for (p = input; *p != 0; ++p) { 1837 pp = strchr(print_ascii, *p); 1838 if (pp == 0) fail(invalid_input); 1839 *p = pp - print_ascii; 1840 } 1841 1842 /* Decode: */ 1843 1844 output_length = unicode_max_length; 1845 status = punycode_decode(input_length, input, &output_length, 1846 output, case_flags); 1847 if (status == punycode_bad_input) fail(invalid_input); 1848 if (status == punycode_big_output) fail(too_big); 1849 if (status == punycode_overflow) fail(overflow); 1850 assert(status == punycode_success); 1851 1852 /* Output the result: */ 1853 1854 for (j = 0; j < output_length; ++j) { 1855 r = printf("%s+%04lX\n", 1856 case_flags[j] ? "U" : "u", 1857 (unsigned long) output[j] ); 1858 if (r < 0) fail(io_error); 1859 } 1860 1861 return EXIT_SUCCESS; 1862 } 1863 1864 usage(argv); 1865 return EXIT_SUCCESS; /* not reached, but quiets compiler warning */ 1866 } 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 Costello Standards Track [Page 33] 1883 1885 RFC 3492 IDNA Punycode March 2003 1886 1887 1888 Author's Address 1889 1890 Adam M. Costello 1891 University of California, Berkeley 1892 http://www.nicemice.net/amc/ 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 Costello Standards Track [Page 34] 1940 1942 RFC 3492 IDNA Punycode March 2003 1943 1944 1945 Full Copyright Statement 1946 1947 Copyright (C) The Internet Society (2003). All Rights Reserved. 1948 1949 This document and translations of it may be copied and furnished to 1950 others, and derivative works that comment on or otherwise explain it 1951 or assist in its implementation may be prepared, copied, published 1952 and distributed, in whole or in part, without restriction of any 1953 kind, provided that the above copyright notice and this paragraph are 1954 included on all such copies and derivative works. However, this 1955 document itself may not be modified in any way, such as by removing 1956 the copyright notice or references to the Internet Society or other 1957 Internet organizations, except as needed for the purpose of 1958 developing Internet standards in which case the procedures for 1959 copyrights defined in the Internet Standards process must be 1960 followed, or as required to translate it into languages other than 1961 English. 1962 1963 The limited permissions granted above are perpetual and will not be 1964 revoked by the Internet Society or its successors or assigns. 1965 1966 This document and the information contained herein is provided on an 1967 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 1968 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 1969 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 1970 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 1971 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1972 1973 Acknowledgement 1974 1975 Funding for the RFC Editor function is currently provided by the 1976 Internet Society. 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 Costello Standards Track [Page 35] 1997 1999