1 1.1 christos 2 1.1 christos 3 1.1 christos 4 1.1 christos 5 1.1 christos 6 1.1 christos 7 1.1 christos Network Working Group P. Deutsch 8 1.1 christos Request for Comments: 1951 Aladdin Enterprises 9 1.1 christos Category: Informational May 1996 10 1.1 christos 11 1.1 christos 12 1.1 christos DEFLATE Compressed Data Format Specification version 1.3 13 1.1 christos 14 1.1 christos Status of This Memo 15 1.1 christos 16 1.1 christos This memo provides information for the Internet community. This memo 17 1.1 christos does not specify an Internet standard of any kind. Distribution of 18 1.1 christos this memo is unlimited. 19 1.1 christos 20 1.1 christos IESG Note: 21 1.1 christos 22 1.1 christos The IESG takes no position on the validity of any Intellectual 23 1.1 christos Property Rights statements contained in this document. 24 1.1 christos 25 1.1 christos Notices 26 1.1 christos 27 1.1 christos Copyright (c) 1996 L. Peter Deutsch 28 1.1 christos 29 1.1 christos Permission is granted to copy and distribute this document for any 30 1.1 christos purpose and without charge, including translations into other 31 1.1 christos languages and incorporation into compilations, provided that the 32 1.1 christos copyright notice and this notice are preserved, and that any 33 1.1 christos substantive changes or deletions from the original are clearly 34 1.1 christos marked. 35 1.1 christos 36 1.1 christos A pointer to the latest version of this and related documentation in 37 1.1 christos HTML format can be found at the URL 38 1.1 christos <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>. 39 1.1 christos 40 1.1 christos Abstract 41 1.1 christos 42 1.1 christos This specification defines a lossless compressed data format that 43 1.1 christos compresses data using a combination of the LZ77 algorithm and Huffman 44 1.1 christos coding, with efficiency comparable to the best currently available 45 1.1 christos general-purpose compression methods. The data can be produced or 46 1.1 christos consumed, even for an arbitrarily long sequentially presented input 47 1.1 christos data stream, using only an a priori bounded amount of intermediate 48 1.1 christos storage. The format can be implemented readily in a manner not 49 1.1 christos covered by patents. 50 1.1 christos 51 1.1 christos 52 1.1 christos 53 1.1 christos 54 1.1 christos 55 1.1 christos 56 1.1 christos 57 1.1 christos 58 1.1 christos Deutsch Informational [Page 1] 59 1.1 christos 61 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996 62 1.1 christos 63 1.1 christos 64 1.1 christos Table of Contents 65 1.1 christos 66 1.1 christos 1. Introduction ................................................... 2 67 1.1 christos 1.1. Purpose ................................................... 2 68 1.1 christos 1.2. Intended audience ......................................... 3 69 1.1 christos 1.3. Scope ..................................................... 3 70 1.1 christos 1.4. Compliance ................................................ 3 71 1.1 christos 1.5. Definitions of terms and conventions used ................ 3 72 1.1 christos 1.6. Changes from previous versions ............................ 4 73 1.1 christos 2. Compressed representation overview ............................. 4 74 1.1 christos 3. Detailed specification ......................................... 5 75 1.1 christos 3.1. Overall conventions ....................................... 5 76 1.1 christos 3.1.1. Packing into bytes .................................. 5 77 1.1 christos 3.2. Compressed block format ................................... 6 78 1.1 christos 3.2.1. Synopsis of prefix and Huffman coding ............... 6 79 1.1 christos 3.2.2. Use of Huffman coding in the "deflate" format ....... 7 80 1.1 christos 3.2.3. Details of block format ............................. 9 81 1.1 christos 3.2.4. Non-compressed blocks (BTYPE=00) ................... 11 82 1.1 christos 3.2.5. Compressed blocks (length and distance codes) ...... 11 83 1.1 christos 3.2.6. Compression with fixed Huffman codes (BTYPE=01) .... 12 84 1.1 christos 3.2.7. Compression with dynamic Huffman codes (BTYPE=10) .. 13 85 1.1 christos 3.3. Compliance ............................................... 14 86 1.1 christos 4. Compression algorithm details ................................. 14 87 1.1 christos 5. References .................................................... 16 88 1.1 christos 6. Security Considerations ....................................... 16 89 1.1 christos 7. Source code ................................................... 16 90 1.1 christos 8. Acknowledgements .............................................. 16 91 1.1 christos 9. Author's Address .............................................. 17 92 1.1 christos 93 1.1 christos 1. Introduction 94 1.1 christos 95 1.1 christos 1.1. Purpose 96 1.1 christos 97 1.1 christos The purpose of this specification is to define a lossless 98 1.1 christos compressed data format that: 99 1.1 christos * Is independent of CPU type, operating system, file system, 100 1.1 christos and character set, and hence can be used for interchange; 101 1.1 christos * Can be produced or consumed, even for an arbitrarily long 102 1.1 christos sequentially presented input data stream, using only an a 103 1.1 christos priori bounded amount of intermediate storage, and hence 104 1.1 christos can be used in data communications or similar structures 105 1.1 christos such as Unix filters; 106 1.1 christos * Compresses data with efficiency comparable to the best 107 1.1 christos currently available general-purpose compression methods, 108 1.1 christos and in particular considerably better than the "compress" 109 1.1 christos program; 110 1.1 christos * Can be implemented readily in a manner not covered by 111 1.1 christos patents, and hence can be practiced freely; 112 1.1 christos 113 1.1 christos 114 1.1 christos 115 1.1 christos Deutsch Informational [Page 2] 116 1.1 christos 118 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996 119 1.1 christos 120 1.1 christos 121 1.1 christos * Is compatible with the file format produced by the current 122 1.1 christos widely used gzip utility, in that conforming decompressors 123 1.1 christos will be able to read data produced by the existing gzip 124 1.1 christos compressor. 125 1.1 christos 126 1.1 christos The data format defined by this specification does not attempt to: 127 1.1 christos 128 1.1 christos * Allow random access to compressed data; 129 1.1 christos * Compress specialized data (e.g., raster graphics) as well 130 1.1 christos as the best currently available specialized algorithms. 131 1.1 christos 132 1.1 christos A simple counting argument shows that no lossless compression 133 1.1 christos algorithm can compress every possible input data set. For the 134 1.1 christos format defined here, the worst case expansion is 5 bytes per 32K- 135 1.1 christos byte block, i.e., a size increase of 0.015% for large data sets. 136 1.1 christos English text usually compresses by a factor of 2.5 to 3; 137 1.1 christos executable files usually compress somewhat less; graphical data 138 1.1 christos such as raster images may compress much more. 139 1.1 christos 140 1.1 christos 1.2. Intended audience 141 1.1 christos 142 1.1 christos This specification is intended for use by implementors of software 143 1.1 christos to compress data into "deflate" format and/or decompress data from 144 1.1 christos "deflate" format. 145 1.1 christos 146 1.1 christos The text of the specification assumes a basic background in 147 1.1 christos programming at the level of bits and other primitive data 148 1.1 christos representations. Familiarity with the technique of Huffman coding 149 1.1 christos is helpful but not required. 150 1.1 christos 151 1.1 christos 1.3. Scope 152 1.1 christos 153 1.1 christos The specification specifies a method for representing a sequence 154 1.1 christos of bytes as a (usually shorter) sequence of bits, and a method for 155 1.1 christos packing the latter bit sequence into bytes. 156 1.1 christos 157 1.1 christos 1.4. Compliance 158 1.1 christos 159 1.1 christos Unless otherwise indicated below, a compliant decompressor must be 160 1.1 christos able to accept and decompress any data set that conforms to all 161 1.1 christos the specifications presented here; a compliant compressor must 162 1.1 christos produce data sets that conform to all the specifications presented 163 1.1 christos here. 164 1.1 christos 165 1.1 christos 1.5. Definitions of terms and conventions used 166 1.1 christos 167 1.1 christos Byte: 8 bits stored or transmitted as a unit (same as an octet). 168 1.1 christos For this specification, a byte is exactly 8 bits, even on machines 169 1.1 christos 170 1.1 christos 171 1.1 christos 172 1.1 christos Deutsch Informational [Page 3] 173 1.1 christos 175 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996 176 1.1 christos 177 1.1 christos 178 1.1 christos which store a character on a number of bits different from eight. 179 1.1 christos See below, for the numbering of bits within a byte. 180 1.1 christos 181 1.1 christos String: a sequence of arbitrary bytes. 182 1.1 christos 183 1.1 christos 1.6. Changes from previous versions 184 1.1 christos 185 1.1 christos There have been no technical changes to the deflate format since 186 1.1 christos version 1.1 of this specification. In version 1.2, some 187 1.1 christos terminology was changed. Version 1.3 is a conversion of the 188 1.1 christos specification to RFC style. 189 1.1 christos 190 1.1 christos 2. Compressed representation overview 191 1.1 christos 192 1.1 christos A compressed data set consists of a series of blocks, corresponding 193 1.1 christos to successive blocks of input data. The block sizes are arbitrary, 194 1.1 christos except that non-compressible blocks are limited to 65,535 bytes. 195 1.1 christos 196 1.1 christos Each block is compressed using a combination of the LZ77 algorithm 197 1.1 christos and Huffman coding. The Huffman trees for each block are independent 198 1.1 christos of those for previous or subsequent blocks; the LZ77 algorithm may 199 1.1 christos use a reference to a duplicated string occurring in a previous block, 200 1.1 christos up to 32K input bytes before. 201 1.1 christos 202 1.1 christos Each block consists of two parts: a pair of Huffman code trees that 203 1.1 christos describe the representation of the compressed data part, and a 204 1.1 christos compressed data part. (The Huffman trees themselves are compressed 205 1.1 christos using Huffman encoding.) The compressed data consists of a series of 206 1.1 christos elements of two types: literal bytes (of strings that have not been 207 1.1 christos detected as duplicated within the previous 32K input bytes), and 208 1.1 christos pointers to duplicated strings, where a pointer is represented as a 209 1.1 christos pair <length, backward distance>. The representation used in the 210 1.1 christos "deflate" format limits distances to 32K bytes and lengths to 258 211 1.1 christos bytes, but does not limit the size of a block, except for 212 1.1 christos uncompressible blocks, which are limited as noted above. 213 1.1 christos 214 1.1 christos Each type of value (literals, distances, and lengths) in the 215 1.1 christos compressed data is represented using a Huffman code, using one code 216 1.1 christos tree for literals and lengths and a separate code tree for distances. 217 1.1 christos The code trees for each block appear in a compact form just before 218 1.1 christos the compressed data for that block. 219 1.1 christos 220 1.1 christos 221 1.1 christos 222 1.1 christos 223 1.1 christos 224 1.1 christos 225 1.1 christos 226 1.1 christos 227 1.1 christos 228 1.1 christos 229 1.1 christos Deutsch Informational [Page 4] 230 1.1 christos 232 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996 233 1.1 christos 234 1.1 christos 235 1.1 christos 3. Detailed specification 236 1.1 christos 237 1.1 christos 3.1. Overall conventions In the diagrams below, a box like this: 238 1.1 christos 239 1.1 christos +---+ 240 1.1 christos | | <-- the vertical bars might be missing 241 1.1 christos +---+ 242 1.1 christos 243 1.1 christos represents one byte; a box like this: 244 1.1 christos 245 1.1 christos +==============+ 246 1.1 christos | | 247 1.1 christos +==============+ 248 1.1 christos 249 1.1 christos represents a variable number of bytes. 250 1.1 christos 251 1.1 christos Bytes stored within a computer do not have a "bit order", since 252 1.1 christos they are always treated as a unit. However, a byte considered as 253 1.1 christos an integer between 0 and 255 does have a most- and least- 254 1.1 christos significant bit, and since we write numbers with the most- 255 1.1 christos significant digit on the left, we also write bytes with the most- 256 1.1 christos significant bit on the left. In the diagrams below, we number the 257 1.1 christos bits of a byte so that bit 0 is the least-significant bit, i.e., 258 1.1 christos the bits are numbered: 259 1.1 christos 260 1.1 christos +--------+ 261 1.1 christos |76543210| 262 1.1 christos +--------+ 263 1.1 christos 264 1.1 christos Within a computer, a number may occupy multiple bytes. All 265 1.1 christos multi-byte numbers in the format described here are stored with 266 1.1 christos the least-significant byte first (at the lower memory address). 267 1.1 christos For example, the decimal number 520 is stored as: 268 1.1 christos 269 1.1 christos 0 1 270 1.1 christos +--------+--------+ 271 1.1 christos |00001000|00000010| 272 1.1 christos +--------+--------+ 273 1.1 christos ^ ^ 274 1.1 christos | | 275 1.1 christos | + more significant byte = 2 x 256 276 1.1 christos + less significant byte = 8 277 1.1 christos 278 1.1 christos 3.1.1. Packing into bytes 279 1.1 christos 280 1.1 christos This document does not address the issue of the order in which 281 1.1 christos bits of a byte are transmitted on a bit-sequential medium, 282 1.1 christos since the final data format described here is byte- rather than 283 1.1 christos 284 1.1 christos 285 1.1 christos 286 1.1 christos Deutsch Informational [Page 5] 287 1.1 christos 289 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996 290 1.1 christos 291 1.1 christos 292 1.1 christos bit-oriented. However, we describe the compressed block format 293 1.1 christos in below, as a sequence of data elements of various bit 294 1.1 christos lengths, not a sequence of bytes. We must therefore specify 295 1.1 christos how to pack these data elements into bytes to form the final 296 1.1 christos compressed byte sequence: 297 1.1 christos 298 1.1 christos * Data elements are packed into bytes in order of 299 1.1 christos increasing bit number within the byte, i.e., starting 300 1.1 christos with the least-significant bit of the byte. 301 1.1 christos * Data elements other than Huffman codes are packed 302 1.1 christos starting with the least-significant bit of the data 303 1.1 christos element. 304 1.1 christos * Huffman codes are packed starting with the most- 305 1.1 christos significant bit of the code. 306 1.1 christos 307 1.1 christos In other words, if one were to print out the compressed data as 308 1.1 christos a sequence of bytes, starting with the first byte at the 309 1.1 christos *right* margin and proceeding to the *left*, with the most- 310 1.1 christos significant bit of each byte on the left as usual, one would be 311 1.1 christos able to parse the result from right to left, with fixed-width 312 1.1 christos elements in the correct MSB-to-LSB order and Huffman codes in 313 1.1 christos bit-reversed order (i.e., with the first bit of the code in the 314 1.1 christos relative LSB position). 315 1.1 christos 316 1.1 christos 3.2. Compressed block format 317 1.1 christos 318 1.1 christos 3.2.1. Synopsis of prefix and Huffman coding 319 1.1 christos 320 1.1 christos Prefix coding represents symbols from an a priori known 321 1.1 christos alphabet by bit sequences (codes), one code for each symbol, in 322 1.1 christos a manner such that different symbols may be represented by bit 323 1.1 christos sequences of different lengths, but a parser can always parse 324 1.1 christos an encoded string unambiguously symbol-by-symbol. 325 1.1 christos 326 1.1 christos We define a prefix code in terms of a binary tree in which the 327 1.1 christos two edges descending from each non-leaf node are labeled 0 and 328 1.1 christos 1 and in which the leaf nodes correspond one-for-one with (are 329 1.1 christos labeled with) the symbols of the alphabet; then the code for a 330 1.1 christos symbol is the sequence of 0's and 1's on the edges leading from 331 1.1 christos the root to the leaf labeled with that symbol. For example: 332 1.1 christos 333 1.1 christos 334 1.1 christos 335 1.1 christos 336 1.1 christos 337 1.1 christos 338 1.1 christos 339 1.1 christos 340 1.1 christos 341 1.1 christos 342 1.1 christos 343 1.1 christos Deutsch Informational [Page 6] 344 1.1 christos 346 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996 347 1.1 christos 348 1.1 christos 349 1.1 christos /\ Symbol Code 350 1.1 christos 0 1 ------ ---- 351 1.1 christos / \ A 00 352 1.1 christos /\ B B 1 353 1.1 christos 0 1 C 011 354 1.1 christos / \ D 010 355 1.1 christos A /\ 356 1.1 christos 0 1 357 1.1 christos / \ 358 1.1 christos D C 359 1.1 christos 360 1.1 christos A parser can decode the next symbol from an encoded input 361 1.1 christos stream by walking down the tree from the root, at each step 362 1.1 christos choosing the edge corresponding to the next input bit. 363 1.1 christos 364 1.1 christos Given an alphabet with known symbol frequencies, the Huffman 365 1.1 christos algorithm allows the construction of an optimal prefix code 366 1.1 christos (one which represents strings with those symbol frequencies 367 1.1 christos using the fewest bits of any possible prefix codes for that 368 1.1 christos alphabet). Such a code is called a Huffman code. (See 369 1.1 christos reference [1] in Chapter 5, references for additional 370 1.1 christos information on Huffman codes.) 371 1.1 christos 372 1.1 christos Note that in the "deflate" format, the Huffman codes for the 373 1.1 christos various alphabets must not exceed certain maximum code lengths. 374 1.1 christos This constraint complicates the algorithm for computing code 375 1.1 christos lengths from symbol frequencies. Again, see Chapter 5, 376 1.1 christos references for details. 377 1.1 christos 378 1.1 christos 3.2.2. Use of Huffman coding in the "deflate" format 379 1.1 christos 380 1.1 christos The Huffman codes used for each alphabet in the "deflate" 381 1.1 christos format have two additional rules: 382 1.1 christos 383 1.1 christos * All codes of a given bit length have lexicographically 384 1.1 christos consecutive values, in the same order as the symbols 385 1.1 christos they represent; 386 1.1 christos 387 1.1 christos * Shorter codes lexicographically precede longer codes. 388 1.1 christos 389 1.1 christos 390 1.1 christos 391 1.1 christos 392 1.1 christos 393 1.1 christos 394 1.1 christos 395 1.1 christos 396 1.1 christos 397 1.1 christos 398 1.1 christos 399 1.1 christos 400 1.1 christos Deutsch Informational [Page 7] 401 1.1 christos 403 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996 404 1.1 christos 405 1.1 christos 406 1.1 christos We could recode the example above to follow this rule as 407 1.1 christos follows, assuming that the order of the alphabet is ABCD: 408 1.1 christos 409 1.1 christos Symbol Code 410 1.1 christos ------ ---- 411 1.1 christos A 10 412 1.1 christos B 0 413 1.1 christos C 110 414 1.1 christos D 111 415 1.1 christos 416 1.1 christos I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are 417 1.1 christos lexicographically consecutive. 418 1.1 christos 419 1.1 christos Given this rule, we can define the Huffman code for an alphabet 420 1.1 christos just by giving the bit lengths of the codes for each symbol of 421 1.1 christos the alphabet in order; this is sufficient to determine the 422 1.1 christos actual codes. In our example, the code is completely defined 423 1.1 christos by the sequence of bit lengths (2, 1, 3, 3). The following 424 1.1 christos algorithm generates the codes as integers, intended to be read 425 1.1 christos from most- to least-significant bit. The code lengths are 426 1.1 christos initially in tree[I].Len; the codes are produced in 427 1.1 christos tree[I].Code. 428 1.1 christos 429 1.1 christos 1) Count the number of codes for each code length. Let 430 1.1 christos bl_count[N] be the number of codes of length N, N >= 1. 431 1.1 christos 432 1.1 christos 2) Find the numerical value of the smallest code for each 433 1.1 christos code length: 434 1.1 christos 435 1.1 christos code = 0; 436 1.1 christos bl_count[0] = 0; 437 1.1 christos for (bits = 1; bits <= MAX_BITS; bits++) { 438 1.1 christos code = (code + bl_count[bits-1]) << 1; 439 1.1 christos next_code[bits] = code; 440 1.1 christos } 441 1.1 christos 442 1.1 christos 3) Assign numerical values to all codes, using consecutive 443 1.1 christos values for all codes of the same length with the base 444 1.1 christos values determined at step 2. Codes that are never used 445 1.1 christos (which have a bit length of zero) must not be assigned a 446 1.1 christos value. 447 1.1 christos 448 1.1 christos for (n = 0; n <= max_code; n++) { 449 1.1 christos len = tree[n].Len; 450 1.1 christos if (len != 0) { 451 1.1 christos tree[n].Code = next_code[len]; 452 1.1 christos next_code[len]++; 453 1.1 christos } 454 1.1 christos 455 1.1 christos 456 1.1 christos 457 1.1 christos Deutsch Informational [Page 8] 458 1.1 christos 460 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996 461 1.1 christos 462 1.1 christos 463 1.1 christos } 464 1.1 christos 465 1.1 christos Example: 466 1.1 christos 467 1.1 christos Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3, 468 1.1 christos 3, 2, 4, 4). After step 1, we have: 469 1.1 christos 470 1.1 christos N bl_count[N] 471 1.1 christos - ----------- 472 1.1 christos 2 1 473 1.1 christos 3 5 474 1.1 christos 4 2 475 1.1 christos 476 1.1 christos Step 2 computes the following next_code values: 477 1.1 christos 478 1.1 christos N next_code[N] 479 1.1 christos - ------------ 480 1.1 christos 1 0 481 1.1 christos 2 0 482 1.1 christos 3 2 483 1.1 christos 4 14 484 1.1 christos 485 1.1 christos Step 3 produces the following code values: 486 1.1 christos 487 1.1 christos Symbol Length Code 488 1.1 christos ------ ------ ---- 489 1.1 christos A 3 010 490 1.1 christos B 3 011 491 1.1 christos C 3 100 492 1.1 christos D 3 101 493 1.1 christos E 3 110 494 1.1 christos F 2 00 495 1.1 christos G 4 1110 496 1.1 christos H 4 1111 497 1.1 christos 498 1.1 christos 3.2.3. Details of block format 499 1.1 christos 500 1.1 christos Each block of compressed data begins with 3 header bits 501 1.1 christos containing the following data: 502 1.1 christos 503 1.1 christos first bit BFINAL 504 1.1 christos next 2 bits BTYPE 505 1.1 christos 506 1.1 christos Note that the header bits do not necessarily begin on a byte 507 1.1 christos boundary, since a block does not necessarily occupy an integral 508 1.1 christos number of bytes. 509 1.1 christos 510 1.1 christos 511 1.1 christos 512 1.1 christos 513 1.1 christos 514 1.1 christos Deutsch Informational [Page 9] 515 1.1 christos 517 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996 518 1.1 christos 519 1.1 christos 520 1.1 christos BFINAL is set if and only if this is the last block of the data 521 1.1 christos set. 522 1.1 christos 523 1.1 christos BTYPE specifies how the data are compressed, as follows: 524 1.1 christos 525 1.1 christos 00 - no compression 526 1.1 christos 01 - compressed with fixed Huffman codes 527 1.1 christos 10 - compressed with dynamic Huffman codes 528 1.1 christos 11 - reserved (error) 529 1.1 christos 530 1.1 christos The only difference between the two compressed cases is how the 531 1.1 christos Huffman codes for the literal/length and distance alphabets are 532 1.1 christos defined. 533 1.1 christos 534 1.1 christos In all cases, the decoding algorithm for the actual data is as 535 1.1 christos follows: 536 1.1 christos 537 1.1 christos do 538 1.1 christos read block header from input stream. 539 1.1 christos if stored with no compression 540 1.1 christos skip any remaining bits in current partially 541 1.1 christos processed byte 542 1.1 christos read LEN and NLEN (see next section) 543 1.1 christos copy LEN bytes of data to output 544 1.1 christos otherwise 545 1.1 christos if compressed with dynamic Huffman codes 546 1.1 christos read representation of code trees (see 547 1.1 christos subsection below) 548 1.1 christos loop (until end of block code recognized) 549 1.1 christos decode literal/length value from input stream 550 1.1 christos if value < 256 551 1.1 christos copy value (literal byte) to output stream 552 1.1 christos otherwise 553 1.1 christos if value = end of block (256) 554 1.1 christos break from loop 555 1.1 christos otherwise (value = 257..285) 556 1.1 christos decode distance from input stream 557 1.1 christos 558 1.1 christos move backwards distance bytes in the output 559 1.1 christos stream, and copy length bytes from this 560 1.1 christos position to the output stream. 561 1.1 christos end loop 562 1.1 christos while not last block 563 1.1 christos 564 1.1 christos Note that a duplicated string reference may refer to a string 565 1.1 christos in a previous block; i.e., the backward distance may cross one 566 1.1 christos or more block boundaries. However a distance cannot refer past 567 1.1 christos the beginning of the output stream. (An application using a 568 1.1 christos 569 1.1 christos 570 1.1 christos 571 1.1 christos Deutsch Informational [Page 10] 572 1.1 christos 574 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996 575 1.1 christos 576 1.1 christos 577 1.1 christos preset dictionary might discard part of the output stream; a 578 1.1 christos distance can refer to that part of the output stream anyway) 579 1.1 christos Note also that the referenced string may overlap the current 580 1.1 christos position; for example, if the last 2 bytes decoded have values 581 1.1 christos X and Y, a string reference with <length = 5, distance = 2> 582 1.1 christos adds X,Y,X,Y,X to the output stream. 583 1.1 christos 584 1.1 christos We now specify each compression method in turn. 585 1.1 christos 586 1.1 christos 3.2.4. Non-compressed blocks (BTYPE=00) 587 1.1 christos 588 1.1 christos Any bits of input up to the next byte boundary are ignored. 589 1.1 christos The rest of the block consists of the following information: 590 1.1 christos 591 1.1 christos 0 1 2 3 4... 592 1.1 christos +---+---+---+---+================================+ 593 1.1 christos | LEN | NLEN |... LEN bytes of literal data...| 594 1.1 christos +---+---+---+---+================================+ 595 1.1 christos 596 1.1 christos LEN is the number of data bytes in the block. NLEN is the 597 1.1 christos one's complement of LEN. 598 1.1 christos 599 1.1 christos 3.2.5. Compressed blocks (length and distance codes) 600 1.1 christos 601 1.1 christos As noted above, encoded data blocks in the "deflate" format 602 1.1 christos consist of sequences of symbols drawn from three conceptually 603 1.1 christos distinct alphabets: either literal bytes, from the alphabet of 604 1.1 christos byte values (0..255), or <length, backward distance> pairs, 605 1.1 christos where the length is drawn from (3..258) and the distance is 606 1.1 christos drawn from (1..32,768). In fact, the literal and length 607 1.1 christos alphabets are merged into a single alphabet (0..285), where 608 1.1 christos values 0..255 represent literal bytes, the value 256 indicates 609 1.1 christos end-of-block, and values 257..285 represent length codes 610 1.1 christos (possibly in conjunction with extra bits following the symbol 611 1.1 christos code) as follows: 612 1.1 christos 613 1.1 christos 614 1.1 christos 615 1.1 christos 616 1.1 christos 617 1.1 christos 618 1.1 christos 619 1.1 christos 620 1.1 christos 621 1.1 christos 622 1.1 christos 623 1.1 christos 624 1.1 christos 625 1.1 christos 626 1.1 christos 627 1.1 christos 628 1.1 christos Deutsch Informational [Page 11] 629 1.1 christos 631 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996 632 1.1 christos 633 1.1 christos 634 1.1 christos Extra Extra Extra 635 1.1 christos Code Bits Length(s) Code Bits Lengths Code Bits Length(s) 636 1.1 christos ---- ---- ------ ---- ---- ------- ---- ---- ------- 637 1.1 christos 257 0 3 267 1 15,16 277 4 67-82 638 1.1 christos 258 0 4 268 1 17,18 278 4 83-98 639 1.1 christos 259 0 5 269 2 19-22 279 4 99-114 640 1.1 christos 260 0 6 270 2 23-26 280 4 115-130 641 1.1 christos 261 0 7 271 2 27-30 281 5 131-162 642 1.1 christos 262 0 8 272 2 31-34 282 5 163-194 643 1.1 christos 263 0 9 273 3 35-42 283 5 195-226 644 1.1 christos 264 0 10 274 3 43-50 284 5 227-257 645 1.1 christos 265 1 11,12 275 3 51-58 285 0 258 646 1.1 christos 266 1 13,14 276 3 59-66 647 1.1 christos 648 1.1 christos The extra bits should be interpreted as a machine integer 649 1.1 christos stored with the most-significant bit first, e.g., bits 1110 650 1.1 christos represent the value 14. 651 1.1 christos 652 1.1 christos Extra Extra Extra 653 1.1 christos Code Bits Dist Code Bits Dist Code Bits Distance 654 1.1 christos ---- ---- ---- ---- ---- ------ ---- ---- -------- 655 1.1 christos 0 0 1 10 4 33-48 20 9 1025-1536 656 1.1 christos 1 0 2 11 4 49-64 21 9 1537-2048 657 1.1 christos 2 0 3 12 5 65-96 22 10 2049-3072 658 1.1 christos 3 0 4 13 5 97-128 23 10 3073-4096 659 1.1 christos 4 1 5,6 14 6 129-192 24 11 4097-6144 660 1.1 christos 5 1 7,8 15 6 193-256 25 11 6145-8192 661 1.1 christos 6 2 9-12 16 7 257-384 26 12 8193-12288 662 1.1 christos 7 2 13-16 17 7 385-512 27 12 12289-16384 663 1.1 christos 8 3 17-24 18 8 513-768 28 13 16385-24576 664 1.1 christos 9 3 25-32 19 8 769-1024 29 13 24577-32768 665 1.1 christos 666 1.1 christos 3.2.6. Compression with fixed Huffman codes (BTYPE=01) 667 1.1 christos 668 1.1 christos The Huffman codes for the two alphabets are fixed, and are not 669 1.1 christos represented explicitly in the data. The Huffman code lengths 670 1.1 christos for the literal/length alphabet are: 671 1.1 christos 672 1.1 christos Lit Value Bits Codes 673 1.1 christos --------- ---- ----- 674 1.1 christos 0 - 143 8 00110000 through 675 1.1 christos 10111111 676 1.1 christos 144 - 255 9 110010000 through 677 1.1 christos 111111111 678 1.1 christos 256 - 279 7 0000000 through 679 1.1 christos 0010111 680 1.1 christos 280 - 287 8 11000000 through 681 1.1 christos 11000111 682 1.1 christos 683 1.1 christos 684 1.1 christos 685 1.1 christos Deutsch Informational [Page 12] 686 1.1 christos 688 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996 689 1.1 christos 690 1.1 christos 691 1.1 christos The code lengths are sufficient to generate the actual codes, 692 1.1 christos as described above; we show the codes in the table for added 693 1.1 christos clarity. Literal/length values 286-287 will never actually 694 1.1 christos occur in the compressed data, but participate in the code 695 1.1 christos construction. 696 1.1 christos 697 1.1 christos Distance codes 0-31 are represented by (fixed-length) 5-bit 698 1.1 christos codes, with possible additional bits as shown in the table 699 1.1 christos shown in Paragraph 3.2.5, above. Note that distance codes 30- 700 1.1 christos 31 will never actually occur in the compressed data. 701 1.1 christos 702 1.1 christos 3.2.7. Compression with dynamic Huffman codes (BTYPE=10) 703 1.1 christos 704 1.1 christos The Huffman codes for the two alphabets appear in the block 705 1.1 christos immediately after the header bits and before the actual 706 1.1 christos compressed data, first the literal/length code and then the 707 1.1 christos distance code. Each code is defined by a sequence of code 708 1.1 christos lengths, as discussed in Paragraph 3.2.2, above. For even 709 1.1 christos greater compactness, the code length sequences themselves are 710 1.1 christos compressed using a Huffman code. The alphabet for code lengths 711 1.1 christos is as follows: 712 1.1 christos 713 1.1 christos 0 - 15: Represent code lengths of 0 - 15 714 1.1 christos 16: Copy the previous code length 3 - 6 times. 715 1.1 christos The next 2 bits indicate repeat length 716 1.1 christos (0 = 3, ... , 3 = 6) 717 1.1 christos Example: Codes 8, 16 (+2 bits 11), 718 1.1 christos 16 (+2 bits 10) will expand to 719 1.1 christos 12 code lengths of 8 (1 + 6 + 5) 720 1.1 christos 17: Repeat a code length of 0 for 3 - 10 times. 721 1.1 christos (3 bits of length) 722 1.1 christos 18: Repeat a code length of 0 for 11 - 138 times 723 1.1 christos (7 bits of length) 724 1.1 christos 725 1.1 christos A code length of 0 indicates that the corresponding symbol in 726 1.1 christos the literal/length or distance alphabet will not occur in the 727 1.1 christos block, and should not participate in the Huffman code 728 1.1 christos construction algorithm given earlier. If only one distance 729 1.1 christos code is used, it is encoded using one bit, not zero bits; in 730 1.1 christos this case there is a single code length of one, with one unused 731 1.1 christos code. One distance code of zero bits means that there are no 732 1.1 christos distance codes used at all (the data is all literals). 733 1.1 christos 734 1.1 christos We can now define the format of the block: 735 1.1 christos 736 1.1 christos 5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286) 737 1.1 christos 5 Bits: HDIST, # of Distance codes - 1 (1 - 32) 738 1.1 christos 4 Bits: HCLEN, # of Code Length codes - 4 (4 - 19) 739 1.1 christos 740 1.1 christos 741 1.1 christos 742 1.1 christos Deutsch Informational [Page 13] 743 1.1 christos 745 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996 746 1.1 christos 747 1.1 christos 748 1.1 christos (HCLEN + 4) x 3 bits: code lengths for the code length 749 1.1 christos alphabet given just above, in the order: 16, 17, 18, 750 1.1 christos 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 751 1.1 christos 752 1.1 christos These code lengths are interpreted as 3-bit integers 753 1.1 christos (0-7); as above, a code length of 0 means the 754 1.1 christos corresponding symbol (literal/length or distance code 755 1.1 christos length) is not used. 756 1.1 christos 757 1.1 christos HLIT + 257 code lengths for the literal/length alphabet, 758 1.1 christos encoded using the code length Huffman code 759 1.1 christos 760 1.1 christos HDIST + 1 code lengths for the distance alphabet, 761 1.1 christos encoded using the code length Huffman code 762 1.1 christos 763 1.1 christos The actual compressed data of the block, 764 1.1 christos encoded using the literal/length and distance Huffman 765 1.1 christos codes 766 1.1 christos 767 1.1 christos The literal/length symbol 256 (end of data), 768 1.1 christos encoded using the literal/length Huffman code 769 1.1 christos 770 1.1 christos The code length repeat codes can cross from HLIT + 257 to the 771 1.1 christos HDIST + 1 code lengths. In other words, all code lengths form 772 1.1 christos a single sequence of HLIT + HDIST + 258 values. 773 1.1 christos 774 1.1 christos 3.3. Compliance 775 1.1 christos 776 1.1 christos A compressor may limit further the ranges of values specified in 777 1.1 christos the previous section and still be compliant; for example, it may 778 1.1 christos limit the range of backward pointers to some value smaller than 779 1.1 christos 32K. Similarly, a compressor may limit the size of blocks so that 780 1.1 christos a compressible block fits in memory. 781 1.1 christos 782 1.1 christos A compliant decompressor must accept the full range of possible 783 1.1 christos values defined in the previous section, and must accept blocks of 784 1.1 christos arbitrary size. 785 1.1 christos 786 1.1 christos 4. Compression algorithm details 787 1.1 christos 788 1.1 christos While it is the intent of this document to define the "deflate" 789 1.1 christos compressed data format without reference to any particular 790 1.1 christos compression algorithm, the format is related to the compressed 791 1.1 christos formats produced by LZ77 (Lempel-Ziv 1977, see reference [2] below); 792 1.1 christos since many variations of LZ77 are patented, it is strongly 793 1.1 christos recommended that the implementor of a compressor follow the general 794 1.1 christos algorithm presented here, which is known not to be patented per se. 795 1.1 christos The material in this section is not part of the definition of the 796 1.1 christos 797 1.1 christos 798 1.1 christos 799 1.1 christos Deutsch Informational [Page 14] 800 1.1 christos 802 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996 803 1.1 christos 804 1.1 christos 805 1.1 christos specification per se, and a compressor need not follow it in order to 806 1.1 christos be compliant. 807 1.1 christos 808 1.1 christos The compressor terminates a block when it determines that starting a 809 1.1 christos new block with fresh trees would be useful, or when the block size 810 1.1 christos fills up the compressor's block buffer. 811 1.1 christos 812 1.1 christos The compressor uses a chained hash table to find duplicated strings, 813 1.1 christos using a hash function that operates on 3-byte sequences. At any 814 1.1 christos given point during compression, let XYZ be the next 3 input bytes to 815 1.1 christos be examined (not necessarily all different, of course). First, the 816 1.1 christos compressor examines the hash chain for XYZ. If the chain is empty, 817 1.1 christos the compressor simply writes out X as a literal byte and advances one 818 1.1 christos byte in the input. If the hash chain is not empty, indicating that 819 1.1 christos the sequence XYZ (or, if we are unlucky, some other 3 bytes with the 820 1.1 christos same hash function value) has occurred recently, the compressor 821 1.1 christos compares all strings on the XYZ hash chain with the actual input data 822 1.1 christos sequence starting at the current point, and selects the longest 823 1.1 christos match. 824 1.1 christos 825 1.1 christos The compressor searches the hash chains starting with the most recent 826 1.1 christos strings, to favor small distances and thus take advantage of the 827 1.1 christos Huffman encoding. The hash chains are singly linked. There are no 828 1.1 christos deletions from the hash chains; the algorithm simply discards matches 829 1.1 christos that are too old. To avoid a worst-case situation, very long hash 830 1.1 christos chains are arbitrarily truncated at a certain length, determined by a 831 1.1 christos run-time parameter. 832 1.1 christos 833 1.1 christos To improve overall compression, the compressor optionally defers the 834 1.1 christos selection of matches ("lazy matching"): after a match of length N has 835 1.1 christos been found, the compressor searches for a longer match starting at 836 1.1 christos the next input byte. If it finds a longer match, it truncates the 837 1.1 christos previous match to a length of one (thus producing a single literal 838 1.1 christos byte) and then emits the longer match. Otherwise, it emits the 839 1.1 christos original match, and, as described above, advances N bytes before 840 1.1 christos continuing. 841 1.1 christos 842 1.1 christos Run-time parameters also control this "lazy match" procedure. If 843 1.1 christos compression ratio is most important, the compressor attempts a 844 1.1 christos complete second search regardless of the length of the first match. 845 1.1 christos In the normal case, if the current match is "long enough", the 846 1.1 christos compressor reduces the search for a longer match, thus speeding up 847 1.1 christos the process. If speed is most important, the compressor inserts new 848 1.1 christos strings in the hash table only when no match was found, or when the 849 1.1 christos match is not "too long". This degrades the compression ratio but 850 1.1 christos saves time since there are both fewer insertions and fewer searches. 851 1.1 christos 852 1.1 christos 853 1.1 christos 854 1.1 christos 855 1.1 christos 856 1.1 christos Deutsch Informational [Page 15] 857 1.1 christos 859 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996 860 1.1 christos 861 1.1 christos 862 1.1 christos 5. References 863 1.1 christos 864 1.1 christos [1] Huffman, D. A., "A Method for the Construction of Minimum 865 1.1 christos Redundancy Codes", Proceedings of the Institute of Radio 866 1.1 christos Engineers, September 1952, Volume 40, Number 9, pp. 1098-1101. 867 1.1 christos 868 1.1 christos [2] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data 869 1.1 christos Compression", IEEE Transactions on Information Theory, Vol. 23, 870 1.1 christos No. 3, pp. 337-343. 871 1.1 christos 872 1.1 christos [3] Gailly, J.-L., and Adler, M., ZLIB documentation and sources, 873 1.1 christos available in ftp://ftp.uu.net/pub/archiving/zip/doc/ 874 1.1 christos 875 1.1 christos [4] Gailly, J.-L., and Adler, M., GZIP documentation and sources, 876 1.1 christos available as gzip-*.tar in ftp://prep.ai.mit.edu/pub/gnu/ 877 1.1 christos 878 1.1 christos [5] Schwartz, E. S., and Kallick, B. "Generating a canonical prefix 879 1.1 christos encoding." Comm. ACM, 7,3 (Mar. 1964), pp. 166-169. 880 1.1 christos 881 1.1 christos [6] Hirschberg and Lelewer, "Efficient decoding of prefix codes," 882 1.1 christos Comm. ACM, 33,4, April 1990, pp. 449-459. 883 1.1 christos 884 1.1 christos 6. Security Considerations 885 1.1 christos 886 1.1 christos Any data compression method involves the reduction of redundancy in 887 1.1 christos the data. Consequently, any corruption of the data is likely to have 888 1.1 christos severe effects and be difficult to correct. Uncompressed text, on 889 1.1 christos the other hand, will probably still be readable despite the presence 890 1.1 christos of some corrupted bytes. 891 1.1 christos 892 1.1 christos It is recommended that systems using this data format provide some 893 1.1 christos means of validating the integrity of the compressed data. See 894 1.1 christos reference [3], for example. 895 1.1 christos 896 1.1 christos 7. Source code 897 1.1 christos 898 1.1 christos Source code for a C language implementation of a "deflate" compliant 899 1.1 christos compressor and decompressor is available within the zlib package at 900 1.1 christos ftp://ftp.uu.net/pub/archiving/zip/zlib/. 901 1.1 christos 902 1.1 christos 8. Acknowledgements 903 1.1 christos 904 1.1 christos Trademarks cited in this document are the property of their 905 1.1 christos respective owners. 906 1.1 christos 907 1.1 christos Phil Katz designed the deflate format. Jean-Loup Gailly and Mark 908 1.1 christos Adler wrote the related software described in this specification. 909 1.1 christos Glenn Randers-Pehrson converted this document to RFC and HTML format. 910 1.1 christos 911 1.1 christos 912 1.1 christos 913 1.1 christos Deutsch Informational [Page 16] 914 1.1 christos 916 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996 917 1.1 christos 918 1.1 christos 919 1.1 christos 9. Author's Address 920 1.1 christos 921 1.1 christos L. Peter Deutsch 922 1.1 christos Aladdin Enterprises 923 1.1 christos 203 Santa Margarita Ave. 924 1.1 christos Menlo Park, CA 94025 925 1.1 christos 926 1.1 christos Phone: (415) 322-0103 (AM only) 927 1.1 christos FAX: (415) 322-1734 928 1.1 christos EMail: <ghost (a] aladdin.com> 929 1.1 christos 930 1.1 christos Questions about the technical content of this specification can be 931 1.1 christos sent by email to: 932 1.1 christos 933 1.1 christos Jean-Loup Gailly <gzip (a] prep.ai.mit.edu> and 934 1.1 christos Mark Adler <madler (a] alumni.caltech.edu> 935 1.1 christos 936 1.1 christos Editorial comments on this specification can be sent by email to: 937 1.1 christos 938 1.1 christos L. Peter Deutsch <ghost (a] aladdin.com> and 939 1.1 christos Glenn Randers-Pehrson <randeg (a] alumni.rpi.edu> 940 1.1 christos 941 1.1 christos 942 1.1 christos 943 1.1 christos 944 1.1 christos 945 1.1 christos 946 1.1 christos 947 1.1 christos 948 1.1 christos 949 1.1 christos 950 1.1 christos 951 1.1 christos 952 1.1 christos 953 1.1 christos 954 1.1 christos 955 1.1 christos 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 Deutsch Informational [Page 17] 971 973