rfc1951.txt revision 1.1 1 1.1 christos
2 1.1 christos
3 1.1 christos
4 1.1 christos
5 1.1 christos
6 1.1 christos
7 1.1 christos Network Working Group P. Deutsch
8 1.1 christos Request for Comments: 1951 Aladdin Enterprises
9 1.1 christos Category: Informational May 1996
10 1.1 christos
11 1.1 christos
12 1.1 christos DEFLATE Compressed Data Format Specification version 1.3
13 1.1 christos
14 1.1 christos Status of This Memo
15 1.1 christos
16 1.1 christos This memo provides information for the Internet community. This memo
17 1.1 christos does not specify an Internet standard of any kind. Distribution of
18 1.1 christos this memo is unlimited.
19 1.1 christos
20 1.1 christos IESG Note:
21 1.1 christos
22 1.1 christos The IESG takes no position on the validity of any Intellectual
23 1.1 christos Property Rights statements contained in this document.
24 1.1 christos
25 1.1 christos Notices
26 1.1 christos
27 1.1 christos Copyright (c) 1996 L. Peter Deutsch
28 1.1 christos
29 1.1 christos Permission is granted to copy and distribute this document for any
30 1.1 christos purpose and without charge, including translations into other
31 1.1 christos languages and incorporation into compilations, provided that the
32 1.1 christos copyright notice and this notice are preserved, and that any
33 1.1 christos substantive changes or deletions from the original are clearly
34 1.1 christos marked.
35 1.1 christos
36 1.1 christos A pointer to the latest version of this and related documentation in
37 1.1 christos HTML format can be found at the URL
38 1.1 christos <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>.
39 1.1 christos
40 1.1 christos Abstract
41 1.1 christos
42 1.1 christos This specification defines a lossless compressed data format that
43 1.1 christos compresses data using a combination of the LZ77 algorithm and Huffman
44 1.1 christos coding, with efficiency comparable to the best currently available
45 1.1 christos general-purpose compression methods. The data can be produced or
46 1.1 christos consumed, even for an arbitrarily long sequentially presented input
47 1.1 christos data stream, using only an a priori bounded amount of intermediate
48 1.1 christos storage. The format can be implemented readily in a manner not
49 1.1 christos covered by patents.
50 1.1 christos
51 1.1 christos
52 1.1 christos
53 1.1 christos
54 1.1 christos
55 1.1 christos
56 1.1 christos
57 1.1 christos
58 1.1 christos Deutsch Informational [Page 1]
59 1.1 christos
61 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996
62 1.1 christos
63 1.1 christos
64 1.1 christos Table of Contents
65 1.1 christos
66 1.1 christos 1. Introduction ................................................... 2
67 1.1 christos 1.1. Purpose ................................................... 2
68 1.1 christos 1.2. Intended audience ......................................... 3
69 1.1 christos 1.3. Scope ..................................................... 3
70 1.1 christos 1.4. Compliance ................................................ 3
71 1.1 christos 1.5. Definitions of terms and conventions used ................ 3
72 1.1 christos 1.6. Changes from previous versions ............................ 4
73 1.1 christos 2. Compressed representation overview ............................. 4
74 1.1 christos 3. Detailed specification ......................................... 5
75 1.1 christos 3.1. Overall conventions ....................................... 5
76 1.1 christos 3.1.1. Packing into bytes .................................. 5
77 1.1 christos 3.2. Compressed block format ................................... 6
78 1.1 christos 3.2.1. Synopsis of prefix and Huffman coding ............... 6
79 1.1 christos 3.2.2. Use of Huffman coding in the "deflate" format ....... 7
80 1.1 christos 3.2.3. Details of block format ............................. 9
81 1.1 christos 3.2.4. Non-compressed blocks (BTYPE=00) ................... 11
82 1.1 christos 3.2.5. Compressed blocks (length and distance codes) ...... 11
83 1.1 christos 3.2.6. Compression with fixed Huffman codes (BTYPE=01) .... 12
84 1.1 christos 3.2.7. Compression with dynamic Huffman codes (BTYPE=10) .. 13
85 1.1 christos 3.3. Compliance ............................................... 14
86 1.1 christos 4. Compression algorithm details ................................. 14
87 1.1 christos 5. References .................................................... 16
88 1.1 christos 6. Security Considerations ....................................... 16
89 1.1 christos 7. Source code ................................................... 16
90 1.1 christos 8. Acknowledgements .............................................. 16
91 1.1 christos 9. Author's Address .............................................. 17
92 1.1 christos
93 1.1 christos 1. Introduction
94 1.1 christos
95 1.1 christos 1.1. Purpose
96 1.1 christos
97 1.1 christos The purpose of this specification is to define a lossless
98 1.1 christos compressed data format that:
99 1.1 christos * Is independent of CPU type, operating system, file system,
100 1.1 christos and character set, and hence can be used for interchange;
101 1.1 christos * Can be produced or consumed, even for an arbitrarily long
102 1.1 christos sequentially presented input data stream, using only an a
103 1.1 christos priori bounded amount of intermediate storage, and hence
104 1.1 christos can be used in data communications or similar structures
105 1.1 christos such as Unix filters;
106 1.1 christos * Compresses data with efficiency comparable to the best
107 1.1 christos currently available general-purpose compression methods,
108 1.1 christos and in particular considerably better than the "compress"
109 1.1 christos program;
110 1.1 christos * Can be implemented readily in a manner not covered by
111 1.1 christos patents, and hence can be practiced freely;
112 1.1 christos
113 1.1 christos
114 1.1 christos
115 1.1 christos Deutsch Informational [Page 2]
116 1.1 christos
118 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996
119 1.1 christos
120 1.1 christos
121 1.1 christos * Is compatible with the file format produced by the current
122 1.1 christos widely used gzip utility, in that conforming decompressors
123 1.1 christos will be able to read data produced by the existing gzip
124 1.1 christos compressor.
125 1.1 christos
126 1.1 christos The data format defined by this specification does not attempt to:
127 1.1 christos
128 1.1 christos * Allow random access to compressed data;
129 1.1 christos * Compress specialized data (e.g., raster graphics) as well
130 1.1 christos as the best currently available specialized algorithms.
131 1.1 christos
132 1.1 christos A simple counting argument shows that no lossless compression
133 1.1 christos algorithm can compress every possible input data set. For the
134 1.1 christos format defined here, the worst case expansion is 5 bytes per 32K-
135 1.1 christos byte block, i.e., a size increase of 0.015% for large data sets.
136 1.1 christos English text usually compresses by a factor of 2.5 to 3;
137 1.1 christos executable files usually compress somewhat less; graphical data
138 1.1 christos such as raster images may compress much more.
139 1.1 christos
140 1.1 christos 1.2. Intended audience
141 1.1 christos
142 1.1 christos This specification is intended for use by implementors of software
143 1.1 christos to compress data into "deflate" format and/or decompress data from
144 1.1 christos "deflate" format.
145 1.1 christos
146 1.1 christos The text of the specification assumes a basic background in
147 1.1 christos programming at the level of bits and other primitive data
148 1.1 christos representations. Familiarity with the technique of Huffman coding
149 1.1 christos is helpful but not required.
150 1.1 christos
151 1.1 christos 1.3. Scope
152 1.1 christos
153 1.1 christos The specification specifies a method for representing a sequence
154 1.1 christos of bytes as a (usually shorter) sequence of bits, and a method for
155 1.1 christos packing the latter bit sequence into bytes.
156 1.1 christos
157 1.1 christos 1.4. Compliance
158 1.1 christos
159 1.1 christos Unless otherwise indicated below, a compliant decompressor must be
160 1.1 christos able to accept and decompress any data set that conforms to all
161 1.1 christos the specifications presented here; a compliant compressor must
162 1.1 christos produce data sets that conform to all the specifications presented
163 1.1 christos here.
164 1.1 christos
165 1.1 christos 1.5. Definitions of terms and conventions used
166 1.1 christos
167 1.1 christos Byte: 8 bits stored or transmitted as a unit (same as an octet).
168 1.1 christos For this specification, a byte is exactly 8 bits, even on machines
169 1.1 christos
170 1.1 christos
171 1.1 christos
172 1.1 christos Deutsch Informational [Page 3]
173 1.1 christos
175 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996
176 1.1 christos
177 1.1 christos
178 1.1 christos which store a character on a number of bits different from eight.
179 1.1 christos See below, for the numbering of bits within a byte.
180 1.1 christos
181 1.1 christos String: a sequence of arbitrary bytes.
182 1.1 christos
183 1.1 christos 1.6. Changes from previous versions
184 1.1 christos
185 1.1 christos There have been no technical changes to the deflate format since
186 1.1 christos version 1.1 of this specification. In version 1.2, some
187 1.1 christos terminology was changed. Version 1.3 is a conversion of the
188 1.1 christos specification to RFC style.
189 1.1 christos
190 1.1 christos 2. Compressed representation overview
191 1.1 christos
192 1.1 christos A compressed data set consists of a series of blocks, corresponding
193 1.1 christos to successive blocks of input data. The block sizes are arbitrary,
194 1.1 christos except that non-compressible blocks are limited to 65,535 bytes.
195 1.1 christos
196 1.1 christos Each block is compressed using a combination of the LZ77 algorithm
197 1.1 christos and Huffman coding. The Huffman trees for each block are independent
198 1.1 christos of those for previous or subsequent blocks; the LZ77 algorithm may
199 1.1 christos use a reference to a duplicated string occurring in a previous block,
200 1.1 christos up to 32K input bytes before.
201 1.1 christos
202 1.1 christos Each block consists of two parts: a pair of Huffman code trees that
203 1.1 christos describe the representation of the compressed data part, and a
204 1.1 christos compressed data part. (The Huffman trees themselves are compressed
205 1.1 christos using Huffman encoding.) The compressed data consists of a series of
206 1.1 christos elements of two types: literal bytes (of strings that have not been
207 1.1 christos detected as duplicated within the previous 32K input bytes), and
208 1.1 christos pointers to duplicated strings, where a pointer is represented as a
209 1.1 christos pair <length, backward distance>. The representation used in the
210 1.1 christos "deflate" format limits distances to 32K bytes and lengths to 258
211 1.1 christos bytes, but does not limit the size of a block, except for
212 1.1 christos uncompressible blocks, which are limited as noted above.
213 1.1 christos
214 1.1 christos Each type of value (literals, distances, and lengths) in the
215 1.1 christos compressed data is represented using a Huffman code, using one code
216 1.1 christos tree for literals and lengths and a separate code tree for distances.
217 1.1 christos The code trees for each block appear in a compact form just before
218 1.1 christos the compressed data for that block.
219 1.1 christos
220 1.1 christos
221 1.1 christos
222 1.1 christos
223 1.1 christos
224 1.1 christos
225 1.1 christos
226 1.1 christos
227 1.1 christos
228 1.1 christos
229 1.1 christos Deutsch Informational [Page 4]
230 1.1 christos
232 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996
233 1.1 christos
234 1.1 christos
235 1.1 christos 3. Detailed specification
236 1.1 christos
237 1.1 christos 3.1. Overall conventions In the diagrams below, a box like this:
238 1.1 christos
239 1.1 christos +---+
240 1.1 christos | | <-- the vertical bars might be missing
241 1.1 christos +---+
242 1.1 christos
243 1.1 christos represents one byte; a box like this:
244 1.1 christos
245 1.1 christos +==============+
246 1.1 christos | |
247 1.1 christos +==============+
248 1.1 christos
249 1.1 christos represents a variable number of bytes.
250 1.1 christos
251 1.1 christos Bytes stored within a computer do not have a "bit order", since
252 1.1 christos they are always treated as a unit. However, a byte considered as
253 1.1 christos an integer between 0 and 255 does have a most- and least-
254 1.1 christos significant bit, and since we write numbers with the most-
255 1.1 christos significant digit on the left, we also write bytes with the most-
256 1.1 christos significant bit on the left. In the diagrams below, we number the
257 1.1 christos bits of a byte so that bit 0 is the least-significant bit, i.e.,
258 1.1 christos the bits are numbered:
259 1.1 christos
260 1.1 christos +--------+
261 1.1 christos |76543210|
262 1.1 christos +--------+
263 1.1 christos
264 1.1 christos Within a computer, a number may occupy multiple bytes. All
265 1.1 christos multi-byte numbers in the format described here are stored with
266 1.1 christos the least-significant byte first (at the lower memory address).
267 1.1 christos For example, the decimal number 520 is stored as:
268 1.1 christos
269 1.1 christos 0 1
270 1.1 christos +--------+--------+
271 1.1 christos |00001000|00000010|
272 1.1 christos +--------+--------+
273 1.1 christos ^ ^
274 1.1 christos | |
275 1.1 christos | + more significant byte = 2 x 256
276 1.1 christos + less significant byte = 8
277 1.1 christos
278 1.1 christos 3.1.1. Packing into bytes
279 1.1 christos
280 1.1 christos This document does not address the issue of the order in which
281 1.1 christos bits of a byte are transmitted on a bit-sequential medium,
282 1.1 christos since the final data format described here is byte- rather than
283 1.1 christos
284 1.1 christos
285 1.1 christos
286 1.1 christos Deutsch Informational [Page 5]
287 1.1 christos
289 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996
290 1.1 christos
291 1.1 christos
292 1.1 christos bit-oriented. However, we describe the compressed block format
293 1.1 christos in below, as a sequence of data elements of various bit
294 1.1 christos lengths, not a sequence of bytes. We must therefore specify
295 1.1 christos how to pack these data elements into bytes to form the final
296 1.1 christos compressed byte sequence:
297 1.1 christos
298 1.1 christos * Data elements are packed into bytes in order of
299 1.1 christos increasing bit number within the byte, i.e., starting
300 1.1 christos with the least-significant bit of the byte.
301 1.1 christos * Data elements other than Huffman codes are packed
302 1.1 christos starting with the least-significant bit of the data
303 1.1 christos element.
304 1.1 christos * Huffman codes are packed starting with the most-
305 1.1 christos significant bit of the code.
306 1.1 christos
307 1.1 christos In other words, if one were to print out the compressed data as
308 1.1 christos a sequence of bytes, starting with the first byte at the
309 1.1 christos *right* margin and proceeding to the *left*, with the most-
310 1.1 christos significant bit of each byte on the left as usual, one would be
311 1.1 christos able to parse the result from right to left, with fixed-width
312 1.1 christos elements in the correct MSB-to-LSB order and Huffman codes in
313 1.1 christos bit-reversed order (i.e., with the first bit of the code in the
314 1.1 christos relative LSB position).
315 1.1 christos
316 1.1 christos 3.2. Compressed block format
317 1.1 christos
318 1.1 christos 3.2.1. Synopsis of prefix and Huffman coding
319 1.1 christos
320 1.1 christos Prefix coding represents symbols from an a priori known
321 1.1 christos alphabet by bit sequences (codes), one code for each symbol, in
322 1.1 christos a manner such that different symbols may be represented by bit
323 1.1 christos sequences of different lengths, but a parser can always parse
324 1.1 christos an encoded string unambiguously symbol-by-symbol.
325 1.1 christos
326 1.1 christos We define a prefix code in terms of a binary tree in which the
327 1.1 christos two edges descending from each non-leaf node are labeled 0 and
328 1.1 christos 1 and in which the leaf nodes correspond one-for-one with (are
329 1.1 christos labeled with) the symbols of the alphabet; then the code for a
330 1.1 christos symbol is the sequence of 0's and 1's on the edges leading from
331 1.1 christos the root to the leaf labeled with that symbol. For example:
332 1.1 christos
333 1.1 christos
334 1.1 christos
335 1.1 christos
336 1.1 christos
337 1.1 christos
338 1.1 christos
339 1.1 christos
340 1.1 christos
341 1.1 christos
342 1.1 christos
343 1.1 christos Deutsch Informational [Page 6]
344 1.1 christos
346 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996
347 1.1 christos
348 1.1 christos
349 1.1 christos /\ Symbol Code
350 1.1 christos 0 1 ------ ----
351 1.1 christos / \ A 00
352 1.1 christos /\ B B 1
353 1.1 christos 0 1 C 011
354 1.1 christos / \ D 010
355 1.1 christos A /\
356 1.1 christos 0 1
357 1.1 christos / \
358 1.1 christos D C
359 1.1 christos
360 1.1 christos A parser can decode the next symbol from an encoded input
361 1.1 christos stream by walking down the tree from the root, at each step
362 1.1 christos choosing the edge corresponding to the next input bit.
363 1.1 christos
364 1.1 christos Given an alphabet with known symbol frequencies, the Huffman
365 1.1 christos algorithm allows the construction of an optimal prefix code
366 1.1 christos (one which represents strings with those symbol frequencies
367 1.1 christos using the fewest bits of any possible prefix codes for that
368 1.1 christos alphabet). Such a code is called a Huffman code. (See
369 1.1 christos reference [1] in Chapter 5, references for additional
370 1.1 christos information on Huffman codes.)
371 1.1 christos
372 1.1 christos Note that in the "deflate" format, the Huffman codes for the
373 1.1 christos various alphabets must not exceed certain maximum code lengths.
374 1.1 christos This constraint complicates the algorithm for computing code
375 1.1 christos lengths from symbol frequencies. Again, see Chapter 5,
376 1.1 christos references for details.
377 1.1 christos
378 1.1 christos 3.2.2. Use of Huffman coding in the "deflate" format
379 1.1 christos
380 1.1 christos The Huffman codes used for each alphabet in the "deflate"
381 1.1 christos format have two additional rules:
382 1.1 christos
383 1.1 christos * All codes of a given bit length have lexicographically
384 1.1 christos consecutive values, in the same order as the symbols
385 1.1 christos they represent;
386 1.1 christos
387 1.1 christos * Shorter codes lexicographically precede longer codes.
388 1.1 christos
389 1.1 christos
390 1.1 christos
391 1.1 christos
392 1.1 christos
393 1.1 christos
394 1.1 christos
395 1.1 christos
396 1.1 christos
397 1.1 christos
398 1.1 christos
399 1.1 christos
400 1.1 christos Deutsch Informational [Page 7]
401 1.1 christos
403 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996
404 1.1 christos
405 1.1 christos
406 1.1 christos We could recode the example above to follow this rule as
407 1.1 christos follows, assuming that the order of the alphabet is ABCD:
408 1.1 christos
409 1.1 christos Symbol Code
410 1.1 christos ------ ----
411 1.1 christos A 10
412 1.1 christos B 0
413 1.1 christos C 110
414 1.1 christos D 111
415 1.1 christos
416 1.1 christos I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are
417 1.1 christos lexicographically consecutive.
418 1.1 christos
419 1.1 christos Given this rule, we can define the Huffman code for an alphabet
420 1.1 christos just by giving the bit lengths of the codes for each symbol of
421 1.1 christos the alphabet in order; this is sufficient to determine the
422 1.1 christos actual codes. In our example, the code is completely defined
423 1.1 christos by the sequence of bit lengths (2, 1, 3, 3). The following
424 1.1 christos algorithm generates the codes as integers, intended to be read
425 1.1 christos from most- to least-significant bit. The code lengths are
426 1.1 christos initially in tree[I].Len; the codes are produced in
427 1.1 christos tree[I].Code.
428 1.1 christos
429 1.1 christos 1) Count the number of codes for each code length. Let
430 1.1 christos bl_count[N] be the number of codes of length N, N >= 1.
431 1.1 christos
432 1.1 christos 2) Find the numerical value of the smallest code for each
433 1.1 christos code length:
434 1.1 christos
435 1.1 christos code = 0;
436 1.1 christos bl_count[0] = 0;
437 1.1 christos for (bits = 1; bits <= MAX_BITS; bits++) {
438 1.1 christos code = (code + bl_count[bits-1]) << 1;
439 1.1 christos next_code[bits] = code;
440 1.1 christos }
441 1.1 christos
442 1.1 christos 3) Assign numerical values to all codes, using consecutive
443 1.1 christos values for all codes of the same length with the base
444 1.1 christos values determined at step 2. Codes that are never used
445 1.1 christos (which have a bit length of zero) must not be assigned a
446 1.1 christos value.
447 1.1 christos
448 1.1 christos for (n = 0; n <= max_code; n++) {
449 1.1 christos len = tree[n].Len;
450 1.1 christos if (len != 0) {
451 1.1 christos tree[n].Code = next_code[len];
452 1.1 christos next_code[len]++;
453 1.1 christos }
454 1.1 christos
455 1.1 christos
456 1.1 christos
457 1.1 christos Deutsch Informational [Page 8]
458 1.1 christos
460 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996
461 1.1 christos
462 1.1 christos
463 1.1 christos }
464 1.1 christos
465 1.1 christos Example:
466 1.1 christos
467 1.1 christos Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3,
468 1.1 christos 3, 2, 4, 4). After step 1, we have:
469 1.1 christos
470 1.1 christos N bl_count[N]
471 1.1 christos - -----------
472 1.1 christos 2 1
473 1.1 christos 3 5
474 1.1 christos 4 2
475 1.1 christos
476 1.1 christos Step 2 computes the following next_code values:
477 1.1 christos
478 1.1 christos N next_code[N]
479 1.1 christos - ------------
480 1.1 christos 1 0
481 1.1 christos 2 0
482 1.1 christos 3 2
483 1.1 christos 4 14
484 1.1 christos
485 1.1 christos Step 3 produces the following code values:
486 1.1 christos
487 1.1 christos Symbol Length Code
488 1.1 christos ------ ------ ----
489 1.1 christos A 3 010
490 1.1 christos B 3 011
491 1.1 christos C 3 100
492 1.1 christos D 3 101
493 1.1 christos E 3 110
494 1.1 christos F 2 00
495 1.1 christos G 4 1110
496 1.1 christos H 4 1111
497 1.1 christos
498 1.1 christos 3.2.3. Details of block format
499 1.1 christos
500 1.1 christos Each block of compressed data begins with 3 header bits
501 1.1 christos containing the following data:
502 1.1 christos
503 1.1 christos first bit BFINAL
504 1.1 christos next 2 bits BTYPE
505 1.1 christos
506 1.1 christos Note that the header bits do not necessarily begin on a byte
507 1.1 christos boundary, since a block does not necessarily occupy an integral
508 1.1 christos number of bytes.
509 1.1 christos
510 1.1 christos
511 1.1 christos
512 1.1 christos
513 1.1 christos
514 1.1 christos Deutsch Informational [Page 9]
515 1.1 christos
517 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996
518 1.1 christos
519 1.1 christos
520 1.1 christos BFINAL is set if and only if this is the last block of the data
521 1.1 christos set.
522 1.1 christos
523 1.1 christos BTYPE specifies how the data are compressed, as follows:
524 1.1 christos
525 1.1 christos 00 - no compression
526 1.1 christos 01 - compressed with fixed Huffman codes
527 1.1 christos 10 - compressed with dynamic Huffman codes
528 1.1 christos 11 - reserved (error)
529 1.1 christos
530 1.1 christos The only difference between the two compressed cases is how the
531 1.1 christos Huffman codes for the literal/length and distance alphabets are
532 1.1 christos defined.
533 1.1 christos
534 1.1 christos In all cases, the decoding algorithm for the actual data is as
535 1.1 christos follows:
536 1.1 christos
537 1.1 christos do
538 1.1 christos read block header from input stream.
539 1.1 christos if stored with no compression
540 1.1 christos skip any remaining bits in current partially
541 1.1 christos processed byte
542 1.1 christos read LEN and NLEN (see next section)
543 1.1 christos copy LEN bytes of data to output
544 1.1 christos otherwise
545 1.1 christos if compressed with dynamic Huffman codes
546 1.1 christos read representation of code trees (see
547 1.1 christos subsection below)
548 1.1 christos loop (until end of block code recognized)
549 1.1 christos decode literal/length value from input stream
550 1.1 christos if value < 256
551 1.1 christos copy value (literal byte) to output stream
552 1.1 christos otherwise
553 1.1 christos if value = end of block (256)
554 1.1 christos break from loop
555 1.1 christos otherwise (value = 257..285)
556 1.1 christos decode distance from input stream
557 1.1 christos
558 1.1 christos move backwards distance bytes in the output
559 1.1 christos stream, and copy length bytes from this
560 1.1 christos position to the output stream.
561 1.1 christos end loop
562 1.1 christos while not last block
563 1.1 christos
564 1.1 christos Note that a duplicated string reference may refer to a string
565 1.1 christos in a previous block; i.e., the backward distance may cross one
566 1.1 christos or more block boundaries. However a distance cannot refer past
567 1.1 christos the beginning of the output stream. (An application using a
568 1.1 christos
569 1.1 christos
570 1.1 christos
571 1.1 christos Deutsch Informational [Page 10]
572 1.1 christos
574 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996
575 1.1 christos
576 1.1 christos
577 1.1 christos preset dictionary might discard part of the output stream; a
578 1.1 christos distance can refer to that part of the output stream anyway)
579 1.1 christos Note also that the referenced string may overlap the current
580 1.1 christos position; for example, if the last 2 bytes decoded have values
581 1.1 christos X and Y, a string reference with <length = 5, distance = 2>
582 1.1 christos adds X,Y,X,Y,X to the output stream.
583 1.1 christos
584 1.1 christos We now specify each compression method in turn.
585 1.1 christos
586 1.1 christos 3.2.4. Non-compressed blocks (BTYPE=00)
587 1.1 christos
588 1.1 christos Any bits of input up to the next byte boundary are ignored.
589 1.1 christos The rest of the block consists of the following information:
590 1.1 christos
591 1.1 christos 0 1 2 3 4...
592 1.1 christos +---+---+---+---+================================+
593 1.1 christos | LEN | NLEN |... LEN bytes of literal data...|
594 1.1 christos +---+---+---+---+================================+
595 1.1 christos
596 1.1 christos LEN is the number of data bytes in the block. NLEN is the
597 1.1 christos one's complement of LEN.
598 1.1 christos
599 1.1 christos 3.2.5. Compressed blocks (length and distance codes)
600 1.1 christos
601 1.1 christos As noted above, encoded data blocks in the "deflate" format
602 1.1 christos consist of sequences of symbols drawn from three conceptually
603 1.1 christos distinct alphabets: either literal bytes, from the alphabet of
604 1.1 christos byte values (0..255), or <length, backward distance> pairs,
605 1.1 christos where the length is drawn from (3..258) and the distance is
606 1.1 christos drawn from (1..32,768). In fact, the literal and length
607 1.1 christos alphabets are merged into a single alphabet (0..285), where
608 1.1 christos values 0..255 represent literal bytes, the value 256 indicates
609 1.1 christos end-of-block, and values 257..285 represent length codes
610 1.1 christos (possibly in conjunction with extra bits following the symbol
611 1.1 christos code) as follows:
612 1.1 christos
613 1.1 christos
614 1.1 christos
615 1.1 christos
616 1.1 christos
617 1.1 christos
618 1.1 christos
619 1.1 christos
620 1.1 christos
621 1.1 christos
622 1.1 christos
623 1.1 christos
624 1.1 christos
625 1.1 christos
626 1.1 christos
627 1.1 christos
628 1.1 christos Deutsch Informational [Page 11]
629 1.1 christos
631 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996
632 1.1 christos
633 1.1 christos
634 1.1 christos Extra Extra Extra
635 1.1 christos Code Bits Length(s) Code Bits Lengths Code Bits Length(s)
636 1.1 christos ---- ---- ------ ---- ---- ------- ---- ---- -------
637 1.1 christos 257 0 3 267 1 15,16 277 4 67-82
638 1.1 christos 258 0 4 268 1 17,18 278 4 83-98
639 1.1 christos 259 0 5 269 2 19-22 279 4 99-114
640 1.1 christos 260 0 6 270 2 23-26 280 4 115-130
641 1.1 christos 261 0 7 271 2 27-30 281 5 131-162
642 1.1 christos 262 0 8 272 2 31-34 282 5 163-194
643 1.1 christos 263 0 9 273 3 35-42 283 5 195-226
644 1.1 christos 264 0 10 274 3 43-50 284 5 227-257
645 1.1 christos 265 1 11,12 275 3 51-58 285 0 258
646 1.1 christos 266 1 13,14 276 3 59-66
647 1.1 christos
648 1.1 christos The extra bits should be interpreted as a machine integer
649 1.1 christos stored with the most-significant bit first, e.g., bits 1110
650 1.1 christos represent the value 14.
651 1.1 christos
652 1.1 christos Extra Extra Extra
653 1.1 christos Code Bits Dist Code Bits Dist Code Bits Distance
654 1.1 christos ---- ---- ---- ---- ---- ------ ---- ---- --------
655 1.1 christos 0 0 1 10 4 33-48 20 9 1025-1536
656 1.1 christos 1 0 2 11 4 49-64 21 9 1537-2048
657 1.1 christos 2 0 3 12 5 65-96 22 10 2049-3072
658 1.1 christos 3 0 4 13 5 97-128 23 10 3073-4096
659 1.1 christos 4 1 5,6 14 6 129-192 24 11 4097-6144
660 1.1 christos 5 1 7,8 15 6 193-256 25 11 6145-8192
661 1.1 christos 6 2 9-12 16 7 257-384 26 12 8193-12288
662 1.1 christos 7 2 13-16 17 7 385-512 27 12 12289-16384
663 1.1 christos 8 3 17-24 18 8 513-768 28 13 16385-24576
664 1.1 christos 9 3 25-32 19 8 769-1024 29 13 24577-32768
665 1.1 christos
666 1.1 christos 3.2.6. Compression with fixed Huffman codes (BTYPE=01)
667 1.1 christos
668 1.1 christos The Huffman codes for the two alphabets are fixed, and are not
669 1.1 christos represented explicitly in the data. The Huffman code lengths
670 1.1 christos for the literal/length alphabet are:
671 1.1 christos
672 1.1 christos Lit Value Bits Codes
673 1.1 christos --------- ---- -----
674 1.1 christos 0 - 143 8 00110000 through
675 1.1 christos 10111111
676 1.1 christos 144 - 255 9 110010000 through
677 1.1 christos 111111111
678 1.1 christos 256 - 279 7 0000000 through
679 1.1 christos 0010111
680 1.1 christos 280 - 287 8 11000000 through
681 1.1 christos 11000111
682 1.1 christos
683 1.1 christos
684 1.1 christos
685 1.1 christos Deutsch Informational [Page 12]
686 1.1 christos
688 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996
689 1.1 christos
690 1.1 christos
691 1.1 christos The code lengths are sufficient to generate the actual codes,
692 1.1 christos as described above; we show the codes in the table for added
693 1.1 christos clarity. Literal/length values 286-287 will never actually
694 1.1 christos occur in the compressed data, but participate in the code
695 1.1 christos construction.
696 1.1 christos
697 1.1 christos Distance codes 0-31 are represented by (fixed-length) 5-bit
698 1.1 christos codes, with possible additional bits as shown in the table
699 1.1 christos shown in Paragraph 3.2.5, above. Note that distance codes 30-
700 1.1 christos 31 will never actually occur in the compressed data.
701 1.1 christos
702 1.1 christos 3.2.7. Compression with dynamic Huffman codes (BTYPE=10)
703 1.1 christos
704 1.1 christos The Huffman codes for the two alphabets appear in the block
705 1.1 christos immediately after the header bits and before the actual
706 1.1 christos compressed data, first the literal/length code and then the
707 1.1 christos distance code. Each code is defined by a sequence of code
708 1.1 christos lengths, as discussed in Paragraph 3.2.2, above. For even
709 1.1 christos greater compactness, the code length sequences themselves are
710 1.1 christos compressed using a Huffman code. The alphabet for code lengths
711 1.1 christos is as follows:
712 1.1 christos
713 1.1 christos 0 - 15: Represent code lengths of 0 - 15
714 1.1 christos 16: Copy the previous code length 3 - 6 times.
715 1.1 christos The next 2 bits indicate repeat length
716 1.1 christos (0 = 3, ... , 3 = 6)
717 1.1 christos Example: Codes 8, 16 (+2 bits 11),
718 1.1 christos 16 (+2 bits 10) will expand to
719 1.1 christos 12 code lengths of 8 (1 + 6 + 5)
720 1.1 christos 17: Repeat a code length of 0 for 3 - 10 times.
721 1.1 christos (3 bits of length)
722 1.1 christos 18: Repeat a code length of 0 for 11 - 138 times
723 1.1 christos (7 bits of length)
724 1.1 christos
725 1.1 christos A code length of 0 indicates that the corresponding symbol in
726 1.1 christos the literal/length or distance alphabet will not occur in the
727 1.1 christos block, and should not participate in the Huffman code
728 1.1 christos construction algorithm given earlier. If only one distance
729 1.1 christos code is used, it is encoded using one bit, not zero bits; in
730 1.1 christos this case there is a single code length of one, with one unused
731 1.1 christos code. One distance code of zero bits means that there are no
732 1.1 christos distance codes used at all (the data is all literals).
733 1.1 christos
734 1.1 christos We can now define the format of the block:
735 1.1 christos
736 1.1 christos 5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
737 1.1 christos 5 Bits: HDIST, # of Distance codes - 1 (1 - 32)
738 1.1 christos 4 Bits: HCLEN, # of Code Length codes - 4 (4 - 19)
739 1.1 christos
740 1.1 christos
741 1.1 christos
742 1.1 christos Deutsch Informational [Page 13]
743 1.1 christos
745 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996
746 1.1 christos
747 1.1 christos
748 1.1 christos (HCLEN + 4) x 3 bits: code lengths for the code length
749 1.1 christos alphabet given just above, in the order: 16, 17, 18,
750 1.1 christos 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
751 1.1 christos
752 1.1 christos These code lengths are interpreted as 3-bit integers
753 1.1 christos (0-7); as above, a code length of 0 means the
754 1.1 christos corresponding symbol (literal/length or distance code
755 1.1 christos length) is not used.
756 1.1 christos
757 1.1 christos HLIT + 257 code lengths for the literal/length alphabet,
758 1.1 christos encoded using the code length Huffman code
759 1.1 christos
760 1.1 christos HDIST + 1 code lengths for the distance alphabet,
761 1.1 christos encoded using the code length Huffman code
762 1.1 christos
763 1.1 christos The actual compressed data of the block,
764 1.1 christos encoded using the literal/length and distance Huffman
765 1.1 christos codes
766 1.1 christos
767 1.1 christos The literal/length symbol 256 (end of data),
768 1.1 christos encoded using the literal/length Huffman code
769 1.1 christos
770 1.1 christos The code length repeat codes can cross from HLIT + 257 to the
771 1.1 christos HDIST + 1 code lengths. In other words, all code lengths form
772 1.1 christos a single sequence of HLIT + HDIST + 258 values.
773 1.1 christos
774 1.1 christos 3.3. Compliance
775 1.1 christos
776 1.1 christos A compressor may limit further the ranges of values specified in
777 1.1 christos the previous section and still be compliant; for example, it may
778 1.1 christos limit the range of backward pointers to some value smaller than
779 1.1 christos 32K. Similarly, a compressor may limit the size of blocks so that
780 1.1 christos a compressible block fits in memory.
781 1.1 christos
782 1.1 christos A compliant decompressor must accept the full range of possible
783 1.1 christos values defined in the previous section, and must accept blocks of
784 1.1 christos arbitrary size.
785 1.1 christos
786 1.1 christos 4. Compression algorithm details
787 1.1 christos
788 1.1 christos While it is the intent of this document to define the "deflate"
789 1.1 christos compressed data format without reference to any particular
790 1.1 christos compression algorithm, the format is related to the compressed
791 1.1 christos formats produced by LZ77 (Lempel-Ziv 1977, see reference [2] below);
792 1.1 christos since many variations of LZ77 are patented, it is strongly
793 1.1 christos recommended that the implementor of a compressor follow the general
794 1.1 christos algorithm presented here, which is known not to be patented per se.
795 1.1 christos The material in this section is not part of the definition of the
796 1.1 christos
797 1.1 christos
798 1.1 christos
799 1.1 christos Deutsch Informational [Page 14]
800 1.1 christos
802 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996
803 1.1 christos
804 1.1 christos
805 1.1 christos specification per se, and a compressor need not follow it in order to
806 1.1 christos be compliant.
807 1.1 christos
808 1.1 christos The compressor terminates a block when it determines that starting a
809 1.1 christos new block with fresh trees would be useful, or when the block size
810 1.1 christos fills up the compressor's block buffer.
811 1.1 christos
812 1.1 christos The compressor uses a chained hash table to find duplicated strings,
813 1.1 christos using a hash function that operates on 3-byte sequences. At any
814 1.1 christos given point during compression, let XYZ be the next 3 input bytes to
815 1.1 christos be examined (not necessarily all different, of course). First, the
816 1.1 christos compressor examines the hash chain for XYZ. If the chain is empty,
817 1.1 christos the compressor simply writes out X as a literal byte and advances one
818 1.1 christos byte in the input. If the hash chain is not empty, indicating that
819 1.1 christos the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
820 1.1 christos same hash function value) has occurred recently, the compressor
821 1.1 christos compares all strings on the XYZ hash chain with the actual input data
822 1.1 christos sequence starting at the current point, and selects the longest
823 1.1 christos match.
824 1.1 christos
825 1.1 christos The compressor searches the hash chains starting with the most recent
826 1.1 christos strings, to favor small distances and thus take advantage of the
827 1.1 christos Huffman encoding. The hash chains are singly linked. There are no
828 1.1 christos deletions from the hash chains; the algorithm simply discards matches
829 1.1 christos that are too old. To avoid a worst-case situation, very long hash
830 1.1 christos chains are arbitrarily truncated at a certain length, determined by a
831 1.1 christos run-time parameter.
832 1.1 christos
833 1.1 christos To improve overall compression, the compressor optionally defers the
834 1.1 christos selection of matches ("lazy matching"): after a match of length N has
835 1.1 christos been found, the compressor searches for a longer match starting at
836 1.1 christos the next input byte. If it finds a longer match, it truncates the
837 1.1 christos previous match to a length of one (thus producing a single literal
838 1.1 christos byte) and then emits the longer match. Otherwise, it emits the
839 1.1 christos original match, and, as described above, advances N bytes before
840 1.1 christos continuing.
841 1.1 christos
842 1.1 christos Run-time parameters also control this "lazy match" procedure. If
843 1.1 christos compression ratio is most important, the compressor attempts a
844 1.1 christos complete second search regardless of the length of the first match.
845 1.1 christos In the normal case, if the current match is "long enough", the
846 1.1 christos compressor reduces the search for a longer match, thus speeding up
847 1.1 christos the process. If speed is most important, the compressor inserts new
848 1.1 christos strings in the hash table only when no match was found, or when the
849 1.1 christos match is not "too long". This degrades the compression ratio but
850 1.1 christos saves time since there are both fewer insertions and fewer searches.
851 1.1 christos
852 1.1 christos
853 1.1 christos
854 1.1 christos
855 1.1 christos
856 1.1 christos Deutsch Informational [Page 15]
857 1.1 christos
859 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996
860 1.1 christos
861 1.1 christos
862 1.1 christos 5. References
863 1.1 christos
864 1.1 christos [1] Huffman, D. A., "A Method for the Construction of Minimum
865 1.1 christos Redundancy Codes", Proceedings of the Institute of Radio
866 1.1 christos Engineers, September 1952, Volume 40, Number 9, pp. 1098-1101.
867 1.1 christos
868 1.1 christos [2] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
869 1.1 christos Compression", IEEE Transactions on Information Theory, Vol. 23,
870 1.1 christos No. 3, pp. 337-343.
871 1.1 christos
872 1.1 christos [3] Gailly, J.-L., and Adler, M., ZLIB documentation and sources,
873 1.1 christos available in ftp://ftp.uu.net/pub/archiving/zip/doc/
874 1.1 christos
875 1.1 christos [4] Gailly, J.-L., and Adler, M., GZIP documentation and sources,
876 1.1 christos available as gzip-*.tar in ftp://prep.ai.mit.edu/pub/gnu/
877 1.1 christos
878 1.1 christos [5] Schwartz, E. S., and Kallick, B. "Generating a canonical prefix
879 1.1 christos encoding." Comm. ACM, 7,3 (Mar. 1964), pp. 166-169.
880 1.1 christos
881 1.1 christos [6] Hirschberg and Lelewer, "Efficient decoding of prefix codes,"
882 1.1 christos Comm. ACM, 33,4, April 1990, pp. 449-459.
883 1.1 christos
884 1.1 christos 6. Security Considerations
885 1.1 christos
886 1.1 christos Any data compression method involves the reduction of redundancy in
887 1.1 christos the data. Consequently, any corruption of the data is likely to have
888 1.1 christos severe effects and be difficult to correct. Uncompressed text, on
889 1.1 christos the other hand, will probably still be readable despite the presence
890 1.1 christos of some corrupted bytes.
891 1.1 christos
892 1.1 christos It is recommended that systems using this data format provide some
893 1.1 christos means of validating the integrity of the compressed data. See
894 1.1 christos reference [3], for example.
895 1.1 christos
896 1.1 christos 7. Source code
897 1.1 christos
898 1.1 christos Source code for a C language implementation of a "deflate" compliant
899 1.1 christos compressor and decompressor is available within the zlib package at
900 1.1 christos ftp://ftp.uu.net/pub/archiving/zip/zlib/.
901 1.1 christos
902 1.1 christos 8. Acknowledgements
903 1.1 christos
904 1.1 christos Trademarks cited in this document are the property of their
905 1.1 christos respective owners.
906 1.1 christos
907 1.1 christos Phil Katz designed the deflate format. Jean-Loup Gailly and Mark
908 1.1 christos Adler wrote the related software described in this specification.
909 1.1 christos Glenn Randers-Pehrson converted this document to RFC and HTML format.
910 1.1 christos
911 1.1 christos
912 1.1 christos
913 1.1 christos Deutsch Informational [Page 16]
914 1.1 christos
916 1.1 christos RFC 1951 DEFLATE Compressed Data Format Specification May 1996
917 1.1 christos
918 1.1 christos
919 1.1 christos 9. Author's Address
920 1.1 christos
921 1.1 christos L. Peter Deutsch
922 1.1 christos Aladdin Enterprises
923 1.1 christos 203 Santa Margarita Ave.
924 1.1 christos Menlo Park, CA 94025
925 1.1 christos
926 1.1 christos Phone: (415) 322-0103 (AM only)
927 1.1 christos FAX: (415) 322-1734
928 1.1 christos EMail: <ghost (a] aladdin.com>
929 1.1 christos
930 1.1 christos Questions about the technical content of this specification can be
931 1.1 christos sent by email to:
932 1.1 christos
933 1.1 christos Jean-Loup Gailly <gzip (a] prep.ai.mit.edu> and
934 1.1 christos Mark Adler <madler (a] alumni.caltech.edu>
935 1.1 christos
936 1.1 christos Editorial comments on this specification can be sent by email to:
937 1.1 christos
938 1.1 christos L. Peter Deutsch <ghost (a] aladdin.com> and
939 1.1 christos Glenn Randers-Pehrson <randeg (a] alumni.rpi.edu>
940 1.1 christos
941 1.1 christos
942 1.1 christos
943 1.1 christos
944 1.1 christos
945 1.1 christos
946 1.1 christos
947 1.1 christos
948 1.1 christos
949 1.1 christos
950 1.1 christos
951 1.1 christos
952 1.1 christos
953 1.1 christos
954 1.1 christos
955 1.1 christos
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970 Deutsch Informational [Page 17]
971
973