Home | History | Annotate | Line # | Download | only in examples
enough.c revision 1.1
      1  1.1  christos /* enough.c -- determine the maximum size of inflate's Huffman code tables over
      2  1.1  christos  * all possible valid and complete Huffman codes, subject to a length limit.
      3  1.1  christos  * Copyright (C) 2007, 2008, 2012 Mark Adler
      4  1.1  christos  * Version 1.4  18 August 2012  Mark Adler
      5  1.1  christos  */
      6  1.1  christos 
      7  1.1  christos /* Version history:
      8  1.1  christos    1.0   3 Jan 2007  First version (derived from codecount.c version 1.4)
      9  1.1  christos    1.1   4 Jan 2007  Use faster incremental table usage computation
     10  1.1  christos                      Prune examine() search on previously visited states
     11  1.1  christos    1.2   5 Jan 2007  Comments clean up
     12  1.1  christos                      As inflate does, decrease root for short codes
     13  1.1  christos                      Refuse cases where inflate would increase root
     14  1.1  christos    1.3  17 Feb 2008  Add argument for initial root table size
     15  1.1  christos                      Fix bug for initial root table size == max - 1
     16  1.1  christos                      Use a macro to compute the history index
     17  1.1  christos    1.4  18 Aug 2012  Avoid shifts more than bits in type (caused endless loop!)
     18  1.1  christos                      Clean up comparisons of different types
     19  1.1  christos                      Clean up code indentation
     20  1.1  christos  */
     21  1.1  christos 
     22  1.1  christos /*
     23  1.1  christos    Examine all possible Huffman codes for a given number of symbols and a
     24  1.1  christos    maximum code length in bits to determine the maximum table size for zilb's
     25  1.1  christos    inflate.  Only complete Huffman codes are counted.
     26  1.1  christos 
     27  1.1  christos    Two codes are considered distinct if the vectors of the number of codes per
     28  1.1  christos    length are not identical.  So permutations of the symbol assignments result
     29  1.1  christos    in the same code for the counting, as do permutations of the assignments of
     30  1.1  christos    the bit values to the codes (i.e. only canonical codes are counted).
     31  1.1  christos 
     32  1.1  christos    We build a code from shorter to longer lengths, determining how many symbols
     33  1.1  christos    are coded at each length.  At each step, we have how many symbols remain to
     34  1.1  christos    be coded, what the last code length used was, and how many bit patterns of
     35  1.1  christos    that length remain unused. Then we add one to the code length and double the
     36  1.1  christos    number of unused patterns to graduate to the next code length.  We then
     37  1.1  christos    assign all portions of the remaining symbols to that code length that
     38  1.1  christos    preserve the properties of a correct and eventually complete code.  Those
     39  1.1  christos    properties are: we cannot use more bit patterns than are available; and when
     40  1.1  christos    all the symbols are used, there are exactly zero possible bit patterns
     41  1.1  christos    remaining.
     42  1.1  christos 
     43  1.1  christos    The inflate Huffman decoding algorithm uses two-level lookup tables for
     44  1.1  christos    speed.  There is a single first-level table to decode codes up to root bits
     45  1.1  christos    in length (root == 9 in the current inflate implementation).  The table
     46  1.1  christos    has 1 << root entries and is indexed by the next root bits of input.  Codes
     47  1.1  christos    shorter than root bits have replicated table entries, so that the correct
     48  1.1  christos    entry is pointed to regardless of the bits that follow the short code.  If
     49  1.1  christos    the code is longer than root bits, then the table entry points to a second-
     50  1.1  christos    level table.  The size of that table is determined by the longest code with
     51  1.1  christos    that root-bit prefix.  If that longest code has length len, then the table
     52  1.1  christos    has size 1 << (len - root), to index the remaining bits in that set of
     53  1.1  christos    codes.  Each subsequent root-bit prefix then has its own sub-table.  The
     54  1.1  christos    total number of table entries required by the code is calculated
     55  1.1  christos    incrementally as the number of codes at each bit length is populated.  When
     56  1.1  christos    all of the codes are shorter than root bits, then root is reduced to the
     57  1.1  christos    longest code length, resulting in a single, smaller, one-level table.
     58  1.1  christos 
     59  1.1  christos    The inflate algorithm also provides for small values of root (relative to
     60  1.1  christos    the log2 of the number of symbols), where the shortest code has more bits
     61  1.1  christos    than root.  In that case, root is increased to the length of the shortest
     62  1.1  christos    code.  This program, by design, does not handle that case, so it is verified
     63  1.1  christos    that the number of symbols is less than 2^(root + 1).
     64  1.1  christos 
     65  1.1  christos    In order to speed up the examination (by about ten orders of magnitude for
     66  1.1  christos    the default arguments), the intermediate states in the build-up of a code
     67  1.1  christos    are remembered and previously visited branches are pruned.  The memory
     68  1.1  christos    required for this will increase rapidly with the total number of symbols and
     69  1.1  christos    the maximum code length in bits.  However this is a very small price to pay
     70  1.1  christos    for the vast speedup.
     71  1.1  christos 
     72  1.1  christos    First, all of the possible Huffman codes are counted, and reachable
     73  1.1  christos    intermediate states are noted by a non-zero count in a saved-results array.
     74  1.1  christos    Second, the intermediate states that lead to (root + 1) bit or longer codes
     75  1.1  christos    are used to look at all sub-codes from those junctures for their inflate
     76  1.1  christos    memory usage.  (The amount of memory used is not affected by the number of
     77  1.1  christos    codes of root bits or less in length.)  Third, the visited states in the
     78  1.1  christos    construction of those sub-codes and the associated calculation of the table
     79  1.1  christos    size is recalled in order to avoid recalculating from the same juncture.
     80  1.1  christos    Beginning the code examination at (root + 1) bit codes, which is enabled by
     81  1.1  christos    identifying the reachable nodes, accounts for about six of the orders of
     82  1.1  christos    magnitude of improvement for the default arguments.  About another four
     83  1.1  christos    orders of magnitude come from not revisiting previous states.  Out of
     84  1.1  christos    approximately 2x10^16 possible Huffman codes, only about 2x10^6 sub-codes
     85  1.1  christos    need to be examined to cover all of the possible table memory usage cases
     86  1.1  christos    for the default arguments of 286 symbols limited to 15-bit codes.
     87  1.1  christos 
     88  1.1  christos    Note that an unsigned long long type is used for counting.  It is quite easy
     89  1.1  christos    to exceed the capacity of an eight-byte integer with a large number of
     90  1.1  christos    symbols and a large maximum code length, so multiple-precision arithmetic
     91  1.1  christos    would need to replace the unsigned long long arithmetic in that case.  This
     92  1.1  christos    program will abort if an overflow occurs.  The big_t type identifies where
     93  1.1  christos    the counting takes place.
     94  1.1  christos 
     95  1.1  christos    An unsigned long long type is also used for calculating the number of
     96  1.1  christos    possible codes remaining at the maximum length.  This limits the maximum
     97  1.1  christos    code length to the number of bits in a long long minus the number of bits
     98  1.1  christos    needed to represent the symbols in a flat code.  The code_t type identifies
     99  1.1  christos    where the bit pattern counting takes place.
    100  1.1  christos  */
    101  1.1  christos 
    102  1.1  christos #include <stdio.h>
    103  1.1  christos #include <stdlib.h>
    104  1.1  christos #include <string.h>
    105  1.1  christos #include <assert.h>
    106  1.1  christos 
    107  1.1  christos #define local static
    108  1.1  christos 
    109  1.1  christos /* special data types */
    110  1.1  christos typedef unsigned long long big_t;   /* type for code counting */
    111  1.1  christos typedef unsigned long long code_t;  /* type for bit pattern counting */
    112  1.1  christos struct tab {                        /* type for been here check */
    113  1.1  christos     size_t len;         /* length of bit vector in char's */
    114  1.1  christos     char *vec;          /* allocated bit vector */
    115  1.1  christos };
    116  1.1  christos 
    117  1.1  christos /* The array for saving results, num[], is indexed with this triplet:
    118  1.1  christos 
    119  1.1  christos       syms: number of symbols remaining to code
    120  1.1  christos       left: number of available bit patterns at length len
    121  1.1  christos       len: number of bits in the codes currently being assigned
    122  1.1  christos 
    123  1.1  christos    Those indices are constrained thusly when saving results:
    124  1.1  christos 
    125  1.1  christos       syms: 3..totsym (totsym == total symbols to code)
    126  1.1  christos       left: 2..syms - 1, but only the evens (so syms == 8 -> 2, 4, 6)
    127  1.1  christos       len: 1..max - 1 (max == maximum code length in bits)
    128  1.1  christos 
    129  1.1  christos    syms == 2 is not saved since that immediately leads to a single code.  left
    130  1.1  christos    must be even, since it represents the number of available bit patterns at
    131  1.1  christos    the current length, which is double the number at the previous length.
    132  1.1  christos    left ends at syms-1 since left == syms immediately results in a single code.
    133  1.1  christos    (left > sym is not allowed since that would result in an incomplete code.)
    134  1.1  christos    len is less than max, since the code completes immediately when len == max.
    135  1.1  christos 
    136  1.1  christos    The offset into the array is calculated for the three indices with the
    137  1.1  christos    first one (syms) being outermost, and the last one (len) being innermost.
    138  1.1  christos    We build the array with length max-1 lists for the len index, with syms-3
    139  1.1  christos    of those for each symbol.  There are totsym-2 of those, with each one
    140  1.1  christos    varying in length as a function of sym.  See the calculation of index in
    141  1.1  christos    count() for the index, and the calculation of size in main() for the size
    142  1.1  christos    of the array.
    143  1.1  christos 
    144  1.1  christos    For the deflate example of 286 symbols limited to 15-bit codes, the array
    145  1.1  christos    has 284,284 entries, taking up 2.17 MB for an 8-byte big_t.  More than
    146  1.1  christos    half of the space allocated for saved results is actually used -- not all
    147  1.1  christos    possible triplets are reached in the generation of valid Huffman codes.
    148  1.1  christos  */
    149  1.1  christos 
    150  1.1  christos /* The array for tracking visited states, done[], is itself indexed identically
    151  1.1  christos    to the num[] array as described above for the (syms, left, len) triplet.
    152  1.1  christos    Each element in the array is further indexed by the (mem, rem) doublet,
    153  1.1  christos    where mem is the amount of inflate table space used so far, and rem is the
    154  1.1  christos    remaining unused entries in the current inflate sub-table.  Each indexed
    155  1.1  christos    element is simply one bit indicating whether the state has been visited or
    156  1.1  christos    not.  Since the ranges for mem and rem are not known a priori, each bit
    157  1.1  christos    vector is of a variable size, and grows as needed to accommodate the visited
    158  1.1  christos    states.  mem and rem are used to calculate a single index in a triangular
    159  1.1  christos    array.  Since the range of mem is expected in the default case to be about
    160  1.1  christos    ten times larger than the range of rem, the array is skewed to reduce the
    161  1.1  christos    memory usage, with eight times the range for mem than for rem.  See the
    162  1.1  christos    calculations for offset and bit in beenhere() for the details.
    163  1.1  christos 
    164  1.1  christos    For the deflate example of 286 symbols limited to 15-bit codes, the bit
    165  1.1  christos    vectors grow to total approximately 21 MB, in addition to the 4.3 MB done[]
    166  1.1  christos    array itself.
    167  1.1  christos  */
    168  1.1  christos 
    169  1.1  christos /* Globals to avoid propagating constants or constant pointers recursively */
    170  1.1  christos local int max;          /* maximum allowed bit length for the codes */
    171  1.1  christos local int root;         /* size of base code table in bits */
    172  1.1  christos local int large;        /* largest code table so far */
    173  1.1  christos local size_t size;      /* number of elements in num and done */
    174  1.1  christos local int *code;        /* number of symbols assigned to each bit length */
    175  1.1  christos local big_t *num;       /* saved results array for code counting */
    176  1.1  christos local struct tab *done; /* states already evaluated array */
    177  1.1  christos 
    178  1.1  christos /* Index function for num[] and done[] */
    179  1.1  christos #define INDEX(i,j,k) (((size_t)((i-1)>>1)*((i-2)>>1)+(j>>1)-1)*(max-1)+k-1)
    180  1.1  christos 
    181  1.1  christos /* Free allocated space.  Uses globals code, num, and done. */
    182  1.1  christos local void cleanup(void)
    183  1.1  christos {
    184  1.1  christos     size_t n;
    185  1.1  christos 
    186  1.1  christos     if (done != NULL) {
    187  1.1  christos         for (n = 0; n < size; n++)
    188  1.1  christos             if (done[n].len)
    189  1.1  christos                 free(done[n].vec);
    190  1.1  christos         free(done);
    191  1.1  christos     }
    192  1.1  christos     if (num != NULL)
    193  1.1  christos         free(num);
    194  1.1  christos     if (code != NULL)
    195  1.1  christos         free(code);
    196  1.1  christos }
    197  1.1  christos 
    198  1.1  christos /* Return the number of possible Huffman codes using bit patterns of lengths
    199  1.1  christos    len through max inclusive, coding syms symbols, with left bit patterns of
    200  1.1  christos    length len unused -- return -1 if there is an overflow in the counting.
    201  1.1  christos    Keep a record of previous results in num to prevent repeating the same
    202  1.1  christos    calculation.  Uses the globals max and num. */
    203  1.1  christos local big_t count(int syms, int len, int left)
    204  1.1  christos {
    205  1.1  christos     big_t sum;          /* number of possible codes from this juncture */
    206  1.1  christos     big_t got;          /* value returned from count() */
    207  1.1  christos     int least;          /* least number of syms to use at this juncture */
    208  1.1  christos     int most;           /* most number of syms to use at this juncture */
    209  1.1  christos     int use;            /* number of bit patterns to use in next call */
    210  1.1  christos     size_t index;       /* index of this case in *num */
    211  1.1  christos 
    212  1.1  christos     /* see if only one possible code */
    213  1.1  christos     if (syms == left)
    214  1.1  christos         return 1;
    215  1.1  christos 
    216  1.1  christos     /* note and verify the expected state */
    217  1.1  christos     assert(syms > left && left > 0 && len < max);
    218  1.1  christos 
    219  1.1  christos     /* see if we've done this one already */
    220  1.1  christos     index = INDEX(syms, left, len);
    221  1.1  christos     got = num[index];
    222  1.1  christos     if (got)
    223  1.1  christos         return got;         /* we have -- return the saved result */
    224  1.1  christos 
    225  1.1  christos     /* we need to use at least this many bit patterns so that the code won't be
    226  1.1  christos        incomplete at the next length (more bit patterns than symbols) */
    227  1.1  christos     least = (left << 1) - syms;
    228  1.1  christos     if (least < 0)
    229  1.1  christos         least = 0;
    230  1.1  christos 
    231  1.1  christos     /* we can use at most this many bit patterns, lest there not be enough
    232  1.1  christos        available for the remaining symbols at the maximum length (if there were
    233  1.1  christos        no limit to the code length, this would become: most = left - 1) */
    234  1.1  christos     most = (((code_t)left << (max - len)) - syms) /
    235  1.1  christos             (((code_t)1 << (max - len)) - 1);
    236  1.1  christos 
    237  1.1  christos     /* count all possible codes from this juncture and add them up */
    238  1.1  christos     sum = 0;
    239  1.1  christos     for (use = least; use <= most; use++) {
    240  1.1  christos         got = count(syms - use, len + 1, (left - use) << 1);
    241  1.1  christos         sum += got;
    242  1.1  christos         if (got == (big_t)0 - 1 || sum < got)   /* overflow */
    243  1.1  christos             return (big_t)0 - 1;
    244  1.1  christos     }
    245  1.1  christos 
    246  1.1  christos     /* verify that all recursive calls are productive */
    247  1.1  christos     assert(sum != 0);
    248  1.1  christos 
    249  1.1  christos     /* save the result and return it */
    250  1.1  christos     num[index] = sum;
    251  1.1  christos     return sum;
    252  1.1  christos }
    253  1.1  christos 
    254  1.1  christos /* Return true if we've been here before, set to true if not.  Set a bit in a
    255  1.1  christos    bit vector to indicate visiting this state.  Each (syms,len,left) state
    256  1.1  christos    has a variable size bit vector indexed by (mem,rem).  The bit vector is
    257  1.1  christos    lengthened if needed to allow setting the (mem,rem) bit. */
    258  1.1  christos local int beenhere(int syms, int len, int left, int mem, int rem)
    259  1.1  christos {
    260  1.1  christos     size_t index;       /* index for this state's bit vector */
    261  1.1  christos     size_t offset;      /* offset in this state's bit vector */
    262  1.1  christos     int bit;            /* mask for this state's bit */
    263  1.1  christos     size_t length;      /* length of the bit vector in bytes */
    264  1.1  christos     char *vector;       /* new or enlarged bit vector */
    265  1.1  christos 
    266  1.1  christos     /* point to vector for (syms,left,len), bit in vector for (mem,rem) */
    267  1.1  christos     index = INDEX(syms, left, len);
    268  1.1  christos     mem -= 1 << root;
    269  1.1  christos     offset = (mem >> 3) + rem;
    270  1.1  christos     offset = ((offset * (offset + 1)) >> 1) + rem;
    271  1.1  christos     bit = 1 << (mem & 7);
    272  1.1  christos 
    273  1.1  christos     /* see if we've been here */
    274  1.1  christos     length = done[index].len;
    275  1.1  christos     if (offset < length && (done[index].vec[offset] & bit) != 0)
    276  1.1  christos         return 1;       /* done this! */
    277  1.1  christos 
    278  1.1  christos     /* we haven't been here before -- set the bit to show we have now */
    279  1.1  christos 
    280  1.1  christos     /* see if we need to lengthen the vector in order to set the bit */
    281  1.1  christos     if (length <= offset) {
    282  1.1  christos         /* if we have one already, enlarge it, zero out the appended space */
    283  1.1  christos         if (length) {
    284  1.1  christos             do {
    285  1.1  christos                 length <<= 1;
    286  1.1  christos             } while (length <= offset);
    287  1.1  christos             vector = realloc(done[index].vec, length);
    288  1.1  christos             if (vector != NULL)
    289  1.1  christos                 memset(vector + done[index].len, 0, length - done[index].len);
    290  1.1  christos         }
    291  1.1  christos 
    292  1.1  christos         /* otherwise we need to make a new vector and zero it out */
    293  1.1  christos         else {
    294  1.1  christos             length = 1 << (len - root);
    295  1.1  christos             while (length <= offset)
    296  1.1  christos                 length <<= 1;
    297  1.1  christos             vector = calloc(length, sizeof(char));
    298  1.1  christos         }
    299  1.1  christos 
    300  1.1  christos         /* in either case, bail if we can't get the memory */
    301  1.1  christos         if (vector == NULL) {
    302  1.1  christos             fputs("abort: unable to allocate enough memory\n", stderr);
    303  1.1  christos             cleanup();
    304  1.1  christos             exit(1);
    305  1.1  christos         }
    306  1.1  christos 
    307  1.1  christos         /* install the new vector */
    308  1.1  christos         done[index].len = length;
    309  1.1  christos         done[index].vec = vector;
    310  1.1  christos     }
    311  1.1  christos 
    312  1.1  christos     /* set the bit */
    313  1.1  christos     done[index].vec[offset] |= bit;
    314  1.1  christos     return 0;
    315  1.1  christos }
    316  1.1  christos 
    317  1.1  christos /* Examine all possible codes from the given node (syms, len, left).  Compute
    318  1.1  christos    the amount of memory required to build inflate's decoding tables, where the
    319  1.1  christos    number of code structures used so far is mem, and the number remaining in
    320  1.1  christos    the current sub-table is rem.  Uses the globals max, code, root, large, and
    321  1.1  christos    done. */
    322  1.1  christos local void examine(int syms, int len, int left, int mem, int rem)
    323  1.1  christos {
    324  1.1  christos     int least;          /* least number of syms to use at this juncture */
    325  1.1  christos     int most;           /* most number of syms to use at this juncture */
    326  1.1  christos     int use;            /* number of bit patterns to use in next call */
    327  1.1  christos 
    328  1.1  christos     /* see if we have a complete code */
    329  1.1  christos     if (syms == left) {
    330  1.1  christos         /* set the last code entry */
    331  1.1  christos         code[len] = left;
    332  1.1  christos 
    333  1.1  christos         /* complete computation of memory used by this code */
    334  1.1  christos         while (rem < left) {
    335  1.1  christos             left -= rem;
    336  1.1  christos             rem = 1 << (len - root);
    337  1.1  christos             mem += rem;
    338  1.1  christos         }
    339  1.1  christos         assert(rem == left);
    340  1.1  christos 
    341  1.1  christos         /* if this is a new maximum, show the entries used and the sub-code */
    342  1.1  christos         if (mem > large) {
    343  1.1  christos             large = mem;
    344  1.1  christos             printf("max %d: ", mem);
    345  1.1  christos             for (use = root + 1; use <= max; use++)
    346  1.1  christos                 if (code[use])
    347  1.1  christos                     printf("%d[%d] ", code[use], use);
    348  1.1  christos             putchar('\n');
    349  1.1  christos             fflush(stdout);
    350  1.1  christos         }
    351  1.1  christos 
    352  1.1  christos         /* remove entries as we drop back down in the recursion */
    353  1.1  christos         code[len] = 0;
    354  1.1  christos         return;
    355  1.1  christos     }
    356  1.1  christos 
    357  1.1  christos     /* prune the tree if we can */
    358  1.1  christos     if (beenhere(syms, len, left, mem, rem))
    359  1.1  christos         return;
    360  1.1  christos 
    361  1.1  christos     /* we need to use at least this many bit patterns so that the code won't be
    362  1.1  christos        incomplete at the next length (more bit patterns than symbols) */
    363  1.1  christos     least = (left << 1) - syms;
    364  1.1  christos     if (least < 0)
    365  1.1  christos         least = 0;
    366  1.1  christos 
    367  1.1  christos     /* we can use at most this many bit patterns, lest there not be enough
    368  1.1  christos        available for the remaining symbols at the maximum length (if there were
    369  1.1  christos        no limit to the code length, this would become: most = left - 1) */
    370  1.1  christos     most = (((code_t)left << (max - len)) - syms) /
    371  1.1  christos             (((code_t)1 << (max - len)) - 1);
    372  1.1  christos 
    373  1.1  christos     /* occupy least table spaces, creating new sub-tables as needed */
    374  1.1  christos     use = least;
    375  1.1  christos     while (rem < use) {
    376  1.1  christos         use -= rem;
    377  1.1  christos         rem = 1 << (len - root);
    378  1.1  christos         mem += rem;
    379  1.1  christos     }
    380  1.1  christos     rem -= use;
    381  1.1  christos 
    382  1.1  christos     /* examine codes from here, updating table space as we go */
    383  1.1  christos     for (use = least; use <= most; use++) {
    384  1.1  christos         code[len] = use;
    385  1.1  christos         examine(syms - use, len + 1, (left - use) << 1,
    386  1.1  christos                 mem + (rem ? 1 << (len - root) : 0), rem << 1);
    387  1.1  christos         if (rem == 0) {
    388  1.1  christos             rem = 1 << (len - root);
    389  1.1  christos             mem += rem;
    390  1.1  christos         }
    391  1.1  christos         rem--;
    392  1.1  christos     }
    393  1.1  christos 
    394  1.1  christos     /* remove entries as we drop back down in the recursion */
    395  1.1  christos     code[len] = 0;
    396  1.1  christos }
    397  1.1  christos 
    398  1.1  christos /* Look at all sub-codes starting with root + 1 bits.  Look at only the valid
    399  1.1  christos    intermediate code states (syms, left, len).  For each completed code,
    400  1.1  christos    calculate the amount of memory required by inflate to build the decoding
    401  1.1  christos    tables. Find the maximum amount of memory required and show the code that
    402  1.1  christos    requires that maximum.  Uses the globals max, root, and num. */
    403  1.1  christos local void enough(int syms)
    404  1.1  christos {
    405  1.1  christos     int n;              /* number of remaing symbols for this node */
    406  1.1  christos     int left;           /* number of unused bit patterns at this length */
    407  1.1  christos     size_t index;       /* index of this case in *num */
    408  1.1  christos 
    409  1.1  christos     /* clear code */
    410  1.1  christos     for (n = 0; n <= max; n++)
    411  1.1  christos         code[n] = 0;
    412  1.1  christos 
    413  1.1  christos     /* look at all (root + 1) bit and longer codes */
    414  1.1  christos     large = 1 << root;              /* base table */
    415  1.1  christos     if (root < max)                 /* otherwise, there's only a base table */
    416  1.1  christos         for (n = 3; n <= syms; n++)
    417  1.1  christos             for (left = 2; left < n; left += 2)
    418  1.1  christos             {
    419  1.1  christos                 /* look at all reachable (root + 1) bit nodes, and the
    420  1.1  christos                    resulting codes (complete at root + 2 or more) */
    421  1.1  christos                 index = INDEX(n, left, root + 1);
    422  1.1  christos                 if (root + 1 < max && num[index])       /* reachable node */
    423  1.1  christos                     examine(n, root + 1, left, 1 << root, 0);
    424  1.1  christos 
    425  1.1  christos                 /* also look at root bit codes with completions at root + 1
    426  1.1  christos                    bits (not saved in num, since complete), just in case */
    427  1.1  christos                 if (num[index - 1] && n <= left << 1)
    428  1.1  christos                     examine((n - left) << 1, root + 1, (n - left) << 1,
    429  1.1  christos                             1 << root, 0);
    430  1.1  christos             }
    431  1.1  christos 
    432  1.1  christos     /* done */
    433  1.1  christos     printf("done: maximum of %d table entries\n", large);
    434  1.1  christos }
    435  1.1  christos 
    436  1.1  christos /*
    437  1.1  christos    Examine and show the total number of possible Huffman codes for a given
    438  1.1  christos    maximum number of symbols, initial root table size, and maximum code length
    439  1.1  christos    in bits -- those are the command arguments in that order.  The default
    440  1.1  christos    values are 286, 9, and 15 respectively, for the deflate literal/length code.
    441  1.1  christos    The possible codes are counted for each number of coded symbols from two to
    442  1.1  christos    the maximum.  The counts for each of those and the total number of codes are
    443  1.1  christos    shown.  The maximum number of inflate table entires is then calculated
    444  1.1  christos    across all possible codes.  Each new maximum number of table entries and the
    445  1.1  christos    associated sub-code (starting at root + 1 == 10 bits) is shown.
    446  1.1  christos 
    447  1.1  christos    To count and examine Huffman codes that are not length-limited, provide a
    448  1.1  christos    maximum length equal to the number of symbols minus one.
    449  1.1  christos 
    450  1.1  christos    For the deflate literal/length code, use "enough".  For the deflate distance
    451  1.1  christos    code, use "enough 30 6".
    452  1.1  christos 
    453  1.1  christos    This uses the %llu printf format to print big_t numbers, which assumes that
    454  1.1  christos    big_t is an unsigned long long.  If the big_t type is changed (for example
    455  1.1  christos    to a multiple precision type), the method of printing will also need to be
    456  1.1  christos    updated.
    457  1.1  christos  */
    458  1.1  christos int main(int argc, char **argv)
    459  1.1  christos {
    460  1.1  christos     int syms;           /* total number of symbols to code */
    461  1.1  christos     int n;              /* number of symbols to code for this run */
    462  1.1  christos     big_t got;          /* return value of count() */
    463  1.1  christos     big_t sum;          /* accumulated number of codes over n */
    464  1.1  christos     code_t word;        /* for counting bits in code_t */
    465  1.1  christos 
    466  1.1  christos     /* set up globals for cleanup() */
    467  1.1  christos     code = NULL;
    468  1.1  christos     num = NULL;
    469  1.1  christos     done = NULL;
    470  1.1  christos 
    471  1.1  christos     /* get arguments -- default to the deflate literal/length code */
    472  1.1  christos     syms = 286;
    473  1.1  christos     root = 9;
    474  1.1  christos     max = 15;
    475  1.1  christos     if (argc > 1) {
    476  1.1  christos         syms = atoi(argv[1]);
    477  1.1  christos         if (argc > 2) {
    478  1.1  christos             root = atoi(argv[2]);
    479  1.1  christos             if (argc > 3)
    480  1.1  christos                 max = atoi(argv[3]);
    481  1.1  christos         }
    482  1.1  christos     }
    483  1.1  christos     if (argc > 4 || syms < 2 || root < 1 || max < 1) {
    484  1.1  christos         fputs("invalid arguments, need: [sym >= 2 [root >= 1 [max >= 1]]]\n",
    485  1.1  christos               stderr);
    486  1.1  christos         return 1;
    487  1.1  christos     }
    488  1.1  christos 
    489  1.1  christos     /* if not restricting the code length, the longest is syms - 1 */
    490  1.1  christos     if (max > syms - 1)
    491  1.1  christos         max = syms - 1;
    492  1.1  christos 
    493  1.1  christos     /* determine the number of bits in a code_t */
    494  1.1  christos     for (n = 0, word = 1; word; n++, word <<= 1)
    495  1.1  christos         ;
    496  1.1  christos 
    497  1.1  christos     /* make sure that the calculation of most will not overflow */
    498  1.1  christos     if (max > n || (code_t)(syms - 2) >= (((code_t)0 - 1) >> (max - 1))) {
    499  1.1  christos         fputs("abort: code length too long for internal types\n", stderr);
    500  1.1  christos         return 1;
    501  1.1  christos     }
    502  1.1  christos 
    503  1.1  christos     /* reject impossible code requests */
    504  1.1  christos     if ((code_t)(syms - 1) > ((code_t)1 << max) - 1) {
    505  1.1  christos         fprintf(stderr, "%d symbols cannot be coded in %d bits\n",
    506  1.1  christos                 syms, max);
    507  1.1  christos         return 1;
    508  1.1  christos     }
    509  1.1  christos 
    510  1.1  christos     /* allocate code vector */
    511  1.1  christos     code = calloc(max + 1, sizeof(int));
    512  1.1  christos     if (code == NULL) {
    513  1.1  christos         fputs("abort: unable to allocate enough memory\n", stderr);
    514  1.1  christos         return 1;
    515  1.1  christos     }
    516  1.1  christos 
    517  1.1  christos     /* determine size of saved results array, checking for overflows,
    518  1.1  christos        allocate and clear the array (set all to zero with calloc()) */
    519  1.1  christos     if (syms == 2)              /* iff max == 1 */
    520  1.1  christos         num = NULL;             /* won't be saving any results */
    521  1.1  christos     else {
    522  1.1  christos         size = syms >> 1;
    523  1.1  christos         if (size > ((size_t)0 - 1) / (n = (syms - 1) >> 1) ||
    524  1.1  christos                 (size *= n, size > ((size_t)0 - 1) / (n = max - 1)) ||
    525  1.1  christos                 (size *= n, size > ((size_t)0 - 1) / sizeof(big_t)) ||
    526  1.1  christos                 (num = calloc(size, sizeof(big_t))) == NULL) {
    527  1.1  christos             fputs("abort: unable to allocate enough memory\n", stderr);
    528  1.1  christos             cleanup();
    529  1.1  christos             return 1;
    530  1.1  christos         }
    531  1.1  christos     }
    532  1.1  christos 
    533  1.1  christos     /* count possible codes for all numbers of symbols, add up counts */
    534  1.1  christos     sum = 0;
    535  1.1  christos     for (n = 2; n <= syms; n++) {
    536  1.1  christos         got = count(n, 1, 2);
    537  1.1  christos         sum += got;
    538  1.1  christos         if (got == (big_t)0 - 1 || sum < got) {     /* overflow */
    539  1.1  christos             fputs("abort: can't count that high!\n", stderr);
    540  1.1  christos             cleanup();
    541  1.1  christos             return 1;
    542  1.1  christos         }
    543  1.1  christos         printf("%llu %d-codes\n", got, n);
    544  1.1  christos     }
    545  1.1  christos     printf("%llu total codes for 2 to %d symbols", sum, syms);
    546  1.1  christos     if (max < syms - 1)
    547  1.1  christos         printf(" (%d-bit length limit)\n", max);
    548  1.1  christos     else
    549  1.1  christos         puts(" (no length limit)");
    550  1.1  christos 
    551  1.1  christos     /* allocate and clear done array for beenhere() */
    552  1.1  christos     if (syms == 2)
    553  1.1  christos         done = NULL;
    554  1.1  christos     else if (size > ((size_t)0 - 1) / sizeof(struct tab) ||
    555  1.1  christos              (done = calloc(size, sizeof(struct tab))) == NULL) {
    556  1.1  christos         fputs("abort: unable to allocate enough memory\n", stderr);
    557  1.1  christos         cleanup();
    558  1.1  christos         return 1;
    559  1.1  christos     }
    560  1.1  christos 
    561  1.1  christos     /* find and show maximum inflate table usage */
    562  1.1  christos     if (root > max)                 /* reduce root to max length */
    563  1.1  christos         root = max;
    564  1.1  christos     if ((code_t)syms < ((code_t)1 << (root + 1)))
    565  1.1  christos         enough(syms);
    566  1.1  christos     else
    567  1.1  christos         puts("cannot handle minimum code lengths > root");
    568  1.1  christos 
    569  1.1  christos     /* done */
    570  1.1  christos     cleanup();
    571  1.1  christos     return 0;
    572  1.1  christos }
    573