zlib/doc/txtvsbin.txt

    1.1  christos A Fast Method for Identifying Plain Text Files
    1.1  christos ==============================================
    1.1  christos
    1.1  christos
    1.1  christos Introduction
    1.1  christos ------------
    1.1  christos
    1.1  christos Given a file coming from an unknown source, it is sometimes desirable
    1.1  christos to find out whether the format of that file is plain text.  Although
    1.1  christos this may appear like a simple task, a fully accurate detection of the
    1.1  christos file type requires heavy-duty semantic analysis on the file contents.
    1.1  christos It is, however, possible to obtain satisfactory results by employing
    1.1  christos various heuristics.
    1.1  christos
    1.1  christos Previous versions of PKZip and other zip-compatible compression tools
    1.1  christos were using a crude detection scheme: if more than 80% (4/5) of the bytes
    1.1  christos found in a certain buffer are within the range [7..127], the file is
    1.1  christos labeled as plain text, otherwise it is labeled as binary.  A prominent
    1.1  christos limitation of this scheme is the restriction to Latin-based alphabets.
    1.1  christos Other alphabets, like Greek, Cyrillic or Asian, make extensive use of
    1.1  christos the bytes within the range [128..255], and texts using these alphabets
    1.1  christos are most often misidentified by this scheme; in other words, the rate
    1.1  christos of false negatives is sometimes too high, which means that the recall
    1.1  christos is low.  Another weakness of this scheme is a reduced precision, due to
    1.1  christos the false positives that may occur when binary files containing large
    1.1  christos amounts of textual characters are misidentified as plain text.
    1.1  christos
    1.1  christos In this article we propose a new, simple detection scheme that features
    1.1  christos a much increased precision and a near-100% recall.  This scheme is
    1.1  christos designed to work on ASCII, Unicode and other ASCII-derived alphabets,
    1.1  christos and it handles single-byte encodings (ISO-8859, MacRoman, KOI8, etc.)
    1.1  christos and variable-sized encodings (ISO-2022, UTF-8, etc.).  Wider encodings
    1.1  christos (UCS-2/UTF-16 and UCS-4/UTF-32) are not handled, however.
    1.1  christos
    1.1  christos
    1.1  christos The Algorithm
    1.1  christos -------------
    1.1  christos
    1.1  christos The algorithm works by dividing the set of bytecodes [0..255] into three
    1.1  christos categories:
1.1.1.2  christos - The allow list of textual bytecodes:
    1.1  christos   9 (TAB), 10 (LF), 13 (CR), 32 (SPACE) to 255.
    1.1  christos - The gray list of tolerated bytecodes:
    1.1  christos   7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC).
1.1.1.2  christos - The block list of undesired, non-textual bytecodes:
    1.1  christos   0 (NUL) to 6, 14 to 31.
    1.1  christos
1.1.1.2  christos If a file contains at least one byte that belongs to the allow list and
1.1.1.2  christos no byte that belongs to the block list, then the file is categorized as
    1.1  christos plain text; otherwise, it is categorized as binary.  (The boundary case,
    1.1  christos when the file is empty, automatically falls into the latter category.)
    1.1  christos
    1.1  christos
    1.1  christos Rationale
    1.1  christos ---------
    1.1  christos
    1.1  christos The idea behind this algorithm relies on two observations.
    1.1  christos
    1.1  christos The first observation is that, although the full range of 7-bit codes
    1.1  christos [0..127] is properly specified by the ASCII standard, most control
    1.1  christos characters in the range [0..31] are not used in practice.  The only
    1.1  christos widely-used, almost universally-portable control codes are 9 (TAB),
    1.1  christos 10 (LF) and 13 (CR).  There are a few more control codes that are
    1.1  christos recognized on a reduced range of platforms and text viewers/editors:
    1.1  christos 7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB) and 27 (ESC); but these
    1.1  christos codes are rarely (if ever) used alone, without being accompanied by
    1.1  christos some printable text.  Even the newer, portable text formats such as
    1.1  christos XML avoid using control characters outside the list mentioned here.
    1.1  christos
    1.1  christos The second observation is that most of the binary files tend to contain
    1.1  christos control characters, especially 0 (NUL).  Even though the older text
    1.1  christos detection schemes observe the presence of non-ASCII codes from the range
    1.1  christos [128..255], the precision rarely has to suffer if this upper range is
    1.1  christos labeled as textual, because the files that are genuinely binary tend to
    1.1  christos contain both control characters and codes from the upper range.  On the
    1.1  christos other hand, the upper range needs to be labeled as textual, because it
    1.1  christos is used by virtually all ASCII extensions.  In particular, this range is
    1.1  christos used for encoding non-Latin scripts.
    1.1  christos
    1.1  christos Since there is no counting involved, other than simply observing the
    1.1  christos presence or the absence of some byte values, the algorithm produces
    1.1  christos consistent results, regardless what alphabet encoding is being used.
    1.1  christos (If counting were involved, it could be possible to obtain different
    1.1  christos results on a text encoded, say, using ISO-8859-16 versus UTF-8.)
    1.1  christos
    1.1  christos There is an extra category of plain text files that are "polluted" with
1.1.1.2  christos one or more block-listed codes, either by mistake or by peculiar design
    1.1  christos considerations.  In such cases, a scheme that tolerates a small fraction
1.1.1.2  christos of block-listed codes would provide an increased recall (i.e. more true
    1.1  christos positives).  This, however, incurs a reduced precision overall, since
    1.1  christos false positives are more likely to appear in binary files that contain
    1.1  christos large chunks of textual data.  Furthermore, "polluted" plain text should
    1.1  christos be regarded as binary by general-purpose text detection schemes, because
    1.1  christos general-purpose text processing algorithms might not be applicable.
    1.1  christos Under this premise, it is safe to say that our detection method provides
    1.1  christos a near-100% recall.
    1.1  christos
    1.1  christos Experiments have been run on many files coming from various platforms
    1.1  christos and applications.  We tried plain text files, system logs, source code,
    1.1  christos formatted office documents, compiled object code, etc.  The results
    1.1  christos confirm the optimistic assumptions about the capabilities of this
    1.1  christos algorithm.
    1.1  christos
    1.1  christos
    1.1  christos --
    1.1  christos Cosmin Truta
    1.1  christos Last updated: 2006-May-28