Home | History | Annotate | Line # | Download | only in lint1
README.md revision 1.17
      1 [//]: # ($NetBSD: README.md,v 1.17 2024/03/28 21:04:48 rillig Exp $)
      2 
      3 # Introduction
      4 
      5 Lint1 analyzes a single translation unit of C code.
      6 
      7 * It reads the output of the C preprocessor, retaining the comments.
      8 * The lexer in `scan.l` and `lex.c` splits the input into tokens.
      9 * The parser in `cgram.y` creates types and expressions from the tokens.
     10 * The checks for declarations are in `decl.c`.
     11 * The checks for initializations are in `init.c`.
     12 * The checks for types and expressions are in `tree.c`.
     13 
     14 To see how a specific lint message is triggered, read the corresponding unit
     15 test in `tests/usr.bin/xlint/lint1/msg_???.c`.
     16 
     17 # Features
     18 
     19 ## Type checking
     20 
     21 Lint has stricter type checking than most C compilers.
     22 
     23 In _strict bool mode_, lint treats `bool` as a type that is incompatible with
     24 other scalar types, like in C#, Go, Java.
     25 See the test `d_c99_bool_strict.c` for details.
     26 
     27 Lint warns about type conversions that may result in alignment problems.
     28 See the test `msg_135.c` for examples.
     29 
     30 ## Control flow analysis
     31 
     32 Lint roughly tracks the control flow inside a single function.
     33 It doesn't follow `goto` statements precisely though,
     34 it rather assumes that each label is reachable.
     35 See the test `msg_193.c` for examples.
     36 
     37 ## Error handling
     38 
     39 Lint tries to continue parsing and checking even after seeing errors.
     40 This part of lint is not robust though, so expect some crashes here,
     41 as variables may not be properly initialized or be null pointers.
     42 The cleanup after handling a parse error is often incomplete.
     43 
     44 ## Configurable diagnostic messages
     45 
     46 Whether lint prints a message and whether each message is an error, a warning
     47 or just informational depends on several things:
     48 
     49 * The language level, with its possible values:
     50     * traditional C (`-t`)
     51     * migration from traditional C to C90 (default)
     52     * C90 (`-s`)
     53     * C99 (`-S`)
     54     * C11 (`-Ac11`)
     55     * C23 (`-Ac23`)
     56 * In GCC mode (`-g`), lint allows several GNU extensions,
     57   reducing the amount of printed messages.
     58 * In strict bool mode (`-T`), lint issues errors when `bool` is mixed with
     59   other scalar types, reusing the existing messages 107 and 211, while also
     60   defining new messages that are specific to strict bool mode.
     61 * The option `-a` performs the check for lossy conversions from large integer
     62   types, the option `-aa` extends this check to small integer types as well,
     63   reusing the same message ID.
     64 * The option `-X` suppresses arbitrary messages by their message ID.
     65 * The option `-q` enables additional queries that are not suitable as regular
     66   warnings but may be interesting to look at on a case-by-case basis.
     67 
     68 # Limitations
     69 
     70 Lint operates on the level of individual expressions.
     71 
     72 * It does not build an AST of the statements of a function, therefore it
     73   cannot reliably analyze the control flow in a single function.
     74 * It does not store the control flow properties of functions, therefore it
     75   cannot relate parameter nullability with the return value.
     76 * It does not have information about functions, except for their prototypes,
     77   therefore it cannot relate them across translation units.
     78 * It does not store detailed information about complex data types, therefore
     79   it cannot cross-check them across translation units.
     80 
     81 # Fundamental types
     82 
     83 Lint mainly analyzes expressions (`tnode_t`), which are formed from operators
     84 (`op_t`) and their operands (`tnode_t`).
     85 Each node has a data type (`type_t`) and a few other properties that depend on
     86 the operator.
     87 
     88 ## type_t
     89 
     90 The basic types are `int`, `_Bool`, `unsigned long`, `pointer` and so on,
     91 as defined in `tspec_t`.
     92 
     93 Concrete types like `int` or `const char *` are created by `gettyp(INT)`,
     94 or by deriving new types from existing types, using `block_derive_pointer`,
     95 `block_derive_array` and `block_derive_function`.
     96 (See [below](#memory-management) for the meaning of the prefix `block_`.)
     97 
     98 After a type has been created, it should not be modified anymore.
     99 Ideally all references to types would be `const`, but that's still on the
    100 to-do list and not trivial.
    101 In the meantime, before modifying a type,
    102 it needs to be copied using `block_dup_type` or `expr_dup_type`.
    103 
    104 ## tnode_t
    105 
    106 When lint parses an expression,
    107 it builds a tree of nodes representing the AST.
    108 Each node has an operator that defines which other members may be accessed.
    109 The operators and their properties are defined in `oper.c`.
    110 Some examples for operators:
    111 
    112 | Operator | Meaning                                        |
    113 |----------|------------------------------------------------|
    114 | CON      | compile-time constant in `u.value`             |
    115 | NAME     | references the identifier in `u.sym`           |
    116 | UPLUS    | the unary operator `+u.ops.left`               |
    117 | PLUS     | the binary operator `u.ops.left + u.ops.right` |
    118 | CALL     | a direct function call                         |
    119 | ICALL    | an indirect function call                      |
    120 | CVT      | an implicit conversion or an explicit cast     |
    121 
    122 As an example, the expression `strcmp(names[i], "name")` has this internal
    123 structure:
    124 
    125 ~~~text
    126  1: 'call' type 'int'
    127  2:   '&' type 'pointer to function(pointer to const char, pointer to const char) returning int'
    128  3:     'name' 'strcmp' with extern 'function(pointer to const char, pointer to const char) returning int'
    129  4:   'load' type 'pointer to const char'
    130  5:     '*' type 'pointer to const char', lvalue
    131  6:       '+' type 'pointer to pointer to const char'
    132  7:         'load' type 'pointer to pointer to const char'
    133  8:           'name' 'names' with auto 'pointer to pointer to const char', lvalue
    134  9:         '*' type 'long'
    135 10:           'convert' type 'long'
    136 11:             'load' type 'int'
    137 12:               'name' 'i' with auto 'int', lvalue
    138 13:           'constant' type 'long', value 8
    139 14:   'convert' type 'pointer to const char'
    140 15:     '&' type 'pointer to char'
    141 16:       'string' type 'array[5] of char', lvalue, "name"
    142 ~~~
    143 
    144 | Lines      | Notes                                                       |
    145 |------------|-------------------------------------------------------------|
    146 | 1, 2, 4, 7 | A function call consists of the function and its arguments. |
    147 | 4, 14      | The arguments of a call are ordered from left to right.     |
    148 | 5, 6       | Array access is represented as `*(left + right)`.           |
    149 | 9, 13      | Array and struct offsets are in premultiplied form.         |
    150 | 9          | The type `ptrdiff_t` on this platform is `long`, not `int`. |
    151 | 13         | The size of a pointer on this platform is 8 bytes.          |
    152 
    153 See `debug_node` for how to interpret the members of `tnode_t`.
    154 
    155 ## sym_t
    156 
    157 There is a single symbol table (`symtab`) for the whole translation unit.
    158 This means that the same identifier may appear multiple times.
    159 To distinguish the identifiers, each symbol has a block level.
    160 Symbols from inner scopes are added to the beginning of the table,
    161 so they are found first when looking for the identifier.
    162 
    163 # Memory management
    164 
    165 ## Block scope
    166 
    167 The memory that is allocated by the `block_*_alloc` functions is freed at the
    168 end of analyzing the block, that is, after the closing `}`.
    169 See `compound_statement_rbrace:` in `cgram.y`.
    170 
    171 ## Expression scope
    172 
    173 The memory that is allocated by the `expr_*_alloc` functions is freed at the
    174 end of analyzing the expression.
    175 See `expr_free_all`.
    176 
    177 # Null pointers
    178 
    179 * Expressions can be null.
    180     * This typically happens in case of syntax errors or other errors.
    181 * The subtype of a pointer, array or function is never null.
    182 
    183 # Common variable names
    184 
    185 | Name | Type      | Meaning                                              |
    186 |------|-----------|------------------------------------------------------|
    187 | t    | `tspec_t` | a simple type such as `INT`, `FUNC`, `PTR`           |
    188 | tp   | `type_t`  | a complete type such as `pointer to array[3] of int` |
    189 | stp  | `type_t`  | the subtype of a pointer, array or function          |
    190 | tn   | `tnode_t` | a tree node, mostly used for expressions             |
    191 | op   | `op_t`    | an operator used in an expression                    |
    192 | ln   | `tnode_t` | the left-hand operand of a binary operator           |
    193 | rn   | `tnode_t` | the right-hand operand of a binary operator          |
    194 | sym  | `sym_t`   | a symbol from the symbol table                       |
    195 
    196 # Abbreviations in variable names
    197 
    198 | Abbr | Expanded                                     |
    199 |------|----------------------------------------------|
    200 | l    | left                                         |
    201 | r    | right                                        |
    202 | o    | old (during type conversions)                |
    203 | n    | new (during type conversions)                |
    204 | op   | operator                                     |
    205 | arg  | the number of the parameter, for diagnostics |
    206 
    207 # Debugging
    208 
    209 Useful breakpoints are:
    210 
    211 | Function/Code       | File    | Remarks                                              |
    212 |---------------------|---------|------------------------------------------------------|
    213 | build_binary        | tree.c  | Creates an expression for a unary or binary operator |
    214 | initialization_expr | init.c  | Checks a single initializer                          |
    215 | expr                | tree.c  | Checks a full expression                             |
    216 | typeok              | tree.c  | Checks two types for compatibility                   |
    217 | vwarning_at         | err.c   | Prints a warning                                     |
    218 | verror_at           | err.c   | Prints an error                                      |
    219 | assert_failed       | err.c   | Prints the location of a failed assertion            |
    220 | `switch (yyn)`      | cgram.c | Reduction of a grammar rule                          |
    221 
    222 # Tests
    223 
    224 The tests are in `tests/usr.bin/xlint`.
    225 By default, each test is run with the lint flags `-g` for GNU mode,
    226 `-S` for C99 mode and `-w` to report warnings as errors.
    227 
    228 Each test can override the lint flags using comments of the following forms:
    229 
    230 * `/* lint1-flags: -tw */` replaces the default flags.
    231 * `/* lint1-extra-flags: -p */` adds to the default flags.
    232 
    233 Most tests check the diagnostics that lint generates.
    234 They do this by placing `expect` comments near the location of the diagnostic.
    235 The comment `/* expect+1: ... */` expects a diagnostic to be generated for the
    236 code 1 line below, `/* expect-5: ... */` expects a diagnostic to be generated
    237 for the code 5 lines above.
    238 An `expect` comment cannot span multiple lines.
    239 At the start and the end of the comment, the placeholder `...` stands for an
    240 arbitrary sequence of characters.
    241 There may be other code or comments in the same line of the `.c` file.
    242 
    243 Each diagnostic has its own test `msg_???.c` that triggers the corresponding
    244 diagnostic.
    245 Most other tests focus on a single feature.
    246 
    247 ## Adding a new test
    248 
    249 1. Run `make add-test NAME=test_name`.
    250 2. Run `cd ../../../tests/usr.bin/xlint/lint1`.
    251 3. Make the test generate the desired diagnostics.
    252 4. Run `./accept.sh test_name` until it no longer complains.
    253 5. Run `cd ../../..`.
    254 6. Run `cvs commit distrib/sets/lists/tests/mi tests/usr.bin/xlint`.
    255