README.md revision 1.13 1 [//]: # ($NetBSD: README.md,v 1.13 2023/08/02 18:51:25 rillig Exp $)
2
3 # Introduction
4
5 Lint1 analyzes a single translation unit of C code.
6
7 * It reads the output of the C preprocessor, retaining the comments.
8 * The lexer in `scan.l` and `lex.c` splits the input into tokens.
9 * The parser in `cgram.y` creates types and expressions from the tokens.
10 * It checks declarations in `decl.c`.
11 * It checks initializations in `init.c`.
12 * It checks types and expressions in `tree.c`.
13
14 To see how a specific lint message is triggered, read the corresponding unit
15 test in `tests/usr.bin/xlint/lint1/msg_???.c`.
16
17 # Features
18
19 ## Type checking
20
21 Lint has stricter type checking than most C compilers.
22
23 In _strict bool mode_, lint treats `bool` as a type that is incompatible with
24 other scalar types, like in C#, Go, Java.
25 See the test `d_c99_bool_strict.c` for details.
26
27 Lint warns about type conversions that may result in alignment problems.
28 See the test `msg_135.c` for examples.
29
30 ## Control flow analysis
31
32 Lint roughly tracks the control flow inside a single function.
33 It doesn't follow `goto` statements precisely though,
34 it rather assumes that each label is reachable.
35 See the test `msg_193.c` for examples.
36
37 ## Error handling
38
39 Lint tries to continue parsing and checking even after seeing errors.
40 This part of lint is not robust though, so expect some crashes here,
41 as variables may not be properly initialized or be null pointers.
42 The cleanup after handling a parse error is often incomplete.
43
44 ## Configurable diagnostic messages
45
46 Whether lint prints a message and whether each message is an error, a warning
47 or just informational depends on several things:
48
49 * The language level, with its possible values:
50 * traditional C (`-t`)
51 * migration from traditional C and C90 (default)
52 * C90 (`-s`)
53 * C99 (`-S`)
54 * C11 (`-Ac11`)
55 * In GCC mode (`-g`), lint allows several GNU extensions,
56 reducing the amount of printed messages.
57 * In strict bool mode (`-T`), lint issues errors when `bool` is mixed with
58 other scalar types, reusing the existing messages 107 and 211, while also
59 defining new messages that are specific to strict bool mode.
60 * The option `-a` performs the check for lossy conversions from large integer
61 types, the option `-aa` extends this check to small integer types as well,
62 reusing the same message ID.
63 * The option `-X` suppresses arbitrary messages by their message ID.
64 * The option `-q` enables additional queries that are not suitable as regular
65 warnings but may be interesting to look at on a case-by-case basis.
66
67 # Limitations
68
69 Lint operates on the level of individual expressions.
70
71 * It does not build an AST of the statements of a function, therefore it
72 cannot reliably analyze the control flow in a single function.
73 * It does not store the control flow properties of functions, therefore it
74 cannot relate parameter nullability with the return value.
75 * It does not have information about functions, except for their prototypes,
76 therefore it cannot relate them across translation units.
77 * It does not store detailed information about complex data types, therefore
78 it cannot cross-check them across translation units.
79
80 # Fundamental types
81
82 Lint mainly analyzes expressions (`tnode_t`), which are formed from operators
83 (`op_t`) and their operands (`tnode_t`).
84 Each node has a data type (`type_t`) and a few other properties that depend on
85 the operator.
86
87 ## type_t
88
89 The basic types are `int`, `_Bool`, `unsigned long`, `pointer` and so on,
90 as defined in `tspec_t`.
91
92 Concrete types like `int` or `const char *` are created by `gettyp(INT)`,
93 or by deriving new types from existing types, using `block_derive_pointer`,
94 `block_derive_array` and `block_derive_function`.
95 (See [below](#memory-management) for the meaning of the prefix `block_`.)
96
97 After a type has been created, it should not be modified anymore.
98 Ideally all references to types would be `const`, but that's still on the
99 to-do list and not trivial.
100 In the meantime, before modifying a type,
101 it needs to be copied using `block_dup_type` or `expr_dup_type`.
102
103 ## tnode_t
104
105 When lint parses an expression,
106 it builds a tree of nodes representing the AST.
107 Each node has an operator that defines which other members may be accessed.
108 The operators and their properties are defined in `ops.def`.
109 Some examples for operators:
110
111 | Operator | Meaning |
112 |----------|--------------------------------------------|
113 | CON | compile-time constant in `tn_val` |
114 | NAME | references the identifier in `tn_sym` |
115 | UPLUS | the unary operator `+tn_left` |
116 | PLUS | the binary operator `tn_left + tn_right` |
117 | CALL | a direct function call |
118 | ICALL | an indirect function call |
119 | CVT | an implicit conversion or an explicit cast |
120
121 As an example, the expression `strcmp(names[i], "name")` has this internal
122 structure:
123
124 ~~~text
125 1: 'call' type 'int'
126 2: '&' type 'pointer to function(pointer to const char, pointer to const char) returning int'
127 3: 'name' 'strcmp' with extern 'function(pointer to const char, pointer to const char) returning int'
128 4: 'push' type 'pointer to const char'
129 5: 'convert' type 'pointer to const char'
130 6: '&' type 'pointer to char'
131 7: 'string' type 'array[5] of char', lvalue, length 4, "name"
132 8: 'push' type 'pointer to const char'
133 9: 'load' type 'pointer to const char'
134 10: '*' type 'pointer to const char', lvalue
135 11: '+' type 'pointer to pointer to const char'
136 12: 'load' type 'pointer to pointer to const char'
137 13: 'name' 'names' with auto 'pointer to pointer to const char', lvalue
138 14: '*' type 'long'
139 15: 'convert' type 'long'
140 16: 'load' type 'int'
141 17: 'name' 'i' with auto 'int', lvalue
142 18: 'constant' type 'long', value 8
143 ~~~
144
145 | Lines | Notes |
146 |--------|------------------------------------------------------------------|
147 | 4, 8 | Each argument of the function call corresponds to a `PUSH` node. |
148 | 5, 9 | The left operand of a `PUSH` node is the actual argument. |
149 | 8 | The right operand is the `PUSH` node of the previous argument. |
150 | 5, 9 | The arguments of a call are ordered from right to left. |
151 | 10, 11 | Array access is represented as `*(left + right)`. |
152 | 14, 18 | Array and struct offsets are in premultiplied form. |
153 | 18 | The size of a pointer on this platform is 8 bytes. |
154
155 See `debug_node` for how to interpret the members of `tnode_t`.
156
157 ## sym_t
158
159 There is a single symbol table (`symtab`) for the whole translation unit.
160 This means that the same identifier may appear multiple times.
161 To distinguish the identifiers, each symbol has a block level.
162 Symbols from inner scopes are added to the beginning of the table,
163 so they are found first when looking for the identifier.
164
165 # Memory management
166
167 ## Block scope
168
169 The memory that is allocated by the `block_*_alloc` functions is freed at the
170 end of analyzing the block, that is, after the closing `}`.
171 See `compound_statement_rbrace:` in `cgram.y`.
172
173 ## Expression scope
174
175 The memory that is allocated by the `expr_*_alloc` functions is freed at the
176 end of analyzing the expression.
177 See `expr_free_all`.
178
179 # Null pointers
180
181 * Expressions can be null.
182 * This typically happens in case of syntax errors or other errors.
183 * The subtype of a pointer, array or function is never null.
184
185 # Common variable names
186
187 | Name | Type | Meaning |
188 |------|-----------|------------------------------------------------------|
189 | t | `tspec_t` | a simple type such as `INT`, `FUNC`, `PTR` |
190 | tp | `type_t` | a complete type such as `pointer to array[3] of int` |
191 | stp | `type_t` | the subtype of a pointer, array or function |
192 | tn | `tnode_t` | a tree node, mostly used for expressions |
193 | op | `op_t` | an operator used in an expression |
194 | ln | `tnode_t` | the left-hand operand of a binary operator |
195 | rn | `tnode_t` | the right-hand operand of a binary operator |
196 | sym | `sym_t` | a symbol from the symbol table |
197
198 # Abbreviations in variable names
199
200 | Abbr | Expanded |
201 |------|----------------------------------------------|
202 | l | left |
203 | r | right |
204 | o | old (during type conversions) |
205 | n | new (during type conversions) |
206 | op | operator |
207 | arg | the number of the parameter, for diagnostics |
208
209 # Debugging
210
211 Useful breakpoints are:
212
213 | Function/Code | File | Remarks |
214 |---------------------|---------|------------------------------------------------------|
215 | build_binary | tree.c | Creates an expression for a unary or binary operator |
216 | initialization_expr | init.c | Checks a single initializer |
217 | expr | tree.c | Checks a full expression |
218 | typeok | tree.c | Checks two types for compatibility |
219 | vwarning_at | err.c | Prints a warning |
220 | verror_at | err.c | Prints an error |
221 | assert_failed | err.c | Prints the location of a failed assertion |
222 | `switch (yyn)` | cgram.c | Reduction of a grammar rule |
223
224 # Tests
225
226 The tests are in `tests/usr.bin/xlint`.
227 By default, each test is run with the lint flags `-g` for GNU mode,
228 `-S` for C99 mode and `-w` to report warnings as errors.
229
230 Each test can override the lint flags using comments of the following forms:
231
232 * `/* lint1-flags: -tw */` replaces the default flags.
233 * `/* lint1-extra-flags: -p */` adds to the default flags.
234
235 Most tests check the diagnostics that lint generates.
236 They do this by placing `expect` comments near the location of the diagnostic.
237 The comment `/* expect+1: ... */` expects a diagnostic to be generated for the
238 code 1 line below, `/* expect-5: ... */` expects a diagnostic to be generated
239 for the code 5 lines above.
240 An `expect` comment cannot span multiple lines.
241 At the start and the end of the comment, the placeholder `...` stands for an
242 arbitrary sequence of characters.
243 There may be other code or comments in the same line of the `.c` file.
244
245 Each diagnostic has its own test `msg_???.c` that triggers the corresponding
246 diagnostic.
247 Most other tests focus on a single feature.
248
249 ## Adding a new test
250
251 1. Run `make add-test NAME=test_name`.
252 2. Run `cd ../../../tests/usr.bin/xlint/lint1`.
253 3. Make the test generate the desired diagnostics.
254 4. Run `./accept.sh test_name` until it no longer complains.
255 5. Run `cd ../../..`.
256 6. Run `cvs commit distrib/sets/lists/tests/mi tests/usr.bin/xlint`.
257