Home | History | Annotate | Line # | Download | only in doc
      1 \input texinfo
      2 @setfilename cppinternals.info
      3 @settitle The GNU C Preprocessor Internals
      4 
      5 @include gcc-common.texi
      6 
      7 @ifinfo
      8 @dircategory Software development
      9 @direntry
     10 * Cpplib: (cppinternals).      Cpplib internals.
     11 @end direntry
     12 @end ifinfo
     13 
     14 @c @smallbook
     15 @c @cropmarks
     16 @c @finalout
     17 @setchapternewpage odd
     18 @ifinfo
     19 This file documents the internals of the GNU C Preprocessor.
     20 
     21 Copyright (C) 2000-2022 Free Software Foundation, Inc.
     22 
     23 Permission is granted to make and distribute verbatim copies of
     24 this manual provided the copyright notice and this permission notice
     25 are preserved on all copies.
     26 
     27 @ignore
     28 Permission is granted to process this file through Tex and print the
     29 results, provided the printed document carries copying permission
     30 notice identical to this one except for the removal of this paragraph
     31 (this paragraph not being relevant to the printed manual).
     32 
     33 @end ignore
     34 Permission is granted to copy and distribute modified versions of this
     35 manual under the conditions for verbatim copying, provided also that
     36 the entire resulting derived work is distributed under the terms of a
     37 permission notice identical to this one.
     38 
     39 Permission is granted to copy and distribute translations of this manual
     40 into another language, under the above conditions for modified versions.
     41 @end ifinfo
     42 
     43 @titlepage
     44 @title Cpplib Internals
     45 @versionsubtitle
     46 @author Neil Booth
     47 @page
     48 @vskip 0pt plus 1filll
     49 @c man begin COPYRIGHT
     50 Copyright @copyright{} 2000-2022 Free Software Foundation, Inc.
     51 
     52 Permission is granted to make and distribute verbatim copies of
     53 this manual provided the copyright notice and this permission notice
     54 are preserved on all copies.
     55 
     56 Permission is granted to copy and distribute modified versions of this
     57 manual under the conditions for verbatim copying, provided also that
     58 the entire resulting derived work is distributed under the terms of a
     59 permission notice identical to this one.
     60 
     61 Permission is granted to copy and distribute translations of this manual
     62 into another language, under the above conditions for modified versions.
     63 @c man end
     64 @end titlepage
     65 @contents
     66 @page
     67 
     68 @ifnottex
     69 @node Top
     70 @top
     71 @chapter Cpplib---the GNU C Preprocessor
     72 
     73 The GNU C preprocessor is
     74 implemented as a library, @dfn{cpplib}, so it can be easily shared between
     75 a stand-alone preprocessor, and a preprocessor integrated with the C,
     76 C++ and Objective-C front ends.  It is also available for use by other
     77 programs, though this is not recommended as its exposed interface has
     78 not yet reached a point of reasonable stability.
     79 
     80 The library has been written to be re-entrant, so that it can be used
     81 to preprocess many files simultaneously if necessary.  It has also been
     82 written with the preprocessing token as the fundamental unit; the
     83 preprocessor in previous versions of GCC would operate on text strings
     84 as the fundamental unit.
     85 
     86 This brief manual documents the internals of cpplib, and explains some
     87 of the tricky issues.  It is intended that, along with the comments in
     88 the source code, a reasonably competent C programmer should be able to
     89 figure out what the code is doing, and why things have been implemented
     90 the way they have.
     91 
     92 @menu
     93 * Conventions::         Conventions used in the code.
     94 * Lexer::               The combined C, C++ and Objective-C Lexer.
     95 * Hash Nodes::          All identifiers are entered into a hash table.
     96 * Macro Expansion::     Macro expansion algorithm.
     97 * Token Spacing::       Spacing and paste avoidance issues.
     98 * Line Numbering::      Tracking location within files.
     99 * Guard Macros::        Optimizing header files with guard macros.
    100 * Files::               File handling.
    101 * Concept Index::       Index.
    102 @end menu
    103 @end ifnottex
    104 
    105 @node Conventions
    106 @unnumbered Conventions
    107 @cindex interface
    108 @cindex header files
    109 
    110 cpplib has two interfaces---one is exposed internally only, and the
    111 other is for both internal and external use.
    112 
    113 The convention is that functions and types that are exposed to multiple
    114 files internally are prefixed with @samp{_cpp_}, and are to be found in
    115 the file @file{internal.h}.  Functions and types exposed to external
    116 clients are in @file{cpplib.h}, and prefixed with @samp{cpp_}.  For
    117 historical reasons this is no longer quite true, but we should strive to
    118 stick to it.
    119 
    120 We are striving to reduce the information exposed in @file{cpplib.h} to the
    121 bare minimum necessary, and then to keep it there.  This makes clear
    122 exactly what external clients are entitled to assume, and allows us to
    123 change internals in the future without worrying whether library clients
    124 are perhaps relying on some kind of undocumented implementation-specific
    125 behavior.
    126 
    127 @node Lexer
    128 @unnumbered The Lexer
    129 @cindex lexer
    130 @cindex newlines
    131 @cindex escaped newlines
    132 
    133 @section Overview
    134 The lexer is contained in the file @file{lex.cc}.  It is a hand-coded
    135 lexer, and not implemented as a state machine.  It can understand C, C++
    136 and Objective-C source code, and has been extended to allow reasonably
    137 successful preprocessing of assembly language.  The lexer does not make
    138 an initial pass to strip out trigraphs and escaped newlines, but handles
    139 them as they are encountered in a single pass of the input file.  It
    140 returns preprocessing tokens individually, not a line at a time.
    141 
    142 It is mostly transparent to users of the library, since the library's
    143 interface for obtaining the next token, @code{cpp_get_token}, takes care
    144 of lexing new tokens, handling directives, and expanding macros as
    145 necessary.  However, the lexer does expose some functionality so that
    146 clients of the library can easily spell a given token, such as
    147 @code{cpp_spell_token} and @code{cpp_token_len}.  These functions are
    148 useful when generating diagnostics, and for emitting the preprocessed
    149 output.
    150 
    151 @section Lexing a token
    152 Lexing of an individual token is handled by @code{_cpp_lex_direct} and
    153 its subroutines.  In its current form the code is quite complicated,
    154 with read ahead characters and such-like, since it strives to not step
    155 back in the character stream in preparation for handling non-ASCII file
    156 encodings.  The current plan is to convert any such files to UTF-8
    157 before processing them.  This complexity is therefore unnecessary and
    158 will be removed, so I'll not discuss it further here.
    159 
    160 The job of @code{_cpp_lex_direct} is simply to lex a token.  It is not
    161 responsible for issues like directive handling, returning lookahead
    162 tokens directly, multiple-include optimization, or conditional block
    163 skipping.  It necessarily has a minor r@^ole to play in memory
    164 management of lexed lines.  I discuss these issues in a separate section
    165 (@pxref{Lexing a line}).
    166 
    167 The lexer places the token it lexes into storage pointed to by the
    168 variable @code{cur_token}, and then increments it.  This variable is
    169 important for correct diagnostic positioning.  Unless a specific line
    170 and column are passed to the diagnostic routines, they will examine the
    171 @code{line} and @code{col} values of the token just before the location
    172 that @code{cur_token} points to, and use that location to report the
    173 diagnostic.
    174 
    175 The lexer does not consider whitespace to be a token in its own right.
    176 If whitespace (other than a new line) precedes a token, it sets the
    177 @code{PREV_WHITE} bit in the token's flags.  Each token has its
    178 @code{line} and @code{col} variables set to the line and column of the
    179 first character of the token.  This line number is the line number in
    180 the translation unit, and can be converted to a source (file, line) pair
    181 using the line map code.
    182 
    183 The first token on a logical, i.e.@: unescaped, line has the flag
    184 @code{BOL} set for beginning-of-line.  This flag is intended for
    185 internal use, both to distinguish a @samp{#} that begins a directive
    186 from one that doesn't, and to generate a call-back to clients that want
    187 to be notified about the start of every non-directive line with tokens
    188 on it.  Clients cannot reliably determine this for themselves: the first
    189 token might be a macro, and the tokens of a macro expansion do not have
    190 the @code{BOL} flag set.  The macro expansion may even be empty, and the
    191 next token on the line certainly won't have the @code{BOL} flag set.
    192 
    193 New lines are treated specially; exactly how the lexer handles them is
    194 context-dependent.  The C standard mandates that directives are
    195 terminated by the first unescaped newline character, even if it appears
    196 in the middle of a macro expansion.  Therefore, if the state variable
    197 @code{in_directive} is set, the lexer returns a @code{CPP_EOF} token,
    198 which is normally used to indicate end-of-file, to indicate
    199 end-of-directive.  In a directive a @code{CPP_EOF} token never means
    200 end-of-file.  Conveniently, if the caller was @code{collect_args}, it
    201 already handles @code{CPP_EOF} as if it were end-of-file, and reports an
    202 error about an unterminated macro argument list.
    203 
    204 The C standard also specifies that a new line in the middle of the
    205 arguments to a macro is treated as whitespace.  This white space is
    206 important in case the macro argument is stringized.  The state variable
    207 @code{parsing_args} is nonzero when the preprocessor is collecting the
    208 arguments to a macro call.  It is set to 1 when looking for the opening
    209 parenthesis to a function-like macro, and 2 when collecting the actual
    210 arguments up to the closing parenthesis, since these two cases need to
    211 be distinguished sometimes.  One such time is here: the lexer sets the
    212 @code{PREV_WHITE} flag of a token if it meets a new line when
    213 @code{parsing_args} is set to 2.  It doesn't set it if it meets a new
    214 line when @code{parsing_args} is 1, since then code like
    215 
    216 @smallexample
    217 #define foo() bar
    218 foo
    219 baz
    220 @end smallexample
    221 
    222 @noindent would be output with an erroneous space before @samp{baz}:
    223 
    224 @smallexample
    225 foo
    226  baz
    227 @end smallexample
    228 
    229 This is a good example of the subtlety of getting token spacing correct
    230 in the preprocessor; there are plenty of tests in the testsuite for
    231 corner cases like this.
    232 
    233 The lexer is written to treat each of @samp{\r}, @samp{\n}, @samp{\r\n}
    234 and @samp{\n\r} as a single new line indicator.  This allows it to
    235 transparently preprocess MS-DOS, Macintosh and Unix files without their
    236 needing to pass through a special filter beforehand.
    237 
    238 We also decided to treat a backslash, either @samp{\} or the trigraph
    239 @samp{??/}, separated from one of the above newline indicators by
    240 non-comment whitespace only, as intending to escape the newline.  It
    241 tends to be a typing mistake, and cannot reasonably be mistaken for
    242 anything else in any of the C-family grammars.  Since handling it this
    243 way is not strictly conforming to the ISO standard, the library issues a
    244 warning wherever it encounters it.
    245 
    246 Handling newlines like this is made simpler by doing it in one place
    247 only.  The function @code{handle_newline} takes care of all newline
    248 characters, and @code{skip_escaped_newlines} takes care of arbitrarily
    249 long sequences of escaped newlines, deferring to @code{handle_newline}
    250 to handle the newlines themselves.
    251 
    252 The most painful aspect of lexing ISO-standard C and C++ is handling
    253 trigraphs and backlash-escaped newlines.  Trigraphs are processed before
    254 any interpretation of the meaning of a character is made, and unfortunately
    255 there is a trigraph representation for a backslash, so it is possible for
    256 the trigraph @samp{??/} to introduce an escaped newline.
    257 
    258 Escaped newlines are tedious because theoretically they can occur
    259 anywhere---between the @samp{+} and @samp{=} of the @samp{+=} token,
    260 within the characters of an identifier, and even between the @samp{*}
    261 and @samp{/} that terminates a comment.  Moreover, you cannot be sure
    262 there is just one---there might be an arbitrarily long sequence of them.
    263 
    264 So, for example, the routine that lexes a number, @code{parse_number},
    265 cannot assume that it can scan forwards until the first non-number
    266 character and be done with it, because this could be the @samp{\}
    267 introducing an escaped newline, or the @samp{?} introducing the trigraph
    268 sequence that represents the @samp{\} of an escaped newline.  If it
    269 encounters a @samp{?} or @samp{\}, it calls @code{skip_escaped_newlines}
    270 to skip over any potential escaped newlines before checking whether the
    271 number has been finished.
    272 
    273 Similarly code in the main body of @code{_cpp_lex_direct} cannot simply
    274 check for a @samp{=} after a @samp{+} character to determine whether it
    275 has a @samp{+=} token; it needs to be prepared for an escaped newline of
    276 some sort.  Such cases use the function @code{get_effective_char}, which
    277 returns the first character after any intervening escaped newlines.
    278 
    279 The lexer needs to keep track of the correct column position, including
    280 counting tabs as specified by the @option{-ftabstop=} option.  This
    281 should be done even within C-style comments; they can appear in the
    282 middle of a line, and we want to report diagnostics in the correct
    283 position for text appearing after the end of the comment.
    284 
    285 @anchor{Invalid identifiers}
    286 Some identifiers, such as @code{__VA_ARGS__} and poisoned identifiers,
    287 may be invalid and require a diagnostic.  However, if they appear in a
    288 macro expansion we don't want to complain with each use of the macro.
    289 It is therefore best to catch them during the lexing stage, in
    290 @code{parse_identifier}.  In both cases, whether a diagnostic is needed
    291 or not is dependent upon the lexer's state.  For example, we don't want
    292 to issue a diagnostic for re-poisoning a poisoned identifier, or for
    293 using @code{__VA_ARGS__} in the expansion of a variable-argument macro.
    294 Therefore @code{parse_identifier} makes use of state flags to determine
    295 whether a diagnostic is appropriate.  Since we change state on a
    296 per-token basis, and don't lex whole lines at a time, this is not a
    297 problem.
    298 
    299 Another place where state flags are used to change behavior is whilst
    300 lexing header names.  Normally, a @samp{<} would be lexed as a single
    301 token.  After a @code{#include} directive, though, it should be lexed as
    302 a single token as far as the nearest @samp{>} character.  Note that we
    303 don't allow the terminators of header names to be escaped; the first
    304 @samp{"} or @samp{>} terminates the header name.
    305 
    306 Interpretation of some character sequences depends upon whether we are
    307 lexing C, C++ or Objective-C, and on the revision of the standard in
    308 force.  For example, @samp{::} is a single token in C++, but in C it is
    309 two separate @samp{:} tokens and almost certainly a syntax error.  Such
    310 cases are handled by @code{_cpp_lex_direct} based upon command-line
    311 flags stored in the @code{cpp_options} structure.
    312 
    313 Once a token has been lexed, it leads an independent existence.  The
    314 spelling of numbers, identifiers and strings is copied to permanent
    315 storage from the original input buffer, so a token remains valid and
    316 correct even if its source buffer is freed with @code{_cpp_pop_buffer}.
    317 The storage holding the spellings of such tokens remains until the
    318 client program calls cpp_destroy, probably at the end of the translation
    319 unit.
    320 
    321 @anchor{Lexing a line}
    322 @section Lexing a line
    323 @cindex token run
    324 
    325 When the preprocessor was changed to return pointers to tokens, one
    326 feature I wanted was some sort of guarantee regarding how long a
    327 returned pointer remains valid.  This is important to the stand-alone
    328 preprocessor, the future direction of the C family front ends, and even
    329 to cpplib itself internally.
    330 
    331 Occasionally the preprocessor wants to be able to peek ahead in the
    332 token stream.  For example, after the name of a function-like macro, it
    333 wants to check the next token to see if it is an opening parenthesis.
    334 Another example is that, after reading the first few tokens of a
    335 @code{#pragma} directive and not recognizing it as a registered pragma,
    336 it wants to backtrack and allow the user-defined handler for unknown
    337 pragmas to access the full @code{#pragma} token stream.  The stand-alone
    338 preprocessor wants to be able to test the current token with the
    339 previous one to see if a space needs to be inserted to preserve their
    340 separate tokenization upon re-lexing (paste avoidance), so it needs to
    341 be sure the pointer to the previous token is still valid.  The
    342 recursive-descent C++ parser wants to be able to perform tentative
    343 parsing arbitrarily far ahead in the token stream, and then to be able
    344 to jump back to a prior position in that stream if necessary.
    345 
    346 The rule I chose, which is fairly natural, is to arrange that the
    347 preprocessor lex all tokens on a line consecutively into a token buffer,
    348 which I call a @dfn{token run}, and when meeting an unescaped new line
    349 (newlines within comments do not count either), to start lexing back at
    350 the beginning of the run.  Note that we do @emph{not} lex a line of
    351 tokens at once; if we did that @code{parse_identifier} would not have
    352 state flags available to warn about invalid identifiers (@pxref{Invalid
    353 identifiers}).
    354 
    355 In other words, accessing tokens that appeared earlier in the current
    356 line is valid, but since each logical line overwrites the tokens of the
    357 previous line, tokens from prior lines are unavailable.  In particular,
    358 since a directive only occupies a single logical line, this means that
    359 the directive handlers like the @code{#pragma} handler can jump around
    360 in the directive's tokens if necessary.
    361 
    362 Two issues remain: what about tokens that arise from macro expansions,
    363 and what happens when we have a long line that overflows the token run?
    364 
    365 Since we promise clients that we preserve the validity of pointers that
    366 we have already returned for tokens that appeared earlier in the line,
    367 we cannot reallocate the run.  Instead, on overflow it is expanded by
    368 chaining a new token run on to the end of the existing one.
    369 
    370 The tokens forming a macro's replacement list are collected by the
    371 @code{#define} handler, and placed in storage that is only freed by
    372 @code{cpp_destroy}.  So if a macro is expanded in the line of tokens,
    373 the pointers to the tokens of its expansion that are returned will always
    374 remain valid.  However, macros are a little trickier than that, since
    375 they give rise to three sources of fresh tokens.  They are the built-in
    376 macros like @code{__LINE__}, and the @samp{#} and @samp{##} operators
    377 for stringizing and token pasting.  I handled this by allocating
    378 space for these tokens from the lexer's token run chain.  This means
    379 they automatically receive the same lifetime guarantees as lexed tokens,
    380 and we don't need to concern ourselves with freeing them.
    381 
    382 Lexing into a line of tokens solves some of the token memory management
    383 issues, but not all.  The opening parenthesis after a function-like
    384 macro name might lie on a different line, and the front ends definitely
    385 want the ability to look ahead past the end of the current line.  So
    386 cpplib only moves back to the start of the token run at the end of a
    387 line if the variable @code{keep_tokens} is zero.  Line-buffering is
    388 quite natural for the preprocessor, and as a result the only time cpplib
    389 needs to increment this variable is whilst looking for the opening
    390 parenthesis to, and reading the arguments of, a function-like macro.  In
    391 the near future cpplib will export an interface to increment and
    392 decrement this variable, so that clients can share full control over the
    393 lifetime of token pointers too.
    394 
    395 The routine @code{_cpp_lex_token} handles moving to new token runs,
    396 calling @code{_cpp_lex_direct} to lex new tokens, or returning
    397 previously-lexed tokens if we stepped back in the token stream.  It also
    398 checks each token for the @code{BOL} flag, which might indicate a
    399 directive that needs to be handled, or require a start-of-line call-back
    400 to be made.  @code{_cpp_lex_token} also handles skipping over tokens in
    401 failed conditional blocks, and invalidates the control macro of the
    402 multiple-include optimization if a token was successfully lexed outside
    403 a directive.  In other words, its callers do not need to concern
    404 themselves with such issues.
    405 
    406 @node Hash Nodes
    407 @unnumbered Hash Nodes
    408 @cindex hash table
    409 @cindex identifiers
    410 @cindex macros
    411 @cindex assertions
    412 @cindex named operators
    413 
    414 When cpplib encounters an ``identifier'', it generates a hash code for
    415 it and stores it in the hash table.  By ``identifier'' we mean tokens
    416 with type @code{CPP_NAME}; this includes identifiers in the usual C
    417 sense, as well as keywords, directive names, macro names and so on.  For
    418 example, all of @code{pragma}, @code{int}, @code{foo} and
    419 @code{__GNUC__} are identifiers and hashed when lexed.
    420 
    421 Each node in the hash table contain various information about the
    422 identifier it represents.  For example, its length and type.  At any one
    423 time, each identifier falls into exactly one of three categories:
    424 
    425 @itemize @bullet
    426 @item Macros
    427 
    428 These have been declared to be macros, either on the command line or
    429 with @code{#define}.  A few, such as @code{__TIME__} are built-ins
    430 entered in the hash table during initialization.  The hash node for a
    431 normal macro points to a structure with more information about the
    432 macro, such as whether it is function-like, how many arguments it takes,
    433 and its expansion.  Built-in macros are flagged as special, and instead
    434 contain an enum indicating which of the various built-in macros it is.
    435 
    436 @item Assertions
    437 
    438 Assertions are in a separate namespace to macros.  To enforce this, cpp
    439 actually prepends a @code{#} character before hashing and entering it in
    440 the hash table.  An assertion's node points to a chain of answers to
    441 that assertion.
    442 
    443 @item Void
    444 
    445 Everything else falls into this category---an identifier that is not
    446 currently a macro, or a macro that has since been undefined with
    447 @code{#undef}.
    448 
    449 When preprocessing C++, this category also includes the named operators,
    450 such as @code{xor}.  In expressions these behave like the operators they
    451 represent, but in contexts where the spelling of a token matters they
    452 are spelt differently.  This spelling distinction is relevant when they
    453 are operands of the stringizing and pasting macro operators @code{#} and
    454 @code{##}.  Named operator hash nodes are flagged, both to catch the
    455 spelling distinction and to prevent them from being defined as macros.
    456 @end itemize
    457 
    458 The same identifiers share the same hash node.  Since each identifier
    459 token, after lexing, contains a pointer to its hash node, this is used
    460 to provide rapid lookup of various information.  For example, when
    461 parsing a @code{#define} statement, CPP flags each argument's identifier
    462 hash node with the index of that argument.  This makes duplicated
    463 argument checking an O(1) operation for each argument.  Similarly, for
    464 each identifier in the macro's expansion, lookup to see if it is an
    465 argument, and which argument it is, is also an O(1) operation.  Further,
    466 each directive name, such as @code{endif}, has an associated directive
    467 enum stored in its hash node, so that directive lookup is also O(1).
    468 
    469 @node Macro Expansion
    470 @unnumbered Macro Expansion Algorithm
    471 @cindex macro expansion
    472 
    473 Macro expansion is a tricky operation, fraught with nasty corner cases
    474 and situations that render what you thought was a nifty way to
    475 optimize the preprocessor's expansion algorithm wrong in quite subtle
    476 ways.
    477 
    478 I strongly recommend you have a good grasp of how the C and C++
    479 standards require macros to be expanded before diving into this
    480 section, let alone the code!.  If you don't have a clear mental
    481 picture of how things like nested macro expansion, stringizing and
    482 token pasting are supposed to work, damage to your sanity can quickly
    483 result.
    484 
    485 @section Internal representation of macros
    486 @cindex macro representation (internal)
    487 
    488 The preprocessor stores macro expansions in tokenized form.  This
    489 saves repeated lexing passes during expansion, at the cost of a small
    490 increase in memory consumption on average.  The tokens are stored
    491 contiguously in memory, so a pointer to the first one and a token
    492 count is all you need to get the replacement list of a macro.
    493 
    494 If the macro is a function-like macro the preprocessor also stores its
    495 parameters, in the form of an ordered list of pointers to the hash
    496 table entry of each parameter's identifier.  Further, in the macro's
    497 stored expansion each occurrence of a parameter is replaced with a
    498 special token of type @code{CPP_MACRO_ARG}.  Each such token holds the
    499 index of the parameter it represents in the parameter list, which
    500 allows rapid replacement of parameters with their arguments during
    501 expansion.  Despite this optimization it is still necessary to store
    502 the original parameters to the macro, both for dumping with e.g.,
    503 @option{-dD}, and to warn about non-trivial macro redefinitions when
    504 the parameter names have changed.
    505 
    506 @section Macro expansion overview
    507 The preprocessor maintains a @dfn{context stack}, implemented as a
    508 linked list of @code{cpp_context} structures, which together represent
    509 the macro expansion state at any one time.  The @code{struct
    510 cpp_reader} member variable @code{context} points to the current top
    511 of this stack.  The top normally holds the unexpanded replacement list
    512 of the innermost macro under expansion, except when cpplib is about to
    513 pre-expand an argument, in which case it holds that argument's
    514 unexpanded tokens.
    515 
    516 When there are no macros under expansion, cpplib is in @dfn{base
    517 context}.  All contexts other than the base context contain a
    518 contiguous list of tokens delimited by a starting and ending token.
    519 When not in base context, cpplib obtains the next token from the list
    520 of the top context.  If there are no tokens left in the list, it pops
    521 that context off the stack, and subsequent ones if necessary, until an
    522 unexhausted context is found or it returns to base context.  In base
    523 context, cpplib reads tokens directly from the lexer.
    524 
    525 If it encounters an identifier that is both a macro and enabled for
    526 expansion, cpplib prepares to push a new context for that macro on the
    527 stack by calling the routine @code{enter_macro_context}.  When this
    528 routine returns, the new context will contain the unexpanded tokens of
    529 the replacement list of that macro.  In the case of function-like
    530 macros, @code{enter_macro_context} also replaces any parameters in the
    531 replacement list, stored as @code{CPP_MACRO_ARG} tokens, with the
    532 appropriate macro argument.  If the standard requires that the
    533 parameter be replaced with its expanded argument, the argument will
    534 have been fully macro expanded first.
    535 
    536 @code{enter_macro_context} also handles special macros like
    537 @code{__LINE__}.  Although these macros expand to a single token which
    538 cannot contain any further macros, for reasons of token spacing
    539 (@pxref{Token Spacing}) and simplicity of implementation, cpplib
    540 handles these special macros by pushing a context containing just that
    541 one token.
    542 
    543 The final thing that @code{enter_macro_context} does before returning
    544 is to mark the macro disabled for expansion (except for special macros
    545 like @code{__TIME__}).  The macro is re-enabled when its context is
    546 later popped from the context stack, as described above.  This strict
    547 ordering ensures that a macro is disabled whilst its expansion is
    548 being scanned, but that it is @emph{not} disabled whilst any arguments
    549 to it are being expanded.
    550 
    551 @section Scanning the replacement list for macros to expand
    552 The C standard states that, after any parameters have been replaced
    553 with their possibly-expanded arguments, the replacement list is
    554 scanned for nested macros.  Further, any identifiers in the
    555 replacement list that are not expanded during this scan are never
    556 again eligible for expansion in the future, if the reason they were
    557 not expanded is that the macro in question was disabled.
    558 
    559 Clearly this latter condition can only apply to tokens resulting from
    560 argument pre-expansion.  Other tokens never have an opportunity to be
    561 re-tested for expansion.  It is possible for identifiers that are
    562 function-like macros to not expand initially but to expand during a
    563 later scan.  This occurs when the identifier is the last token of an
    564 argument (and therefore originally followed by a comma or a closing
    565 parenthesis in its macro's argument list), and when it replaces its
    566 parameter in the macro's replacement list, the subsequent token
    567 happens to be an opening parenthesis (itself possibly the first token
    568 of an argument).
    569 
    570 It is important to note that when cpplib reads the last token of a
    571 given context, that context still remains on the stack.  Only when
    572 looking for the @emph{next} token do we pop it off the stack and drop
    573 to a lower context.  This makes backing up by one token easy, but more
    574 importantly ensures that the macro corresponding to the current
    575 context is still disabled when we are considering the last token of
    576 its replacement list for expansion (or indeed expanding it).  As an
    577 example, which illustrates many of the points above, consider
    578 
    579 @smallexample
    580 #define foo(x) bar x
    581 foo(foo) (2)
    582 @end smallexample
    583 
    584 @noindent which fully expands to @samp{bar foo (2)}.  During pre-expansion
    585 of the argument, @samp{foo} does not expand even though the macro is
    586 enabled, since it has no following parenthesis [pre-expansion of an
    587 argument only uses tokens from that argument; it cannot take tokens
    588 from whatever follows the macro invocation].  This still leaves the
    589 argument token @samp{foo} eligible for future expansion.  Then, when
    590 re-scanning after argument replacement, the token @samp{foo} is
    591 rejected for expansion, and marked ineligible for future expansion,
    592 since the macro is now disabled.  It is disabled because the
    593 replacement list @samp{bar foo} of the macro is still on the context
    594 stack.
    595 
    596 If instead the algorithm looked for an opening parenthesis first and
    597 then tested whether the macro were disabled it would be subtly wrong.
    598 In the example above, the replacement list of @samp{foo} would be
    599 popped in the process of finding the parenthesis, re-enabling
    600 @samp{foo} and expanding it a second time.
    601 
    602 @section Looking for a function-like macro's opening parenthesis
    603 Function-like macros only expand when immediately followed by a
    604 parenthesis.  To do this cpplib needs to temporarily disable macros
    605 and read the next token.  Unfortunately, because of spacing issues
    606 (@pxref{Token Spacing}), there can be fake padding tokens in-between,
    607 and if the next real token is not a parenthesis cpplib needs to be
    608 able to back up that one token as well as retain the information in
    609 any intervening padding tokens.
    610 
    611 Backing up more than one token when macros are involved is not
    612 permitted by cpplib, because in general it might involve issues like
    613 restoring popped contexts onto the context stack, which are too hard.
    614 Instead, searching for the parenthesis is handled by a special
    615 function, @code{funlike_invocation_p}, which remembers padding
    616 information as it reads tokens.  If the next real token is not an
    617 opening parenthesis, it backs up that one token, and then pushes an
    618 extra context just containing the padding information if necessary.
    619 
    620 @section Marking tokens ineligible for future expansion
    621 As discussed above, cpplib needs a way of marking tokens as
    622 unexpandable.  Since the tokens cpplib handles are read-only once they
    623 have been lexed, it instead makes a copy of the token and adds the
    624 flag @code{NO_EXPAND} to the copy.
    625 
    626 For efficiency and to simplify memory management by avoiding having to
    627 remember to free these tokens, they are allocated as temporary tokens
    628 from the lexer's current token run (@pxref{Lexing a line}) using the
    629 function @code{_cpp_temp_token}.  The tokens are then re-used once the
    630 current line of tokens has been read in.
    631 
    632 This might sound unsafe.  However, tokens runs are not re-used at the
    633 end of a line if it happens to be in the middle of a macro argument
    634 list, and cpplib only wants to back-up more than one lexer token in
    635 situations where no macro expansion is involved, so the optimization
    636 is safe.
    637 
    638 @node Token Spacing
    639 @unnumbered Token Spacing
    640 @cindex paste avoidance
    641 @cindex spacing
    642 @cindex token spacing
    643 
    644 First, consider an issue that only concerns the stand-alone
    645 preprocessor: there needs to be a guarantee that re-reading its preprocessed
    646 output results in an identical token stream.  Without taking special
    647 measures, this might not be the case because of macro substitution.
    648 For example:
    649 
    650 @smallexample
    651 #define PLUS +
    652 #define EMPTY
    653 #define f(x) =x=
    654 +PLUS -EMPTY- PLUS+ f(=)
    655         @expansion{} + + - - + + = = =
    656 @emph{not}
    657         @expansion{} ++ -- ++ ===
    658 @end smallexample
    659 
    660 One solution would be to simply insert a space between all adjacent
    661 tokens.  However, we would like to keep space insertion to a minimum,
    662 both for aesthetic reasons and because it causes problems for people who
    663 still try to abuse the preprocessor for things like Fortran source and
    664 Makefiles.
    665 
    666 For now, just notice that when tokens are added (or removed, as shown by
    667 the @code{EMPTY} example) from the original lexed token stream, we need
    668 to check for accidental token pasting.  We call this @dfn{paste
    669 avoidance}.  Token addition and removal can only occur because of macro
    670 expansion, but accidental pasting can occur in many places: both before
    671 and after each macro replacement, each argument replacement, and
    672 additionally each token created by the @samp{#} and @samp{##} operators.
    673 
    674 Look at how the preprocessor gets whitespace output correct
    675 normally.  The @code{cpp_token} structure contains a flags byte, and one
    676 of those flags is @code{PREV_WHITE}.  This is flagged by the lexer, and
    677 indicates that the token was preceded by whitespace of some form other
    678 than a new line.  The stand-alone preprocessor can use this flag to
    679 decide whether to insert a space between tokens in the output.
    680 
    681 Now consider the result of the following macro expansion:
    682 
    683 @smallexample
    684 #define add(x, y, z) x + y +z;
    685 sum = add (1,2, 3);
    686         @expansion{} sum = 1 + 2 +3;
    687 @end smallexample
    688 
    689 The interesting thing here is that the tokens @samp{1} and @samp{2} are
    690 output with a preceding space, and @samp{3} is output without a
    691 preceding space, but when lexed none of these tokens had that property.
    692 Careful consideration reveals that @samp{1} gets its preceding
    693 whitespace from the space preceding @samp{add} in the macro invocation,
    694 @emph{not} replacement list.  @samp{2} gets its whitespace from the
    695 space preceding the parameter @samp{y} in the macro replacement list,
    696 and @samp{3} has no preceding space because parameter @samp{z} has none
    697 in the replacement list.
    698 
    699 Once lexed, tokens are effectively fixed and cannot be altered, since
    700 pointers to them might be held in many places, in particular by
    701 in-progress macro expansions.  So instead of modifying the two tokens
    702 above, the preprocessor inserts a special token, which I call a
    703 @dfn{padding token}, into the token stream to indicate that spacing of
    704 the subsequent token is special.  The preprocessor inserts padding
    705 tokens in front of every macro expansion and expanded macro argument.
    706 These point to a @dfn{source token} from which the subsequent real token
    707 should inherit its spacing.  In the above example, the source tokens are
    708 @samp{add} in the macro invocation, and @samp{y} and @samp{z} in the
    709 macro replacement list, respectively.
    710 
    711 It is quite easy to get multiple padding tokens in a row, for example if
    712 a macro's first replacement token expands straight into another macro.
    713 
    714 @smallexample
    715 #define foo bar
    716 #define bar baz
    717 [foo]
    718         @expansion{} [baz]
    719 @end smallexample
    720 
    721 Here, two padding tokens are generated with sources the @samp{foo} token
    722 between the brackets, and the @samp{bar} token from foo's replacement
    723 list, respectively.  Clearly the first padding token is the one to
    724 use, so the output code should contain a rule that the first
    725 padding token in a sequence is the one that matters.
    726 
    727 But what if a macro expansion is left?  Adjusting the above
    728 example slightly:
    729 
    730 @smallexample
    731 #define foo bar
    732 #define bar EMPTY baz
    733 #define EMPTY
    734 [foo] EMPTY;
    735         @expansion{} [ baz] ;
    736 @end smallexample
    737 
    738 As shown, now there should be a space before @samp{baz} and the
    739 semicolon in the output.
    740 
    741 The rules we decided above fail for @samp{baz}: we generate three
    742 padding tokens, one per macro invocation, before the token @samp{baz}.
    743 We would then have it take its spacing from the first of these, which
    744 carries source token @samp{foo} with no leading space.
    745 
    746 It is vital that cpplib get spacing correct in these examples since any
    747 of these macro expansions could be stringized, where spacing matters.
    748 
    749 So, this demonstrates that not just entering macro and argument
    750 expansions, but leaving them requires special handling too.  I made
    751 cpplib insert a padding token with a @code{NULL} source token when
    752 leaving macro expansions, as well as after each replaced argument in a
    753 macro's replacement list.  It also inserts appropriate padding tokens on
    754 either side of tokens created by the @samp{#} and @samp{##} operators.
    755 I expanded the rule so that, if we see a padding token with a
    756 @code{NULL} source token, @emph{and} that source token has no leading
    757 space, then we behave as if we have seen no padding tokens at all.  A
    758 quick check shows this rule will then get the above example correct as
    759 well.
    760 
    761 Now a relationship with paste avoidance is apparent: we have to be
    762 careful about paste avoidance in exactly the same locations we have
    763 padding tokens in order to get white space correct.  This makes
    764 implementation of paste avoidance easy: wherever the stand-alone
    765 preprocessor is fixing up spacing because of padding tokens, and it
    766 turns out that no space is needed, it has to take the extra step to
    767 check that a space is not needed after all to avoid an accidental paste.
    768 The function @code{cpp_avoid_paste} advises whether a space is required
    769 between two consecutive tokens.  To avoid excessive spacing, it tries
    770 hard to only require a space if one is likely to be necessary, but for
    771 reasons of efficiency it is slightly conservative and might recommend a
    772 space where one is not strictly needed.
    773 
    774 @node Line Numbering
    775 @unnumbered Line numbering
    776 @cindex line numbers
    777 
    778 @section Just which line number anyway?
    779 
    780 There are three reasonable requirements a cpplib client might have for
    781 the line number of a token passed to it:
    782 
    783 @itemize @bullet
    784 @item
    785 The source line it was lexed on.
    786 @item
    787 The line it is output on.  This can be different to the line it was
    788 lexed on if, for example, there are intervening escaped newlines or
    789 C-style comments.  For example:
    790 
    791 @smallexample
    792 foo /* @r{A long
    793 comment} */ bar \
    794 baz
    795 @result{}
    796 foo bar baz
    797 @end smallexample
    798 
    799 @item
    800 If the token results from a macro expansion, the line of the macro name,
    801 or possibly the line of the closing parenthesis in the case of
    802 function-like macro expansion.
    803 @end itemize
    804 
    805 The @code{cpp_token} structure contains @code{line} and @code{col}
    806 members.  The lexer fills these in with the line and column of the first
    807 character of the token.  Consequently, but maybe unexpectedly, a token
    808 from the replacement list of a macro expansion carries the location of
    809 the token within the @code{#define} directive, because cpplib expands a
    810 macro by returning pointers to the tokens in its replacement list.  The
    811 current implementation of cpplib assigns tokens created from built-in
    812 macros and the @samp{#} and @samp{##} operators the location of the most
    813 recently lexed token.  This is a because they are allocated from the
    814 lexer's token runs, and because of the way the diagnostic routines infer
    815 the appropriate location to report.
    816 
    817 The diagnostic routines in cpplib display the location of the most
    818 recently @emph{lexed} token, unless they are passed a specific line and
    819 column to report.  For diagnostics regarding tokens that arise from
    820 macro expansions, it might also be helpful for the user to see the
    821 original location in the macro definition that the token came from.
    822 Since that is exactly the information each token carries, such an
    823 enhancement could be made relatively easily in future.
    824 
    825 The stand-alone preprocessor faces a similar problem when determining
    826 the correct line to output the token on: the position attached to a
    827 token is fairly useless if the token came from a macro expansion.  All
    828 tokens on a logical line should be output on its first physical line, so
    829 the token's reported location is also wrong if it is part of a physical
    830 line other than the first.
    831 
    832 To solve these issues, cpplib provides a callback that is generated
    833 whenever it lexes a preprocessing token that starts a new logical line
    834 other than a directive.  It passes this token (which may be a
    835 @code{CPP_EOF} token indicating the end of the translation unit) to the
    836 callback routine, which can then use the line and column of this token
    837 to produce correct output.
    838 
    839 @section Representation of line numbers
    840 
    841 As mentioned above, cpplib stores with each token the line number that
    842 it was lexed on.  In fact, this number is not the number of the line in
    843 the source file, but instead bears more resemblance to the number of the
    844 line in the translation unit.
    845 
    846 The preprocessor maintains a monotonic increasing line count, which is
    847 incremented at every new line character (and also at the end of any
    848 buffer that does not end in a new line).  Since a line number of zero is
    849 useful to indicate certain special states and conditions, this variable
    850 starts counting from one.
    851 
    852 This variable therefore uniquely enumerates each line in the translation
    853 unit.  With some simple infrastructure, it is straight forward to map
    854 from this to the original source file and line number pair, saving space
    855 whenever line number information needs to be saved.  The code the
    856 implements this mapping lies in the files @file{line-map.cc} and
    857 @file{line-map.h}.
    858 
    859 Command-line macros and assertions are implemented by pushing a buffer
    860 containing the right hand side of an equivalent @code{#define} or
    861 @code{#assert} directive.  Some built-in macros are handled similarly.
    862 Since these are all processed before the first line of the main input
    863 file, it will typically have an assigned line closer to twenty than to
    864 one.
    865 
    866 @node Guard Macros
    867 @unnumbered The Multiple-Include Optimization
    868 @cindex guard macros
    869 @cindex controlling macros
    870 @cindex multiple-include optimization
    871 
    872 Header files are often of the form
    873 
    874 @smallexample
    875 #ifndef FOO
    876 #define FOO
    877 @dots{}
    878 #endif
    879 @end smallexample
    880 
    881 @noindent
    882 to prevent the compiler from processing them more than once.  The
    883 preprocessor notices such header files, so that if the header file
    884 appears in a subsequent @code{#include} directive and @code{FOO} is
    885 defined, then it is ignored and it doesn't preprocess or even re-open
    886 the file a second time.  This is referred to as the @dfn{multiple
    887 include optimization}.
    888 
    889 Under what circumstances is such an optimization valid?  If the file
    890 were included a second time, it can only be optimized away if that
    891 inclusion would result in no tokens to return, and no relevant
    892 directives to process.  Therefore the current implementation imposes
    893 requirements and makes some allowances as follows:
    894 
    895 @enumerate
    896 @item
    897 There must be no tokens outside the controlling @code{#if}-@code{#endif}
    898 pair, but whitespace and comments are permitted.
    899 
    900 @item
    901 There must be no directives outside the controlling directive pair, but
    902 the @dfn{null directive} (a line containing nothing other than a single
    903 @samp{#} and possibly whitespace) is permitted.
    904 
    905 @item
    906 The opening directive must be of the form
    907 
    908 @smallexample
    909 #ifndef FOO
    910 @end smallexample
    911 
    912 or
    913 
    914 @smallexample
    915 #if !defined FOO     [equivalently, #if !defined(FOO)]
    916 @end smallexample
    917 
    918 @item
    919 In the second form above, the tokens forming the @code{#if} expression
    920 must have come directly from the source file---no macro expansion must
    921 have been involved.  This is because macro definitions can change, and
    922 tracking whether or not a relevant change has been made is not worth the
    923 implementation cost.
    924 
    925 @item
    926 There can be no @code{#else} or @code{#elif} directives at the outer
    927 conditional block level, because they would probably contain something
    928 of interest to a subsequent pass.
    929 @end enumerate
    930 
    931 First, when pushing a new file on the buffer stack,
    932 @code{_stack_include_file} sets the controlling macro @code{mi_cmacro} to
    933 @code{NULL}, and sets @code{mi_valid} to @code{true}.  This indicates
    934 that the preprocessor has not yet encountered anything that would
    935 invalidate the multiple-include optimization.  As described in the next
    936 few paragraphs, these two variables having these values effectively
    937 indicates top-of-file.
    938 
    939 When about to return a token that is not part of a directive,
    940 @code{_cpp_lex_token} sets @code{mi_valid} to @code{false}.  This
    941 enforces the constraint that tokens outside the controlling conditional
    942 block invalidate the optimization.
    943 
    944 The @code{do_if}, when appropriate, and @code{do_ifndef} directive
    945 handlers pass the controlling macro to the function
    946 @code{push_conditional}.  cpplib maintains a stack of nested conditional
    947 blocks, and after processing every opening conditional this function
    948 pushes an @code{if_stack} structure onto the stack.  In this structure
    949 it records the controlling macro for the block, provided there is one
    950 and we're at top-of-file (as described above).  If an @code{#elif} or
    951 @code{#else} directive is encountered, the controlling macro for that
    952 block is cleared to @code{NULL}.  Otherwise, it survives until the
    953 @code{#endif} closing the block, upon which @code{do_endif} sets
    954 @code{mi_valid} to true and stores the controlling macro in
    955 @code{mi_cmacro}.
    956 
    957 @code{_cpp_handle_directive} clears @code{mi_valid} when processing any
    958 directive other than an opening conditional and the null directive.
    959 With this, and requiring top-of-file to record a controlling macro, and
    960 no @code{#else} or @code{#elif} for it to survive and be copied to
    961 @code{mi_cmacro} by @code{do_endif}, we have enforced the absence of
    962 directives outside the main conditional block for the optimization to be
    963 on.
    964 
    965 Note that whilst we are inside the conditional block, @code{mi_valid} is
    966 likely to be reset to @code{false}, but this does not matter since
    967 the closing @code{#endif} restores it to @code{true} if appropriate.
    968 
    969 Finally, since @code{_cpp_lex_direct} pops the file off the buffer stack
    970 at @code{EOF} without returning a token, if the @code{#endif} directive
    971 was not followed by any tokens, @code{mi_valid} is @code{true} and
    972 @code{_cpp_pop_file_buffer} remembers the controlling macro associated
    973 with the file.  Subsequent calls to @code{stack_include_file} result in
    974 no buffer being pushed if the controlling macro is defined, effecting
    975 the optimization.
    976 
    977 A quick word on how we handle the
    978 
    979 @smallexample
    980 #if !defined FOO
    981 @end smallexample
    982 
    983 @noindent
    984 case.  @code{_cpp_parse_expr} and @code{parse_defined} take steps to see
    985 whether the three stages @samp{!}, @samp{defined-expression} and
    986 @samp{end-of-directive} occur in order in a @code{#if} expression.  If
    987 so, they return the guard macro to @code{do_if} in the variable
    988 @code{mi_ind_cmacro}, and otherwise set it to @code{NULL}.
    989 @code{enter_macro_context} sets @code{mi_valid} to false, so if a macro
    990 was expanded whilst parsing any part of the expression, then the
    991 top-of-file test in @code{push_conditional} fails and the optimization
    992 is turned off.
    993 
    994 @node Files
    995 @unnumbered File Handling
    996 @cindex files
    997 
    998 Fairly obviously, the file handling code of cpplib resides in the file
    999 @file{files.cc}.  It takes care of the details of file searching,
   1000 opening, reading and caching, for both the main source file and all the
   1001 headers it recursively includes.
   1002 
   1003 The basic strategy is to minimize the number of system calls.  On many
   1004 systems, the basic @code{open ()} and @code{fstat ()} system calls can
   1005 be quite expensive.  For every @code{#include}-d file, we need to try
   1006 all the directories in the search path until we find a match.  Some
   1007 projects, such as glibc, pass twenty or thirty include paths on the
   1008 command line, so this can rapidly become time consuming.
   1009 
   1010 For a header file we have not encountered before we have little choice
   1011 but to do this.  However, it is often the case that the same headers are
   1012 repeatedly included, and in these cases we try to avoid repeating the
   1013 filesystem queries whilst searching for the correct file.
   1014 
   1015 For each file we try to open, we store the constructed path in a splay
   1016 tree.  This path first undergoes simplification by the function
   1017 @code{_cpp_simplify_pathname}.  For example,
   1018 @file{/usr/include/bits/../foo.h} is simplified to
   1019 @file{/usr/include/foo.h} before we enter it in the splay tree and try
   1020 to @code{open ()} the file.  CPP will then find subsequent uses of
   1021 @file{foo.h}, even as @file{/usr/include/foo.h}, in the splay tree and
   1022 save system calls.
   1023 
   1024 Further, it is likely the file contents have also been cached, saving a
   1025 @code{read ()} system call.  We don't bother caching the contents of
   1026 header files that are re-inclusion protected, and whose re-inclusion
   1027 macro is defined when we leave the header file for the first time.  If
   1028 the host supports it, we try to map suitably large files into memory,
   1029 rather than reading them in directly.
   1030 
   1031 The include paths are internally stored on a null-terminated
   1032 singly-linked list, starting with the @code{"header.h"} directory search
   1033 chain, which then links into the @code{<header.h>} directory chain.
   1034 
   1035 Files included with the @code{<foo.h>} syntax start the lookup directly
   1036 in the second half of this chain.  However, files included with the
   1037 @code{"foo.h"} syntax start at the beginning of the chain, but with one
   1038 extra directory prepended.  This is the directory of the current file;
   1039 the one containing the @code{#include} directive.  Prepending this
   1040 directory on a per-file basis is handled by the function
   1041 @code{search_from}.
   1042 
   1043 Note that a header included with a directory component, such as
   1044 @code{#include "mydir/foo.h"} and opened as
   1045 @file{/usr/local/include/mydir/foo.h}, will have the complete path minus
   1046 the basename @samp{foo.h} as the current directory.
   1047 
   1048 Enough information is stored in the splay tree that CPP can immediately
   1049 tell whether it can skip the header file because of the multiple include
   1050 optimization, whether the file didn't exist or couldn't be opened for
   1051 some reason, or whether the header was flagged not to be re-used, as it
   1052 is with the obsolete @code{#import} directive.
   1053 
   1054 For the benefit of MS-DOS filesystems with an 8.3 filename limitation,
   1055 CPP offers the ability to treat various include file names as aliases
   1056 for the real header files with shorter names.  The map from one to the
   1057 other is found in a special file called @samp{header.gcc}, stored in the
   1058 command line (or system) include directories to which the mapping
   1059 applies.  This may be higher up the directory tree than the full path to
   1060 the file minus the base name.
   1061 
   1062 @node Concept Index
   1063 @unnumbered Concept Index
   1064 @printindex cp
   1065 
   1066 @bye
   1067