1 @c Copyright (C) 2010-2022 Free Software Foundation, Inc. 2 @c This is part of the GCC manual. 3 @c For copying conditions, see the file gcc.texi. 4 @c Contributed by Jan Hubicka <jh (a] suse.cz> and 5 @c Diego Novillo <dnovillo (a] google.com> 6 7 @node LTO 8 @chapter Link Time Optimization 9 @cindex lto 10 @cindex whopr 11 @cindex wpa 12 @cindex ltrans 13 14 Link Time Optimization (LTO) gives GCC the capability of 15 dumping its internal representation (GIMPLE) to disk, 16 so that all the different compilation units that make up 17 a single executable can be optimized as a single module. 18 This expands the scope of inter-procedural optimizations 19 to encompass the whole program (or, rather, everything 20 that is visible at link time). 21 22 @menu 23 * LTO Overview:: Overview of LTO. 24 * LTO object file layout:: LTO file sections in ELF. 25 * IPA:: Using summary information in IPA passes. 26 * WHOPR:: Whole program assumptions, 27 linker plugin and symbol visibilities. 28 * Internal flags:: Internal flags controlling @code{lto1}. 29 @end menu 30 31 @node LTO Overview 32 @section Design Overview 33 34 Link time optimization is implemented as a GCC front end for a 35 bytecode representation of GIMPLE that is emitted in special sections 36 of @code{.o} files. Currently, LTO support is enabled in most 37 ELF-based systems, as well as darwin, cygwin and mingw systems. 38 39 By default, object files generated with LTO support contain only GIMPLE 40 bytecode. Such objects are called ``slim'', and they require that 41 tools like @code{ar} and @code{nm} understand symbol tables of LTO 42 sections. For most targets these tools have been extended to use the 43 plugin infrastructure, so GCC can support ``slim'' objects consisting 44 of the intermediate code alone. 45 46 GIMPLE bytecode could also be saved alongside final object code if 47 the @option{-ffat-lto-objects} option is passed, or if no plugin support 48 is detected for @code{ar} and @code{nm} when GCC is configured. It makes 49 the object files generated with LTO support larger than regular object 50 files. This ``fat'' object format allows to ship one set of fat 51 objects which could be used both for development and the production of 52 optimized builds. A, perhaps surprising, side effect of this feature 53 is that any mistake in the toolchain leads to LTO information not 54 being used (e.g.@: an older @code{libtool} calling @code{ld} directly). 55 This is both an advantage, as the system is more robust, and a 56 disadvantage, as the user is not informed that the optimization has 57 been disabled. 58 59 At the highest level, LTO splits the compiler in two. The first half 60 (the ``writer'') produces a streaming representation of all the 61 internal data structures needed to optimize and generate code. This 62 includes declarations, types, the callgraph and the GIMPLE representation 63 of function bodies. 64 65 When @option{-flto} is given during compilation of a source file, the 66 pass manager executes all the passes in @code{all_lto_gen_passes}. 67 Currently, this phase is composed of two IPA passes: 68 69 @itemize @bullet 70 @item @code{pass_ipa_lto_gimple_out} 71 This pass executes the function @code{lto_output} in 72 @file{lto-streamer-out.cc}, which traverses the call graph encoding 73 every reachable declaration, type and function. This generates a 74 memory representation of all the file sections described below. 75 76 @item @code{pass_ipa_lto_finish_out} 77 This pass executes the function @code{produce_asm_for_decls} in 78 @file{lto-streamer-out.cc}, which takes the memory image built in the 79 previous pass and encodes it in the corresponding ELF file sections. 80 @end itemize 81 82 The second half of LTO support is the ``reader''. This is implemented 83 as the GCC front end @file{lto1} in @file{lto/lto.cc}. When 84 @file{collect2} detects a link set of @code{.o}/@code{.a} files with 85 LTO information and the @option{-flto} is enabled, it invokes 86 @file{lto1} which reads the set of files and aggregates them into a 87 single translation unit for optimization. The main entry point for 88 the reader is @file{lto/lto.cc}:@code{lto_main}. 89 90 @subsection LTO modes of operation 91 92 One of the main goals of the GCC link-time infrastructure was to allow 93 effective compilation of large programs. For this reason GCC implements two 94 link-time compilation modes. 95 96 @enumerate 97 @item @emph{LTO mode}, in which the whole program is read into the 98 compiler at link-time and optimized in a similar way as if it 99 were a single source-level compilation unit. 100 101 @item @emph{WHOPR or partitioned mode}, designed to utilize multiple 102 CPUs and/or a distributed compilation environment to quickly link 103 large applications. WHOPR stands for WHOle Program optimizeR (not to 104 be confused with the semantics of @option{-fwhole-program}). It 105 partitions the aggregated callgraph from many different @code{.o} 106 files and distributes the compilation of the sub-graphs to different 107 CPUs. 108 109 Note that distributed compilation is not implemented yet, but since 110 the parallelism is facilitated via generating a @code{Makefile}, it 111 would be easy to implement. 112 @end enumerate 113 114 WHOPR splits LTO into three main stages: 115 @enumerate 116 @item Local generation (LGEN) 117 This stage executes in parallel. Every file in the program is compiled 118 into the intermediate language and packaged together with the local 119 call-graph and summary information. This stage is the same for both 120 the LTO and WHOPR compilation mode. 121 122 @item Whole Program Analysis (WPA) 123 WPA is performed sequentially. The global call-graph is generated, and 124 a global analysis procedure makes transformation decisions. The global 125 call-graph is partitioned to facilitate parallel optimization during 126 phase 3. The results of the WPA stage are stored into new object files 127 which contain the partitions of program expressed in the intermediate 128 language and the optimization decisions. 129 130 @item Local transformations (LTRANS) 131 This stage executes in parallel. All the decisions made during phase 2 132 are implemented locally in each partitioned object file, and the final 133 object code is generated. Optimizations which cannot be decided 134 efficiently during the phase 2 may be performed on the local 135 call-graph partitions. 136 @end enumerate 137 138 WHOPR can be seen as an extension of the usual LTO mode of 139 compilation. In LTO, WPA and LTRANS are executed within a single 140 execution of the compiler, after the whole program has been read into 141 memory. 142 143 When compiling in WHOPR mode, the callgraph is partitioned during 144 the WPA stage. The whole program is split into a given number of 145 partitions of roughly the same size. The compiler tries to 146 minimize the number of references which cross partition boundaries. 147 The main advantage of WHOPR is to allow the parallel execution of 148 LTRANS stages, which are the most time-consuming part of the 149 compilation process. Additionally, it avoids the need to load the 150 whole program into memory. 151 152 153 @node LTO object file layout 154 @section LTO file sections 155 156 LTO information is stored in several ELF sections inside object files. 157 Data structures and enum codes for sections are defined in 158 @file{lto-streamer.h}. 159 160 These sections are emitted from @file{lto-streamer-out.cc} and mapped 161 in all at once from @file{lto/lto.cc}:@code{lto_file_read}. The 162 individual functions dealing with the reading/writing of each section 163 are described below. 164 165 @itemize @bullet 166 @item Command line options (@code{.gnu.lto_.opts}) 167 168 This section contains the command line options used to generate the 169 object files. This is used at link time to determine the optimization 170 level and other settings when they are not explicitly specified at the 171 linker command line. 172 173 Currently, GCC does not support combining LTO object files compiled 174 with different set of the command line options into a single binary. 175 At link time, the options given on the command line and the options 176 saved on all the files in a link-time set are applied globally. No 177 attempt is made at validating the combination of flags (other than the 178 usual validation done by option processing). This is implemented in 179 @file{lto/lto.cc}:@code{lto_read_all_file_options}. 180 181 182 @item Symbol table (@code{.gnu.lto_.symtab}) 183 184 This table replaces the ELF symbol table for functions and variables 185 represented in the LTO IL. Symbols used and exported by the optimized 186 assembly code of ``fat'' objects might not match the ones used and 187 exported by the intermediate code. This table is necessary because 188 the intermediate code is less optimized and thus requires a separate 189 symbol table. 190 191 Additionally, the binary code in the ``fat'' object will lack a call 192 to a function, since the call was optimized out at compilation time 193 after the intermediate language was streamed out. In some special 194 cases, the same optimization may not happen during link-time 195 optimization. This would lead to an undefined symbol if only one 196 symbol table was used. 197 198 The symbol table is emitted in 199 @file{lto-streamer-out.cc}:@code{produce_symtab}. 200 201 202 @item Global declarations and types (@code{.gnu.lto_.decls}) 203 204 This section contains an intermediate language dump of all 205 declarations and types required to represent the callgraph, static 206 variables and top-level debug info. 207 208 The contents of this section are emitted in 209 @file{lto-streamer-out.cc}:@code{produce_asm_for_decls}. Types and 210 symbols are emitted in a topological order that preserves the sharing 211 of pointers when the file is read back in 212 (@file{lto.cc}:@code{read_cgraph_and_symbols}). 213 214 215 @item The callgraph (@code{.gnu.lto_.cgraph}) 216 217 This section contains the basic data structure used by the GCC 218 inter-procedural optimization infrastructure. This section stores an 219 annotated multi-graph which represents the functions and call sites as 220 well as the variables, aliases and top-level @code{asm} statements. 221 222 This section is emitted in 223 @file{lto-streamer-out.cc}:@code{output_cgraph} and read in 224 @file{lto-cgraph.cc}:@code{input_cgraph}. 225 226 227 @item IPA references (@code{.gnu.lto_.refs}) 228 229 This section contains references between function and static 230 variables. It is emitted by @file{lto-cgraph.cc}:@code{output_refs} 231 and read by @file{lto-cgraph.cc}:@code{input_refs}. 232 233 234 @item Function bodies (@code{.gnu.lto_.function_body.<name>}) 235 236 This section contains function bodies in the intermediate language 237 representation. Every function body is in a separate section to allow 238 copying of the section independently to different object files or 239 reading the function on demand. 240 241 Functions are emitted in 242 @file{lto-streamer-out.cc}:@code{output_function} and read in 243 @file{lto-streamer-in.cc}:@code{input_function}. 244 245 246 @item Static variable initializers (@code{.gnu.lto_.vars}) 247 248 This section contains all the symbols in the global variable pool. It 249 is emitted by @file{lto-cgraph.cc}:@code{output_varpool} and read in 250 @file{lto-cgraph.cc}:@code{input_cgraph}. 251 252 @item Summaries and optimization summaries used by IPA passes 253 (@code{.gnu.lto_.<xxx>}, where @code{<xxx>} is one of @code{jmpfuncs}, 254 @code{pureconst} or @code{reference}) 255 256 These sections are used by IPA passes that need to emit summary 257 information during LTO generation to be read and aggregated at 258 link time. Each pass is responsible for implementing two pass manager 259 hooks: one for writing the summary and another for reading it in. The 260 format of these sections is entirely up to each individual pass. The 261 only requirement is that the writer and reader hooks agree on the 262 format. 263 @end itemize 264 265 266 @node IPA 267 @section Using summary information in IPA passes 268 269 Programs are represented internally as a @emph{callgraph} (a 270 multi-graph where nodes are functions and edges are call sites) 271 and a @emph{varpool} (a list of static and external variables in 272 the program). 273 274 The inter-procedural optimization is organized as a sequence of 275 individual passes, which operate on the callgraph and the 276 varpool. To make the implementation of WHOPR possible, every 277 inter-procedural optimization pass is split into several stages 278 that are executed at different times during WHOPR compilation: 279 280 @itemize @bullet 281 @item LGEN time 282 @enumerate 283 @item @emph{Generate summary} (@code{generate_summary} in 284 @code{struct ipa_opt_pass_d}). This stage analyzes every function 285 body and variable initializer is examined and stores relevant 286 information into a pass-specific data structure. 287 288 @item @emph{Write summary} (@code{write_summary} in 289 @code{struct ipa_opt_pass_d}). This stage writes all the 290 pass-specific information generated by @code{generate_summary}. 291 Summaries go into their own @code{LTO_section_*} sections that 292 have to be declared in @file{lto-streamer.h}:@code{enum 293 lto_section_type}. A new section is created by calling 294 @code{create_output_block} and data can be written using the 295 @code{lto_output_*} routines. 296 @end enumerate 297 298 @item WPA time 299 @enumerate 300 @item @emph{Read summary} (@code{read_summary} in 301 @code{struct ipa_opt_pass_d}). This stage reads all the 302 pass-specific information in exactly the same order that it was 303 written by @code{write_summary}. 304 305 @item @emph{Execute} (@code{execute} in @code{struct 306 opt_pass}). This performs inter-procedural propagation. This 307 must be done without actual access to the individual function 308 bodies or variable initializers. Typically, this results in a 309 transitive closure operation over the summary information of all 310 the nodes in the callgraph. 311 312 @item @emph{Write optimization summary} 313 (@code{write_optimization_summary} in @code{struct 314 ipa_opt_pass_d}). This writes the result of the inter-procedural 315 propagation into the object file. This can use the same data 316 structures and helper routines used in @code{write_summary}. 317 @end enumerate 318 319 @item LTRANS time 320 @enumerate 321 @item @emph{Read optimization summary} 322 (@code{read_optimization_summary} in @code{struct 323 ipa_opt_pass_d}). The counterpart to 324 @code{write_optimization_summary}. This reads the interprocedural 325 optimization decisions in exactly the same format emitted by 326 @code{write_optimization_summary}. 327 328 @item @emph{Transform} (@code{function_transform} and 329 @code{variable_transform} in @code{struct ipa_opt_pass_d}). 330 The actual function bodies and variable initializers are updated 331 based on the information passed down from the @emph{Execute} stage. 332 @end enumerate 333 @end itemize 334 335 The implementation of the inter-procedural passes are shared 336 between LTO, WHOPR and classic non-LTO compilation. 337 338 @itemize 339 @item During the traditional file-by-file mode every pass executes its 340 own @emph{Generate summary}, @emph{Execute}, and @emph{Transform} 341 stages within the single execution context of the compiler. 342 343 @item In LTO compilation mode, every pass uses @emph{Generate 344 summary} and @emph{Write summary} stages at compilation time, 345 while the @emph{Read summary}, @emph{Execute}, and 346 @emph{Transform} stages are executed at link time. 347 348 @item In WHOPR mode all stages are used. 349 @end itemize 350 351 To simplify development, the GCC pass manager differentiates 352 between normal inter-procedural passes (@pxref{Regular IPA passes}), 353 small inter-procedural passes (@pxref{Small IPA passes}) 354 and late inter-procedural passes (@pxref{Late IPA passes}). 355 A small or late IPA pass (@code{SIMPLE_IPA_PASS}) does 356 everything at once and thus cannot be executed during WPA in 357 WHOPR mode. It defines only the @emph{Execute} stage and during 358 this stage it accesses and modifies the function bodies. Such 359 passes are useful for optimization at LGEN or LTRANS time and are 360 used, for example, to implement early optimization before writing 361 object files. The simple inter-procedural passes can also be used 362 for easier prototyping and development of a new inter-procedural 363 pass. 364 365 366 @subsection Virtual clones 367 368 One of the main challenges of introducing the WHOPR compilation 369 mode was addressing the interactions between optimization passes. 370 In LTO compilation mode, the passes are executed in a sequence, 371 each of which consists of analysis (or @emph{Generate summary}), 372 propagation (or @emph{Execute}) and @emph{Transform} stages. 373 Once the work of one pass is finished, the next pass sees the 374 updated program representation and can execute. This makes the 375 individual passes dependent on each other. 376 377 In WHOPR mode all passes first execute their @emph{Generate 378 summary} stage. Then summary writing marks the end of the LGEN 379 stage. At WPA time, 380 the summaries are read back into memory and all passes run the 381 @emph{Execute} stage. Optimization summaries are streamed and 382 sent to LTRANS, where all the passes execute the @emph{Transform} 383 stage. 384 385 Most optimization passes split naturally into analysis, 386 propagation and transformation stages. But some do not. The 387 main problem arises when one pass performs changes and the 388 following pass gets confused by seeing different callgraphs 389 between the @emph{Transform} stage and the @emph{Generate summary} 390 or @emph{Execute} stage. This means that the passes are required 391 to communicate their decisions with each other. 392 393 To facilitate this communication, the GCC callgraph 394 infrastructure implements @emph{virtual clones}, a method of 395 representing the changes performed by the optimization passes in 396 the callgraph without needing to update function bodies. 397 398 A @emph{virtual clone} in the callgraph is a function that has no 399 associated body, just a description of how to create its body based 400 on a different function (which itself may be a virtual clone). 401 402 The description of function modifications includes adjustments to 403 the function's signature (which allows, for example, removing or 404 adding function arguments), substitutions to perform on the 405 function body, and, for inlined functions, a pointer to the 406 function that it will be inlined into. 407 408 It is also possible to redirect any edge of the callgraph from a 409 function to its virtual clone. This implies updating of the call 410 site to adjust for the new function signature. 411 412 Most of the transformations performed by inter-procedural 413 optimizations can be represented via virtual clones. For 414 instance, a constant propagation pass can produce a virtual clone 415 of the function which replaces one of its arguments by a 416 constant. The inliner can represent its decisions by producing a 417 clone of a function whose body will be later integrated into 418 a given function. 419 420 Using @emph{virtual clones}, the program can be easily updated 421 during the @emph{Execute} stage, solving most of pass interactions 422 problems that would otherwise occur during @emph{Transform}. 423 424 Virtual clones are later materialized in the LTRANS stage and 425 turned into real functions. Passes executed after the virtual 426 clone were introduced also perform their @emph{Transform} stage 427 on new functions, so for a pass there is no significant 428 difference between operating on a real function or a virtual 429 clone introduced before its @emph{Execute} stage. 430 431 Optimization passes then work on virtual clones introduced before 432 their @emph{Execute} stage as if they were real functions. The 433 only difference is that clones are not visible during the 434 @emph{Generate Summary} stage. 435 436 To keep function summaries updated, the callgraph interface 437 allows an optimizer to register a callback that is called every 438 time a new clone is introduced as well as when the actual 439 function or variable is generated or when a function or variable 440 is removed. These hooks are registered in the @emph{Generate 441 summary} stage and allow the pass to keep its information intact 442 until the @emph{Execute} stage. The same hooks can also be 443 registered during the @emph{Execute} stage to keep the 444 optimization summaries updated for the @emph{Transform} stage. 445 446 @subsection IPA references 447 448 GCC represents IPA references in the callgraph. For a function 449 or variable @code{A}, the @emph{IPA reference} is a list of all 450 locations where the address of @code{A} is taken and, when 451 @code{A} is a variable, a list of all direct stores and reads 452 to/from @code{A}. References represent an oriented multi-graph on 453 the union of nodes of the callgraph and the varpool. See 454 @file{ipa-reference.cc}:@code{ipa_reference_write_optimization_summary} 455 and 456 @file{ipa-reference.cc}:@code{ipa_reference_read_optimization_summary} 457 for details. 458 459 @subsection Jump functions 460 Suppose that an optimization pass sees a function @code{A} and it 461 knows the values of (some of) its arguments. The @emph{jump 462 function} describes the value of a parameter of a given function 463 call in function @code{A} based on this knowledge. 464 465 Jump functions are used by several optimizations, such as the 466 inter-procedural constant propagation pass and the 467 devirtualization pass. The inliner also uses jump functions to 468 perform inlining of callbacks. 469 470 @node WHOPR 471 @section Whole program assumptions, linker plugin and symbol visibilities 472 473 Link-time optimization gives relatively minor benefits when used 474 alone. The problem is that propagation of inter-procedural 475 information does not work well across functions and variables 476 that are called or referenced by other compilation units (such as 477 from a dynamically linked library). We say that such functions 478 and variables are @emph{externally visible}. 479 480 To make the situation even more difficult, many applications 481 organize themselves as a set of shared libraries, and the default 482 ELF visibility rules allow one to overwrite any externally 483 visible symbol with a different symbol at runtime. This 484 basically disables any optimizations across such functions and 485 variables, because the compiler cannot be sure that the function 486 body it is seeing is the same function body that will be used at 487 runtime. Any function or variable not declared @code{static} in 488 the sources degrades the quality of inter-procedural 489 optimization. 490 491 To avoid this problem the compiler must assume that it sees the 492 whole program when doing link-time optimization. Strictly 493 speaking, the whole program is rarely visible even at link-time. 494 Standard system libraries are usually linked dynamically or not 495 provided with the link-time information. In GCC, the whole 496 program option (@option{-fwhole-program}) asserts that every 497 function and variable defined in the current compilation 498 unit is static, except for function @code{main} (note: at 499 link time, the current unit is the union of all objects compiled 500 with LTO). Since some functions and variables need to 501 be referenced externally, for example by another DSO or from an 502 assembler file, GCC also provides the function and variable 503 attribute @code{externally_visible} which can be used to disable 504 the effect of @option{-fwhole-program} on a specific symbol. 505 506 The whole program mode assumptions are slightly more complex in 507 C++, where inline functions in headers are put into @emph{COMDAT} 508 sections. COMDAT function and variables can be defined by 509 multiple object files and their bodies are unified at link-time 510 and dynamic link-time. COMDAT functions are changed to local only 511 when their address is not taken and thus un-sharing them with a 512 library is not harmful. COMDAT variables always remain externally 513 visible, however for readonly variables it is assumed that their 514 initializers cannot be overwritten by a different value. 515 516 GCC provides the function and variable attribute 517 @code{visibility} that can be used to specify the visibility of 518 externally visible symbols (or alternatively an 519 @option{-fdefault-visibility} command line option). ELF defines 520 the @code{default}, @code{protected}, @code{hidden} and 521 @code{internal} visibilities. 522 523 The most commonly used is visibility is @code{hidden}. It 524 specifies that the symbol cannot be referenced from outside of 525 the current shared library. Unfortunately, this information 526 cannot be used directly by the link-time optimization in the 527 compiler since the whole shared library also might contain 528 non-LTO objects and those are not visible to the compiler. 529 530 GCC solves this problem using linker plugins. A @emph{linker 531 plugin} is an interface to the linker that allows an external 532 program to claim the ownership of a given object file. The linker 533 then performs the linking procedure by querying the plugin about 534 the symbol table of the claimed objects and once the linking 535 decisions are complete, the plugin is allowed to provide the 536 final object file before the actual linking is made. The linker 537 plugin obtains the symbol resolution information which specifies 538 which symbols provided by the claimed objects are bound from the 539 rest of a binary being linked. 540 541 GCC is designed to be independent of the rest of the toolchain 542 and aims to support linkers without plugin support. For this 543 reason it does not use the linker plugin by default. Instead, 544 the object files are examined by @command{collect2} before being 545 passed to the linker and objects found to have LTO sections are 546 passed to @command{lto1} first. This mode does not work for 547 library archives. The decision on what object files from the 548 archive are needed depends on the actual linking and thus GCC 549 would have to implement the linker itself. The resolution 550 information is missing too and thus GCC needs to make an educated 551 guess based on @option{-fwhole-program}. Without the linker 552 plugin GCC also assumes that symbols are declared @code{hidden} 553 and not referred by non-LTO code by default. 554 555 @node Internal flags 556 @section Internal flags controlling @code{lto1} 557 558 The following flags are passed into @command{lto1} and are not 559 meant to be used directly from the command line. 560 561 @itemize 562 @item -fwpa 563 @opindex fwpa 564 This option runs the serial part of the link-time optimizer 565 performing the inter-procedural propagation (WPA mode). The 566 compiler reads in summary information from all inputs and 567 performs an analysis based on summary information only. It 568 generates object files for subsequent runs of the link-time 569 optimizer where individual object files are optimized using both 570 summary information from the WPA mode and the actual function 571 bodies. It then drives the LTRANS phase. 572 573 @item -fltrans 574 @opindex fltrans 575 This option runs the link-time optimizer in the 576 local-transformation (LTRANS) mode, which reads in output from a 577 previous run of the LTO in WPA mode. In the LTRANS mode, LTO 578 optimizes an object and produces the final assembly. 579 580 @item -fltrans-output-list=@var{file} 581 @opindex fltrans-output-list 582 This option specifies a file to which the names of LTRANS output 583 files are written. This option is only meaningful in conjunction 584 with @option{-fwpa}. 585 586 @item -fresolution=@var{file} 587 @opindex fresolution 588 This option specifies the linker resolution file. This option is 589 only meaningful in conjunction with @option{-fwpa} and as option 590 to pass through to the LTO linker plugin. 591 @end itemize 592