1 =============================================================== 2 Tutorial for building tools using LibTooling and LibASTMatchers 3 =============================================================== 4 5 This document is intended to show how to build a useful source-to-source 6 translation tool based on Clang's `LibTooling <LibTooling.html>`_. It is 7 explicitly aimed at people who are new to Clang, so all you should need 8 is a working knowledge of C++ and the command line. 9 10 In order to work on the compiler, you need some basic knowledge of the 11 abstract syntax tree (AST). To this end, the reader is encouraged to 12 skim the :doc:`Introduction to the Clang 13 AST <IntroductionToTheClangAST>` 14 15 Step 0: Obtaining Clang 16 ======================= 17 18 As Clang is part of the LLVM project, you'll need to download LLVM's 19 source code first. Both Clang and LLVM are in the same git repository, 20 under different directories. For further information, see the `getting 21 started guide <https://llvm.org/docs/GettingStarted.html>`_. 22 23 .. code-block:: console 24 25 cd ~/clang-llvm 26 git clone https://github.com/llvm/llvm-project.git 27 28 Next you need to obtain the CMake build system and Ninja build tool. 29 30 .. code-block:: console 31 32 cd ~/clang-llvm 33 git clone https://github.com/martine/ninja.git 34 cd ninja 35 git checkout release 36 ./bootstrap.py 37 sudo cp ninja /usr/bin/ 38 39 cd ~/clang-llvm 40 git clone git://cmake.org/stage/cmake.git 41 cd cmake 42 git checkout next 43 ./bootstrap 44 make 45 sudo make install 46 47 Okay. Now we'll build Clang! 48 49 .. code-block:: console 50 51 cd ~/clang-llvm 52 mkdir build && cd build 53 cmake -G Ninja ../llvm -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_BUILD_TESTS=ON # Enable tests; default is off. 54 ninja 55 ninja check # Test LLVM only. 56 ninja clang-test # Test Clang only. 57 ninja install 58 59 And we're live. 60 61 All of the tests should pass. 62 63 Finally, we want to set Clang as its own compiler. 64 65 .. code-block:: console 66 67 cd ~/clang-llvm/build 68 ccmake ../llvm 69 70 The second command will bring up a GUI for configuring Clang. You need 71 to set the entry for ``CMAKE_CXX_COMPILER``. Press ``'t'`` to turn on 72 advanced mode. Scroll down to ``CMAKE_CXX_COMPILER``, and set it to 73 ``/usr/bin/clang++``, or wherever you installed it. Press ``'c'`` to 74 configure, then ``'g'`` to generate CMake's files. 75 76 Finally, run ninja one last time, and you're done. 77 78 Step 1: Create a ClangTool 79 ========================== 80 81 Now that we have enough background knowledge, it's time to create the 82 simplest productive ClangTool in existence: a syntax checker. While this 83 already exists as ``clang-check``, it's important to understand what's 84 going on. 85 86 First, we'll need to create a new directory for our tool and tell CMake 87 that it exists. As this is not going to be a core clang tool, it will 88 live in the ``clang-tools-extra`` repository. 89 90 .. code-block:: console 91 92 cd ~/clang-llvm 93 mkdir clang-tools-extra/loop-convert 94 echo 'add_subdirectory(loop-convert)' >> clang-tools-extra/CMakeLists.txt 95 vim clang-tools-extra/loop-convert/CMakeLists.txt 96 97 CMakeLists.txt should have the following contents: 98 99 :: 100 101 set(LLVM_LINK_COMPONENTS support) 102 103 add_clang_executable(loop-convert 104 LoopConvert.cpp 105 ) 106 target_link_libraries(loop-convert 107 PRIVATE 108 clangTooling 109 clangBasic 110 clangASTMatchers 111 ) 112 113 With that done, Ninja will be able to compile our tool. Let's give it 114 something to compile! Put the following into 115 ``clang-tools-extra/loop-convert/LoopConvert.cpp``. A detailed explanation of 116 why the different parts are needed can be found in the `LibTooling 117 documentation <LibTooling.html>`_. 118 119 .. code-block:: c++ 120 121 // Declares clang::SyntaxOnlyAction. 122 #include "clang/Frontend/FrontendActions.h" 123 #include "clang/Tooling/CommonOptionsParser.h" 124 #include "clang/Tooling/Tooling.h" 125 // Declares llvm::cl::extrahelp. 126 #include "llvm/Support/CommandLine.h" 127 128 using namespace clang::tooling; 129 using namespace llvm; 130 131 // Apply a custom category to all command-line options so that they are the 132 // only ones displayed. 133 static llvm::cl::OptionCategory MyToolCategory("my-tool options"); 134 135 // CommonOptionsParser declares HelpMessage with a description of the common 136 // command-line options related to the compilation database and input files. 137 // It's nice to have this help message in all tools. 138 static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage); 139 140 // A help message for this specific tool can be added afterwards. 141 static cl::extrahelp MoreHelp("\nMore help text...\n"); 142 143 int main(int argc, const char **argv) { 144 auto ExpectedParser = CommonOptionsParser::create(argc, argv, MyToolCategory); 145 if (!ExpectedParser) { 146 // Fail gracefully for unsupported options. 147 llvm::errs() << ExpectedParser.takeError(); 148 return 1; 149 } 150 CommonOptionsParser& OptionsParser = ExpectedParser.get(); 151 ClangTool Tool(OptionsParser.getCompilations(), 152 OptionsParser.getSourcePathList()); 153 return Tool.run(newFrontendActionFactory<clang::SyntaxOnlyAction>().get()); 154 } 155 156 And that's it! You can compile our new tool by running ninja from the 157 ``build`` directory. 158 159 .. code-block:: console 160 161 cd ~/clang-llvm/build 162 ninja 163 164 You should now be able to run the syntax checker, which is located in 165 ``~/clang-llvm/build/bin``, on any source file. Try it! 166 167 .. code-block:: console 168 169 echo "int main() { return 0; }" > test.cpp 170 bin/loop-convert test.cpp -- 171 172 Note the two dashes after we specify the source file. The additional 173 options for the compiler are passed after the dashes rather than loading 174 them from a compilation database - there just aren't any options needed 175 right now. 176 177 Intermezzo: Learn AST matcher basics 178 ==================================== 179 180 Clang recently introduced the :doc:`ASTMatcher 181 library <LibASTMatchers>` to provide a simple, powerful, and 182 concise way to describe specific patterns in the AST. Implemented as a 183 DSL powered by macros and templates (see 184 `ASTMatchers.h <../doxygen/ASTMatchers_8h_source.html>`_ if you're 185 curious), matchers offer the feel of algebraic data types common to 186 functional programming languages. 187 188 For example, suppose you wanted to examine only binary operators. There 189 is a matcher to do exactly that, conveniently named ``binaryOperator``. 190 I'll give you one guess what this matcher does: 191 192 .. code-block:: c++ 193 194 binaryOperator(hasOperatorName("+"), hasLHS(integerLiteral(equals(0)))) 195 196 Shockingly, it will match against addition expressions whose left hand 197 side is exactly the literal 0. It will not match against other forms of 198 0, such as ``'\0'`` or ``NULL``, but it will match against macros that 199 expand to 0. The matcher will also not match against calls to the 200 overloaded operator ``'+'``, as there is a separate ``operatorCallExpr`` 201 matcher to handle overloaded operators. 202 203 There are AST matchers to match all the different nodes of the AST, 204 narrowing matchers to only match AST nodes fulfilling specific criteria, 205 and traversal matchers to get from one kind of AST node to another. For 206 a complete list of AST matchers, take a look at the `AST Matcher 207 References <LibASTMatchersReference.html>`_ 208 209 All matcher that are nouns describe entities in the AST and can be 210 bound, so that they can be referred to whenever a match is found. To do 211 so, simply call the method ``bind`` on these matchers, e.g.: 212 213 .. code-block:: c++ 214 215 variable(hasType(isInteger())).bind("intvar") 216 217 Step 2: Using AST matchers 218 ========================== 219 220 Okay, on to using matchers for real. Let's start by defining a matcher 221 which will capture all ``for`` statements that define a new variable 222 initialized to zero. Let's start with matching all ``for`` loops: 223 224 .. code-block:: c++ 225 226 forStmt() 227 228 Next, we want to specify that a single variable is declared in the first 229 portion of the loop, so we can extend the matcher to 230 231 .. code-block:: c++ 232 233 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl())))) 234 235 Finally, we can add the condition that the variable is initialized to 236 zero. 237 238 .. code-block:: c++ 239 240 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( 241 hasInitializer(integerLiteral(equals(0)))))))) 242 243 It is fairly easy to read and understand the matcher definition ("match 244 loops whose init portion declares a single variable which is initialized 245 to the integer literal 0"), but deciding that every piece is necessary 246 is more difficult. Note that this matcher will not match loops whose 247 variables are initialized to ``'\0'``, ``0.0``, ``NULL``, or any form of 248 zero besides the integer 0. 249 250 The last step is giving the matcher a name and binding the ``ForStmt`` 251 as we will want to do something with it: 252 253 .. code-block:: c++ 254 255 StatementMatcher LoopMatcher = 256 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( 257 hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop"); 258 259 Once you have defined your matchers, you will need to add a little more 260 scaffolding in order to run them. Matchers are paired with a 261 ``MatchCallback`` and registered with a ``MatchFinder`` object, then run 262 from a ``ClangTool``. More code! 263 264 Add the following to ``LoopConvert.cpp``: 265 266 .. code-block:: c++ 267 268 #include "clang/ASTMatchers/ASTMatchers.h" 269 #include "clang/ASTMatchers/ASTMatchFinder.h" 270 271 using namespace clang; 272 using namespace clang::ast_matchers; 273 274 StatementMatcher LoopMatcher = 275 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( 276 hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop"); 277 278 class LoopPrinter : public MatchFinder::MatchCallback { 279 public : 280 virtual void run(const MatchFinder::MatchResult &Result) { 281 if (const ForStmt *FS = Result.Nodes.getNodeAs<clang::ForStmt>("forLoop")) 282 FS->dump(); 283 } 284 }; 285 286 And change ``main()`` to: 287 288 .. code-block:: c++ 289 290 int main(int argc, const char **argv) { 291 auto ExpectedParser = CommonOptionsParser::create(argc, argv, MyToolCategory); 292 if (!ExpectedParser) { 293 // Fail gracefully for unsupported options. 294 llvm::errs() << ExpectedParser.takeError(); 295 return 1; 296 } 297 CommonOptionsParser& OptionsParser = ExpectedParser.get(); 298 ClangTool Tool(OptionsParser.getCompilations(), 299 OptionsParser.getSourcePathList()); 300 301 LoopPrinter Printer; 302 MatchFinder Finder; 303 Finder.addMatcher(LoopMatcher, &Printer); 304 305 return Tool.run(newFrontendActionFactory(&Finder).get()); 306 } 307 308 Now, you should be able to recompile and run the code to discover for 309 loops. Create a new file with a few examples, and test out our new 310 handiwork: 311 312 .. code-block:: console 313 314 cd ~/clang-llvm/llvm/llvm_build/ 315 ninja loop-convert 316 vim ~/test-files/simple-loops.cc 317 bin/loop-convert ~/test-files/simple-loops.cc 318 319 Step 3.5: More Complicated Matchers 320 =================================== 321 322 Our simple matcher is capable of discovering for loops, but we would 323 still need to filter out many more ourselves. We can do a good portion 324 of the remaining work with some cleverly chosen matchers, but first we 325 need to decide exactly which properties we want to allow. 326 327 How can we characterize for loops over arrays which would be eligible 328 for translation to range-based syntax? Range based loops over arrays of 329 size ``N`` that: 330 331 - start at index ``0`` 332 - iterate consecutively 333 - end at index ``N-1`` 334 335 We already check for (1), so all we need to add is a check to the loop's 336 condition to ensure that the loop's index variable is compared against 337 ``N`` and another check to ensure that the increment step just 338 increments this same variable. The matcher for (2) is straightforward: 339 require a pre- or post-increment of the same variable declared in the 340 init portion. 341 342 Unfortunately, such a matcher is impossible to write. Matchers contain 343 no logic for comparing two arbitrary AST nodes and determining whether 344 or not they are equal, so the best we can do is matching more than we 345 would like to allow, and punting extra comparisons to the callback. 346 347 In any case, we can start building this sub-matcher. We can require that 348 the increment step be a unary increment like this: 349 350 .. code-block:: c++ 351 352 hasIncrement(unaryOperator(hasOperatorName("++"))) 353 354 Specifying what is incremented introduces another quirk of Clang's AST: 355 Usages of variables are represented as ``DeclRefExpr``'s ("declaration 356 reference expressions") because they are expressions which refer to 357 variable declarations. To find a ``unaryOperator`` that refers to a 358 specific declaration, we can simply add a second condition to it: 359 360 .. code-block:: c++ 361 362 hasIncrement(unaryOperator( 363 hasOperatorName("++"), 364 hasUnaryOperand(declRefExpr()))) 365 366 Furthermore, we can restrict our matcher to only match if the 367 incremented variable is an integer: 368 369 .. code-block:: c++ 370 371 hasIncrement(unaryOperator( 372 hasOperatorName("++"), 373 hasUnaryOperand(declRefExpr(to(varDecl(hasType(isInteger()))))))) 374 375 And the last step will be to attach an identifier to this variable, so 376 that we can retrieve it in the callback: 377 378 .. code-block:: c++ 379 380 hasIncrement(unaryOperator( 381 hasOperatorName("++"), 382 hasUnaryOperand(declRefExpr(to( 383 varDecl(hasType(isInteger())).bind("incrementVariable")))))) 384 385 We can add this code to the definition of ``LoopMatcher`` and make sure 386 that our program, outfitted with the new matcher, only prints out loops 387 that declare a single variable initialized to zero and have an increment 388 step consisting of a unary increment of some variable. 389 390 Now, we just need to add a matcher to check if the condition part of the 391 ``for`` loop compares a variable against the size of the array. There is 392 only one problem - we don't know which array we're iterating over 393 without looking at the body of the loop! We are again restricted to 394 approximating the result we want with matchers, filling in the details 395 in the callback. So we start with: 396 397 .. code-block:: c++ 398 399 hasCondition(binaryOperator(hasOperatorName("<")) 400 401 It makes sense to ensure that the left-hand side is a reference to a 402 variable, and that the right-hand side has integer type. 403 404 .. code-block:: c++ 405 406 hasCondition(binaryOperator( 407 hasOperatorName("<"), 408 hasLHS(declRefExpr(to(varDecl(hasType(isInteger()))))), 409 hasRHS(expr(hasType(isInteger()))))) 410 411 Why? Because it doesn't work. Of the three loops provided in 412 ``test-files/simple.cpp``, zero of them have a matching condition. A 413 quick look at the AST dump of the first for loop, produced by the 414 previous iteration of loop-convert, shows us the answer: 415 416 :: 417 418 (ForStmt 0x173b240 419 (DeclStmt 0x173afc8 420 0x173af50 "int i = 421 (IntegerLiteral 0x173afa8 'int' 0)") 422 <<>> 423 (BinaryOperator 0x173b060 '_Bool' '<' 424 (ImplicitCastExpr 0x173b030 'int' 425 (DeclRefExpr 0x173afe0 'int' lvalue Var 0x173af50 'i' 'int')) 426 (ImplicitCastExpr 0x173b048 'int' 427 (DeclRefExpr 0x173b008 'const int' lvalue Var 0x170fa80 'N' 'const int'))) 428 (UnaryOperator 0x173b0b0 'int' lvalue prefix '++' 429 (DeclRefExpr 0x173b088 'int' lvalue Var 0x173af50 'i' 'int')) 430 (CompoundStatement ... 431 432 We already know that the declaration and increments both match, or this 433 loop wouldn't have been dumped. The culprit lies in the implicit cast 434 applied to the first operand (i.e. the LHS) of the less-than operator, 435 an L-value to R-value conversion applied to the expression referencing 436 ``i``. Thankfully, the matcher library offers a solution to this problem 437 in the form of ``ignoringParenImpCasts``, which instructs the matcher to 438 ignore implicit casts and parentheses before continuing to match. 439 Adjusting the condition operator will restore the desired match. 440 441 .. code-block:: c++ 442 443 hasCondition(binaryOperator( 444 hasOperatorName("<"), 445 hasLHS(ignoringParenImpCasts(declRefExpr( 446 to(varDecl(hasType(isInteger())))))), 447 hasRHS(expr(hasType(isInteger()))))) 448 449 After adding binds to the expressions we wished to capture and 450 extracting the identifier strings into variables, we have array-step-2 451 completed. 452 453 Step 4: Retrieving Matched Nodes 454 ================================ 455 456 So far, the matcher callback isn't very interesting: it just dumps the 457 loop's AST. At some point, we will need to make changes to the input 458 source code. Next, we'll work on using the nodes we bound in the 459 previous step. 460 461 The ``MatchFinder::run()`` callback takes a 462 ``MatchFinder::MatchResult&`` as its parameter. We're most interested in 463 its ``Context`` and ``Nodes`` members. Clang uses the ``ASTContext`` 464 class to represent contextual information about the AST, as the name 465 implies, though the most functionally important detail is that several 466 operations require an ``ASTContext*`` parameter. More immediately useful 467 is the set of matched nodes, and how we retrieve them. 468 469 Since we bind three variables (identified by ConditionVarName, 470 InitVarName, and IncrementVarName), we can obtain the matched nodes by 471 using the ``getNodeAs()`` member function. 472 473 In ``LoopConvert.cpp`` add 474 475 .. code-block:: c++ 476 477 #include "clang/AST/ASTContext.h" 478 479 Change ``LoopMatcher`` to 480 481 .. code-block:: c++ 482 483 StatementMatcher LoopMatcher = 484 forStmt(hasLoopInit(declStmt( 485 hasSingleDecl(varDecl(hasInitializer(integerLiteral(equals(0)))) 486 .bind("initVarName")))), 487 hasIncrement(unaryOperator( 488 hasOperatorName("++"), 489 hasUnaryOperand(declRefExpr( 490 to(varDecl(hasType(isInteger())).bind("incVarName")))))), 491 hasCondition(binaryOperator( 492 hasOperatorName("<"), 493 hasLHS(ignoringParenImpCasts(declRefExpr( 494 to(varDecl(hasType(isInteger())).bind("condVarName"))))), 495 hasRHS(expr(hasType(isInteger())))))).bind("forLoop"); 496 497 And change ``LoopPrinter::run`` to 498 499 .. code-block:: c++ 500 501 void LoopPrinter::run(const MatchFinder::MatchResult &Result) { 502 ASTContext *Context = Result.Context; 503 const ForStmt *FS = Result.Nodes.getNodeAs<ForStmt>("forLoop"); 504 // We do not want to convert header files! 505 if (!FS || !Context->getSourceManager().isWrittenInMainFile(FS->getForLoc())) 506 return; 507 const VarDecl *IncVar = Result.Nodes.getNodeAs<VarDecl>("incVarName"); 508 const VarDecl *CondVar = Result.Nodes.getNodeAs<VarDecl>("condVarName"); 509 const VarDecl *InitVar = Result.Nodes.getNodeAs<VarDecl>("initVarName"); 510 511 if (!areSameVariable(IncVar, CondVar) || !areSameVariable(IncVar, InitVar)) 512 return; 513 llvm::outs() << "Potential array-based loop discovered.\n"; 514 } 515 516 Clang associates a ``VarDecl`` with each variable to represent the variable's 517 declaration. Since the "canonical" form of each declaration is unique by 518 address, all we need to do is make sure neither ``ValueDecl`` (base class of 519 ``VarDecl``) is ``NULL`` and compare the canonical Decls. 520 521 .. code-block:: c++ 522 523 static bool areSameVariable(const ValueDecl *First, const ValueDecl *Second) { 524 return First && Second && 525 First->getCanonicalDecl() == Second->getCanonicalDecl(); 526 } 527 528 If execution reaches the end of ``LoopPrinter::run()``, we know that the 529 loop shell that looks like 530 531 .. code-block:: c++ 532 533 for (int i= 0; i < expr(); ++i) { ... } 534 535 For now, we will just print a message explaining that we found a loop. 536 The next section will deal with recursively traversing the AST to 537 discover all changes needed. 538 539 As a side note, it's not as trivial to test if two expressions are the same, 540 though Clang has already done the hard work for us by providing a way to 541 canonicalize expressions: 542 543 .. code-block:: c++ 544 545 static bool areSameExpr(ASTContext *Context, const Expr *First, 546 const Expr *Second) { 547 if (!First || !Second) 548 return false; 549 llvm::FoldingSetNodeID FirstID, SecondID; 550 First->Profile(FirstID, *Context, true); 551 Second->Profile(SecondID, *Context, true); 552 return FirstID == SecondID; 553 } 554 555 This code relies on the comparison between two 556 ``llvm::FoldingSetNodeIDs``. As the documentation for 557 ``Stmt::Profile()`` indicates, the ``Profile()`` member function builds 558 a description of a node in the AST, based on its properties, along with 559 those of its children. ``FoldingSetNodeID`` then serves as a hash we can 560 use to compare expressions. We will need ``areSameExpr`` later. Before 561 you run the new code on the additional loops added to 562 test-files/simple.cpp, try to figure out which ones will be considered 563 potentially convertible. 564