Home | History | Annotate | Line # | Download | only in docs
      1 ===============================================================
      2 Tutorial for building tools using LibTooling and LibASTMatchers
      3 ===============================================================
      4 
      5 This document is intended to show how to build a useful source-to-source
      6 translation tool based on Clang's `LibTooling <LibTooling.html>`_. It is
      7 explicitly aimed at people who are new to Clang, so all you should need
      8 is a working knowledge of C++ and the command line.
      9 
     10 In order to work on the compiler, you need some basic knowledge of the
     11 abstract syntax tree (AST). To this end, the reader is encouraged to
     12 skim the :doc:`Introduction to the Clang
     13 AST <IntroductionToTheClangAST>`
     14 
     15 Step 0: Obtaining Clang
     16 =======================
     17 
     18 As Clang is part of the LLVM project, you'll need to download LLVM's
     19 source code first. Both Clang and LLVM are in the same git repository,
     20 under different directories. For further information, see the `getting
     21 started guide <https://llvm.org/docs/GettingStarted.html>`_.
     22 
     23 .. code-block:: console
     24 
     25       cd ~/clang-llvm
     26       git clone https://github.com/llvm/llvm-project.git
     27 
     28 Next you need to obtain the CMake build system and Ninja build tool.
     29 
     30 .. code-block:: console
     31 
     32       cd ~/clang-llvm
     33       git clone https://github.com/martine/ninja.git
     34       cd ninja
     35       git checkout release
     36       ./bootstrap.py
     37       sudo cp ninja /usr/bin/
     38 
     39       cd ~/clang-llvm
     40       git clone git://cmake.org/stage/cmake.git
     41       cd cmake
     42       git checkout next
     43       ./bootstrap
     44       make
     45       sudo make install
     46 
     47 Okay. Now we'll build Clang!
     48 
     49 .. code-block:: console
     50 
     51       cd ~/clang-llvm
     52       mkdir build && cd build
     53       cmake -G Ninja ../llvm -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_BUILD_TESTS=ON  # Enable tests; default is off.
     54       ninja
     55       ninja check       # Test LLVM only.
     56       ninja clang-test  # Test Clang only.
     57       ninja install
     58 
     59 And we're live.
     60 
     61 All of the tests should pass.
     62 
     63 Finally, we want to set Clang as its own compiler.
     64 
     65 .. code-block:: console
     66 
     67       cd ~/clang-llvm/build
     68       ccmake ../llvm
     69 
     70 The second command will bring up a GUI for configuring Clang. You need
     71 to set the entry for ``CMAKE_CXX_COMPILER``. Press ``'t'`` to turn on
     72 advanced mode. Scroll down to ``CMAKE_CXX_COMPILER``, and set it to
     73 ``/usr/bin/clang++``, or wherever you installed it. Press ``'c'`` to
     74 configure, then ``'g'`` to generate CMake's files.
     75 
     76 Finally, run ninja one last time, and you're done.
     77 
     78 Step 1: Create a ClangTool
     79 ==========================
     80 
     81 Now that we have enough background knowledge, it's time to create the
     82 simplest productive ClangTool in existence: a syntax checker. While this
     83 already exists as ``clang-check``, it's important to understand what's
     84 going on.
     85 
     86 First, we'll need to create a new directory for our tool and tell CMake
     87 that it exists. As this is not going to be a core clang tool, it will
     88 live in the ``clang-tools-extra`` repository.
     89 
     90 .. code-block:: console
     91 
     92       cd ~/clang-llvm
     93       mkdir clang-tools-extra/loop-convert
     94       echo 'add_subdirectory(loop-convert)' >> clang-tools-extra/CMakeLists.txt
     95       vim clang-tools-extra/loop-convert/CMakeLists.txt
     96 
     97 CMakeLists.txt should have the following contents:
     98 
     99 ::
    100 
    101       set(LLVM_LINK_COMPONENTS support)
    102 
    103       add_clang_executable(loop-convert
    104         LoopConvert.cpp
    105         )
    106       target_link_libraries(loop-convert
    107         PRIVATE
    108         clangTooling
    109         clangBasic
    110         clangASTMatchers
    111         )
    112 
    113 With that done, Ninja will be able to compile our tool. Let's give it
    114 something to compile! Put the following into
    115 ``clang-tools-extra/loop-convert/LoopConvert.cpp``. A detailed explanation of
    116 why the different parts are needed can be found in the `LibTooling
    117 documentation <LibTooling.html>`_.
    118 
    119 .. code-block:: c++
    120 
    121       // Declares clang::SyntaxOnlyAction.
    122       #include "clang/Frontend/FrontendActions.h"
    123       #include "clang/Tooling/CommonOptionsParser.h"
    124       #include "clang/Tooling/Tooling.h"
    125       // Declares llvm::cl::extrahelp.
    126       #include "llvm/Support/CommandLine.h"
    127 
    128       using namespace clang::tooling;
    129       using namespace llvm;
    130 
    131       // Apply a custom category to all command-line options so that they are the
    132       // only ones displayed.
    133       static llvm::cl::OptionCategory MyToolCategory("my-tool options");
    134 
    135       // CommonOptionsParser declares HelpMessage with a description of the common
    136       // command-line options related to the compilation database and input files.
    137       // It's nice to have this help message in all tools.
    138       static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage);
    139 
    140       // A help message for this specific tool can be added afterwards.
    141       static cl::extrahelp MoreHelp("\nMore help text...\n");
    142 
    143       int main(int argc, const char **argv) {
    144         auto ExpectedParser = CommonOptionsParser::create(argc, argv, MyToolCategory);
    145         if (!ExpectedParser) {
    146           // Fail gracefully for unsupported options.
    147           llvm::errs() << ExpectedParser.takeError();
    148           return 1;
    149         }
    150         CommonOptionsParser& OptionsParser = ExpectedParser.get();
    151         ClangTool Tool(OptionsParser.getCompilations(),
    152                        OptionsParser.getSourcePathList());
    153         return Tool.run(newFrontendActionFactory<clang::SyntaxOnlyAction>().get());
    154       }
    155 
    156 And that's it! You can compile our new tool by running ninja from the
    157 ``build`` directory.
    158 
    159 .. code-block:: console
    160 
    161       cd ~/clang-llvm/build
    162       ninja
    163 
    164 You should now be able to run the syntax checker, which is located in
    165 ``~/clang-llvm/build/bin``, on any source file. Try it!
    166 
    167 .. code-block:: console
    168 
    169       echo "int main() { return 0; }" > test.cpp
    170       bin/loop-convert test.cpp --
    171 
    172 Note the two dashes after we specify the source file. The additional
    173 options for the compiler are passed after the dashes rather than loading
    174 them from a compilation database - there just aren't any options needed
    175 right now.
    176 
    177 Intermezzo: Learn AST matcher basics
    178 ====================================
    179 
    180 Clang recently introduced the :doc:`ASTMatcher
    181 library <LibASTMatchers>` to provide a simple, powerful, and
    182 concise way to describe specific patterns in the AST. Implemented as a
    183 DSL powered by macros and templates (see
    184 `ASTMatchers.h <../doxygen/ASTMatchers_8h_source.html>`_ if you're
    185 curious), matchers offer the feel of algebraic data types common to
    186 functional programming languages.
    187 
    188 For example, suppose you wanted to examine only binary operators. There
    189 is a matcher to do exactly that, conveniently named ``binaryOperator``.
    190 I'll give you one guess what this matcher does:
    191 
    192 .. code-block:: c++
    193 
    194       binaryOperator(hasOperatorName("+"), hasLHS(integerLiteral(equals(0))))
    195 
    196 Shockingly, it will match against addition expressions whose left hand
    197 side is exactly the literal 0. It will not match against other forms of
    198 0, such as ``'\0'`` or ``NULL``, but it will match against macros that
    199 expand to 0. The matcher will also not match against calls to the
    200 overloaded operator ``'+'``, as there is a separate ``operatorCallExpr``
    201 matcher to handle overloaded operators.
    202 
    203 There are AST matchers to match all the different nodes of the AST,
    204 narrowing matchers to only match AST nodes fulfilling specific criteria,
    205 and traversal matchers to get from one kind of AST node to another. For
    206 a complete list of AST matchers, take a look at the `AST Matcher
    207 References <LibASTMatchersReference.html>`_
    208 
    209 All matcher that are nouns describe entities in the AST and can be
    210 bound, so that they can be referred to whenever a match is found. To do
    211 so, simply call the method ``bind`` on these matchers, e.g.:
    212 
    213 .. code-block:: c++
    214 
    215       variable(hasType(isInteger())).bind("intvar")
    216 
    217 Step 2: Using AST matchers
    218 ==========================
    219 
    220 Okay, on to using matchers for real. Let's start by defining a matcher
    221 which will capture all ``for`` statements that define a new variable
    222 initialized to zero. Let's start with matching all ``for`` loops:
    223 
    224 .. code-block:: c++
    225 
    226       forStmt()
    227 
    228 Next, we want to specify that a single variable is declared in the first
    229 portion of the loop, so we can extend the matcher to
    230 
    231 .. code-block:: c++
    232 
    233       forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl()))))
    234 
    235 Finally, we can add the condition that the variable is initialized to
    236 zero.
    237 
    238 .. code-block:: c++
    239 
    240       forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
    241         hasInitializer(integerLiteral(equals(0))))))))
    242 
    243 It is fairly easy to read and understand the matcher definition ("match
    244 loops whose init portion declares a single variable which is initialized
    245 to the integer literal 0"), but deciding that every piece is necessary
    246 is more difficult. Note that this matcher will not match loops whose
    247 variables are initialized to ``'\0'``, ``0.0``, ``NULL``, or any form of
    248 zero besides the integer 0.
    249 
    250 The last step is giving the matcher a name and binding the ``ForStmt``
    251 as we will want to do something with it:
    252 
    253 .. code-block:: c++
    254 
    255       StatementMatcher LoopMatcher =
    256         forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
    257           hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop");
    258 
    259 Once you have defined your matchers, you will need to add a little more
    260 scaffolding in order to run them. Matchers are paired with a
    261 ``MatchCallback`` and registered with a ``MatchFinder`` object, then run
    262 from a ``ClangTool``. More code!
    263 
    264 Add the following to ``LoopConvert.cpp``:
    265 
    266 .. code-block:: c++
    267 
    268       #include "clang/ASTMatchers/ASTMatchers.h"
    269       #include "clang/ASTMatchers/ASTMatchFinder.h"
    270 
    271       using namespace clang;
    272       using namespace clang::ast_matchers;
    273 
    274       StatementMatcher LoopMatcher =
    275         forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
    276           hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop");
    277 
    278       class LoopPrinter : public MatchFinder::MatchCallback {
    279       public :
    280         virtual void run(const MatchFinder::MatchResult &Result) {
    281           if (const ForStmt *FS = Result.Nodes.getNodeAs<clang::ForStmt>("forLoop"))
    282             FS->dump();
    283         }
    284       };
    285 
    286 And change ``main()`` to:
    287 
    288 .. code-block:: c++
    289 
    290       int main(int argc, const char **argv) {
    291         auto ExpectedParser = CommonOptionsParser::create(argc, argv, MyToolCategory);
    292         if (!ExpectedParser) {
    293           // Fail gracefully for unsupported options.
    294           llvm::errs() << ExpectedParser.takeError();
    295           return 1;
    296         }
    297         CommonOptionsParser& OptionsParser = ExpectedParser.get();
    298         ClangTool Tool(OptionsParser.getCompilations(),
    299                        OptionsParser.getSourcePathList());
    300 
    301         LoopPrinter Printer;
    302         MatchFinder Finder;
    303         Finder.addMatcher(LoopMatcher, &Printer);
    304 
    305         return Tool.run(newFrontendActionFactory(&Finder).get());
    306       }
    307 
    308 Now, you should be able to recompile and run the code to discover for
    309 loops. Create a new file with a few examples, and test out our new
    310 handiwork:
    311 
    312 .. code-block:: console
    313 
    314       cd ~/clang-llvm/llvm/llvm_build/
    315       ninja loop-convert
    316       vim ~/test-files/simple-loops.cc
    317       bin/loop-convert ~/test-files/simple-loops.cc
    318 
    319 Step 3.5: More Complicated Matchers
    320 ===================================
    321 
    322 Our simple matcher is capable of discovering for loops, but we would
    323 still need to filter out many more ourselves. We can do a good portion
    324 of the remaining work with some cleverly chosen matchers, but first we
    325 need to decide exactly which properties we want to allow.
    326 
    327 How can we characterize for loops over arrays which would be eligible
    328 for translation to range-based syntax? Range based loops over arrays of
    329 size ``N`` that:
    330 
    331 -  start at index ``0``
    332 -  iterate consecutively
    333 -  end at index ``N-1``
    334 
    335 We already check for (1), so all we need to add is a check to the loop's
    336 condition to ensure that the loop's index variable is compared against
    337 ``N`` and another check to ensure that the increment step just
    338 increments this same variable. The matcher for (2) is straightforward:
    339 require a pre- or post-increment of the same variable declared in the
    340 init portion.
    341 
    342 Unfortunately, such a matcher is impossible to write. Matchers contain
    343 no logic for comparing two arbitrary AST nodes and determining whether
    344 or not they are equal, so the best we can do is matching more than we
    345 would like to allow, and punting extra comparisons to the callback.
    346 
    347 In any case, we can start building this sub-matcher. We can require that
    348 the increment step be a unary increment like this:
    349 
    350 .. code-block:: c++
    351 
    352       hasIncrement(unaryOperator(hasOperatorName("++")))
    353 
    354 Specifying what is incremented introduces another quirk of Clang's AST:
    355 Usages of variables are represented as ``DeclRefExpr``'s ("declaration
    356 reference expressions") because they are expressions which refer to
    357 variable declarations. To find a ``unaryOperator`` that refers to a
    358 specific declaration, we can simply add a second condition to it:
    359 
    360 .. code-block:: c++
    361 
    362       hasIncrement(unaryOperator(
    363         hasOperatorName("++"),
    364         hasUnaryOperand(declRefExpr())))
    365 
    366 Furthermore, we can restrict our matcher to only match if the
    367 incremented variable is an integer:
    368 
    369 .. code-block:: c++
    370 
    371       hasIncrement(unaryOperator(
    372         hasOperatorName("++"),
    373         hasUnaryOperand(declRefExpr(to(varDecl(hasType(isInteger())))))))
    374 
    375 And the last step will be to attach an identifier to this variable, so
    376 that we can retrieve it in the callback:
    377 
    378 .. code-block:: c++
    379 
    380       hasIncrement(unaryOperator(
    381         hasOperatorName("++"),
    382         hasUnaryOperand(declRefExpr(to(
    383           varDecl(hasType(isInteger())).bind("incrementVariable"))))))
    384 
    385 We can add this code to the definition of ``LoopMatcher`` and make sure
    386 that our program, outfitted with the new matcher, only prints out loops
    387 that declare a single variable initialized to zero and have an increment
    388 step consisting of a unary increment of some variable.
    389 
    390 Now, we just need to add a matcher to check if the condition part of the
    391 ``for`` loop compares a variable against the size of the array. There is
    392 only one problem - we don't know which array we're iterating over
    393 without looking at the body of the loop! We are again restricted to
    394 approximating the result we want with matchers, filling in the details
    395 in the callback. So we start with:
    396 
    397 .. code-block:: c++
    398 
    399       hasCondition(binaryOperator(hasOperatorName("<"))
    400 
    401 It makes sense to ensure that the left-hand side is a reference to a
    402 variable, and that the right-hand side has integer type.
    403 
    404 .. code-block:: c++
    405 
    406       hasCondition(binaryOperator(
    407         hasOperatorName("<"),
    408         hasLHS(declRefExpr(to(varDecl(hasType(isInteger()))))),
    409         hasRHS(expr(hasType(isInteger())))))
    410 
    411 Why? Because it doesn't work. Of the three loops provided in
    412 ``test-files/simple.cpp``, zero of them have a matching condition. A
    413 quick look at the AST dump of the first for loop, produced by the
    414 previous iteration of loop-convert, shows us the answer:
    415 
    416 ::
    417 
    418       (ForStmt 0x173b240
    419         (DeclStmt 0x173afc8
    420           0x173af50 "int i =
    421             (IntegerLiteral 0x173afa8 'int' 0)")
    422         <<>>
    423         (BinaryOperator 0x173b060 '_Bool' '<'
    424           (ImplicitCastExpr 0x173b030 'int'
    425             (DeclRefExpr 0x173afe0 'int' lvalue Var 0x173af50 'i' 'int'))
    426           (ImplicitCastExpr 0x173b048 'int'
    427             (DeclRefExpr 0x173b008 'const int' lvalue Var 0x170fa80 'N' 'const int')))
    428         (UnaryOperator 0x173b0b0 'int' lvalue prefix '++'
    429           (DeclRefExpr 0x173b088 'int' lvalue Var 0x173af50 'i' 'int'))
    430         (CompoundStatement ...
    431 
    432 We already know that the declaration and increments both match, or this
    433 loop wouldn't have been dumped. The culprit lies in the implicit cast
    434 applied to the first operand (i.e. the LHS) of the less-than operator,
    435 an L-value to R-value conversion applied to the expression referencing
    436 ``i``. Thankfully, the matcher library offers a solution to this problem
    437 in the form of ``ignoringParenImpCasts``, which instructs the matcher to
    438 ignore implicit casts and parentheses before continuing to match.
    439 Adjusting the condition operator will restore the desired match.
    440 
    441 .. code-block:: c++
    442 
    443       hasCondition(binaryOperator(
    444         hasOperatorName("<"),
    445         hasLHS(ignoringParenImpCasts(declRefExpr(
    446           to(varDecl(hasType(isInteger())))))),
    447         hasRHS(expr(hasType(isInteger())))))
    448 
    449 After adding binds to the expressions we wished to capture and
    450 extracting the identifier strings into variables, we have array-step-2
    451 completed.
    452 
    453 Step 4: Retrieving Matched Nodes
    454 ================================
    455 
    456 So far, the matcher callback isn't very interesting: it just dumps the
    457 loop's AST. At some point, we will need to make changes to the input
    458 source code. Next, we'll work on using the nodes we bound in the
    459 previous step.
    460 
    461 The ``MatchFinder::run()`` callback takes a
    462 ``MatchFinder::MatchResult&`` as its parameter. We're most interested in
    463 its ``Context`` and ``Nodes`` members. Clang uses the ``ASTContext``
    464 class to represent contextual information about the AST, as the name
    465 implies, though the most functionally important detail is that several
    466 operations require an ``ASTContext*`` parameter. More immediately useful
    467 is the set of matched nodes, and how we retrieve them.
    468 
    469 Since we bind three variables (identified by ConditionVarName,
    470 InitVarName, and IncrementVarName), we can obtain the matched nodes by
    471 using the ``getNodeAs()`` member function.
    472 
    473 In ``LoopConvert.cpp`` add
    474 
    475 .. code-block:: c++
    476 
    477       #include "clang/AST/ASTContext.h"
    478 
    479 Change ``LoopMatcher`` to
    480 
    481 .. code-block:: c++
    482 
    483       StatementMatcher LoopMatcher =
    484           forStmt(hasLoopInit(declStmt(
    485                       hasSingleDecl(varDecl(hasInitializer(integerLiteral(equals(0))))
    486                                         .bind("initVarName")))),
    487                   hasIncrement(unaryOperator(
    488                       hasOperatorName("++"),
    489                       hasUnaryOperand(declRefExpr(
    490                           to(varDecl(hasType(isInteger())).bind("incVarName")))))),
    491                   hasCondition(binaryOperator(
    492                       hasOperatorName("<"),
    493                       hasLHS(ignoringParenImpCasts(declRefExpr(
    494                           to(varDecl(hasType(isInteger())).bind("condVarName"))))),
    495                       hasRHS(expr(hasType(isInteger())))))).bind("forLoop");
    496 
    497 And change ``LoopPrinter::run`` to
    498 
    499 .. code-block:: c++
    500 
    501       void LoopPrinter::run(const MatchFinder::MatchResult &Result) {
    502         ASTContext *Context = Result.Context;
    503         const ForStmt *FS = Result.Nodes.getNodeAs<ForStmt>("forLoop");
    504         // We do not want to convert header files!
    505         if (!FS || !Context->getSourceManager().isWrittenInMainFile(FS->getForLoc()))
    506           return;
    507         const VarDecl *IncVar = Result.Nodes.getNodeAs<VarDecl>("incVarName");
    508         const VarDecl *CondVar = Result.Nodes.getNodeAs<VarDecl>("condVarName");
    509         const VarDecl *InitVar = Result.Nodes.getNodeAs<VarDecl>("initVarName");
    510 
    511         if (!areSameVariable(IncVar, CondVar) || !areSameVariable(IncVar, InitVar))
    512           return;
    513         llvm::outs() << "Potential array-based loop discovered.\n";
    514       }
    515 
    516 Clang associates a ``VarDecl`` with each variable to represent the variable's
    517 declaration. Since the "canonical" form of each declaration is unique by
    518 address, all we need to do is make sure neither ``ValueDecl`` (base class of
    519 ``VarDecl``) is ``NULL`` and compare the canonical Decls.
    520 
    521 .. code-block:: c++
    522 
    523       static bool areSameVariable(const ValueDecl *First, const ValueDecl *Second) {
    524         return First && Second &&
    525                First->getCanonicalDecl() == Second->getCanonicalDecl();
    526       }
    527 
    528 If execution reaches the end of ``LoopPrinter::run()``, we know that the
    529 loop shell that looks like
    530 
    531 .. code-block:: c++
    532 
    533       for (int i= 0; i < expr(); ++i) { ... }
    534 
    535 For now, we will just print a message explaining that we found a loop.
    536 The next section will deal with recursively traversing the AST to
    537 discover all changes needed.
    538 
    539 As a side note, it's not as trivial to test if two expressions are the same,
    540 though Clang has already done the hard work for us by providing a way to
    541 canonicalize expressions:
    542 
    543 .. code-block:: c++
    544 
    545       static bool areSameExpr(ASTContext *Context, const Expr *First,
    546                               const Expr *Second) {
    547         if (!First || !Second)
    548           return false;
    549         llvm::FoldingSetNodeID FirstID, SecondID;
    550         First->Profile(FirstID, *Context, true);
    551         Second->Profile(SecondID, *Context, true);
    552         return FirstID == SecondID;
    553       }
    554 
    555 This code relies on the comparison between two
    556 ``llvm::FoldingSetNodeIDs``. As the documentation for
    557 ``Stmt::Profile()`` indicates, the ``Profile()`` member function builds
    558 a description of a node in the AST, based on its properties, along with
    559 those of its children. ``FoldingSetNodeID`` then serves as a hash we can
    560 use to compare expressions. We will need ``areSameExpr`` later. Before
    561 you run the new code on the additional loops added to
    562 test-files/simple.cpp, try to figure out which ones will be considered
    563 potentially convertible.
    564