Home | History | Annotate | Line # | Download | only in doc
      1 \input texinfo @c -*-texinfo-*-
      2 
      3 @c ----------------------------------------------------------------------------
      4 @c This is the Texinfo source file for the GPROFNG manual.  This manual
      5 @c includes the man pages for the various tools.
      6 @c
      7 @c Author: Ruud van der Pas
      8 @c ----------------------------------------------------------------------------
      9 
     10 @c %**start of header
     11 
     12 @setfilename gprofng.info
     13 @settitle GNU gprofng
     14 
     15 @c -- Set the indent for the @example command to 1 space, not 5 ---------------
     16 @exampleindent 1
     17 
     18 @paragraphindent 3
     19 
     20 @c %**end of header
     21 
     22 @c -- Start a new chapter on a new, odd numbered, page ------------------------
     23 @setchapternewpage odd
     24 
     25 @c -- Merge all index entries into the Concepts Index -------------------------
     26 @syncodeindex fn cp
     27 @syncodeindex ky cp
     28 @syncodeindex pg cp
     29 @syncodeindex vr cp
     30 
     31 @c -- Macros specific to gprofng ----------------------------------------------
     32 @include gp-macros.texi
     33 
     34 @c -- Get the version information ---------------------------------------------
     35 @include version.texi
     36 
     37 @c -- Entry for the Info dir structure ----------------------------------------
     38 @ifnottex
     39 @dircategory Software development
     40 @direntry
     41 * gprofng: (gprofng).                    The next generation profiling tool for Linux
     42 @end direntry
     43 @end ifnottex
     44 
     45 @c -- Copyright stuff ---------------------------------------------------------
     46 @copying
     47 This document is the manual for @ProductName{}, last updated @value{UPDATED}.
     48 
     49 Copyright @copyright{} 2022-2025 Free Software Foundation, Inc.
     50 
     51 @c -- @quotation
     52 Permission is granted to copy, distribute and/or modify this document
     53 under the terms of the GNU Free Documentation License,
     54 Version 1.3 or any later version published by the Free Software
     55 Foundation; with no Invariant Sections, with no Front-Cover texts,
     56 and with no Back-Cover Texts.  A copy of the license is included in the
     57 section entitled ``GNU Free Documentation License.''
     58 
     59 @c -- @end quotation
     60 @end copying
     61 
     62 @finalout
     63 @smallbook
     64 
     65 @c -- Define the title page ---------------------------------------------------
     66 @titlepage
     67 @title GNU gprofng
     68 @subtitle The next generation profiling tool for Linux
     69 @subtitle version @value{VERSION} (last updated @value{UPDATED})
     70 @author Ruud van der Pas
     71 @page
     72 @vskip 0pt plus 1filll
     73 @insertcopying
     74 @end titlepage
     75 
     76 @c -- Generate the Table of Contents ------------------------------------------
     77 @contents
     78 
     79 @c -- The Top node ------------------------------------------------------------
     80 @c Should contain a short summary, copying permissions and a master menu.
     81 @c ----------------------------------------------------------------------------
     82 @ifnottex
     83 @node Top
     84 @top  GNU Gprofng
     85 
     86 @insertcopying
     87 @end ifnottex
     88 
     89 @ifinfo
     90 @c -- The menu entries --------------------------------------------------------
     91 
     92 @c * Display Source Code::             Display the source code and disassembly.
     93 @c * Archive Experiment Data::         Archive an experiment.
     94 
     95 @menu
     96 * Introduction::                      About this manual.
     97 * Overview::                          A brief overview of @ProductName{}.
     98 * A Mini Tutorial::                   A short tutorial with the key features.
     99 * The gprofng Tools::                 An overview of the tools supported.
    100 * Performance Data Collection::       Record the performance information.
    101 * View the Performance Information::  Different ways to view the data.
    102 * Terminology::                       Concepts and terminology explained.
    103 * Other Document Formats::            Create this document in other formats.
    104 * The gprofng Man Pages::             The gprofng man pages.
    105 * Index::                             The index.
    106 
    107 @detailmenu
    108 
    109 --- The Detailed Node Listing ---
    110 
    111 Introduction
    112 
    113 Overview
    114 
    115 * Main Features::                     A high level overview.
    116 * Sampling versus Tracing::           The pros and cons of both approaches.
    117 * Steps Needed to Create a Profile::  How to create a profile.
    118 
    119 A Mini Tutorial
    120 
    121 * Getting Started::                 The basics of profiling with gprofng.
    122 * Support for Multithreading::      Commands for multithreaded applications.
    123 * View Multiple Experiments::       Analyze multiple experiments. 
    124 * Profile Hardware Event Counters:: How to use hardware event counters.
    125 * Java Profiling::                  How to profile a Java application.
    126 
    127 The gprofng Tools
    128 
    129 * Tools Overview::                            A brief description of the tools.
    130 * The gprofng.rc file with default settings:: Customize the settings.
    131 * Filters::                                   Filters.
    132 * Supported Environment Variables::           The supported environment variables.
    133 
    134 Performance Data Collection
    135 
    136 * The @command{gprofng collect app} Tool::    Collect application performance data.
    137 
    138 View the Performance Information
    139 
    140 * The @command{gprofng display text} Tool::   View the performance data in plain text.
    141 
    142 Terminology
    143 
    144 * The Program Counter::                    What is a Program Counter?
    145 * Inclusive and Exclusive Metrics::        An explanation of inclusive and exclusive metrics.
    146 * Metric Definitions::                     Definitions associated with metrics.
    147 * The Viewmode::                           Select the way call stacks are presented.
    148 * The Selection List::                     How to define a selection.
    149 * Load Objects and Functions::             The components in an application.
    150 * The Concept of a CPU in gprofng::        The definition of a CPU.
    151 * Hardware Event Counters Explained::      What are event counters?
    152 * apath::                                  Our generic definition of a path.
    153 
    154 The gprofng Man Pages
    155 
    156 * Man page for @command{gprofng collect app}::   The man page for gprofng collect app.
    157 * Man page for @command{gprofng display text}::  The man page for gprofng display text.
    158 * Man page for @command{gprofng display html}::  The man page for gprofng display html.
    159 * Man page for @command{gprofng display src}::   The man page for gprofng display src.
    160 * Man page for @command{gprofng archive}::       The man page for gprofng archive.
    161 
    162 @c -- Index
    163 
    164 @end detailmenu
    165 @end menu
    166 @end ifinfo
    167 
    168 @c -- A new node --------------------------------------------------------------
    169 @node    Introduction
    170 @chapter Introduction
    171 @c ----------------------------------------------------------------------------
    172 The @ProductName{} tool is the next generation profiler for Linux.  It consists
    173 of various commands to generate and display profile information.
    174 
    175 This manual starts with a tutorial how to create and interpret a profile.  This
    176 part is highly practical and has the goal to get users up to speed as quickly
    177 as possible.  As soon as possible, we would like to show you how to get your
    178 first profile on your screen.
    179 
    180 This is followed by more examples, covering many of the features.  At the
    181 end of this tutorial, you should feel confident enough to tackle the more
    182 complex tasks.
    183 
    184 In a future update a more formal reference manual will be included as well.
    185 Since even in this tutorial we use certain terminology, we have included a
    186 chapter with descriptions at the end.  In case you encounter unfamiliar
    187 wordings or terminology, please check this chapter.
    188 
    189 One word of caution.  In several cases we had to somewhat tweak the screen
    190 output in order to make it fit.  This is why the output may look somewhat
    191 different when you try things yourself.
    192 
    193 For now, we wish you a smooth profiling experience with @ProductName{} and
    194 good luck tackling performance bottlenecks.
    195 
    196 @c -- A new node --------------------------------------------------------------
    197 @c cccccc @node    A Brief Overview of @ProductName{}
    198 @node    Overview
    199 @chapter A Brief Overview of gprofng
    200 @c ----------------------------------------------------------------------------
    201 
    202 @menu
    203 * Main Features::                     A high level overview.
    204 * Sampling versus Tracing::           The pros and cons of sampling versus tracing.
    205 * Steps Needed to Create a Profile::  How to create a profile.
    206 @end menu
    207 
    208 Before we cover this tool in quite some detail, we start with a brief overview
    209 of what it is, and the main features.  Since we know that many of you would
    210 like to get started rightaway, already in this first chapter we explain the
    211 basics of profiling with @ToolName{}.
    212 
    213 @c ----------------------------------------------------------------------------
    214 @c TBD Review this text.  Probably be more specific on the gcc releases and
    215 @c processor specifics.
    216 @c ----------------------------------------------------------------------------
    217 
    218 @c -- A new node --------------------------------------------------------------
    219 @node    Main Features
    220 @section Main Features
    221 @c ----------------------------------------------------------------------------
    222 
    223 @noindent
    224 These are the main features of the @ProductName{} tool:
    225 
    226 @itemize @bullet
    227 
    228 @item
    229 Profiling is supported for an application written in C, C++, Java, or Scala.
    230 
    231 @c TBD Java: up to 1.8 full support, support other than for modules
    232 
    233 @item
    234 Shared libraries are supported.  The information is presented at the
    235 instruction level.
    236 
    237 @item
    238 The following multithreading programming models are supported: Pthreads,
    239 OpenMP, and Java threads.
    240 
    241 @item
    242 This tool works with unmodified production level executables.  There is no need
    243 to recompile the code, but if the @samp{-g} option has been used when building
    244 the application, source line level information is available.
    245 
    246 @item
    247 The focus is on support for code generated with the @command{gcc} compiler, but
    248 there is some limited support for the @command{icc} compiler as well.  Future
    249 improvements and enhancements will focus on @command{gcc} though.
    250 
    251 @item
    252 Processors from Intel, AMD, and Arm are supported, but the level of support
    253 depends on the architectural details.  In particular, hardware event counters
    254 may not be supported.  If this is the case, all views not related to these
    255 counters still ought to work though.
    256 
    257 @item
    258 Several views into the data are supported.  For example, a function overview
    259 where the time is spent, but also a source line, disassembly, call tree and
    260 a caller-callees overview are available.
    261 
    262 @item
    263 Through filters, the user can zoom in on an area of interest.
    264 
    265 @item
    266 Two or more profiles can be aggregated, or used in a comparison.  This
    267 comparison can be obtained at the function, source line, and disassembly
    268 level.
    269 
    270 @item
    271 Through a simple scripting language, and customization of the metrics shown,
    272 the generation and creation of a profile can be fully automated and provide
    273 tailored output.
    274 
    275 @end itemize
    276 
    277 @c -- A new node --------------------------------------------------------------
    278 @node    Sampling versus Tracing
    279 @section Sampling versus Tracing
    280 @c ----------------------------------------------------------------------------
    281 
    282 A key difference with some other profiling tools is that the main data
    283 collection command @CollectApp{} mostly uses
    284 @cindex Program Counter sampling
    285 @cindex PC sampling
    286 Program Counter (PC) sampling
    287 under the hood.
    288 
    289 With @emph{sampling}, the executable is interrupted at regular intervals.  Each
    290 time it is halted, key information is gathered and stored.  This includes the
    291 Program Counter that keeps track of where the execution is.  Hence the name.
    292 
    293 Together with operational data, this information is stored in the experiment
    294 directory and can be viewed in the second phase.
    295 
    296 For example, the PC information is used to derive where the program was when
    297 it was halted.  Since the sampling interval is known, it is relatively easy to
    298 derive how much time was spent in the various parts of the program.
    299 
    300 The opposite technique is generally referred to as @emph{tracing}.  With
    301 tracing, the target is instrumented with specific calls that collect the
    302 requested information.
    303 
    304 These are some of the pros and cons of PC sampling verus tracing:
    305 
    306 @itemize
    307 
    308 @item
    309 Since there is no need to recompile, existing executables can be used
    310 and the profile measures the behaviour of exactly the same executable that is
    311 used in production runs.
    312 
    313 With tracing, one inherently profiles a different executable, because
    314 the calls to the instrumentation library may affect runtime behaviour.
    315 
    316 @item
    317 With sampling, there are very few restrictions on what can be profiled and
    318 even without access to the source code, a basic profile can be made.
    319 
    320 @item
    321 A downside of sampling is that, depending on the sampling frequency, small
    322 functions may be missed or not captured accurately.  Although this is rare,
    323 this may happen and is the reason why the user has control over the sampling
    324 rate.
    325 
    326 @item
    327 While tracing produces precise information, sampling is statistical in nature.
    328 As a result, small variations may occur across seemingly identical runs.  We
    329 have not observed more than a few percent deviation though.  Especially if
    330 the target job executed for a sufficiently long time.
    331 
    332 @item
    333 With sampling, it is not possible to get an accurate count how often
    334 functions are called.
    335 
    336 @end itemize
    337 
    338 @c -- A new node --------------------------------------------------------------
    339 @node    Steps Needed to Create a Profile
    340 @section Steps Needed to Create a Profile
    341 @c ----------------------------------------------------------------------------
    342 
    343 Creating a profile takes two steps.  First the profile data needs to be
    344 generated.  This is followed by a viewing step to create a report from the
    345 information that has been gathered.
    346 
    347 Every @ProductName{} command starts with @ToolName{}, the name of the driver.
    348 This is followed by a keyword to define the high level functionality.
    349 Depending on this keyword, a third qualifier may be needed to further narrow
    350 down the request.
    351 This combination is then followed by options that are specific to the
    352 functionality desired.
    353 
    354 The command to gather, or ``collect'', the performance data is called
    355 @CollectApp{}.  Aside from numerous options, this command takes the name
    356 of the target executable as an input parameter.
    357 
    358 Upon completion of the run, the performance data can be
    359 found in the newly created
    360 @cindex Experiment directory
    361 experiment directory.
    362 
    363 Unless explicitly specified otherwise, a default
    364 name for this directory is chosen.  The name is @file{test.<n>.er} where
    365 @var{<n>} is the first integer number not in use yet for such a name.
    366 
    367 For example, the first time @CollectApp{} is invoked, an experiment
    368 directory with the name @file{test.1.er} is created.
    369 Upon a subsequent invocation of @CollectApp{} in the same directory,
    370 an experiment directory with the name @file{test.2.er} will be created,
    371 and so forth.
    372 
    373 Note that @CollectApp{} supports an option to explicitly name the experiment
    374 directory.
    375 Aside from the restriction that the name of this directory has to end
    376 with @samp{.er}, any valid directory name can be used for this.
    377 
    378 Now that we have the performance data, the next step is to display it.
    379 
    380 @IndexSubentry{@code{gprofng}, @code{display text}}
    381 The most commonly used command to view the performance information is
    382 @DisplayText{}.   This is a very extensive and customizable tool that
    383 produces the information in ASCII format.
    384 
    385 @IndexSubentry{@code{gprofng}, @code{display html}}
    386 Another option is to use @DisplayHTML{}.  This tool generates a directory with
    387 files in html format.  These can be viewed in a browser, allowing for easy
    388 navigation through the profile data.
    389 
    390 @c -- A new node --------------------------------------------------------------
    391 @node    A Mini Tutorial
    392 @chapter A Mini Tutorial
    393 @c ----------------------------------------------------------------------------
    394 
    395 In this chapter we present and discuss the main functionality of @ToolName{}.
    396 This will be a practical approach, using an example code to generate profile
    397 data and show how to get various performance reports.
    398 
    399 @menu
    400 * Getting Started::                 The basics of profiling with @ProductName().
    401 * Support for Multithreading::      Commands specific to multithreaded applications.
    402 * View Multiple Experiments::       Analyze multiple experiments simultaneously.
    403 * Profile Hardware Event Counters:: How to use hardware event counters.
    404 * Java Profiling::                  How to profile a Java application.
    405 @end menu
    406 
    407 @c -- A new node --------------------------------------------------------------
    408 @node    Getting Started
    409 @section Getting Started
    410 @c ----------------------------------------------------------------------------
    411 
    412 The information presented here provides a good and common basis for many
    413 profiling tasks, but there are more features that you may want to leverage.
    414 
    415 These are covered in subsequent sections in this chapter.
    416 
    417 @menu
    418 * The Example Program::                        A description of the example program used.
    419 * A First Profile::                            How to get the first profile.
    420 * The Source Code View::                       Display the metrics in the source code.
    421 * The Disassembly View::                       Display the metrics at the instruction level.
    422 * Display and Define the Metrics::             An example how to customize the metrics.
    423 * Customization of the Output::                An example how to customize the output.
    424 * Name the Experiment Directory::              Change the name of the experiment directory.
    425 * Control the Number of Lines in the Output::  Change the number of lines in the tables.
    426 * Sorting the Performance Data::               How to set the metric to sort by.
    427 * Scripting::                                  Use a script to execute the commands.
    428 * A More Elaborate Example::                   An example of customization.
    429 * The Call Tree::                              Display the dynamic call tree.
    430 * More Information on the Experiment::         How to get additional statistics.
    431 * Control the Sampling Frequency::             How to control the sampling granularity.
    432 * Information on Load Objects::                How to get more information on load objects.
    433 @end menu
    434 
    435 @c -- A new node --------------------------------------------------------------
    436 @node       The Example Program
    437 @subsection The Example Program
    438 @c ----------------------------------------------------------------------------
    439 
    440 Throughout this guide we use the same example C code that implements the
    441 multiplication of a vector of length @math{n} by an @math{m} by @math{n}
    442 matrix.  The result is stored in a vector of length @math{m}.
    443 @cindex Pthreads
    444 @cindex Posix Threads
    445 The algorithm has been parallelized using Posix Threads, or Pthreads for short.
    446 
    447 The code was built using the @code{gcc} compiler and the name of the executable
    448 is
    449 @cindex mxv-pthreads
    450 @command{mxv-pthreads}.
    451 
    452 The matrix sizes can be set through the @code{-m} and @code{-n} options.  The
    453 number of threads is set with the @code{-t} option.  These are additional
    454 threads that are used in the multiplication.  To increase the duration of
    455 the run, the computations are executed repeatedly.
    456 
    457 This is an example that multiplies a @math{8000} by @math{4000} matrix with
    458 a vector of length @math{4000}.  Although this is a multithreaded application,
    459 initially we will be using @math{1} threads.  Later on we will show examples
    460 using multiple threads.
    461 
    462 @smallexample
    463 @verbatim
    464 $ ./mxv-pthreads -m 8000 -n 4000 -t 1
    465 mxv: error check passed - rows = 8000 columns = 4000 threads = 1
    466 $
    467 @end verbatim
    468 @end smallexample
    469 
    470 The program performs an internal check to verify that the computed results
    471 are correct.  The result of this check is printed, as well as the matrix
    472 sizes and the number of threads used.
    473 
    474 @c -- A new node --------------------------------------------------------------
    475 @node       A First Profile
    476 @subsection A First Profile
    477 @c ----------------------------------------------------------------------------
    478 
    479 The first step is to collect the performance data.  It is important to remember
    480 that much more information is gathered than may be shown by default.  Often a
    481 single data collection run is sufficient to get a lot of insight.
    482 
    483 The @CollectApp{} command is used for the data collection.  Nothing needs to be
    484 changed in the way the application is executed.  The only difference is that it
    485 is now run under control of the tool, as shown below:
    486 
    487 @cartouche
    488 @smallexample
    489 $ gprofng collect app ./mxv-pthreads -m 8000 -n 4000 -t 1
    490 @end smallexample
    491 @end cartouche
    492 
    493 @noindent
    494 This produces the following output:
    495 
    496 @smallexample
    497 @verbatim
    498 Creating experiment directory test.1.er (Process ID: 2749878) ...
    499 mxv: error check passed - rows = 8000 columns = 4000 threads = 1
    500 @end verbatim
    501 @end smallexample
    502 
    503 We see a message that an experiment directory with the name @file{test.1.er}
    504 has been created.  The process id is also echoed.  The application completes
    505 as usual and we have our first experiment directory that can be analyzed.
    506 
    507 The tool we use for this is called @DisplayText{}.  It takes the name of
    508 the experiment directory as an argument.
    509 
    510 @cindex Interpreter mode
    511 If invoked this way, the tool starts in the interactive @emph{interpreter}
    512 mode.
    513 While in this environment, commands can be given and the tool responds.  This
    514 is illustrated below:
    515 
    516 @smallexample
    517 @verbatim
    518 $ gprofng display text test.1.er
    519 Warning: History and command editing is not supported on this system.
    520 (gprofng-display-text) quit
    521 $
    522 @end verbatim
    523 @end smallexample
    524 
    525 @cindex Command line mode
    526 While useful in certain cases, we prefer to use this tool in command line mode
    527 by specifying the commands to be issued when invoking the tool.  The way to do
    528 this is to prepend the command(s) with a hyphen (@samp{-}) if used on the
    529 command line.
    530 
    531 Since this makes the command appear to be an option, they are also sometimes
    532 referred to as such, but technically they are commands.  This is the
    533 terminology we will use in this user guide, but for convenience the commands
    534 are also listed as options in the index.
    535 
    536 For example,
    537 @IndexSubentry{Options,  @code{-functions}}
    538 @IndexSubentry{Commands, @code{functions}}
    539 below we use the @command{functions} command to request a list of the functions
    540 that have been executed, plus their respective CPU times:
    541 
    542 @cartouche
    543 @smallexample
    544 $ gprofng display text -functions test.1.er
    545 @end smallexample
    546 @end cartouche
    547 
    548 @smallexample
    549 @verbatim
    550 Functions sorted by metric: Exclusive Total CPU Time
    551 
    552 Excl. Total   Incl. Total    Name
    553 CPU           CPU
    554  sec.      %   sec.      %
    555 9.367 100.00  9.367 100.00   <Total>
    556 8.926  95.30  8.926  95.30   mxv_core
    557 0.210   2.24  0.420   4.49   init_data
    558 0.080   0.85  0.210   2.24   drand48
    559 0.070   0.75  0.130   1.39   erand48_r
    560 0.060   0.64  0.060   0.64   __drand48_iterate
    561 0.010   0.11  0.020   0.21   _int_malloc
    562 0.010   0.11  0.010   0.11   sysmalloc
    563 0.      0.    8.926  95.30   <static>@0x47960 (<libgp-collector.so>)
    564 0.      0.    0.440   4.70   __libc_start_main
    565 0.      0.    0.020   0.21   allocate_data
    566 0.      0.    8.926  95.30   driver_mxv
    567 0.      0.    0.440   4.70   main
    568 0.      0.    0.020   0.21   malloc
    569 0.      0.    8.926  95.30   start_thread
    570 @end verbatim
    571 @end smallexample
    572 
    573 @noindent
    574 As easy and simple as these steps are, we do have a first profile of our
    575 program!
    576 
    577 There are five columns.  The first four contain the
    578 @cindex Total CPU time
    579 ''Total CPU Time'', which
    580 is the sum of the user and system time. @xref{Inclusive and Exclusive Metrics}
    581 for an explanation of ``exclusive'' and ``inclusive'' times.
    582 
    583 The first line echoes the metric that is used to sort the output.  By default,
    584 this is the exclusive CPU time, but through the @command{sort} command, the
    585 sort metric can be changed by the user.
    586 
    587 Next, there are four columns with the exclusive and inclusive CPU times and the
    588 respective percentages.  This is followed by the name of the function.
    589 
    590 @IndexSubentry{Miscellaneous, @code{<Total>}}
    591 The function with the name @code{<Total>} is not a user function.  It is a
    592 pseudo function introduced by @ToolName{}.  It is used to display the
    593 accumulated measured metric values.  In this example, we see that the total
    594 CPU time of this job was 9.367 seconds and it is scaled to 100%.  All
    595 other percentages in the same column are relative to this number.
    596 
    597 @c -- If the metric is derived, for example the @code{IPC}, the value shown
    598 @c -- under
    599 @c -- @code{<Total>} is based upon the total values of the that are metrics
    600 @c -- used to
    601 @c -- compute the derived metric.
    602 @c -- @IndexSubentry{Hardware event counters, IPC}
    603 
    604 With 8.926 seconds, function @code{mxv_core} takes 95.30% of the
    605 total time and is by far the most time consuming function.
    606 The exclusive and inclusive metrics are identical, which means that it
    607 is a leaf function not calling any other functions.
    608 
    609 The next function in the list is @code{init_data}.  Although with 4.49%,
    610 the CPU time spent in this part is modest, this is an interesting entry because
    611 the inclusive CPU time of 0.420 seconds is twice the exclusive CPU time
    612 of 0.210 seconds.  Clearly this function is calling another function,
    613 or even more than one function and collectively this takes 0.210 seconds.
    614 Below we show the call tree feature that provides more details on the call
    615 structure of the application.
    616 
    617 The function @code{<static>@@0x47960 (<libgp-collector.so>)} does odd and
    618 certainly not familiar.  It is one of the internal functions used by
    619 @CollectApp{} and can be ignored.  Also, while the inclusive time is high,
    620 the exclusive time is zero.  This means it doesn't contribute to the
    621 performance.
    622 
    623 The question is how we know where this function originates from.  There are
    624 several commands to dig deeper an get more details on a function.
    625 @xref{Information on Load Objects}.
    626 
    627 @c -- A new node --------------------------------------------------------------
    628 @node       The Source Code View
    629 @subsection The Source Code View
    630 @c ----------------------------------------------------------------------------
    631 
    632 In general, the tuning efforts are best focused on the most time consuming
    633 part(s) of an application.  In this case that is easy, since over 95% of
    634 the total CPU time is spent in function @code{mxv_core}.
    635 It is now time to dig deeper and look
    636 @cindex Source level metrics
    637 at the metrics distribution at the source code level.  Since we measured
    638 CPU times, these are the metrics shown.
    639 
    640 @IndexSubentry{Options,  @code{-source}}
    641 @IndexSubentry{Commands, @code{source}}
    642 The @code{source} command is used to accomplish this.  It takes the name of the
    643 function, not the source filename, as an argument.  This is demonstrated
    644 below, where the @DisplayText{} command is used to show the annotated
    645 source listing of function @code{mxv_core}.
    646 
    647 Be aware that when using the @command{gcc} compiler, the source code has to
    648 be compiled with the @code{-g} option in order for the source code feature
    649 to work.  Otherwise the location(s) can not be determined.  For other compilers
    650 we recommend to check the documentation for such an option.
    651 
    652 Below the command to display the source code of a function is shown.  Since at
    653 this point we are primarily interested in the timings only, we use the
    654 @code{metrics} command to request the exclusive and inclusive total CPU
    655 timings only. @xref{Display and Define the Metrics} for more information
    656 how to define the metrics to be displayed.
    657 
    658 @cartouche
    659 @smallexample
    660 $ gprofng display text -metrics ei.totalcpu -source mxv_core test.1.er
    661 @end smallexample
    662 @end cartouche
    663 
    664 The output is shown below.  It has been somewhat modified to fit the formatting
    665 constraints and reduce the number of lines.
    666 
    667 @smallexample
    668 @verbatim
    669 Current metrics: e.totalcpu:i.totalcpu:name
    670 Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu )
    671 Source file: <apath>/mxv.c
    672 Object file: mxv-pthreads (found as test.1.er/archives/...)
    673 Load Object: mxv-pthreads (found as test.1.er/archives/...)
    674 
    675    Excl.     Incl.
    676    Total     Total
    677    CPU sec.  CPU sec.
    678 
    679    <lines deleted>
    680                                 <Function: mxv_core>
    681                             43. void __attribute__ ((noinline)) 
    682                                      mxv_core (int64_t row_index_start,
    683                             44.                int64_t row_index_end,
    684                             45.                int64_t m,
    685                             46.                int64_t n,
    686                             47.                double **restrict A,
    687                             48.                double *restrict b,
    688                             49.                double *restrict c)
    689                             50. {
    690    0.        0.             50. {
    691    0.        0.             51.   for (int64_t i=row_index_start;
    692                                                  i<=row_index_end; i++)
    693                             52.     {
    694    0.        0.             53.       double row_sum = 0.0;
    695 ## 4.613     4.613          54.       for (int64_t j=0; j<n; j++)
    696 ## 4.313     4.313          55.         row_sum += A[i][j] * b[j];
    697    0.        0.             56.       c[i] = row_sum;
    698                             57.     }
    699    0.        0.             58. }
    700 @end verbatim
    701 @end smallexample
    702 
    703 The first line echoes the metrics that have been selected.  The second line
    704 is not very meaningful when looking at the source code listing, but it shows
    705 the metric that is used to sort the data.
    706 
    707 The next three lines provide information on the location of the source file,
    708 the object file and the load object (@xref{Load Objects and Functions}).
    709 
    710 Function @code{mxv_core} is part of a source file that has other functions
    711 as well.  These functions will be shown with the values for the metrics, but
    712 for lay-out purposes they have been removed in the output shown above.
    713 
    714 The header is followed by the annotated source code listing.  The selected
    715 metrics are shown first, followed by a source line number, and the source code.
    716 @IndexSubentry{Miscellaneous ,@code{##}}
    717 The most time consuming line(s) are marked with the @code{##} symbol.  In
    718 this way they are easier to identify and find with a search.
    719 
    720 What we see is that all of the time is spent in lines 54-55.
    721 
    722 @IndexSubentry{Options,  @code{-lines}}
    723 @IndexSubentry{Commands, @code{lines}}
    724 A related command sometimes comes handy as well.  It is called @code{lines}
    725 and displays a list of the source lines and their metrics, ordered according
    726 to the current sort metric (@xref{Sorting the Performance Data}).
    727 
    728 Below the command and the output.  For lay-out reasons, only the top 10 is
    729 shown here and the last part of the text on some lines has been replaced
    730 by dots.  The full text is @samp{instructions without line numbers} and
    731 means that the line number information for that function was not found.
    732 
    733 @cartouche
    734 @smallexample
    735 $ gprofng display text -lines test.1.er
    736 @end smallexample
    737 @end cartouche
    738 
    739 @smallexample
    740 @verbatim
    741 Lines sorted by metric: Exclusive Total CPU Time
    742 
    743 Excl. Total   Incl. Total    Name
    744 CPU           CPU
    745  sec.      %   sec.      %
    746 9.367 100.00  9.367 100.00   <Total>
    747 4.613  49.25  4.613  49.25   mxv_core, line 54 in "mxv.c"
    748 4.313  46.05  4.313  46.05   mxv_core, line 55 in "mxv.c"
    749 0.160   1.71  0.370   3.95   init_data, line 118 in "manage_data.c"
    750 0.080   0.85  0.210   2.24   <Function: drand48, instructions ...>
    751 0.070   0.75  0.130   1.39   <Function: erand48_r, instructions ...>
    752 0.060   0.64  0.060   0.64   <Function: __drand48_iterate, ...>
    753 0.040   0.43  0.040   0.43   init_data, line 124 in "manage_data.c"
    754 0.010   0.11  0.020   0.21   <Function: _int_malloc, instructions ...>
    755 0.010   0.11  0.010   0.11   <Function: sysmalloc, instructions ...>
    756 @end verbatim
    757 @end smallexample
    758 
    759 What this overview immediately highlights is that the third most time consuming
    760 source line takes 0.370 seconds only.  This means that the inclusive time is
    761 only 3.95% and clearly this branch of the code hardly impacts the performance.
    762 
    763 @c -- A new node --------------------------------------------------------------
    764 @node       The Disassembly View
    765 @subsection The Disassembly View
    766 @c ----------------------------------------------------------------------------
    767 
    768 The source view is very useful to obtain more insight where the time is spent,
    769 but sometimes this is not sufficient.  The disassembly view provides more
    770 details since it shows the metrics at the instruction level.
    771 
    772 This view is displayed with the
    773 @IndexSubentry{Options,  @code{-disasm}}
    774 @IndexSubentry{Commands, @code{disasm}}
    775 @command{disasm}
    776 command and as with the source view, it displays an annotated listing.  In this
    777 @cindex Instruction level metrics
    778 case it shows the instructions with the metrics, interleaved with the
    779 source lines.  The
    780 instructions have a reference in square brackets (@code{[} and @code{]})
    781 to the source line they correspond to.
    782 
    783 @noindent
    784 We again focus on the tmings only and set the metrics accordingly:
    785 
    786 @cartouche
    787 @smallexample
    788 $ gprofng display text -metrics ei.totalcpu -disasm mxv_core test.1.er
    789 @end smallexample
    790 @end cartouche
    791 
    792 @smallexample
    793 @verbatim
    794 Current metrics: e.totalcpu:i.totalcpu:name
    795 Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu )
    796 Source file: <apath>/src/mxv.c
    797 Object file: mxv-pthreads (found as test.1.er/archives/...)
    798 Load Object: mxv-pthreads (found as test.1.er/archives/...)
    799 
    800    Excl.     Incl.
    801    Total     Total
    802    CPU sec.  CPU sec.
    803 
    804    <lines deleted>
    805                       43. void __attribute__ ((noinline)) 
    806                                mxv_core (int64_t row_index_start,
    807                       44.                int64_t row_index_end,
    808                       45.                int64_t m,
    809                       46.                int64_t n,
    810                       47.                double **restrict A,
    811                       48.                double *restrict b,
    812                       49.                double *restrict c)
    813                       50. {
    814                           <Function: mxv_core>
    815    0.        0.           [50]   401d56:  mov    0x8(%rsp),%r10
    816                       51.   for (int64_t i=row_index_start;
    817                                            i<=row_index_end; i++)
    818    0.        0.           [51]   401d5b:  cmp    %rsi,%rdi
    819    0.        0.           [51]   401d5e:  jg     0x47
    820    0.        0.           [51]   401d60:  add    $0x1,%rsi
    821    0.        0.           [51]   401d64:  jmp    0x36
    822                       52.     {
    823                       53.       double row_sum = 0.0;
    824                       54.       for (int64_t j=0; j<n; j++)
    825                       55          row_sum += A[i][j] * b[j];
    826    0.        0.           [55]   401d66:  mov    (%r8,%rdi,8),%rdx
    827    0.        0.           [54]   401d6a:  mov    $0x0,%eax
    828    0.        0.           [53]   401d6f:  pxor   %xmm1,%xmm1
    829    0.110     0.110        [55]   401d73:  movsd  (%rdx,%rax,8),%xmm0
    830    1.921     1.921        [55]   401d78:  mulsd  (%r9,%rax,8),%xmm0
    831    2.282     2.282        [55]   401d7e:  addsd  %xmm0,%xmm1
    832 ## 4.613     4.613        [54]   401d82:  add    $0x1,%rax
    833    0.        0.           [54]   401d86:  cmp    %rax,%rcx
    834    0.        0.           [54]   401d89:  jne    0xffffffffffffffea
    835                       56.       c[i] = row_sum;
    836    0.        0.           [56]   401d8b:  movsd  %xmm1,(%r10,%rdi,8)
    837    0.        0.           [51]   401d91:  add    $0x1,%rdi
    838    0.        0.           [51]   401d95:  cmp    %rsi,%rdi
    839    0.        0.           [51]   401d98:  je     0xd
    840    0.        0.           [53]   401d9a:  pxor   %xmm1,%xmm1
    841    0.        0.           [54]   401d9e:  test   %rcx,%rcx
    842    0.        0.           [54]   401da1:  jg     0xffffffffffffffc5
    843    0.        0.           [54]   401da3:  jmp    0xffffffffffffffe8
    844                       57.     }
    845                       58. }
    846    0.        0.           [58]   401da5:  ret
    847 @end verbatim
    848 @end smallexample
    849 
    850 For each instruction, the timing values are given and we can immediately
    851 identify the most expensive instructions.  As with the source level view,
    852 these are marked with the @code{##} symbol.
    853 
    854 It comes as no surprise that the time consuming instructions originate from
    855 the source code at lines 54-55.
    856 One thing to note is that the source line numbers no longer appear in
    857 sequential order.
    858 This is because the compiler has re-ordered the instructions as part of
    859 the code optimizations it has performed.
    860 
    861 As illustrated below and similar to the @command{lines} command, we can get
    862 an overview of the instructions executed by using the
    863 @IndexSubentry{Options,  @code{-pcs}}
    864 @IndexSubentry{Commands, @code{pcs}}
    865 @command{pcs}
    866 command.
    867 
    868 @noindent
    869 Below the command and the output, which again has been restricted
    870 to 10 lines.  As before, some lines have been shortened for lay-out
    871 purposes.
    872 
    873 @cartouche
    874 @smallexample
    875 $ gprofng display text -pcs test.1.er
    876 @end smallexample
    877 @end cartouche
    878 
    879 @smallexample
    880 @verbatim
    881 PCs sorted by metric: Exclusive Total CPU Time
    882 
    883 Excl. Total   Incl. Total    Name
    884 CPU           CPU
    885  sec.      %   sec.      %
    886 9.367 100.00  9.367 100.00   <Total>
    887 4.613  49.25  4.613  49.25   mxv_core + 0x0000002C, line 54 in "mxv.c"
    888 2.282  24.36  2.282  24.36   mxv_core + 0x00000028, line 55 in "mxv.c"
    889 1.921  20.51  1.921  20.51   mxv_core + 0x00000022, line 55 in "mxv.c"
    890 0.150   1.60  0.150   1.60   init_data + 0x000000AC, line 118 in ...
    891 0.110   1.18  0.110   1.18   mxv_core + 0x0000001D, line 55 in "mxv.c"
    892 0.040   0.43  0.040   0.43   drand48 + 0x00000022
    893 0.040   0.43  0.040   0.43   init_data + 0x000000F1, line 124 in ...
    894 0.030   0.32  0.030   0.32   __drand48_iterate + 0x0000001E
    895 0.020   0.21  0.020   0.21   __drand48_iterate + 0x00000038
    896 @end verbatim
    897 @end smallexample
    898 
    899 @noindent
    900 What we see is that the top three instructions take 94% of the total CPU time
    901 and any optimizations should focus on this part of the code..
    902 
    903 @c -- A new node --------------------------------------------------------------
    904 @node       Display and Define the Metrics
    905 @subsection Display and Define the Metrics
    906 @c ----------------------------------------------------------------------------
    907 
    908 The metrics shown by @DisplayText{} are useful, but there is more recorded
    909 than displayed by default.  We can customize the values shown by defining the
    910 metrics ourselves.
    911 
    912 There are two commands related to changing the metrics shown:
    913 @IndexSubentry{Options,  @code{-metric_list}}
    914 @IndexSubentry{Commands, @code{metric_list}}
    915 @command{metric_list} and
    916 @IndexSubentry{Options,  @code{-metrics}}
    917 @IndexSubentry{Commands, @code{metrics}}
    918 @command{metrics}.
    919 
    920 The first command shows the currently selected metrics, plus all the metrics
    921 that have been stored as part of the experiment.  The second command may be
    922 used to define the metric list.
    923 
    924 @noindent
    925 This is the way to get the information about the metrics:
    926 
    927 @IndexSubentry{Options,  @code{-metric_list}}
    928 @IndexSubentry{Commands, @code{metric_list}}
    929 @cartouche
    930 @smallexample
    931 $ gprofng display text -metric_list test.1.er
    932 @end smallexample
    933 @end cartouche
    934 
    935 @noindent
    936 This is the output:
    937 
    938 @smallexample
    939 @verbatim
    940 Current metrics: e.%totalcpu:i.%totalcpu:name
    941 Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
    942 Available metrics:
    943 Exclusive Total CPU Time: e.%totalcpu
    944 Inclusive Total CPU Time: i.%totalcpu
    945                     Size: size
    946               PC Address: address
    947                     Name: name
    948 @end verbatim
    949 @end smallexample
    950 
    951 This shows the metrics that are currently used, the metric that is used to sort
    952 the data and all the metrics that have been recorded, but are not necessarily
    953 shown.
    954 
    955 @cindex Default metrics
    956 In this case, the current metrics are set to the exclusive and inclusive
    957 total CPU times, the respective percentages, and the name of the function,
    958 or load object.
    959 
    960 @IndexSubentry{Options,  @code{-metrics}}
    961 @IndexSubentry{Commands, @code{metrics}}
    962 The @code{metrics} command is used to define the metrics that need to be
    963 displayed.
    964 
    965 For example, to swap the exclusive and inclusive metrics, use the following
    966 metric definition: @code{i.%totalcpu:e.%totalcpu}.
    967 
    968 Since the metrics can be tailored for different views, there is also a way
    969 to reset them to the default.  This is done through the special keyword
    970 @code{default} for the metrics definition (@command{-metrics default}).
    971 @IndexSubentry{Metrics, Reset to default}
    972 
    973 @c -- A new node --------------------------------------------------------------
    974 @node       Customization of the Output
    975 @subsection Customization of the Output
    976 @c ----------------------------------------------------------------------------
    977 
    978 With the information just given, the function overview can be customized.
    979 For sake of the example, we would like to display the name of the function
    980 first, only followed by the exclusive CPU time, given as an absolute number
    981 and a percentage.
    982 
    983 Note that the commands are parsed in order of appearance.  This is why we
    984 need to define the metrics @emph{before} requesting the function overview:
    985 
    986 @cartouche
    987 @smallexample
    988 $ gprofng display text -metrics name:e.%totalcpu -functions test.1.er
    989 @end smallexample
    990 @end cartouche
    991 
    992 @smallexample
    993 @verbatim
    994 Current metrics: name:e.%totalcpu
    995 Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
    996 Functions sorted by metric: Exclusive Total CPU Time
    997 
    998 Name                                     Excl. Total
    999                                          CPU
   1000                                           sec.      %
   1001  <Total>                                 9.367 100.00
   1002  mxv_core                                8.926  95.30
   1003  init_data                               0.210   2.24
   1004  drand48                                 0.080   0.85
   1005  erand48_r                               0.070   0.75
   1006  __drand48_iterate                       0.060   0.64
   1007  _int_malloc                             0.010   0.11
   1008  sysmalloc                               0.010   0.11
   1009  <static>@0x47960 (<libgp-collector.so>) 0.      0.
   1010  __libc_start_main                       0.      0.
   1011  allocate_data                           0.      0.
   1012  driver_mxv                              0.      0.
   1013  main                                    0.      0.
   1014  malloc                                  0.      0.
   1015  start_thread                            0.      0.
   1016 @end verbatim
   1017 @end smallexample
   1018 
   1019 This was a first and simple example how to customize the output.  Note that we
   1020 did not rerun our profiling job and merely modified the display settings.
   1021 Below we will show other and also more advanced examples of customization.
   1022 
   1023 @c -- A new node --------------------------------------------------------------
   1024 @node       Name the Experiment Directory
   1025 @subsection Name the Experiment Directory
   1026 @c ----------------------------------------------------------------------------
   1027 
   1028 When using @CollectApp{}, the default names for experiments work fine, but
   1029 they are quite generic.  It is often more convenient to select a more
   1030 descriptive name.  For example, one that reflects conditions for the experiment
   1031 conducted, like the number of threads used.
   1032 
   1033 For this, the mutually exclusive @code{-o} and @code{-O} options come in handy.
   1034 Both may be used to provide a name for the experiment directory, but the
   1035 behaviour of @CollectApp{} is different.
   1036 
   1037 With the
   1038 @IndexSubentry{Options, @code{-o}}
   1039 @samp{-o}
   1040 option, an existing experiment directory is not overwritten.  Any directory
   1041 with the same name either needs to be renamed, moved, or removed, before the
   1042 experiment can be conducted.
   1043 
   1044 This is in contrast with the behaviour for the
   1045 @IndexSubentry{Options, @code{-O}}
   1046 @samp{-O}
   1047 option.  Any existing directory with the same name is silently overwritten.
   1048 
   1049 Be aware that the name of the experiment directory has to end with @file{.er}.
   1050 
   1051 @c -- A new node --------------------------------------------------------------
   1052 @node       Control the Number of Lines in the Output
   1053 @subsection Control the Number of Lines in the Output
   1054 @c ----------------------------------------------------------------------------
   1055 
   1056 @IndexSubentry{Options,  @code{-limit}}
   1057 @IndexSubentry{Commands, @code{limit}}
   1058 The @command{limit} @var{<n>} command can be used to control the number of
   1059 lines printed in various views.  For example it impacts the function view, but
   1060 also takes effect for other display commands, like @command{lines}.
   1061 
   1062 The argument @var{<n>} should be a positive integer number.  It sets the number
   1063 of lines in the (function) view.  A value of zero resets the limit to the
   1064 default.
   1065 
   1066 Be aware that the pseudo-function @code{<Total>} counts as a regular function.
   1067 For example @command{limit 10} displays nine user level functions.
   1068 
   1069 @c -- A new node --------------------------------------------------------------
   1070 @node       Sorting the Performance Data
   1071 @subsection Sorting the Performance Data
   1072 @c ----------------------------------------------------------------------------
   1073 
   1074 @IndexSubentry{Options,  @code{-sort}}
   1075 @IndexSubentry{Commands, @code{sort}}
   1076 The @command{sort} @var{<key>} command sets the key to be used when sorting the
   1077 performance data.
   1078 
   1079 The key is a valid metric definition, but the
   1080 @IndexSubentry{Metrics, Visibility field}
   1081 visibility field
   1082 (@xref{Metric Definitions})
   1083 in the metric
   1084 definition is ignored, since this does not affect the outcome of the sorting
   1085 operation.
   1086 For example if the sort key is set to @code{e.totalcpu}, the values
   1087 will be sorted in descending order with respect to the exclusive total
   1088 CPU time.
   1089 
   1090 @IndexSubentry{Sort, Reverse order}
   1091 The data can be sorted in reverse order by prepending the metric definition
   1092 with a minus (@samp{-}) sign.  For example @command{sort -e.totalcpu}.
   1093 
   1094 @IndexSubentry{Sort, Reset to default}
   1095 A default metric for the sort operation has been defined and since this is
   1096 a persistent command, this default can be restored with @code{default} as
   1097 the key (@command{sort default}).
   1098 
   1099 @c -- A new node --------------------------------------------------------------
   1100 @node       Scripting
   1101 @subsection Scripting
   1102 @c ----------------------------------------------------------------------------
   1103 
   1104 @cindex Script files
   1105 The list with commands for @DisplayText{} can be very long.  This is tedious
   1106 and also error prone.  Luckily, there is an easier and elegant way to control
   1107 the output of this tool.
   1108 
   1109 @IndexSubentry{Options,  @code{-script}}
   1110 @IndexSubentry{Commands, @code{script}}
   1111 Through the @command{script} command, the name of a file with commands can be
   1112 passed in.  These commands are parsed and executed as if they appeared on
   1113 the command line in the same order as encountered in the file.  The commands
   1114 in this script file can actually be mixed with commands on the command line
   1115 and multiple script files may be used.
   1116 The difference between the commands in the script file and those used on the
   1117 command line is that the latter require a leading dash (@samp{-}) symbol.
   1118 
   1119 Comment lines in a script file are supported.  They need to start with the
   1120 @samp{#} symbol.
   1121 
   1122 @c -- A new node --------------------------------------------------------------
   1123 @node       A More Elaborate Example
   1124 @subsection A More Elaborate Example
   1125 @c ----------------------------------------------------------------------------
   1126 
   1127 With the information presented so far, we can customize our data
   1128 gathering and display commands.
   1129 
   1130 As an example, we would like to use @file{mxv.1.thr.er} as the name for the
   1131 experiment directory.  In this way, the name of the algorithm and the
   1132 number of threads that were used are included in the name.
   1133 We also don't mind to overwrite an existing
   1134 experiment directory with the same name.
   1135 
   1136 All that needs to be done is to use the
   1137 @IndexSubentry{Options, @code{-O}}
   1138 @samp{-O}
   1139 option, followed by the directory name of choice when running @CollectApp{}:
   1140 
   1141 @cartouche
   1142 @smallexample
   1143 $ exe=mxv-pthreads
   1144 $ m=8000
   1145 $ n=4000
   1146 $ gprofng collect app -O mxv.1.thr.er ./$exe -m $m -n $n -t 1
   1147 @end smallexample
   1148 @end cartouche
   1149 
   1150 Since we want to customize the profile and prefer to keep the command line
   1151 short, the commands to generate the profile are put into a file with the
   1152 name @file{my-script}:
   1153 
   1154 @smallexample
   1155 @verbatim
   1156 $ cat my-script
   1157 # This is my first gprofng script
   1158 # Set the metrics
   1159 metrics i.%totalcpu:e.%totalcpu:name
   1160 # Use the exclusive time to sort
   1161 sort e.totalcpu
   1162 # Limit the function list to 5 lines
   1163 limit 5
   1164 # Show the function list
   1165 functions
   1166 @end verbatim
   1167 @end smallexample
   1168 
   1169 This script file is specified as input to the @DisplayText{} command
   1170 that is used to display the performance information stored in experiment
   1171 directory @file{mxv.1.thr.er}:
   1172 
   1173 @cartouche
   1174 @smallexample
   1175 $ gprofng display text -script my-script mxv.1.thr.er
   1176 @end smallexample
   1177 @end cartouche
   1178 
   1179 This command produces the following output:
   1180 
   1181 @smallexample
   1182 @verbatim
   1183 # This is my first gprofng script
   1184 # Set the metrics
   1185 Current metrics: i.%totalcpu:e.%totalcpu:name
   1186 Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
   1187 # Use the exclusive time to sort
   1188 Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
   1189 # Limit the function list to 5 lines
   1190 Print limit set to 5
   1191 # Show the function list
   1192 Functions sorted by metric: Exclusive Total CPU Time
   1193 
   1194 Incl. Total   Excl. Total    Name
   1195 CPU           CPU
   1196  sec.      %   sec.      %
   1197 9.703 100.00  9.703 100.00   <Total>
   1198 9.226  95.09  9.226  95.09   mxv_core
   1199 0.455   4.69  0.210   2.17   init_data
   1200 0.169   1.75  0.123   1.26   erand48_r
   1201 0.244   2.52  0.075   0.77   drand48
   1202 @end verbatim
   1203 @end smallexample
   1204 
   1205 In the first part of the output the comment lines in the script file are
   1206 echoed.  These are interleaved with an acknowledgement message for the
   1207 commands.
   1208 
   1209 This is followed by a profile consisting of 5 lines only.  For both metrics,
   1210 the percentages plus the timings are given.  The numbers are sorted with
   1211 respect to the exclusive total CPU time.  Although this is the default, for
   1212 demonstration purposes we use the @command{sort} command to explicitly define
   1213 the metric for the sort.
   1214 
   1215 While we executed the same job as before and only changed the name of the
   1216 experiment directory, the results are somewhat different.  This is sampling
   1217 in action.  The numbers are not all that different though.
   1218 It is seen that function @code{mxv_core} is responsbile for
   1219 95% of the CPU time and @code{init_data} takes 4.5% only.
   1220 
   1221 @c -- A new node --------------------------------------------------------------
   1222 @node       The Call Tree
   1223 @subsection The Call Tree
   1224 @c ----------------------------------------------------------------------------
   1225 
   1226 The call tree shows the dynamic structure of the application by displaying the
   1227 functions executed and their parent.  The CPU time attributed to each function
   1228 is shown as well.  This view helps to find the most expensive
   1229 execution path in the program.
   1230 
   1231 @IndexSubentry{Options,  @code{-calltree}}
   1232 @IndexSubentry{Commands, @code{calltree}}
   1233 This feature is enabled through the @command{calltree} command.  For example,
   1234 this is how to get the call tree for our current experiment:
   1235 
   1236 @cartouche
   1237 @smallexample
   1238 $ gprofng display text -calltree mxv.1.thr.er
   1239 @end smallexample
   1240 @end cartouche
   1241 
   1242 This displays the following structure:
   1243 
   1244 @smallexample
   1245 @verbatim
   1246 Functions Call Tree. Metric: Attributed Total CPU Time
   1247 
   1248 Attr. Total    Name
   1249 CPU
   1250  sec.      %
   1251 9.703 100.00   +-<Total>
   1252 9.226  95.09     +-start_thread
   1253 9.226  95.09     |  +-<static>@0x47960 (<libgp-collector.so>)
   1254 9.226  95.09     |    +-driver_mxv
   1255 9.226  95.09     |      +-mxv_core
   1256 0.477   4.91     +-__libc_start_main
   1257 0.477   4.91       +-main
   1258 0.455   4.69         +-init_data
   1259 0.244   2.52         |  +-drand48
   1260 0.169   1.75         |    +-erand48_r
   1261 0.047   0.48         |      +-__drand48_iterate
   1262 0.021   0.22         +-allocate_data
   1263 0.021   0.22         |  +-malloc
   1264 0.021   0.22         |    +-_int_malloc
   1265 0.006   0.06         |      +-sysmalloc
   1266 0.003   0.03         |        +-__default_morecore
   1267 0.003   0.03         |          +-sbrk
   1268 0.003   0.03         |            +-brk
   1269 0.001   0.01         +-pthread_create
   1270 0.001   0.01           +-__pthread_create_2_1
   1271 @end verbatim
   1272 @end smallexample
   1273 
   1274 At first sight this may not be what is expected and some explanation is in
   1275 place.
   1276 
   1277 @c ----------------------------------------------------------------------------
   1278 @c TBD: Revise this text when we have user and machine mode.
   1279 @c ----------------------------------------------------------------------------
   1280 The top function is the pseudo-function @code{<Total>} that we have seen
   1281 before.  It is introduced and shown here to provide the total value of the
   1282 metric(s).
   1283 
   1284 We also see function @code{<static>@@0x47960} in the call tree and apparently
   1285 it is from @code{libgp-collector.so}, a library that is internal to
   1286 @ToolName{}.
   1287 The @code{<static>} marker, followed by the program counter, is shown if the
   1288 name of the function cannot be found.  This function is part of the
   1289 implementation of the data collection process and should be hidden to the
   1290 user.  This is part of a planned future enhancement.
   1291 
   1292 In general, if a view has a function that does not appear to be part of the
   1293 user code, or seems odd anyhow, the @command{objects} and @command{fsingle}
   1294 @IndexSubentry{Options,  @code{-objects}}
   1295 @IndexSubentry{Commands, @code{objects}}
   1296 @IndexSubentry{Options,  @code{-fsingle}}
   1297 @IndexSubentry{Commands, @code{fsingle}}
   1298 commands are very useful
   1299 to find out more about load objects in general, but also to help identify
   1300 an unknown entry in the function overview. @xref{Load Objects and Functions}.
   1301 
   1302 Another thing to note is that there are two main branches.  The one under
   1303 @code{<static>@@0x47960} and the second one under @code{__libc_start_main}.
   1304 This reflects the fact that this is a multithreaded program and the
   1305 threaded part shows up as a separate branch in the call tree.
   1306 
   1307 The way to interpret this structure is as follows.  The program starts
   1308 under control of @code{__libc_start_main}.  This executes the main program
   1309 called @code{main}, which at the top level executes functions
   1310 @code{init_data}, @code{allocate_data}, and @code{pthread_create}.
   1311 The latter function creates and executes the additional thread(s).
   1312 
   1313 For this multithreaded part of the code, we need to look at the branch
   1314 under function @code{start_thread} that calls the driver code for the
   1315 matrix-vector multiplication (@code{driver_mxv}), which executes the function
   1316 that performs the actual multiplication (@code{mxv_core}).
   1317 
   1318 There are two things worth noting for the call tree feature:
   1319 
   1320 @itemize
   1321 
   1322 @item
   1323 This is a dynamic tree and since sampling is used, it most likely looks
   1324 slighlty different across seemingly identical profile runs.  In case the
   1325 run times are short, it is worth considering to use a high resolution
   1326 through the
   1327 @IndexSubentry{Options, @code{-p}}
   1328 @samp{-p}
   1329 option.  For example use @samp{-p hi} to increase the sampling rate.
   1330 
   1331 @item
   1332 In case hardware event counters have been enabled
   1333 (@xref{Profile Hardware Event Counters}), these values are also displayed
   1334 in the call tree view.
   1335 
   1336 @end itemize
   1337 
   1338 @c -- A new node --------------------------------------------------------------
   1339 @node       More Information on the Experiment
   1340 @subsection More Information on the Experiment
   1341 @c ----------------------------------------------------------------------------
   1342 
   1343 The experiment directory not only contains performance related data.  Several
   1344 system characteristics, the profiling command executed, plus some global
   1345 performance statistics are stored and can be displayed.
   1346 
   1347 @IndexSubentry{Options,  @code{-header}}
   1348 @IndexSubentry{Commands, @code{header}}
   1349 The @command{header} command displays information about the experiment(s).
   1350 For example, this is command is used to extract this data from for our
   1351 experiment directory:
   1352 
   1353 @cartouche
   1354 @smallexample
   1355 $ gprofng display text -header mxv.1.thr.er
   1356 @end smallexample
   1357 @end cartouche
   1358 
   1359 The above command prints the following information.  Note that some of the
   1360 lay-out and the information has been modified.  Directory paths have been
   1361 replaced @code{<apath>} for example.  Textual changes are
   1362 marked with the @samp{<} and @samp{>} symbols.
   1363 
   1364 @smallexample
   1365 @verbatim
   1366 Experiment: mxv.1.thr.er
   1367 No errors
   1368 No warnings
   1369 Archive command ` /usr/bin/gprofng-archive -n -a on --outfile
   1370                      <apath>/archive.log <apath>/mxv.1.thr.er'
   1371 
   1372 Target command (64-bit): './mxv-pthreads -m 8000 -n 4000 -t 1'
   1373 Process pid 2750071, ppid 2750069, pgrp 2749860, sid 2742080
   1374 Current working directory: <apath>
   1375 Collector version: `2.40.00'; experiment version 12.4 (64-bit)
   1376 Host `<the-host-name>', OS `Linux <version>', page size 4096,
   1377                      architecture `x86_64'
   1378   4 CPUs, clock speed 2294 MHz.
   1379   Memory: 3506491 pages @  4096 = 13697 MB.
   1380 Data collection parameters:
   1381   Clock-profiling, interval = 997 microsecs.
   1382   Periodic sampling, 1 secs.
   1383   Follow descendant processes from: fork|exec|combo
   1384 
   1385 Experiment started <date and time>
   1386 
   1387 Experiment Ended: 9.801216173
   1388 Data Collection Duration: 9.801216173
   1389 @end verbatim
   1390 @end smallexample
   1391 
   1392 The output above may assist in troubleshooting, or to verify some of the
   1393 operational conditions and we recommend to include this command when
   1394 generating a profile.
   1395 
   1396 @IndexSubentry{Options, @code{-C}}
   1397 Related to this command there is a useful option to record comment(s) in
   1398 an experiment.
   1399 To this end, use the @samp{-C} option on the @CollectApp{} tool to
   1400 specify a comment string.  Up to ten comment lines can be included.
   1401 These comments are displayed with the @command{header} command on
   1402 the @DisplayText{} tool.
   1403 
   1404 @IndexSubentry{Options,  @code{-overview}}
   1405 @IndexSubentry{Commands, @code{overview}}
   1406 The @command{overview} command displays information on the experiment(s) and
   1407 also shows a summary of the values for the metric(s) used.  This is an example
   1408 how to use it on the newly created experiment directory:
   1409 
   1410 @cartouche
   1411 @smallexample
   1412 $ gprofng display text -overview mxv.1.thr.er
   1413 @end smallexample
   1414 @end cartouche
   1415 
   1416 @smallexample
   1417 @verbatim
   1418 Experiment(s):
   1419 
   1420 Experiment      :mxv.1.thr.er
   1421   Target        : './mxv-pthreads -m 8000 -n 4000 -t 1'
   1422   Host          : <hostname> (<ISA>, Linux <version>)
   1423   Start Time    : <date and time>
   1424   Duration      : 9.801 Seconds
   1425 
   1426 Metrics:
   1427 
   1428   Experiment Duration (Seconds): [9.801]
   1429   Clock Profiling
   1430     [X]Total CPU Time - totalcpu (Seconds): [*9.703]
   1431 
   1432 Notes: '*' indicates hot metrics, '[X]' indicates currently enabled
   1433        metrics.
   1434        The metrics command can be used to change selections. The
   1435        metric_list command lists all available metrics.
   1436 @end verbatim
   1437 @end smallexample
   1438 
   1439 This command provides a dashboard overview that helps to easily identify
   1440 where the time is spent and in case hardware event counters are used, it
   1441 shows their total values.
   1442 
   1443 @c -- A new node --------------------------------------------------------------
   1444 @node       Control the Sampling Frequency
   1445 @subsection Control the Sampling Frequency
   1446 @c ----------------------------------------------------------------------------
   1447 
   1448 @cindex Sampling frequency
   1449 So far we did not go into details on the frequency of the sampling process,
   1450 but in some cases it is useful to change the default of 10 milliseconds.
   1451 
   1452 The advantage of increasing the sampling frequency is that functions that
   1453 do not take much time per invocation are more accurately captured.  The
   1454 downside is that more data is gathered.  This has an impact on the overhead
   1455 of the collection process and more disk space is required.
   1456 
   1457 In general this is not an immediate concern, but with heavily threaded
   1458 applications that run for an extended period of time, increasing the
   1459 frequency may have a more noticeable impact.
   1460 
   1461 @IndexSubentry{Options, @code{-p}}
   1462 The @code{-p} option on the @CollectApp{} tool is used to enable or disable
   1463 clock based profiling, or to explicitly set the sampling rate.
   1464 @cindex Sampling interval
   1465 This option takes one of the following keywords:
   1466 
   1467 @table @code
   1468 
   1469 @item off
   1470 Disable clock based profiling.
   1471 
   1472 @item on
   1473 Enable clock based profiling with a per thread sampling interval of 10 ms.
   1474 This is the default.
   1475 
   1476 @item lo
   1477 Enable clock based profiling with a per thread sampling interval of 100 ms.
   1478 
   1479 @item hi
   1480 Enable clock based profiling with a per thread sampling interval of 1 ms.
   1481 
   1482 @item @var{value}
   1483 @cindex Sampling interval
   1484 Enable clock based profiling with a per thread sampling interval of
   1485 @var{value}.
   1486 
   1487 @end table
   1488 
   1489 It may seem unnecessary to have an option to disable clock based profiling,
   1490 but there is a good reason to support this.
   1491 By default, clock profiling  is enabled when conducting hardware event counter
   1492 experiments (@xref{Profile Hardware Event Counters}).
   1493 With the @code{-p off} option, this can be disabled.
   1494 
   1495 If an explicit value is set for the sampling, the number can be an integer or
   1496 a floating-point number.
   1497 A suffix of @samp{u} for microseconds, or @samp{m} for milliseconds is
   1498 supported.  If no suffix is used, the value is assumed to be in milliseconds.
   1499 For example, the following command sets the sampling rate to
   1500 5123.4 microseconds:
   1501 
   1502 @cartouche
   1503 @smallexample
   1504 $ gprofng collect app -p 5123.4u ./mxv-pthreads -m 8000 -n 4000 -t 1
   1505 @end smallexample
   1506 @end cartouche
   1507 
   1508 If the value is smaller than the clock profiling minimum, a warning message
   1509 is issued and it is set to the minimum.
   1510 In case it is not a multiple of the clock profiling resolution, it is
   1511 silently rounded down to the nearest multiple of the clock resolution.
   1512 If the value exceeds the clock profiling maximum, is negative, or zero, an
   1513 error is reported.
   1514 
   1515 @IndexSubentry{Options,  @code{-header}}
   1516 @IndexSubentry{Commands, @code{header}}
   1517 @noindent
   1518 Note that the @code{header} command echoes the sampling rate used.
   1519 
   1520 @c -- A new node --------------------------------------------------------------
   1521 @node       Information on Load Objects
   1522 @subsection Information on Load Objects
   1523 @c ----------------------------------------------------------------------------
   1524 
   1525 It may happen that the function view shows a function that is not known to
   1526 the user.  This can easily happen with library functions for example.
   1527 Luckily there are three commands that come in handy then.
   1528 
   1529 @IndexSubentry{Options,  @code{-objects}}
   1530 @IndexSubentry{Commands, @code{objects}}
   1531 @IndexSubentry{Options,  @code{-fsingle}}
   1532 @IndexSubentry{Commands, @code{fsingle}}
   1533 @IndexSubentry{Options,  @code{-fsummary}}
   1534 @IndexSubentry{Commands, @code{fsummary}}
   1535 These commands are @command{objects}, @command{fsingle}, and
   1536 @command{fsummary}.  They provide details on
   1537 @cindex Load objects
   1538 load objects (@xref{Load Objects and Functions}).
   1539 
   1540 The @command{objects} command lists all load objects that have been referenced
   1541 during the performance experiment.
   1542 Below we show the command and the result for our profile job.  Like before,
   1543 some path names in the output have been shortened and replaced by the
   1544 @IndexSubentry{Miscellaneous, @code{<apath>}}
   1545 @code{<apath>} symbol that represents an absolute directory path.
   1546 
   1547 @cartouche
   1548 @smallexample
   1549 $ gprofng display text -objects mxv.1.thr.er
   1550 @end smallexample
   1551 @end cartouche
   1552 
   1553 The output includes the name and path of the target executable:
   1554 
   1555 @smallexample
   1556 @verbatim
   1557 <Unknown> (<Unknown>)
   1558 <mxv-pthreads> (<apath>/mxv-pthreads)
   1559 <libdl-2.28.so> (/usr/lib64/libdl-2.28.so)
   1560 <librt-2.28.so> (/usr/lib64/librt-2.28.so)
   1561 <libc-2.28.so> (/usr/lib64/libc-2.28.so)
   1562 <libpthread-2.28.so> (/usr/lib64/libpthread-2.28.so)
   1563 <libm-2.28.so> (/usr/lib64/libm-2.28.so)
   1564 <libgp-collector.so> (/usr/lib64/gprofng/libgp-collector.so)
   1565 <ld-2.28.so> (/usr/lib64/ld-2.28.so)
   1566 <DYNAMIC_FUNCTIONS> (DYNAMIC_FUNCTIONS)
   1567 @end verbatim
   1568 @end smallexample
   1569 
   1570 @IndexSubentry{Options,  @code{-fsingle}}
   1571 @IndexSubentry{Commands, @code{fsingle}}
   1572 The @command{fsingle} command may be used to get more details on a specific
   1573 entry in the function view, say.  For example, the command below provides
   1574 additional information on the @code{pthread_create} function shown in the
   1575 function overview.
   1576 
   1577 @cartouche
   1578 @smallexample
   1579 $ gprofng display text -fsingle pthread_create mxv.1.thr.er
   1580 @end smallexample
   1581 @end cartouche
   1582 
   1583 Below the output from this command.  It has been somewhat modified to match
   1584 the display requirements.
   1585 
   1586 @smallexample
   1587 @verbatim
   1588 + gprofng display text -fsingle pthread_create mxv.1.thr.er
   1589 pthread_create
   1590         Exclusive Total CPU Time: 0.    (  0. %)
   1591         Inclusive Total CPU Time: 0.001 (  0.0%)
   1592                             Size:   258
   1593                       PC Address: 8:0x00049f60
   1594                      Source File: (unknown)
   1595                      Object File: (unknown)
   1596                      Load Object: /usr/lib64/gprofng/libgp-collector.so
   1597                     Mangled Name:
   1598                          Aliases:
   1599 @end verbatim
   1600 @end smallexample
   1601 
   1602 In this table we not only see how much time was spent in this function, we
   1603 also see where it originates from.  In addition to this, the size and start
   1604 address are given as well.  If the source code location is known it is also
   1605 shown here.
   1606 
   1607 @IndexSubentry{Options,  @code{-fsummary}}
   1608 @IndexSubentry{Commands, @code{fsummary}}
   1609 The related @code{fsummary} command displays the same information as
   1610 @code{fsingle}, but for all functions in the function overview,
   1611 including @code{<Total>}:
   1612 
   1613 @cartouche
   1614 @smallexample
   1615 $ gprofng display text -fsummary mxv.1.thr.er
   1616 @end smallexample
   1617 @end cartouche
   1618 
   1619 @smallexample
   1620 @verbatim
   1621 Functions sorted by metric: Exclusive Total CPU Time
   1622 
   1623 <Total>
   1624        Exclusive Total CPU Time: 9.703 (100.0%)
   1625        Inclusive Total CPU Time: 9.703 (100.0%)
   1626                            Size:     0
   1627                      PC Address: 1:0x00000000
   1628                     Source File: (unknown)
   1629                     Object File: (unknown)
   1630                     Load Object: <Total>
   1631                    Mangled Name:
   1632                         Aliases:
   1633 
   1634 mxv_core
   1635        Exclusive Total CPU Time: 9.226 ( 95.1%)
   1636        Inclusive Total CPU Time: 9.226 ( 95.1%)
   1637                            Size:    80
   1638                      PC Address: 2:0x00001d56
   1639                     Source File: <apath>/src/mxv.c
   1640                     Object File: mxv.1.thr.er/archives/mxv-pthreads_ss_pf53V__5
   1641                     Load Object: <apath>/mxv-pthreads
   1642                    Mangled Name:
   1643                         Aliases:
   1644 
   1645           ... etc ...
   1646 @end verbatim
   1647 @end smallexample
   1648 
   1649 @c -- A new node --------------------------------------------------------------
   1650 @node    Support for Multithreading
   1651 @section Support for Multithreading
   1652 @c ----------------------------------------------------------------------------
   1653 
   1654 In this chapter the support for multithreading is introduced and discussed.
   1655 As is shown below, nothing needs to be changed when collecting the performance
   1656 data.
   1657 
   1658 The difference is that additional commands are available to get more
   1659 information on the multithreading details, plus that several filters allow
   1660 the user to zoom in on specific threads.
   1661 
   1662 @c -- A new node --------------------------------------------------------------
   1663 @node       Creating a Multithreading Experiment
   1664 @subsection Creating a Multithreading Experiment
   1665 @c ----------------------------------------------------------------------------
   1666 
   1667 We demonstrate the support for multithreading using the same code and settings
   1668 as before, but this time 2 threads are used:
   1669 
   1670 @cartouche
   1671 @smallexample
   1672 $ exe=mxv-pthreads
   1673 $ m=8000
   1674 $ n=4000
   1675 $ gprofng collect app -O mxv.2.thr.er ./$exe -m $m -n $n -t 2
   1676 @end smallexample
   1677 @end cartouche
   1678 
   1679 First of all, in as far as @ProductName{} is concerned, no changes are needed.
   1680 Nothing special is needed to profile a multithreaded job when using
   1681 @ToolName{}.
   1682 
   1683 The same is true when displaying the performance results.  The same commands
   1684 that were used before work unmodified.  For example, this is all that is
   1685 needed to get a function overview:
   1686 
   1687 @cartouche
   1688 @smallexample
   1689 $ gprofng display text -limit 5 -functions mxv.2.thr.er
   1690 @end smallexample
   1691 @end cartouche
   1692 
   1693 @noindent
   1694 This produces the following familiar looking output:
   1695 
   1696 @smallexample
   1697 @verbatim
   1698 Print limit set to 5
   1699 Functions sorted by metric: Exclusive Total CPU Time
   1700 
   1701 Excl. Total   Incl. Total    Name
   1702 CPU           CPU
   1703  sec.      %   sec.      %
   1704 9.464 100.00  9.464 100.00   <Total>
   1705 8.961  94.69  8.961  94.69   mxv_core
   1706 0.224   2.37  0.469   4.95   init_data
   1707 0.105   1.11  0.177   1.88   erand48_r
   1708 0.073   0.77  0.073   0.77   __drand48_iterate
   1709 @end verbatim
   1710 @end smallexample
   1711 
   1712 @c -- A new node --------------------------------------------------------------
   1713 @node       Commands Specific to Multithreading
   1714 @subsection Commands Specific to Multithreading
   1715 @c ----------------------------------------------------------------------------
   1716 
   1717 The function overview shown above shows the results aggregated over all the
   1718 threads.  The interesting new element is that we can also look at the
   1719 performance data for the individual threads.
   1720 
   1721 @IndexSubentry{Options,  @code{-thread_list}}
   1722 @IndexSubentry{Commands, @code{thread_list}}
   1723 The @command{thread_list} command displays how many threads have been used:
   1724 
   1725 @cartouche
   1726 @smallexample
   1727 $ gprofng display text -thread_list mxv.2.thr.er
   1728 @end smallexample
   1729 @end cartouche
   1730 
   1731 This produces the following output, showing that three threads have
   1732 been used:
   1733 
   1734 @smallexample
   1735 @verbatim
   1736 Exp Sel Total
   1737 === === =====
   1738   1 all     3
   1739 @end verbatim
   1740 @end smallexample
   1741 
   1742 The output confirms there is one experiment and that by default all
   1743 threads are selected.
   1744 
   1745 It may seem surprising to see three threads here, since we used the
   1746 @code{-t 2} option, but it is common for a Pthreads program to use one
   1747 additional thread.
   1748 Typically, there is one main thread that runs from start to finish.
   1749 It handles the sequential portions of the code, as well as thread
   1750 management related tasks.
   1751 It is no different in the example code.  At some point, the main thread
   1752 creates and activates the two threads that perform the multiplication
   1753 of the matrix with the vector.  Upon completion of this computation,
   1754 the main thread continues.
   1755 
   1756 @IndexSubentry{Options,  @code{-threads}}
   1757 @IndexSubentry{Commands, @code{threads}}
   1758 The @command{threads} command is simple, yet very powerful.  It shows the
   1759 total value of the metrics for each thread.
   1760 
   1761 @cartouche
   1762 @smallexample
   1763 $ gprofng display text -threads mxv.2.thr.er
   1764 @end smallexample
   1765 @end cartouche
   1766 
   1767 @noindent
   1768 The command above produces the following overview:
   1769 
   1770 @smallexample
   1771 @verbatim
   1772 Objects sorted by metric: Exclusive Total CPU Time
   1773 
   1774 Excl. Total    Name
   1775 CPU
   1776  sec.      %
   1777 9.464 100.00   <Total>
   1778 4.547  48.05   Process 1, Thread 3
   1779 4.414  46.64   Process 1, Thread 2
   1780 0.502   5.31   Process 1, Thread 1
   1781 @end verbatim
   1782 @end smallexample
   1783 
   1784 The first line gives the total CPU time accumulated over the threads
   1785 selected.  This is followed by the metric value(s) for each thread.
   1786 
   1787 From this it is clear that the main thread is responsible for a
   1788 little over 5% of the total CPU time, while the other two threads
   1789 take 47-48% each.
   1790 
   1791 This view is ideally suited to verify if there are any load balancing
   1792 issues and also to find the most time consuming thread(s).
   1793 
   1794 @IndexSubentry{Filters, Thread selection}
   1795 While useful, often more information than this is needed.  This is
   1796 @IndexSubentry{Options,  @code{-thread_select}}
   1797 @IndexSubentry{Commands, @code{thread_select}}
   1798 where the thread selection filter comes in.  Through the
   1799 @command{thread_select}
   1800 command, one or more threads may be selected.
   1801 @xref{The Selection List} how to define the selection list.
   1802 
   1803 Since it is most common to use this command in a script, we do so as
   1804 well here.  Below the script we are using:
   1805 
   1806 @cartouche
   1807 @smallexample
   1808 # Define the metrics
   1809 metrics e.%totalcpu
   1810 # Limit the output to 5 lines
   1811 limit 5
   1812 # Get the function overview for thread 1
   1813 thread_select 1
   1814 functions
   1815 # Get the function overview for thread 2
   1816 thread_select 2
   1817 functions
   1818 # Get the function overview for thread 3
   1819 thread_select 3
   1820 functions
   1821 @end smallexample
   1822 @end cartouche
   1823 
   1824 The definition of the metrics and the output limit have been shown and
   1825 explained earlier.  The new command to focus on is @command{thread_select}.
   1826 
   1827 This command takes a list (@xref{The Selection List}) to select specific
   1828 threads.  In this case, the individual thread numbers that were
   1829 obtained earlier with the @command{thread_list} command are selected.
   1830 
   1831 This restricts the output of the @command{functions} command to the thread
   1832 number(s) specified.  This means that the script above shows which
   1833 function(s) each thread executes and how much CPU time they consumed.
   1834 Both the exclusive timings and their percentages are given.
   1835 
   1836 Note that technically this command is a filter and persistent.  The
   1837 selection remains active until changed through another thread selection
   1838 command, or when it is reset with the @samp{all} selection list.
   1839 
   1840 @noindent
   1841 This is the relevant part of the output for the first thread:
   1842 
   1843 @smallexample
   1844 @verbatim
   1845 Exp Sel Total
   1846 === === =====
   1847   1 1       3
   1848 Functions sorted by metric: Exclusive Total CPU Time
   1849 
   1850 Excl. Total    Name
   1851 CPU
   1852  sec.      %
   1853 0.502 100.00   <Total>
   1854 0.224  44.64   init_data
   1855 0.105  20.83   erand48_r
   1856 0.073  14.48   __drand48_iterate
   1857 0.067  13.29   drand48
   1858 @end verbatim
   1859 @end smallexample
   1860 
   1861 As usual, the comment lines are echoed.  This is followed by a confirmation
   1862 of the selection.  The first table shows that one experiment is loaded and
   1863 that thread 1 out of the three threads has been selected.  What is
   1864 displayed next is the function overview for this particular thread.  Due to
   1865 the @code{limit 5} command, there are only five functions in this list.
   1866 
   1867 Clearly, this thread handles the data initialization part and as we know
   1868 from the call tree output, function @code{init_data} executes the 3 other
   1869 functions shown in this profile.
   1870 
   1871 Below are the overviews for threads 2 and 3 respectively.  It is seen that all
   1872 of the CPU time is spent in function @code{mxv_core} and that this time
   1873 is approximately the same for both threads.
   1874 
   1875 @smallexample
   1876 @verbatim
   1877 # Get the function overview for thread 2
   1878 Exp Sel Total
   1879 === === =====
   1880   1 2       3
   1881 Functions sorted by metric: Exclusive Total CPU Time
   1882 
   1883 Excl. Total    Name
   1884 CPU
   1885  sec.      %
   1886 4.414 100.00   <Total>
   1887 4.414 100.00   mxv_core
   1888 0.      0.     <static>@0x48630 (<libgp-collector.so>)
   1889 0.      0.     driver_mxv
   1890 0.      0.     start_thread
   1891 
   1892 # Get the function overview for thread 3
   1893 Exp Sel Total
   1894 === === =====
   1895   1 3       3
   1896 Functions sorted by metric: Exclusive Total CPU Time
   1897 
   1898 Excl. Total    Name
   1899 CPU
   1900  sec.      %
   1901 4.547 100.00   <Total>
   1902 4.547 100.00   mxv_core
   1903 0.      0.     <static>@0x48630 (<libgp-collector.so>)
   1904 0.      0.     driver_mxv
   1905 0.      0.     start_thread
   1906 @end verbatim
   1907 @end smallexample
   1908 
   1909 When analyzing the performance of a multithreaded application, it is sometimes
   1910 useful to know whether threads have mostly executed on the same core, say, or
   1911 if they have wandered across multiple cores.  This sort of stickiness is
   1912 usually referred to as
   1913 @cindex Thread affinity
   1914 @emph{thread affinity}.
   1915 
   1916 Similar to the commands for the threads, there are several commands related
   1917 to the usage of the cores, or @emph{CPUs} as they are called in @ToolName{}
   1918 (@xref{The Concept of a CPU in gprofng}).
   1919 
   1920 @IndexSubentry{Options,  @code{-cpu_list}}
   1921 @IndexSubentry{Commands, @code{cpu_list}}
   1922 Similar to the @command{thread_list} command, the @command{cpu_list} command
   1923 displays how many CPUs have been used.
   1924 @IndexSubentry{Options,  @code{-cpus}}
   1925 @IndexSubentry{Commands, @code{cpus}}
   1926 The equivalent of the @command{threads} threads command, is the @command{cpus}
   1927 command, which shows the numbers of the CPUs that were used and the metric
   1928 values for each one of them.  Both commands are demonstrated below.
   1929 
   1930 @cartouche
   1931 @smallexample
   1932 $ gprofng display text -cpu_list -cpus mxv.2.thr.er
   1933 @end smallexample
   1934 @end cartouche
   1935 
   1936 @noindent
   1937 This command produces the following output:
   1938 
   1939 @smallexample
   1940 @verbatim
   1941 + gprofng display text -cpu_list -cpus mxv.2.thr.er
   1942 Exp Sel Total
   1943 === === =====
   1944   1 all     4
   1945 Objects sorted by metric: Exclusive Total CPU Time
   1946 
   1947 Excl. Total    Name
   1948 CPU
   1949  sec.      %
   1950 9.464 100.00   <Total>
   1951 4.414  46.64   CPU 2
   1952 2.696  28.49   CPU 0
   1953 1.851  19.56   CPU 1
   1954 0.502   5.31   CPU 3
   1955 @end verbatim
   1956 @end smallexample
   1957 
   1958 The first table shows that there is only one experiment and that all of the
   1959 four CPUs have been selected.  The second table shows the exclusive metrics
   1960 for each of the CPUs that have been used.
   1961 
   1962 As also echoed in the output, the data is sorted with respect to the
   1963 exclusive CPU time, but it is very easy to sort the data by the CPU id
   1964 @IndexSubentry{Options, -sort}
   1965 @IndexSubentry{Commands, sort}
   1966 by using the @command{sort} command:
   1967 
   1968 @cartouche
   1969 @smallexample
   1970 $ gprofng display text -cpu_list -sort name -cpus mxv.2.thr.er
   1971 @end smallexample
   1972 @end cartouche
   1973 
   1974 @noindent
   1975 With the @command{sort} added, the output is as follows:
   1976 
   1977 @smallexample
   1978 @verbatim
   1979 Exp Sel Total
   1980 === === =====
   1981   1 all     4
   1982 Current Sort Metric: Name ( name )
   1983 Objects sorted by metric: Name
   1984 
   1985 Excl. Total    Name
   1986 CPU
   1987  sec.      %
   1988 9.464 100.00   <Total>
   1989 2.696  28.49   CPU 0
   1990 1.851  19.56   CPU 1
   1991 4.414  46.64   CPU 2
   1992 0.502   5.31   CPU 3
   1993 @end verbatim
   1994 @end smallexample
   1995 
   1996 While the table with thread times shown earlier may point at a load imbalance
   1997 in the application, this overview has a different purpose.
   1998 
   1999 For example, we see that 4 CPUs have been used, but we know that the
   2000 application uses 3 threads only.
   2001 We will now demonstrate how filters can be used to help answer the
   2002 question why 4 CPUs are used, while the application has 3 threads only.
   2003 This means that at least one thread has executed on more than one CPU.
   2004 
   2005 Recall the thread level timings:
   2006 
   2007 @smallexample
   2008 @verbatim
   2009 Excl. Total    Name
   2010 CPU
   2011  sec.      %
   2012 9.464 100.00   <Total>
   2013 4.547  48.05   Process 1, Thread 3
   2014 4.414  46.64   Process 1, Thread 2
   2015 0.502   5.31   Process 1, Thread 1
   2016 @end verbatim
   2017 @end smallexample
   2018 
   2019 Compared to the CPU timings above, it seems very likely that thread 3 has
   2020 used more than one CPU, because the thread and CPU timings are the same
   2021 for both other threads.
   2022 
   2023 The command below selects thread number 3 and then requests the CPU
   2024 utilization for this thread:
   2025 
   2026 @cartouche
   2027 @smallexample
   2028 $ gprofng display text -thread_select 3 -sort name -cpus mxv.2.thr.er
   2029 @end smallexample
   2030 @end cartouche
   2031 
   2032 The output shown below confirms that thread 3 is selected and then displays
   2033 the CPU(s) that have been used by this thread:
   2034 
   2035 @smallexample
   2036 @verbatim
   2037 Exp Sel Total
   2038 === === =====
   2039   1 3       3
   2040 
   2041 Objects sorted by metric: Exclusive Total CPU Time
   2042 
   2043 Excl. Total    Name
   2044 CPU
   2045  sec.      %
   2046 4.547 100.00   <Total>
   2047 2.696  59.29   CPU 0
   2048 1.851  40.71   CPU 1
   2049 @end verbatim
   2050 @end smallexample
   2051 
   2052 The results show that this thread has used CPU 0 nearly 60% of the time
   2053 and CPU 1 for the remaining 40%.
   2054 
   2055 To confirm that this is the only thread that has used more than one CPU, the
   2056 same approach can be used for threads 1 and 2:
   2057 
   2058 @smallexample
   2059 @verbatim
   2060 $ gprofng display text -thread_select 1 -cpus mxv.2.thr.er
   2061 Exp Sel Total
   2062 === === =====
   2063   1 1       3
   2064 Objects sorted by metric: Exclusive Total CPU Time
   2065 
   2066 Excl. Total    Name
   2067 CPU
   2068  sec.      %
   2069 0.502 100.00   <Total>
   2070 0.502 100.00   CPU 3
   2071 @end verbatim
   2072 @end smallexample
   2073 
   2074 @smallexample
   2075 @verbatim
   2076 $ gprofng display text -thread_select 2 -cpus mxv.2.thr.er
   2077 Exp Sel Total
   2078 === === =====
   2079   1 2       3
   2080 Objects sorted by metric: Exclusive Total CPU Time
   2081 
   2082 Excl. Total    Name
   2083 CPU
   2084  sec.      %
   2085 4.414 100.00   <Total>
   2086 4.414 100.00   CPU 2
   2087 @end verbatim
   2088 @end smallexample
   2089 
   2090 @noindent
   2091 The output above shows that indeed threads 1 and 2 each have used a single
   2092 CPU only.
   2093 
   2094 @c -- A new node --------------------------------------------------------------
   2095 @node    View Multiple Experiments
   2096 @section View Multiple Experiments
   2097 @c ----------------------------------------------------------------------------
   2098 
   2099 One thing we did not cover sofar is that @ToolName{} fully supports the
   2100 analysis of multiple experiments.  The @DisplayText{} tool accepts a list of
   2101 experiments.  The data can either be aggregated across the experiments, or
   2102 used in a comparison.
   2103 
   2104 The default is to aggregate the metric values across the experiments that have
   2105 been loaded.  The @command{compare} command can be used to enable the
   2106 @IndexSubentry{Options, @code{-compare}}
   2107 @IndexSubentry{Commands, @code{compare}}
   2108 comparison of results.
   2109 
   2110 In this section both modes are illustrated with an example.
   2111 
   2112 @c -- A new node --------------------------------------------------------------
   2113 @node       Aggregation of Experiments
   2114 @subsection Aggregation of Experiments
   2115 @c ----------------------------------------------------------------------------
   2116 
   2117 If the data for multiple experiments is aggregrated, the @DisplayText{} tool
   2118 shows the combined results.
   2119 For example, below is the script to show the function view for the data
   2120 aggregated over two experiments, drop the first experiment and then show
   2121 the function view for the second experiment only.
   2122 We will call it @file{my-script-agg}.
   2123 
   2124 @cartouche
   2125 @smallexample
   2126 # Define the metrics
   2127 metrics e.%totalcpu
   2128 # Limit the output to 5 lines
   2129 limit 5
   2130 # Get the list with experiments
   2131 experiment_list
   2132 # Get the function overview for all
   2133 functions
   2134 # Drop the first experiment
   2135 drop_exp mxv.2.thr.er
   2136 # Get the function overview for exp #2
   2137 functions
   2138 @end smallexample
   2139 @end cartouche
   2140 
   2141 @IndexSubentry{Options,  @code{-experiment_list}}
   2142 @IndexSubentry{Commands, @code{experiment_list}}
   2143 With the exception of the @command{experiment_list} command, all commands
   2144 used have been discussed earlier.
   2145 
   2146 The @command{experiment_list} command provides a list of the experiments
   2147 that have been loaded.  This may be used to get the experiment IDs and
   2148 to verify the correct experiments are loaded for the aggregation.
   2149 
   2150 @noindent
   2151 Below is an example that loads two experiments and uses the above
   2152 script to display different function views.
   2153 
   2154 @cartouche
   2155 @smallexample
   2156 $ gprofng display text -script my-script-agg mxv.2.thr.er mxv.4.thr.er
   2157 @end smallexample
   2158 @end cartouche
   2159 
   2160 @noindent
   2161 This produces the following output:
   2162 
   2163 @smallexample
   2164 @verbatim
   2165 # Define the metrics
   2166 Current metrics: e.%totalcpu:name
   2167 Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
   2168 # Limit the output to 5 lines
   2169 Print limit set to 5
   2170 # Get the list with experiments
   2171 ID Sel     PID Experiment
   2172 == === ======= ============
   2173  1 yes 1339450 mxv.2.thr.er
   2174  2 yes 3579561 mxv.4.thr.er
   2175 # Get the function overview for all
   2176 Functions sorted by metric: Exclusive Total CPU Time
   2177 
   2178 Excl. Total     Name
   2179 CPU
   2180   sec.      %
   2181 20.567 100.00   <Total>
   2182 19.553  95.07   mxv_core
   2183  0.474   2.30   init_data
   2184  0.198   0.96   erand48_r
   2185  0.149   0.72   drand48
   2186 
   2187 # Drop the first experiment
   2188 Experiment mxv.2.thr.er has been dropped
   2189 # Get the function overview for exp #2
   2190 Functions sorted by metric: Exclusive Total CPU Time
   2191 
   2192 Excl. Total     Name
   2193 CPU
   2194   sec.      %
   2195 11.104 100.00   <Total>
   2196 10.592  95.39   mxv_core
   2197  0.249   2.24   init_data
   2198  0.094   0.84   erand48_r
   2199  0.082   0.74   drand48
   2200 @end verbatim
   2201 @end smallexample
   2202 
   2203 The first five lines should look familiar.  The five lines following echo
   2204 the comment line in the script and show the overview of the experiments.
   2205 This confirms two experiments have been loaded and that both are active.
   2206 This is followed by the function overview.  The timings have been summed
   2207 up and the percentages are adjusted accordingly.
   2208 
   2209 @c -- A new node --------------------------------------------------------------
   2210 @node       Comparison of Experiments
   2211 @subsection Comparison of Experiments
   2212 @c ----------------------------------------------------------------------------
   2213 
   2214 The support for multiple experiments really shines in comparison mode.
   2215 @cindex Compare experiments
   2216 In comparison mode, the data for the various experiments is shown side by
   2217 side, as illustrated below where we compare the results for the multithreaded
   2218 experiments using two and four threads respectively.
   2219 
   2220 This
   2221 feature is controlled through the
   2222 @IndexSubentry{Options,  @code{-compare}}
   2223 @IndexSubentry{Commands, @code{compare}}
   2224 @code{compare} command.
   2225 
   2226 The comparison mode is enabled through @command{compare on} and with
   2227 @command{compare off} it is disabled again.
   2228 In addition to @samp{on}, or @samp{off}, this command also supports
   2229 the @samp{delta} and @samp{ratio} keywords.
   2230 
   2231 This is the script that will be used in our example.  It sets the comparison
   2232 mode to @samp{on}:
   2233 
   2234 @smallexample
   2235 @verbatim
   2236 # Define the metrics
   2237 metrics e.%totalcpu
   2238 # Limit the output to 5 lines
   2239 limit 5
   2240 # Set the comparison mode to differences
   2241 compare on
   2242 # Get the function overview
   2243 functions
   2244 @end verbatim
   2245 @end smallexample
   2246 
   2247 Assuming this script file is called @file{my-script-comp}, this is how
   2248 it is used to display the differences:
   2249 
   2250 @cartouche
   2251 @smallexample
   2252 $ gprofng display text -script my-script-comp mxv.2.thr.er mxv.4.thr.er
   2253 @end smallexample
   2254 @end cartouche
   2255 
   2256 @noindent
   2257 This produces the output shown below.  The data for the first experiment
   2258 is shown as absolute numbers.  The timings for the other experiment are
   2259 shown as a delta relative to these reference numbers:
   2260 
   2261 @smallexample
   2262 @verbatim
   2263 
   2264 mxv.2.thr.er  mxv.4.thr.er
   2265 Excl. Total   Excl. Total     Name
   2266 CPU           CPU
   2267  sec.      %    sec.      %
   2268 9.464 100.00  11.104 100.00   <Total>
   2269 8.961  94.69  10.592  95.39   mxv_core
   2270 0.224   2.37   0.249   2.24   init_data
   2271 0.105   1.11   0.094   0.84   erand48_r
   2272 0.073   0.77   0.060   0.54   __drand48_iterate
   2273 @end verbatim
   2274 @end smallexample
   2275 
   2276 This table is already helpful to more easily compare (two) profiles, but
   2277 there is more that we can do here.
   2278 
   2279 By default, in comparison mode, all measured values are shown.  Often
   2280 profiling is about comparing performance data.  It is therefore
   2281 sometimes more useful to look at differences or ratios, using one
   2282 experiment as a reference.
   2283 
   2284 The values shown are relative to this difference.  For example if a ratio
   2285 is below one, it means the reference value was higher.
   2286 
   2287 In the example below, we use the same two experiments used in the comparison
   2288 above.  The script is also nearly identical.  The only change is that we now
   2289 use the @samp{delta} keyword.
   2290 
   2291 As before, the number of lines is restricted to 5 and we focus on
   2292 the exclusive timings plus percentages.  For the comparison part we are
   2293 interested in the differences.
   2294 
   2295 This is the script that produces such an overview:
   2296 
   2297 @smallexample
   2298 @verbatim
   2299 # Define the metrics
   2300 metrics e.%totalcpu
   2301 # Limit the output to 5 lines
   2302 limit 5
   2303 # Set the comparison mode to differences
   2304 compare delta
   2305 # Get the function overview
   2306 functions
   2307 @end verbatim
   2308 @end smallexample
   2309 
   2310 Assuming this script file is called @file{my-script-comp2}, this is how we
   2311 get the table displayed on our screen:
   2312 
   2313 @cartouche
   2314 @smallexample
   2315 $ gprofng display text -script my-script-comp2 mxv.2.thr.er mxv.4.thr.er
   2316 @end smallexample
   2317 @end cartouche
   2318 
   2319 Leaving out some of the lines printed, but we have seen before, we get
   2320 the following table:
   2321 
   2322 @smallexample
   2323 @verbatim
   2324 mxv.2.thr.er  mxv.4.thr.er
   2325 Excl. Total   Excl. Total     Name
   2326 CPU           CPU
   2327  sec.      %   delta      %
   2328 9.464 100.00  +1.640 100.00   <Total>
   2329 8.961  94.69  +1.631  95.39   mxv_core
   2330 0.224   2.37  +0.025   2.24   init_data
   2331 0.105   1.11  -0.011   0.84   erand48_r
   2332 0.073   0.77  -0.013   0.54   __drand48_iterate
   2333 @end verbatim
   2334 @end smallexample
   2335 
   2336 It is now easier to see that the CPU times for the most time consuming
   2337 functions in this code are practically the same.
   2338 
   2339 It is also possible to show ratio's through the @command{compare ratio}
   2340 @IndexSubentry{Options, @code{-compare}}
   2341 @IndexSubentry{Commands, @code{compare}}
   2342 command.  The first colum is used as a reference and the values for
   2343 the other columns with metrics are derived by dividing the value by
   2344 the reference.  The result for such a comparison is shown below:
   2345 
   2346 @smallexample
   2347 @verbatim
   2348 mxv.2.thr.er  mxv.4.thr.er
   2349 Excl. Total   Excl. Total CPU    Name
   2350 CPU
   2351  sec.      %      ratio      %
   2352 9.464 100.00  x   1.173 100.00   <Total>
   2353 8.961  94.69  x   1.182  95.39   mxv_core
   2354 0.224   2.37  x   1.111   2.24   init_data
   2355 0.105   1.11  x   0.895   0.84   erand48_r
   2356 0.073   0.77  x   0.822   0.54   __drand48_iterate
   2357 @end verbatim
   2358 @end smallexample
   2359 
   2360 Note that the comparison feature is supported at the function, source, and
   2361 disassembly level.  There is no practical limit on the number of experiments
   2362 that can be used in a comparison.
   2363 
   2364 @c -- A new node --------------------------------------------------------------
   2365 @node    Profile Hardware Event Counters
   2366 @section Profile Hardware Event Counters
   2367 @c ----------------------------------------------------------------------------
   2368 
   2369 Many processors provide a set of hardware event counters and @ToolName{}
   2370 provides support for this feature.
   2371 @xref{Hardware Event Counters Explained} for those readers that are not
   2372 familiar with such counters and like to learn more.
   2373 
   2374 In this section we explain how to get the details on the event counter
   2375 support for the processor used in the experiment(s), and show several
   2376 examples.
   2377 
   2378 @c -- A new node --------------------------------------------------------------
   2379 @node       Getting Information on the Counters Supported
   2380 @subsection Getting Information on the Counters Supported
   2381 @c ----------------------------------------------------------------------------
   2382 
   2383 The first step is to check if the processor used for the experiments is
   2384 supported by @ToolName{}.
   2385 @IndexSubentry{Options, @code{-h}}
   2386 The @code{-h} option on @CollectApp{} will show the event counter
   2387 information:
   2388 
   2389 @cartouche
   2390 @smallexample
   2391 $ gprofng collect app -h
   2392 @end smallexample
   2393 @end cartouche
   2394 
   2395 In case the counters are supported, a list with the events is printed.
   2396 Otherwise, a warning message will be issued.
   2397 
   2398 For example, below we show this command and the output on an Intel Xeon
   2399 Platinum 8167M (aka ``Skylake'') processor.  The output has been split
   2400 into several sections and each section is commented upon separately.
   2401 
   2402 @smallexample
   2403 @verbatim
   2404 Run "gprofng collect app --help" for a usage message.
   2405 
   2406 Specifying HW counters on `Intel Arch PerfMon v2 on Family 6 Model 85'
   2407 (cpuver=2499):
   2408 
   2409   -h {auto|lo|on|hi}
   2410 	turn on default set of HW counters at the specified rate
   2411   -h <ctr_def> [-h <ctr_def>]...
   2412   -h <ctr_def>[,<ctr_def>]...
   2413 	specify HW counter profiling for up to 4 HW counters
   2414 @end verbatim
   2415 @end smallexample
   2416 
   2417 The first line shows how to get a usage overview.  This is followed by
   2418 some information on the target processor.
   2419 The next five lines explain in what ways the @code{-h} option can be
   2420 used to define the events to be monitored.
   2421 
   2422 The first version shown above enables a default set of counters.  This
   2423 default depends on the processor this command is executed on.  The
   2424 keyword following the @code{-h} option defines the sampling rate:
   2425 
   2426 @table @code
   2427 
   2428 @item auto
   2429 Match the sample rate of used by clock profiling.  If the latter is disabled,
   2430 Use a per thread sampling rate of approximately 100 samples per second.
   2431 This setting is the default and preferred.
   2432 
   2433 @item on
   2434 Use a per thread sampling rate of approximately 100 samples per second.
   2435 
   2436 @item lo
   2437 Use a per thread sampling rate of approximately 10 samples per second.
   2438 
   2439 @item hi
   2440 Use a per thread sampling rate of approximately 1000 samples per second.
   2441 
   2442 @end table
   2443 
   2444 The second and third variant define the events to be monitored.  Note
   2445 that the number of simultaneous events supported is printed.  In this
   2446 case we can monitor four events in a single profiling job.
   2447 
   2448 It is a matter of preference whether you like to use the @code{-h}
   2449 option for each event, or use it once, followed by a comma separated
   2450 list.
   2451 
   2452 There is one slight catch though.  The counter definition below has
   2453 mandatory comma (@code{,}) between the event and the rate.  While a
   2454 default can be used for the rate, the comma cannot be omitted.
   2455 This may result in a somewhat awkward counter definition in case
   2456 the default sampling rate is used.
   2457 
   2458 For example, the following two commands are equivalent.  Note
   2459 the double comma in the second command.  This is not a typo.
   2460 
   2461 @cartouche
   2462 @smallexample
   2463 $ gprofng collect app -h cycles -h insts ...
   2464 $ gprofng collect app -h cycles,,insts ...
   2465 @end smallexample
   2466 @end cartouche
   2467 
   2468 In the first command this comma is not needed, because a
   2469 comma (``@code{,}'') immediately followed by white space may
   2470 be omitted.
   2471 
   2472 This is why we prefer the this syntax and in the remainder will
   2473 use the first version of this command.
   2474 
   2475 @IndexSubentry{Hardware event counters, counter definition}
   2476 The counter definition takes an event name, plus optionally one or
   2477 more attributes, followed by a comma, and optionally the sampling rate.
   2478 The output section below shows the formal definition.
   2479 
   2480 @cartouche
   2481 @smallexample
   2482   <ctr_def> == <ctr>[[~<attr>=<val>]...],[<rate>]
   2483 @end smallexample
   2484 @end cartouche
   2485 
   2486 The printed help then explains this syntax.  Below we have summarized
   2487 and expanded this output:
   2488 
   2489 @table @code
   2490 
   2491 @item @var{<ctr>}
   2492 The counter name must be selected from the available counters listed
   2493 as part of the output printed with the @code{-h} option.
   2494 On most systems, if a counter is not listed, it may still be specified
   2495 by its numeric value.
   2496 
   2497 @item @var{~<attr>=<val>}
   2498 This is an optional attribute that depends on the processor.  The list
   2499 of supported attributes is printed in the output.  Examples of
   2500 attributes are ``user'', or ``system''.  The value can given in decimal
   2501 or hexadecimal format.
   2502 Multiple attributes may be specified, and each must be preceded
   2503 by a ~.
   2504 
   2505 @item @var{<rate>}
   2506 
   2507 The sampling rate is one of the following:
   2508 
   2509 @table @code
   2510 
   2511 @item auto
   2512 This is the default and matches the rate used by clock profiling.
   2513 If clock profiling is disabled, use @samp{on}.
   2514 
   2515 @item on
   2516 Set the per thread maximum sampling rate to ~100 samples/second
   2517 
   2518 @item lo
   2519 Set the per thread maximum sampling rate to ~10 samples/second
   2520 
   2521 @item hi
   2522 Set the per thread maximum sampling rate to ~1000 samples/second
   2523 
   2524 @item @var{<interval>}
   2525 Define the sampling interval.
   2526 @xref{Control the Sampling Frequency} how to define this.
   2527 
   2528 @end table
   2529 
   2530 @end table
   2531 
   2532 After the section with the formal definition of events and counters, a
   2533 processor specific list is displayed.  This part starts with an overview
   2534 of the default set of counters and the aliased names supported
   2535 @emph{on this specific processor}.
   2536 
   2537 @smallexample
   2538 @verbatim
   2539 Default set of HW counters:
   2540 
   2541     -h cycles,,insts,,llm
   2542 
   2543 Aliases for most useful HW counters:
   2544 
   2545  alias    raw name                   type units regs description
   2546 
   2547  cycles   unhalted-core-cycles   CPU-cycles 0123 CPU Cycles
   2548  insts    instruction-retired        events 0123 Instructions Executed
   2549  llm      llc-misses                 events 0123 Last-Level Cache Misses
   2550  br_msp   branch-misses-retired      events 0123 Branch Mispredict
   2551  br_ins   branch-instruction-retired events 0123 Branch Instructions
   2552 @end verbatim
   2553 @end smallexample
   2554 
   2555 @noindent
   2556 The definitions given above may or may not be available on other processors.
   2557 
   2558 The table above shows the default set of counters defined for this processor,
   2559 and the aliases.  For each alias the full ``raw'' name is given, plus the
   2560 unit of the number returned by the counter (CPU cycles, or a raw count),
   2561 the hardware counter the event is allowed to be mapped onto, and a short
   2562 description.
   2563 
   2564 The last part of the output contains all the events that can be monitored:
   2565 
   2566 @smallexample
   2567 @verbatim
   2568 Raw HW counters:
   2569 
   2570     name                                type      units regs description
   2571 
   2572     unhalted-core-cycles                     CPU-cycles 0123
   2573     unhalted-reference-cycles                    events 0123
   2574     instruction-retired                          events 0123
   2575     llc-reference                                events 0123
   2576     llc-misses                                   events 0123
   2577     branch-instruction-retired                   events 0123
   2578     branch-misses-retired                        events 0123
   2579     ld_blocks.store_forward                      events 0123
   2580     ld_blocks.no_sr                              events 0123
   2581     ld_blocks_partial.address_alias              events 0123
   2582     dtlb_load_misses.miss_causes_a_walk          events 0123
   2583     dtlb_load_misses.walk_completed_4k           events 0123
   2584 
   2585     <many lines deleted>
   2586 
   2587     l2_lines_out.silent                          events 0123
   2588     l2_lines_out.non_silent                      events 0123
   2589     l2_lines_out.useless_hwpf                    events 0123
   2590     sq_misc.split_lock                           events 0123
   2591 @end verbatim
   2592 @end smallexample
   2593 
   2594 As can be seen, these names are not always easy to correlate to a specific
   2595 event of interest.  The processor manual should provide more clarity on this.
   2596 
   2597 @c -- A new node --------------------------------------------------------------
   2598 @node       Examples Using Hardware Event Counters
   2599 @subsection Examples Using Hardware Event Counters
   2600 @c ----------------------------------------------------------------------------
   2601 
   2602 The previous section may give the impression that these counters are hard to
   2603 use, but as we will show now, in practice it is quite simple.
   2604 
   2605 With the information from the @code{-h} option, we can easily set up our first
   2606 event counter experiment.
   2607 
   2608 We start by using the default set of counters defined for our processor and we
   2609 use 2 threads:
   2610 
   2611 @cartouche
   2612 @smallexample
   2613 $ exe=mxv-pthreads
   2614 $ m=8000
   2615 $ n=4000
   2616 $ exp=mxv.hwc.def.2.thr.er
   2617 $ gprofng collect app -O $exp -h auto ./$exe -m $m -n $n -t 2
   2618 @end smallexample
   2619 @end cartouche
   2620 
   2621 @IndexSubentry{Options, @code{-h}}
   2622 @IndexSubentry{Hardware event counters, @code{auto} option}
   2623 The new option here is @code{-h auto}.  The @code{auto} keyword enables
   2624 hardware event counter profiling and selects the default set of counters
   2625 defined for this processor.
   2626 
   2627 As before, we can display the information, but there is one practical hurdle
   2628 to take.  Unless we like to view all metrics recorded, we would need to know
   2629 the names of the events that have been enabled.  This is tedious and also not
   2630 portable in case we would like to repeat this experiment on another processor.
   2631 
   2632 @IndexSubentry{Hardware event counters, @code{hwc} metric}
   2633 This is where the special @code{hwc} metric comes very handy.  It
   2634 automatically expands to the active set of hardware event counters used
   2635 in the experiment(s).
   2636 
   2637 With this, it is very easy to display the event counter values.  Note that
   2638 although the regular clock based profiling was enabled, we only want to see
   2639 the counter values.  We also request to see the percentages and limit the
   2640 output to the first 5 lines:
   2641 
   2642 @cartouche
   2643 @smallexample
   2644 $ exp=mxv.hwc.def.2.thr.er
   2645 $ gprofng display text -metrics e.%hwc -limit 5 -functions $exp
   2646 @end smallexample
   2647 @end cartouche
   2648 
   2649 @smallexample
   2650 @verbatim
   2651 Current metrics: e.%cycles:e+%insts:e+%llm:name
   2652 Current Sort Metric: Exclusive CPU Cycles ( e.%cycles )
   2653 Print limit set to 5
   2654 Functions sorted by metric: Exclusive CPU Cycles
   2655 
   2656 Excl. CPU     Excl. Instructions  Excl. Last-Level   Name
   2657 Cycles        Executed            Cache Misses
   2658  sec.      %                  %                 %
   2659 2.691 100.00  7906475309 100.00   122658983 100.00   <Total>
   2660 2.598  96.54  7432724378  94.01   121745696  99.26   mxv_core
   2661 0.035   1.31   188860269   2.39       70084   0.06   erand48_r
   2662 0.026   0.95    73623396   0.93      763116   0.62   init_data
   2663 0.018   0.66    76824434   0.97       40040   0.03   drand48
   2664 @end verbatim
   2665 @end smallexample
   2666 
   2667 As we have seen before, the first few lines echo the settings.
   2668 This includes a list with the hardware event counters used by
   2669 default.
   2670 
   2671 The table that follows makes it very easy to get an overview where the
   2672 time is spent and how many of the target events have occurred.
   2673 
   2674 As before, we can drill down deeper and see the same metrics at the source
   2675 line and instruction level.  Other than using @code{hwc} in the metrics
   2676 definitions, nothing has changed compared to the previous examples:
   2677 
   2678 @cartouche
   2679 @smallexample
   2680 $ exp=mxv.hwc.def.2.thr.er
   2681 $ gprofng display text -metrics e.hwc -source mxv_core $exp
   2682 @end smallexample
   2683 @end cartouche
   2684 
   2685 This is the relevant part of the output.  Since the lines get very long,
   2686 we have somewhat modified the lay-out:
   2687 
   2688 @smallexample
   2689 @verbatim
   2690    Excl. CPU Excl.        Excl.
   2691    Cycles    Instructions Last-Level
   2692     sec.     Executed     Cache Misses
   2693                                          <Function: mxv_core>
   2694    0.                 0          0   32. void __attribute__ ((noinline))
   2695                                          mxv_core(...)
   2696    0.                 0          0   33. {
   2697    0.                 0          0   34.   for (uint64_t i=...) {
   2698    0.                 0          0   35.     double row_sum = 0.0;
   2699 ## 1.872     7291879319   88150571   36.     for (int64_t j=0; j<n; j++)
   2700    0.725      140845059   33595125   37.        row_sum += A[i][j]*b[j];
   2701    0.                 0          0   38.     c[i] = row_sum;
   2702                                      39.    }
   2703    0.                 0          0   40. }
   2704 @end verbatim
   2705 @end smallexample
   2706 
   2707 In a smiliar way we can display the event counter values at the instruction
   2708 level.  Again we have modified the lay-out due to page width limitations:
   2709 
   2710 @cartouche
   2711 @smallexample
   2712 $ exp=mxv.hwc.def.2.thr.er
   2713 $ gprofng display text -metrics e.hwc -disasm mxv_core $exp
   2714 @end smallexample
   2715 @end cartouche
   2716 
   2717 @smallexample
   2718 @verbatim
   2719    Excl. CPU Excl.        Excl.
   2720    Cycles    Instructions Last-Level
   2721     sec.     Executed     Cache Misses
   2722                                                 <Function: mxv_core>
   2723    0.                 0          0  [33] 4021ba: mov   0x8(%rsp),%r10
   2724                                     34.   for (uint64_t i=...) {
   2725    0.                 0          0  [34] 4021bf: cmp   %rsi,%rdi
   2726    0.                 0          0  [34] 4021c2: jbe   0x37
   2727    0.                 0          0  [34] 4021c4: ret
   2728                                     35.       double row_sum = 0.0;
   2729                                     36.       for (int64_t j=0; j<n; j++)
   2730                                     37.         row_sum += A[i][j]*b[j];
   2731    0.                 0          0  [37] 4021c5: mov   (%r8,%rdi,8),%rdx
   2732    0.                 0          0  [36] 4021c9: mov   $0x0,%eax
   2733    0.                 0          0  [35] 4021ce: pxor  %xmm1,%xmm1
   2734    0.002       12804230     321394  [37] 4021d2: movsd (%rdx,%rax,8),%xmm0
   2735    0.141       60819025    3866677  [37] 4021d7: mulsd (%r9,%rax,8),%xmm0
   2736    0.582       67221804   29407054  [37] 4021dd: addsd %xmm0,%xmm1
   2737 ## 1.871     7279075109   87989870  [36] 4021e1: add   $0x1,%rax
   2738    0.002       12804210      80351  [36] 4021e5: cmp   %rax,%rcx
   2739    0.                 0          0  [36] 4021e8: jne   0xffffffffffffffea
   2740                                     38.       c[i] = row_sum;
   2741    0.                 0          0  [38] 4021ea: movsd %xmm1,(%r10,%rdi,8)
   2742    0.                 0          0  [34] 4021f0: add   $0x1,%rdi
   2743    0.                 0          0  [34] 4021f4: cmp   %rdi,%rsi
   2744    0.                 0          0  [34] 4021f7: jb    0xd
   2745    0.                 0          0  [35] 4021f9: pxor  %xmm1,%xmm1
   2746    0.                 0          0  [36] 4021fd: test  %rcx,%rcx
   2747    0.                 0      80350  [36] 402200: jne   0xffffffffffffffc5
   2748    0.                 0          0  [36] 402202: jmp   0xffffffffffffffe8
   2749                                     39.   }
   2750                                     40. }
   2751    0.                 0          0  [40]  402204:  ret
   2752 @end verbatim
   2753 @end smallexample
   2754 
   2755 So far we have used the default settings for the event counters.  It is
   2756 quite straightforward to select specific counters.  For sake of the
   2757 example, let's assume we would like to count how many branch instructions
   2758 and retired memory load instructions that missed in the L1 cache have been
   2759 executed.  We also want to count these events with a high resolution.
   2760 
   2761 This is the command to do so:
   2762 
   2763 @cartouche
   2764 @smallexample
   2765 $ exe=mxv-pthreads
   2766 $ m=8000
   2767 $ n=4000
   2768 $ exp=mxv.hwc.sel.2.thr.er
   2769 $ hwc1=br_ins,hi
   2770 $ hwc2=mem_load_retired.l1_miss,hi
   2771 $ gprofng collect app -O $exp -h $hwc1 -h $hwc2 $exe -m $m -n $n -t 2
   2772 @end smallexample
   2773 @end cartouche
   2774 
   2775 As before, we get a table with the event counts.  Due to the very
   2776 long name for the second counter, we have somewhat modified the
   2777 output.
   2778 
   2779 @cartouche
   2780 @smallexample
   2781 $ gprofng display text -limit 10 -functions mxv.hwc.sel.2.thr.er
   2782 @end smallexample
   2783 @end cartouche
   2784 
   2785 @smallexample
   2786 @verbatim
   2787 Functions sorted by metric: Exclusive Total CPU Time
   2788 Excl.     Incl.     Excl. Branch  Excl.                 Name
   2789 Total     Total     Instructions  mem_load_retired.l1_miss
   2790 CPU sec.  CPU sec.                Events
   2791 2.597     2.597     1305305319    4021340               <Total>
   2792 2.481     2.481     1233233242    3982327               mxv_core
   2793 0.040     0.107       19019012       9003               init_data
   2794 0.028     0.052       23023048      15006               erand48_r
   2795 0.024     0.024       19019008       9004               __drand48_iterate
   2796 0.015     0.067       11011009       2998               drand48
   2797 0.008     0.010              0       3002               _int_malloc
   2798 0.001     0.001              0          0               brk
   2799 0.001     0.002              0          0               sysmalloc
   2800 0.        0.001              0          0               __default_morecore
   2801 @end verbatim
   2802 @end smallexample
   2803 
   2804 @IndexSubentry{Options,  @code{-compare}}
   2805 @IndexSubentry{Commands, @code{compare}}
   2806 When using event counters, the values could be very large and it is not easy
   2807 to compare the numbers.  As we will show next, the @code{ratio} feature is
   2808 very useful when comparing such profiles.
   2809 
   2810 To demonstrate this, we have set up another event counter experiment where
   2811 we would like to compare the number of last level cache miss and the number
   2812 of branch instructions executed when using a single thread, or two threads.
   2813 
   2814 These are the commands used to generate the experiment directories:
   2815 
   2816 @cartouche
   2817 @smallexample
   2818 $ exe=./mxv-pthreads
   2819 $ m=8000
   2820 $ n=4000
   2821 $ exp1=mxv.hwc.comp.1.thr.er
   2822 $ exp2=mxv.hwc.comp.2.thr.er
   2823 $ gprofng collect app -O $exp1 -h llm -h br_ins $exe -m $m -n $n -t 1
   2824 $ gprofng collect app -O $exp2 -h llm -h br_ins $exe -m $m -n $n -t 2
   2825 @end smallexample
   2826 @end cartouche
   2827 
   2828 The following script has been used to get the tables.  Due to lay-out
   2829 restrictions, we have to create two tables, one for each counter.
   2830 
   2831 @cartouche
   2832 @smallexample
   2833 # Limit the output to 5 lines
   2834 limit 5
   2835 # Define the metrics
   2836 metrics name:e.llm
   2837 # Set the comparison to ratio
   2838 compare ratio
   2839 functions
   2840 # Define the metrics
   2841 metrics name:e.br_ins
   2842 # Set the comparison to ratio
   2843 compare ratio
   2844 functions
   2845 @end smallexample
   2846 @end cartouche
   2847 
   2848 Note that we print the name of the function first, followed by the counter
   2849 data.
   2850 The new element is that we set the comparison mode to @code{ratio}.  This
   2851 divides the data in a column by its counterpart in the reference experiment.
   2852 
   2853 This is the command using this script and the two experiment directories as
   2854 input:
   2855 
   2856 @cartouche
   2857 @smallexample
   2858 $ gprofng display text -script my-script-comp-counters \
   2859   mxv.hwc.comp.1.thr.er \
   2860   mxv.hwc.comp.2.thr.er
   2861 @end smallexample
   2862 @end cartouche
   2863 
   2864 By design, we get two tables, one for each counter:
   2865 
   2866 @smallexample
   2867 @verbatim
   2868 Functions sorted by metric: Exclusive Last-Level Cache Misses
   2869 
   2870                               mxv.hwc.comp.1.thr.er  mxv.hwc.comp.2.thr.er
   2871 Name                          Excl. Last-Level       Excl. Last-Level
   2872                               Cache Misses           Cache Misses
   2873                                                          ratio
   2874  <Total>                      122709276              x   0.788
   2875  mxv_core                     121796001              x   0.787
   2876  init_data                       723064              x   1.055
   2877  erand48_r                       100111              x   0.500
   2878  drand48                          60065              x   1.167
   2879 
   2880 Functions sorted by metric: Exclusive Branch Instructions
   2881 
   2882                               mxv.hwc.comp.1.thr.er  mxv.hwc.comp.2.thr.er
   2883 Name                          Excl. Branch           Excl. Branch
   2884                               Instructions           Instructions
   2885                                                        ratio
   2886  <Total>                      1307307316             x 0.997
   2887  mxv_core                     1235235239             x 0.997
   2888  erand48_r                      23023033             x 0.957
   2889  drand48                        20020009             x 0.600
   2890  __drand48_iterate              17017028             x 0.882
   2891 @end verbatim
   2892 @end smallexample
   2893 
   2894 A ratio less than one in the second column, means that this counter
   2895 value was smaller than the value from the reference experiment shown
   2896 in the first column.
   2897 
   2898 This kind of presentation of the results makes it much easier to
   2899 quickly interpret the data.
   2900 
   2901 We conclude this section with thread-level event counter overviews,
   2902 but before we go into this, there is an important metric we need to
   2903 mention.
   2904 
   2905 @c -- TBD Explain <Total> for IPC
   2906 
   2907 @IndexSubentry{Hardware event counters, IPC}
   2908 In case it is known how many instructions and CPU cycles have been executed,
   2909 the value for the IPC (``Instructions Per Clockycle'') can be computed.
   2910 @xref{Hardware Event Counters Explained}.
   2911 This is a derived metric that gives an indication how well the processor
   2912 is utilized.  The inverse of the IPC is called CPI.
   2913 @IndexSubentry{Hardware event counters, CPI}
   2914 
   2915 The @DisplayText{} command automatically computes the IPC and CPI values
   2916 if an experiment contains the event counter values for the instructions
   2917 and CPU cycles executed.  These are part of the metric list and can be
   2918 displayed, just like any other metric.
   2919 
   2920 @IndexSubentry{Options,  @code{-metric_list}}
   2921 @IndexSubentry{Commands, @code{metric_list}}
   2922 This can be verified through the @command{metric_list} command.  If we go
   2923 back to our earlier experiment with the default event counters, we get
   2924 the following result.
   2925 
   2926 @cartouche
   2927 @smallexample
   2928 $ gprofng display text -metric_list mxv.hwc.def.2.thr.er
   2929 @end smallexample
   2930 @end cartouche
   2931 
   2932 @smallexample
   2933 @verbatim
   2934 Current metrics: e.totalcpu:i.totalcpu:e.cycles:e+insts:e+llm:name
   2935 Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu )
   2936 Available metrics:
   2937          Exclusive Total CPU Time: e.%totalcpu
   2938          Inclusive Total CPU Time: i.%totalcpu
   2939              Exclusive CPU Cycles: e.+%cycles
   2940              Inclusive CPU Cycles: i.+%cycles
   2941   Exclusive Instructions Executed: e+%insts
   2942   Inclusive Instructions Executed: i+%insts
   2943 Exclusive Last-Level Cache Misses: e+%llm
   2944 Inclusive Last-Level Cache Misses: i+%llm
   2945  Exclusive Instructions Per Cycle: e+IPC
   2946  Inclusive Instructions Per Cycle: i+IPC
   2947  Exclusive Cycles Per Instruction: e+CPI
   2948  Inclusive Cycles Per Instruction: i+CPI
   2949                              Size: size
   2950                        PC Address: address
   2951                              Name: name
   2952 @end verbatim
   2953 @end smallexample
   2954 
   2955 Among the other metrics, we see the new metrics for the IPC and CPI
   2956 listed.
   2957 
   2958 In the script below, we use this information and add the IPC and CPI
   2959 to the metrics to be displayed.  We also use a the thread filter to
   2960 display these values for the individual threads.
   2961 
   2962 This is the complete script we have used.  Other than a different selection
   2963 of the metrics, there are no new features.
   2964 
   2965 @cartouche
   2966 @smallexample
   2967 # Define the metrics
   2968 metrics e.insts:e.%cycles:e.IPC:e.CPI
   2969 # Sort with respect to cycles
   2970 sort e.cycles
   2971 # Limit the output to 5 lines
   2972 limit 5
   2973 # Get the function overview for all threads
   2974 functions
   2975 # Get the function overview for thread 1
   2976 thread_select 1
   2977 functions
   2978 # Get the function overview for thread 2
   2979 thread_select 2
   2980 functions
   2981 # Get the function overview for thread 3
   2982 thread_select 3
   2983 functions
   2984 @end smallexample
   2985 @end cartouche
   2986 
   2987 In the metrics definition on the second line, we explicitly request the
   2988 counter values for the instructions (@code{e.insts}) and CPU cycles
   2989 (@code{e.cycles}) executed.  These names can be found in output from the
   2990 @IndexSubentry{Options, @code{-metric_list}}
   2991 @IndexSubentry{Commands, @code{metric_list}}
   2992 @command{metric_list} command above.
   2993 In addition to these metrics, we also request the IPC and CPI to be shown.
   2994 
   2995 @IndexSubentry{Options, @code{-limit}}
   2996 @IndexSubentry{Commands, @code{limit}}
   2997 As before, we used the @command{limit} command to control the number of
   2998 functions displayed.  We then request an overview for all the threads,
   2999 followed by three sets of two commands to select a thread and display the
   3000 function overview.
   3001 
   3002 The script above is used as follows:
   3003 
   3004 @cartouche
   3005 @smallexample
   3006 $ gprofng display text -script my-script-ipc mxv.hwc.def.2.thr.er
   3007 @end smallexample
   3008 @end cartouche
   3009 
   3010 @noindent
   3011 This script produces four tables.  We list them separately below,
   3012 and have left out the additional output.
   3013 
   3014 @noindent
   3015 The first table shows the accumulated values across the three
   3016 threads that have been active.
   3017 
   3018 @smallexample
   3019 @verbatim
   3020 Functions sorted by metric: Exclusive CPU Cycles
   3021 
   3022 Excl.         Excl. CPU     Excl.  Excl.   Name
   3023 Instructions  Cycles        IPC    CPI
   3024 Executed       sec.      %
   3025 7906475309    2.691 100.00  1.473  0.679   <Total>
   3026 7432724378    2.598  96.54  1.434  0.697   mxv_core
   3027  188860269    0.035   1.31  2.682  0.373   erand48_r
   3028   73623396    0.026   0.95  1.438  0.696   init_data
   3029   76824434    0.018   0.66  2.182  0.458   drand48
   3030 @end verbatim
   3031 @end smallexample
   3032 
   3033 @noindent
   3034 This shows that IPC of this program is completely dominated
   3035 by function @code{mxv_core}.  It has a fairly low IPC value
   3036 of 1.43.
   3037 
   3038 @noindent
   3039 The next table is for thread 1 and shows the values for the
   3040 main thread.
   3041 
   3042 @smallexample
   3043 @verbatim
   3044 Exp Sel Total
   3045 === === =====
   3046   1 1       3
   3047 Functions sorted by metric: Exclusive CPU Cycles
   3048 
   3049 Excl.         Excl. CPU     Excl.  Excl.   Name
   3050 Instructions  Cycles        IPC    CPI
   3051 Executed       sec.      %
   3052 473750931     0.093 100.00  2.552  0.392   <Total>
   3053 188860269     0.035  37.93  2.682  0.373   erand48_r
   3054  73623396     0.026  27.59  1.438  0.696   init_data
   3055  76824434     0.018  18.97  2.182  0.458   drand48
   3056 134442832     0.013  13.79  5.250  0.190   __drand48_iterate
   3057 @end verbatim
   3058 @end smallexample
   3059 
   3060 @noindent
   3061 Although this thread hardly uses any CPU cycles, the overall IPC
   3062 of 2.55 is not all that bad.
   3063 
   3064 @noindent
   3065 Last, we show the tables for threads 2 and 3:
   3066 
   3067 @smallexample
   3068 @verbatim
   3069 Exp Sel Total
   3070 === === =====
   3071   1 2       3
   3072 Functions sorted by metric: Exclusive CPU Cycles
   3073 
   3074 Excl.         Excl. CPU     Excl.  Excl.   Name
   3075 Instructions  Cycles        IPC    CPI
   3076 Executed       sec.      %
   3077 3716362189    1.298 100.00  1.435  0.697   <Total>
   3078 3716362189    1.298 100.00  1.435  0.697   mxv_core
   3079          0    0.      0.    0.     0.      collector_root
   3080          0    0.      0.    0.     0.      driver_mxv
   3081 
   3082 Exp Sel Total
   3083 === === =====
   3084   1 3       3
   3085 Functions sorted by metric: Exclusive CPU Cycles
   3086 
   3087 Excl.         Excl. CPU     Excl.  Excl.   Name
   3088 Instructions  Cycles        IPC    CPI
   3089 Executed       sec.      %
   3090 3716362189    1.300 100.00  1.433  0.698   <Total>
   3091 3716362189    1.300 100.00  1.433  0.698   mxv_core
   3092          0    0.      0.    0.     0.      collector_root
   3093          0    0.      0.    0.     0.      driver_mxv
   3094 @end verbatim
   3095 @end smallexample
   3096 
   3097 It is seen that both execute the same number of instructions and
   3098 take about the same number of CPU cycles.  As a result, the IPC is
   3099 the same for both threads.
   3100 
   3101 @c -- A new node --------------------------------------------------------------
   3102 @c TBD @node    Additional Features
   3103 @c TBD @section Additional Features
   3104 @c ----------------------------------------------------------------------------
   3105 
   3106 @c -- A new node --------------------------------------------------------------
   3107 @c TBD @node    More Filtering Capabilities
   3108 @c TBD @subsection More Filtering Capabilities
   3109 @c ----------------------------------------------------------------------------
   3110 
   3111 @c TBD Cover @code{samples} and @code{seconds}
   3112 
   3113 @c -- A new node --------------------------------------------------------------
   3114 @node    Java Profiling
   3115 @section Java Profiling
   3116 @c ----------------------------------------------------------------------------
   3117 
   3118 @IndexSubentry{Options, @code{-j}}
   3119 @IndexSubentry{Java profiling, @code{-j on/off}}
   3120 The @CollectApp{} command supports Java profiling.  The @code{-j on} option
   3121 can be used for this, but since this feature is enabled by default, there is
   3122 no need to set this explicitly.  Java profiling may be disabled through the
   3123 @code{-j off} option.
   3124 
   3125 The program is compiled as usual and the experiment directory is created
   3126 similar to what we have seen before.  The only difference with a C/C++
   3127 application is that the program has to be explicitly executed by java.
   3128 
   3129 For example, this is how to generate the experiment data for a Java
   3130 program that has the source code stored in file @code{Pi.java}:
   3131 
   3132 @cartouche
   3133 @smallexample
   3134 $ javac Pi.java
   3135 $ gprofng collect app -j on -O pi.demo.er java Pi < pi.in
   3136 @end smallexample
   3137 @end cartouche
   3138 
   3139 Regarding which java is selected to generate the data, @ToolName{}
   3140 first looks for the JDK in the path set in either the
   3141 @IndexSubentry{Java profiling, @code{JDK_HOME}}
   3142 @code{JDK_HOME} environment variable, or in the
   3143 @IndexSubentry{Java profiling, @code{JAVA_PATH}}
   3144 @code{JAVA_PATH} environment variable.  If neither of these variables is
   3145 set, it checks for a JDK in the search path (set in the PATH
   3146 environment variable).  If there is no JDK in this path, it checks for
   3147 the java executable in @code{/usr/java/bin/java}.
   3148 
   3149 In case additional options need to be passed on to the JVM, the
   3150 @IndexSubentry{Options, @code{-J}}
   3151 @IndexSubentry{Java profiling, @code{-J <string>}}
   3152 @code{-J <string>} option can be used.  The string with the
   3153 option(s) has to be delimited by quotation marks in case
   3154 there is more than one argument.
   3155 
   3156 The @DisplayText{} command may be used to view the performance data.  There is
   3157 no need for any special options and the same commands as previously discussed
   3158 are supported.
   3159 
   3160 @IndexSubentry{Options,  @code{-viewmode}}
   3161 @IndexSubentry{Commands, @code{viewmode}}
   3162 @IndexSubentry{Java profiling, different view modes}
   3163 The @code{viewmode} command
   3164 @xref{The Viewmode}
   3165 is very useful to examine the call stacks.
   3166 
   3167 For example, this is how one can see the native call stacks.  For
   3168 lay-out purposes we have restricted the list to the first five entries:
   3169 
   3170 @cartouche
   3171 @smallexample
   3172 $ gprofng display text -limit 5 -viewmode machine -calltree pi.demo.er
   3173 @end smallexample
   3174 @end cartouche
   3175 
   3176 @smallexample
   3177 @verbatim
   3178 Print limit set to 5
   3179 Viewmode set to machine
   3180 Functions Call Tree. Metric: Attributed Total CPU Time
   3181 
   3182 Attr.      Name
   3183 Total
   3184 CPU sec.
   3185 1.381      +-<Total>
   3186 1.171        +-Pi.calculatePi(double)
   3187 0.110        +-collector_root
   3188 0.110        |  +-JavaMain
   3189 0.070        |    +-jni_CallStaticVoidMethod
   3190 @end verbatim
   3191 @end smallexample
   3192 
   3193 @noindent
   3194 Note that the selection of the viewmode is echoed in the output.
   3195 
   3196 @c -- A new node --------------------------------------------------------------
   3197 @node    The gprofng Tools
   3198 @chapter The gprofng Tools
   3199 @c ----------------------------------------------------------------------------
   3200 
   3201 Several tools are included in @ProductName{}.  In subsequent chapters these
   3202 are discussed in detail.  Below a brief description is given, followed by an
   3203 overview of the environment variables that are supported.
   3204 
   3205 @c -- A new node --------------------------------------------------------------
   3206 @node    Tools Overview
   3207 @section Tools Overview
   3208 @c ----------------------------------------------------------------------------
   3209 
   3210 The following tools are supported by @ProductName{}:
   3211 
   3212 @table @code
   3213 
   3214 @item @CollectApp{}
   3215 @IndexSubentry{@code{gprofng}, @code{collect app}}
   3216 
   3217 Collects the performance data and stores the results in an experiment
   3218 directory.  There are many options on this tool, but quite often the
   3219 defaults are sufficient.
   3220 An experiment directory is required for the subsequent analysis of
   3221 the results.
   3222 
   3223 @item @DisplayText{}
   3224 @IndexSubentry{@code{gprofng}, @code{display text}}
   3225 
   3226 Generates performance reports in ASCII format.  Commandline
   3227 options, and/or commands in a script file are used to control the contents
   3228 and lay-out of the generated report(s).
   3229 
   3230 @item @DisplayHTML{}
   3231 @IndexSubentry{@code{gprofng}, @code{display html}}
   3232 
   3233 Takes one or more experiment directories and generates a directory with
   3234 HTML files.  Starting from the index.html file, the performance data
   3235 may be examined in a browser.
   3236 
   3237 @item @DisplaySRC{}
   3238 @IndexSubentry{@code{gprofng}, @code{display src}}
   3239 
   3240 Displays the source code, interleaved with the disassembled instructions.
   3241 
   3242 @item @Archive{}
   3243 @IndexSubentry{@code{gprofng}, @code{archive}}
   3244 
   3245 Archives an experiment directory by (optionally) including source code and
   3246 object files, as well as the shared libraries that have been used.
   3247 
   3248 @item @GUI{}
   3249 @IndexSubentry{@code{gprofng}, @code{display gui}}
   3250 
   3251 This is an optional component that can be installed in addition to the
   3252 command line gprofng tools listed above.  It supports the graphical
   3253 analysis of one or more experiments that have been created using
   3254 @CollectApp{}.
   3255 
   3256 The GUI part of @ProductName{} is a GNU project.  This is the link to the
   3257 @url{https://savannah.gnu.org/projects/gprofng-gui, gprofng GUI page}.
   3258 This page contains more information (e.g. how to clone the repo).
   3259 There is also a
   3260 @url{https://ftp.gnu.org/gnu/gprofng-gui, tar file distribution directory}
   3261 with tar files that include everything that is needed to build and install
   3262 the GUI code.  Various versions are available here.
   3263 Be aware that in order to build and use the gprofng GUI, Java needs to be
   3264 installed first.  The minimum Java version required is Java 8.
   3265 
   3266 @end table
   3267 
   3268 @c -- A new section -----------------------------------------------------------
   3269 @node    The gprofng.rc file with default settings
   3270 @section The gprofng.rc file with default settings
   3271 @c ----------------------------------------------------------------------------
   3272 The @file{gprofng.rc}
   3273 @cindex gprofng.rc
   3274 file is used to define default settings for the @DisplayText{}, @Archive{},
   3275 and @DisplaySRC{} tools, but the user can override these defaults through
   3276 local configuration settings when building and installing from the source
   3277 code.
   3278 
   3279 There are three files that are checked when the tool starts up.  The first
   3280 file has pre-defined settings and comes with the installation, but through
   3281 a hidden file called @file{.gprofng.rc}, the user can (re)define the defaults.
   3282 
   3283 These are the locations and files that are checked upon starting the above
   3284 mentioned tools:
   3285 
   3286 @enumerate
   3287 
   3288 @item
   3289 The system-wide filename is called @file{gprofng.rc} and is located in
   3290 the @file{/etc} subdirectory in case an RPM was used for the installation.
   3291 
   3292 If @ProductName{} has been built from the source, this file is in
   3293 subdirectory @file{etc} in the top level installation directory.
   3294 
   3295 @item
   3296 The user's home directory may have a hidden file called @file{.gprofng.rc}.
   3297 
   3298 @item
   3299 The directory where @DisplayText{} (or @DisplaySRC{}) is invoked from, may
   3300 have a hidden file called @file{.gprofng.rc}.
   3301 
   3302 @end enumerate
   3303 
   3304 The settings of each file override the settings of the file(s) read before it.
   3305 Defaults in the system-wide file are overruled by the file in the user home
   3306 directory (if any) and any settings in the @file{.gprofng.rc} file in the
   3307 current directory override those.
   3308 
   3309 Note that the settings in these files only affect the defaults.  Unlike
   3310 the commands used in a script file, they are not commands for the tools.
   3311 
   3312 @c -- TBD indxobj_define, 
   3313 @c -- TBD object_show, 
   3314 @c -- TBD object_hide, 
   3315 @c -- TBD object_api, 
   3316 
   3317 The @file{.gprofng.rc} configuration files can contain the
   3318 @command{addpath},
   3319 @command{compare}, 
   3320 @command{dthresh}, 
   3321 @command{name}, 
   3322 @command{pathmap},
   3323 @command{printmode}, 
   3324 @command{sthresh},
   3325 and
   3326 @command{viewmode} 
   3327 commands as described in this user guide.
   3328 
   3329 They can also contain the following commands, @emph{which cannot be used on
   3330 either the command line, or in a script file}:
   3331 
   3332 @table @code
   3333 
   3334 @item dmetrics @var{metric-spec}
   3335 @IndexSubentry{Commands, @code{dmetrics}}
   3336 
   3337 Specify the default metrics to be displayed or printed in the function list.
   3338 The syntax and use of the metric list is described in section
   3339 @ref{Metric Definitions}.
   3340 The order of the metric keywords in the list determines the order in which
   3341 the metrics are presented.
   3342 
   3343 Default metrics for the @code{callers-callees} list are derived from the
   3344 function list default metrics by adding the corresponding attributed metric
   3345 before the first occurrence of each metric name in the list.
   3346 
   3347 @item dsort @var{metric-spec}
   3348 @IndexSubentry{Commands, @code{dsort}}
   3349 
   3350 Specify the default metric by which the function list is sorted.  The sort
   3351 metric is the first metric in this list that matches a metric in any loaded
   3352 experiment, subject to the following conditions:
   3353 
   3354 @itemize @bullet
   3355 
   3356 @item
   3357 If the entry in @var{metric-spec} has a visibility string of an exclamation
   3358 point (@samp{!}), the first metric whose name matches is used, regardless of
   3359 whether it is visible.
   3360 
   3361 @item
   3362 If the entry in @var{metric-spec} has any other visibility string, the first
   3363 visible metric whose name matches is used.
   3364 
   3365 @end itemize
   3366 
   3367 The syntax and use of the metric list is described in section
   3368 @ref{Metric Definitions}.
   3369 The default sort metric for the @code{callers-callees} list is the attributed
   3370 metric corresponding to the default sort metric for the function list.
   3371 
   3372 @item en_desc @{on | off | =@var{regex}@}
   3373 @IndexSubentry{Commands, @code{en_desc}}
   3374 
   3375 Set the mode for reading descendant experiments to @samp{on} (enable all
   3376 descendants) or @samp{off} to disable all descendants.  If
   3377 @samp{=}@var{regex} is used, enable data from those experiments whose
   3378 executable name matches the regular expression.
   3379 
   3380 The default setting is @samp{on} to follow all descendants.  In reading
   3381 experiments with descendants, any sub-experiments that contain little or
   3382 no performance data are ignored by @DisplayText{}.
   3383 
   3384 @end table
   3385 
   3386 @c -- A new section -----------------------------------------------------------
   3387 @node    Filters
   3388 @section Filters
   3389 @c ----------------------------------------------------------------------------
   3390 
   3391 Various filter commands are supported by @DisplayText{}.
   3392 Thanks to the use of filters, the user can zoom in on a certain area of
   3393 interest.  With filters, it is possible to select one or more threads to
   3394 focus on, define a window in time, select specific call stacks, etc.
   3395 @IndexSubentry{Filters, Intro}
   3396 
   3397 While already powerful by themselves, filters may be combined to further
   3398 narrow down the view into the data.
   3399 
   3400 @IndexSubentry{Filters, Persistence}
   3401 It is important to note that filters are @emph{persistent}.  A filter is
   3402 active until it is reset.  This means that successive filter commands
   3403 increasingly narrow down the view until one or more are reset.
   3404 
   3405 @noindent
   3406 An example is the following:
   3407 
   3408 @cartouche
   3409 @smallexample
   3410 $ gprofng display text -thread_select 1 -functions \
   3411                        -cpu_select 2 -functions @dots{}
   3412 @end smallexample
   3413 @end cartouche
   3414 
   3415 This command selects thread 1 and requests the function view for this thread.
   3416 The third (@command{cpu_select 2}) command @emph{adds} the
   3417 constraint that only the events on CPU 2 are to be selected.  This means
   3418 that the next function view selects events that were executed by thread 1 and
   3419 have been running on CPU 2.
   3420 
   3421 @noindent
   3422 In contrast with this single command line, the two commands below look similar,
   3423 but behave very differently:
   3424 
   3425 @cartouche
   3426 @smallexample
   3427 $ gprofng display text -thread_select 1 -functions @dots{}
   3428 $ gprofng display text -cpu_select 2 -functions @dots{}
   3429 @end smallexample
   3430 @end cartouche
   3431 
   3432 The first command displays the function view for thread 1.  The second
   3433 command shows the function view for CPU 2 for @emph{all} threads that have
   3434 been running on this CPU.
   3435 
   3436 As the following example demonstrates, things get a little more tricky in
   3437 case a script file is used.  Consider the following script file:
   3438 
   3439 @smallexample
   3440 @verbatim
   3441 thread_select 1
   3442 functions
   3443 cpu_select 2
   3444 functions
   3445 @end verbatim
   3446 @end smallexample
   3447 
   3448 This script file displays the function view for thread 1 first.  This is
   3449 followed by those functions that were executed by thread 1 @emph{and} have
   3450 been run on CPU 2.
   3451 
   3452 If however, the script should behave like the two command line invocations
   3453 shown above, the thread selection filter needs to be reset before CPU 2 is
   3454 selected:
   3455 
   3456 @smallexample
   3457 @verbatim
   3458 thread_select 1
   3459 functions
   3460 # Reset the thread selection filter:
   3461 thread_select all
   3462 cpu_select 2
   3463 functions
   3464 @end verbatim
   3465 @end smallexample
   3466 
   3467 In general, filters behave differently than commands or options.  In
   3468 particular there may be an interaction between different filter definitions.
   3469 
   3470 For example, as explained above, in the first script file the
   3471 @command{thread_select} and @command{cpu_select} commands interact.
   3472 
   3473 For a list of all the predefined filters see @ref{Predefined Filters}.
   3474 
   3475 @c -- A new section -----------------------------------------------------------
   3476 @node    Supported Environment Variables
   3477 @section Supported Environment Variables
   3478 @c ----------------------------------------------------------------------------
   3479 
   3480 Various environment variables are supported.  We refer to the man page for
   3481 gprofng(1) for an overview and description
   3482 (@xref{Man page for @command{gprofng}}).
   3483 
   3484 @c -- A new chapter -----------------------------------------------------------
   3485 @node    Performance Data Collection
   3486 @chapter Performance Data Collection
   3487 @c ----------------------------------------------------------------------------
   3488 
   3489 The @CollectApp{} command is used to gather the application performance data
   3490 while the application executes.
   3491 At regular intervals, program execution is halted and the required data is
   3492 recorded.
   3493 @cindex Experiment directory
   3494 An experiment directory is created when the tool starts.  This directory is
   3495 used to store the relevant information and forms the basis for a subsequent
   3496 analysis with one of the viewing tools.
   3497 
   3498 @c -- A new section -----------------------------------------------------------
   3499 @node    The @command{gprofng collect app} Tool
   3500 @section The @command{gprofng collect app} Tool
   3501 @c ----------------------------------------------------------------------------
   3502 
   3503 This is the command to collect the performance information for the target
   3504 application.  The usage is as follows:
   3505 
   3506 @cartouche
   3507 @smallexample
   3508 $ gprofng collect app [OPTION(S)] TARGET [TARGET_ARGUMENTS]
   3509 @end smallexample
   3510 @end cartouche
   3511 
   3512 @noindent
   3513 Options to the command are passed in first.  This is followed by the name of
   3514 the target, which is typically a binary executable or a script, followed by
   3515 any options that may be required by the target.
   3516 
   3517 @c -- A new section -----------------------------------------------------------
   3518 @node    View the Performance Information
   3519 @chapter View the Performance Information
   3520 @c ----------------------------------------------------------------------------
   3521 Various tools to view the performance data stored in one or more experiment
   3522 directories are available.  In this chapter, these will all be covered in
   3523 detail.
   3524 
   3525 @c -- A new section -----------------------------------------------------------
   3526 @node    The @command{gprofng display text} Tool
   3527 @section The @command{gprofng display text} Tool
   3528 @c ----------------------------------------------------------------------------
   3529 
   3530 This tool displays the performance information in ASCII format.  It supports
   3531 a variety of views into the data recorded.
   3532 These views can be specified in two ways and both may be used simultaneously:
   3533 
   3534 @itemize @bullet
   3535 
   3536 @item
   3537 Command line options start with a dash (@samp{-}) symbol and may take an
   3538 argument.
   3539 
   3540 @item
   3541 Options may also be included in a file, the ``script file''.  In this case,
   3542 the dash symbol should @emph{not} be included.  Multiple script files can
   3543 be used on the same command line.
   3544 
   3545 @end itemize
   3546 
   3547 While they may appear as an option, they are really commands and this is
   3548 why they will be referred to as @emph{commands}
   3549 @cindex Commands
   3550 in the documentation.
   3551 
   3552 As a general rule, @emph{the order of options matters} and if the same option,
   3553 or command, occurs multiple times, the rightmost setting is selected.
   3554 
   3555 @c -- A new sub section -------------------------------------------------------
   3556 @node       The @code{gprofng display text} Commands
   3557 @subsection The @code{gprofng display text} Commands
   3558 @c ----------------------------------------------------------------------------
   3559 
   3560 The most commonly used commands are documented in the man page for this tool
   3561 (@xref{Man page for @command{gprofng display text}}).  In this section we list
   3562 and describe all other commands that are supported.
   3563 
   3564 @c -- A new sub subsection ----------------------------------------------------
   3565 @node                Commands that List Experiment Details
   3566 @unnumberedsubsubsec Commands that List Experiment Details
   3567 @c ----------------------------------------------------------------------------
   3568 
   3569 @table @code
   3570 
   3571 @item experiment_ids
   3572 @ifclear man
   3573 @IndexSubentry{Options,  @code{-experiment_ids}}
   3574 @IndexSubentry{Commands, @code{experiment_ids}}
   3575 @end ifclear
   3576 
   3577 For each experiment that has been loaded, show the totals of the metrics
   3578 recorded, plus some other operational characteristics like the name of
   3579 the executable, PID, etc.  The top line contains the accumulated totals
   3580 for the metrics.
   3581 
   3582 @item experiment_list
   3583 @ifclear man
   3584 @IndexSubentry{Options,  @code{-experiment_list}}
   3585 @IndexSubentry{Commands, @code{experiment_list}}
   3586 @end ifclear
   3587 
   3588 Display the list of experiments that are loaded.  Each experiment is listed
   3589 with an index, which is used when selecting samples, threads, or LWPs, and
   3590 a process id (PID), which can be used for advanced filtering.
   3591 
   3592 @item cpu_list
   3593 @IndexSubentry{Options,  @code{-cpu_list}}
   3594 @IndexSubentry{Commands, @code{cpu_list}}
   3595 
   3596 Display the total number of CPUs that have been used during the experiment(s).
   3597 
   3598 @item cpus
   3599 @IndexSubentry{Options,  @code{-cpus}}
   3600 @IndexSubentry{Commands, @code{cpus}}
   3601 
   3602 Show a list of CPUs that were used by the application, along with the metrics
   3603 that have been recorded.  The CPUs are represented by a CPU number and show
   3604 the Total CPU time by default.
   3605 
   3606 Note that since the data is sorted with respect to the default metric, it may
   3607 be useful to use the @command{sort name} command to show the list sorted with
   3608 respect to the CPU id.
   3609 
   3610 @item GCEvents
   3611 @IndexSubentry{Options,  @code{-GCEvents}}
   3612 @IndexSubentry{Commands, @code{GCEvents}}
   3613 
   3614 This commands is for Java applications only.  It shows any Garbage Collection
   3615 (GC) events that have occurred while the application was executing.
   3616 
   3617 @item lwp_list
   3618 @IndexSubentry{Options,  @code{-lwp_list}}
   3619 @IndexSubentry{Commands, @code{lwp_list}}
   3620 
   3621 Displays the list of LWPs processed during the experiment(s).
   3622 
   3623 @item processes
   3624 @IndexSubentry{Options,  @code{-processes}}
   3625 @IndexSubentry{Commands, @code{processes}}
   3626 
   3627 For each experiment that has been loaded, this command displays a list of
   3628 processes that were created by the application, along with their metrics.
   3629 The processes are represented by process ID (PID) numbers and show the
   3630 Total CPU time metric by default.  If additional metrics are recorded in
   3631 an experiment, these are shown as well.
   3632 
   3633 @item samples
   3634 @IndexSubentry{Options,  @code{-samples}}
   3635 @IndexSubentry{Commands, @code{samples}}
   3636 
   3637 Display a list of sample points and their metrics, which reflect the
   3638 microstates recorded at each sample point in the loaded experiment.
   3639 The samples are represented by sample numbers and show the Total CPU time
   3640 by default.  Other metrics might also be displayed if enabled.
   3641 
   3642 @item sample_list
   3643 @IndexSubentry{Options,  @code{-sample_list}}
   3644 @IndexSubentry{Commands, @code{sample_list}}
   3645 
   3646 For each experiment loaded, display the list of samples currently selected.
   3647 
   3648 @item seconds
   3649 @IndexSubentry{Options,  @code{-seconds}}
   3650 @IndexSubentry{Commands, @code{seconds}}
   3651 
   3652 Show each second of the profiling run that was captured in the experiment,
   3653 along with the metrics collected in that second.  The seconds view differs
   3654 from the samples view in that it shows periodic samples that occur every
   3655 second beginning at 0 and the interval cannot be changed.
   3656 
   3657 The seconds view lists the seconds of execution with the Total CPU time by
   3658 default.  Other metrics might also be displayed if the metrics are present
   3659 in the loaded experiments.
   3660 
   3661 @item threads
   3662 @IndexSubentry{Options,  @code{-threads}}
   3663 @IndexSubentry{Commands, @code{threads}}
   3664 
   3665 Show a list of threads and their metrics.  The threads are represented
   3666 by a process and thread pair and show the Total CPU time by default.
   3667 Other metrics might also be displayed by default if the metrics are
   3668 present in the loaded experiment.
   3669 
   3670 @item thread_list
   3671 @IndexSubentry{Options,  @code{-thread_list}}
   3672 @IndexSubentry{Commands, @code{thread_list}}
   3673 
   3674 Display the list of threads currently selected for the analysis.
   3675 
   3676 @end table
   3677 
   3678 @noindent
   3679 @emph{The commands below are for use in scripts and interactive mode only.
   3680 They are not allowed on the command line.}
   3681 
   3682 @table @code
   3683 
   3684 @item add_exp @var{exp-name}
   3685 @IndexSubentry{Commands, @code{add_exp}}
   3686 
   3687 Add the named experiment to the current session.
   3688 
   3689 @item drop_exp @var{exp-name}
   3690 @IndexSubentry{Commands, @code{drop_exp}}
   3691 
   3692 Drop the named experiment from the current session.
   3693 
   3694 @item open_exp @var{exp-name}
   3695 @IndexSubentry{Commands, @code{open_exp}}
   3696 
   3697 Drop all loaded experiments from the session, and then load the named
   3698 experiment.
   3699 
   3700 @end table
   3701 
   3702 @c -- A new sub subsection ----------------------------------------------------
   3703 @node                Commands that Affect Listings and Output
   3704 @unnumberedsubsubsec Commands that Affect Listings and Output
   3705 @c ----------------------------------------------------------------------------
   3706 
   3707 @table @code
   3708 
   3709 @item dthresh @var{value}
   3710 @IndexSubentry{Options,  @code{-dthresh}}
   3711 @IndexSubentry{Commands, @code{dthresh}}
   3712 
   3713 Specify the threshold percentage for highlighting metrics in the annotated
   3714 disassembly code.  If the value of any metric is equal to or greater than
   3715 @var{value} as a percentage of the maximum value of that metric for any
   3716 instruction line in the file, the line on which the metrics occur has a
   3717 @samp{##} marker inserted at the beginning of the line.  The default is 75.
   3718 
   3719 @item printmode @{text | html | @var{single-char}@}
   3720 @IndexSubentry{Options,  @code{-printmode}}
   3721 @IndexSubentry{Commands, @code{printmode}}
   3722 
   3723 Set the print mode.  If the keyword is @code{text}, printing will be done in
   3724 tabular form using plain text.  In case the @code{html} keyword is selected,
   3725 the output is formatted as an HTML table.
   3726 
   3727 Alternatively, @var{single-char} may be used in a delimiter separated list,
   3728 with the single character @var{single-char} as the delimiter.
   3729 
   3730 The printmode setting is used only for those commands that generate tables,
   3731 such as @command{functions}.  The setting is ignored for other printing
   3732 commands, including those showing source and disassembly listings.
   3733 
   3734 @item sthresh @var{value}
   3735 @IndexSubentry{Options,  @code{-sthresh}}
   3736 @IndexSubentry{Commands, @code{sthresh}}
   3737 
   3738 Specify the threshold percentage for highlighting metrics in the annotated
   3739 source code.  If the value of any metric is equal to or greater than
   3740 @var{value} (as a percentage) of the maximum value of that metric for any
   3741 source line in the file, the line on which the metrics occur has a @samp{##}
   3742 marker inserted at the beginning of the line.  The default is 75.
   3743 
   3744 @end table
   3745 
   3746 @c -- A new sub subsection ----------------------------------------------------
   3747 @node                Predefined Filters
   3748 @unnumberedsubsubsec Predefined Filters
   3749 @c ----------------------------------------------------------------------------
   3750 
   3751 The filters below use a list, the selection list, to define a sequence of
   3752 numbers. @xref{The Selection List}.
   3753 Note that this selection is persistent, but the filter can be reset by using
   3754 @samp{all} as the @var{selection-list}.
   3755 
   3756 @table @code
   3757 
   3758 @item cpu_select @var{selection-list}
   3759 @IndexSubentry{Options,  @code{-cpu_select}}
   3760 @IndexSubentry{Commands, @code{cpu_select}}
   3761 
   3762 Select the CPU ids specified in the @var{selection-list}.
   3763 
   3764 @item lwp_select @var{selection-list}
   3765 @IndexSubentry{Options,  @code{-lwp_select}}
   3766 @IndexSubentry{Commands, @code{lwp_select}}
   3767 
   3768 Select the LWPs specified in the @var{selection-list}.
   3769 
   3770 @item sample_select @var{selection-list}
   3771 @IndexSubentry{Options,  @code{-sample-select}}
   3772 @IndexSubentry{Commands, @code{sample-select}}
   3773 
   3774 @item thread_select @var{selection-list}
   3775 @IndexSubentry{Options,  @code{-thread_select}}
   3776 @IndexSubentry{Commands, @code{thread_select}}
   3777 
   3778 Select a series of threads, or just one, to be used in subsequent views.
   3779 The @var{selection-list} consists of a sequence of comma separated numbers.
   3780 This may include a range of the form @samp{n-m}.
   3781 
   3782 @end table
   3783 
   3784 
   3785 @c -- A new sub subsection ----------------------------------------------------
   3786 @node                Commands to Set and Change Search Paths
   3787 @unnumberedsubsubsec Commands to Set and Change Search Paths
   3788 @c ----------------------------------------------------------------------------
   3789 
   3790 @table @code
   3791 
   3792 @item addpath @var{path-list}
   3793 @IndexSubentry{Options,  @code{-addpath}}
   3794 @IndexSubentry{Commands, @code{addpath}}
   3795 
   3796 Append @var{path-list} to the current setpath settings.  Note that multiple
   3797 @command{addpath} commands can be used in @file{.gprofng.rc} files, and will
   3798 be concatenated.
   3799 
   3800 @item pathmap @var{old-prefix} @var{new-prefix}
   3801 @IndexSubentry{Options,  @code{-pathmap}}
   3802 @IndexSubentry{Commands, @code{pathmap}}
   3803 
   3804 If a file cannot be found using the path list set by @command{addpath}, or
   3805 the @command{setpath} command, one or more path remappings may be set with the
   3806 @command{pathmap} command.
   3807 
   3808 With path mapping, the user can specify how to replace the leading component
   3809 in a full path by a different string.
   3810 
   3811 With this command, any path name for a source file, object file, or shared
   3812 object that begins with the prefix specified with @var{old-prefix}, the
   3813 old prefix is replaced by the prefix specified with @var{new-prefix}.
   3814 The resulting path is used to find the file.
   3815 
   3816 For example, if a source file located in directory @file{/tmp}
   3817 is shown in the @DisplayText{} output, but should instead be taken from
   3818 @file{/home/demo}, the following @file{pathmap} command redefines the
   3819 path:
   3820 
   3821 @smallexample
   3822 $ gprofng diplay text -pathmap /tmp /home/demo -source ...
   3823 @end smallexample
   3824 
   3825 Note that multiple @command{pathmap} commands can be supplied, and each is
   3826 tried until the file is found.
   3827 
   3828 @item setpath @var{path-list}
   3829 @IndexSubentry{Options,  @code{-setpath}}
   3830 @IndexSubentry{Commands, @code{setpath}}
   3831 
   3832 Set the path used to find source and object files.  The path is defined
   3833 through the @var{path-list} keyword.  It is a colon separated list of
   3834 directories, jar files, or zip files.
   3835 If any directory has a colon character in it, escape it with a
   3836 backslash (@samp{\}).
   3837 
   3838 The special directory name @code{$expts}, refers
   3839 to the set of current experiments in the order in which they were loaded.
   3840 You can abbreviate it with a single @samp{$} character.
   3841 
   3842 The default path is @samp{$expts:..} which is the directories of the
   3843 loaded experiments and the current working directory.
   3844 
   3845 Use @command{setpath} with no argument to display the current path.
   3846 
   3847 Note that @command{setpath} commands @emph{are not allowed .gprofng.rc
   3848 configuration files}.
   3849 
   3850 @end table
   3851 
   3852 @c -- A new subsection --------------------------------------------------------
   3853 @c -- TBD @node       Usage examples for @code{gprofng display text}
   3854 @c -- TBD @subsection Usage examples for @code{gprofng display text}
   3855 @c ----------------------------------------------------------------------------
   3856 
   3857 @c -- TBD In this section we present usage examples.
   3858 
   3859 @c -- A new chapter -----------------------------------------------------------
   3860 @c  TBD @node    The @code{gprofng display html} Tool
   3861 @c  TBD @section The @code{gprofng display html} Tool
   3862 @c ----------------------------------------------------------------------------
   3863 @c  TBD The options are documented in the man page for this tool. In this section we
   3864 @c  TBD present usage examples.
   3865 
   3866 @c -- A new chapter -----------------------------------------------------------
   3867 @c  TBD @node    Display Source Code
   3868 @c  TBD @chapter Display Source Code
   3869 @c ----------------------------------------------------------------------------
   3870 @c  TBD The options are documented in the man page for this tool. In this section we
   3871 @c  TBD present usage examples.
   3872 
   3873 
   3874 @c -- A new chapter -----------------------------------------------------------
   3875 @c  TBD @node    Archive Experiment Data
   3876 @c  TBD @chapter Archive Experiment Data
   3877 @c ----------------------------------------------------------------------------
   3878 @c  TBD The options are documented in the man page for this tool. In this section we
   3879 @c  TBD present usage examples.
   3880 
   3881 @c -- A new chapter -----------------------------------------------------------
   3882 @node    Terminology
   3883 @chapter Terminology
   3884 @c ----------------------------------------------------------------------------
   3885 
   3886 Throughout this manual, certain terminology specific to profiling tools,
   3887 or @ToolName{}, or even to this document only, is used.  In this chapter
   3888 this terminology is explained in detail.
   3889 
   3890 @menu
   3891 * The Program Counter::                    What is a Program Counter?
   3892 * Inclusive and Exclusive Metrics::        An explanation of inclusive and exclusive metrics.
   3893 * Metric Definitions::                     Definitions associated with metrics.
   3894 * The Viewmode::                           Select the way call stacks are presented.
   3895 * The Selection List::                     How to define a selection.
   3896 * Load Objects and Functions::             The components in an application.
   3897 * The Concept of a CPU in gprofng::        The definition of a CPU.
   3898 * Hardware Event Counters Explained::      What are event counters?
   3899 * apath::                                  Our generic definition of a path.
   3900 @end menu
   3901 
   3902 @c ----------------------------------------------------------------------------
   3903 @node    The Program Counter
   3904 @section The Program Counter
   3905 @c ----------------------------------------------------------------------------
   3906 
   3907 @cindex PC
   3908 @cindex Program Counter
   3909 The @emph{Program Counter}, or PC for short, keeps track where program
   3910 execution is.  The address of the next instruction to be executed is stored
   3911 in a special purpose register in the processor, or core.
   3912 
   3913 @cindex Instruction pointer
   3914 The PC is sometimes also referred to as the @emph{instruction pointer}, but
   3915 we will use Program Counter or PC throughout this document.
   3916 
   3917 @c ----------------------------------------------------------------------------
   3918 @node    Inclusive and Exclusive Metrics
   3919 @section Inclusive and Exclusive Metrics
   3920 @c ----------------------------------------------------------------------------
   3921 
   3922 In the remainder, these two concepts occur quite often and for lack of a better
   3923 place, they are explained here.
   3924 
   3925 @cindex Inclusive metric
   3926 The @emph{inclusive} value for a metric includes all values that are part of
   3927 the dynamic extent of the target function.  For example if function @code{A}
   3928 calls functions @code{B} and @code{C}, the inclusive CPU time for @code{A}
   3929 includes the CPU time spent in @code{B} and @code{C}.
   3930 
   3931 @cindex Exclusive metric
   3932 In contrast with this, the @emph{exclusive} value for a metric is computed
   3933 by excluding the metric values used by other functions called.  In our
   3934 imaginary example, the exclusive CPU time for function @code{A} is the
   3935 time spent outside calling functions @code{B} and @code{C}.
   3936 
   3937 @cindex Leaf function
   3938 In case of a @emph{leaf function}, the inclusive and exclusive values for the
   3939 metric are the same since by definition, it is not calling any other
   3940 function(s).
   3941 
   3942 Why do we use these two different values? The inclusive metric shows the most
   3943 expensive path, in terms of this metric, in the application.  For example, if
   3944 the metric is cache misses, the function with the highest inclusive metric
   3945 tells you where most of the cache misses come from.
   3946 
   3947 Within this branch of the application, the exclusive metric points to the
   3948 functions that contribute and help to identify which part(s) to consider
   3949 for further analysis.
   3950 
   3951 @c ----------------------------------------------------------------------------
   3952 @node    Metric Definitions
   3953 @section Metric Definitions
   3954 @c ----------------------------------------------------------------------------
   3955 The metrics displayed in the various views are highly customizable.  In this
   3956 section it is explained how to construct the metrics definition(s).
   3957 
   3958 @IndexSubentry{Options,  @code{-metrics}}
   3959 @IndexSubentry{Commands, @code{metrics}}
   3960 The @command{metrics} command takes a colon (@samp{:}) separated list, where
   3961 each item in the list consists of the following three fields:
   3962 @var{<flavor>}@var{<visibility>}@var{<metric-name>}.
   3963 
   3964 @cindex Flavor field
   3965 @cindex Visibility field
   3966 @cindex Metric name field
   3967 @IndexSubentry{Metrics, Flavor field}
   3968 @IndexSubentry{Metrics, Visibility field}
   3969 @IndexSubentry{Metrics, Metric name field}
   3970 The @var{<flavor>} field is either @samp{e} for ``exclusive'', and/or
   3971 @samp{i} for ``inclusive''.  The @var{<metric-name>} field is the name of
   3972 the metric and the @var{<visibility>} field consists of one ore more characters
   3973 from the following table:
   3974 
   3975 @table @code
   3976 
   3977 @item .
   3978 Show the metric as time.  This applies to timing metrics and hardware event
   3979 counters that measure cycles.  Interpret as @samp{+} for other metrics.
   3980 
   3981 @item %
   3982 Show the metric as a percentage of the total value for this metric.
   3983 
   3984 @item +
   3985 Show the metric as an absolute value.  For hardware event counters this is
   3986 the event count.  Interpret as @samp{.} for timing metrics.
   3987 
   3988 @item !
   3989 Do not show any metric value.  Cannot be used with other visibility characters.
   3990 This visibility is meant to be used in a @command{dmetrics} command to set
   3991 default metrics that override the built-in visibility defaults
   3992 for each type of metric.
   3993 
   3994 @end table
   3995 
   3996 Both the @var{<flavor>} and @var{<visibility>} strings may have more than one
   3997 character.  If both strings have more than one character, the @var{<flavor>}
   3998 string is expanded first.  For example, @code{ie.%user} is first expanded to
   3999 @code{i.%user:e.%user}, which is then expanded into
   4000 @code{i.user:i%user:e.user:e%user}.
   4001 
   4002 @c ----------------------------------------------------------------------------
   4003 @node    The Viewmode
   4004 @section The Viewmode
   4005 
   4006 @cindex Viewmode
   4007 @IndexSubentry{Options,  @code{-viewmode}}
   4008 @IndexSubentry{Commands, @code{viewmode}}
   4009 
   4010 There are different ways to view a call stack in Java.  In @ToolName{}, this
   4011 is called the @emph{viewmode} and the setting is controlled through a command
   4012 with the same name.
   4013 
   4014 The @code{viewmode} command takes one of the following keywords:
   4015 
   4016 @table @code
   4017 
   4018 @item user
   4019 This is the default and shows the Java call stacks for Java threads.
   4020 No call stacks for any housekeeping threads are shown.  The function
   4021 list contains a function
   4022 @IndexSubentry{Java profiling, @code{<JVM-System>}}
   4023 @code{<JVM-System>} that represents the aggregated time from non-Java
   4024 threads.
   4025 When the JVM software does not report a Java call stack, time is reported
   4026 against the function
   4027 @IndexSubentry{Java profiling, @code{<no Java callstack recorded>}}
   4028 @code{<no Java callstack recorded>}.
   4029 
   4030 @item expert
   4031 Show the Java call stacks for Java threads when the Java code from the
   4032 user is executed and machine call stacks when JVM code is executed, or
   4033 when the JVM software does not report a Java call stack.
   4034 Show the machine call stacks for housekeeping threads.
   4035 
   4036 @item machine
   4037 Show the actual native call stacks for all threads.
   4038 
   4039 @end table
   4040 
   4041 @c ----------------------------------------------------------------------------
   4042 @node    The Selection List
   4043 @section The Selection List
   4044 @c ----------------------------------------------------------------------------
   4045 
   4046 @cindex Selection list
   4047 @cindex List specification
   4048 Several commands allow the user to specify a sequence of numbers called the
   4049 @emph{selection list}.  Such a list may for example be used to select specific
   4050 threads from all the threads that have been used when conducting the
   4051 experiment(s).
   4052 
   4053 A selection list (or ``list'' in the remainder of this section) can be a
   4054 single number, a contiguous range of numbers with the start and end numbers
   4055 separated by a hyphen (@samp{-}), a comma-separated list of numbers and
   4056 ranges, or the @code{all} keyword that resets the filter.
   4057 @IndexSubentry{Filters, Reset to default}
   4058 Lists must not contain spaces.
   4059 
   4060 Each list can optionally be preceded by an experiment list with a similar
   4061 format, separated from the list by a colon (:).
   4062 If no experiment list is included, the list applies to all experiments.
   4063 
   4064 Multiple lists can be concatenated by separating the individual lists
   4065 by a plus sign.
   4066 
   4067 These are some examples of various filters using a list:
   4068 
   4069 @table @code
   4070 
   4071 @item thread_select 1
   4072 Select thread 1 from all experiments.
   4073 
   4074 @item thread_select all:1
   4075 Select thread 1 from all experiments.
   4076 
   4077 @item thread_select 1:all
   4078 Select all the threads from the first experiment loaded.
   4079 
   4080 @item thread_select 1:2+3:4
   4081 Select thread 2 from experiment 1 and thread 4 from experiment 3.
   4082 
   4083 @item cpu_select all:1,3,5
   4084 Selects cores 1, 3, and 5 from all experiments.
   4085 
   4086 @item cpu_select 1,2:all
   4087 Select all cores from experiments 1 and 2.
   4088 
   4089 @end table
   4090 
   4091 Recall that there are several list commands that show the mapping between the
   4092 numbers and the targets.
   4093 
   4094 @IndexSubentry{Options, @code{-experiment_list}}
   4095 @IndexSubentry{Commands, @code{experiment_list}}
   4096 For example, the @command{experiment_list} command shows the name(s) of the
   4097 experiment(s) loaded and the associated number.  In this example it is used
   4098 to get this information for a range of experiments:
   4099 
   4100 @cartouche
   4101 @smallexample
   4102 $ gprofng display text -experiment_list mxv.?.thr.er
   4103 @end smallexample
   4104 @end cartouche
   4105 
   4106 @noindent
   4107 This is the output, showing for each experiment the ID, the PID, and the name:
   4108 
   4109 @smallexample
   4110 @verbatim
   4111 ID Sel     PID Experiment
   4112 == === ======= ============
   4113  1 yes 2750071 mxv.1.thr.er
   4114  2 yes 1339450 mxv.2.thr.er
   4115  3 yes 3579561 mxv.4.thr.er
   4116 @end verbatim
   4117 @end smallexample
   4118 
   4119 @c ----------------------------------------------------------------------------
   4120 @node    Load Objects and Functions
   4121 @section Load Objects and Functions
   4122 @c ----------------------------------------------------------------------------
   4123 
   4124 An application consists of various components.  The source code files are
   4125 compiled into object files.  These are then glued together at link time to form
   4126 the executable.
   4127 During execution, the program may also dynamically load objects.
   4128 
   4129 @cindex Load object
   4130 A @emph{load object} is defined to be an executable, or shared object.  A
   4131 shared library is an example of a load object in @ToolName{}.
   4132 
   4133 Each load object, contains a text section with the instructions generated by
   4134 the compiler, a data section for data, and various symbol tables.
   4135 All load objects must contain an
   4136 @cindex ELF
   4137 ELF
   4138 symbol table, which gives the names and addresses of all the globally known
   4139 functions in that object.
   4140 
   4141 Load objects compiled with the -g option contain additional symbolic
   4142 information that can augment the ELF symbol table and provide information
   4143 about functions that are not global, additional information about object
   4144 modules from which the functions came, and line number information relating
   4145 addresses to source lines.
   4146 
   4147 The term
   4148 @cindex Function
   4149 @emph{function}
   4150 is used to describe a set of instructions that represent a high-level operation
   4151 described in the source code.  The term also covers methods as used in C++
   4152 and in the Java programming language.
   4153 
   4154 In the @ToolName{} context, functions are provided in source code format.
   4155 Normally their names appear in the symbol table representing a set of
   4156 addresses.
   4157 @cindex Program Counter
   4158 @cindex PC
   4159 If the Program Counter (PC) is within that set, the program is executing within
   4160 that function.
   4161 
   4162 In principle, any address within the text segment of a load object can be
   4163 mapped to a function.  Exactly the same mapping is used for the leaf PC and
   4164 all the other PCs on the call stack.
   4165 
   4166 Most of the functions correspond directly to the source model of the program,
   4167 but there are exceptions.  This topic is however outside of the scope of this
   4168 guide.
   4169 
   4170 @c ----------------------------------------------------------------------------
   4171 @node    The Concept of a CPU in gprofng
   4172 @section The Concept of a CPU in gprofng
   4173 @c ----------------------------------------------------------------------------
   4174 
   4175 @cindex CPU
   4176 In @ProductName{}, there is the concept of a CPU.  Admittedly, this is not the
   4177 best word to describe what is meant here and may be replaced in the future.
   4178 
   4179 The word CPU is used in many of the displays.
   4180 In the context of @ProductName{}, it is meant to denote a part of the
   4181 processor that is capable of executing instructions and with its own state,
   4182 like the program counter.
   4183 
   4184 For example, on a contemporary processor, a CPU could be a core.  In case
   4185 hardware threads are supported within a core, a CPU is one of those
   4186 hardware threads.
   4187 
   4188 To see which CPUs have been used in the experiment, use the @command{cpu}
   4189 command in @DisplayText{}.
   4190 
   4191 @c ----------------------------------------------------------------------------
   4192 @node    Hardware Event Counters Explained
   4193 @section Hardware Event Counters Explained
   4194 @c ----------------------------------------------------------------------------
   4195 
   4196 @IndexSubentry{Hardware event counters, description}
   4197 For quite a number of years now, many microprocessors have supported hardware
   4198 event counters.
   4199 
   4200 On the hardware side, this means that in the processor there are one or more
   4201 registers dedicated to count certain activities, or ``events''.
   4202 Examples of such events are the number of instructions executed, or the
   4203 number of cache misses at level 2 in the memory hierarchy.
   4204 
   4205 While there is a limited set of such registers, the user can map events onto
   4206 them.  In case more than one register is available, this allows for the
   4207 simultaenous measurement of various events.
   4208 
   4209 A simple, yet powerful, example is to simultaneously count the number of CPU
   4210 cycles and the number of instructions excuted.  These two numbers can then be
   4211 used to compute the
   4212 @cindex IPC
   4213 @emph{IPC} value.  IPC stands for ``Instructions Per Clockcycle'' and each
   4214 processor has an architecturally defined maximum.  For example, if this
   4215 maximum number is 2, it means the processor is capable of executing two
   4216 instructions every clock cycle.
   4217 
   4218 Whether this is actually achieved, depends on several factors, including the
   4219 instruction characteristics.
   4220 However, in case the IPC value is well below this maximum in a time critical
   4221 part of the application and this cannot be easily explained, further
   4222 investigation is probably warranted.
   4223 
   4224 @cindex CPI
   4225 A related metric is called @emph{CPI}, or ``Clockcycles Per Instruction''.
   4226 It is the inverse of the CPI and can be compared against the theoretical
   4227 value(s) of the target instruction(s).  A significant difference may point
   4228 at a bottleneck.
   4229 
   4230 One thing to keep in mind is that the value returned by a counter can either
   4231 be the number of times the event occured, or a CPU cycle count.  In case of
   4232 the latter it is possible to convert this number to time.
   4233 
   4234 @IndexSubentry{Hardware event counters, variable CPU frequency}
   4235 This is often easier to interpret than a simple count, but there is one
   4236 caveat to keep in mind.  The CPU frequency may not have been constant while
   4237 the experimen was recorded and this impacts the time reported.
   4238 
   4239 These event counters, or ``counters'' for short, provide great insight into
   4240 what happens deep inside the processor.  In case higher level information does
   4241 not provide the insight needed, the counters provide the information to get
   4242 to the bottom of a performance problem.
   4243 
   4244 There are some things to consider though.
   4245 
   4246 @itemize @bullet
   4247 
   4248 @item
   4249 The event definitions and names vary across processors and it may even happen
   4250 that some events change with an update.
   4251 Unfortunately and this is luckily rare, there are sometimes bugs causing the
   4252 wrong count to be returned.
   4253 
   4254 @IndexSubentry{Hardware event counters, alias name}
   4255 In @ToolName{}, some of the processor specific event names have an alias
   4256 name.  For example @code{insts} measures the instructions executed.
   4257 These aliases not only makes it easier to identify the functionality, but also
   4258 provide portability of certain events across processors.
   4259 
   4260 @item
   4261 Another complexity is that there are typically many events one can monitor.
   4262 There may up to hundreds of events available and it could require several
   4263 experiments to zoom in on the root cause of a performance problem.
   4264 
   4265 @item
   4266 There may be restrictions regarding the mapping of event(s) onto the
   4267 counters.  For example, certain events may be restricted to specific
   4268 counters only.  As a result, one may have to conduct additional experiments
   4269 to cover all the events of interest.
   4270 
   4271 @item
   4272 The names of the events may also not be easy to interpret.  In such cases,
   4273 the description can be found in the architecture manual for the processor.
   4274 
   4275 @end itemize
   4276 
   4277 Despite these drawbacks, hardware event counters are extremely useful and
   4278 may even turn out to be indispensable.
   4279 
   4280 @c ----------------------------------------------------------------------------
   4281 @node    apath
   4282 @section What is <apath>?
   4283 @c ----------------------------------------------------------------------------
   4284 
   4285 In most cases, @ToolName{} shows the absolute pathnames of directories.  These
   4286 tend to be rather long, causing display issues in this document.
   4287 
   4288 Instead of wrapping these long pathnames over multiple lines, we decided to
   4289 represent them by the @code{<apath>} symbol, which stands for ``an absolute
   4290 pathname''.
   4291 
   4292 Note that different occurrences of @code{<apath>} may represent different
   4293 absolute pathnames.
   4294 
   4295 @c -- A new node --------------------------------------------------------------
   4296 @node    Other Document Formats
   4297 @chapter Other Document Formats
   4298 @c ----------------------------------------------------------------------------
   4299 
   4300 @emph{This chapter is applicable when building gprofng from the
   4301 binutils source.}
   4302 
   4303 This document is written in Texinfo and the source text is made available as
   4304 part of the binutils distribution.  The file name is @code{gprofng.texi} and
   4305 can be found in subdirectory @code{gprofng/doc} of the top level binutils
   4306 directory.
   4307 
   4308 The default installation procedure creates a file in the @code{info} format and
   4309 stores it in the documentation section of binutils.
   4310 This source file can however also be used to generate the document in the
   4311 @code{html} and @code{pdf} formats.  These may be easier to read and search.
   4312 
   4313 To generate this documentation file in a different format, go to the directory
   4314 that was used to build the tools.  The make file to build the other formats is
   4315 in the @code{gprofng/doc} subdirectory.
   4316 
   4317 For example, if you have set the build directory to be @var{<my-build-dir>},
   4318 go to subdirectory @var{<my-build-dir>/gprofng/doc}.
   4319 
   4320 This subdirectory has a single file called @file{Makefile} that can be used to
   4321 build the documentation in various formats.  We recommend to use these
   4322 commands.
   4323 
   4324 There are four commands to generate the documentation in the @code{html} or
   4325 @code{pdf} format.  It is assumed that you are in directory @code{gprofng/doc}
   4326 under the main directory @var{<my-build-dir>}.
   4327 
   4328 @table @code
   4329 
   4330 @item make html
   4331 Create the html file in the current directory.
   4332 
   4333 @item make pdf
   4334 Create the pdf file in the current directory.
   4335 
   4336 @item make install-html
   4337 Create and install the html file in the binutils documentation directory.
   4338 
   4339 @item make install-pdf
   4340 Create and install the pdf file in the binutils documentation directory.
   4341 
   4342 @end table
   4343 
   4344 For example, to install this document in the binutils documentation directory,
   4345 the commands below may be executed.  In this notation, @var{<format>}
   4346 is one of @code{html}, or @code{pdf}:
   4347 
   4348 @smallexample
   4349 @verbatim
   4350 $ cd <my-build-dir>/gprofng/doc
   4351 $ make install-<format>
   4352 @end verbatim
   4353 @end smallexample
   4354 
   4355 The binutils installation directory is either the default @code{/usr/local} or
   4356 the one that has been set with the @code{--prefix} option as part of the
   4357 @code{configure} command.  In this example we symbolize this location with
   4358 @code{<install>}.
   4359 
   4360 The documentation directory is @code{<install>/share/doc/gprofng} in case
   4361 @code{html} or @code{pdf} is selected and @code{<install>/share/info} for the
   4362 file in the @code{info} format.
   4363 
   4364 @noindent
   4365 Some things to note:
   4366 
   4367 @itemize
   4368 
   4369 @item
   4370 For the @code{pdf} file to be generated, the @code{texi2dvi} tool is required.
   4371 @cindex texi2dvi
   4372 It is for example available as part of the @code{texinfo-tex} package.
   4373 
   4374 @item
   4375 Instead of generating a single file in the @code{html} format, it is also
   4376 possible to create a directory with individual files for the various chapters.
   4377 To do so, remove the use of @code{--no-split} in variable @code{MAKEINFOHTML}
   4378 in the make file in the @code{<my-build-dir/gprofng/doc} directory.
   4379 
   4380 @end itemize
   4381 
   4382 @c -- An appendix -------------------------------------------------------------
   4383 @node     The gprofng Man Pages
   4384 @appendix The gprofng Man Pages
   4385 @c ----------------------------------------------------------------------------
   4386 
   4387 In this appendix the man pages for the various @ProductName{} tools are listed.
   4388 
   4389 @c -- A new node --------------------------------------------------------------
   4390 @c  @node    gprofng driver
   4391 @node    Man page for @command{gprofng}
   4392 @section Man page for @command{gprofng}
   4393 @c ----------------------------------------------------------------------------
   4394 
   4395 @include gprofng.texi
   4396 
   4397 @c -- A new node --------------------------------------------------------------
   4398 @page
   4399 @node    Man page for @command{gprofng collect app}
   4400 @section Man page for @command{gprofng collect app}
   4401 @c ----------------------------------------------------------------------------
   4402 
   4403 @include gprofng-collect-app.texi
   4404 
   4405 @c -- A new node --------------------------------------------------------------
   4406 @page
   4407 @node    Man page for @command{gprofng display text}
   4408 @section Man page for @command{gprofng display text}
   4409 @c ----------------------------------------------------------------------------
   4410 
   4411 @include gprofng-display-text.texi
   4412 
   4413 @c -- A new node --------------------------------------------------------------
   4414 @page
   4415 @node    Man page for @command{gprofng display html}
   4416 @section Man page for @command{gprofng display html}
   4417 @c ----------------------------------------------------------------------------
   4418 
   4419 @include gprofng-display-html.texi
   4420 
   4421 @c -- A new node --------------------------------------------------------------
   4422 @page
   4423 @node    Man page for @command{gprofng display src}
   4424 @section Man page for @command{gprofng display src}
   4425 @c ----------------------------------------------------------------------------
   4426 
   4427 @include gprofng-display-src.texi
   4428 
   4429 @c -- A new node --------------------------------------------------------------
   4430 @page
   4431 @node    Man page for @command{gprofng archive}
   4432 @section Man page for @command{gprofng archive}
   4433 @c ----------------------------------------------------------------------------
   4434 
   4435 @include gprofng-archive.texi
   4436 
   4437 @ifnothtml
   4438 @node       Index
   4439 @unnumbered Index
   4440 @printindex cp
   4441 @end ifnothtml
   4442 
   4443 @bye
   4444