r600/sb/notes.markdown

af69d88dSmrgr600-sb
af69d88dSmrg=======
af69d88dSmrg
af69d88dSmrg* * * * *
af69d88dSmrg
af69d88dSmrgDebugging
af69d88dSmrg---------
af69d88dSmrg
af69d88dSmrg### Environment variables
af69d88dSmrg
af69d88dSmrg-   **R600\_DEBUG**
af69d88dSmrg
af69d88dSmrg    There are new flags:
af69d88dSmrg
01e04c3fSmrg    -   **nosb** - Disable sb backend for graphics shaders
af69d88dSmrg    -   **sbcl** - Enable optimization of compute shaders (experimental)
01e04c3fSmrg    -   **sbdry** - Dry run, optimize but use source bytecode -
01e04c3fSmrg        useful if you only want to check shader dumps
af69d88dSmrg        without the risk of lockups and other problems
af69d88dSmrg    -   **sbstat** - Print optimization statistics (only time so far)
af69d88dSmrg    -   **sbdump** - Print IR after some passes.
01e04c3fSmrg    -   **sbnofallback** - Abort on errors instead of fallback
01e04c3fSmrg    -   **sbdisasm** - Use sb disassembler for shader dumps
01e04c3fSmrg    -   **sbsafemath** - Disable unsafe math optimizations
af69d88dSmrg
af69d88dSmrg### Regression debugging
af69d88dSmrg
af69d88dSmrgIf there are any regressions as compared to the default backend
af69d88dSmrg(R600\_SB=0), it's possible to use the following environment variables
af69d88dSmrgto find the incorrectly optimized shader that causes the regression.
af69d88dSmrg
af69d88dSmrg-   **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some
af69d88dSmrg    shaders
af69d88dSmrg    -   0 - disabled (default)
af69d88dSmrg    -   1 - skip optimization for the shaders in the range
af69d88dSmrg        [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is,
af69d88dSmrg        optimize only the shaders that are not in this range
af69d88dSmrg    -   2 - optimize only the shaders in the range
af69d88dSmrg        [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END]
af69d88dSmrg
af69d88dSmrg-   **R600\_SB\_DSKIP\_START** - start of the range (1-based)
af69d88dSmrg
af69d88dSmrg-   **R600\_SB\_DSKIP\_END** - end of the range (1-based)
af69d88dSmrg
af69d88dSmrgExample - optimize only the shaders 5, 6, and 7:
af69d88dSmrg
af69d88dSmrg    R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2
af69d88dSmrg
af69d88dSmrgAll shaders compiled by the application are numbered starting from 1,
af69d88dSmrgthe number of shaders used by the application may be obtained by running
af69d88dSmrgit with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#"
af69d88dSmrgfor each compiled shader.
af69d88dSmrg
af69d88dSmrgAfter figuring out the total number of shaders used by the application,
af69d88dSmrgthe variables above allow to use bisection to find the shader that is
af69d88dSmrgthe cause of regression. E.g. if the application uses 100 shaders, we
af69d88dSmrgcan divide the range [1; 100] and run the application with the
af69d88dSmrgoptimization enabled only for the first half of the shaders:
af69d88dSmrg
af69d88dSmrg    R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app>
af69d88dSmrg
af69d88dSmrgIf the regression is reproduced with these parameters, then the failing
af69d88dSmrgshader is in the range [1; 50], if it's not reproduced - then it's in
af69d88dSmrgthe range [51; 100]. Then we can divide the new range again and repeat
af69d88dSmrgthe testing, until we'll reduce the range to a single failing shader.
af69d88dSmrg
af69d88dSmrg*NOTE: This method relies on the assumption that the application
af69d88dSmrgproduces the same sequence of the shaders on each run. It's not always
af69d88dSmrgtrue - some applications may produce different sequences of the shaders,
af69d88dSmrgin such cases the tools like apitrace may be used to record the trace
af69d88dSmrgwith the application, then this method may be applied when replaying the
af69d88dSmrgtrace - also this may be faster and/or more convenient than testing the
af69d88dSmrgapplication itself.*
af69d88dSmrg
af69d88dSmrg* * * * *
af69d88dSmrg
af69d88dSmrgIntermediate Representation
af69d88dSmrg---------------------------
af69d88dSmrg
af69d88dSmrg### Values
af69d88dSmrg
af69d88dSmrgAll kinds of the operands (literal constants, references to kcache
af69d88dSmrgconstants, references to GPRs, etc) are currently represented by the
af69d88dSmrg**value** class (possibly it makes sense to switch to hierarchy of
af69d88dSmrgclasses derived from **value** instead, to save some memory).
af69d88dSmrg
af69d88dSmrgAll values (except some pseudo values like the exec\_mask or predicate
af69d88dSmrgregister) represent 32bit scalar values - there are no vector values,
af69d88dSmrgCF/FETCH instructions use groups of 4 values for src and dst operands.
af69d88dSmrg
af69d88dSmrg### Nodes
af69d88dSmrg
af69d88dSmrgShader programs are represented using the tree data structure, some
af69d88dSmrgnodes contain a list of subnodes.
af69d88dSmrg
af69d88dSmrg#### Control flow nodes
af69d88dSmrg
af69d88dSmrgControl flow information is represented using four special node types
af69d88dSmrg(based on the ideas from [[1]](#references) )
af69d88dSmrg
af69d88dSmrg-   **region\_node** - single-entry, single-exit region.
af69d88dSmrg
af69d88dSmrg    All loops and if's in the program are enclosed in region nodes.
af69d88dSmrg    Region nodes have two containers for phi nodes -
af69d88dSmrg    region\_node::loop\_phi contains the phi expressions to be executed
af69d88dSmrg    at the region entry, region\_node::phi contains the phi expressions
af69d88dSmrg    to be executed at the region exit. It's the only type of the node
af69d88dSmrg    that contains associated phi expressions.
af69d88dSmrg
af69d88dSmrg-   **depart\_node** - "depart region \$id after { ... }"
af69d88dSmrg
af69d88dSmrg    Depart target region (jump to exit point) after executing contained
af69d88dSmrg    code.
af69d88dSmrg
af69d88dSmrg-   **repeat\_node** - "repeat region \$id after { ... }"
af69d88dSmrg
af69d88dSmrg    Repeat target region (jump to entry point) after executing contained
af69d88dSmrg    code.
af69d88dSmrg
af69d88dSmrg-   **if\_node** - "if (cond) { ... }"
af69d88dSmrg
af69d88dSmrg    Execute contained code if condition is true. The difference from
af69d88dSmrg    [[1]](#references) is that we don't have associated phi expressions
af69d88dSmrg    for the **if\_node**, we enclose **if\_node** in the
af69d88dSmrg    **region\_node** and store corresponding phi's in the
af69d88dSmrg    **region\_node**, this allows more uniform handling.
af69d88dSmrg
af69d88dSmrgThe target region of depart and repeat nodes is always the region where
af69d88dSmrgthey are located (possibly in the nested region), there are no arbitrary
af69d88dSmrgjumps/goto's - control flow in the program is always structured.
af69d88dSmrg
af69d88dSmrgTypical control flow constructs can be represented as in the following
af69d88dSmrgexamples:
af69d88dSmrg
af69d88dSmrgGLSL:
af69d88dSmrg
af69d88dSmrg    if (cond) {
af69d88dSmrg        < 1 >
af69d88dSmrg    } else {
af69d88dSmrg        < 2 >
af69d88dSmrg    }
af69d88dSmrg
af69d88dSmrgIR:
af69d88dSmrg
af69d88dSmrg    region #0 {
af69d88dSmrg        depart region #0 after {
af69d88dSmrg            if (cond) {
af69d88dSmrg                depart region #0 after {
af69d88dSmrg                    < 1 >
af69d88dSmrg                }
af69d88dSmrg            }
af69d88dSmrg            < 2 >
af69d88dSmrg        }
af69d88dSmrg        <region #0 phi nodes >
af69d88dSmrg    }
af69d88dSmrg
af69d88dSmrgGLSL:
af69d88dSmrg
af69d88dSmrg    while (cond) {
af69d88dSmrg        < 1 >
af69d88dSmrg    }
af69d88dSmrg
af69d88dSmrgIR:
af69d88dSmrg
af69d88dSmrg    region #0 {
af69d88dSmrg        <region #0 loop_phi nodes>
af69d88dSmrg        repeat region #0 after {
af69d88dSmrg            region #1 {
af69d88dSmrg                depart region #1 after {
af69d88dSmrg                    if (!cond) {
af69d88dSmrg                        depart region #0
af69d88dSmrg                    }
af69d88dSmrg                }
af69d88dSmrg            }
af69d88dSmrg            < 1 >
af69d88dSmrg        }
af69d88dSmrg        <region #0 phi nodes>
af69d88dSmrg    }
af69d88dSmrg
af69d88dSmrg'Break' and 'continue' inside the loops are directly translated to the
af69d88dSmrgdepart and repeat nodes for the corresponding loop region.
af69d88dSmrg
af69d88dSmrgThis may look a bit too complicated, but in fact this allows more simple
af69d88dSmrgand uniform handling of the control flow.
af69d88dSmrg
af69d88dSmrgAll loop\_phi and phi nodes for some region always have the same number
af69d88dSmrgof source operands. The number of source operands for
af69d88dSmrgregion\_node::loop\_phi nodes is 1 + number of repeat nodes that
af69d88dSmrgreference this region as a target. The number of source operands for
af69d88dSmrgregion\_node::phi nodes is equal to the number of depart nodes that
af69d88dSmrgreference this region as a target. All depart/repeat nodes for the
af69d88dSmrgregion have unique indices equal to the index of source operand for
af69d88dSmrgphi/loop\_phi nodes.
af69d88dSmrg
af69d88dSmrgFirst source operand for region\_node::loop\_phi nodes (src[0]) is an
af69d88dSmrgincoming value that enters the region from the outside. Each remaining
af69d88dSmrgsource operand comes from the corresponding repeat node.
af69d88dSmrg
af69d88dSmrgMore complex example:
af69d88dSmrg
af69d88dSmrgGLSL:
af69d88dSmrg
af69d88dSmrg    a = 1;
af69d88dSmrg    while (a < 5) {
af69d88dSmrg        a = a * 2;
af69d88dSmrg        if (b == 3) {
af69d88dSmrg            continue;
af69d88dSmrg        } else {
af69d88dSmrg            a = 6;
af69d88dSmrg        }
af69d88dSmrg        if (c == 4)
af69d88dSmrg            break;
af69d88dSmrg        a = a + 1;
af69d88dSmrg    }
af69d88dSmrg
af69d88dSmrgIR with SSA form:
af69d88dSmrg
af69d88dSmrg    a.1 = 1;
af69d88dSmrg    region #0 {
af69d88dSmrg        // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2
af69d88dSmrg        region#0 loop_phi: a.2 = phi a.1, a.6, a.3
af69d88dSmrg
af69d88dSmrg        repeat_1 region #0 after {
af69d88dSmrg            a.3 = a.2 * 2;
af69d88dSmrg            cond1 = (b == 3);
af69d88dSmrg            region #1 {
af69d88dSmrg                depart_0 region #1 after {
af69d88dSmrg                    if (cond1) {
af69d88dSmrg                        repeat_2 region #0;
af69d88dSmrg                    }
af69d88dSmrg                }
af69d88dSmrg                a.4 = 6;
af69d88dSmrg
af69d88dSmrg                region #1 phi: a.5 = phi a.4; // src[0] - from depart_0
af69d88dSmrg            }
af69d88dSmrg            cond2 = (c == 4);
af69d88dSmrg            region #2 {
af69d88dSmrg                depart_0 region #2 after {
af69d88dSmrg                    if (cond2) {
af69d88dSmrg                        depart_0 region #0;
af69d88dSmrg                    }
af69d88dSmrg                }
af69d88dSmrg            }
af69d88dSmrg            a.6 = a.5 + 1;
af69d88dSmrg        }
af69d88dSmrg
af69d88dSmrg        region #0 phi: a.7 = phi a.5 // src[0] from depart_0
af69d88dSmrg    }
af69d88dSmrg
af69d88dSmrgPhi nodes with single source operand are just copies, they are not
af69d88dSmrgreally necessary, but this allows to handle all **depart\_node**s in the
af69d88dSmrguniform way.
af69d88dSmrg
af69d88dSmrg#### Instruction nodes
af69d88dSmrg
af69d88dSmrgInstruction nodes represent different kinds of instructions -
af69d88dSmrg**alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains
af69d88dSmrgthe "bc" structure where all fields of the bytecode are stored (the type
af69d88dSmrgis **bc\_alu** for **alu\_node**, etc). The operands are represented
af69d88dSmrgusing the vectors of pointers to **value** class (node::src, node::dst)
af69d88dSmrg
af69d88dSmrg#### SSA-specific nodes
af69d88dSmrg
af69d88dSmrgPhi nodes currently don't have special node class, they are stored as
af69d88dSmrg**node**. Destination vector contains a single destination value, source
af69d88dSmrgvector contains 1 or more source values.
af69d88dSmrg
af69d88dSmrgPsi nodes [[5], [6]](#references) also don't have a special node class
af69d88dSmrgand stored as **node**. Source vector contains 3 values for each source
af69d88dSmrgoperand - the **value** of predicate, **value** of corresponding
af69d88dSmrgPRED\_SEL field, and the source **value** itself.
af69d88dSmrg
af69d88dSmrg### Indirect addressing
af69d88dSmrg
af69d88dSmrgSpecial kind of values (VLK\_RELREG) is used to represent indirect
af69d88dSmrgoperands. These values don't have SSA versions. The representation is
af69d88dSmrgmostly based on the [[2]](#references). Indirect operand contains the
af69d88dSmrg"offset/address" value (value::rel), (e.g. some SSA version of the AR
af69d88dSmrgregister value, though after some passes it may be any value - constant,
af69d88dSmrgregister, etc), also it contains the maydef and mayuse vectors of
af69d88dSmrgpointers to **value**s (similar to dst/src vectors in the **node**) to
af69d88dSmrgrepresent the effects of aliasing in the SSA form.
af69d88dSmrg
af69d88dSmrgE.g. if we have the array R5.x ... R8.x and the following instruction :
af69d88dSmrg
af69d88dSmrg    MOV R0.x, R[5 + AR].x
af69d88dSmrg
af69d88dSmrgthen source indirect operand is represented with the VLK\_RELREG value,
af69d88dSmrgvalue::rel is AR, value::maydef is empty (in fact it always contain the
af69d88dSmrgsame number of elements as mayuse to simplify the handling, but they are
af69d88dSmrgNULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the
af69d88dSmrgcorresponding SSA versions after ssa\_rename).
af69d88dSmrg
af69d88dSmrgAdditional "virtual variables" as in [HSSA [2]](#references) are not
af69d88dSmrgused, also there is no special handling for "zero versions". Typical
af69d88dSmrgprograms in our case are small, indirect addressing is rare, array sizes
af69d88dSmrgare limited by max gpr number, so we don't really need to use special
af69d88dSmrgtricks to avoid the explosion of value versions. Also this allows more
af69d88dSmrgprecise liveness computation for array elements without modifications to
af69d88dSmrgthe algorithms.
af69d88dSmrg
af69d88dSmrgWith the following instruction:
af69d88dSmrg
af69d88dSmrg    MOV R[5+AR].x, R0.x
af69d88dSmrg
af69d88dSmrgwe'll have both maydef and mayuse vectors for dst operand filled with
af69d88dSmrgarray values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename
af69d88dSmrgpass mayuse will contain previous versions, maydef will contain new
af69d88dSmrgpotentially-defined versions.
af69d88dSmrg
af69d88dSmrg* * * * *
af69d88dSmrg
af69d88dSmrgPasses
af69d88dSmrg------
af69d88dSmrg
af69d88dSmrg-   **bc\_parser** - creates the IR from the source bytecode,
af69d88dSmrg    initializes src and dst value vectors for instruction nodes. Most
af69d88dSmrg    ALU nodes have one dst operand and the number of source operands is
af69d88dSmrg    equal to the number of source operands for the ISA instruction.
af69d88dSmrg    Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst
af69d88dSmrg    gpr as in the original instruction, other two are pseudo-operands
af69d88dSmrg    that represent possibly updated predicate and exec\_mask. Predicate
af69d88dSmrg    values are used in the predicated alu instructions (node::pred),
af69d88dSmrg    exec\_mask values are used in the if\_nodes (if\_node::cond). Each
af69d88dSmrg    vector operand in the CF/TEX/VTX instructions is represented with 4
af69d88dSmrg    values - components of the vector.
af69d88dSmrg
af69d88dSmrg-   **ssa\_prepare** - creates phi expressions.
af69d88dSmrg
af69d88dSmrg-   **ssa\_rename** - renames the values (assigns versions).
af69d88dSmrg
af69d88dSmrg-   **liveness** - liveness computation, sets 'dead' flag for unused
af69d88dSmrg    nodes and values, optionally computes interference information for
af69d88dSmrg    the values.
af69d88dSmrg
af69d88dSmrg-   **dce\_cleanup** - eliminates 'dead' nodes, also removes some
af69d88dSmrg    unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP
af69d88dSmrg    instructions in the source, containers for ALU groups (they were
af69d88dSmrg    only needed for the ssa\_rename pass)
af69d88dSmrg
af69d88dSmrg-   **if\_conversion** - converts control flow with if\_nodes to the
af69d88dSmrg    data flow in cases where it can improve performance (small alu-only
af69d88dSmrg    branches). Both branches are executed speculatively and the phi
af69d88dSmrg    expressions are replaced with conditional moves (CNDxx) to select
af69d88dSmrg    the final value using the same condition predicate as was used by
af69d88dSmrg    the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx
af69d88dSmrg    instruction, CNDxx now uses dst[0] from the same PREDSETxx
af69d88dSmrg    instruction.
af69d88dSmrg
af69d88dSmrg-   **peephole** - peephole optimizations
af69d88dSmrg
af69d88dSmrg-   **gvn** - Global Value Numbering [[2]](#references),
af69d88dSmrg    [[3]](#references)
af69d88dSmrg
af69d88dSmrg-   **gcm** - Global Code Motion [[3]](#references). Also performs
af69d88dSmrg    grouping of the instructions of the same kind (CF/FETCH/ALU).
af69d88dSmrg
af69d88dSmrg-   register allocation passes, some ideas are used from
af69d88dSmrg    [[4]](#references), but implementation is simplified to make it more
af69d88dSmrg    efficient in terms of the compilation speed (e.g. no recursive
af69d88dSmrg    recoloring) while achieving good enough results.
af69d88dSmrg
af69d88dSmrg    -   **ra\_split** - prepares the program to register allocation.
af69d88dSmrg        Splits live ranges for constrained values by inserting the
af69d88dSmrg        copies to/from temporary values, so that the live range of the
af69d88dSmrg        constrained values becomes minimal.
af69d88dSmrg
af69d88dSmrg    -   **ra\_coalesce** - performs global allocation on registers used
af69d88dSmrg        in CF/FETCH instructions. It's performed first to make sure they
af69d88dSmrg        end up in the same GPR. Also tries to allocate all values
af69d88dSmrg        involved in copies (inserted by the ra\_split pass) to the same
af69d88dSmrg        register, so that the copies may be eliminated.
af69d88dSmrg
af69d88dSmrg    -   **ra\_init** - allocates gpr arrays (if indirect addressing is
af69d88dSmrg        used), and remaining values.
af69d88dSmrg
af69d88dSmrg-   **post\_scheduler** - ALU scheduler, handles VLIW packing and
af69d88dSmrg    performs the final register allocation for local values inside ALU
af69d88dSmrg    clauses. Eliminates all coalesced copies (if src and dst of the copy
af69d88dSmrg    are allocated to the same register).
af69d88dSmrg
af69d88dSmrg-   **ra\_checker** - optional debugging pass that tries to catch basic
af69d88dSmrg    errors of the scheduler or regalloc,
af69d88dSmrg
af69d88dSmrg-   **bc\_finalize** - propagates the regalloc information from values
af69d88dSmrg    in node::src and node::dst vectors to the bytecode fields, converts
af69d88dSmrg    control flow structure (region/depart/repeat) to the target
af69d88dSmrg    instructions (JUMP/ELSE/POP,
af69d88dSmrg    LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK).
af69d88dSmrg
af69d88dSmrg-   **bc\_builder** - builds final bytecode,
af69d88dSmrg
af69d88dSmrg* * * * *
af69d88dSmrg
af69d88dSmrgReferences
af69d88dSmrg----------
af69d88dSmrg
af69d88dSmrg[1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl
af69d88dSmrgMcConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf)
af69d88dSmrg
af69d88dSmrg[2] ["Effective Representation of Aliases and Indirect Memory Operations
af69d88dSmrgin SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark
af69d88dSmrgStreich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf)
af69d88dSmrg
af69d88dSmrg[3] ["Global Code Motion. Global Value Numbering.", Cliff
af69d88dSmrgClick](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf)
af69d88dSmrg
af69d88dSmrg[4] ["Register Allocation for Programs in SSA Form", Sebastian
af69d88dSmrgHack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532)
af69d88dSmrg
af69d88dSmrg[5] ["An extension to the SSA representation for predicated code",
af69d88dSmrgFrancois de
af69d88dSmrgFerriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf)
af69d88dSmrg
af69d88dSmrg[6] ["Improvements to the Psi-SSA Representation", F. de
af69d88dSmrgFerriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf)