r600/sb/notes.markdown

848b8605Smrgr600-sb
848b8605Smrg=======
848b8605Smrg
848b8605Smrg* * * * *
848b8605Smrg
848b8605SmrgDebugging
848b8605Smrg---------
848b8605Smrg
848b8605Smrg### Environment variables
848b8605Smrg
848b8605Smrg-   **R600\_DEBUG**
848b8605Smrg
848b8605Smrg    There are new flags:
848b8605Smrg
b8e80941Smrg    -   **nosb** - Disable sb backend for graphics shaders
848b8605Smrg    -   **sbcl** - Enable optimization of compute shaders (experimental)
b8e80941Smrg    -   **sbdry** - Dry run, optimize but use source bytecode -
b8e80941Smrg        useful if you only want to check shader dumps
848b8605Smrg        without the risk of lockups and other problems
848b8605Smrg    -   **sbstat** - Print optimization statistics (only time so far)
848b8605Smrg    -   **sbdump** - Print IR after some passes.
b8e80941Smrg    -   **sbnofallback** - Abort on errors instead of fallback
b8e80941Smrg    -   **sbdisasm** - Use sb disassembler for shader dumps
b8e80941Smrg    -   **sbsafemath** - Disable unsafe math optimizations
848b8605Smrg
848b8605Smrg### Regression debugging
848b8605Smrg
848b8605SmrgIf there are any regressions as compared to the default backend
848b8605Smrg(R600\_SB=0), it's possible to use the following environment variables
848b8605Smrgto find the incorrectly optimized shader that causes the regression.
848b8605Smrg
848b8605Smrg-   **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some
848b8605Smrg    shaders
848b8605Smrg    -   0 - disabled (default)
848b8605Smrg    -   1 - skip optimization for the shaders in the range
848b8605Smrg        [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is,
848b8605Smrg        optimize only the shaders that are not in this range
848b8605Smrg    -   2 - optimize only the shaders in the range
848b8605Smrg        [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END]
848b8605Smrg
848b8605Smrg-   **R600\_SB\_DSKIP\_START** - start of the range (1-based)
848b8605Smrg
848b8605Smrg-   **R600\_SB\_DSKIP\_END** - end of the range (1-based)
848b8605Smrg
848b8605SmrgExample - optimize only the shaders 5, 6, and 7:
848b8605Smrg
848b8605Smrg    R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2
848b8605Smrg
848b8605SmrgAll shaders compiled by the application are numbered starting from 1,
848b8605Smrgthe number of shaders used by the application may be obtained by running
848b8605Smrgit with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#"
848b8605Smrgfor each compiled shader.
848b8605Smrg
848b8605SmrgAfter figuring out the total number of shaders used by the application,
848b8605Smrgthe variables above allow to use bisection to find the shader that is
848b8605Smrgthe cause of regression. E.g. if the application uses 100 shaders, we
848b8605Smrgcan divide the range [1; 100] and run the application with the
848b8605Smrgoptimization enabled only for the first half of the shaders:
848b8605Smrg
848b8605Smrg    R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app>
848b8605Smrg
848b8605SmrgIf the regression is reproduced with these parameters, then the failing
848b8605Smrgshader is in the range [1; 50], if it's not reproduced - then it's in
848b8605Smrgthe range [51; 100]. Then we can divide the new range again and repeat
848b8605Smrgthe testing, until we'll reduce the range to a single failing shader.
848b8605Smrg
848b8605Smrg*NOTE: This method relies on the assumption that the application
848b8605Smrgproduces the same sequence of the shaders on each run. It's not always
848b8605Smrgtrue - some applications may produce different sequences of the shaders,
848b8605Smrgin such cases the tools like apitrace may be used to record the trace
848b8605Smrgwith the application, then this method may be applied when replaying the
848b8605Smrgtrace - also this may be faster and/or more convenient than testing the
848b8605Smrgapplication itself.*
848b8605Smrg
848b8605Smrg* * * * *
848b8605Smrg
848b8605SmrgIntermediate Representation
848b8605Smrg---------------------------
848b8605Smrg
848b8605Smrg### Values
848b8605Smrg
848b8605SmrgAll kinds of the operands (literal constants, references to kcache
848b8605Smrgconstants, references to GPRs, etc) are currently represented by the
848b8605Smrg**value** class (possibly it makes sense to switch to hierarchy of
848b8605Smrgclasses derived from **value** instead, to save some memory).
848b8605Smrg
848b8605SmrgAll values (except some pseudo values like the exec\_mask or predicate
848b8605Smrgregister) represent 32bit scalar values - there are no vector values,
848b8605SmrgCF/FETCH instructions use groups of 4 values for src and dst operands.
848b8605Smrg
848b8605Smrg### Nodes
848b8605Smrg
848b8605SmrgShader programs are represented using the tree data structure, some
848b8605Smrgnodes contain a list of subnodes.
848b8605Smrg
848b8605Smrg#### Control flow nodes
848b8605Smrg
848b8605SmrgControl flow information is represented using four special node types
848b8605Smrg(based on the ideas from [[1]](#references) )
848b8605Smrg
848b8605Smrg-   **region\_node** - single-entry, single-exit region.
848b8605Smrg
848b8605Smrg    All loops and if's in the program are enclosed in region nodes.
848b8605Smrg    Region nodes have two containers for phi nodes -
848b8605Smrg    region\_node::loop\_phi contains the phi expressions to be executed
848b8605Smrg    at the region entry, region\_node::phi contains the phi expressions
848b8605Smrg    to be executed at the region exit. It's the only type of the node
848b8605Smrg    that contains associated phi expressions.
848b8605Smrg
848b8605Smrg-   **depart\_node** - "depart region \$id after { ... }"
848b8605Smrg
848b8605Smrg    Depart target region (jump to exit point) after executing contained
848b8605Smrg    code.
848b8605Smrg
848b8605Smrg-   **repeat\_node** - "repeat region \$id after { ... }"
848b8605Smrg
848b8605Smrg    Repeat target region (jump to entry point) after executing contained
848b8605Smrg    code.
848b8605Smrg
848b8605Smrg-   **if\_node** - "if (cond) { ... }"
848b8605Smrg
848b8605Smrg    Execute contained code if condition is true. The difference from
848b8605Smrg    [[1]](#references) is that we don't have associated phi expressions
848b8605Smrg    for the **if\_node**, we enclose **if\_node** in the
848b8605Smrg    **region\_node** and store corresponding phi's in the
848b8605Smrg    **region\_node**, this allows more uniform handling.
848b8605Smrg
848b8605SmrgThe target region of depart and repeat nodes is always the region where
848b8605Smrgthey are located (possibly in the nested region), there are no arbitrary
848b8605Smrgjumps/goto's - control flow in the program is always structured.
848b8605Smrg
848b8605SmrgTypical control flow constructs can be represented as in the following
848b8605Smrgexamples:
848b8605Smrg
848b8605SmrgGLSL:
848b8605Smrg
848b8605Smrg    if (cond) {
848b8605Smrg        < 1 >
848b8605Smrg    } else {
848b8605Smrg        < 2 >
848b8605Smrg    }
848b8605Smrg
848b8605SmrgIR:
848b8605Smrg
848b8605Smrg    region #0 {
848b8605Smrg        depart region #0 after {
848b8605Smrg            if (cond) {
848b8605Smrg                depart region #0 after {
848b8605Smrg                    < 1 >
848b8605Smrg                }
848b8605Smrg            }
848b8605Smrg            < 2 >
848b8605Smrg        }
848b8605Smrg        <region #0 phi nodes >
848b8605Smrg    }
848b8605Smrg
848b8605SmrgGLSL:
848b8605Smrg
848b8605Smrg    while (cond) {
848b8605Smrg        < 1 >
848b8605Smrg    }
848b8605Smrg
848b8605SmrgIR:
848b8605Smrg
848b8605Smrg    region #0 {
848b8605Smrg        <region #0 loop_phi nodes>
848b8605Smrg        repeat region #0 after {
848b8605Smrg            region #1 {
848b8605Smrg                depart region #1 after {
848b8605Smrg                    if (!cond) {
848b8605Smrg                        depart region #0
848b8605Smrg                    }
848b8605Smrg                }
848b8605Smrg            }
848b8605Smrg            < 1 >
848b8605Smrg        }
848b8605Smrg        <region #0 phi nodes>
848b8605Smrg    }
848b8605Smrg
848b8605Smrg'Break' and 'continue' inside the loops are directly translated to the
848b8605Smrgdepart and repeat nodes for the corresponding loop region.
848b8605Smrg
848b8605SmrgThis may look a bit too complicated, but in fact this allows more simple
848b8605Smrgand uniform handling of the control flow.
848b8605Smrg
848b8605SmrgAll loop\_phi and phi nodes for some region always have the same number
848b8605Smrgof source operands. The number of source operands for
848b8605Smrgregion\_node::loop\_phi nodes is 1 + number of repeat nodes that
848b8605Smrgreference this region as a target. The number of source operands for
848b8605Smrgregion\_node::phi nodes is equal to the number of depart nodes that
848b8605Smrgreference this region as a target. All depart/repeat nodes for the
848b8605Smrgregion have unique indices equal to the index of source operand for
848b8605Smrgphi/loop\_phi nodes.
848b8605Smrg
848b8605SmrgFirst source operand for region\_node::loop\_phi nodes (src[0]) is an
848b8605Smrgincoming value that enters the region from the outside. Each remaining
848b8605Smrgsource operand comes from the corresponding repeat node.
848b8605Smrg
848b8605SmrgMore complex example:
848b8605Smrg
848b8605SmrgGLSL:
848b8605Smrg
848b8605Smrg    a = 1;
848b8605Smrg    while (a < 5) {
848b8605Smrg        a = a * 2;
848b8605Smrg        if (b == 3) {
848b8605Smrg            continue;
848b8605Smrg        } else {
848b8605Smrg            a = 6;
848b8605Smrg        }
848b8605Smrg        if (c == 4)
848b8605Smrg            break;
848b8605Smrg        a = a + 1;
848b8605Smrg    }
848b8605Smrg
848b8605SmrgIR with SSA form:
848b8605Smrg
848b8605Smrg    a.1 = 1;
848b8605Smrg    region #0 {
848b8605Smrg        // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2
848b8605Smrg        region#0 loop_phi: a.2 = phi a.1, a.6, a.3
848b8605Smrg
848b8605Smrg        repeat_1 region #0 after {
848b8605Smrg            a.3 = a.2 * 2;
848b8605Smrg            cond1 = (b == 3);
848b8605Smrg            region #1 {
848b8605Smrg                depart_0 region #1 after {
848b8605Smrg                    if (cond1) {
848b8605Smrg                        repeat_2 region #0;
848b8605Smrg                    }
848b8605Smrg                }
848b8605Smrg                a.4 = 6;
848b8605Smrg
848b8605Smrg                region #1 phi: a.5 = phi a.4; // src[0] - from depart_0
848b8605Smrg            }
848b8605Smrg            cond2 = (c == 4);
848b8605Smrg            region #2 {
848b8605Smrg                depart_0 region #2 after {
848b8605Smrg                    if (cond2) {
848b8605Smrg                        depart_0 region #0;
848b8605Smrg                    }
848b8605Smrg                }
848b8605Smrg            }
848b8605Smrg            a.6 = a.5 + 1;
848b8605Smrg        }
848b8605Smrg
848b8605Smrg        region #0 phi: a.7 = phi a.5 // src[0] from depart_0
848b8605Smrg    }
848b8605Smrg
848b8605SmrgPhi nodes with single source operand are just copies, they are not
848b8605Smrgreally necessary, but this allows to handle all **depart\_node**s in the
848b8605Smrguniform way.
848b8605Smrg
848b8605Smrg#### Instruction nodes
848b8605Smrg
848b8605SmrgInstruction nodes represent different kinds of instructions -
848b8605Smrg**alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains
848b8605Smrgthe "bc" structure where all fields of the bytecode are stored (the type
848b8605Smrgis **bc\_alu** for **alu\_node**, etc). The operands are represented
848b8605Smrgusing the vectors of pointers to **value** class (node::src, node::dst)
848b8605Smrg
848b8605Smrg#### SSA-specific nodes
848b8605Smrg
848b8605SmrgPhi nodes currently don't have special node class, they are stored as
848b8605Smrg**node**. Destination vector contains a single destination value, source
848b8605Smrgvector contains 1 or more source values.
848b8605Smrg
848b8605SmrgPsi nodes [[5], [6]](#references) also don't have a special node class
848b8605Smrgand stored as **node**. Source vector contains 3 values for each source
848b8605Smrgoperand - the **value** of predicate, **value** of corresponding
848b8605SmrgPRED\_SEL field, and the source **value** itself.
848b8605Smrg
848b8605Smrg### Indirect addressing
848b8605Smrg
848b8605SmrgSpecial kind of values (VLK\_RELREG) is used to represent indirect
848b8605Smrgoperands. These values don't have SSA versions. The representation is
848b8605Smrgmostly based on the [[2]](#references). Indirect operand contains the
848b8605Smrg"offset/address" value (value::rel), (e.g. some SSA version of the AR
848b8605Smrgregister value, though after some passes it may be any value - constant,
848b8605Smrgregister, etc), also it contains the maydef and mayuse vectors of
848b8605Smrgpointers to **value**s (similar to dst/src vectors in the **node**) to
848b8605Smrgrepresent the effects of aliasing in the SSA form.
848b8605Smrg
848b8605SmrgE.g. if we have the array R5.x ... R8.x and the following instruction :
848b8605Smrg
848b8605Smrg    MOV R0.x, R[5 + AR].x
848b8605Smrg
848b8605Smrgthen source indirect operand is represented with the VLK\_RELREG value,
848b8605Smrgvalue::rel is AR, value::maydef is empty (in fact it always contain the
848b8605Smrgsame number of elements as mayuse to simplify the handling, but they are
848b8605SmrgNULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the
848b8605Smrgcorresponding SSA versions after ssa\_rename).
848b8605Smrg
848b8605SmrgAdditional "virtual variables" as in [HSSA [2]](#references) are not
848b8605Smrgused, also there is no special handling for "zero versions". Typical
848b8605Smrgprograms in our case are small, indirect addressing is rare, array sizes
848b8605Smrgare limited by max gpr number, so we don't really need to use special
848b8605Smrgtricks to avoid the explosion of value versions. Also this allows more
848b8605Smrgprecise liveness computation for array elements without modifications to
848b8605Smrgthe algorithms.
848b8605Smrg
848b8605SmrgWith the following instruction:
848b8605Smrg
848b8605Smrg    MOV R[5+AR].x, R0.x
848b8605Smrg
848b8605Smrgwe'll have both maydef and mayuse vectors for dst operand filled with
848b8605Smrgarray values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename
848b8605Smrgpass mayuse will contain previous versions, maydef will contain new
848b8605Smrgpotentially-defined versions.
848b8605Smrg
848b8605Smrg* * * * *
848b8605Smrg
848b8605SmrgPasses
848b8605Smrg------
848b8605Smrg
848b8605Smrg-   **bc\_parser** - creates the IR from the source bytecode,
848b8605Smrg    initializes src and dst value vectors for instruction nodes. Most
848b8605Smrg    ALU nodes have one dst operand and the number of source operands is
848b8605Smrg    equal to the number of source operands for the ISA instruction.
848b8605Smrg    Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst
848b8605Smrg    gpr as in the original instruction, other two are pseudo-operands
848b8605Smrg    that represent possibly updated predicate and exec\_mask. Predicate
848b8605Smrg    values are used in the predicated alu instructions (node::pred),
848b8605Smrg    exec\_mask values are used in the if\_nodes (if\_node::cond). Each
848b8605Smrg    vector operand in the CF/TEX/VTX instructions is represented with 4
848b8605Smrg    values - components of the vector.
848b8605Smrg
848b8605Smrg-   **ssa\_prepare** - creates phi expressions.
848b8605Smrg
848b8605Smrg-   **ssa\_rename** - renames the values (assigns versions).
848b8605Smrg
848b8605Smrg-   **liveness** - liveness computation, sets 'dead' flag for unused
848b8605Smrg    nodes and values, optionally computes interference information for
848b8605Smrg    the values.
848b8605Smrg
848b8605Smrg-   **dce\_cleanup** - eliminates 'dead' nodes, also removes some
848b8605Smrg    unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP
848b8605Smrg    instructions in the source, containers for ALU groups (they were
848b8605Smrg    only needed for the ssa\_rename pass)
848b8605Smrg
848b8605Smrg-   **if\_conversion** - converts control flow with if\_nodes to the
848b8605Smrg    data flow in cases where it can improve performance (small alu-only
848b8605Smrg    branches). Both branches are executed speculatively and the phi
848b8605Smrg    expressions are replaced with conditional moves (CNDxx) to select
848b8605Smrg    the final value using the same condition predicate as was used by
848b8605Smrg    the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx
848b8605Smrg    instruction, CNDxx now uses dst[0] from the same PREDSETxx
848b8605Smrg    instruction.
848b8605Smrg
848b8605Smrg-   **peephole** - peephole optimizations
848b8605Smrg
848b8605Smrg-   **gvn** - Global Value Numbering [[2]](#references),
848b8605Smrg    [[3]](#references)
848b8605Smrg
848b8605Smrg-   **gcm** - Global Code Motion [[3]](#references). Also performs
848b8605Smrg    grouping of the instructions of the same kind (CF/FETCH/ALU).
848b8605Smrg
848b8605Smrg-   register allocation passes, some ideas are used from
848b8605Smrg    [[4]](#references), but implementation is simplified to make it more
848b8605Smrg    efficient in terms of the compilation speed (e.g. no recursive
848b8605Smrg    recoloring) while achieving good enough results.
848b8605Smrg
848b8605Smrg    -   **ra\_split** - prepares the program to register allocation.
848b8605Smrg        Splits live ranges for constrained values by inserting the
848b8605Smrg        copies to/from temporary values, so that the live range of the
848b8605Smrg        constrained values becomes minimal.
848b8605Smrg
848b8605Smrg    -   **ra\_coalesce** - performs global allocation on registers used
848b8605Smrg        in CF/FETCH instructions. It's performed first to make sure they
848b8605Smrg        end up in the same GPR. Also tries to allocate all values
848b8605Smrg        involved in copies (inserted by the ra\_split pass) to the same
848b8605Smrg        register, so that the copies may be eliminated.
848b8605Smrg
848b8605Smrg    -   **ra\_init** - allocates gpr arrays (if indirect addressing is
848b8605Smrg        used), and remaining values.
848b8605Smrg
848b8605Smrg-   **post\_scheduler** - ALU scheduler, handles VLIW packing and
848b8605Smrg    performs the final register allocation for local values inside ALU
848b8605Smrg    clauses. Eliminates all coalesced copies (if src and dst of the copy
848b8605Smrg    are allocated to the same register).
848b8605Smrg
848b8605Smrg-   **ra\_checker** - optional debugging pass that tries to catch basic
848b8605Smrg    errors of the scheduler or regalloc,
848b8605Smrg
848b8605Smrg-   **bc\_finalize** - propagates the regalloc information from values
848b8605Smrg    in node::src and node::dst vectors to the bytecode fields, converts
848b8605Smrg    control flow structure (region/depart/repeat) to the target
848b8605Smrg    instructions (JUMP/ELSE/POP,
848b8605Smrg    LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK).
848b8605Smrg
848b8605Smrg-   **bc\_builder** - builds final bytecode,
848b8605Smrg
848b8605Smrg* * * * *
848b8605Smrg
848b8605SmrgReferences
848b8605Smrg----------
848b8605Smrg
848b8605Smrg[1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl
848b8605SmrgMcConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf)
848b8605Smrg
848b8605Smrg[2] ["Effective Representation of Aliases and Indirect Memory Operations
848b8605Smrgin SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark
848b8605SmrgStreich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf)
848b8605Smrg
848b8605Smrg[3] ["Global Code Motion. Global Value Numbering.", Cliff
848b8605SmrgClick](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf)
848b8605Smrg
848b8605Smrg[4] ["Register Allocation for Programs in SSA Form", Sebastian
848b8605SmrgHack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532)
848b8605Smrg
848b8605Smrg[5] ["An extension to the SSA representation for predicated code",
848b8605SmrgFrancois de
848b8605SmrgFerriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf)
848b8605Smrg
848b8605Smrg[6] ["Improvements to the Psi-SSA Representation", F. de
848b8605SmrgFerriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf)