1848b8605Smrgr600-sb
2848b8605Smrg=======
3848b8605Smrg
4848b8605Smrg* * * * *
5848b8605Smrg
6848b8605SmrgDebugging
7848b8605Smrg---------
8848b8605Smrg
9848b8605Smrg### Environment variables
10848b8605Smrg
11848b8605Smrg-   **R600\_DEBUG**
12848b8605Smrg
13848b8605Smrg    There are new flags:
14848b8605Smrg
15b8e80941Smrg    -   **nosb** - Disable sb backend for graphics shaders
16848b8605Smrg    -   **sbcl** - Enable optimization of compute shaders (experimental)
17b8e80941Smrg    -   **sbdry** - Dry run, optimize but use source bytecode -
18b8e80941Smrg        useful if you only want to check shader dumps
19848b8605Smrg        without the risk of lockups and other problems
20848b8605Smrg    -   **sbstat** - Print optimization statistics (only time so far)
21848b8605Smrg    -   **sbdump** - Print IR after some passes.
22b8e80941Smrg    -   **sbnofallback** - Abort on errors instead of fallback
23b8e80941Smrg    -   **sbdisasm** - Use sb disassembler for shader dumps
24b8e80941Smrg    -   **sbsafemath** - Disable unsafe math optimizations
25848b8605Smrg
26848b8605Smrg### Regression debugging
27848b8605Smrg
28848b8605SmrgIf there are any regressions as compared to the default backend
29848b8605Smrg(R600\_SB=0), it's possible to use the following environment variables
30848b8605Smrgto find the incorrectly optimized shader that causes the regression.
31848b8605Smrg
32848b8605Smrg-   **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some
33848b8605Smrg    shaders
34848b8605Smrg    -   0 - disabled (default)
35848b8605Smrg    -   1 - skip optimization for the shaders in the range
36848b8605Smrg        [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is,
37848b8605Smrg        optimize only the shaders that are not in this range
38848b8605Smrg    -   2 - optimize only the shaders in the range
39848b8605Smrg        [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END]
40848b8605Smrg
41848b8605Smrg-   **R600\_SB\_DSKIP\_START** - start of the range (1-based)
42848b8605Smrg
43848b8605Smrg-   **R600\_SB\_DSKIP\_END** - end of the range (1-based)
44848b8605Smrg
45848b8605SmrgExample - optimize only the shaders 5, 6, and 7:
46848b8605Smrg
47848b8605Smrg    R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2
48848b8605Smrg
49848b8605SmrgAll shaders compiled by the application are numbered starting from 1,
50848b8605Smrgthe number of shaders used by the application may be obtained by running
51848b8605Smrgit with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#"
52848b8605Smrgfor each compiled shader.
53848b8605Smrg
54848b8605SmrgAfter figuring out the total number of shaders used by the application,
55848b8605Smrgthe variables above allow to use bisection to find the shader that is
56848b8605Smrgthe cause of regression. E.g. if the application uses 100 shaders, we
57848b8605Smrgcan divide the range [1; 100] and run the application with the
58848b8605Smrgoptimization enabled only for the first half of the shaders:
59848b8605Smrg
60848b8605Smrg    R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app>
61848b8605Smrg
62848b8605SmrgIf the regression is reproduced with these parameters, then the failing
63848b8605Smrgshader is in the range [1; 50], if it's not reproduced - then it's in
64848b8605Smrgthe range [51; 100]. Then we can divide the new range again and repeat
65848b8605Smrgthe testing, until we'll reduce the range to a single failing shader.
66848b8605Smrg
67848b8605Smrg*NOTE: This method relies on the assumption that the application
68848b8605Smrgproduces the same sequence of the shaders on each run. It's not always
69848b8605Smrgtrue - some applications may produce different sequences of the shaders,
70848b8605Smrgin such cases the tools like apitrace may be used to record the trace
71848b8605Smrgwith the application, then this method may be applied when replaying the
72848b8605Smrgtrace - also this may be faster and/or more convenient than testing the
73848b8605Smrgapplication itself.*
74848b8605Smrg
75848b8605Smrg* * * * *
76848b8605Smrg
77848b8605SmrgIntermediate Representation
78848b8605Smrg---------------------------
79848b8605Smrg
80848b8605Smrg### Values
81848b8605Smrg
82848b8605SmrgAll kinds of the operands (literal constants, references to kcache
83848b8605Smrgconstants, references to GPRs, etc) are currently represented by the
84848b8605Smrg**value** class (possibly it makes sense to switch to hierarchy of
85848b8605Smrgclasses derived from **value** instead, to save some memory).
86848b8605Smrg
87848b8605SmrgAll values (except some pseudo values like the exec\_mask or predicate
88848b8605Smrgregister) represent 32bit scalar values - there are no vector values,
89848b8605SmrgCF/FETCH instructions use groups of 4 values for src and dst operands.
90848b8605Smrg
91848b8605Smrg### Nodes
92848b8605Smrg
93848b8605SmrgShader programs are represented using the tree data structure, some
94848b8605Smrgnodes contain a list of subnodes.
95848b8605Smrg
96848b8605Smrg#### Control flow nodes
97848b8605Smrg
98848b8605SmrgControl flow information is represented using four special node types
99848b8605Smrg(based on the ideas from [[1]](#references) )
100848b8605Smrg
101848b8605Smrg-   **region\_node** - single-entry, single-exit region.
102848b8605Smrg
103848b8605Smrg    All loops and if's in the program are enclosed in region nodes.
104848b8605Smrg    Region nodes have two containers for phi nodes -
105848b8605Smrg    region\_node::loop\_phi contains the phi expressions to be executed
106848b8605Smrg    at the region entry, region\_node::phi contains the phi expressions
107848b8605Smrg    to be executed at the region exit. It's the only type of the node
108848b8605Smrg    that contains associated phi expressions.
109848b8605Smrg
110848b8605Smrg-   **depart\_node** - "depart region \$id after { ... }"
111848b8605Smrg
112848b8605Smrg    Depart target region (jump to exit point) after executing contained
113848b8605Smrg    code.
114848b8605Smrg
115848b8605Smrg-   **repeat\_node** - "repeat region \$id after { ... }"
116848b8605Smrg
117848b8605Smrg    Repeat target region (jump to entry point) after executing contained
118848b8605Smrg    code.
119848b8605Smrg
120848b8605Smrg-   **if\_node** - "if (cond) { ... }"
121848b8605Smrg
122848b8605Smrg    Execute contained code if condition is true. The difference from
123848b8605Smrg    [[1]](#references) is that we don't have associated phi expressions
124848b8605Smrg    for the **if\_node**, we enclose **if\_node** in the
125848b8605Smrg    **region\_node** and store corresponding phi's in the
126848b8605Smrg    **region\_node**, this allows more uniform handling.
127848b8605Smrg
128848b8605SmrgThe target region of depart and repeat nodes is always the region where
129848b8605Smrgthey are located (possibly in the nested region), there are no arbitrary
130848b8605Smrgjumps/goto's - control flow in the program is always structured.
131848b8605Smrg
132848b8605SmrgTypical control flow constructs can be represented as in the following
133848b8605Smrgexamples:
134848b8605Smrg
135848b8605SmrgGLSL:
136848b8605Smrg
137848b8605Smrg    if (cond) {
138848b8605Smrg        < 1 >
139848b8605Smrg    } else {
140848b8605Smrg        < 2 >
141848b8605Smrg    }
142848b8605Smrg
143848b8605SmrgIR:
144848b8605Smrg
145848b8605Smrg    region #0 {
146848b8605Smrg        depart region #0 after {
147848b8605Smrg            if (cond) {
148848b8605Smrg                depart region #0 after {
149848b8605Smrg                    < 1 >
150848b8605Smrg                }
151848b8605Smrg            }
152848b8605Smrg            < 2 >
153848b8605Smrg        }
154848b8605Smrg        <region #0 phi nodes >
155848b8605Smrg    }
156848b8605Smrg
157848b8605SmrgGLSL:
158848b8605Smrg
159848b8605Smrg    while (cond) {
160848b8605Smrg        < 1 >
161848b8605Smrg    }
162848b8605Smrg
163848b8605SmrgIR:
164848b8605Smrg
165848b8605Smrg    region #0 {
166848b8605Smrg        <region #0 loop_phi nodes>
167848b8605Smrg        repeat region #0 after {
168848b8605Smrg            region #1 {
169848b8605Smrg                depart region #1 after {
170848b8605Smrg                    if (!cond) {
171848b8605Smrg                        depart region #0
172848b8605Smrg                    }
173848b8605Smrg                }
174848b8605Smrg            }
175848b8605Smrg            < 1 >
176848b8605Smrg        }
177848b8605Smrg        <region #0 phi nodes>
178848b8605Smrg    }
179848b8605Smrg
180848b8605Smrg'Break' and 'continue' inside the loops are directly translated to the
181848b8605Smrgdepart and repeat nodes for the corresponding loop region.
182848b8605Smrg
183848b8605SmrgThis may look a bit too complicated, but in fact this allows more simple
184848b8605Smrgand uniform handling of the control flow.
185848b8605Smrg
186848b8605SmrgAll loop\_phi and phi nodes for some region always have the same number
187848b8605Smrgof source operands. The number of source operands for
188848b8605Smrgregion\_node::loop\_phi nodes is 1 + number of repeat nodes that
189848b8605Smrgreference this region as a target. The number of source operands for
190848b8605Smrgregion\_node::phi nodes is equal to the number of depart nodes that
191848b8605Smrgreference this region as a target. All depart/repeat nodes for the
192848b8605Smrgregion have unique indices equal to the index of source operand for
193848b8605Smrgphi/loop\_phi nodes.
194848b8605Smrg
195848b8605SmrgFirst source operand for region\_node::loop\_phi nodes (src[0]) is an
196848b8605Smrgincoming value that enters the region from the outside. Each remaining
197848b8605Smrgsource operand comes from the corresponding repeat node.
198848b8605Smrg
199848b8605SmrgMore complex example:
200848b8605Smrg
201848b8605SmrgGLSL:
202848b8605Smrg
203848b8605Smrg    a = 1;
204848b8605Smrg    while (a < 5) {
205848b8605Smrg        a = a * 2;
206848b8605Smrg        if (b == 3) {
207848b8605Smrg            continue;
208848b8605Smrg        } else {
209848b8605Smrg            a = 6;
210848b8605Smrg        }
211848b8605Smrg        if (c == 4)
212848b8605Smrg            break;
213848b8605Smrg        a = a + 1;
214848b8605Smrg    }
215848b8605Smrg
216848b8605SmrgIR with SSA form:
217848b8605Smrg
218848b8605Smrg    a.1 = 1;
219848b8605Smrg    region #0 {
220848b8605Smrg        // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2
221848b8605Smrg        region#0 loop_phi: a.2 = phi a.1, a.6, a.3
222848b8605Smrg
223848b8605Smrg        repeat_1 region #0 after {
224848b8605Smrg            a.3 = a.2 * 2;
225848b8605Smrg            cond1 = (b == 3);
226848b8605Smrg            region #1 {
227848b8605Smrg                depart_0 region #1 after {
228848b8605Smrg                    if (cond1) {
229848b8605Smrg                        repeat_2 region #0;
230848b8605Smrg                    }
231848b8605Smrg                }
232848b8605Smrg                a.4 = 6;
233848b8605Smrg
234848b8605Smrg                region #1 phi: a.5 = phi a.4; // src[0] - from depart_0
235848b8605Smrg            }
236848b8605Smrg            cond2 = (c == 4);
237848b8605Smrg            region #2 {
238848b8605Smrg                depart_0 region #2 after {
239848b8605Smrg                    if (cond2) {
240848b8605Smrg                        depart_0 region #0;
241848b8605Smrg                    }
242848b8605Smrg                }
243848b8605Smrg            }
244848b8605Smrg            a.6 = a.5 + 1;
245848b8605Smrg        }
246848b8605Smrg
247848b8605Smrg        region #0 phi: a.7 = phi a.5 // src[0] from depart_0
248848b8605Smrg    }
249848b8605Smrg
250848b8605SmrgPhi nodes with single source operand are just copies, they are not
251848b8605Smrgreally necessary, but this allows to handle all **depart\_node**s in the
252848b8605Smrguniform way.
253848b8605Smrg
254848b8605Smrg#### Instruction nodes
255848b8605Smrg
256848b8605SmrgInstruction nodes represent different kinds of instructions -
257848b8605Smrg**alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains
258848b8605Smrgthe "bc" structure where all fields of the bytecode are stored (the type
259848b8605Smrgis **bc\_alu** for **alu\_node**, etc). The operands are represented
260848b8605Smrgusing the vectors of pointers to **value** class (node::src, node::dst)
261848b8605Smrg
262848b8605Smrg#### SSA-specific nodes
263848b8605Smrg
264848b8605SmrgPhi nodes currently don't have special node class, they are stored as
265848b8605Smrg**node**. Destination vector contains a single destination value, source
266848b8605Smrgvector contains 1 or more source values.
267848b8605Smrg
268848b8605SmrgPsi nodes [[5], [6]](#references) also don't have a special node class
269848b8605Smrgand stored as **node**. Source vector contains 3 values for each source
270848b8605Smrgoperand - the **value** of predicate, **value** of corresponding
271848b8605SmrgPRED\_SEL field, and the source **value** itself.
272848b8605Smrg
273848b8605Smrg### Indirect addressing
274848b8605Smrg
275848b8605SmrgSpecial kind of values (VLK\_RELREG) is used to represent indirect
276848b8605Smrgoperands. These values don't have SSA versions. The representation is
277848b8605Smrgmostly based on the [[2]](#references). Indirect operand contains the
278848b8605Smrg"offset/address" value (value::rel), (e.g. some SSA version of the AR
279848b8605Smrgregister value, though after some passes it may be any value - constant,
280848b8605Smrgregister, etc), also it contains the maydef and mayuse vectors of
281848b8605Smrgpointers to **value**s (similar to dst/src vectors in the **node**) to
282848b8605Smrgrepresent the effects of aliasing in the SSA form.
283848b8605Smrg
284848b8605SmrgE.g. if we have the array R5.x ... R8.x and the following instruction :
285848b8605Smrg
286848b8605Smrg    MOV R0.x, R[5 + AR].x
287848b8605Smrg
288848b8605Smrgthen source indirect operand is represented with the VLK\_RELREG value,
289848b8605Smrgvalue::rel is AR, value::maydef is empty (in fact it always contain the
290848b8605Smrgsame number of elements as mayuse to simplify the handling, but they are
291848b8605SmrgNULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the
292848b8605Smrgcorresponding SSA versions after ssa\_rename).
293848b8605Smrg
294848b8605SmrgAdditional "virtual variables" as in [HSSA [2]](#references) are not
295848b8605Smrgused, also there is no special handling for "zero versions". Typical
296848b8605Smrgprograms in our case are small, indirect addressing is rare, array sizes
297848b8605Smrgare limited by max gpr number, so we don't really need to use special
298848b8605Smrgtricks to avoid the explosion of value versions. Also this allows more
299848b8605Smrgprecise liveness computation for array elements without modifications to
300848b8605Smrgthe algorithms.
301848b8605Smrg
302848b8605SmrgWith the following instruction:
303848b8605Smrg
304848b8605Smrg    MOV R[5+AR].x, R0.x
305848b8605Smrg
306848b8605Smrgwe'll have both maydef and mayuse vectors for dst operand filled with
307848b8605Smrgarray values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename
308848b8605Smrgpass mayuse will contain previous versions, maydef will contain new
309848b8605Smrgpotentially-defined versions.
310848b8605Smrg
311848b8605Smrg* * * * *
312848b8605Smrg
313848b8605SmrgPasses
314848b8605Smrg------
315848b8605Smrg
316848b8605Smrg-   **bc\_parser** - creates the IR from the source bytecode,
317848b8605Smrg    initializes src and dst value vectors for instruction nodes. Most
318848b8605Smrg    ALU nodes have one dst operand and the number of source operands is
319848b8605Smrg    equal to the number of source operands for the ISA instruction.
320848b8605Smrg    Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst
321848b8605Smrg    gpr as in the original instruction, other two are pseudo-operands
322848b8605Smrg    that represent possibly updated predicate and exec\_mask. Predicate
323848b8605Smrg    values are used in the predicated alu instructions (node::pred),
324848b8605Smrg    exec\_mask values are used in the if\_nodes (if\_node::cond). Each
325848b8605Smrg    vector operand in the CF/TEX/VTX instructions is represented with 4
326848b8605Smrg    values - components of the vector.
327848b8605Smrg
328848b8605Smrg-   **ssa\_prepare** - creates phi expressions.
329848b8605Smrg
330848b8605Smrg-   **ssa\_rename** - renames the values (assigns versions).
331848b8605Smrg
332848b8605Smrg-   **liveness** - liveness computation, sets 'dead' flag for unused
333848b8605Smrg    nodes and values, optionally computes interference information for
334848b8605Smrg    the values.
335848b8605Smrg
336848b8605Smrg-   **dce\_cleanup** - eliminates 'dead' nodes, also removes some
337848b8605Smrg    unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP
338848b8605Smrg    instructions in the source, containers for ALU groups (they were
339848b8605Smrg    only needed for the ssa\_rename pass)
340848b8605Smrg
341848b8605Smrg-   **if\_conversion** - converts control flow with if\_nodes to the
342848b8605Smrg    data flow in cases where it can improve performance (small alu-only
343848b8605Smrg    branches). Both branches are executed speculatively and the phi
344848b8605Smrg    expressions are replaced with conditional moves (CNDxx) to select
345848b8605Smrg    the final value using the same condition predicate as was used by
346848b8605Smrg    the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx
347848b8605Smrg    instruction, CNDxx now uses dst[0] from the same PREDSETxx
348848b8605Smrg    instruction.
349848b8605Smrg
350848b8605Smrg-   **peephole** - peephole optimizations
351848b8605Smrg
352848b8605Smrg-   **gvn** - Global Value Numbering [[2]](#references),
353848b8605Smrg    [[3]](#references)
354848b8605Smrg
355848b8605Smrg-   **gcm** - Global Code Motion [[3]](#references). Also performs
356848b8605Smrg    grouping of the instructions of the same kind (CF/FETCH/ALU).
357848b8605Smrg
358848b8605Smrg-   register allocation passes, some ideas are used from
359848b8605Smrg    [[4]](#references), but implementation is simplified to make it more
360848b8605Smrg    efficient in terms of the compilation speed (e.g. no recursive
361848b8605Smrg    recoloring) while achieving good enough results.
362848b8605Smrg
363848b8605Smrg    -   **ra\_split** - prepares the program to register allocation.
364848b8605Smrg        Splits live ranges for constrained values by inserting the
365848b8605Smrg        copies to/from temporary values, so that the live range of the
366848b8605Smrg        constrained values becomes minimal.
367848b8605Smrg
368848b8605Smrg    -   **ra\_coalesce** - performs global allocation on registers used
369848b8605Smrg        in CF/FETCH instructions. It's performed first to make sure they
370848b8605Smrg        end up in the same GPR. Also tries to allocate all values
371848b8605Smrg        involved in copies (inserted by the ra\_split pass) to the same
372848b8605Smrg        register, so that the copies may be eliminated.
373848b8605Smrg
374848b8605Smrg    -   **ra\_init** - allocates gpr arrays (if indirect addressing is
375848b8605Smrg        used), and remaining values.
376848b8605Smrg
377848b8605Smrg-   **post\_scheduler** - ALU scheduler, handles VLIW packing and
378848b8605Smrg    performs the final register allocation for local values inside ALU
379848b8605Smrg    clauses. Eliminates all coalesced copies (if src and dst of the copy
380848b8605Smrg    are allocated to the same register).
381848b8605Smrg
382848b8605Smrg-   **ra\_checker** - optional debugging pass that tries to catch basic
383848b8605Smrg    errors of the scheduler or regalloc,
384848b8605Smrg
385848b8605Smrg-   **bc\_finalize** - propagates the regalloc information from values
386848b8605Smrg    in node::src and node::dst vectors to the bytecode fields, converts
387848b8605Smrg    control flow structure (region/depart/repeat) to the target
388848b8605Smrg    instructions (JUMP/ELSE/POP,
389848b8605Smrg    LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK).
390848b8605Smrg
391848b8605Smrg-   **bc\_builder** - builds final bytecode,
392848b8605Smrg
393848b8605Smrg* * * * *
394848b8605Smrg
395848b8605SmrgReferences
396848b8605Smrg----------
397848b8605Smrg
398848b8605Smrg[1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl
399848b8605SmrgMcConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf)
400848b8605Smrg
401848b8605Smrg[2] ["Effective Representation of Aliases and Indirect Memory Operations
402848b8605Smrgin SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark
403848b8605SmrgStreich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf)
404848b8605Smrg
405848b8605Smrg[3] ["Global Code Motion. Global Value Numbering.", Cliff
406848b8605SmrgClick](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf)
407848b8605Smrg
408848b8605Smrg[4] ["Register Allocation for Programs in SSA Form", Sebastian
409848b8605SmrgHack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532)
410848b8605Smrg
411848b8605Smrg[5] ["An extension to the SSA representation for predicated code",
412848b8605SmrgFrancois de
413848b8605SmrgFerriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf)
414848b8605Smrg
415848b8605Smrg[6] ["Improvements to the Psi-SSA Representation", F. de
416848b8605SmrgFerriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf)
417