1af69d88dSmrgr600-sb
2af69d88dSmrg=======
3af69d88dSmrg
4af69d88dSmrg* * * * *
5af69d88dSmrg
6af69d88dSmrgDebugging
7af69d88dSmrg---------
8af69d88dSmrg
9af69d88dSmrg### Environment variables
10af69d88dSmrg
11af69d88dSmrg-   **R600\_DEBUG**
12af69d88dSmrg
13af69d88dSmrg    There are new flags:
14af69d88dSmrg
1501e04c3fSmrg    -   **nosb** - Disable sb backend for graphics shaders
16af69d88dSmrg    -   **sbcl** - Enable optimization of compute shaders (experimental)
1701e04c3fSmrg    -   **sbdry** - Dry run, optimize but use source bytecode -
1801e04c3fSmrg        useful if you only want to check shader dumps
19af69d88dSmrg        without the risk of lockups and other problems
20af69d88dSmrg    -   **sbstat** - Print optimization statistics (only time so far)
21af69d88dSmrg    -   **sbdump** - Print IR after some passes.
2201e04c3fSmrg    -   **sbnofallback** - Abort on errors instead of fallback
2301e04c3fSmrg    -   **sbdisasm** - Use sb disassembler for shader dumps
2401e04c3fSmrg    -   **sbsafemath** - Disable unsafe math optimizations
25af69d88dSmrg
26af69d88dSmrg### Regression debugging
27af69d88dSmrg
28af69d88dSmrgIf there are any regressions as compared to the default backend
29af69d88dSmrg(R600\_SB=0), it's possible to use the following environment variables
30af69d88dSmrgto find the incorrectly optimized shader that causes the regression.
31af69d88dSmrg
32af69d88dSmrg-   **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some
33af69d88dSmrg    shaders
34af69d88dSmrg    -   0 - disabled (default)
35af69d88dSmrg    -   1 - skip optimization for the shaders in the range
36af69d88dSmrg        [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is,
37af69d88dSmrg        optimize only the shaders that are not in this range
38af69d88dSmrg    -   2 - optimize only the shaders in the range
39af69d88dSmrg        [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END]
40af69d88dSmrg
41af69d88dSmrg-   **R600\_SB\_DSKIP\_START** - start of the range (1-based)
42af69d88dSmrg
43af69d88dSmrg-   **R600\_SB\_DSKIP\_END** - end of the range (1-based)
44af69d88dSmrg
45af69d88dSmrgExample - optimize only the shaders 5, 6, and 7:
46af69d88dSmrg
47af69d88dSmrg    R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2
48af69d88dSmrg
49af69d88dSmrgAll shaders compiled by the application are numbered starting from 1,
50af69d88dSmrgthe number of shaders used by the application may be obtained by running
51af69d88dSmrgit with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#"
52af69d88dSmrgfor each compiled shader.
53af69d88dSmrg
54af69d88dSmrgAfter figuring out the total number of shaders used by the application,
55af69d88dSmrgthe variables above allow to use bisection to find the shader that is
56af69d88dSmrgthe cause of regression. E.g. if the application uses 100 shaders, we
57af69d88dSmrgcan divide the range [1; 100] and run the application with the
58af69d88dSmrgoptimization enabled only for the first half of the shaders:
59af69d88dSmrg
60af69d88dSmrg    R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app>
61af69d88dSmrg
62af69d88dSmrgIf the regression is reproduced with these parameters, then the failing
63af69d88dSmrgshader is in the range [1; 50], if it's not reproduced - then it's in
64af69d88dSmrgthe range [51; 100]. Then we can divide the new range again and repeat
65af69d88dSmrgthe testing, until we'll reduce the range to a single failing shader.
66af69d88dSmrg
67af69d88dSmrg*NOTE: This method relies on the assumption that the application
68af69d88dSmrgproduces the same sequence of the shaders on each run. It's not always
69af69d88dSmrgtrue - some applications may produce different sequences of the shaders,
70af69d88dSmrgin such cases the tools like apitrace may be used to record the trace
71af69d88dSmrgwith the application, then this method may be applied when replaying the
72af69d88dSmrgtrace - also this may be faster and/or more convenient than testing the
73af69d88dSmrgapplication itself.*
74af69d88dSmrg
75af69d88dSmrg* * * * *
76af69d88dSmrg
77af69d88dSmrgIntermediate Representation
78af69d88dSmrg---------------------------
79af69d88dSmrg
80af69d88dSmrg### Values
81af69d88dSmrg
82af69d88dSmrgAll kinds of the operands (literal constants, references to kcache
83af69d88dSmrgconstants, references to GPRs, etc) are currently represented by the
84af69d88dSmrg**value** class (possibly it makes sense to switch to hierarchy of
85af69d88dSmrgclasses derived from **value** instead, to save some memory).
86af69d88dSmrg
87af69d88dSmrgAll values (except some pseudo values like the exec\_mask or predicate
88af69d88dSmrgregister) represent 32bit scalar values - there are no vector values,
89af69d88dSmrgCF/FETCH instructions use groups of 4 values for src and dst operands.
90af69d88dSmrg
91af69d88dSmrg### Nodes
92af69d88dSmrg
93af69d88dSmrgShader programs are represented using the tree data structure, some
94af69d88dSmrgnodes contain a list of subnodes.
95af69d88dSmrg
96af69d88dSmrg#### Control flow nodes
97af69d88dSmrg
98af69d88dSmrgControl flow information is represented using four special node types
99af69d88dSmrg(based on the ideas from [[1]](#references) )
100af69d88dSmrg
101af69d88dSmrg-   **region\_node** - single-entry, single-exit region.
102af69d88dSmrg
103af69d88dSmrg    All loops and if's in the program are enclosed in region nodes.
104af69d88dSmrg    Region nodes have two containers for phi nodes -
105af69d88dSmrg    region\_node::loop\_phi contains the phi expressions to be executed
106af69d88dSmrg    at the region entry, region\_node::phi contains the phi expressions
107af69d88dSmrg    to be executed at the region exit. It's the only type of the node
108af69d88dSmrg    that contains associated phi expressions.
109af69d88dSmrg
110af69d88dSmrg-   **depart\_node** - "depart region \$id after { ... }"
111af69d88dSmrg
112af69d88dSmrg    Depart target region (jump to exit point) after executing contained
113af69d88dSmrg    code.
114af69d88dSmrg
115af69d88dSmrg-   **repeat\_node** - "repeat region \$id after { ... }"
116af69d88dSmrg
117af69d88dSmrg    Repeat target region (jump to entry point) after executing contained
118af69d88dSmrg    code.
119af69d88dSmrg
120af69d88dSmrg-   **if\_node** - "if (cond) { ... }"
121af69d88dSmrg
122af69d88dSmrg    Execute contained code if condition is true. The difference from
123af69d88dSmrg    [[1]](#references) is that we don't have associated phi expressions
124af69d88dSmrg    for the **if\_node**, we enclose **if\_node** in the
125af69d88dSmrg    **region\_node** and store corresponding phi's in the
126af69d88dSmrg    **region\_node**, this allows more uniform handling.
127af69d88dSmrg
128af69d88dSmrgThe target region of depart and repeat nodes is always the region where
129af69d88dSmrgthey are located (possibly in the nested region), there are no arbitrary
130af69d88dSmrgjumps/goto's - control flow in the program is always structured.
131af69d88dSmrg
132af69d88dSmrgTypical control flow constructs can be represented as in the following
133af69d88dSmrgexamples:
134af69d88dSmrg
135af69d88dSmrgGLSL:
136af69d88dSmrg
137af69d88dSmrg    if (cond) {
138af69d88dSmrg        < 1 >
139af69d88dSmrg    } else {
140af69d88dSmrg        < 2 >
141af69d88dSmrg    }
142af69d88dSmrg
143af69d88dSmrgIR:
144af69d88dSmrg
145af69d88dSmrg    region #0 {
146af69d88dSmrg        depart region #0 after {
147af69d88dSmrg            if (cond) {
148af69d88dSmrg                depart region #0 after {
149af69d88dSmrg                    < 1 >
150af69d88dSmrg                }
151af69d88dSmrg            }
152af69d88dSmrg            < 2 >
153af69d88dSmrg        }
154af69d88dSmrg        <region #0 phi nodes >
155af69d88dSmrg    }
156af69d88dSmrg
157af69d88dSmrgGLSL:
158af69d88dSmrg
159af69d88dSmrg    while (cond) {
160af69d88dSmrg        < 1 >
161af69d88dSmrg    }
162af69d88dSmrg
163af69d88dSmrgIR:
164af69d88dSmrg
165af69d88dSmrg    region #0 {
166af69d88dSmrg        <region #0 loop_phi nodes>
167af69d88dSmrg        repeat region #0 after {
168af69d88dSmrg            region #1 {
169af69d88dSmrg                depart region #1 after {
170af69d88dSmrg                    if (!cond) {
171af69d88dSmrg                        depart region #0
172af69d88dSmrg                    }
173af69d88dSmrg                }
174af69d88dSmrg            }
175af69d88dSmrg            < 1 >
176af69d88dSmrg        }
177af69d88dSmrg        <region #0 phi nodes>
178af69d88dSmrg    }
179af69d88dSmrg
180af69d88dSmrg'Break' and 'continue' inside the loops are directly translated to the
181af69d88dSmrgdepart and repeat nodes for the corresponding loop region.
182af69d88dSmrg
183af69d88dSmrgThis may look a bit too complicated, but in fact this allows more simple
184af69d88dSmrgand uniform handling of the control flow.
185af69d88dSmrg
186af69d88dSmrgAll loop\_phi and phi nodes for some region always have the same number
187af69d88dSmrgof source operands. The number of source operands for
188af69d88dSmrgregion\_node::loop\_phi nodes is 1 + number of repeat nodes that
189af69d88dSmrgreference this region as a target. The number of source operands for
190af69d88dSmrgregion\_node::phi nodes is equal to the number of depart nodes that
191af69d88dSmrgreference this region as a target. All depart/repeat nodes for the
192af69d88dSmrgregion have unique indices equal to the index of source operand for
193af69d88dSmrgphi/loop\_phi nodes.
194af69d88dSmrg
195af69d88dSmrgFirst source operand for region\_node::loop\_phi nodes (src[0]) is an
196af69d88dSmrgincoming value that enters the region from the outside. Each remaining
197af69d88dSmrgsource operand comes from the corresponding repeat node.
198af69d88dSmrg
199af69d88dSmrgMore complex example:
200af69d88dSmrg
201af69d88dSmrgGLSL:
202af69d88dSmrg
203af69d88dSmrg    a = 1;
204af69d88dSmrg    while (a < 5) {
205af69d88dSmrg        a = a * 2;
206af69d88dSmrg        if (b == 3) {
207af69d88dSmrg            continue;
208af69d88dSmrg        } else {
209af69d88dSmrg            a = 6;
210af69d88dSmrg        }
211af69d88dSmrg        if (c == 4)
212af69d88dSmrg            break;
213af69d88dSmrg        a = a + 1;
214af69d88dSmrg    }
215af69d88dSmrg
216af69d88dSmrgIR with SSA form:
217af69d88dSmrg
218af69d88dSmrg    a.1 = 1;
219af69d88dSmrg    region #0 {
220af69d88dSmrg        // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2
221af69d88dSmrg        region#0 loop_phi: a.2 = phi a.1, a.6, a.3
222af69d88dSmrg
223af69d88dSmrg        repeat_1 region #0 after {
224af69d88dSmrg            a.3 = a.2 * 2;
225af69d88dSmrg            cond1 = (b == 3);
226af69d88dSmrg            region #1 {
227af69d88dSmrg                depart_0 region #1 after {
228af69d88dSmrg                    if (cond1) {
229af69d88dSmrg                        repeat_2 region #0;
230af69d88dSmrg                    }
231af69d88dSmrg                }
232af69d88dSmrg                a.4 = 6;
233af69d88dSmrg
234af69d88dSmrg                region #1 phi: a.5 = phi a.4; // src[0] - from depart_0
235af69d88dSmrg            }
236af69d88dSmrg            cond2 = (c == 4);
237af69d88dSmrg            region #2 {
238af69d88dSmrg                depart_0 region #2 after {
239af69d88dSmrg                    if (cond2) {
240af69d88dSmrg                        depart_0 region #0;
241af69d88dSmrg                    }
242af69d88dSmrg                }
243af69d88dSmrg            }
244af69d88dSmrg            a.6 = a.5 + 1;
245af69d88dSmrg        }
246af69d88dSmrg
247af69d88dSmrg        region #0 phi: a.7 = phi a.5 // src[0] from depart_0
248af69d88dSmrg    }
249af69d88dSmrg
250af69d88dSmrgPhi nodes with single source operand are just copies, they are not
251af69d88dSmrgreally necessary, but this allows to handle all **depart\_node**s in the
252af69d88dSmrguniform way.
253af69d88dSmrg
254af69d88dSmrg#### Instruction nodes
255af69d88dSmrg
256af69d88dSmrgInstruction nodes represent different kinds of instructions -
257af69d88dSmrg**alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains
258af69d88dSmrgthe "bc" structure where all fields of the bytecode are stored (the type
259af69d88dSmrgis **bc\_alu** for **alu\_node**, etc). The operands are represented
260af69d88dSmrgusing the vectors of pointers to **value** class (node::src, node::dst)
261af69d88dSmrg
262af69d88dSmrg#### SSA-specific nodes
263af69d88dSmrg
264af69d88dSmrgPhi nodes currently don't have special node class, they are stored as
265af69d88dSmrg**node**. Destination vector contains a single destination value, source
266af69d88dSmrgvector contains 1 or more source values.
267af69d88dSmrg
268af69d88dSmrgPsi nodes [[5], [6]](#references) also don't have a special node class
269af69d88dSmrgand stored as **node**. Source vector contains 3 values for each source
270af69d88dSmrgoperand - the **value** of predicate, **value** of corresponding
271af69d88dSmrgPRED\_SEL field, and the source **value** itself.
272af69d88dSmrg
273af69d88dSmrg### Indirect addressing
274af69d88dSmrg
275af69d88dSmrgSpecial kind of values (VLK\_RELREG) is used to represent indirect
276af69d88dSmrgoperands. These values don't have SSA versions. The representation is
277af69d88dSmrgmostly based on the [[2]](#references). Indirect operand contains the
278af69d88dSmrg"offset/address" value (value::rel), (e.g. some SSA version of the AR
279af69d88dSmrgregister value, though after some passes it may be any value - constant,
280af69d88dSmrgregister, etc), also it contains the maydef and mayuse vectors of
281af69d88dSmrgpointers to **value**s (similar to dst/src vectors in the **node**) to
282af69d88dSmrgrepresent the effects of aliasing in the SSA form.
283af69d88dSmrg
284af69d88dSmrgE.g. if we have the array R5.x ... R8.x and the following instruction :
285af69d88dSmrg
286af69d88dSmrg    MOV R0.x, R[5 + AR].x
287af69d88dSmrg
288af69d88dSmrgthen source indirect operand is represented with the VLK\_RELREG value,
289af69d88dSmrgvalue::rel is AR, value::maydef is empty (in fact it always contain the
290af69d88dSmrgsame number of elements as mayuse to simplify the handling, but they are
291af69d88dSmrgNULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the
292af69d88dSmrgcorresponding SSA versions after ssa\_rename).
293af69d88dSmrg
294af69d88dSmrgAdditional "virtual variables" as in [HSSA [2]](#references) are not
295af69d88dSmrgused, also there is no special handling for "zero versions". Typical
296af69d88dSmrgprograms in our case are small, indirect addressing is rare, array sizes
297af69d88dSmrgare limited by max gpr number, so we don't really need to use special
298af69d88dSmrgtricks to avoid the explosion of value versions. Also this allows more
299af69d88dSmrgprecise liveness computation for array elements without modifications to
300af69d88dSmrgthe algorithms.
301af69d88dSmrg
302af69d88dSmrgWith the following instruction:
303af69d88dSmrg
304af69d88dSmrg    MOV R[5+AR].x, R0.x
305af69d88dSmrg
306af69d88dSmrgwe'll have both maydef and mayuse vectors for dst operand filled with
307af69d88dSmrgarray values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename
308af69d88dSmrgpass mayuse will contain previous versions, maydef will contain new
309af69d88dSmrgpotentially-defined versions.
310af69d88dSmrg
311af69d88dSmrg* * * * *
312af69d88dSmrg
313af69d88dSmrgPasses
314af69d88dSmrg------
315af69d88dSmrg
316af69d88dSmrg-   **bc\_parser** - creates the IR from the source bytecode,
317af69d88dSmrg    initializes src and dst value vectors for instruction nodes. Most
318af69d88dSmrg    ALU nodes have one dst operand and the number of source operands is
319af69d88dSmrg    equal to the number of source operands for the ISA instruction.
320af69d88dSmrg    Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst
321af69d88dSmrg    gpr as in the original instruction, other two are pseudo-operands
322af69d88dSmrg    that represent possibly updated predicate and exec\_mask. Predicate
323af69d88dSmrg    values are used in the predicated alu instructions (node::pred),
324af69d88dSmrg    exec\_mask values are used in the if\_nodes (if\_node::cond). Each
325af69d88dSmrg    vector operand in the CF/TEX/VTX instructions is represented with 4
326af69d88dSmrg    values - components of the vector.
327af69d88dSmrg
328af69d88dSmrg-   **ssa\_prepare** - creates phi expressions.
329af69d88dSmrg
330af69d88dSmrg-   **ssa\_rename** - renames the values (assigns versions).
331af69d88dSmrg
332af69d88dSmrg-   **liveness** - liveness computation, sets 'dead' flag for unused
333af69d88dSmrg    nodes and values, optionally computes interference information for
334af69d88dSmrg    the values.
335af69d88dSmrg
336af69d88dSmrg-   **dce\_cleanup** - eliminates 'dead' nodes, also removes some
337af69d88dSmrg    unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP
338af69d88dSmrg    instructions in the source, containers for ALU groups (they were
339af69d88dSmrg    only needed for the ssa\_rename pass)
340af69d88dSmrg
341af69d88dSmrg-   **if\_conversion** - converts control flow with if\_nodes to the
342af69d88dSmrg    data flow in cases where it can improve performance (small alu-only
343af69d88dSmrg    branches). Both branches are executed speculatively and the phi
344af69d88dSmrg    expressions are replaced with conditional moves (CNDxx) to select
345af69d88dSmrg    the final value using the same condition predicate as was used by
346af69d88dSmrg    the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx
347af69d88dSmrg    instruction, CNDxx now uses dst[0] from the same PREDSETxx
348af69d88dSmrg    instruction.
349af69d88dSmrg
350af69d88dSmrg-   **peephole** - peephole optimizations
351af69d88dSmrg
352af69d88dSmrg-   **gvn** - Global Value Numbering [[2]](#references),
353af69d88dSmrg    [[3]](#references)
354af69d88dSmrg
355af69d88dSmrg-   **gcm** - Global Code Motion [[3]](#references). Also performs
356af69d88dSmrg    grouping of the instructions of the same kind (CF/FETCH/ALU).
357af69d88dSmrg
358af69d88dSmrg-   register allocation passes, some ideas are used from
359af69d88dSmrg    [[4]](#references), but implementation is simplified to make it more
360af69d88dSmrg    efficient in terms of the compilation speed (e.g. no recursive
361af69d88dSmrg    recoloring) while achieving good enough results.
362af69d88dSmrg
363af69d88dSmrg    -   **ra\_split** - prepares the program to register allocation.
364af69d88dSmrg        Splits live ranges for constrained values by inserting the
365af69d88dSmrg        copies to/from temporary values, so that the live range of the
366af69d88dSmrg        constrained values becomes minimal.
367af69d88dSmrg
368af69d88dSmrg    -   **ra\_coalesce** - performs global allocation on registers used
369af69d88dSmrg        in CF/FETCH instructions. It's performed first to make sure they
370af69d88dSmrg        end up in the same GPR. Also tries to allocate all values
371af69d88dSmrg        involved in copies (inserted by the ra\_split pass) to the same
372af69d88dSmrg        register, so that the copies may be eliminated.
373af69d88dSmrg
374af69d88dSmrg    -   **ra\_init** - allocates gpr arrays (if indirect addressing is
375af69d88dSmrg        used), and remaining values.
376af69d88dSmrg
377af69d88dSmrg-   **post\_scheduler** - ALU scheduler, handles VLIW packing and
378af69d88dSmrg    performs the final register allocation for local values inside ALU
379af69d88dSmrg    clauses. Eliminates all coalesced copies (if src and dst of the copy
380af69d88dSmrg    are allocated to the same register).
381af69d88dSmrg
382af69d88dSmrg-   **ra\_checker** - optional debugging pass that tries to catch basic
383af69d88dSmrg    errors of the scheduler or regalloc,
384af69d88dSmrg
385af69d88dSmrg-   **bc\_finalize** - propagates the regalloc information from values
386af69d88dSmrg    in node::src and node::dst vectors to the bytecode fields, converts
387af69d88dSmrg    control flow structure (region/depart/repeat) to the target
388af69d88dSmrg    instructions (JUMP/ELSE/POP,
389af69d88dSmrg    LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK).
390af69d88dSmrg
391af69d88dSmrg-   **bc\_builder** - builds final bytecode,
392af69d88dSmrg
393af69d88dSmrg* * * * *
394af69d88dSmrg
395af69d88dSmrgReferences
396af69d88dSmrg----------
397af69d88dSmrg
398af69d88dSmrg[1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl
399af69d88dSmrgMcConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf)
400af69d88dSmrg
401af69d88dSmrg[2] ["Effective Representation of Aliases and Indirect Memory Operations
402af69d88dSmrgin SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark
403af69d88dSmrgStreich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf)
404af69d88dSmrg
405af69d88dSmrg[3] ["Global Code Motion. Global Value Numbering.", Cliff
406af69d88dSmrgClick](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf)
407af69d88dSmrg
408af69d88dSmrg[4] ["Register Allocation for Programs in SSA Form", Sebastian
409af69d88dSmrgHack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532)
410af69d88dSmrg
411af69d88dSmrg[5] ["An extension to the SSA representation for predicated code",
412af69d88dSmrgFrancois de
413af69d88dSmrgFerriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf)
414af69d88dSmrg
415af69d88dSmrg[6] ["Improvements to the Psi-SSA Representation", F. de
416af69d88dSmrgFerriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf)
417