1848b8605Smrgr600-sb 2848b8605Smrg======= 3848b8605Smrg 4848b8605Smrg* * * * * 5848b8605Smrg 6848b8605SmrgDebugging 7848b8605Smrg--------- 8848b8605Smrg 9848b8605Smrg### Environment variables 10848b8605Smrg 11848b8605Smrg- **R600\_DEBUG** 12848b8605Smrg 13848b8605Smrg There are new flags: 14848b8605Smrg 15b8e80941Smrg - **nosb** - Disable sb backend for graphics shaders 16848b8605Smrg - **sbcl** - Enable optimization of compute shaders (experimental) 17b8e80941Smrg - **sbdry** - Dry run, optimize but use source bytecode - 18b8e80941Smrg useful if you only want to check shader dumps 19848b8605Smrg without the risk of lockups and other problems 20848b8605Smrg - **sbstat** - Print optimization statistics (only time so far) 21848b8605Smrg - **sbdump** - Print IR after some passes. 22b8e80941Smrg - **sbnofallback** - Abort on errors instead of fallback 23b8e80941Smrg - **sbdisasm** - Use sb disassembler for shader dumps 24b8e80941Smrg - **sbsafemath** - Disable unsafe math optimizations 25848b8605Smrg 26848b8605Smrg### Regression debugging 27848b8605Smrg 28848b8605SmrgIf there are any regressions as compared to the default backend 29848b8605Smrg(R600\_SB=0), it's possible to use the following environment variables 30848b8605Smrgto find the incorrectly optimized shader that causes the regression. 31848b8605Smrg 32848b8605Smrg- **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some 33848b8605Smrg shaders 34848b8605Smrg - 0 - disabled (default) 35848b8605Smrg - 1 - skip optimization for the shaders in the range 36848b8605Smrg [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is, 37848b8605Smrg optimize only the shaders that are not in this range 38848b8605Smrg - 2 - optimize only the shaders in the range 39848b8605Smrg [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END] 40848b8605Smrg 41848b8605Smrg- **R600\_SB\_DSKIP\_START** - start of the range (1-based) 42848b8605Smrg 43848b8605Smrg- **R600\_SB\_DSKIP\_END** - end of the range (1-based) 44848b8605Smrg 45848b8605SmrgExample - optimize only the shaders 5, 6, and 7: 46848b8605Smrg 47848b8605Smrg R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2 48848b8605Smrg 49848b8605SmrgAll shaders compiled by the application are numbered starting from 1, 50848b8605Smrgthe number of shaders used by the application may be obtained by running 51848b8605Smrgit with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#" 52848b8605Smrgfor each compiled shader. 53848b8605Smrg 54848b8605SmrgAfter figuring out the total number of shaders used by the application, 55848b8605Smrgthe variables above allow to use bisection to find the shader that is 56848b8605Smrgthe cause of regression. E.g. if the application uses 100 shaders, we 57848b8605Smrgcan divide the range [1; 100] and run the application with the 58848b8605Smrgoptimization enabled only for the first half of the shaders: 59848b8605Smrg 60848b8605Smrg R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app> 61848b8605Smrg 62848b8605SmrgIf the regression is reproduced with these parameters, then the failing 63848b8605Smrgshader is in the range [1; 50], if it's not reproduced - then it's in 64848b8605Smrgthe range [51; 100]. Then we can divide the new range again and repeat 65848b8605Smrgthe testing, until we'll reduce the range to a single failing shader. 66848b8605Smrg 67848b8605Smrg*NOTE: This method relies on the assumption that the application 68848b8605Smrgproduces the same sequence of the shaders on each run. It's not always 69848b8605Smrgtrue - some applications may produce different sequences of the shaders, 70848b8605Smrgin such cases the tools like apitrace may be used to record the trace 71848b8605Smrgwith the application, then this method may be applied when replaying the 72848b8605Smrgtrace - also this may be faster and/or more convenient than testing the 73848b8605Smrgapplication itself.* 74848b8605Smrg 75848b8605Smrg* * * * * 76848b8605Smrg 77848b8605SmrgIntermediate Representation 78848b8605Smrg--------------------------- 79848b8605Smrg 80848b8605Smrg### Values 81848b8605Smrg 82848b8605SmrgAll kinds of the operands (literal constants, references to kcache 83848b8605Smrgconstants, references to GPRs, etc) are currently represented by the 84848b8605Smrg**value** class (possibly it makes sense to switch to hierarchy of 85848b8605Smrgclasses derived from **value** instead, to save some memory). 86848b8605Smrg 87848b8605SmrgAll values (except some pseudo values like the exec\_mask or predicate 88848b8605Smrgregister) represent 32bit scalar values - there are no vector values, 89848b8605SmrgCF/FETCH instructions use groups of 4 values for src and dst operands. 90848b8605Smrg 91848b8605Smrg### Nodes 92848b8605Smrg 93848b8605SmrgShader programs are represented using the tree data structure, some 94848b8605Smrgnodes contain a list of subnodes. 95848b8605Smrg 96848b8605Smrg#### Control flow nodes 97848b8605Smrg 98848b8605SmrgControl flow information is represented using four special node types 99848b8605Smrg(based on the ideas from [[1]](#references) ) 100848b8605Smrg 101848b8605Smrg- **region\_node** - single-entry, single-exit region. 102848b8605Smrg 103848b8605Smrg All loops and if's in the program are enclosed in region nodes. 104848b8605Smrg Region nodes have two containers for phi nodes - 105848b8605Smrg region\_node::loop\_phi contains the phi expressions to be executed 106848b8605Smrg at the region entry, region\_node::phi contains the phi expressions 107848b8605Smrg to be executed at the region exit. It's the only type of the node 108848b8605Smrg that contains associated phi expressions. 109848b8605Smrg 110848b8605Smrg- **depart\_node** - "depart region \$id after { ... }" 111848b8605Smrg 112848b8605Smrg Depart target region (jump to exit point) after executing contained 113848b8605Smrg code. 114848b8605Smrg 115848b8605Smrg- **repeat\_node** - "repeat region \$id after { ... }" 116848b8605Smrg 117848b8605Smrg Repeat target region (jump to entry point) after executing contained 118848b8605Smrg code. 119848b8605Smrg 120848b8605Smrg- **if\_node** - "if (cond) { ... }" 121848b8605Smrg 122848b8605Smrg Execute contained code if condition is true. The difference from 123848b8605Smrg [[1]](#references) is that we don't have associated phi expressions 124848b8605Smrg for the **if\_node**, we enclose **if\_node** in the 125848b8605Smrg **region\_node** and store corresponding phi's in the 126848b8605Smrg **region\_node**, this allows more uniform handling. 127848b8605Smrg 128848b8605SmrgThe target region of depart and repeat nodes is always the region where 129848b8605Smrgthey are located (possibly in the nested region), there are no arbitrary 130848b8605Smrgjumps/goto's - control flow in the program is always structured. 131848b8605Smrg 132848b8605SmrgTypical control flow constructs can be represented as in the following 133848b8605Smrgexamples: 134848b8605Smrg 135848b8605SmrgGLSL: 136848b8605Smrg 137848b8605Smrg if (cond) { 138848b8605Smrg < 1 > 139848b8605Smrg } else { 140848b8605Smrg < 2 > 141848b8605Smrg } 142848b8605Smrg 143848b8605SmrgIR: 144848b8605Smrg 145848b8605Smrg region #0 { 146848b8605Smrg depart region #0 after { 147848b8605Smrg if (cond) { 148848b8605Smrg depart region #0 after { 149848b8605Smrg < 1 > 150848b8605Smrg } 151848b8605Smrg } 152848b8605Smrg < 2 > 153848b8605Smrg } 154848b8605Smrg <region #0 phi nodes > 155848b8605Smrg } 156848b8605Smrg 157848b8605SmrgGLSL: 158848b8605Smrg 159848b8605Smrg while (cond) { 160848b8605Smrg < 1 > 161848b8605Smrg } 162848b8605Smrg 163848b8605SmrgIR: 164848b8605Smrg 165848b8605Smrg region #0 { 166848b8605Smrg <region #0 loop_phi nodes> 167848b8605Smrg repeat region #0 after { 168848b8605Smrg region #1 { 169848b8605Smrg depart region #1 after { 170848b8605Smrg if (!cond) { 171848b8605Smrg depart region #0 172848b8605Smrg } 173848b8605Smrg } 174848b8605Smrg } 175848b8605Smrg < 1 > 176848b8605Smrg } 177848b8605Smrg <region #0 phi nodes> 178848b8605Smrg } 179848b8605Smrg 180848b8605Smrg'Break' and 'continue' inside the loops are directly translated to the 181848b8605Smrgdepart and repeat nodes for the corresponding loop region. 182848b8605Smrg 183848b8605SmrgThis may look a bit too complicated, but in fact this allows more simple 184848b8605Smrgand uniform handling of the control flow. 185848b8605Smrg 186848b8605SmrgAll loop\_phi and phi nodes for some region always have the same number 187848b8605Smrgof source operands. The number of source operands for 188848b8605Smrgregion\_node::loop\_phi nodes is 1 + number of repeat nodes that 189848b8605Smrgreference this region as a target. The number of source operands for 190848b8605Smrgregion\_node::phi nodes is equal to the number of depart nodes that 191848b8605Smrgreference this region as a target. All depart/repeat nodes for the 192848b8605Smrgregion have unique indices equal to the index of source operand for 193848b8605Smrgphi/loop\_phi nodes. 194848b8605Smrg 195848b8605SmrgFirst source operand for region\_node::loop\_phi nodes (src[0]) is an 196848b8605Smrgincoming value that enters the region from the outside. Each remaining 197848b8605Smrgsource operand comes from the corresponding repeat node. 198848b8605Smrg 199848b8605SmrgMore complex example: 200848b8605Smrg 201848b8605SmrgGLSL: 202848b8605Smrg 203848b8605Smrg a = 1; 204848b8605Smrg while (a < 5) { 205848b8605Smrg a = a * 2; 206848b8605Smrg if (b == 3) { 207848b8605Smrg continue; 208848b8605Smrg } else { 209848b8605Smrg a = 6; 210848b8605Smrg } 211848b8605Smrg if (c == 4) 212848b8605Smrg break; 213848b8605Smrg a = a + 1; 214848b8605Smrg } 215848b8605Smrg 216848b8605SmrgIR with SSA form: 217848b8605Smrg 218848b8605Smrg a.1 = 1; 219848b8605Smrg region #0 { 220848b8605Smrg // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2 221848b8605Smrg region#0 loop_phi: a.2 = phi a.1, a.6, a.3 222848b8605Smrg 223848b8605Smrg repeat_1 region #0 after { 224848b8605Smrg a.3 = a.2 * 2; 225848b8605Smrg cond1 = (b == 3); 226848b8605Smrg region #1 { 227848b8605Smrg depart_0 region #1 after { 228848b8605Smrg if (cond1) { 229848b8605Smrg repeat_2 region #0; 230848b8605Smrg } 231848b8605Smrg } 232848b8605Smrg a.4 = 6; 233848b8605Smrg 234848b8605Smrg region #1 phi: a.5 = phi a.4; // src[0] - from depart_0 235848b8605Smrg } 236848b8605Smrg cond2 = (c == 4); 237848b8605Smrg region #2 { 238848b8605Smrg depart_0 region #2 after { 239848b8605Smrg if (cond2) { 240848b8605Smrg depart_0 region #0; 241848b8605Smrg } 242848b8605Smrg } 243848b8605Smrg } 244848b8605Smrg a.6 = a.5 + 1; 245848b8605Smrg } 246848b8605Smrg 247848b8605Smrg region #0 phi: a.7 = phi a.5 // src[0] from depart_0 248848b8605Smrg } 249848b8605Smrg 250848b8605SmrgPhi nodes with single source operand are just copies, they are not 251848b8605Smrgreally necessary, but this allows to handle all **depart\_node**s in the 252848b8605Smrguniform way. 253848b8605Smrg 254848b8605Smrg#### Instruction nodes 255848b8605Smrg 256848b8605SmrgInstruction nodes represent different kinds of instructions - 257848b8605Smrg**alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains 258848b8605Smrgthe "bc" structure where all fields of the bytecode are stored (the type 259848b8605Smrgis **bc\_alu** for **alu\_node**, etc). The operands are represented 260848b8605Smrgusing the vectors of pointers to **value** class (node::src, node::dst) 261848b8605Smrg 262848b8605Smrg#### SSA-specific nodes 263848b8605Smrg 264848b8605SmrgPhi nodes currently don't have special node class, they are stored as 265848b8605Smrg**node**. Destination vector contains a single destination value, source 266848b8605Smrgvector contains 1 or more source values. 267848b8605Smrg 268848b8605SmrgPsi nodes [[5], [6]](#references) also don't have a special node class 269848b8605Smrgand stored as **node**. Source vector contains 3 values for each source 270848b8605Smrgoperand - the **value** of predicate, **value** of corresponding 271848b8605SmrgPRED\_SEL field, and the source **value** itself. 272848b8605Smrg 273848b8605Smrg### Indirect addressing 274848b8605Smrg 275848b8605SmrgSpecial kind of values (VLK\_RELREG) is used to represent indirect 276848b8605Smrgoperands. These values don't have SSA versions. The representation is 277848b8605Smrgmostly based on the [[2]](#references). Indirect operand contains the 278848b8605Smrg"offset/address" value (value::rel), (e.g. some SSA version of the AR 279848b8605Smrgregister value, though after some passes it may be any value - constant, 280848b8605Smrgregister, etc), also it contains the maydef and mayuse vectors of 281848b8605Smrgpointers to **value**s (similar to dst/src vectors in the **node**) to 282848b8605Smrgrepresent the effects of aliasing in the SSA form. 283848b8605Smrg 284848b8605SmrgE.g. if we have the array R5.x ... R8.x and the following instruction : 285848b8605Smrg 286848b8605Smrg MOV R0.x, R[5 + AR].x 287848b8605Smrg 288848b8605Smrgthen source indirect operand is represented with the VLK\_RELREG value, 289848b8605Smrgvalue::rel is AR, value::maydef is empty (in fact it always contain the 290848b8605Smrgsame number of elements as mayuse to simplify the handling, but they are 291848b8605SmrgNULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the 292848b8605Smrgcorresponding SSA versions after ssa\_rename). 293848b8605Smrg 294848b8605SmrgAdditional "virtual variables" as in [HSSA [2]](#references) are not 295848b8605Smrgused, also there is no special handling for "zero versions". Typical 296848b8605Smrgprograms in our case are small, indirect addressing is rare, array sizes 297848b8605Smrgare limited by max gpr number, so we don't really need to use special 298848b8605Smrgtricks to avoid the explosion of value versions. Also this allows more 299848b8605Smrgprecise liveness computation for array elements without modifications to 300848b8605Smrgthe algorithms. 301848b8605Smrg 302848b8605SmrgWith the following instruction: 303848b8605Smrg 304848b8605Smrg MOV R[5+AR].x, R0.x 305848b8605Smrg 306848b8605Smrgwe'll have both maydef and mayuse vectors for dst operand filled with 307848b8605Smrgarray values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename 308848b8605Smrgpass mayuse will contain previous versions, maydef will contain new 309848b8605Smrgpotentially-defined versions. 310848b8605Smrg 311848b8605Smrg* * * * * 312848b8605Smrg 313848b8605SmrgPasses 314848b8605Smrg------ 315848b8605Smrg 316848b8605Smrg- **bc\_parser** - creates the IR from the source bytecode, 317848b8605Smrg initializes src and dst value vectors for instruction nodes. Most 318848b8605Smrg ALU nodes have one dst operand and the number of source operands is 319848b8605Smrg equal to the number of source operands for the ISA instruction. 320848b8605Smrg Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst 321848b8605Smrg gpr as in the original instruction, other two are pseudo-operands 322848b8605Smrg that represent possibly updated predicate and exec\_mask. Predicate 323848b8605Smrg values are used in the predicated alu instructions (node::pred), 324848b8605Smrg exec\_mask values are used in the if\_nodes (if\_node::cond). Each 325848b8605Smrg vector operand in the CF/TEX/VTX instructions is represented with 4 326848b8605Smrg values - components of the vector. 327848b8605Smrg 328848b8605Smrg- **ssa\_prepare** - creates phi expressions. 329848b8605Smrg 330848b8605Smrg- **ssa\_rename** - renames the values (assigns versions). 331848b8605Smrg 332848b8605Smrg- **liveness** - liveness computation, sets 'dead' flag for unused 333848b8605Smrg nodes and values, optionally computes interference information for 334848b8605Smrg the values. 335848b8605Smrg 336848b8605Smrg- **dce\_cleanup** - eliminates 'dead' nodes, also removes some 337848b8605Smrg unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP 338848b8605Smrg instructions in the source, containers for ALU groups (they were 339848b8605Smrg only needed for the ssa\_rename pass) 340848b8605Smrg 341848b8605Smrg- **if\_conversion** - converts control flow with if\_nodes to the 342848b8605Smrg data flow in cases where it can improve performance (small alu-only 343848b8605Smrg branches). Both branches are executed speculatively and the phi 344848b8605Smrg expressions are replaced with conditional moves (CNDxx) to select 345848b8605Smrg the final value using the same condition predicate as was used by 346848b8605Smrg the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx 347848b8605Smrg instruction, CNDxx now uses dst[0] from the same PREDSETxx 348848b8605Smrg instruction. 349848b8605Smrg 350848b8605Smrg- **peephole** - peephole optimizations 351848b8605Smrg 352848b8605Smrg- **gvn** - Global Value Numbering [[2]](#references), 353848b8605Smrg [[3]](#references) 354848b8605Smrg 355848b8605Smrg- **gcm** - Global Code Motion [[3]](#references). Also performs 356848b8605Smrg grouping of the instructions of the same kind (CF/FETCH/ALU). 357848b8605Smrg 358848b8605Smrg- register allocation passes, some ideas are used from 359848b8605Smrg [[4]](#references), but implementation is simplified to make it more 360848b8605Smrg efficient in terms of the compilation speed (e.g. no recursive 361848b8605Smrg recoloring) while achieving good enough results. 362848b8605Smrg 363848b8605Smrg - **ra\_split** - prepares the program to register allocation. 364848b8605Smrg Splits live ranges for constrained values by inserting the 365848b8605Smrg copies to/from temporary values, so that the live range of the 366848b8605Smrg constrained values becomes minimal. 367848b8605Smrg 368848b8605Smrg - **ra\_coalesce** - performs global allocation on registers used 369848b8605Smrg in CF/FETCH instructions. It's performed first to make sure they 370848b8605Smrg end up in the same GPR. Also tries to allocate all values 371848b8605Smrg involved in copies (inserted by the ra\_split pass) to the same 372848b8605Smrg register, so that the copies may be eliminated. 373848b8605Smrg 374848b8605Smrg - **ra\_init** - allocates gpr arrays (if indirect addressing is 375848b8605Smrg used), and remaining values. 376848b8605Smrg 377848b8605Smrg- **post\_scheduler** - ALU scheduler, handles VLIW packing and 378848b8605Smrg performs the final register allocation for local values inside ALU 379848b8605Smrg clauses. Eliminates all coalesced copies (if src and dst of the copy 380848b8605Smrg are allocated to the same register). 381848b8605Smrg 382848b8605Smrg- **ra\_checker** - optional debugging pass that tries to catch basic 383848b8605Smrg errors of the scheduler or regalloc, 384848b8605Smrg 385848b8605Smrg- **bc\_finalize** - propagates the regalloc information from values 386848b8605Smrg in node::src and node::dst vectors to the bytecode fields, converts 387848b8605Smrg control flow structure (region/depart/repeat) to the target 388848b8605Smrg instructions (JUMP/ELSE/POP, 389848b8605Smrg LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK). 390848b8605Smrg 391848b8605Smrg- **bc\_builder** - builds final bytecode, 392848b8605Smrg 393848b8605Smrg* * * * * 394848b8605Smrg 395848b8605SmrgReferences 396848b8605Smrg---------- 397848b8605Smrg 398848b8605Smrg[1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl 399848b8605SmrgMcConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf) 400848b8605Smrg 401848b8605Smrg[2] ["Effective Representation of Aliases and Indirect Memory Operations 402848b8605Smrgin SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark 403848b8605SmrgStreich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf) 404848b8605Smrg 405848b8605Smrg[3] ["Global Code Motion. Global Value Numbering.", Cliff 406848b8605SmrgClick](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf) 407848b8605Smrg 408848b8605Smrg[4] ["Register Allocation for Programs in SSA Form", Sebastian 409848b8605SmrgHack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532) 410848b8605Smrg 411848b8605Smrg[5] ["An extension to the SSA representation for predicated code", 412848b8605SmrgFrancois de 413848b8605SmrgFerriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf) 414848b8605Smrg 415848b8605Smrg[6] ["Improvements to the Psi-SSA Representation", F. de 416848b8605SmrgFerriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf) 417