1af69d88dSmrgr600-sb 2af69d88dSmrg======= 3af69d88dSmrg 4af69d88dSmrg* * * * * 5af69d88dSmrg 6af69d88dSmrgDebugging 7af69d88dSmrg--------- 8af69d88dSmrg 9af69d88dSmrg### Environment variables 10af69d88dSmrg 11af69d88dSmrg- **R600\_DEBUG** 12af69d88dSmrg 13af69d88dSmrg There are new flags: 14af69d88dSmrg 1501e04c3fSmrg - **nosb** - Disable sb backend for graphics shaders 16af69d88dSmrg - **sbcl** - Enable optimization of compute shaders (experimental) 1701e04c3fSmrg - **sbdry** - Dry run, optimize but use source bytecode - 1801e04c3fSmrg useful if you only want to check shader dumps 19af69d88dSmrg without the risk of lockups and other problems 20af69d88dSmrg - **sbstat** - Print optimization statistics (only time so far) 21af69d88dSmrg - **sbdump** - Print IR after some passes. 2201e04c3fSmrg - **sbnofallback** - Abort on errors instead of fallback 2301e04c3fSmrg - **sbdisasm** - Use sb disassembler for shader dumps 2401e04c3fSmrg - **sbsafemath** - Disable unsafe math optimizations 25af69d88dSmrg 26af69d88dSmrg### Regression debugging 27af69d88dSmrg 28af69d88dSmrgIf there are any regressions as compared to the default backend 29af69d88dSmrg(R600\_SB=0), it's possible to use the following environment variables 30af69d88dSmrgto find the incorrectly optimized shader that causes the regression. 31af69d88dSmrg 32af69d88dSmrg- **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some 33af69d88dSmrg shaders 34af69d88dSmrg - 0 - disabled (default) 35af69d88dSmrg - 1 - skip optimization for the shaders in the range 36af69d88dSmrg [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is, 37af69d88dSmrg optimize only the shaders that are not in this range 38af69d88dSmrg - 2 - optimize only the shaders in the range 39af69d88dSmrg [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END] 40af69d88dSmrg 41af69d88dSmrg- **R600\_SB\_DSKIP\_START** - start of the range (1-based) 42af69d88dSmrg 43af69d88dSmrg- **R600\_SB\_DSKIP\_END** - end of the range (1-based) 44af69d88dSmrg 45af69d88dSmrgExample - optimize only the shaders 5, 6, and 7: 46af69d88dSmrg 47af69d88dSmrg R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2 48af69d88dSmrg 49af69d88dSmrgAll shaders compiled by the application are numbered starting from 1, 50af69d88dSmrgthe number of shaders used by the application may be obtained by running 51af69d88dSmrgit with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#" 52af69d88dSmrgfor each compiled shader. 53af69d88dSmrg 54af69d88dSmrgAfter figuring out the total number of shaders used by the application, 55af69d88dSmrgthe variables above allow to use bisection to find the shader that is 56af69d88dSmrgthe cause of regression. E.g. if the application uses 100 shaders, we 57af69d88dSmrgcan divide the range [1; 100] and run the application with the 58af69d88dSmrgoptimization enabled only for the first half of the shaders: 59af69d88dSmrg 60af69d88dSmrg R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app> 61af69d88dSmrg 62af69d88dSmrgIf the regression is reproduced with these parameters, then the failing 63af69d88dSmrgshader is in the range [1; 50], if it's not reproduced - then it's in 64af69d88dSmrgthe range [51; 100]. Then we can divide the new range again and repeat 65af69d88dSmrgthe testing, until we'll reduce the range to a single failing shader. 66af69d88dSmrg 67af69d88dSmrg*NOTE: This method relies on the assumption that the application 68af69d88dSmrgproduces the same sequence of the shaders on each run. It's not always 69af69d88dSmrgtrue - some applications may produce different sequences of the shaders, 70af69d88dSmrgin such cases the tools like apitrace may be used to record the trace 71af69d88dSmrgwith the application, then this method may be applied when replaying the 72af69d88dSmrgtrace - also this may be faster and/or more convenient than testing the 73af69d88dSmrgapplication itself.* 74af69d88dSmrg 75af69d88dSmrg* * * * * 76af69d88dSmrg 77af69d88dSmrgIntermediate Representation 78af69d88dSmrg--------------------------- 79af69d88dSmrg 80af69d88dSmrg### Values 81af69d88dSmrg 82af69d88dSmrgAll kinds of the operands (literal constants, references to kcache 83af69d88dSmrgconstants, references to GPRs, etc) are currently represented by the 84af69d88dSmrg**value** class (possibly it makes sense to switch to hierarchy of 85af69d88dSmrgclasses derived from **value** instead, to save some memory). 86af69d88dSmrg 87af69d88dSmrgAll values (except some pseudo values like the exec\_mask or predicate 88af69d88dSmrgregister) represent 32bit scalar values - there are no vector values, 89af69d88dSmrgCF/FETCH instructions use groups of 4 values for src and dst operands. 90af69d88dSmrg 91af69d88dSmrg### Nodes 92af69d88dSmrg 93af69d88dSmrgShader programs are represented using the tree data structure, some 94af69d88dSmrgnodes contain a list of subnodes. 95af69d88dSmrg 96af69d88dSmrg#### Control flow nodes 97af69d88dSmrg 98af69d88dSmrgControl flow information is represented using four special node types 99af69d88dSmrg(based on the ideas from [[1]](#references) ) 100af69d88dSmrg 101af69d88dSmrg- **region\_node** - single-entry, single-exit region. 102af69d88dSmrg 103af69d88dSmrg All loops and if's in the program are enclosed in region nodes. 104af69d88dSmrg Region nodes have two containers for phi nodes - 105af69d88dSmrg region\_node::loop\_phi contains the phi expressions to be executed 106af69d88dSmrg at the region entry, region\_node::phi contains the phi expressions 107af69d88dSmrg to be executed at the region exit. It's the only type of the node 108af69d88dSmrg that contains associated phi expressions. 109af69d88dSmrg 110af69d88dSmrg- **depart\_node** - "depart region \$id after { ... }" 111af69d88dSmrg 112af69d88dSmrg Depart target region (jump to exit point) after executing contained 113af69d88dSmrg code. 114af69d88dSmrg 115af69d88dSmrg- **repeat\_node** - "repeat region \$id after { ... }" 116af69d88dSmrg 117af69d88dSmrg Repeat target region (jump to entry point) after executing contained 118af69d88dSmrg code. 119af69d88dSmrg 120af69d88dSmrg- **if\_node** - "if (cond) { ... }" 121af69d88dSmrg 122af69d88dSmrg Execute contained code if condition is true. The difference from 123af69d88dSmrg [[1]](#references) is that we don't have associated phi expressions 124af69d88dSmrg for the **if\_node**, we enclose **if\_node** in the 125af69d88dSmrg **region\_node** and store corresponding phi's in the 126af69d88dSmrg **region\_node**, this allows more uniform handling. 127af69d88dSmrg 128af69d88dSmrgThe target region of depart and repeat nodes is always the region where 129af69d88dSmrgthey are located (possibly in the nested region), there are no arbitrary 130af69d88dSmrgjumps/goto's - control flow in the program is always structured. 131af69d88dSmrg 132af69d88dSmrgTypical control flow constructs can be represented as in the following 133af69d88dSmrgexamples: 134af69d88dSmrg 135af69d88dSmrgGLSL: 136af69d88dSmrg 137af69d88dSmrg if (cond) { 138af69d88dSmrg < 1 > 139af69d88dSmrg } else { 140af69d88dSmrg < 2 > 141af69d88dSmrg } 142af69d88dSmrg 143af69d88dSmrgIR: 144af69d88dSmrg 145af69d88dSmrg region #0 { 146af69d88dSmrg depart region #0 after { 147af69d88dSmrg if (cond) { 148af69d88dSmrg depart region #0 after { 149af69d88dSmrg < 1 > 150af69d88dSmrg } 151af69d88dSmrg } 152af69d88dSmrg < 2 > 153af69d88dSmrg } 154af69d88dSmrg <region #0 phi nodes > 155af69d88dSmrg } 156af69d88dSmrg 157af69d88dSmrgGLSL: 158af69d88dSmrg 159af69d88dSmrg while (cond) { 160af69d88dSmrg < 1 > 161af69d88dSmrg } 162af69d88dSmrg 163af69d88dSmrgIR: 164af69d88dSmrg 165af69d88dSmrg region #0 { 166af69d88dSmrg <region #0 loop_phi nodes> 167af69d88dSmrg repeat region #0 after { 168af69d88dSmrg region #1 { 169af69d88dSmrg depart region #1 after { 170af69d88dSmrg if (!cond) { 171af69d88dSmrg depart region #0 172af69d88dSmrg } 173af69d88dSmrg } 174af69d88dSmrg } 175af69d88dSmrg < 1 > 176af69d88dSmrg } 177af69d88dSmrg <region #0 phi nodes> 178af69d88dSmrg } 179af69d88dSmrg 180af69d88dSmrg'Break' and 'continue' inside the loops are directly translated to the 181af69d88dSmrgdepart and repeat nodes for the corresponding loop region. 182af69d88dSmrg 183af69d88dSmrgThis may look a bit too complicated, but in fact this allows more simple 184af69d88dSmrgand uniform handling of the control flow. 185af69d88dSmrg 186af69d88dSmrgAll loop\_phi and phi nodes for some region always have the same number 187af69d88dSmrgof source operands. The number of source operands for 188af69d88dSmrgregion\_node::loop\_phi nodes is 1 + number of repeat nodes that 189af69d88dSmrgreference this region as a target. The number of source operands for 190af69d88dSmrgregion\_node::phi nodes is equal to the number of depart nodes that 191af69d88dSmrgreference this region as a target. All depart/repeat nodes for the 192af69d88dSmrgregion have unique indices equal to the index of source operand for 193af69d88dSmrgphi/loop\_phi nodes. 194af69d88dSmrg 195af69d88dSmrgFirst source operand for region\_node::loop\_phi nodes (src[0]) is an 196af69d88dSmrgincoming value that enters the region from the outside. Each remaining 197af69d88dSmrgsource operand comes from the corresponding repeat node. 198af69d88dSmrg 199af69d88dSmrgMore complex example: 200af69d88dSmrg 201af69d88dSmrgGLSL: 202af69d88dSmrg 203af69d88dSmrg a = 1; 204af69d88dSmrg while (a < 5) { 205af69d88dSmrg a = a * 2; 206af69d88dSmrg if (b == 3) { 207af69d88dSmrg continue; 208af69d88dSmrg } else { 209af69d88dSmrg a = 6; 210af69d88dSmrg } 211af69d88dSmrg if (c == 4) 212af69d88dSmrg break; 213af69d88dSmrg a = a + 1; 214af69d88dSmrg } 215af69d88dSmrg 216af69d88dSmrgIR with SSA form: 217af69d88dSmrg 218af69d88dSmrg a.1 = 1; 219af69d88dSmrg region #0 { 220af69d88dSmrg // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2 221af69d88dSmrg region#0 loop_phi: a.2 = phi a.1, a.6, a.3 222af69d88dSmrg 223af69d88dSmrg repeat_1 region #0 after { 224af69d88dSmrg a.3 = a.2 * 2; 225af69d88dSmrg cond1 = (b == 3); 226af69d88dSmrg region #1 { 227af69d88dSmrg depart_0 region #1 after { 228af69d88dSmrg if (cond1) { 229af69d88dSmrg repeat_2 region #0; 230af69d88dSmrg } 231af69d88dSmrg } 232af69d88dSmrg a.4 = 6; 233af69d88dSmrg 234af69d88dSmrg region #1 phi: a.5 = phi a.4; // src[0] - from depart_0 235af69d88dSmrg } 236af69d88dSmrg cond2 = (c == 4); 237af69d88dSmrg region #2 { 238af69d88dSmrg depart_0 region #2 after { 239af69d88dSmrg if (cond2) { 240af69d88dSmrg depart_0 region #0; 241af69d88dSmrg } 242af69d88dSmrg } 243af69d88dSmrg } 244af69d88dSmrg a.6 = a.5 + 1; 245af69d88dSmrg } 246af69d88dSmrg 247af69d88dSmrg region #0 phi: a.7 = phi a.5 // src[0] from depart_0 248af69d88dSmrg } 249af69d88dSmrg 250af69d88dSmrgPhi nodes with single source operand are just copies, they are not 251af69d88dSmrgreally necessary, but this allows to handle all **depart\_node**s in the 252af69d88dSmrguniform way. 253af69d88dSmrg 254af69d88dSmrg#### Instruction nodes 255af69d88dSmrg 256af69d88dSmrgInstruction nodes represent different kinds of instructions - 257af69d88dSmrg**alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains 258af69d88dSmrgthe "bc" structure where all fields of the bytecode are stored (the type 259af69d88dSmrgis **bc\_alu** for **alu\_node**, etc). The operands are represented 260af69d88dSmrgusing the vectors of pointers to **value** class (node::src, node::dst) 261af69d88dSmrg 262af69d88dSmrg#### SSA-specific nodes 263af69d88dSmrg 264af69d88dSmrgPhi nodes currently don't have special node class, they are stored as 265af69d88dSmrg**node**. Destination vector contains a single destination value, source 266af69d88dSmrgvector contains 1 or more source values. 267af69d88dSmrg 268af69d88dSmrgPsi nodes [[5], [6]](#references) also don't have a special node class 269af69d88dSmrgand stored as **node**. Source vector contains 3 values for each source 270af69d88dSmrgoperand - the **value** of predicate, **value** of corresponding 271af69d88dSmrgPRED\_SEL field, and the source **value** itself. 272af69d88dSmrg 273af69d88dSmrg### Indirect addressing 274af69d88dSmrg 275af69d88dSmrgSpecial kind of values (VLK\_RELREG) is used to represent indirect 276af69d88dSmrgoperands. These values don't have SSA versions. The representation is 277af69d88dSmrgmostly based on the [[2]](#references). Indirect operand contains the 278af69d88dSmrg"offset/address" value (value::rel), (e.g. some SSA version of the AR 279af69d88dSmrgregister value, though after some passes it may be any value - constant, 280af69d88dSmrgregister, etc), also it contains the maydef and mayuse vectors of 281af69d88dSmrgpointers to **value**s (similar to dst/src vectors in the **node**) to 282af69d88dSmrgrepresent the effects of aliasing in the SSA form. 283af69d88dSmrg 284af69d88dSmrgE.g. if we have the array R5.x ... R8.x and the following instruction : 285af69d88dSmrg 286af69d88dSmrg MOV R0.x, R[5 + AR].x 287af69d88dSmrg 288af69d88dSmrgthen source indirect operand is represented with the VLK\_RELREG value, 289af69d88dSmrgvalue::rel is AR, value::maydef is empty (in fact it always contain the 290af69d88dSmrgsame number of elements as mayuse to simplify the handling, but they are 291af69d88dSmrgNULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the 292af69d88dSmrgcorresponding SSA versions after ssa\_rename). 293af69d88dSmrg 294af69d88dSmrgAdditional "virtual variables" as in [HSSA [2]](#references) are not 295af69d88dSmrgused, also there is no special handling for "zero versions". Typical 296af69d88dSmrgprograms in our case are small, indirect addressing is rare, array sizes 297af69d88dSmrgare limited by max gpr number, so we don't really need to use special 298af69d88dSmrgtricks to avoid the explosion of value versions. Also this allows more 299af69d88dSmrgprecise liveness computation for array elements without modifications to 300af69d88dSmrgthe algorithms. 301af69d88dSmrg 302af69d88dSmrgWith the following instruction: 303af69d88dSmrg 304af69d88dSmrg MOV R[5+AR].x, R0.x 305af69d88dSmrg 306af69d88dSmrgwe'll have both maydef and mayuse vectors for dst operand filled with 307af69d88dSmrgarray values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename 308af69d88dSmrgpass mayuse will contain previous versions, maydef will contain new 309af69d88dSmrgpotentially-defined versions. 310af69d88dSmrg 311af69d88dSmrg* * * * * 312af69d88dSmrg 313af69d88dSmrgPasses 314af69d88dSmrg------ 315af69d88dSmrg 316af69d88dSmrg- **bc\_parser** - creates the IR from the source bytecode, 317af69d88dSmrg initializes src and dst value vectors for instruction nodes. Most 318af69d88dSmrg ALU nodes have one dst operand and the number of source operands is 319af69d88dSmrg equal to the number of source operands for the ISA instruction. 320af69d88dSmrg Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst 321af69d88dSmrg gpr as in the original instruction, other two are pseudo-operands 322af69d88dSmrg that represent possibly updated predicate and exec\_mask. Predicate 323af69d88dSmrg values are used in the predicated alu instructions (node::pred), 324af69d88dSmrg exec\_mask values are used in the if\_nodes (if\_node::cond). Each 325af69d88dSmrg vector operand in the CF/TEX/VTX instructions is represented with 4 326af69d88dSmrg values - components of the vector. 327af69d88dSmrg 328af69d88dSmrg- **ssa\_prepare** - creates phi expressions. 329af69d88dSmrg 330af69d88dSmrg- **ssa\_rename** - renames the values (assigns versions). 331af69d88dSmrg 332af69d88dSmrg- **liveness** - liveness computation, sets 'dead' flag for unused 333af69d88dSmrg nodes and values, optionally computes interference information for 334af69d88dSmrg the values. 335af69d88dSmrg 336af69d88dSmrg- **dce\_cleanup** - eliminates 'dead' nodes, also removes some 337af69d88dSmrg unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP 338af69d88dSmrg instructions in the source, containers for ALU groups (they were 339af69d88dSmrg only needed for the ssa\_rename pass) 340af69d88dSmrg 341af69d88dSmrg- **if\_conversion** - converts control flow with if\_nodes to the 342af69d88dSmrg data flow in cases where it can improve performance (small alu-only 343af69d88dSmrg branches). Both branches are executed speculatively and the phi 344af69d88dSmrg expressions are replaced with conditional moves (CNDxx) to select 345af69d88dSmrg the final value using the same condition predicate as was used by 346af69d88dSmrg the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx 347af69d88dSmrg instruction, CNDxx now uses dst[0] from the same PREDSETxx 348af69d88dSmrg instruction. 349af69d88dSmrg 350af69d88dSmrg- **peephole** - peephole optimizations 351af69d88dSmrg 352af69d88dSmrg- **gvn** - Global Value Numbering [[2]](#references), 353af69d88dSmrg [[3]](#references) 354af69d88dSmrg 355af69d88dSmrg- **gcm** - Global Code Motion [[3]](#references). Also performs 356af69d88dSmrg grouping of the instructions of the same kind (CF/FETCH/ALU). 357af69d88dSmrg 358af69d88dSmrg- register allocation passes, some ideas are used from 359af69d88dSmrg [[4]](#references), but implementation is simplified to make it more 360af69d88dSmrg efficient in terms of the compilation speed (e.g. no recursive 361af69d88dSmrg recoloring) while achieving good enough results. 362af69d88dSmrg 363af69d88dSmrg - **ra\_split** - prepares the program to register allocation. 364af69d88dSmrg Splits live ranges for constrained values by inserting the 365af69d88dSmrg copies to/from temporary values, so that the live range of the 366af69d88dSmrg constrained values becomes minimal. 367af69d88dSmrg 368af69d88dSmrg - **ra\_coalesce** - performs global allocation on registers used 369af69d88dSmrg in CF/FETCH instructions. It's performed first to make sure they 370af69d88dSmrg end up in the same GPR. Also tries to allocate all values 371af69d88dSmrg involved in copies (inserted by the ra\_split pass) to the same 372af69d88dSmrg register, so that the copies may be eliminated. 373af69d88dSmrg 374af69d88dSmrg - **ra\_init** - allocates gpr arrays (if indirect addressing is 375af69d88dSmrg used), and remaining values. 376af69d88dSmrg 377af69d88dSmrg- **post\_scheduler** - ALU scheduler, handles VLIW packing and 378af69d88dSmrg performs the final register allocation for local values inside ALU 379af69d88dSmrg clauses. Eliminates all coalesced copies (if src and dst of the copy 380af69d88dSmrg are allocated to the same register). 381af69d88dSmrg 382af69d88dSmrg- **ra\_checker** - optional debugging pass that tries to catch basic 383af69d88dSmrg errors of the scheduler or regalloc, 384af69d88dSmrg 385af69d88dSmrg- **bc\_finalize** - propagates the regalloc information from values 386af69d88dSmrg in node::src and node::dst vectors to the bytecode fields, converts 387af69d88dSmrg control flow structure (region/depart/repeat) to the target 388af69d88dSmrg instructions (JUMP/ELSE/POP, 389af69d88dSmrg LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK). 390af69d88dSmrg 391af69d88dSmrg- **bc\_builder** - builds final bytecode, 392af69d88dSmrg 393af69d88dSmrg* * * * * 394af69d88dSmrg 395af69d88dSmrgReferences 396af69d88dSmrg---------- 397af69d88dSmrg 398af69d88dSmrg[1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl 399af69d88dSmrgMcConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf) 400af69d88dSmrg 401af69d88dSmrg[2] ["Effective Representation of Aliases and Indirect Memory Operations 402af69d88dSmrgin SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark 403af69d88dSmrgStreich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf) 404af69d88dSmrg 405af69d88dSmrg[3] ["Global Code Motion. Global Value Numbering.", Cliff 406af69d88dSmrgClick](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf) 407af69d88dSmrg 408af69d88dSmrg[4] ["Register Allocation for Programs in SSA Form", Sebastian 409af69d88dSmrgHack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532) 410af69d88dSmrg 411af69d88dSmrg[5] ["An extension to the SSA representation for predicated code", 412af69d88dSmrgFrancois de 413af69d88dSmrgFerriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf) 414af69d88dSmrg 415af69d88dSmrg[6] ["Improvements to the Psi-SSA Representation", F. de 416af69d88dSmrgFerriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf) 417