17ec681f3SmrgIR3 NOTES 27ec681f3Smrg========= 37ec681f3Smrg 47ec681f3SmrgSome notes about ir3, the compiler and machine-specific IR for the shader ISA introduced with adreno a3xx. The same shader ISA is present, with some small differences, in adreno a4xx. 57ec681f3Smrg 67ec681f3SmrgCompared to the previous generation a2xx ISA (ir2), the a3xx ISA is a "simple" scalar instruction set. However, the compiler is responsible, in most cases, to schedule the instructions. The hardware does not try to hide the shader core pipeline stages. For a common example, a common (cat2) ALU instruction takes four cycles, so a subsequent cat2 instruction which uses the result must have three intervening instructions (or NOPs). When operating on vec4's, typically the corresponding scalar instructions for operating on the remaining three components could typically fit. Although that results in a lot of edge cases where things fall over, like: 77ec681f3Smrg 87ec681f3Smrg:: 97ec681f3Smrg 107ec681f3Smrg ADD TEMP[0], TEMP[1], TEMP[2] 117ec681f3Smrg MUL TEMP[0], TEMP[1], TEMP[0].wzyx 127ec681f3Smrg 137ec681f3SmrgHere, the second instruction needs the output of the first group of scalar instructions in the wrong order, resulting in not enough instruction spots between the ``add r0.w, r1.w, r2.w`` and ``mul r0.x, r1.x, r0.w``. Which is why the original (old) compiler which merely translated nearly literally from TGSI to ir3, had a strong tendency to fall over. 147ec681f3Smrg 157ec681f3SmrgSo the current compiler instead, in the frontend, generates a directed-acyclic-graph of instructions and basic blocks, which go through various additional passes to eventually schedule and do register assignment. 167ec681f3Smrg 177ec681f3SmrgFor additional documentation about the hardware, see wiki: `a3xx ISA 187ec681f3Smrg<https://github.com/freedreno/freedreno/wiki/A3xx-shader-instruction-set-architecture>`_. 197ec681f3Smrg 207ec681f3SmrgExternal Structure 217ec681f3Smrg------------------ 227ec681f3Smrg 237ec681f3Smrg``ir3_shader`` 247ec681f3Smrg A single vertex/fragment/etc shader from gallium perspective (i.e. 257ec681f3Smrg maps to a single TGSI shader), and manages a set of shader variants 267ec681f3Smrg which are generated on demand based on the shader key. 277ec681f3Smrg 287ec681f3Smrg``ir3_shader_key`` 297ec681f3Smrg The configuration key that identifies a shader variant. I.e. based 307ec681f3Smrg on other GL state (two-sided-color, render-to-alpha, etc) or render 317ec681f3Smrg stages (binning-pass vertex shader) different shader variants are 327ec681f3Smrg generated. 337ec681f3Smrg 347ec681f3Smrg``ir3_shader_variant`` 357ec681f3Smrg The actual hw shader generated based on input TGSI and shader key. 367ec681f3Smrg 377ec681f3Smrg``ir3_compiler`` 387ec681f3Smrg Compiler frontend which generates ir3 and runs the various backend 397ec681f3Smrg stages to schedule and do register assignment. 407ec681f3Smrg 417ec681f3SmrgThe IR 427ec681f3Smrg------ 437ec681f3Smrg 447ec681f3SmrgThe ir3 IR maps quite directly to the hardware, in that instruction opcodes map directly to hardware opcodes, and that dst/src register(s) map directly to the hardware dst/src register(s). But there are a few extensions, in the form of meta_ instructions. And additionally, for normal (non-const, etc) src registers, the ``IR3_REG_SSA`` flag is set and ``reg->instr`` points to the source instruction which produced that value. So, for example, the following TGSI shader: 457ec681f3Smrg 467ec681f3Smrg:: 477ec681f3Smrg 487ec681f3Smrg VERT 497ec681f3Smrg DCL IN[0] 507ec681f3Smrg DCL IN[1] 517ec681f3Smrg DCL OUT[0], POSITION 527ec681f3Smrg DCL TEMP[0], LOCAL 537ec681f3Smrg 1: DP3 TEMP[0].x, IN[0].xyzz, IN[1].xyzz 547ec681f3Smrg 2: MOV OUT[0], TEMP[0].xxxx 557ec681f3Smrg 3: END 567ec681f3Smrg 577ec681f3Smrgeventually generates: 587ec681f3Smrg 597ec681f3Smrg.. graphviz:: 607ec681f3Smrg 617ec681f3Smrg digraph G { 627ec681f3Smrg rankdir=RL; 637ec681f3Smrg nodesep=0.25; 647ec681f3Smrg ranksep=1.5; 657ec681f3Smrg subgraph clusterdce198 { 667ec681f3Smrg label="vert"; 677ec681f3Smrg inputdce198 [shape=record,label="inputs|<in0> i0.x|<in1> i0.y|<in2> i0.z|<in4> i1.x|<in5> i1.y|<in6> i1.z"]; 687ec681f3Smrg instrdcf348 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"]; 697ec681f3Smrg instrdcedd0 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"]; 707ec681f3Smrg inputdce198:<in2>:w -> instrdcedd0:<src0> 717ec681f3Smrg inputdce198:<in6>:w -> instrdcedd0:<src1> 727ec681f3Smrg instrdcec30 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"]; 737ec681f3Smrg inputdce198:<in1>:w -> instrdcec30:<src0> 747ec681f3Smrg inputdce198:<in5>:w -> instrdcec30:<src1> 757ec681f3Smrg instrdceb60 [shape=record,style=filled,fillcolor=lightgrey,label="{mul.f|<dst0>|<src0> |<src1> }"]; 767ec681f3Smrg inputdce198:<in0>:w -> instrdceb60:<src0> 777ec681f3Smrg inputdce198:<in4>:w -> instrdceb60:<src1> 787ec681f3Smrg instrdceb60:<dst0> -> instrdcec30:<src2> 797ec681f3Smrg instrdcec30:<dst0> -> instrdcedd0:<src2> 807ec681f3Smrg instrdcedd0:<dst0> -> instrdcf348:<src0> 817ec681f3Smrg instrdcf400 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"]; 827ec681f3Smrg instrdcedd0:<dst0> -> instrdcf400:<src0> 837ec681f3Smrg instrdcf4b8 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"]; 847ec681f3Smrg instrdcedd0:<dst0> -> instrdcf4b8:<src0> 857ec681f3Smrg outputdce198 [shape=record,label="outputs|<out0> o0.x|<out1> o0.y|<out2> o0.z|<out3> o0.w"]; 867ec681f3Smrg instrdcf348:<dst0> -> outputdce198:<out0>:e 877ec681f3Smrg instrdcf400:<dst0> -> outputdce198:<out1>:e 887ec681f3Smrg instrdcf4b8:<dst0> -> outputdce198:<out2>:e 897ec681f3Smrg instrdcedd0:<dst0> -> outputdce198:<out3>:e 907ec681f3Smrg } 917ec681f3Smrg } 927ec681f3Smrg 937ec681f3Smrg(after scheduling, etc, but before register assignment). 947ec681f3Smrg 957ec681f3SmrgInternal Structure 967ec681f3Smrg~~~~~~~~~~~~~~~~~~ 977ec681f3Smrg 987ec681f3Smrg``ir3_block`` 997ec681f3Smrg Represents a basic block. 1007ec681f3Smrg 1017ec681f3Smrg TODO: currently blocks are nested, but I think I need to change that 1027ec681f3Smrg to a more conventional arrangement before implementing proper flow 1037ec681f3Smrg control. Currently the only flow control handles is if/else which 1047ec681f3Smrg gets flattened out and results chosen with ``sel`` instructions. 1057ec681f3Smrg 1067ec681f3Smrg``ir3_instruction`` 1077ec681f3Smrg Represents a machine instruction or meta_ instruction. Has pointers 1087ec681f3Smrg to dst register (``regs[0]``) and src register(s) (``regs[1..n]``), 1097ec681f3Smrg as needed. 1107ec681f3Smrg 1117ec681f3Smrg``ir3_register`` 1127ec681f3Smrg Represents a src or dst register, flags indicate const/relative/etc. 1137ec681f3Smrg If ``IR3_REG_SSA`` is set on a src register, the actual register 1147ec681f3Smrg number (name) has not been assigned yet, and instead the ``instr`` 1157ec681f3Smrg field points to src instruction. 1167ec681f3Smrg 1177ec681f3SmrgIn addition there are various util macros/functions to simplify manipulation/traversal of the graph: 1187ec681f3Smrg 1197ec681f3Smrg``foreach_src(srcreg, instr)`` 1207ec681f3Smrg Iterate each instruction's source ``ir3_register``\s 1217ec681f3Smrg 1227ec681f3Smrg``foreach_src_n(srcreg, n, instr)`` 1237ec681f3Smrg Like ``foreach_src``, also setting ``n`` to the source number (starting 1247ec681f3Smrg with ``0``). 1257ec681f3Smrg 1267ec681f3Smrg``foreach_ssa_src(srcinstr, instr)`` 1277ec681f3Smrg Iterate each instruction's SSA source ``ir3_instruction``\s. This skips 1287ec681f3Smrg non-SSA sources (consts, etc), but includes virtual sources (such as the 1297ec681f3Smrg address register if `relative addressing`_ is used). 1307ec681f3Smrg 1317ec681f3Smrg``foreach_ssa_src_n(srcinstr, n, instr)`` 1327ec681f3Smrg Like ``foreach_ssa_src``, also setting ``n`` to the source number. 1337ec681f3Smrg 1347ec681f3SmrgFor example: 1357ec681f3Smrg 1367ec681f3Smrg.. code-block:: c 1377ec681f3Smrg 1387ec681f3Smrg foreach_ssa_src_n(src, i, instr) { 1397ec681f3Smrg unsigned d = delay_calc_srcn(ctx, src, instr, i); 1407ec681f3Smrg delay = MAX2(delay, d); 1417ec681f3Smrg } 1427ec681f3Smrg 1437ec681f3Smrg 1447ec681f3SmrgTODO probably other helper/util stuff worth mentioning here 1457ec681f3Smrg 1467ec681f3Smrg.. _meta: 1477ec681f3Smrg 1487ec681f3SmrgMeta Instructions 1497ec681f3Smrg~~~~~~~~~~~~~~~~~ 1507ec681f3Smrg 1517ec681f3Smrg**input** 1527ec681f3Smrg Used for shader inputs (registers configured in the command-stream 1537ec681f3Smrg to hold particular input values, written by the shader core before 1547ec681f3Smrg start of execution. Also used for connecting up values within a 1557ec681f3Smrg basic block to an output of a previous block. 1567ec681f3Smrg 1577ec681f3Smrg**output** 1587ec681f3Smrg Used to hold outputs of a basic block. 1597ec681f3Smrg 1607ec681f3Smrg**flow** 1617ec681f3Smrg TODO 1627ec681f3Smrg 1637ec681f3Smrg**phi** 1647ec681f3Smrg TODO 1657ec681f3Smrg 1667ec681f3Smrg**collect** 1677ec681f3Smrg Groups registers which need to be assigned to consecutive scalar 1687ec681f3Smrg registers, for example `sam` (texture fetch) src instructions (see 1697ec681f3Smrg `register groups`_) or array element dereference 1707ec681f3Smrg (see `relative addressing`_). 1717ec681f3Smrg 1727ec681f3Smrg**split** 1737ec681f3Smrg The counterpart to **collect**, when an instruction such as `sam` 1747ec681f3Smrg writes multiple components, splits the result into individual 1757ec681f3Smrg scalar components to be consumed by other instructions. 1767ec681f3Smrg 1777ec681f3Smrg 1787ec681f3Smrg.. _`flow control`: 1797ec681f3Smrg 1807ec681f3SmrgFlow Control 1817ec681f3Smrg~~~~~~~~~~~~ 1827ec681f3Smrg 1837ec681f3SmrgTODO 1847ec681f3Smrg 1857ec681f3Smrg 1867ec681f3Smrg.. _`register groups`: 1877ec681f3Smrg 1887ec681f3SmrgRegister Groups 1897ec681f3Smrg~~~~~~~~~~~~~~~ 1907ec681f3Smrg 1917ec681f3SmrgCertain instructions, such as texture sample instructions, consume multiple consecutive scalar registers via a single src register encoded in the instruction, and/or write multiple consecutive scalar registers. In the simplest example: 1927ec681f3Smrg 1937ec681f3Smrg:: 1947ec681f3Smrg 1957ec681f3Smrg sam (f32)(xyz)r2.x, r0.z, s#0, t#0 1967ec681f3Smrg 1977ec681f3Smrgfor a 2d texture, would read ``r0.zw`` to get the coordinate, and write ``r2.xyz``. 1987ec681f3Smrg 1997ec681f3SmrgBefore register assignment, to group the two components of the texture src together: 2007ec681f3Smrg 2017ec681f3Smrg.. graphviz:: 2027ec681f3Smrg 2037ec681f3Smrg digraph G { 2047ec681f3Smrg { rank=same; 2057ec681f3Smrg collect; 2067ec681f3Smrg }; 2077ec681f3Smrg { rank=same; 2087ec681f3Smrg coord_x; 2097ec681f3Smrg coord_y; 2107ec681f3Smrg }; 2117ec681f3Smrg sam -> collect [label="regs[1]"]; 2127ec681f3Smrg collect -> coord_x [label="regs[1]"]; 2137ec681f3Smrg collect -> coord_y [label="regs[2]"]; 2147ec681f3Smrg coord_x -> coord_y [label="right",style=dotted]; 2157ec681f3Smrg coord_y -> coord_x [label="left",style=dotted]; 2167ec681f3Smrg coord_x [label="coord.x"]; 2177ec681f3Smrg coord_y [label="coord.y"]; 2187ec681f3Smrg } 2197ec681f3Smrg 2207ec681f3SmrgThe frontend sets up the SSA ptrs from ``sam`` source register to the ``collect`` meta instruction, which in turn points to the instructions producing the ``coord.x`` and ``coord.y`` values. And the grouping_ pass sets up the ``left`` and ``right`` neighbor pointers to the ``collect``\'s sources, used later by the `register assignment`_ pass to assign blocks of scalar registers. 2217ec681f3Smrg 2227ec681f3SmrgAnd likewise, for the consecutive scalar registers for the destination: 2237ec681f3Smrg 2247ec681f3Smrg.. graphviz:: 2257ec681f3Smrg 2267ec681f3Smrg digraph { 2277ec681f3Smrg { rank=same; 2287ec681f3Smrg A; 2297ec681f3Smrg B; 2307ec681f3Smrg C; 2317ec681f3Smrg }; 2327ec681f3Smrg { rank=same; 2337ec681f3Smrg split_0; 2347ec681f3Smrg split_1; 2357ec681f3Smrg split_2; 2367ec681f3Smrg }; 2377ec681f3Smrg A -> split_0; 2387ec681f3Smrg B -> split_1; 2397ec681f3Smrg C -> split_2; 2407ec681f3Smrg split_0 [label="split\noff=0"]; 2417ec681f3Smrg split_0 -> sam; 2427ec681f3Smrg split_1 [label="split\noff=1"]; 2437ec681f3Smrg split_1 -> sam; 2447ec681f3Smrg split_2 [label="split\noff=2"]; 2457ec681f3Smrg split_2 -> sam; 2467ec681f3Smrg split_0 -> split_1 [label="right",style=dotted]; 2477ec681f3Smrg split_1 -> split_0 [label="left",style=dotted]; 2487ec681f3Smrg split_1 -> split_2 [label="right",style=dotted]; 2497ec681f3Smrg split_2 -> split_1 [label="left",style=dotted]; 2507ec681f3Smrg sam; 2517ec681f3Smrg } 2527ec681f3Smrg 2537ec681f3Smrg.. _`relative addressing`: 2547ec681f3Smrg 2557ec681f3SmrgRelative Addressing 2567ec681f3Smrg~~~~~~~~~~~~~~~~~~~ 2577ec681f3Smrg 2587ec681f3SmrgMost instructions support addressing indirectly (relative to address register) into const or gpr register file in some or all of their src/dst registers. In this case the register accessed is taken from ``r<a0.x + n>`` or ``c<a0.x + n>``, i.e. address register (``a0.x``) value plus ``n``, where ``n`` is encoded in the instruction (rather than the absolute register number). 2597ec681f3Smrg 2607ec681f3Smrg Note that cat5 (texture sample) instructions are the notable exception, not 2617ec681f3Smrg supporting relative addressing of src or dst. 2627ec681f3Smrg 2637ec681f3SmrgRelative addressing of the const file (for example, a uniform array) is relatively simple. We don't do register assignment of the const file, so all that is required is to schedule things properly. I.e. the instruction that writes the address register must be scheduled first, and we cannot have two different address register values live at one time. 2647ec681f3Smrg 2657ec681f3SmrgBut relative addressing of gpr file (which can be as src or dst) has additional restrictions on register assignment (i.e. the array elements must be assigned to consecutive scalar registers). And in the case of relative dst, subsequent instructions now depend on both the relative write, as well as the previous instruction which wrote that register, since we do not know at compile time which actual register was written. 2667ec681f3Smrg 2677ec681f3SmrgEach instruction has an optional ``address`` pointer, to capture the dependency on the address register value when relative addressing is used for any of the src/dst register(s). This behaves as an additional virtual src register, i.e. ``foreach_ssa_src()`` will also iterate the address register (last). 2687ec681f3Smrg 2697ec681f3Smrg Note that ``nop``\'s for timing constraints, type specifiers (i.e. 2707ec681f3Smrg ``add.f`` vs ``add.u``), etc, omitted for brevity in examples 2717ec681f3Smrg 2727ec681f3Smrg:: 2737ec681f3Smrg 2747ec681f3Smrg mova a0.x, hr1.y 2757ec681f3Smrg sub r1.y, r2.x, r3.x 2767ec681f3Smrg add r0.x, r1.y, c<a0.x + 2> 2777ec681f3Smrg 2787ec681f3Smrgresults in: 2797ec681f3Smrg 2807ec681f3Smrg.. graphviz:: 2817ec681f3Smrg 2827ec681f3Smrg digraph { 2837ec681f3Smrg rankdir=LR; 2847ec681f3Smrg sub; 2857ec681f3Smrg const [label="const file"]; 2867ec681f3Smrg add; 2877ec681f3Smrg mova; 2887ec681f3Smrg add -> mova; 2897ec681f3Smrg add -> sub; 2907ec681f3Smrg add -> const [label="off=2"]; 2917ec681f3Smrg } 2927ec681f3Smrg 2937ec681f3SmrgThe scheduling pass has some smarts to schedule things such that only a single ``a0.x`` value is used at any one time. 2947ec681f3Smrg 2957ec681f3SmrgTo implement variable arrays, the NIR registers are stored as an ``ir3_array``, 2967ec681f3Smrgwhich will be register allocated to consecutive hardware registers. The array 2977ec681f3Smrgaccess uses the id field in the ``ir3_register`` to map to the array being 2987ec681f3Smrgaccessed, and the offset field for the fixed offset within the array. A NIR 2997ec681f3Smrgindirect register read such as: 3007ec681f3Smrg 3017ec681f3Smrg:: 3027ec681f3Smrg 3037ec681f3Smrg decl_reg vec2 32 r0[2] 3047ec681f3Smrg ... 3057ec681f3Smrg vec2 32 ssa_19 = mov r0[0 + ssa_9] 3067ec681f3Smrg 3077ec681f3Smrg 3087ec681f3Smrgresults in: 3097ec681f3Smrg 3107ec681f3Smrg:: 3117ec681f3Smrg 3127ec681f3Smrg 0000:0000:001: shl.b hssa_19, hssa_17, himm[0.000000,1,0x1] 3137ec681f3Smrg 0000:0000:002: mov.s16s16 hr61.x, hssa_19 3147ec681f3Smrg 0000:0000:002: mov.u32u32 ssa_21, arr[id=1, offset=0, size=4, ssa_12], address=_[0000:0000:002: mov.s16s16] 3157ec681f3Smrg 0000:0000:002: mov.u32u32 ssa_22, arr[id=1, offset=1, size=4, ssa_12], address=_[0000:0000:002: mov.s16s16] 3167ec681f3Smrg 3177ec681f3Smrg 3187ec681f3SmrgArray writes write to the array in ``instr->regs[0]->array.id``. A NIR indirect 3197ec681f3Smrgregister write such as: 3207ec681f3Smrg 3217ec681f3Smrg:: 3227ec681f3Smrg 3237ec681f3Smrg decl_reg vec2 32 r0[2] 3247ec681f3Smrg ... 3257ec681f3Smrg r0[0 + ssa_12] = mov ssa_13 3267ec681f3Smrg 3277ec681f3Smrgresults in: 3287ec681f3Smrg 3297ec681f3Smrg:: 3307ec681f3Smrg 3317ec681f3Smrg 0000:0000:001: shl.b hssa_29, hssa_27, himm[0.000000,1,0x1] 3327ec681f3Smrg 0000:0000:002: mov.s16s16 hr61.x, hssa_29 3337ec681f3Smrg 0000:0000:001: mov.u32u32 arr[id=1, offset=0, size=4, ssa_17], c2.y, address=_[0000:0000:002: mov.s16s16] 3347ec681f3Smrg 0000:0000:004: mov.u32u32 arr[id=1, offset=1, size=4, ssa_31], c2.z, address=_[0000:0000:002: mov.s16s16] 3357ec681f3Smrg 3367ec681f3SmrgNote that only cat1 (mov) can do indirect write, and thus NIR register stores 3377ec681f3Smrgmay need to introduce an extra mov. 3387ec681f3Smrg 3397ec681f3Smrgir3 array accesses in the DAG get serialized by the ``instr->barrier_class`` and 3407ec681f3Smrgcontaining ``IR3_BARRIER_ARRAY_W`` or ``IR3_BARRIER_ARRAY_R``. 3417ec681f3Smrg 3427ec681f3SmrgShader Passes 3437ec681f3Smrg------------- 3447ec681f3Smrg 3457ec681f3SmrgAfter the frontend has generated the use-def graph of instructions, they are run through various passes which include scheduling_ and `register assignment`_. Because inserting ``mov`` instructions after scheduling would also require inserting additional ``nop`` instructions (since it is too late to reschedule to try and fill the bubbles), the earlier stages try to ensure that (at least given an infinite supply of registers) that `register assignment`_ after scheduling_ cannot fail. 3467ec681f3Smrg 3477ec681f3Smrg Note that we essentially have ~256 scalar registers in the 3487ec681f3Smrg architecture (although larger register usage will at some thresholds 3497ec681f3Smrg limit the number of threads which can run in parallel). And at some 3507ec681f3Smrg point we will have to deal with spilling. 3517ec681f3Smrg 3527ec681f3Smrg.. _flatten: 3537ec681f3Smrg 3547ec681f3SmrgFlatten 3557ec681f3Smrg~~~~~~~ 3567ec681f3Smrg 3577ec681f3SmrgIn this stage, simple if/else blocks are flattened into a single block with ``phi`` nodes converted into ``sel`` instructions. The a3xx ISA has very few predicated instructions, and we would prefer not to use branches for simple if/else. 3587ec681f3Smrg 3597ec681f3Smrg 3607ec681f3Smrg.. _`copy propagation`: 3617ec681f3Smrg 3627ec681f3SmrgCopy Propagation 3637ec681f3Smrg~~~~~~~~~~~~~~~~ 3647ec681f3Smrg 3657ec681f3SmrgCurrently the frontend inserts ``mov``\s in various cases, because certain categories of instructions have limitations about const regs as sources. And the CP pass simply removes all simple ``mov``\s (i.e. src-type is same as dst-type, no abs/neg flags, etc). 3667ec681f3Smrg 3677ec681f3SmrgThe eventual plan is to invert that, with the front-end inserting no ``mov``\s and CP legalize things. 3687ec681f3Smrg 3697ec681f3Smrg 3707ec681f3Smrg.. _grouping: 3717ec681f3Smrg 3727ec681f3SmrgGrouping 3737ec681f3Smrg~~~~~~~~ 3747ec681f3Smrg 3757ec681f3SmrgIn the grouping pass, instructions which need to be grouped (for ``collect``\s, etc) have their ``left`` / ``right`` neighbor pointers setup. In cases where there is a conflict (i.e. one instruction cannot have two unique left or right neighbors), an additional ``mov`` instruction is inserted. This ensures that there is some possible valid `register assignment`_ at the later stages. 3767ec681f3Smrg 3777ec681f3Smrg 3787ec681f3Smrg.. _depth: 3797ec681f3Smrg 3807ec681f3SmrgDepth 3817ec681f3Smrg~~~~~ 3827ec681f3Smrg 3837ec681f3SmrgIn the depth pass, a depth is calculated for each instruction node within its basic block. The depth is the sum of the required cycles (delay slots needed between two instructions plus one) of each instruction plus the max depth of any of its source instructions. (meta_ instructions don't add to the depth). As an instruction's depth is calculated, it is inserted into a per block list sorted by deepest instruction. Unreachable instructions and inputs are marked. 3847ec681f3Smrg 3857ec681f3Smrg TODO: we should probably calculate both hard and soft depths (?) to 3867ec681f3Smrg try to coax additional instructions to fit in places where we need 3877ec681f3Smrg to use sync bits, such as after a texture fetch or SFU. 3887ec681f3Smrg 3897ec681f3Smrg.. _scheduling: 3907ec681f3Smrg 3917ec681f3SmrgScheduling 3927ec681f3Smrg~~~~~~~~~~ 3937ec681f3Smrg 3947ec681f3SmrgAfter the grouping_ pass, there are no more instructions to insert or remove. Start scheduling each basic block from the deepest node in the depth sorted list created by the depth_ pass, recursively trying to schedule each instruction after its source instructions plus delay slots. Insert ``nop``\s as required. 3957ec681f3Smrg 3967ec681f3Smrg.. _`register assignment`: 3977ec681f3Smrg 3987ec681f3SmrgRegister Assignment 3997ec681f3Smrg~~~~~~~~~~~~~~~~~~~ 4007ec681f3Smrg 4017ec681f3SmrgTODO 4027ec681f3Smrg 4037ec681f3Smrg 404