ir3-notes.rst revision 7ec681f3
17ec681f3SmrgIR3 NOTES
27ec681f3Smrg=========
37ec681f3Smrg
47ec681f3SmrgSome notes about ir3, the compiler and machine-specific IR for the shader ISA introduced with adreno a3xx.  The same shader ISA is present, with some small differences, in adreno a4xx.
57ec681f3Smrg
67ec681f3SmrgCompared to the previous generation a2xx ISA (ir2), the a3xx ISA is a "simple" scalar instruction set.  However, the compiler is responsible, in most cases, to schedule the instructions.  The hardware does not try to hide the shader core pipeline stages.  For a common example, a common (cat2) ALU instruction takes four cycles, so a subsequent cat2 instruction which uses the result must have three intervening instructions (or NOPs).  When operating on vec4's, typically the corresponding scalar instructions for operating on the remaining three components could typically fit.  Although that results in a lot of edge cases where things fall over, like:
77ec681f3Smrg
87ec681f3Smrg::
97ec681f3Smrg
107ec681f3Smrg  ADD TEMP[0], TEMP[1], TEMP[2]
117ec681f3Smrg  MUL TEMP[0], TEMP[1], TEMP[0].wzyx
127ec681f3Smrg
137ec681f3SmrgHere, the second instruction needs the output of the first group of scalar instructions in the wrong order, resulting in not enough instruction spots between the ``add r0.w, r1.w, r2.w`` and ``mul r0.x, r1.x, r0.w``.  Which is why the original (old) compiler which merely translated nearly literally from TGSI to ir3, had a strong tendency to fall over.
147ec681f3Smrg
157ec681f3SmrgSo the current compiler instead, in the frontend, generates a directed-acyclic-graph of instructions and basic blocks, which go through various additional passes to eventually schedule and do register assignment.
167ec681f3Smrg
177ec681f3SmrgFor additional documentation about the hardware, see wiki: `a3xx ISA
187ec681f3Smrg<https://github.com/freedreno/freedreno/wiki/A3xx-shader-instruction-set-architecture>`_.
197ec681f3Smrg
207ec681f3SmrgExternal Structure
217ec681f3Smrg------------------
227ec681f3Smrg
237ec681f3Smrg``ir3_shader``
247ec681f3Smrg    A single vertex/fragment/etc shader from gallium perspective (i.e.
257ec681f3Smrg    maps to a single TGSI shader), and manages a set of shader variants
267ec681f3Smrg    which are generated on demand based on the shader key.
277ec681f3Smrg
287ec681f3Smrg``ir3_shader_key``
297ec681f3Smrg    The configuration key that identifies a shader variant.  I.e. based
307ec681f3Smrg    on other GL state (two-sided-color, render-to-alpha, etc) or render
317ec681f3Smrg    stages (binning-pass vertex shader) different shader variants are
327ec681f3Smrg    generated.
337ec681f3Smrg
347ec681f3Smrg``ir3_shader_variant``
357ec681f3Smrg    The actual hw shader generated based on input TGSI and shader key.
367ec681f3Smrg
377ec681f3Smrg``ir3_compiler``
387ec681f3Smrg    Compiler frontend which generates ir3 and runs the various backend
397ec681f3Smrg    stages to schedule and do register assignment.
407ec681f3Smrg
417ec681f3SmrgThe IR
427ec681f3Smrg------
437ec681f3Smrg
447ec681f3SmrgThe ir3 IR maps quite directly to the hardware, in that instruction opcodes map directly to hardware opcodes, and that dst/src register(s) map directly to the hardware dst/src register(s).  But there are a few extensions, in the form of meta_ instructions.  And additionally, for normal (non-const, etc) src registers, the ``IR3_REG_SSA`` flag is set and ``reg->instr`` points to the source instruction which produced that value.  So, for example, the following TGSI shader:
457ec681f3Smrg
467ec681f3Smrg::
477ec681f3Smrg
487ec681f3Smrg  VERT
497ec681f3Smrg  DCL IN[0]
507ec681f3Smrg  DCL IN[1]
517ec681f3Smrg  DCL OUT[0], POSITION
527ec681f3Smrg  DCL TEMP[0], LOCAL
537ec681f3Smrg    1: DP3 TEMP[0].x, IN[0].xyzz, IN[1].xyzz
547ec681f3Smrg    2: MOV OUT[0], TEMP[0].xxxx
557ec681f3Smrg    3: END
567ec681f3Smrg
577ec681f3Smrgeventually generates:
587ec681f3Smrg
597ec681f3Smrg.. graphviz::
607ec681f3Smrg
617ec681f3Smrg  digraph G {
627ec681f3Smrg  rankdir=RL;
637ec681f3Smrg  nodesep=0.25;
647ec681f3Smrg  ranksep=1.5;
657ec681f3Smrg  subgraph clusterdce198 {
667ec681f3Smrg  label="vert";
677ec681f3Smrg  inputdce198 [shape=record,label="inputs|<in0> i0.x|<in1> i0.y|<in2> i0.z|<in4> i1.x|<in5> i1.y|<in6> i1.z"];
687ec681f3Smrg  instrdcf348 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
697ec681f3Smrg  instrdcedd0 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"];
707ec681f3Smrg  inputdce198:<in2>:w -> instrdcedd0:<src0>
717ec681f3Smrg  inputdce198:<in6>:w -> instrdcedd0:<src1>
727ec681f3Smrg  instrdcec30 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"];
737ec681f3Smrg  inputdce198:<in1>:w -> instrdcec30:<src0>
747ec681f3Smrg  inputdce198:<in5>:w -> instrdcec30:<src1>
757ec681f3Smrg  instrdceb60 [shape=record,style=filled,fillcolor=lightgrey,label="{mul.f|<dst0>|<src0> |<src1> }"];
767ec681f3Smrg  inputdce198:<in0>:w -> instrdceb60:<src0>
777ec681f3Smrg  inputdce198:<in4>:w -> instrdceb60:<src1>
787ec681f3Smrg  instrdceb60:<dst0> -> instrdcec30:<src2>
797ec681f3Smrg  instrdcec30:<dst0> -> instrdcedd0:<src2>
807ec681f3Smrg  instrdcedd0:<dst0> -> instrdcf348:<src0>
817ec681f3Smrg  instrdcf400 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
827ec681f3Smrg  instrdcedd0:<dst0> -> instrdcf400:<src0>
837ec681f3Smrg  instrdcf4b8 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
847ec681f3Smrg  instrdcedd0:<dst0> -> instrdcf4b8:<src0>
857ec681f3Smrg  outputdce198 [shape=record,label="outputs|<out0> o0.x|<out1> o0.y|<out2> o0.z|<out3> o0.w"];
867ec681f3Smrg  instrdcf348:<dst0> -> outputdce198:<out0>:e
877ec681f3Smrg  instrdcf400:<dst0> -> outputdce198:<out1>:e
887ec681f3Smrg  instrdcf4b8:<dst0> -> outputdce198:<out2>:e
897ec681f3Smrg  instrdcedd0:<dst0> -> outputdce198:<out3>:e
907ec681f3Smrg  }
917ec681f3Smrg  }
927ec681f3Smrg
937ec681f3Smrg(after scheduling, etc, but before register assignment).
947ec681f3Smrg
957ec681f3SmrgInternal Structure
967ec681f3Smrg~~~~~~~~~~~~~~~~~~
977ec681f3Smrg
987ec681f3Smrg``ir3_block``
997ec681f3Smrg    Represents a basic block.
1007ec681f3Smrg
1017ec681f3Smrg    TODO: currently blocks are nested, but I think I need to change that
1027ec681f3Smrg    to a more conventional arrangement before implementing proper flow
1037ec681f3Smrg    control.  Currently the only flow control handles is if/else which
1047ec681f3Smrg    gets flattened out and results chosen with ``sel`` instructions.
1057ec681f3Smrg
1067ec681f3Smrg``ir3_instruction``
1077ec681f3Smrg    Represents a machine instruction or meta_ instruction.  Has pointers
1087ec681f3Smrg    to dst register (``regs[0]``) and src register(s) (``regs[1..n]``),
1097ec681f3Smrg    as needed.
1107ec681f3Smrg
1117ec681f3Smrg``ir3_register``
1127ec681f3Smrg    Represents a src or dst register, flags indicate const/relative/etc.
1137ec681f3Smrg    If ``IR3_REG_SSA`` is set on a src register, the actual register
1147ec681f3Smrg    number (name) has not been assigned yet, and instead the ``instr``
1157ec681f3Smrg    field points to src instruction.
1167ec681f3Smrg
1177ec681f3SmrgIn addition there are various util macros/functions to simplify manipulation/traversal of the graph:
1187ec681f3Smrg
1197ec681f3Smrg``foreach_src(srcreg, instr)``
1207ec681f3Smrg    Iterate each instruction's source ``ir3_register``\s
1217ec681f3Smrg
1227ec681f3Smrg``foreach_src_n(srcreg, n, instr)``
1237ec681f3Smrg    Like ``foreach_src``, also setting ``n`` to the source number (starting
1247ec681f3Smrg    with ``0``).
1257ec681f3Smrg
1267ec681f3Smrg``foreach_ssa_src(srcinstr, instr)``
1277ec681f3Smrg    Iterate each instruction's SSA source ``ir3_instruction``\s.  This skips
1287ec681f3Smrg    non-SSA sources (consts, etc), but includes virtual sources (such as the
1297ec681f3Smrg    address register if `relative addressing`_ is used).
1307ec681f3Smrg
1317ec681f3Smrg``foreach_ssa_src_n(srcinstr, n, instr)``
1327ec681f3Smrg    Like ``foreach_ssa_src``, also setting ``n`` to the source number.
1337ec681f3Smrg
1347ec681f3SmrgFor example:
1357ec681f3Smrg
1367ec681f3Smrg.. code-block:: c
1377ec681f3Smrg
1387ec681f3Smrg  foreach_ssa_src_n(src, i, instr) {
1397ec681f3Smrg    unsigned d = delay_calc_srcn(ctx, src, instr, i);
1407ec681f3Smrg    delay = MAX2(delay, d);
1417ec681f3Smrg  }
1427ec681f3Smrg
1437ec681f3Smrg
1447ec681f3SmrgTODO probably other helper/util stuff worth mentioning here
1457ec681f3Smrg
1467ec681f3Smrg.. _meta:
1477ec681f3Smrg
1487ec681f3SmrgMeta Instructions
1497ec681f3Smrg~~~~~~~~~~~~~~~~~
1507ec681f3Smrg
1517ec681f3Smrg**input**
1527ec681f3Smrg    Used for shader inputs (registers configured in the command-stream
1537ec681f3Smrg    to hold particular input values, written by the shader core before
1547ec681f3Smrg    start of execution.  Also used for connecting up values within a
1557ec681f3Smrg    basic block to an output of a previous block.
1567ec681f3Smrg
1577ec681f3Smrg**output**
1587ec681f3Smrg    Used to hold outputs of a basic block.
1597ec681f3Smrg
1607ec681f3Smrg**flow**
1617ec681f3Smrg    TODO
1627ec681f3Smrg
1637ec681f3Smrg**phi**
1647ec681f3Smrg    TODO
1657ec681f3Smrg
1667ec681f3Smrg**collect**
1677ec681f3Smrg    Groups registers which need to be assigned to consecutive scalar
1687ec681f3Smrg    registers, for example `sam` (texture fetch) src instructions (see
1697ec681f3Smrg    `register groups`_) or array element dereference
1707ec681f3Smrg    (see `relative addressing`_).
1717ec681f3Smrg
1727ec681f3Smrg**split**
1737ec681f3Smrg    The counterpart to **collect**, when an instruction such as `sam`
1747ec681f3Smrg    writes multiple components, splits the result into individual
1757ec681f3Smrg    scalar components to be consumed by other instructions.
1767ec681f3Smrg
1777ec681f3Smrg
1787ec681f3Smrg.. _`flow control`:
1797ec681f3Smrg
1807ec681f3SmrgFlow Control
1817ec681f3Smrg~~~~~~~~~~~~
1827ec681f3Smrg
1837ec681f3SmrgTODO
1847ec681f3Smrg
1857ec681f3Smrg
1867ec681f3Smrg.. _`register groups`:
1877ec681f3Smrg
1887ec681f3SmrgRegister Groups
1897ec681f3Smrg~~~~~~~~~~~~~~~
1907ec681f3Smrg
1917ec681f3SmrgCertain instructions, such as texture sample instructions, consume multiple consecutive scalar registers via a single src register encoded in the instruction, and/or write multiple consecutive scalar registers.  In the simplest example:
1927ec681f3Smrg
1937ec681f3Smrg::
1947ec681f3Smrg
1957ec681f3Smrg  sam (f32)(xyz)r2.x, r0.z, s#0, t#0
1967ec681f3Smrg
1977ec681f3Smrgfor a 2d texture, would read ``r0.zw`` to get the coordinate, and write ``r2.xyz``.
1987ec681f3Smrg
1997ec681f3SmrgBefore register assignment, to group the two components of the texture src together:
2007ec681f3Smrg
2017ec681f3Smrg.. graphviz::
2027ec681f3Smrg
2037ec681f3Smrg  digraph G {
2047ec681f3Smrg    { rank=same;
2057ec681f3Smrg      collect;
2067ec681f3Smrg    };
2077ec681f3Smrg    { rank=same;
2087ec681f3Smrg      coord_x;
2097ec681f3Smrg      coord_y;
2107ec681f3Smrg    };
2117ec681f3Smrg    sam -> collect [label="regs[1]"];
2127ec681f3Smrg    collect -> coord_x [label="regs[1]"];
2137ec681f3Smrg    collect -> coord_y [label="regs[2]"];
2147ec681f3Smrg    coord_x -> coord_y [label="right",style=dotted];
2157ec681f3Smrg    coord_y -> coord_x [label="left",style=dotted];
2167ec681f3Smrg    coord_x [label="coord.x"];
2177ec681f3Smrg    coord_y [label="coord.y"];
2187ec681f3Smrg  }
2197ec681f3Smrg
2207ec681f3SmrgThe frontend sets up the SSA ptrs from ``sam`` source register to the ``collect`` meta instruction, which in turn points to the instructions producing the ``coord.x`` and ``coord.y`` values.  And the grouping_ pass sets up the ``left`` and ``right`` neighbor pointers to the ``collect``\'s sources, used later by the `register assignment`_ pass to assign blocks of scalar registers.
2217ec681f3Smrg
2227ec681f3SmrgAnd likewise, for the consecutive scalar registers for the destination:
2237ec681f3Smrg
2247ec681f3Smrg.. graphviz::
2257ec681f3Smrg
2267ec681f3Smrg  digraph {
2277ec681f3Smrg    { rank=same;
2287ec681f3Smrg      A;
2297ec681f3Smrg      B;
2307ec681f3Smrg      C;
2317ec681f3Smrg    };
2327ec681f3Smrg    { rank=same;
2337ec681f3Smrg      split_0;
2347ec681f3Smrg      split_1;
2357ec681f3Smrg      split_2;
2367ec681f3Smrg    };
2377ec681f3Smrg    A -> split_0;
2387ec681f3Smrg    B -> split_1;
2397ec681f3Smrg    C -> split_2;
2407ec681f3Smrg    split_0 [label="split\noff=0"];
2417ec681f3Smrg    split_0 -> sam;
2427ec681f3Smrg    split_1 [label="split\noff=1"];
2437ec681f3Smrg    split_1 -> sam;
2447ec681f3Smrg    split_2 [label="split\noff=2"];
2457ec681f3Smrg    split_2 -> sam;
2467ec681f3Smrg    split_0 -> split_1 [label="right",style=dotted];
2477ec681f3Smrg    split_1 -> split_0 [label="left",style=dotted];
2487ec681f3Smrg    split_1 -> split_2 [label="right",style=dotted];
2497ec681f3Smrg    split_2 -> split_1 [label="left",style=dotted];
2507ec681f3Smrg    sam;
2517ec681f3Smrg  }
2527ec681f3Smrg
2537ec681f3Smrg.. _`relative addressing`:
2547ec681f3Smrg
2557ec681f3SmrgRelative Addressing
2567ec681f3Smrg~~~~~~~~~~~~~~~~~~~
2577ec681f3Smrg
2587ec681f3SmrgMost instructions support addressing indirectly (relative to address register) into const or gpr register file in some or all of their src/dst registers.  In this case the register accessed is taken from ``r<a0.x + n>`` or ``c<a0.x + n>``, i.e. address register (``a0.x``) value plus ``n``, where ``n`` is encoded in the instruction (rather than the absolute register number).
2597ec681f3Smrg
2607ec681f3Smrg    Note that cat5 (texture sample) instructions are the notable exception, not
2617ec681f3Smrg    supporting relative addressing of src or dst.
2627ec681f3Smrg
2637ec681f3SmrgRelative addressing of the const file (for example, a uniform array) is relatively simple.  We don't do register assignment of the const file, so all that is required is to schedule things properly.  I.e. the instruction that writes the address register must be scheduled first, and we cannot have two different address register values live at one time.
2647ec681f3Smrg
2657ec681f3SmrgBut relative addressing of gpr file (which can be as src or dst) has additional restrictions on register assignment (i.e. the array elements must be assigned to consecutive scalar registers).  And in the case of relative dst, subsequent instructions now depend on both the relative write, as well as the previous instruction which wrote that register, since we do not know at compile time which actual register was written.
2667ec681f3Smrg
2677ec681f3SmrgEach instruction has an optional ``address`` pointer, to capture the dependency on the address register value when relative addressing is used for any of the src/dst register(s).  This behaves as an additional virtual src register, i.e. ``foreach_ssa_src()`` will also iterate the address register (last).
2687ec681f3Smrg
2697ec681f3Smrg    Note that ``nop``\'s for timing constraints, type specifiers (i.e.
2707ec681f3Smrg    ``add.f`` vs ``add.u``), etc, omitted for brevity in examples
2717ec681f3Smrg
2727ec681f3Smrg::
2737ec681f3Smrg
2747ec681f3Smrg  mova a0.x, hr1.y
2757ec681f3Smrg  sub r1.y, r2.x, r3.x
2767ec681f3Smrg  add r0.x, r1.y, c<a0.x + 2>
2777ec681f3Smrg
2787ec681f3Smrgresults in:
2797ec681f3Smrg
2807ec681f3Smrg.. graphviz::
2817ec681f3Smrg
2827ec681f3Smrg  digraph {
2837ec681f3Smrg    rankdir=LR;
2847ec681f3Smrg    sub;
2857ec681f3Smrg    const [label="const file"];
2867ec681f3Smrg    add;
2877ec681f3Smrg    mova;
2887ec681f3Smrg    add -> mova;
2897ec681f3Smrg    add -> sub;
2907ec681f3Smrg    add -> const [label="off=2"];
2917ec681f3Smrg  }
2927ec681f3Smrg
2937ec681f3SmrgThe scheduling pass has some smarts to schedule things such that only a single ``a0.x`` value is used at any one time.
2947ec681f3Smrg
2957ec681f3SmrgTo implement variable arrays, the NIR registers are stored as an ``ir3_array``,
2967ec681f3Smrgwhich will be register allocated to consecutive hardware registers.  The array
2977ec681f3Smrgaccess uses the id field in the ``ir3_register`` to map to the array being
2987ec681f3Smrgaccessed, and the offset field for the fixed offset within the array.  A NIR
2997ec681f3Smrgindirect register read such as:
3007ec681f3Smrg
3017ec681f3Smrg::
3027ec681f3Smrg
3037ec681f3Smrg  decl_reg vec2 32 r0[2]
3047ec681f3Smrg  ...
3057ec681f3Smrg  vec2 32 ssa_19 = mov r0[0 + ssa_9]
3067ec681f3Smrg
3077ec681f3Smrg
3087ec681f3Smrgresults in:
3097ec681f3Smrg
3107ec681f3Smrg::
3117ec681f3Smrg
3127ec681f3Smrg  0000:0000:001:  shl.b hssa_19, hssa_17, himm[0.000000,1,0x1]
3137ec681f3Smrg  0000:0000:002:  mov.s16s16 hr61.x, hssa_19
3147ec681f3Smrg  0000:0000:002:  mov.u32u32 ssa_21, arr[id=1, offset=0, size=4, ssa_12], address=_[0000:0000:002:  mov.s16s16]
3157ec681f3Smrg  0000:0000:002:  mov.u32u32 ssa_22, arr[id=1, offset=1, size=4, ssa_12], address=_[0000:0000:002:  mov.s16s16]
3167ec681f3Smrg
3177ec681f3Smrg
3187ec681f3SmrgArray writes write to the array in ``instr->regs[0]->array.id``.  A NIR indirect
3197ec681f3Smrgregister write such as:
3207ec681f3Smrg
3217ec681f3Smrg::
3227ec681f3Smrg
3237ec681f3Smrg  decl_reg vec2 32 r0[2]
3247ec681f3Smrg  ...
3257ec681f3Smrg  r0[0 + ssa_12] = mov ssa_13
3267ec681f3Smrg
3277ec681f3Smrgresults in:
3287ec681f3Smrg
3297ec681f3Smrg::
3307ec681f3Smrg
3317ec681f3Smrg  0000:0000:001:  shl.b hssa_29, hssa_27, himm[0.000000,1,0x1]
3327ec681f3Smrg  0000:0000:002:  mov.s16s16 hr61.x, hssa_29
3337ec681f3Smrg  0000:0000:001:  mov.u32u32 arr[id=1, offset=0, size=4, ssa_17], c2.y, address=_[0000:0000:002:  mov.s16s16]
3347ec681f3Smrg  0000:0000:004:  mov.u32u32 arr[id=1, offset=1, size=4, ssa_31], c2.z, address=_[0000:0000:002:  mov.s16s16]
3357ec681f3Smrg
3367ec681f3SmrgNote that only cat1 (mov) can do indirect write, and thus NIR register stores
3377ec681f3Smrgmay need to introduce an extra mov.
3387ec681f3Smrg
3397ec681f3Smrgir3 array accesses in the DAG get serialized by the ``instr->barrier_class`` and
3407ec681f3Smrgcontaining ``IR3_BARRIER_ARRAY_W`` or ``IR3_BARRIER_ARRAY_R``.
3417ec681f3Smrg
3427ec681f3SmrgShader Passes
3437ec681f3Smrg-------------
3447ec681f3Smrg
3457ec681f3SmrgAfter the frontend has generated the use-def graph of instructions, they are run through various passes which include scheduling_ and `register assignment`_.  Because inserting ``mov`` instructions after scheduling would also require inserting additional ``nop`` instructions (since it is too late to reschedule to try and fill the bubbles), the earlier stages try to ensure that (at least given an infinite supply of registers) that `register assignment`_ after scheduling_ cannot fail.
3467ec681f3Smrg
3477ec681f3Smrg    Note that we essentially have ~256 scalar registers in the
3487ec681f3Smrg    architecture (although larger register usage will at some thresholds
3497ec681f3Smrg    limit the number of threads which can run in parallel).  And at some
3507ec681f3Smrg    point we will have to deal with spilling.
3517ec681f3Smrg
3527ec681f3Smrg.. _flatten:
3537ec681f3Smrg
3547ec681f3SmrgFlatten
3557ec681f3Smrg~~~~~~~
3567ec681f3Smrg
3577ec681f3SmrgIn this stage, simple if/else blocks are flattened into a single block with ``phi`` nodes converted into ``sel`` instructions.  The a3xx ISA has very few predicated instructions, and we would prefer not to use branches for simple if/else.
3587ec681f3Smrg
3597ec681f3Smrg
3607ec681f3Smrg.. _`copy propagation`:
3617ec681f3Smrg
3627ec681f3SmrgCopy Propagation
3637ec681f3Smrg~~~~~~~~~~~~~~~~
3647ec681f3Smrg
3657ec681f3SmrgCurrently the frontend inserts ``mov``\s in various cases, because certain categories of instructions have limitations about const regs as sources.  And the CP pass simply removes all simple ``mov``\s (i.e. src-type is same as dst-type, no abs/neg flags, etc).
3667ec681f3Smrg
3677ec681f3SmrgThe eventual plan is to invert that, with the front-end inserting no ``mov``\s and CP legalize things.
3687ec681f3Smrg
3697ec681f3Smrg
3707ec681f3Smrg.. _grouping:
3717ec681f3Smrg
3727ec681f3SmrgGrouping
3737ec681f3Smrg~~~~~~~~
3747ec681f3Smrg
3757ec681f3SmrgIn the grouping pass, instructions which need to be grouped (for ``collect``\s, etc) have their ``left`` / ``right`` neighbor pointers setup.  In cases where there is a conflict (i.e. one instruction cannot have two unique left or right neighbors), an additional ``mov`` instruction is inserted.  This ensures that there is some possible valid `register assignment`_ at the later stages.
3767ec681f3Smrg
3777ec681f3Smrg
3787ec681f3Smrg.. _depth:
3797ec681f3Smrg
3807ec681f3SmrgDepth
3817ec681f3Smrg~~~~~
3827ec681f3Smrg
3837ec681f3SmrgIn the depth pass, a depth is calculated for each instruction node within its basic block.  The depth is the sum of the required cycles (delay slots needed between two instructions plus one) of each instruction plus the max depth of any of its source instructions.  (meta_ instructions don't add to the depth).  As an instruction's depth is calculated, it is inserted into a per block list sorted by deepest instruction.  Unreachable instructions and inputs are marked.
3847ec681f3Smrg
3857ec681f3Smrg    TODO: we should probably calculate both hard and soft depths (?) to
3867ec681f3Smrg    try to coax additional instructions to fit in places where we need
3877ec681f3Smrg    to use sync bits, such as after a texture fetch or SFU.
3887ec681f3Smrg
3897ec681f3Smrg.. _scheduling:
3907ec681f3Smrg
3917ec681f3SmrgScheduling
3927ec681f3Smrg~~~~~~~~~~
3937ec681f3Smrg
3947ec681f3SmrgAfter the grouping_ pass, there are no more instructions to insert or remove.  Start scheduling each basic block from the deepest node in the depth sorted list created by the depth_ pass, recursively trying to schedule each instruction after its source instructions plus delay slots.  Insert ``nop``\s as required.
3957ec681f3Smrg
3967ec681f3Smrg.. _`register assignment`:
3977ec681f3Smrg
3987ec681f3SmrgRegister Assignment
3997ec681f3Smrg~~~~~~~~~~~~~~~~~~~
4007ec681f3Smrg
4017ec681f3SmrgTODO
4027ec681f3Smrg
4037ec681f3Smrg
404