drivers/freedreno/ir3-notes.rst

7ec681f3SmrgIR3 NOTES
7ec681f3Smrg=========
7ec681f3Smrg
7ec681f3SmrgSome notes about ir3, the compiler and machine-specific IR for the shader ISA introduced with adreno a3xx.  The same shader ISA is present, with some small differences, in adreno a4xx.
7ec681f3Smrg
7ec681f3SmrgCompared to the previous generation a2xx ISA (ir2), the a3xx ISA is a "simple" scalar instruction set.  However, the compiler is responsible, in most cases, to schedule the instructions.  The hardware does not try to hide the shader core pipeline stages.  For a common example, a common (cat2) ALU instruction takes four cycles, so a subsequent cat2 instruction which uses the result must have three intervening instructions (or NOPs).  When operating on vec4's, typically the corresponding scalar instructions for operating on the remaining three components could typically fit.  Although that results in a lot of edge cases where things fall over, like:
7ec681f3Smrg
7ec681f3Smrg::
7ec681f3Smrg
7ec681f3Smrg  ADD TEMP[0], TEMP[1], TEMP[2]
7ec681f3Smrg  MUL TEMP[0], TEMP[1], TEMP[0].wzyx
7ec681f3Smrg
7ec681f3SmrgHere, the second instruction needs the output of the first group of scalar instructions in the wrong order, resulting in not enough instruction spots between the ``add r0.w, r1.w, r2.w`` and ``mul r0.x, r1.x, r0.w``.  Which is why the original (old) compiler which merely translated nearly literally from TGSI to ir3, had a strong tendency to fall over.
7ec681f3Smrg
7ec681f3SmrgSo the current compiler instead, in the frontend, generates a directed-acyclic-graph of instructions and basic blocks, which go through various additional passes to eventually schedule and do register assignment.
7ec681f3Smrg
7ec681f3SmrgFor additional documentation about the hardware, see wiki: `a3xx ISA
7ec681f3Smrg<https://github.com/freedreno/freedreno/wiki/A3xx-shader-instruction-set-architecture>`_.
7ec681f3Smrg
7ec681f3SmrgExternal Structure
7ec681f3Smrg------------------
7ec681f3Smrg
7ec681f3Smrg``ir3_shader``
7ec681f3Smrg    A single vertex/fragment/etc shader from gallium perspective (i.e.
7ec681f3Smrg    maps to a single TGSI shader), and manages a set of shader variants
7ec681f3Smrg    which are generated on demand based on the shader key.
7ec681f3Smrg
7ec681f3Smrg``ir3_shader_key``
7ec681f3Smrg    The configuration key that identifies a shader variant.  I.e. based
7ec681f3Smrg    on other GL state (two-sided-color, render-to-alpha, etc) or render
7ec681f3Smrg    stages (binning-pass vertex shader) different shader variants are
7ec681f3Smrg    generated.
7ec681f3Smrg
7ec681f3Smrg``ir3_shader_variant``
7ec681f3Smrg    The actual hw shader generated based on input TGSI and shader key.
7ec681f3Smrg
7ec681f3Smrg``ir3_compiler``
7ec681f3Smrg    Compiler frontend which generates ir3 and runs the various backend
7ec681f3Smrg    stages to schedule and do register assignment.
7ec681f3Smrg
7ec681f3SmrgThe IR
7ec681f3Smrg------
7ec681f3Smrg
7ec681f3SmrgThe ir3 IR maps quite directly to the hardware, in that instruction opcodes map directly to hardware opcodes, and that dst/src register(s) map directly to the hardware dst/src register(s).  But there are a few extensions, in the form of meta_ instructions.  And additionally, for normal (non-const, etc) src registers, the ``IR3_REG_SSA`` flag is set and ``reg->instr`` points to the source instruction which produced that value.  So, for example, the following TGSI shader:
7ec681f3Smrg
7ec681f3Smrg::
7ec681f3Smrg
7ec681f3Smrg  VERT
7ec681f3Smrg  DCL IN[0]
7ec681f3Smrg  DCL IN[1]
7ec681f3Smrg  DCL OUT[0], POSITION
7ec681f3Smrg  DCL TEMP[0], LOCAL
7ec681f3Smrg    1: DP3 TEMP[0].x, IN[0].xyzz, IN[1].xyzz
7ec681f3Smrg    2: MOV OUT[0], TEMP[0].xxxx
7ec681f3Smrg    3: END
7ec681f3Smrg
7ec681f3Smrgeventually generates:
7ec681f3Smrg
7ec681f3Smrg.. graphviz::
7ec681f3Smrg
7ec681f3Smrg  digraph G {
7ec681f3Smrg  rankdir=RL;
7ec681f3Smrg  nodesep=0.25;
7ec681f3Smrg  ranksep=1.5;
7ec681f3Smrg  subgraph clusterdce198 {
7ec681f3Smrg  label="vert";
7ec681f3Smrg  inputdce198 [shape=record,label="inputs|<in0> i0.x|<in1> i0.y|<in2> i0.z|<in4> i1.x|<in5> i1.y|<in6> i1.z"];
7ec681f3Smrg  instrdcf348 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
7ec681f3Smrg  instrdcedd0 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"];
7ec681f3Smrg  inputdce198:<in2>:w -> instrdcedd0:<src0>
7ec681f3Smrg  inputdce198:<in6>:w -> instrdcedd0:<src1>
7ec681f3Smrg  instrdcec30 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"];
7ec681f3Smrg  inputdce198:<in1>:w -> instrdcec30:<src0>
7ec681f3Smrg  inputdce198:<in5>:w -> instrdcec30:<src1>
7ec681f3Smrg  instrdceb60 [shape=record,style=filled,fillcolor=lightgrey,label="{mul.f|<dst0>|<src0> |<src1> }"];
7ec681f3Smrg  inputdce198:<in0>:w -> instrdceb60:<src0>
7ec681f3Smrg  inputdce198:<in4>:w -> instrdceb60:<src1>
7ec681f3Smrg  instrdceb60:<dst0> -> instrdcec30:<src2>
7ec681f3Smrg  instrdcec30:<dst0> -> instrdcedd0:<src2>
7ec681f3Smrg  instrdcedd0:<dst0> -> instrdcf348:<src0>
7ec681f3Smrg  instrdcf400 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
7ec681f3Smrg  instrdcedd0:<dst0> -> instrdcf400:<src0>
7ec681f3Smrg  instrdcf4b8 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
7ec681f3Smrg  instrdcedd0:<dst0> -> instrdcf4b8:<src0>
7ec681f3Smrg  outputdce198 [shape=record,label="outputs|<out0> o0.x|<out1> o0.y|<out2> o0.z|<out3> o0.w"];
7ec681f3Smrg  instrdcf348:<dst0> -> outputdce198:<out0>:e
7ec681f3Smrg  instrdcf400:<dst0> -> outputdce198:<out1>:e
7ec681f3Smrg  instrdcf4b8:<dst0> -> outputdce198:<out2>:e
7ec681f3Smrg  instrdcedd0:<dst0> -> outputdce198:<out3>:e
7ec681f3Smrg  }
7ec681f3Smrg  }
7ec681f3Smrg
7ec681f3Smrg(after scheduling, etc, but before register assignment).
7ec681f3Smrg
7ec681f3SmrgInternal Structure
7ec681f3Smrg~~~~~~~~~~~~~~~~~~
7ec681f3Smrg
7ec681f3Smrg``ir3_block``
7ec681f3Smrg    Represents a basic block.
7ec681f3Smrg
7ec681f3Smrg    TODO: currently blocks are nested, but I think I need to change that
7ec681f3Smrg    to a more conventional arrangement before implementing proper flow
7ec681f3Smrg    control.  Currently the only flow control handles is if/else which
7ec681f3Smrg    gets flattened out and results chosen with ``sel`` instructions.
7ec681f3Smrg
7ec681f3Smrg``ir3_instruction``
7ec681f3Smrg    Represents a machine instruction or meta_ instruction.  Has pointers
7ec681f3Smrg    to dst register (``regs[0]``) and src register(s) (``regs[1..n]``),
7ec681f3Smrg    as needed.
7ec681f3Smrg
7ec681f3Smrg``ir3_register``
7ec681f3Smrg    Represents a src or dst register, flags indicate const/relative/etc.
7ec681f3Smrg    If ``IR3_REG_SSA`` is set on a src register, the actual register
7ec681f3Smrg    number (name) has not been assigned yet, and instead the ``instr``
7ec681f3Smrg    field points to src instruction.
7ec681f3Smrg
7ec681f3SmrgIn addition there are various util macros/functions to simplify manipulation/traversal of the graph:
7ec681f3Smrg
7ec681f3Smrg``foreach_src(srcreg, instr)``
7ec681f3Smrg    Iterate each instruction's source ``ir3_register``\s
7ec681f3Smrg
7ec681f3Smrg``foreach_src_n(srcreg, n, instr)``
7ec681f3Smrg    Like ``foreach_src``, also setting ``n`` to the source number (starting
7ec681f3Smrg    with ``0``).
7ec681f3Smrg
7ec681f3Smrg``foreach_ssa_src(srcinstr, instr)``
7ec681f3Smrg    Iterate each instruction's SSA source ``ir3_instruction``\s.  This skips
7ec681f3Smrg    non-SSA sources (consts, etc), but includes virtual sources (such as the
7ec681f3Smrg    address register if `relative addressing`_ is used).
7ec681f3Smrg
7ec681f3Smrg``foreach_ssa_src_n(srcinstr, n, instr)``
7ec681f3Smrg    Like ``foreach_ssa_src``, also setting ``n`` to the source number.
7ec681f3Smrg
7ec681f3SmrgFor example:
7ec681f3Smrg
7ec681f3Smrg.. code-block:: c
7ec681f3Smrg
7ec681f3Smrg  foreach_ssa_src_n(src, i, instr) {
7ec681f3Smrg    unsigned d = delay_calc_srcn(ctx, src, instr, i);
7ec681f3Smrg    delay = MAX2(delay, d);
7ec681f3Smrg  }
7ec681f3Smrg
7ec681f3Smrg
7ec681f3SmrgTODO probably other helper/util stuff worth mentioning here
7ec681f3Smrg
7ec681f3Smrg.. _meta:
7ec681f3Smrg
7ec681f3SmrgMeta Instructions
7ec681f3Smrg~~~~~~~~~~~~~~~~~
7ec681f3Smrg
7ec681f3Smrg**input**
7ec681f3Smrg    Used for shader inputs (registers configured in the command-stream
7ec681f3Smrg    to hold particular input values, written by the shader core before
7ec681f3Smrg    start of execution.  Also used for connecting up values within a
7ec681f3Smrg    basic block to an output of a previous block.
7ec681f3Smrg
7ec681f3Smrg**output**
7ec681f3Smrg    Used to hold outputs of a basic block.
7ec681f3Smrg
7ec681f3Smrg**flow**
7ec681f3Smrg    TODO
7ec681f3Smrg
7ec681f3Smrg**phi**
7ec681f3Smrg    TODO
7ec681f3Smrg
7ec681f3Smrg**collect**
7ec681f3Smrg    Groups registers which need to be assigned to consecutive scalar
7ec681f3Smrg    registers, for example `sam` (texture fetch) src instructions (see
7ec681f3Smrg    `register groups`_) or array element dereference
7ec681f3Smrg    (see `relative addressing`_).
7ec681f3Smrg
7ec681f3Smrg**split**
7ec681f3Smrg    The counterpart to **collect**, when an instruction such as `sam`
7ec681f3Smrg    writes multiple components, splits the result into individual
7ec681f3Smrg    scalar components to be consumed by other instructions.
7ec681f3Smrg
7ec681f3Smrg
7ec681f3Smrg.. _`flow control`:
7ec681f3Smrg
7ec681f3SmrgFlow Control
7ec681f3Smrg~~~~~~~~~~~~
7ec681f3Smrg
7ec681f3SmrgTODO
7ec681f3Smrg
7ec681f3Smrg
7ec681f3Smrg.. _`register groups`:
7ec681f3Smrg
7ec681f3SmrgRegister Groups
7ec681f3Smrg~~~~~~~~~~~~~~~
7ec681f3Smrg
7ec681f3SmrgCertain instructions, such as texture sample instructions, consume multiple consecutive scalar registers via a single src register encoded in the instruction, and/or write multiple consecutive scalar registers.  In the simplest example:
7ec681f3Smrg
7ec681f3Smrg::
7ec681f3Smrg
7ec681f3Smrg  sam (f32)(xyz)r2.x, r0.z, s#0, t#0
7ec681f3Smrg
7ec681f3Smrgfor a 2d texture, would read ``r0.zw`` to get the coordinate, and write ``r2.xyz``.
7ec681f3Smrg
7ec681f3SmrgBefore register assignment, to group the two components of the texture src together:
7ec681f3Smrg
7ec681f3Smrg.. graphviz::
7ec681f3Smrg
7ec681f3Smrg  digraph G {
7ec681f3Smrg    { rank=same;
7ec681f3Smrg      collect;
7ec681f3Smrg    };
7ec681f3Smrg    { rank=same;
7ec681f3Smrg      coord_x;
7ec681f3Smrg      coord_y;
7ec681f3Smrg    };
7ec681f3Smrg    sam -> collect [label="regs[1]"];
7ec681f3Smrg    collect -> coord_x [label="regs[1]"];
7ec681f3Smrg    collect -> coord_y [label="regs[2]"];
7ec681f3Smrg    coord_x -> coord_y [label="right",style=dotted];
7ec681f3Smrg    coord_y -> coord_x [label="left",style=dotted];
7ec681f3Smrg    coord_x [label="coord.x"];
7ec681f3Smrg    coord_y [label="coord.y"];
7ec681f3Smrg  }
7ec681f3Smrg
7ec681f3SmrgThe frontend sets up the SSA ptrs from ``sam`` source register to the ``collect`` meta instruction, which in turn points to the instructions producing the ``coord.x`` and ``coord.y`` values.  And the grouping_ pass sets up the ``left`` and ``right`` neighbor pointers to the ``collect``\'s sources, used later by the `register assignment`_ pass to assign blocks of scalar registers.
7ec681f3Smrg
7ec681f3SmrgAnd likewise, for the consecutive scalar registers for the destination:
7ec681f3Smrg
7ec681f3Smrg.. graphviz::
7ec681f3Smrg
7ec681f3Smrg  digraph {
7ec681f3Smrg    { rank=same;
7ec681f3Smrg      A;
7ec681f3Smrg      B;
7ec681f3Smrg      C;
7ec681f3Smrg    };
7ec681f3Smrg    { rank=same;
7ec681f3Smrg      split_0;
7ec681f3Smrg      split_1;
7ec681f3Smrg      split_2;
7ec681f3Smrg    };
7ec681f3Smrg    A -> split_0;
7ec681f3Smrg    B -> split_1;
7ec681f3Smrg    C -> split_2;
7ec681f3Smrg    split_0 [label="split\noff=0"];
7ec681f3Smrg    split_0 -> sam;
7ec681f3Smrg    split_1 [label="split\noff=1"];
7ec681f3Smrg    split_1 -> sam;
7ec681f3Smrg    split_2 [label="split\noff=2"];
7ec681f3Smrg    split_2 -> sam;
7ec681f3Smrg    split_0 -> split_1 [label="right",style=dotted];
7ec681f3Smrg    split_1 -> split_0 [label="left",style=dotted];
7ec681f3Smrg    split_1 -> split_2 [label="right",style=dotted];
7ec681f3Smrg    split_2 -> split_1 [label="left",style=dotted];
7ec681f3Smrg    sam;
7ec681f3Smrg  }
7ec681f3Smrg
7ec681f3Smrg.. _`relative addressing`:
7ec681f3Smrg
7ec681f3SmrgRelative Addressing
7ec681f3Smrg~~~~~~~~~~~~~~~~~~~
7ec681f3Smrg
7ec681f3SmrgMost instructions support addressing indirectly (relative to address register) into const or gpr register file in some or all of their src/dst registers.  In this case the register accessed is taken from ``r<a0.x + n>`` or ``c<a0.x + n>``, i.e. address register (``a0.x``) value plus ``n``, where ``n`` is encoded in the instruction (rather than the absolute register number).
7ec681f3Smrg
7ec681f3Smrg    Note that cat5 (texture sample) instructions are the notable exception, not
7ec681f3Smrg    supporting relative addressing of src or dst.
7ec681f3Smrg
7ec681f3SmrgRelative addressing of the const file (for example, a uniform array) is relatively simple.  We don't do register assignment of the const file, so all that is required is to schedule things properly.  I.e. the instruction that writes the address register must be scheduled first, and we cannot have two different address register values live at one time.
7ec681f3Smrg
7ec681f3SmrgBut relative addressing of gpr file (which can be as src or dst) has additional restrictions on register assignment (i.e. the array elements must be assigned to consecutive scalar registers).  And in the case of relative dst, subsequent instructions now depend on both the relative write, as well as the previous instruction which wrote that register, since we do not know at compile time which actual register was written.
7ec681f3Smrg
7ec681f3SmrgEach instruction has an optional ``address`` pointer, to capture the dependency on the address register value when relative addressing is used for any of the src/dst register(s).  This behaves as an additional virtual src register, i.e. ``foreach_ssa_src()`` will also iterate the address register (last).
7ec681f3Smrg
7ec681f3Smrg    Note that ``nop``\'s for timing constraints, type specifiers (i.e.
7ec681f3Smrg    ``add.f`` vs ``add.u``), etc, omitted for brevity in examples
7ec681f3Smrg
7ec681f3Smrg::
7ec681f3Smrg
7ec681f3Smrg  mova a0.x, hr1.y
7ec681f3Smrg  sub r1.y, r2.x, r3.x
7ec681f3Smrg  add r0.x, r1.y, c<a0.x + 2>
7ec681f3Smrg
7ec681f3Smrgresults in:
7ec681f3Smrg
7ec681f3Smrg.. graphviz::
7ec681f3Smrg
7ec681f3Smrg  digraph {
7ec681f3Smrg    rankdir=LR;
7ec681f3Smrg    sub;
7ec681f3Smrg    const [label="const file"];
7ec681f3Smrg    add;
7ec681f3Smrg    mova;
7ec681f3Smrg    add -> mova;
7ec681f3Smrg    add -> sub;
7ec681f3Smrg    add -> const [label="off=2"];
7ec681f3Smrg  }
7ec681f3Smrg
7ec681f3SmrgThe scheduling pass has some smarts to schedule things such that only a single ``a0.x`` value is used at any one time.
7ec681f3Smrg
7ec681f3SmrgTo implement variable arrays, the NIR registers are stored as an ``ir3_array``,
7ec681f3Smrgwhich will be register allocated to consecutive hardware registers.  The array
7ec681f3Smrgaccess uses the id field in the ``ir3_register`` to map to the array being
7ec681f3Smrgaccessed, and the offset field for the fixed offset within the array.  A NIR
7ec681f3Smrgindirect register read such as:
7ec681f3Smrg
7ec681f3Smrg::
7ec681f3Smrg
7ec681f3Smrg  decl_reg vec2 32 r0[2]
7ec681f3Smrg  ...
7ec681f3Smrg  vec2 32 ssa_19 = mov r0[0 + ssa_9]
7ec681f3Smrg
7ec681f3Smrg
7ec681f3Smrgresults in:
7ec681f3Smrg
7ec681f3Smrg::
7ec681f3Smrg
7ec681f3Smrg  0000:0000:001:  shl.b hssa_19, hssa_17, himm[0.000000,1,0x1]
7ec681f3Smrg  0000:0000:002:  mov.s16s16 hr61.x, hssa_19
7ec681f3Smrg  0000:0000:002:  mov.u32u32 ssa_21, arr[id=1, offset=0, size=4, ssa_12], address=_[0000:0000:002:  mov.s16s16]
7ec681f3Smrg  0000:0000:002:  mov.u32u32 ssa_22, arr[id=1, offset=1, size=4, ssa_12], address=_[0000:0000:002:  mov.s16s16]
7ec681f3Smrg
7ec681f3Smrg
7ec681f3SmrgArray writes write to the array in ``instr->regs[0]->array.id``.  A NIR indirect
7ec681f3Smrgregister write such as:
7ec681f3Smrg
7ec681f3Smrg::
7ec681f3Smrg
7ec681f3Smrg  decl_reg vec2 32 r0[2]
7ec681f3Smrg  ...
7ec681f3Smrg  r0[0 + ssa_12] = mov ssa_13
7ec681f3Smrg
7ec681f3Smrgresults in:
7ec681f3Smrg
7ec681f3Smrg::
7ec681f3Smrg
7ec681f3Smrg  0000:0000:001:  shl.b hssa_29, hssa_27, himm[0.000000,1,0x1]
7ec681f3Smrg  0000:0000:002:  mov.s16s16 hr61.x, hssa_29
7ec681f3Smrg  0000:0000:001:  mov.u32u32 arr[id=1, offset=0, size=4, ssa_17], c2.y, address=_[0000:0000:002:  mov.s16s16]
7ec681f3Smrg  0000:0000:004:  mov.u32u32 arr[id=1, offset=1, size=4, ssa_31], c2.z, address=_[0000:0000:002:  mov.s16s16]
7ec681f3Smrg
7ec681f3SmrgNote that only cat1 (mov) can do indirect write, and thus NIR register stores
7ec681f3Smrgmay need to introduce an extra mov.
7ec681f3Smrg
7ec681f3Smrgir3 array accesses in the DAG get serialized by the ``instr->barrier_class`` and
7ec681f3Smrgcontaining ``IR3_BARRIER_ARRAY_W`` or ``IR3_BARRIER_ARRAY_R``.
7ec681f3Smrg
7ec681f3SmrgShader Passes
7ec681f3Smrg-------------
7ec681f3Smrg
7ec681f3SmrgAfter the frontend has generated the use-def graph of instructions, they are run through various passes which include scheduling_ and `register assignment`_.  Because inserting ``mov`` instructions after scheduling would also require inserting additional ``nop`` instructions (since it is too late to reschedule to try and fill the bubbles), the earlier stages try to ensure that (at least given an infinite supply of registers) that `register assignment`_ after scheduling_ cannot fail.
7ec681f3Smrg
7ec681f3Smrg    Note that we essentially have ~256 scalar registers in the
7ec681f3Smrg    architecture (although larger register usage will at some thresholds
7ec681f3Smrg    limit the number of threads which can run in parallel).  And at some
7ec681f3Smrg    point we will have to deal with spilling.
7ec681f3Smrg
7ec681f3Smrg.. _flatten:
7ec681f3Smrg
7ec681f3SmrgFlatten
7ec681f3Smrg~~~~~~~
7ec681f3Smrg
7ec681f3SmrgIn this stage, simple if/else blocks are flattened into a single block with ``phi`` nodes converted into ``sel`` instructions.  The a3xx ISA has very few predicated instructions, and we would prefer not to use branches for simple if/else.
7ec681f3Smrg
7ec681f3Smrg
7ec681f3Smrg.. _`copy propagation`:
7ec681f3Smrg
7ec681f3SmrgCopy Propagation
7ec681f3Smrg~~~~~~~~~~~~~~~~
7ec681f3Smrg
7ec681f3SmrgCurrently the frontend inserts ``mov``\s in various cases, because certain categories of instructions have limitations about const regs as sources.  And the CP pass simply removes all simple ``mov``\s (i.e. src-type is same as dst-type, no abs/neg flags, etc).
7ec681f3Smrg
7ec681f3SmrgThe eventual plan is to invert that, with the front-end inserting no ``mov``\s and CP legalize things.
7ec681f3Smrg
7ec681f3Smrg
7ec681f3Smrg.. _grouping:
7ec681f3Smrg
7ec681f3SmrgGrouping
7ec681f3Smrg~~~~~~~~
7ec681f3Smrg
7ec681f3SmrgIn the grouping pass, instructions which need to be grouped (for ``collect``\s, etc) have their ``left`` / ``right`` neighbor pointers setup.  In cases where there is a conflict (i.e. one instruction cannot have two unique left or right neighbors), an additional ``mov`` instruction is inserted.  This ensures that there is some possible valid `register assignment`_ at the later stages.
7ec681f3Smrg
7ec681f3Smrg
7ec681f3Smrg.. _depth:
7ec681f3Smrg
7ec681f3SmrgDepth
7ec681f3Smrg~~~~~
7ec681f3Smrg
7ec681f3SmrgIn the depth pass, a depth is calculated for each instruction node within its basic block.  The depth is the sum of the required cycles (delay slots needed between two instructions plus one) of each instruction plus the max depth of any of its source instructions.  (meta_ instructions don't add to the depth).  As an instruction's depth is calculated, it is inserted into a per block list sorted by deepest instruction.  Unreachable instructions and inputs are marked.
7ec681f3Smrg
7ec681f3Smrg    TODO: we should probably calculate both hard and soft depths (?) to
7ec681f3Smrg    try to coax additional instructions to fit in places where we need
7ec681f3Smrg    to use sync bits, such as after a texture fetch or SFU.
7ec681f3Smrg
7ec681f3Smrg.. _scheduling:
7ec681f3Smrg
7ec681f3SmrgScheduling
7ec681f3Smrg~~~~~~~~~~
7ec681f3Smrg
7ec681f3SmrgAfter the grouping_ pass, there are no more instructions to insert or remove.  Start scheduling each basic block from the deepest node in the depth sorted list created by the depth_ pass, recursively trying to schedule each instruction after its source instructions plus delay slots.  Insert ``nop``\s as required.
7ec681f3Smrg
7ec681f3Smrg.. _`register assignment`:
7ec681f3Smrg
7ec681f3SmrgRegister Assignment
7ec681f3Smrg~~~~~~~~~~~~~~~~~~~
7ec681f3Smrg
7ec681f3SmrgTODO
7ec681f3Smrg
7ec681f3Smrg