17ec681f3Smrg=====================
27ec681f3SmrgAdreno Five Microcode
37ec681f3Smrg=====================
47ec681f3Smrg
57ec681f3Smrg.. contents::
67ec681f3Smrg
77ec681f3Smrg.. _afuc-introduction:
87ec681f3Smrg
97ec681f3SmrgIntroduction
107ec681f3Smrg============
117ec681f3Smrg
127ec681f3SmrgAdreno GPUs prior to 6xx use two micro-controllers to parse the command-stream,
137ec681f3Smrgsetup the hardware for draws (or compute jobs), and do various GPU
147ec681f3Smrghousekeeping.  They are relatively simple (basically glorified
157ec681f3Smrgregister writers) and basically all their state is in a collection
167ec681f3Smrgof registers.  Ie. there is no stack, and no memory assigned to
177ec681f3Smrgthem; any global state like which bank of context registers is to
187ec681f3Smrgbe used in the next draw is stored in a register.
197ec681f3Smrg
207ec681f3SmrgThe setup is similar to radeon, in fact Adreno 2xx thru 4xx used
217ec681f3Smrgbasically the same instruction set as r600.  There is a "PFP"
227ec681f3Smrg(Prefetch Parser) and "ME" (Micro Engine, also confusingly referred
237ec681f3Smrgto as "PM4").  These make up the "CP" ("Command Parser").  The
247ec681f3SmrgPFP runs ahead of the ME, with some PM4 packets handled entirely
257ec681f3Smrgin the PFP.  Between the PFP and ME is a FIFO ("MEQ").  In the
267ec681f3Smrggenerations prior to Adreno 5xx, the PFP and ME had different
277ec681f3Smrginstruction sets.
287ec681f3Smrg
297ec681f3SmrgStarting with Adreno 5xx, a new microcontroller with a unified
307ec681f3Smrginstruction set was introduced, although the overall architecture
317ec681f3Smrgand purpose of the two microcontrollers remains the same.
327ec681f3Smrg
337ec681f3SmrgFor lack of a better name, this new instruction set is called
347ec681f3Smrg"Adreno Five MicroCode" or "afuc".  (No idea what Qualcomm calls
357ec681f3Smrgit internally.
367ec681f3Smrg
377ec681f3SmrgWith Adreno 6xx, the separate PF and ME are replaced with a single
387ec681f3SmrgSQE microcontroller using the same instruction set as 5xx.
397ec681f3Smrg
407ec681f3Smrg.. _afuc-overview:
417ec681f3Smrg
427ec681f3SmrgInstruction Set Overview
437ec681f3Smrg========================
447ec681f3Smrg
457ec681f3Smrg32bit instruction set with basic arithmatic ops that can take
467ec681f3Smrgeither two source registers or one src and a 16b immediate.
477ec681f3Smrg
487ec681f3Smrg32 registers, although some are special purpose:
497ec681f3Smrg
507ec681f3Smrg- ``$00`` - always reads zero, otherwise seems to be the PC
517ec681f3Smrg- ``$01`` - current PM4 packet header
527ec681f3Smrg- ``$1c`` - alias ``$rem``, remaining data in packet
537ec681f3Smrg- ``$1d`` - alias ``$addr``
547ec681f3Smrg- ``$1f`` - alias ``$data``
557ec681f3Smrg
567ec681f3SmrgBranch instructions have a delay slot so the following instruction
577ec681f3Smrgis always executed regardless of whether branch is taken or not.
587ec681f3Smrg
597ec681f3Smrg
607ec681f3Smrg.. _afuc-alu:
617ec681f3Smrg
627ec681f3SmrgALU Instructions
637ec681f3Smrg================
647ec681f3Smrg
657ec681f3SmrgThe following instructions are available:
667ec681f3Smrg
677ec681f3Smrg- ``add``   - add
687ec681f3Smrg- ``addhi`` - add + carry (for upper 32b of 64b value)
697ec681f3Smrg- ``sub``   - subtract
707ec681f3Smrg- ``subhi`` - subtract + carry (for upper 32b of 64b value)
717ec681f3Smrg- ``and``   - bitwise AND
727ec681f3Smrg- ``or``    - bitwise OR
737ec681f3Smrg- ``xor``   - bitwise XOR
747ec681f3Smrg- ``not``   - bitwise NOT (no src1)
757ec681f3Smrg- ``shl``   - shift-left
767ec681f3Smrg- ``ushr``  - unsigned shift-right
777ec681f3Smrg- ``ishr``  - signed shift-right
787ec681f3Smrg- ``rot``   - rotate-left (like shift-left with wrap-around)
797ec681f3Smrg- ``mul8``  - multiply low 8b of two src
807ec681f3Smrg- ``min``   - minimum
817ec681f3Smrg- ``max``   - maximum
827ec681f3Smrg- ``comp``  - compare two values
837ec681f3Smrg
847ec681f3SmrgThe ALU instructions can take either two src registers, or a src
857ec681f3Smrgplus 16b immediate as 2nd src, ex::
867ec681f3Smrg
877ec681f3Smrg  add $dst, $src, 0x1234   ; src2 is immed
887ec681f3Smrg  add $dst, $src1, $src2   ; src2 is reg
897ec681f3Smrg
907ec681f3SmrgThe ``not`` instruction only takes a single source::
917ec681f3Smrg
927ec681f3Smrg  not $dst, $src
937ec681f3Smrg  not $dst, 0x1234
947ec681f3Smrg
957ec681f3Smrg.. _afuc-alu-cmp:
967ec681f3Smrg
977ec681f3SmrgThe ``cmp`` instruction returns:
987ec681f3Smrg
997ec681f3Smrg- ``0x00`` if src1 > src2
1007ec681f3Smrg- ``0x2b`` if src1 == src2
1017ec681f3Smrg- ``0x1e`` if src1 < src2
1027ec681f3Smrg
1037ec681f3SmrgSee explanation in :ref:`afuc-branch`
1047ec681f3Smrg
1057ec681f3Smrg
1067ec681f3Smrg.. _afuc-branch:
1077ec681f3Smrg
1087ec681f3SmrgBranch Instructions
1097ec681f3Smrg===================
1107ec681f3Smrg
1117ec681f3SmrgThe following branch/jump instructions are available:
1127ec681f3Smrg
1137ec681f3Smrg- ``brne`` - branch if not equal (or bit not set)
1147ec681f3Smrg- ``breq`` - branch if equal (or bit set)
1157ec681f3Smrg- ``jump`` - unconditional jump
1167ec681f3Smrg
1177ec681f3SmrgBoth ``brne`` and ``breq`` have two forms, comparing the src register
1187ec681f3Smrgagainst either a small immediate (up to 5 bits) or a specific bit::
1197ec681f3Smrg
1207ec681f3Smrg  breq $src, b3, #somelabel  ; branch if src & (1 << 3)
1217ec681f3Smrg  breq $src, 0x3, #somelabel ; branch if src == 3
1227ec681f3Smrg
1237ec681f3SmrgThe branch instructions are encoded with a 16b relative offset.
1247ec681f3SmrgSince ``$00`` always reads back zero, it can be used to construct
1257ec681f3Smrgan unconditional relative jump.
1267ec681f3Smrg
1277ec681f3SmrgThe :ref:`cmp <afuc-alu-cmp>` instruction can be paired with the
1287ec681f3Smrgbit-test variants of ``brne``/``breq`` to implement gt/ge/lt/le,
1297ec681f3Smrgdue to the bit pattern it returns, for example::
1307ec681f3Smrg
1317ec681f3Smrg  cmp $04, $02, $03
1327ec681f3Smrg  breq $04, b1, #somelabel
1337ec681f3Smrg
1347ec681f3Smrgwill branch if ``$02`` is less than or equal to ``$03``.
1357ec681f3Smrg
1367ec681f3Smrg
1377ec681f3Smrg.. _afuc-call:
1387ec681f3Smrg
1397ec681f3SmrgCall/Return
1407ec681f3Smrg===========
1417ec681f3Smrg
1427ec681f3SmrgSimple subroutines can be implemented with ``call``/``ret``.  The
1437ec681f3Smrgjump instruction encodes a fixed offset.
1447ec681f3Smrg
1457ec681f3Smrg  TODO not sure how many levels deep function calls can be nested.
1467ec681f3Smrg  There isn't really a stack.  Definitely seems to be multiple
1477ec681f3Smrg  levels of fxn call, see in PFP: CP_CONTEXT_SWITCH_YIELD -> f13 ->
1487ec681f3Smrg  f22.
1497ec681f3Smrg
1507ec681f3Smrg
1517ec681f3Smrg.. _afuc-control:
1527ec681f3Smrg
1537ec681f3SmrgConfig Instructions
1547ec681f3Smrg===================
1557ec681f3Smrg
1567ec681f3SmrgThese seem to read/write config state in other parts of CP.  In at
1577ec681f3Smrgleast some cases I expect these map to CP registers (but possibly
1587ec681f3Smrgnot directly??)
1597ec681f3Smrg
1607ec681f3Smrg- ``cread $dst, [$off + addr], flags``
1617ec681f3Smrg- ``cwrite $src, [$off + addr], flags``
1627ec681f3Smrg
1637ec681f3SmrgIn cases where no offset is needed, ``$00`` is frequently used as
1647ec681f3Smrgthe offset.
1657ec681f3Smrg
1667ec681f3SmrgFor example, the following sequences sets::
1677ec681f3Smrg
1687ec681f3Smrg  ; load CP_INDIRECT_BUFFER parameters from cmdstream:
1697ec681f3Smrg  mov $02, $data   ; low 32b of IB target address
1707ec681f3Smrg  mov $03, $data   ; high 32b of IB target
1717ec681f3Smrg  mov $04, $data   ; IB size in dwords
1727ec681f3Smrg
1737ec681f3Smrg  ; sanity check # of dwords:
1747ec681f3Smrg  breq $04, 0x0, #l23 (#69, 04a2)
1757ec681f3Smrg
1767ec681f3Smrg  ; this seems something to do with figuring out whether
1777ec681f3Smrg  ; we are going from RB->IB1 or IB1->IB2 (ie. so the
1787ec681f3Smrg  ; below cwrite instructions update either
1797ec681f3Smrg  ; CP_IB1_BASE_LO/HI/BUFSIZE or CP_IB2_BASE_LO/HI/BUFSIZE
1807ec681f3Smrg  and $05, $18, 0x0003
1817ec681f3Smrg  shl $05, $05, 0x0002
1827ec681f3Smrg
1837ec681f3Smrg  ; update CP_IBn_BASE_LO/HI/BUFSIZE:
1847ec681f3Smrg  cwrite $02, [$05 + 0x0b0], 0x8
1857ec681f3Smrg  cwrite $03, [$05 + 0x0b1], 0x8
1867ec681f3Smrg  cwrite $04, [$05 + 0x0b2], 0x8
1877ec681f3Smrg
1887ec681f3Smrg
1897ec681f3Smrg
1907ec681f3Smrg.. _afuc-reg-access:
1917ec681f3Smrg
1927ec681f3SmrgRegister Access
1937ec681f3Smrg===============
1947ec681f3Smrg
1957ec681f3SmrgThe special registers ``$addr`` and ``$data`` can be used to write GPU
1967ec681f3Smrgregisters, for example, to write::
1977ec681f3Smrg
1987ec681f3Smrg  mov $addr, CP_SCRATCH_REG[0x2] ; set register to write
1997ec681f3Smrg  mov $data, $03                 ; CP_SCRATCH_REG[0x2]
2007ec681f3Smrg  mov $data, $04                 ; CP_SCRATCH_REG[0x3]
2017ec681f3Smrg  ...
2027ec681f3Smrg
2037ec681f3Smrgsubsequent writes to ``$data`` will increment the address of the register
2047ec681f3Smrgto write, so a sequence of consecutive registers can be written
2057ec681f3Smrg
2067ec681f3SmrgTo read::
2077ec681f3Smrg
2087ec681f3Smrg  mov $addr, CP_SCRATCH_REG[0x2]
2097ec681f3Smrg  mov $03, $addr
2107ec681f3Smrg  mov $04, $addr
2117ec681f3Smrg
2127ec681f3SmrgMany registers that are updated frequently have two banks, so they can be
2137ec681f3Smrgupdated without stalling for previous draw to finish.  These banks are
2147ec681f3Smrgarranged so bit 11 is zero for bank 0 and 1 for bank 1.  The ME fw (at
2157ec681f3Smrgleast the version I'm looking at) stores this in ``$17``, so to update
2167ec681f3Smrgthese registers from ME::
2177ec681f3Smrg
2187ec681f3Smrg  or $addr, $17, VFD_INDEX_OFFSET
2197ec681f3Smrg  mov $data, $03
2207ec681f3Smrg  ...
2217ec681f3Smrg
2227ec681f3SmrgNote that PFP doesn't seem to use this approach, instead it does something
2237ec681f3Smrglike::
2247ec681f3Smrg
2257ec681f3Smrg  mov $0c, CP_SCRATCH_REG[0x7]
2267ec681f3Smrg  mov $02, 0x789a   ; value
2277ec681f3Smrg  cwrite $0c, [$00 + 0x010], 0x8
2287ec681f3Smrg  cwrite $02, [$00 + 0x011], 0x8
2297ec681f3Smrg
2307ec681f3SmrgLike with the ``$addr``/``$data`` approach, the destination register address
2317ec681f3Smrgincrements on each write.
2327ec681f3Smrg
2337ec681f3Smrg.. _afuc-mem:
2347ec681f3Smrg
2357ec681f3SmrgMemory Access
2367ec681f3Smrg=============
2377ec681f3Smrg
2387ec681f3SmrgThere are no load/store instructions, as such.  The microcontrollers
2397ec681f3Smrghave only indirect memory access via GPU registers.  There are two
2407ec681f3Smrgmechanism possible.
2417ec681f3Smrg
2427ec681f3SmrgRead/Write via CP_NRT Registers
2437ec681f3Smrg-------------------------------
2447ec681f3Smrg
2457ec681f3SmrgThis seems to be only used by ME.  If PFP were also using it, they would
2467ec681f3Smrgrace with each other.  It seems to be primarily used for small reads.
2477ec681f3Smrg
2487ec681f3Smrg- ``CP_ME_NRT_ADDR_LO``/``_HI`` - write to set the address to read or write
2497ec681f3Smrg- ``CP_ME_NRT_DATA`` - write to trigger write to address in ``CP_ME_NRT_ADDR``
2507ec681f3Smrg
2517ec681f3SmrgThe address register increments with successive reads or writes.
2527ec681f3Smrg
2537ec681f3SmrgMemory Write example::
2547ec681f3Smrg
2557ec681f3Smrg  ; store 64b value in $04+$05 to 64b address in $02+$03
2567ec681f3Smrg  mov $addr, CP_ME_NRT_ADDR_LO
2577ec681f3Smrg  mov $data, $02
2587ec681f3Smrg  mov $data, $03
2597ec681f3Smrg  mov $addr, CP_ME_NRT_DATA
2607ec681f3Smrg  mov $data, $04
2617ec681f3Smrg  mov $data, $05
2627ec681f3Smrg
2637ec681f3SmrgMemory Read example::
2647ec681f3Smrg
2657ec681f3Smrg  ; load 64b value from address in $02+$03 into $04+$05
2667ec681f3Smrg  mov $addr, CP_ME_NRT_ADDR_LO
2677ec681f3Smrg  mov $data, $02
2687ec681f3Smrg  mov $data, $03
2697ec681f3Smrg  mov $04, $addr
2707ec681f3Smrg  mov $05, $addr
2717ec681f3Smrg
2727ec681f3Smrg
2737ec681f3SmrgRead via Control Instructions
2747ec681f3Smrg-----------------------------
2757ec681f3Smrg
2767ec681f3SmrgThis is used by PFP whenever it needs to read memory.  Also seems to be
2777ec681f3Smrgused by ME for streaming reads (larger amounts of data).  The DMA access
2787ec681f3Smrgseems to be done by ROQ.
2797ec681f3Smrg
2807ec681f3Smrg  TODO might also be possible for write access
2817ec681f3Smrg
2827ec681f3Smrg  TODO some of the control commands might be synchronizing access
2837ec681f3Smrg  between PFP and ME??
2847ec681f3Smrg
2857ec681f3SmrgAn example from ``CP_DRAW_INDIRECT`` packet handler::
2867ec681f3Smrg
2877ec681f3Smrg  mov $07, 0x0004  ; # of dwords to read from draw-indirect buffer
2887ec681f3Smrg  ; load address of indirect buffer from cmdstream:
2897ec681f3Smrg  cwrite $data, [$00 + 0x0b8], 0x8
2907ec681f3Smrg  cwrite $data, [$00 + 0x0b9], 0x8
2917ec681f3Smrg  ; set # of dwords to read:
2927ec681f3Smrg  cwrite $07, [$00 + 0x0ba], 0x8
2937ec681f3Smrg  ...
2947ec681f3Smrg  ; read parameters from draw-indirect buffer:
2957ec681f3Smrg  mov $09, $addr
2967ec681f3Smrg  mov $07, $addr
2977ec681f3Smrg  cread $12, [$00 + 0x040], 0x8
2987ec681f3Smrg  ; the start parameter gets written into MEQ, which ME writes
2997ec681f3Smrg  ; to VFD_INDEX_OFFSET register:
3007ec681f3Smrg  mov $data, $addr
3017ec681f3Smrg
3027ec681f3Smrg
3037ec681f3SmrgA6XX NOTES
3047ec681f3Smrg==========
3057ec681f3Smrg
3067ec681f3SmrgThe ``$14`` register holds global flags set by:
3077ec681f3Smrg
3087ec681f3Smrg  CP_SKIP_IB2_ENABLE_LOCAL - b8
3097ec681f3Smrg  CP_SKIP_IB2_ENABLE_GLOBAL - b9
3107ec681f3Smrg  CP_SET_MARKER
3117ec681f3Smrg    MODE=GMEM - sets b15
3127ec681f3Smrg    MODE=BLIT2D - clears b15, b12, b7
3137ec681f3Smrg  CP_SET_MODE - b29+b30
3147ec681f3Smrg  CP_SET_VISIBILITY_OVERRIDE - b11, b21, b30?
3157ec681f3Smrg  CP_SET_DRAW_STATE - checks b29+b30
3167ec681f3Smrg
3177ec681f3Smrg  CP_COND_REG_EXEC - checks b10, which should be predicate flag?
318