17ec681f3Smrg===================== 27ec681f3SmrgAdreno Five Microcode 37ec681f3Smrg===================== 47ec681f3Smrg 57ec681f3Smrg.. contents:: 67ec681f3Smrg 77ec681f3Smrg.. _afuc-introduction: 87ec681f3Smrg 97ec681f3SmrgIntroduction 107ec681f3Smrg============ 117ec681f3Smrg 127ec681f3SmrgAdreno GPUs prior to 6xx use two micro-controllers to parse the command-stream, 137ec681f3Smrgsetup the hardware for draws (or compute jobs), and do various GPU 147ec681f3Smrghousekeeping. They are relatively simple (basically glorified 157ec681f3Smrgregister writers) and basically all their state is in a collection 167ec681f3Smrgof registers. Ie. there is no stack, and no memory assigned to 177ec681f3Smrgthem; any global state like which bank of context registers is to 187ec681f3Smrgbe used in the next draw is stored in a register. 197ec681f3Smrg 207ec681f3SmrgThe setup is similar to radeon, in fact Adreno 2xx thru 4xx used 217ec681f3Smrgbasically the same instruction set as r600. There is a "PFP" 227ec681f3Smrg(Prefetch Parser) and "ME" (Micro Engine, also confusingly referred 237ec681f3Smrgto as "PM4"). These make up the "CP" ("Command Parser"). The 247ec681f3SmrgPFP runs ahead of the ME, with some PM4 packets handled entirely 257ec681f3Smrgin the PFP. Between the PFP and ME is a FIFO ("MEQ"). In the 267ec681f3Smrggenerations prior to Adreno 5xx, the PFP and ME had different 277ec681f3Smrginstruction sets. 287ec681f3Smrg 297ec681f3SmrgStarting with Adreno 5xx, a new microcontroller with a unified 307ec681f3Smrginstruction set was introduced, although the overall architecture 317ec681f3Smrgand purpose of the two microcontrollers remains the same. 327ec681f3Smrg 337ec681f3SmrgFor lack of a better name, this new instruction set is called 347ec681f3Smrg"Adreno Five MicroCode" or "afuc". (No idea what Qualcomm calls 357ec681f3Smrgit internally. 367ec681f3Smrg 377ec681f3SmrgWith Adreno 6xx, the separate PF and ME are replaced with a single 387ec681f3SmrgSQE microcontroller using the same instruction set as 5xx. 397ec681f3Smrg 407ec681f3Smrg.. _afuc-overview: 417ec681f3Smrg 427ec681f3SmrgInstruction Set Overview 437ec681f3Smrg======================== 447ec681f3Smrg 457ec681f3Smrg32bit instruction set with basic arithmatic ops that can take 467ec681f3Smrgeither two source registers or one src and a 16b immediate. 477ec681f3Smrg 487ec681f3Smrg32 registers, although some are special purpose: 497ec681f3Smrg 507ec681f3Smrg- ``$00`` - always reads zero, otherwise seems to be the PC 517ec681f3Smrg- ``$01`` - current PM4 packet header 527ec681f3Smrg- ``$1c`` - alias ``$rem``, remaining data in packet 537ec681f3Smrg- ``$1d`` - alias ``$addr`` 547ec681f3Smrg- ``$1f`` - alias ``$data`` 557ec681f3Smrg 567ec681f3SmrgBranch instructions have a delay slot so the following instruction 577ec681f3Smrgis always executed regardless of whether branch is taken or not. 587ec681f3Smrg 597ec681f3Smrg 607ec681f3Smrg.. _afuc-alu: 617ec681f3Smrg 627ec681f3SmrgALU Instructions 637ec681f3Smrg================ 647ec681f3Smrg 657ec681f3SmrgThe following instructions are available: 667ec681f3Smrg 677ec681f3Smrg- ``add`` - add 687ec681f3Smrg- ``addhi`` - add + carry (for upper 32b of 64b value) 697ec681f3Smrg- ``sub`` - subtract 707ec681f3Smrg- ``subhi`` - subtract + carry (for upper 32b of 64b value) 717ec681f3Smrg- ``and`` - bitwise AND 727ec681f3Smrg- ``or`` - bitwise OR 737ec681f3Smrg- ``xor`` - bitwise XOR 747ec681f3Smrg- ``not`` - bitwise NOT (no src1) 757ec681f3Smrg- ``shl`` - shift-left 767ec681f3Smrg- ``ushr`` - unsigned shift-right 777ec681f3Smrg- ``ishr`` - signed shift-right 787ec681f3Smrg- ``rot`` - rotate-left (like shift-left with wrap-around) 797ec681f3Smrg- ``mul8`` - multiply low 8b of two src 807ec681f3Smrg- ``min`` - minimum 817ec681f3Smrg- ``max`` - maximum 827ec681f3Smrg- ``comp`` - compare two values 837ec681f3Smrg 847ec681f3SmrgThe ALU instructions can take either two src registers, or a src 857ec681f3Smrgplus 16b immediate as 2nd src, ex:: 867ec681f3Smrg 877ec681f3Smrg add $dst, $src, 0x1234 ; src2 is immed 887ec681f3Smrg add $dst, $src1, $src2 ; src2 is reg 897ec681f3Smrg 907ec681f3SmrgThe ``not`` instruction only takes a single source:: 917ec681f3Smrg 927ec681f3Smrg not $dst, $src 937ec681f3Smrg not $dst, 0x1234 947ec681f3Smrg 957ec681f3Smrg.. _afuc-alu-cmp: 967ec681f3Smrg 977ec681f3SmrgThe ``cmp`` instruction returns: 987ec681f3Smrg 997ec681f3Smrg- ``0x00`` if src1 > src2 1007ec681f3Smrg- ``0x2b`` if src1 == src2 1017ec681f3Smrg- ``0x1e`` if src1 < src2 1027ec681f3Smrg 1037ec681f3SmrgSee explanation in :ref:`afuc-branch` 1047ec681f3Smrg 1057ec681f3Smrg 1067ec681f3Smrg.. _afuc-branch: 1077ec681f3Smrg 1087ec681f3SmrgBranch Instructions 1097ec681f3Smrg=================== 1107ec681f3Smrg 1117ec681f3SmrgThe following branch/jump instructions are available: 1127ec681f3Smrg 1137ec681f3Smrg- ``brne`` - branch if not equal (or bit not set) 1147ec681f3Smrg- ``breq`` - branch if equal (or bit set) 1157ec681f3Smrg- ``jump`` - unconditional jump 1167ec681f3Smrg 1177ec681f3SmrgBoth ``brne`` and ``breq`` have two forms, comparing the src register 1187ec681f3Smrgagainst either a small immediate (up to 5 bits) or a specific bit:: 1197ec681f3Smrg 1207ec681f3Smrg breq $src, b3, #somelabel ; branch if src & (1 << 3) 1217ec681f3Smrg breq $src, 0x3, #somelabel ; branch if src == 3 1227ec681f3Smrg 1237ec681f3SmrgThe branch instructions are encoded with a 16b relative offset. 1247ec681f3SmrgSince ``$00`` always reads back zero, it can be used to construct 1257ec681f3Smrgan unconditional relative jump. 1267ec681f3Smrg 1277ec681f3SmrgThe :ref:`cmp <afuc-alu-cmp>` instruction can be paired with the 1287ec681f3Smrgbit-test variants of ``brne``/``breq`` to implement gt/ge/lt/le, 1297ec681f3Smrgdue to the bit pattern it returns, for example:: 1307ec681f3Smrg 1317ec681f3Smrg cmp $04, $02, $03 1327ec681f3Smrg breq $04, b1, #somelabel 1337ec681f3Smrg 1347ec681f3Smrgwill branch if ``$02`` is less than or equal to ``$03``. 1357ec681f3Smrg 1367ec681f3Smrg 1377ec681f3Smrg.. _afuc-call: 1387ec681f3Smrg 1397ec681f3SmrgCall/Return 1407ec681f3Smrg=========== 1417ec681f3Smrg 1427ec681f3SmrgSimple subroutines can be implemented with ``call``/``ret``. The 1437ec681f3Smrgjump instruction encodes a fixed offset. 1447ec681f3Smrg 1457ec681f3Smrg TODO not sure how many levels deep function calls can be nested. 1467ec681f3Smrg There isn't really a stack. Definitely seems to be multiple 1477ec681f3Smrg levels of fxn call, see in PFP: CP_CONTEXT_SWITCH_YIELD -> f13 -> 1487ec681f3Smrg f22. 1497ec681f3Smrg 1507ec681f3Smrg 1517ec681f3Smrg.. _afuc-control: 1527ec681f3Smrg 1537ec681f3SmrgConfig Instructions 1547ec681f3Smrg=================== 1557ec681f3Smrg 1567ec681f3SmrgThese seem to read/write config state in other parts of CP. In at 1577ec681f3Smrgleast some cases I expect these map to CP registers (but possibly 1587ec681f3Smrgnot directly??) 1597ec681f3Smrg 1607ec681f3Smrg- ``cread $dst, [$off + addr], flags`` 1617ec681f3Smrg- ``cwrite $src, [$off + addr], flags`` 1627ec681f3Smrg 1637ec681f3SmrgIn cases where no offset is needed, ``$00`` is frequently used as 1647ec681f3Smrgthe offset. 1657ec681f3Smrg 1667ec681f3SmrgFor example, the following sequences sets:: 1677ec681f3Smrg 1687ec681f3Smrg ; load CP_INDIRECT_BUFFER parameters from cmdstream: 1697ec681f3Smrg mov $02, $data ; low 32b of IB target address 1707ec681f3Smrg mov $03, $data ; high 32b of IB target 1717ec681f3Smrg mov $04, $data ; IB size in dwords 1727ec681f3Smrg 1737ec681f3Smrg ; sanity check # of dwords: 1747ec681f3Smrg breq $04, 0x0, #l23 (#69, 04a2) 1757ec681f3Smrg 1767ec681f3Smrg ; this seems something to do with figuring out whether 1777ec681f3Smrg ; we are going from RB->IB1 or IB1->IB2 (ie. so the 1787ec681f3Smrg ; below cwrite instructions update either 1797ec681f3Smrg ; CP_IB1_BASE_LO/HI/BUFSIZE or CP_IB2_BASE_LO/HI/BUFSIZE 1807ec681f3Smrg and $05, $18, 0x0003 1817ec681f3Smrg shl $05, $05, 0x0002 1827ec681f3Smrg 1837ec681f3Smrg ; update CP_IBn_BASE_LO/HI/BUFSIZE: 1847ec681f3Smrg cwrite $02, [$05 + 0x0b0], 0x8 1857ec681f3Smrg cwrite $03, [$05 + 0x0b1], 0x8 1867ec681f3Smrg cwrite $04, [$05 + 0x0b2], 0x8 1877ec681f3Smrg 1887ec681f3Smrg 1897ec681f3Smrg 1907ec681f3Smrg.. _afuc-reg-access: 1917ec681f3Smrg 1927ec681f3SmrgRegister Access 1937ec681f3Smrg=============== 1947ec681f3Smrg 1957ec681f3SmrgThe special registers ``$addr`` and ``$data`` can be used to write GPU 1967ec681f3Smrgregisters, for example, to write:: 1977ec681f3Smrg 1987ec681f3Smrg mov $addr, CP_SCRATCH_REG[0x2] ; set register to write 1997ec681f3Smrg mov $data, $03 ; CP_SCRATCH_REG[0x2] 2007ec681f3Smrg mov $data, $04 ; CP_SCRATCH_REG[0x3] 2017ec681f3Smrg ... 2027ec681f3Smrg 2037ec681f3Smrgsubsequent writes to ``$data`` will increment the address of the register 2047ec681f3Smrgto write, so a sequence of consecutive registers can be written 2057ec681f3Smrg 2067ec681f3SmrgTo read:: 2077ec681f3Smrg 2087ec681f3Smrg mov $addr, CP_SCRATCH_REG[0x2] 2097ec681f3Smrg mov $03, $addr 2107ec681f3Smrg mov $04, $addr 2117ec681f3Smrg 2127ec681f3SmrgMany registers that are updated frequently have two banks, so they can be 2137ec681f3Smrgupdated without stalling for previous draw to finish. These banks are 2147ec681f3Smrgarranged so bit 11 is zero for bank 0 and 1 for bank 1. The ME fw (at 2157ec681f3Smrgleast the version I'm looking at) stores this in ``$17``, so to update 2167ec681f3Smrgthese registers from ME:: 2177ec681f3Smrg 2187ec681f3Smrg or $addr, $17, VFD_INDEX_OFFSET 2197ec681f3Smrg mov $data, $03 2207ec681f3Smrg ... 2217ec681f3Smrg 2227ec681f3SmrgNote that PFP doesn't seem to use this approach, instead it does something 2237ec681f3Smrglike:: 2247ec681f3Smrg 2257ec681f3Smrg mov $0c, CP_SCRATCH_REG[0x7] 2267ec681f3Smrg mov $02, 0x789a ; value 2277ec681f3Smrg cwrite $0c, [$00 + 0x010], 0x8 2287ec681f3Smrg cwrite $02, [$00 + 0x011], 0x8 2297ec681f3Smrg 2307ec681f3SmrgLike with the ``$addr``/``$data`` approach, the destination register address 2317ec681f3Smrgincrements on each write. 2327ec681f3Smrg 2337ec681f3Smrg.. _afuc-mem: 2347ec681f3Smrg 2357ec681f3SmrgMemory Access 2367ec681f3Smrg============= 2377ec681f3Smrg 2387ec681f3SmrgThere are no load/store instructions, as such. The microcontrollers 2397ec681f3Smrghave only indirect memory access via GPU registers. There are two 2407ec681f3Smrgmechanism possible. 2417ec681f3Smrg 2427ec681f3SmrgRead/Write via CP_NRT Registers 2437ec681f3Smrg------------------------------- 2447ec681f3Smrg 2457ec681f3SmrgThis seems to be only used by ME. If PFP were also using it, they would 2467ec681f3Smrgrace with each other. It seems to be primarily used for small reads. 2477ec681f3Smrg 2487ec681f3Smrg- ``CP_ME_NRT_ADDR_LO``/``_HI`` - write to set the address to read or write 2497ec681f3Smrg- ``CP_ME_NRT_DATA`` - write to trigger write to address in ``CP_ME_NRT_ADDR`` 2507ec681f3Smrg 2517ec681f3SmrgThe address register increments with successive reads or writes. 2527ec681f3Smrg 2537ec681f3SmrgMemory Write example:: 2547ec681f3Smrg 2557ec681f3Smrg ; store 64b value in $04+$05 to 64b address in $02+$03 2567ec681f3Smrg mov $addr, CP_ME_NRT_ADDR_LO 2577ec681f3Smrg mov $data, $02 2587ec681f3Smrg mov $data, $03 2597ec681f3Smrg mov $addr, CP_ME_NRT_DATA 2607ec681f3Smrg mov $data, $04 2617ec681f3Smrg mov $data, $05 2627ec681f3Smrg 2637ec681f3SmrgMemory Read example:: 2647ec681f3Smrg 2657ec681f3Smrg ; load 64b value from address in $02+$03 into $04+$05 2667ec681f3Smrg mov $addr, CP_ME_NRT_ADDR_LO 2677ec681f3Smrg mov $data, $02 2687ec681f3Smrg mov $data, $03 2697ec681f3Smrg mov $04, $addr 2707ec681f3Smrg mov $05, $addr 2717ec681f3Smrg 2727ec681f3Smrg 2737ec681f3SmrgRead via Control Instructions 2747ec681f3Smrg----------------------------- 2757ec681f3Smrg 2767ec681f3SmrgThis is used by PFP whenever it needs to read memory. Also seems to be 2777ec681f3Smrgused by ME for streaming reads (larger amounts of data). The DMA access 2787ec681f3Smrgseems to be done by ROQ. 2797ec681f3Smrg 2807ec681f3Smrg TODO might also be possible for write access 2817ec681f3Smrg 2827ec681f3Smrg TODO some of the control commands might be synchronizing access 2837ec681f3Smrg between PFP and ME?? 2847ec681f3Smrg 2857ec681f3SmrgAn example from ``CP_DRAW_INDIRECT`` packet handler:: 2867ec681f3Smrg 2877ec681f3Smrg mov $07, 0x0004 ; # of dwords to read from draw-indirect buffer 2887ec681f3Smrg ; load address of indirect buffer from cmdstream: 2897ec681f3Smrg cwrite $data, [$00 + 0x0b8], 0x8 2907ec681f3Smrg cwrite $data, [$00 + 0x0b9], 0x8 2917ec681f3Smrg ; set # of dwords to read: 2927ec681f3Smrg cwrite $07, [$00 + 0x0ba], 0x8 2937ec681f3Smrg ... 2947ec681f3Smrg ; read parameters from draw-indirect buffer: 2957ec681f3Smrg mov $09, $addr 2967ec681f3Smrg mov $07, $addr 2977ec681f3Smrg cread $12, [$00 + 0x040], 0x8 2987ec681f3Smrg ; the start parameter gets written into MEQ, which ME writes 2997ec681f3Smrg ; to VFD_INDEX_OFFSET register: 3007ec681f3Smrg mov $data, $addr 3017ec681f3Smrg 3027ec681f3Smrg 3037ec681f3SmrgA6XX NOTES 3047ec681f3Smrg========== 3057ec681f3Smrg 3067ec681f3SmrgThe ``$14`` register holds global flags set by: 3077ec681f3Smrg 3087ec681f3Smrg CP_SKIP_IB2_ENABLE_LOCAL - b8 3097ec681f3Smrg CP_SKIP_IB2_ENABLE_GLOBAL - b9 3107ec681f3Smrg CP_SET_MARKER 3117ec681f3Smrg MODE=GMEM - sets b15 3127ec681f3Smrg MODE=BLIT2D - clears b15, b12, b7 3137ec681f3Smrg CP_SET_MODE - b29+b30 3147ec681f3Smrg CP_SET_VISIBILITY_OVERRIDE - b11, b21, b30? 3157ec681f3Smrg CP_SET_DRAW_STATE - checks b29+b30 3167ec681f3Smrg 3177ec681f3Smrg CP_COND_REG_EXEC - checks b10, which should be predicate flag? 318