17ec681f3Smrg# Unofficial GCN/RDNA ISA reference errata
27ec681f3Smrg
37ec681f3Smrg## `v_sad_u32`
47ec681f3Smrg
57ec681f3SmrgThe Vega ISA reference writes its behaviour as:
67ec681f3Smrg
77ec681f3Smrg```
87ec681f3SmrgD.u = abs(S0.i - S1.i) + S2.u.
97ec681f3Smrg```
107ec681f3Smrg
117ec681f3SmrgThis is incorrect. The actual behaviour is what is written in the GCN3 reference
127ec681f3Smrgguide:
137ec681f3Smrg
147ec681f3Smrg```
157ec681f3SmrgABS_DIFF (A,B) = (A>B) ? (A-B) : (B-A)
167ec681f3SmrgD.u = ABS_DIFF (S0.u,S1.u) + S2.u
177ec681f3Smrg```
187ec681f3Smrg
197ec681f3SmrgThe instruction doesn't subtract the S0 and S1 and use the absolute value (the
207ec681f3Smrg_signed_ distance), it uses the _unsigned_ distance between the operands. So
217ec681f3Smrg`v_sad_u32(-5, 0, 0)` would return `4294967291` (`-5` interpreted as unsigned),
227ec681f3Smrgnot `5`.
237ec681f3Smrg
247ec681f3Smrg## `s_bfe_*`
257ec681f3Smrg
267ec681f3SmrgBoth the RDNA, Vega and GCN3 ISA references write that these instructions don't write
277ec681f3SmrgSCC. They do.
287ec681f3Smrg
297ec681f3Smrg## `v_bcnt_u32_b32`
307ec681f3Smrg
317ec681f3SmrgThe Vega ISA reference writes its behaviour as:
327ec681f3Smrg
337ec681f3Smrg```
347ec681f3SmrgD.u = 0;
357ec681f3Smrgfor i in 0 ... 31 do
367ec681f3SmrgD.u += (S0.u[i] == 1 ? 1 : 0);
377ec681f3Smrgendfor.
387ec681f3Smrg```
397ec681f3Smrg
407ec681f3SmrgThis is incorrect. The actual behaviour (and number of operands) is what
417ec681f3Smrgis written in the GCN3 reference guide:
427ec681f3Smrg
437ec681f3Smrg```
447ec681f3SmrgD.u = CountOneBits(S0.u) + S1.u.
457ec681f3Smrg```
467ec681f3Smrg
477ec681f3Smrg## `v_alignbyte_b32`
487ec681f3Smrg
497ec681f3SmrgAll versions of the ISA document are vague about it, but after some trial and
507ec681f3Smrgerror we discovered that only 2 bits of the 3rd operand are used.
517ec681f3SmrgTherefore, this instruction can't shift more than 24 bits.
527ec681f3Smrg
537ec681f3SmrgThe correct description of `v_alignbyte_b32` is probably the following:
547ec681f3Smrg
557ec681f3Smrg```
567ec681f3SmrgD.u = ({S0, S1} >> (8 * S2.u[1:0])) & 0xffffffff
577ec681f3Smrg```
587ec681f3Smrg
597ec681f3Smrg## SMEM stores
607ec681f3Smrg
617ec681f3SmrgThe Vega ISA references doesn't say this (or doesn't make it clear), but
627ec681f3Smrgthe offset for SMEM stores must be in m0 if IMM == 0.
637ec681f3Smrg
647ec681f3SmrgThe RDNA ISA doesn't mention SMEM stores at all, but they seem to be supported
657ec681f3Smrgby the chip and are present in LLVM. AMD devs however highly recommend avoiding
667ec681f3Smrgthese instructions.
677ec681f3Smrg
687ec681f3Smrg## SMEM atomics
697ec681f3Smrg
707ec681f3SmrgRDNA ISA: same as the SMEM stores, the ISA pretends they don't exist, but they
717ec681f3Smrgare there in LLVM.
727ec681f3Smrg
737ec681f3Smrg## VMEM stores
747ec681f3Smrg
757ec681f3SmrgAll reference guides say (under "Vector Memory Instruction Data Dependencies"):
767ec681f3Smrg
777ec681f3Smrg> When a VM instruction is issued, the address is immediately read out of VGPRs
787ec681f3Smrg> and sent to the texture cache. Any texture or buffer resources and samplers
797ec681f3Smrg> are also sent immediately. However, write-data is not immediately sent to the
807ec681f3Smrg> texture cache.
817ec681f3Smrg
827ec681f3SmrgReading that, one might think that waitcnts need to be added when writing to
837ec681f3Smrgthe registers used for a VMEM store's data. Experimentation has shown that this
847ec681f3Smrgdoes not seem to be the case on GFX8 and GFX9 (GFX6 and GFX7 are untested). It
857ec681f3Smrgalso seems unlikely, since NOPs are apparently needed in a subset of these
867ec681f3Smrgsituations.
877ec681f3Smrg
887ec681f3Smrg## MIMG opcodes on GFX8/GCN3
897ec681f3Smrg
907ec681f3SmrgThe `image_atomic_{swap,cmpswap,add,sub}` opcodes in the GCN3 ISA reference
917ec681f3Smrgguide are incorrect. The Vega ISA reference guide has the correct ones.
927ec681f3Smrg
937ec681f3Smrg## VINTRP encoding
947ec681f3Smrg
957ec681f3SmrgVEGA ISA doc says the encoding should be `110010` but `110101` works.
967ec681f3Smrg
977ec681f3Smrg## VOP1 instructions encoded as VOP3
987ec681f3Smrg
997ec681f3SmrgRDNA ISA doc says that `0x140` should be added to the opcode, but that doesn't
1007ec681f3Smrgwork. What works is adding `0x180`, which LLVM also does.
1017ec681f3Smrg
1027ec681f3Smrg## FLAT, Scratch, Global instructions
1037ec681f3Smrg
1047ec681f3SmrgThe NV bit was removed in RDNA, but some parts of the doc still mention it.
1057ec681f3Smrg
1067ec681f3SmrgRDNA ISA doc 13.8.1 says that SADDR should be set to 0x7f when ADDR is used, but
1077ec681f3Smrg9.3.1 says it should be set to NULL. We assume 9.3.1 is correct and set it to
1087ec681f3SmrgSGPR_NULL.
1097ec681f3Smrg
1107ec681f3Smrg## Legacy instructions
1117ec681f3Smrg
1127ec681f3SmrgSome instructions have a `_LEGACY` variant which implements "DX9 rules", in which
1137ec681f3Smrgthe zero "wins" in multiplications, ie. `0.0*x` is always `0.0`. The VEGA ISA
1147ec681f3Smrgmentions `V_MAC_LEGACY_F32` but this instruction is not really there on VEGA.
1157ec681f3Smrg
1167ec681f3Smrg## `m0` with LDS instructions on Vega and newer
1177ec681f3Smrg
1187ec681f3SmrgThe Vega ISA doc (both the old one and the "7nm" one) claims that LDS instructions
1197ec681f3Smrguse the `m0` register for address clamping like older GPUs, but this is not the case.
1207ec681f3Smrg
1217ec681f3SmrgIn reality, only the `_addtid` variants of LDS instructions use `m0` on Vega and
1227ec681f3Smrgnewer GPUs, so the relevant section of the RDNA ISA doc seems to apply.
1237ec681f3SmrgLLVM also doesn't emit any initialization of `m0` for LDS instructions, and this
1247ec681f3Smrgwas also confirmed by AMD devs.
1257ec681f3Smrg
1267ec681f3Smrg## RDNA L0, L1 cache and DLC, GLC bits
1277ec681f3Smrg
1287ec681f3SmrgThe old L1 cache was renamed to L0, and a new L1 cache was added to RDNA. The
1297ec681f3SmrgL1 cache is 1 cache per shader array. Some instruction encodings have DLC and
1307ec681f3SmrgGLC bits that interact with the cache.
1317ec681f3Smrg
1327ec681f3Smrg* DLC ("device level coherent") bit: controls the L1 cache
1337ec681f3Smrg* GLC ("globally coherent") bit: controls the L0 cache
1347ec681f3Smrg
1357ec681f3SmrgThe recommendation from AMD devs is to always set these two bits at the same time,
1367ec681f3Smrgas it doesn't make too much sense to set them independently, aside from some
1377ec681f3Smrgcircumstances (eg. we needn't set DLC when only one shader array is used).
1387ec681f3Smrg
1397ec681f3SmrgStores and atomics always bypass the L1 cache, so they don't support the DLC bit,
1407ec681f3Smrgand it shouldn't be set in these cases. Setting the DLC for these cases can result
1417ec681f3Smrgin graphical glitches or hangs.
1427ec681f3Smrg
1437ec681f3Smrg## RDNA `s_dcache_wb`
1447ec681f3Smrg
1457ec681f3SmrgThe `s_dcache_wb` is not mentioned in the RDNA ISA doc, but it is needed in order
1467ec681f3Smrgto achieve correct behavior in some SSBO CTS tests.
1477ec681f3Smrg
1487ec681f3Smrg## RDNA subvector mode
1497ec681f3Smrg
1507ec681f3SmrgThe documentation of `s_subvector_loop_begin` and `s_subvector_mode_end` is not clear
1517ec681f3Smrgon what sort of addressing should be used, but it says that it
1527ec681f3Smrg"is equivalent to an `S_CBRANCH` with extra math", so the subvector loop handling
1537ec681f3Smrgin ACO is done according to the `s_cbranch` doc.
1547ec681f3Smrg
1557ec681f3Smrg## RDNA early rasterization
1567ec681f3Smrg
1577ec681f3SmrgThe ISA documentation says about `s_endpgm`:
1587ec681f3Smrg
1597ec681f3Smrg> The hardware implicitly executes S_WAITCNT 0 and S_WAITCNT_VSCNT 0
1607ec681f3Smrg> before executing this instruction.
1617ec681f3Smrg
1627ec681f3SmrgWhat the doc doesn't say is that in case of NGG (and legacy VS) when there
1637ec681f3Smrgare no param exports, the driver sets `NO_PC_EXPORT=1` for optimal performance,
1647ec681f3Smrgand when this is set, the hardware will start clipping and rasterization
1657ec681f3Smrgas soon as it encounters a position export with `DONE=1`, without waiting
1667ec681f3Smrgfor the NGG (or VS) to finish.
1677ec681f3Smrg
1687ec681f3SmrgIt can even launch PS waves before NGG (or VS) ends.
1697ec681f3Smrg
1707ec681f3SmrgWhen this happens, any store performed by a VS is not guaranteed
1717ec681f3Smrgto be complete when PS tries to load it, so we need to manually
1727ec681f3Smrgmake sure to insert wait instructions before the position exports.
1737ec681f3Smrg
1747ec681f3Smrg# Hardware Bugs
1757ec681f3Smrg
1767ec681f3Smrg## SMEM corrupts VCCZ on SI/CI
1777ec681f3Smrg
1787ec681f3Smrg[See this LLVM source.](https://github.com/llvm/llvm-project/blob/acb089e12ae48b82c0b05c42326196a030df9b82/llvm/lib/Target/AMDGPU/SIInsertWaits.cpp#L580-L616)
1797ec681f3Smrg
1807ec681f3SmrgAfter issuing a SMEM instructions, we need to wait for the SMEM instructions to
1817ec681f3Smrgfinish and then write to vcc (for example, `s_mov_b64 vcc, vcc`) to correct vccz
1827ec681f3Smrg
1837ec681f3SmrgCurrently, we don't do this.
1847ec681f3Smrg
1857ec681f3Smrg## SGPR offset on MUBUF prevents addr clamping on SI/CI
1867ec681f3Smrg
1877ec681f3Smrg[See this LLVM source.](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp#L1917-L1922)
1887ec681f3Smrg
1897ec681f3SmrgThis leads to wrong bounds checking, using a VGPR offset fixes it.
1907ec681f3Smrg
1917ec681f3Smrg## GCN / GFX6 hazards
1927ec681f3Smrg
1937ec681f3Smrg### VINTRP followed by a read with `v_readfirstlane` or `v_readlane`
1947ec681f3Smrg
1957ec681f3SmrgIt's required to insert 1 wait state if the dst VGPR of any  `v_interp_*` is
1967ec681f3Smrgfollowed by a read with `v_readfirstlane` or `v_readlane` to fix GPU hangs on GFX6.
1977ec681f3SmrgNote that `v_writelane_*` is apparently not affected. This hazard isn't
1987ec681f3Smrgdocumented anywhere but AMD confirmed it.
1997ec681f3Smrg
2007ec681f3Smrg## RDNA / GFX10 hazards
2017ec681f3Smrg
2027ec681f3Smrg### SMEM store followed by a load with the same address
2037ec681f3Smrg
2047ec681f3SmrgWe found that an `s_buffer_load` will produce incorrect results if it is preceded
2057ec681f3Smrgby an `s_buffer_store` with the same address. Inserting an `s_nop` between them
2067ec681f3Smrgdoes not mitigate the issue, so an `s_waitcnt lgkmcnt(0)` must be inserted.
2077ec681f3SmrgThis is not mentioned by LLVM among the other GFX10 bugs, but LLVM doesn't use
2087ec681f3SmrgSMEM stores, so it's not surprising that they didn't notice it.
2097ec681f3Smrg
2107ec681f3Smrg### VMEMtoScalarWriteHazard
2117ec681f3Smrg
2127ec681f3SmrgTriggered by:
2137ec681f3SmrgVMEM/FLAT/GLOBAL/SCRATCH/DS instruction reads an SGPR (or EXEC, or M0).
2147ec681f3SmrgThen, a SALU/SMEM instruction writes the same SGPR.
2157ec681f3Smrg
2167ec681f3SmrgMitigated by:
2177ec681f3SmrgA VALU instruction or an `s_waitcnt vmcnt(0)` between the two instructions.
2187ec681f3Smrg
2197ec681f3Smrg### SMEMtoVectorWriteHazard
2207ec681f3Smrg
2217ec681f3SmrgTriggered by:
2227ec681f3SmrgAn SMEM instruction reads an SGPR. Then, a VALU instruction writes that same SGPR.
2237ec681f3Smrg
2247ec681f3SmrgMitigated by:
2257ec681f3SmrgAny non-SOPP SALU instruction (except `s_setvskip`, `s_version`, and any non-lgkmcnt `s_waitcnt`).
2267ec681f3Smrg
2277ec681f3Smrg### Offset3fBug
2287ec681f3Smrg
2297ec681f3SmrgAny branch that is located at offset 0x3f will be buggy. Just insert some NOPs to make sure no branch
2307ec681f3Smrgis located at this offset.
2317ec681f3Smrg
2327ec681f3Smrg### InstFwdPrefetchBug
2337ec681f3Smrg
2347ec681f3SmrgAccording to LLVM, the `s_inst_prefetch` instruction can cause a hang.
2357ec681f3SmrgThere are no further details.
2367ec681f3Smrg
2377ec681f3Smrg### LdsMisalignedBug
2387ec681f3Smrg
2397ec681f3SmrgWhen there is a misaligned multi-dword FLAT load/store instruction in WGP mode,
2407ec681f3Smrgit needs to be split into multiple single-dword FLAT instructions.
2417ec681f3Smrg
2427ec681f3SmrgACO doesn't use FLAT load/store on GFX10, so is unaffected.
2437ec681f3Smrg
2447ec681f3Smrg### FlatSegmentOffsetBug
2457ec681f3Smrg
2467ec681f3SmrgThe 12-bit immediate OFFSET field of FLAT instructions must always be 0.
2477ec681f3SmrgGLOBAL and SCRATCH are unaffected.
2487ec681f3Smrg
2497ec681f3SmrgACO doesn't use FLAT load/store on GFX10, so is unaffected.
2507ec681f3Smrg
2517ec681f3Smrg### VcmpxPermlaneHazard
2527ec681f3Smrg
2537ec681f3SmrgTriggered by:
2547ec681f3SmrgAny permlane instruction that follows any VOPC instruction.
2557ec681f3SmrgConfirmed by AMD devs that despite the name, this doesn't only affect v_cmpx.
2567ec681f3Smrg
2577ec681f3SmrgMitigated by: any VALU instruction except `v_nop`.
2587ec681f3Smrg
2597ec681f3Smrg### VcmpxExecWARHazard
2607ec681f3Smrg
2617ec681f3SmrgTriggered by:
2627ec681f3SmrgAny non-VALU instruction reads the EXEC mask. Then, any VALU instruction writes the EXEC mask.
2637ec681f3Smrg
2647ec681f3SmrgMitigated by:
2657ec681f3SmrgA VALU instruction that writes an SGPR (or has a valid SDST operand), or `s_waitcnt_depctr 0xfffe`.
2667ec681f3SmrgNote: `s_waitcnt_depctr` is an internal instruction, so there is no further information
2677ec681f3Smrgabout what it does or what its operand means.
2687ec681f3Smrg
2697ec681f3Smrg### LdsBranchVmemWARHazard
2707ec681f3Smrg
2717ec681f3SmrgTriggered by:
2727ec681f3SmrgVMEM/GLOBAL/SCRATCH instruction, then a branch, then a DS instruction,
2737ec681f3Smrgor vice versa: DS instruction, then a branch, then a VMEM/GLOBAL/SCRATCH instruction.
2747ec681f3Smrg
2757ec681f3SmrgMitigated by:
2767ec681f3SmrgOnly `s_waitcnt_vscnt null, 0`. Needed even if the first instruction is a load.
2777ec681f3Smrg
2787ec681f3Smrg### NSAClauseBug
2797ec681f3Smrg
2807ec681f3Smrg"MIMG-NSA in a hard clause has unpredictable results on GFX10.1"
2817ec681f3Smrg
2827ec681f3Smrg### NSAMaxSize5
2837ec681f3Smrg
2847ec681f3SmrgNSA MIMG instructions should be limited to 3 dwords before GFX10.3 to avoid
2857ec681f3Smrgstability issues: https://reviews.llvm.org/D103348
286