amd/compiler/README-ISA.md

7ec681f3Smrg# Unofficial GCN/RDNA ISA reference errata
7ec681f3Smrg
7ec681f3Smrg## `v_sad_u32`
7ec681f3Smrg
7ec681f3SmrgThe Vega ISA reference writes its behaviour as:
7ec681f3Smrg
7ec681f3Smrg```
7ec681f3SmrgD.u = abs(S0.i - S1.i) + S2.u.
7ec681f3Smrg```
7ec681f3Smrg
7ec681f3SmrgThis is incorrect. The actual behaviour is what is written in the GCN3 reference
7ec681f3Smrgguide:
7ec681f3Smrg
7ec681f3Smrg```
7ec681f3SmrgABS_DIFF (A,B) = (A>B) ? (A-B) : (B-A)
7ec681f3SmrgD.u = ABS_DIFF (S0.u,S1.u) + S2.u
7ec681f3Smrg```
7ec681f3Smrg
7ec681f3SmrgThe instruction doesn't subtract the S0 and S1 and use the absolute value (the
7ec681f3Smrg_signed_ distance), it uses the _unsigned_ distance between the operands. So
7ec681f3Smrg`v_sad_u32(-5, 0, 0)` would return `4294967291` (`-5` interpreted as unsigned),
7ec681f3Smrgnot `5`.
7ec681f3Smrg
7ec681f3Smrg## `s_bfe_*`
7ec681f3Smrg
7ec681f3SmrgBoth the RDNA, Vega and GCN3 ISA references write that these instructions don't write
7ec681f3SmrgSCC. They do.
7ec681f3Smrg
7ec681f3Smrg## `v_bcnt_u32_b32`
7ec681f3Smrg
7ec681f3SmrgThe Vega ISA reference writes its behaviour as:
7ec681f3Smrg
7ec681f3Smrg```
7ec681f3SmrgD.u = 0;
7ec681f3Smrgfor i in 0 ... 31 do
7ec681f3SmrgD.u += (S0.u[i] == 1 ? 1 : 0);
7ec681f3Smrgendfor.
7ec681f3Smrg```
7ec681f3Smrg
7ec681f3SmrgThis is incorrect. The actual behaviour (and number of operands) is what
7ec681f3Smrgis written in the GCN3 reference guide:
7ec681f3Smrg
7ec681f3Smrg```
7ec681f3SmrgD.u = CountOneBits(S0.u) + S1.u.
7ec681f3Smrg```
7ec681f3Smrg
7ec681f3Smrg## `v_alignbyte_b32`
7ec681f3Smrg
7ec681f3SmrgAll versions of the ISA document are vague about it, but after some trial and
7ec681f3Smrgerror we discovered that only 2 bits of the 3rd operand are used.
7ec681f3SmrgTherefore, this instruction can't shift more than 24 bits.
7ec681f3Smrg
7ec681f3SmrgThe correct description of `v_alignbyte_b32` is probably the following:
7ec681f3Smrg
7ec681f3Smrg```
7ec681f3SmrgD.u = ({S0, S1} >> (8 * S2.u[1:0])) & 0xffffffff
7ec681f3Smrg```
7ec681f3Smrg
7ec681f3Smrg## SMEM stores
7ec681f3Smrg
7ec681f3SmrgThe Vega ISA references doesn't say this (or doesn't make it clear), but
7ec681f3Smrgthe offset for SMEM stores must be in m0 if IMM == 0.
7ec681f3Smrg
7ec681f3SmrgThe RDNA ISA doesn't mention SMEM stores at all, but they seem to be supported
7ec681f3Smrgby the chip and are present in LLVM. AMD devs however highly recommend avoiding
7ec681f3Smrgthese instructions.
7ec681f3Smrg
7ec681f3Smrg## SMEM atomics
7ec681f3Smrg
7ec681f3SmrgRDNA ISA: same as the SMEM stores, the ISA pretends they don't exist, but they
7ec681f3Smrgare there in LLVM.
7ec681f3Smrg
7ec681f3Smrg## VMEM stores
7ec681f3Smrg
7ec681f3SmrgAll reference guides say (under "Vector Memory Instruction Data Dependencies"):
7ec681f3Smrg
7ec681f3Smrg> When a VM instruction is issued, the address is immediately read out of VGPRs
7ec681f3Smrg> and sent to the texture cache. Any texture or buffer resources and samplers
7ec681f3Smrg> are also sent immediately. However, write-data is not immediately sent to the
7ec681f3Smrg> texture cache.
7ec681f3Smrg
7ec681f3SmrgReading that, one might think that waitcnts need to be added when writing to
7ec681f3Smrgthe registers used for a VMEM store's data. Experimentation has shown that this
7ec681f3Smrgdoes not seem to be the case on GFX8 and GFX9 (GFX6 and GFX7 are untested). It
7ec681f3Smrgalso seems unlikely, since NOPs are apparently needed in a subset of these
7ec681f3Smrgsituations.
7ec681f3Smrg
7ec681f3Smrg## MIMG opcodes on GFX8/GCN3
7ec681f3Smrg
7ec681f3SmrgThe `image_atomic_{swap,cmpswap,add,sub}` opcodes in the GCN3 ISA reference
7ec681f3Smrgguide are incorrect. The Vega ISA reference guide has the correct ones.
7ec681f3Smrg
7ec681f3Smrg## VINTRP encoding
7ec681f3Smrg
7ec681f3SmrgVEGA ISA doc says the encoding should be `110010` but `110101` works.
7ec681f3Smrg
7ec681f3Smrg## VOP1 instructions encoded as VOP3
7ec681f3Smrg
7ec681f3SmrgRDNA ISA doc says that `0x140` should be added to the opcode, but that doesn't
7ec681f3Smrgwork. What works is adding `0x180`, which LLVM also does.
7ec681f3Smrg
7ec681f3Smrg## FLAT, Scratch, Global instructions
7ec681f3Smrg
7ec681f3SmrgThe NV bit was removed in RDNA, but some parts of the doc still mention it.
7ec681f3Smrg
7ec681f3SmrgRDNA ISA doc 13.8.1 says that SADDR should be set to 0x7f when ADDR is used, but
7ec681f3Smrg9.3.1 says it should be set to NULL. We assume 9.3.1 is correct and set it to
7ec681f3SmrgSGPR_NULL.
7ec681f3Smrg
7ec681f3Smrg## Legacy instructions
7ec681f3Smrg
7ec681f3SmrgSome instructions have a `_LEGACY` variant which implements "DX9 rules", in which
7ec681f3Smrgthe zero "wins" in multiplications, ie. `0.0*x` is always `0.0`. The VEGA ISA
7ec681f3Smrgmentions `V_MAC_LEGACY_F32` but this instruction is not really there on VEGA.
7ec681f3Smrg
7ec681f3Smrg## `m0` with LDS instructions on Vega and newer
7ec681f3Smrg
7ec681f3SmrgThe Vega ISA doc (both the old one and the "7nm" one) claims that LDS instructions
7ec681f3Smrguse the `m0` register for address clamping like older GPUs, but this is not the case.
7ec681f3Smrg
7ec681f3SmrgIn reality, only the `_addtid` variants of LDS instructions use `m0` on Vega and
7ec681f3Smrgnewer GPUs, so the relevant section of the RDNA ISA doc seems to apply.
7ec681f3SmrgLLVM also doesn't emit any initialization of `m0` for LDS instructions, and this
7ec681f3Smrgwas also confirmed by AMD devs.
7ec681f3Smrg
7ec681f3Smrg## RDNA L0, L1 cache and DLC, GLC bits
7ec681f3Smrg
7ec681f3SmrgThe old L1 cache was renamed to L0, and a new L1 cache was added to RDNA. The
7ec681f3SmrgL1 cache is 1 cache per shader array. Some instruction encodings have DLC and
7ec681f3SmrgGLC bits that interact with the cache.
7ec681f3Smrg
7ec681f3Smrg* DLC ("device level coherent") bit: controls the L1 cache
7ec681f3Smrg* GLC ("globally coherent") bit: controls the L0 cache
7ec681f3Smrg
7ec681f3SmrgThe recommendation from AMD devs is to always set these two bits at the same time,
7ec681f3Smrgas it doesn't make too much sense to set them independently, aside from some
7ec681f3Smrgcircumstances (eg. we needn't set DLC when only one shader array is used).
7ec681f3Smrg
7ec681f3SmrgStores and atomics always bypass the L1 cache, so they don't support the DLC bit,
7ec681f3Smrgand it shouldn't be set in these cases. Setting the DLC for these cases can result
7ec681f3Smrgin graphical glitches or hangs.
7ec681f3Smrg
7ec681f3Smrg## RDNA `s_dcache_wb`
7ec681f3Smrg
7ec681f3SmrgThe `s_dcache_wb` is not mentioned in the RDNA ISA doc, but it is needed in order
7ec681f3Smrgto achieve correct behavior in some SSBO CTS tests.
7ec681f3Smrg
7ec681f3Smrg## RDNA subvector mode
7ec681f3Smrg
7ec681f3SmrgThe documentation of `s_subvector_loop_begin` and `s_subvector_mode_end` is not clear
7ec681f3Smrgon what sort of addressing should be used, but it says that it
7ec681f3Smrg"is equivalent to an `S_CBRANCH` with extra math", so the subvector loop handling
7ec681f3Smrgin ACO is done according to the `s_cbranch` doc.
7ec681f3Smrg
7ec681f3Smrg## RDNA early rasterization
7ec681f3Smrg
7ec681f3SmrgThe ISA documentation says about `s_endpgm`:
7ec681f3Smrg
7ec681f3Smrg> The hardware implicitly executes S_WAITCNT 0 and S_WAITCNT_VSCNT 0
7ec681f3Smrg> before executing this instruction.
7ec681f3Smrg
7ec681f3SmrgWhat the doc doesn't say is that in case of NGG (and legacy VS) when there
7ec681f3Smrgare no param exports, the driver sets `NO_PC_EXPORT=1` for optimal performance,
7ec681f3Smrgand when this is set, the hardware will start clipping and rasterization
7ec681f3Smrgas soon as it encounters a position export with `DONE=1`, without waiting
7ec681f3Smrgfor the NGG (or VS) to finish.
7ec681f3Smrg
7ec681f3SmrgIt can even launch PS waves before NGG (or VS) ends.
7ec681f3Smrg
7ec681f3SmrgWhen this happens, any store performed by a VS is not guaranteed
7ec681f3Smrgto be complete when PS tries to load it, so we need to manually
7ec681f3Smrgmake sure to insert wait instructions before the position exports.
7ec681f3Smrg
7ec681f3Smrg# Hardware Bugs
7ec681f3Smrg
7ec681f3Smrg## SMEM corrupts VCCZ on SI/CI
7ec681f3Smrg
7ec681f3Smrg[See this LLVM source.](https://github.com/llvm/llvm-project/blob/acb089e12ae48b82c0b05c42326196a030df9b82/llvm/lib/Target/AMDGPU/SIInsertWaits.cpp#L580-L616)
7ec681f3Smrg
7ec681f3SmrgAfter issuing a SMEM instructions, we need to wait for the SMEM instructions to
7ec681f3Smrgfinish and then write to vcc (for example, `s_mov_b64 vcc, vcc`) to correct vccz
7ec681f3Smrg
7ec681f3SmrgCurrently, we don't do this.
7ec681f3Smrg
7ec681f3Smrg## SGPR offset on MUBUF prevents addr clamping on SI/CI
7ec681f3Smrg
7ec681f3Smrg[See this LLVM source.](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp#L1917-L1922)
7ec681f3Smrg
7ec681f3SmrgThis leads to wrong bounds checking, using a VGPR offset fixes it.
7ec681f3Smrg
7ec681f3Smrg## GCN / GFX6 hazards
7ec681f3Smrg
7ec681f3Smrg### VINTRP followed by a read with `v_readfirstlane` or `v_readlane`
7ec681f3Smrg
7ec681f3SmrgIt's required to insert 1 wait state if the dst VGPR of any  `v_interp_*` is
7ec681f3Smrgfollowed by a read with `v_readfirstlane` or `v_readlane` to fix GPU hangs on GFX6.
7ec681f3SmrgNote that `v_writelane_*` is apparently not affected. This hazard isn't
7ec681f3Smrgdocumented anywhere but AMD confirmed it.
7ec681f3Smrg
7ec681f3Smrg## RDNA / GFX10 hazards
7ec681f3Smrg
7ec681f3Smrg### SMEM store followed by a load with the same address
7ec681f3Smrg
7ec681f3SmrgWe found that an `s_buffer_load` will produce incorrect results if it is preceded
7ec681f3Smrgby an `s_buffer_store` with the same address. Inserting an `s_nop` between them
7ec681f3Smrgdoes not mitigate the issue, so an `s_waitcnt lgkmcnt(0)` must be inserted.
7ec681f3SmrgThis is not mentioned by LLVM among the other GFX10 bugs, but LLVM doesn't use
7ec681f3SmrgSMEM stores, so it's not surprising that they didn't notice it.
7ec681f3Smrg
7ec681f3Smrg### VMEMtoScalarWriteHazard
7ec681f3Smrg
7ec681f3SmrgTriggered by:
7ec681f3SmrgVMEM/FLAT/GLOBAL/SCRATCH/DS instruction reads an SGPR (or EXEC, or M0).
7ec681f3SmrgThen, a SALU/SMEM instruction writes the same SGPR.
7ec681f3Smrg
7ec681f3SmrgMitigated by:
7ec681f3SmrgA VALU instruction or an `s_waitcnt vmcnt(0)` between the two instructions.
7ec681f3Smrg
7ec681f3Smrg### SMEMtoVectorWriteHazard
7ec681f3Smrg
7ec681f3SmrgTriggered by:
7ec681f3SmrgAn SMEM instruction reads an SGPR. Then, a VALU instruction writes that same SGPR.
7ec681f3Smrg
7ec681f3SmrgMitigated by:
7ec681f3SmrgAny non-SOPP SALU instruction (except `s_setvskip`, `s_version`, and any non-lgkmcnt `s_waitcnt`).
7ec681f3Smrg
7ec681f3Smrg### Offset3fBug
7ec681f3Smrg
7ec681f3SmrgAny branch that is located at offset 0x3f will be buggy. Just insert some NOPs to make sure no branch
7ec681f3Smrgis located at this offset.
7ec681f3Smrg
7ec681f3Smrg### InstFwdPrefetchBug
7ec681f3Smrg
7ec681f3SmrgAccording to LLVM, the `s_inst_prefetch` instruction can cause a hang.
7ec681f3SmrgThere are no further details.
7ec681f3Smrg
7ec681f3Smrg### LdsMisalignedBug
7ec681f3Smrg
7ec681f3SmrgWhen there is a misaligned multi-dword FLAT load/store instruction in WGP mode,
7ec681f3Smrgit needs to be split into multiple single-dword FLAT instructions.
7ec681f3Smrg
7ec681f3SmrgACO doesn't use FLAT load/store on GFX10, so is unaffected.
7ec681f3Smrg
7ec681f3Smrg### FlatSegmentOffsetBug
7ec681f3Smrg
7ec681f3SmrgThe 12-bit immediate OFFSET field of FLAT instructions must always be 0.
7ec681f3SmrgGLOBAL and SCRATCH are unaffected.
7ec681f3Smrg
7ec681f3SmrgACO doesn't use FLAT load/store on GFX10, so is unaffected.
7ec681f3Smrg
7ec681f3Smrg### VcmpxPermlaneHazard
7ec681f3Smrg
7ec681f3SmrgTriggered by:
7ec681f3SmrgAny permlane instruction that follows any VOPC instruction.
7ec681f3SmrgConfirmed by AMD devs that despite the name, this doesn't only affect v_cmpx.
7ec681f3Smrg
7ec681f3SmrgMitigated by: any VALU instruction except `v_nop`.
7ec681f3Smrg
7ec681f3Smrg### VcmpxExecWARHazard
7ec681f3Smrg
7ec681f3SmrgTriggered by:
7ec681f3SmrgAny non-VALU instruction reads the EXEC mask. Then, any VALU instruction writes the EXEC mask.
7ec681f3Smrg
7ec681f3SmrgMitigated by:
7ec681f3SmrgA VALU instruction that writes an SGPR (or has a valid SDST operand), or `s_waitcnt_depctr 0xfffe`.
7ec681f3SmrgNote: `s_waitcnt_depctr` is an internal instruction, so there is no further information
7ec681f3Smrgabout what it does or what its operand means.
7ec681f3Smrg
7ec681f3Smrg### LdsBranchVmemWARHazard
7ec681f3Smrg
7ec681f3SmrgTriggered by:
7ec681f3SmrgVMEM/GLOBAL/SCRATCH instruction, then a branch, then a DS instruction,
7ec681f3Smrgor vice versa: DS instruction, then a branch, then a VMEM/GLOBAL/SCRATCH instruction.
7ec681f3Smrg
7ec681f3SmrgMitigated by:
7ec681f3SmrgOnly `s_waitcnt_vscnt null, 0`. Needed even if the first instruction is a load.
7ec681f3Smrg
7ec681f3Smrg### NSAClauseBug
7ec681f3Smrg
7ec681f3Smrg"MIMG-NSA in a hard clause has unpredictable results on GFX10.1"
7ec681f3Smrg
7ec681f3Smrg### NSAMaxSize5
7ec681f3Smrg
7ec681f3SmrgNSA MIMG instructions should be limited to 3 dwords before GFX10.3 to avoid
7ec681f3Smrgstability issues: https://reviews.llvm.org/D103348