17ec681f3Smrg# Unofficial GCN/RDNA ISA reference errata 27ec681f3Smrg 37ec681f3Smrg## `v_sad_u32` 47ec681f3Smrg 57ec681f3SmrgThe Vega ISA reference writes its behaviour as: 67ec681f3Smrg 77ec681f3Smrg``` 87ec681f3SmrgD.u = abs(S0.i - S1.i) + S2.u. 97ec681f3Smrg``` 107ec681f3Smrg 117ec681f3SmrgThis is incorrect. The actual behaviour is what is written in the GCN3 reference 127ec681f3Smrgguide: 137ec681f3Smrg 147ec681f3Smrg``` 157ec681f3SmrgABS_DIFF (A,B) = (A>B) ? (A-B) : (B-A) 167ec681f3SmrgD.u = ABS_DIFF (S0.u,S1.u) + S2.u 177ec681f3Smrg``` 187ec681f3Smrg 197ec681f3SmrgThe instruction doesn't subtract the S0 and S1 and use the absolute value (the 207ec681f3Smrg_signed_ distance), it uses the _unsigned_ distance between the operands. So 217ec681f3Smrg`v_sad_u32(-5, 0, 0)` would return `4294967291` (`-5` interpreted as unsigned), 227ec681f3Smrgnot `5`. 237ec681f3Smrg 247ec681f3Smrg## `s_bfe_*` 257ec681f3Smrg 267ec681f3SmrgBoth the RDNA, Vega and GCN3 ISA references write that these instructions don't write 277ec681f3SmrgSCC. They do. 287ec681f3Smrg 297ec681f3Smrg## `v_bcnt_u32_b32` 307ec681f3Smrg 317ec681f3SmrgThe Vega ISA reference writes its behaviour as: 327ec681f3Smrg 337ec681f3Smrg``` 347ec681f3SmrgD.u = 0; 357ec681f3Smrgfor i in 0 ... 31 do 367ec681f3SmrgD.u += (S0.u[i] == 1 ? 1 : 0); 377ec681f3Smrgendfor. 387ec681f3Smrg``` 397ec681f3Smrg 407ec681f3SmrgThis is incorrect. The actual behaviour (and number of operands) is what 417ec681f3Smrgis written in the GCN3 reference guide: 427ec681f3Smrg 437ec681f3Smrg``` 447ec681f3SmrgD.u = CountOneBits(S0.u) + S1.u. 457ec681f3Smrg``` 467ec681f3Smrg 477ec681f3Smrg## `v_alignbyte_b32` 487ec681f3Smrg 497ec681f3SmrgAll versions of the ISA document are vague about it, but after some trial and 507ec681f3Smrgerror we discovered that only 2 bits of the 3rd operand are used. 517ec681f3SmrgTherefore, this instruction can't shift more than 24 bits. 527ec681f3Smrg 537ec681f3SmrgThe correct description of `v_alignbyte_b32` is probably the following: 547ec681f3Smrg 557ec681f3Smrg``` 567ec681f3SmrgD.u = ({S0, S1} >> (8 * S2.u[1:0])) & 0xffffffff 577ec681f3Smrg``` 587ec681f3Smrg 597ec681f3Smrg## SMEM stores 607ec681f3Smrg 617ec681f3SmrgThe Vega ISA references doesn't say this (or doesn't make it clear), but 627ec681f3Smrgthe offset for SMEM stores must be in m0 if IMM == 0. 637ec681f3Smrg 647ec681f3SmrgThe RDNA ISA doesn't mention SMEM stores at all, but they seem to be supported 657ec681f3Smrgby the chip and are present in LLVM. AMD devs however highly recommend avoiding 667ec681f3Smrgthese instructions. 677ec681f3Smrg 687ec681f3Smrg## SMEM atomics 697ec681f3Smrg 707ec681f3SmrgRDNA ISA: same as the SMEM stores, the ISA pretends they don't exist, but they 717ec681f3Smrgare there in LLVM. 727ec681f3Smrg 737ec681f3Smrg## VMEM stores 747ec681f3Smrg 757ec681f3SmrgAll reference guides say (under "Vector Memory Instruction Data Dependencies"): 767ec681f3Smrg 777ec681f3Smrg> When a VM instruction is issued, the address is immediately read out of VGPRs 787ec681f3Smrg> and sent to the texture cache. Any texture or buffer resources and samplers 797ec681f3Smrg> are also sent immediately. However, write-data is not immediately sent to the 807ec681f3Smrg> texture cache. 817ec681f3Smrg 827ec681f3SmrgReading that, one might think that waitcnts need to be added when writing to 837ec681f3Smrgthe registers used for a VMEM store's data. Experimentation has shown that this 847ec681f3Smrgdoes not seem to be the case on GFX8 and GFX9 (GFX6 and GFX7 are untested). It 857ec681f3Smrgalso seems unlikely, since NOPs are apparently needed in a subset of these 867ec681f3Smrgsituations. 877ec681f3Smrg 887ec681f3Smrg## MIMG opcodes on GFX8/GCN3 897ec681f3Smrg 907ec681f3SmrgThe `image_atomic_{swap,cmpswap,add,sub}` opcodes in the GCN3 ISA reference 917ec681f3Smrgguide are incorrect. The Vega ISA reference guide has the correct ones. 927ec681f3Smrg 937ec681f3Smrg## VINTRP encoding 947ec681f3Smrg 957ec681f3SmrgVEGA ISA doc says the encoding should be `110010` but `110101` works. 967ec681f3Smrg 977ec681f3Smrg## VOP1 instructions encoded as VOP3 987ec681f3Smrg 997ec681f3SmrgRDNA ISA doc says that `0x140` should be added to the opcode, but that doesn't 1007ec681f3Smrgwork. What works is adding `0x180`, which LLVM also does. 1017ec681f3Smrg 1027ec681f3Smrg## FLAT, Scratch, Global instructions 1037ec681f3Smrg 1047ec681f3SmrgThe NV bit was removed in RDNA, but some parts of the doc still mention it. 1057ec681f3Smrg 1067ec681f3SmrgRDNA ISA doc 13.8.1 says that SADDR should be set to 0x7f when ADDR is used, but 1077ec681f3Smrg9.3.1 says it should be set to NULL. We assume 9.3.1 is correct and set it to 1087ec681f3SmrgSGPR_NULL. 1097ec681f3Smrg 1107ec681f3Smrg## Legacy instructions 1117ec681f3Smrg 1127ec681f3SmrgSome instructions have a `_LEGACY` variant which implements "DX9 rules", in which 1137ec681f3Smrgthe zero "wins" in multiplications, ie. `0.0*x` is always `0.0`. The VEGA ISA 1147ec681f3Smrgmentions `V_MAC_LEGACY_F32` but this instruction is not really there on VEGA. 1157ec681f3Smrg 1167ec681f3Smrg## `m0` with LDS instructions on Vega and newer 1177ec681f3Smrg 1187ec681f3SmrgThe Vega ISA doc (both the old one and the "7nm" one) claims that LDS instructions 1197ec681f3Smrguse the `m0` register for address clamping like older GPUs, but this is not the case. 1207ec681f3Smrg 1217ec681f3SmrgIn reality, only the `_addtid` variants of LDS instructions use `m0` on Vega and 1227ec681f3Smrgnewer GPUs, so the relevant section of the RDNA ISA doc seems to apply. 1237ec681f3SmrgLLVM also doesn't emit any initialization of `m0` for LDS instructions, and this 1247ec681f3Smrgwas also confirmed by AMD devs. 1257ec681f3Smrg 1267ec681f3Smrg## RDNA L0, L1 cache and DLC, GLC bits 1277ec681f3Smrg 1287ec681f3SmrgThe old L1 cache was renamed to L0, and a new L1 cache was added to RDNA. The 1297ec681f3SmrgL1 cache is 1 cache per shader array. Some instruction encodings have DLC and 1307ec681f3SmrgGLC bits that interact with the cache. 1317ec681f3Smrg 1327ec681f3Smrg* DLC ("device level coherent") bit: controls the L1 cache 1337ec681f3Smrg* GLC ("globally coherent") bit: controls the L0 cache 1347ec681f3Smrg 1357ec681f3SmrgThe recommendation from AMD devs is to always set these two bits at the same time, 1367ec681f3Smrgas it doesn't make too much sense to set them independently, aside from some 1377ec681f3Smrgcircumstances (eg. we needn't set DLC when only one shader array is used). 1387ec681f3Smrg 1397ec681f3SmrgStores and atomics always bypass the L1 cache, so they don't support the DLC bit, 1407ec681f3Smrgand it shouldn't be set in these cases. Setting the DLC for these cases can result 1417ec681f3Smrgin graphical glitches or hangs. 1427ec681f3Smrg 1437ec681f3Smrg## RDNA `s_dcache_wb` 1447ec681f3Smrg 1457ec681f3SmrgThe `s_dcache_wb` is not mentioned in the RDNA ISA doc, but it is needed in order 1467ec681f3Smrgto achieve correct behavior in some SSBO CTS tests. 1477ec681f3Smrg 1487ec681f3Smrg## RDNA subvector mode 1497ec681f3Smrg 1507ec681f3SmrgThe documentation of `s_subvector_loop_begin` and `s_subvector_mode_end` is not clear 1517ec681f3Smrgon what sort of addressing should be used, but it says that it 1527ec681f3Smrg"is equivalent to an `S_CBRANCH` with extra math", so the subvector loop handling 1537ec681f3Smrgin ACO is done according to the `s_cbranch` doc. 1547ec681f3Smrg 1557ec681f3Smrg## RDNA early rasterization 1567ec681f3Smrg 1577ec681f3SmrgThe ISA documentation says about `s_endpgm`: 1587ec681f3Smrg 1597ec681f3Smrg> The hardware implicitly executes S_WAITCNT 0 and S_WAITCNT_VSCNT 0 1607ec681f3Smrg> before executing this instruction. 1617ec681f3Smrg 1627ec681f3SmrgWhat the doc doesn't say is that in case of NGG (and legacy VS) when there 1637ec681f3Smrgare no param exports, the driver sets `NO_PC_EXPORT=1` for optimal performance, 1647ec681f3Smrgand when this is set, the hardware will start clipping and rasterization 1657ec681f3Smrgas soon as it encounters a position export with `DONE=1`, without waiting 1667ec681f3Smrgfor the NGG (or VS) to finish. 1677ec681f3Smrg 1687ec681f3SmrgIt can even launch PS waves before NGG (or VS) ends. 1697ec681f3Smrg 1707ec681f3SmrgWhen this happens, any store performed by a VS is not guaranteed 1717ec681f3Smrgto be complete when PS tries to load it, so we need to manually 1727ec681f3Smrgmake sure to insert wait instructions before the position exports. 1737ec681f3Smrg 1747ec681f3Smrg# Hardware Bugs 1757ec681f3Smrg 1767ec681f3Smrg## SMEM corrupts VCCZ on SI/CI 1777ec681f3Smrg 1787ec681f3Smrg[See this LLVM source.](https://github.com/llvm/llvm-project/blob/acb089e12ae48b82c0b05c42326196a030df9b82/llvm/lib/Target/AMDGPU/SIInsertWaits.cpp#L580-L616) 1797ec681f3Smrg 1807ec681f3SmrgAfter issuing a SMEM instructions, we need to wait for the SMEM instructions to 1817ec681f3Smrgfinish and then write to vcc (for example, `s_mov_b64 vcc, vcc`) to correct vccz 1827ec681f3Smrg 1837ec681f3SmrgCurrently, we don't do this. 1847ec681f3Smrg 1857ec681f3Smrg## SGPR offset on MUBUF prevents addr clamping on SI/CI 1867ec681f3Smrg 1877ec681f3Smrg[See this LLVM source.](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp#L1917-L1922) 1887ec681f3Smrg 1897ec681f3SmrgThis leads to wrong bounds checking, using a VGPR offset fixes it. 1907ec681f3Smrg 1917ec681f3Smrg## GCN / GFX6 hazards 1927ec681f3Smrg 1937ec681f3Smrg### VINTRP followed by a read with `v_readfirstlane` or `v_readlane` 1947ec681f3Smrg 1957ec681f3SmrgIt's required to insert 1 wait state if the dst VGPR of any `v_interp_*` is 1967ec681f3Smrgfollowed by a read with `v_readfirstlane` or `v_readlane` to fix GPU hangs on GFX6. 1977ec681f3SmrgNote that `v_writelane_*` is apparently not affected. This hazard isn't 1987ec681f3Smrgdocumented anywhere but AMD confirmed it. 1997ec681f3Smrg 2007ec681f3Smrg## RDNA / GFX10 hazards 2017ec681f3Smrg 2027ec681f3Smrg### SMEM store followed by a load with the same address 2037ec681f3Smrg 2047ec681f3SmrgWe found that an `s_buffer_load` will produce incorrect results if it is preceded 2057ec681f3Smrgby an `s_buffer_store` with the same address. Inserting an `s_nop` between them 2067ec681f3Smrgdoes not mitigate the issue, so an `s_waitcnt lgkmcnt(0)` must be inserted. 2077ec681f3SmrgThis is not mentioned by LLVM among the other GFX10 bugs, but LLVM doesn't use 2087ec681f3SmrgSMEM stores, so it's not surprising that they didn't notice it. 2097ec681f3Smrg 2107ec681f3Smrg### VMEMtoScalarWriteHazard 2117ec681f3Smrg 2127ec681f3SmrgTriggered by: 2137ec681f3SmrgVMEM/FLAT/GLOBAL/SCRATCH/DS instruction reads an SGPR (or EXEC, or M0). 2147ec681f3SmrgThen, a SALU/SMEM instruction writes the same SGPR. 2157ec681f3Smrg 2167ec681f3SmrgMitigated by: 2177ec681f3SmrgA VALU instruction or an `s_waitcnt vmcnt(0)` between the two instructions. 2187ec681f3Smrg 2197ec681f3Smrg### SMEMtoVectorWriteHazard 2207ec681f3Smrg 2217ec681f3SmrgTriggered by: 2227ec681f3SmrgAn SMEM instruction reads an SGPR. Then, a VALU instruction writes that same SGPR. 2237ec681f3Smrg 2247ec681f3SmrgMitigated by: 2257ec681f3SmrgAny non-SOPP SALU instruction (except `s_setvskip`, `s_version`, and any non-lgkmcnt `s_waitcnt`). 2267ec681f3Smrg 2277ec681f3Smrg### Offset3fBug 2287ec681f3Smrg 2297ec681f3SmrgAny branch that is located at offset 0x3f will be buggy. Just insert some NOPs to make sure no branch 2307ec681f3Smrgis located at this offset. 2317ec681f3Smrg 2327ec681f3Smrg### InstFwdPrefetchBug 2337ec681f3Smrg 2347ec681f3SmrgAccording to LLVM, the `s_inst_prefetch` instruction can cause a hang. 2357ec681f3SmrgThere are no further details. 2367ec681f3Smrg 2377ec681f3Smrg### LdsMisalignedBug 2387ec681f3Smrg 2397ec681f3SmrgWhen there is a misaligned multi-dword FLAT load/store instruction in WGP mode, 2407ec681f3Smrgit needs to be split into multiple single-dword FLAT instructions. 2417ec681f3Smrg 2427ec681f3SmrgACO doesn't use FLAT load/store on GFX10, so is unaffected. 2437ec681f3Smrg 2447ec681f3Smrg### FlatSegmentOffsetBug 2457ec681f3Smrg 2467ec681f3SmrgThe 12-bit immediate OFFSET field of FLAT instructions must always be 0. 2477ec681f3SmrgGLOBAL and SCRATCH are unaffected. 2487ec681f3Smrg 2497ec681f3SmrgACO doesn't use FLAT load/store on GFX10, so is unaffected. 2507ec681f3Smrg 2517ec681f3Smrg### VcmpxPermlaneHazard 2527ec681f3Smrg 2537ec681f3SmrgTriggered by: 2547ec681f3SmrgAny permlane instruction that follows any VOPC instruction. 2557ec681f3SmrgConfirmed by AMD devs that despite the name, this doesn't only affect v_cmpx. 2567ec681f3Smrg 2577ec681f3SmrgMitigated by: any VALU instruction except `v_nop`. 2587ec681f3Smrg 2597ec681f3Smrg### VcmpxExecWARHazard 2607ec681f3Smrg 2617ec681f3SmrgTriggered by: 2627ec681f3SmrgAny non-VALU instruction reads the EXEC mask. Then, any VALU instruction writes the EXEC mask. 2637ec681f3Smrg 2647ec681f3SmrgMitigated by: 2657ec681f3SmrgA VALU instruction that writes an SGPR (or has a valid SDST operand), or `s_waitcnt_depctr 0xfffe`. 2667ec681f3SmrgNote: `s_waitcnt_depctr` is an internal instruction, so there is no further information 2677ec681f3Smrgabout what it does or what its operand means. 2687ec681f3Smrg 2697ec681f3Smrg### LdsBranchVmemWARHazard 2707ec681f3Smrg 2717ec681f3SmrgTriggered by: 2727ec681f3SmrgVMEM/GLOBAL/SCRATCH instruction, then a branch, then a DS instruction, 2737ec681f3Smrgor vice versa: DS instruction, then a branch, then a VMEM/GLOBAL/SCRATCH instruction. 2747ec681f3Smrg 2757ec681f3SmrgMitigated by: 2767ec681f3SmrgOnly `s_waitcnt_vscnt null, 0`. Needed even if the first instruction is a load. 2777ec681f3Smrg 2787ec681f3Smrg### NSAClauseBug 2797ec681f3Smrg 2807ec681f3Smrg"MIMG-NSA in a hard clause has unpredictable results on GFX10.1" 2817ec681f3Smrg 2827ec681f3Smrg### NSAMaxSize5 2837ec681f3Smrg 2847ec681f3SmrgNSA MIMG instructions should be limited to 3 dwords before GFX10.3 to avoid 2857ec681f3Smrgstability issues: https://reviews.llvm.org/D103348 286