Human-written content
This is a vibe-coded fuzzer for the AMDGPU path in LLVM.
We test the full LLVM IR -> AMDGPU assembly compilation path, although in practice most of the bugs we're finding are in the AMDGPU-specific parts of the compiler.
The idea is to:
- generate programs that have defined semantics (no UB or poison),
- compile them with -O0 and -O2,
- ensure that -O0 and -O2 have the same result, and
- compare that result to that of a trusted interpreter.
In most of the reproducers we've found, -O0 gives the wrong result and -O2 gives the correct result. My untested hypothesis is that we could find reproducers for most of these bugs at -O2 as well, it's just that LLVM is good at simplifying code, and simpler code is less likely to hit a backend bug.
I initially used LLVM HEAD as the primary fuzzing target, but many of the bugs I found didn't reproduce in the latest ROCm release. (IOW HEAD has regressions compared to the release.) Seeing this, I figured I should be fuzzing the release instead. After m038, AMD asked us to switch active fuzzing back to HEAD builds; the current upstream LLVM HEAD column has llvm/llvm-project#196418, llvm/llvm-project#198412, llvm/llvm-project#198491, llvm/llvm-project#198508, and llvm/llvm-project#198556 applied locally (the last three are AMD-provided fixes for the m001, m003/m005/m012/m014, and m026-m029 bug classes; 198556 supersedes the older 198373 and 198419 bitop3 fixes that previous builds carried). In any case, the table of results below shows which versions reproduce which bugs.
Everything below this line is AI-generated. You probably only care about the "bugs generated" table. Good luck.
This directory contains the AMDGPU fuzzer work area. It is intentionally
separate from the PTX / ptxas fuzzer in ../ptx/.
The AMDGPU fuzzer is the directed C++ libFuzzer target in fuzzer/. Its only
input format is an LLVM bitcode module containing an AMDGPU kernel named
fuzz_kernel. For each input module, the fuzzer compiles the kernel through
-O0 and -O2 LLVM pipelines, links both code objects into one HSACO, runs
both kernels through HIP, and compares device output. Set
FUZZX_USE_LLVM_INTERPRETER_ORACLE=1 to also run an LLVM-interpreter oracle
for modules that do not use AMDGPU-specific intrinsics beyond workgroup and
workitem IDs and do not use FP types. Pure LLVM integer bit-counting and
byte-swap intrinsics are allowed in oracle-compatible modules. The interpreter
clone scalarizes vector integer intrinsics and lowers safe LLVM integer min/max,
saturation, absolute value, funnel-shift, bit-reverse, and overflow intrinsics
to plain IR before execution. Oracle findings include the expected value in
mismatch.txt.
Set FUZZX_REQUIRE_LLVM_INTERPRETER_ORACLE=1 for an oracle-focused campaign
where mutation and crossover keep only interpreter-compatible modules.
The custom mutator and crossover operate on LLVM IR rather than on raw bytes.
They currently build a conservative, defined subset of integer IR: no undef,
no explicit poison values, no nuw / nsw / exact, no inbounds, no
integer division except nonzero-denominator udiv / urem, only masked or
constant shift amounts, and only the fixed skeleton input load/output store.
Coverage includes scalar i32 integer arithmetic, bitwise ops,
compares/selects, masked dynamic shifts, rare signed division/remainder by proven-positive divisors,
standalone i8 / i16 scalar subexpressions, i64 subexpressions truncated
to i32, <2 x i32> / <4 x i32> vector subexpressions including fixed
shufflevector masks, and narrow <4/8 x i8> / <4/8 x i16> vector
subexpressions reduced back to i32,
scalar and vector forms of LLVM bit/min/max/saturation/absolute intrinsics,
narrow scalar funnel shifts and unsigned division/remainder by proven-nonzero
denominators, explicit i1 boolean subexpressions reduced back to i32,
pure-IR unsigned min/max and saturating add/sub select idioms, and
pure-IR masked funnel-shift/rotate idioms, pure-IR signed add/sub overflow
select idioms, pure-IR predicate-mask blend/sign idioms, and pure-IR bitfield
extract/insert idioms, pure-IR byte/word pack-unpack idioms, pure-IR widening
multiply-high/low idioms, pure-IR byte dot-product chain idioms, pure-IR
bit-count/bit-twiddle idioms, pure-IR
average/absolute-difference idioms, and pure-IR lane clamp/saturating-pack
idioms, pure-IR vector shuffle/horizontal-reduction idioms, pure-IR
carry/borrow-chain idioms, pure-IR dynamic byte extraction/permutation idioms,
pure-IR compare-rank/mask idioms, pure-IR ternary bit-logic idioms, pure-IR
64-bit pair arithmetic idioms, and pure-IR byte-prefix/permutation idioms,
pure-IR overflow-chain idioms, pure-IR select lookup-table idioms, and pure-IR
nibble reduction idioms, pure-IR SWAR bit tricks, pure-IR byte compare/mask
idioms, pure-IR limb multiply/add idioms, pure-IR select-network idioms,
pure-IR vector compare/mask pack idioms, pure-IR byte Horner-mix idioms,
pure-IR bit ballot/matrix-pack idioms, pure-IR halfword compare/pack idioms,
pure-IR nibble table-lookup idioms, pure-IR bit deposit/extract idioms,
pure-IR i64 byte-permutation idioms, and pure-IR narrow-vector min/max idioms,
pure-IR byte-lane select idioms, pure-IR halfword dot-accumulate idioms,
pure-IR rotate/mask cascade idioms, and pure-IR vector byte gather idioms,
pure-IR byte-prefix compare and byte median/range idioms, pure-IR i64
cross-lane fold idioms, pure-IR vector pairwise byte arithmetic idioms,
pure-IR byte permute-control idioms, pure-IR bit-run mask idioms, pure-IR i64
multiply-fold idioms, pure-IR halfword blend-network idioms, pure-IR byte
ternary-blend idioms, pure-IR halfword prefix-sum idioms, pure-IR i64
rotate-add idioms, pure-IR vector compare bitmask idioms, pure-IR byte carry
propagation idioms, pure-IR bit-slice boolean idioms, pure-IR vector
splat/blend idioms, pure-IR i64 compare/pack idioms, pure-IR nibble
carry-chain idioms, pure-IR halfword saturating-difference idioms, pure-IR i64
bitfield-mix idioms, pure-IR vector lane mix/pack idioms, pure-IR byte
saturating pack idioms, pure-IR halfword multiply-high idioms, pure-IR i64
prefix-fold idioms, and pure-IR vector byte rotate/pack idioms, alongside
LLVM bit, min/max, saturation, absolute-value, funnel-shift, and integer
overflow intrinsics. It also emits a small AMDGPU-specific pure
integer-intrinsic subset covering BFE, SAD/MSAD, lerp, 24-bit multiply,
packed SAD/MQSAD, alignbyte, signed first-bit-high, mbcnt, perm,
explicit bitop3, readfirstlane, wave reductions, and integer dot-product
operations, plus bounded AMDGPU FP/packing intrinsics such as
fmed3, frexp, fract, class, and packed FP/int conversions. Known
sudot* and fma.legacy instruction-selection crashes are gated off by
default. It also emits a finite
scalar FP subset by masking
inputs to small nonnegative integers, converting with uitofp, using exact
fadd / fmul / nonzero-denominator fdiv / fcmp / select shapes, and
converting back with in-range fptoui; a signed variant uses small
sign-extended integers, sitofp, fadd / fsub / fmul /
nonzero-denominator fdiv, and in-range fptosi. It also emits finite scalar
half and <2/4 x half> / <2/4 x float> vector FP subexpressions reduced
back to i32. The mutator can
also wrap the current result in structured two-way
branches, wider multi-way switches, branch/PHI cascades, and deeper bounded CFG
subgraphs with i32 phi joins. Those subgraphs can nest more diamonds, switches,
cascades, and small counted loops with optional guarded early exits. The mutator
also generates top-level counted loops with bounded constant or dynamically
masked trip counts whose bodies can contain nested diamonds, switches, cascades,
and inner loops. A dedicated loop-nest mutation wraps an inner counted loop and
optional tail CFG inside an outer bounded loop. A complex-CFG mutation chains
several nested subgraphs before the final store, so a single corpus entry can
contain multiple high-fanout joins and loop nests instead of just one wrapper
around the result. Some generated loops carry two independent i32 accumulator
phis, combine them after the loop, take a guarded early exit from the loop
body through an exit phi, or switch from the loop body to multiple distinct exit
values before one joined exit phi, so corpus entries exercise both expression
simplification and CFG and loop transforms. CFG arms include the same scalar
integer, bit, boolean, narrowing, saturating, funnel-shift, finite-FP, and vector
expression families as the linear mutator. Scalar and CFG expressions can also
mix in extra i32 global input loads from in[seed % n]; these loads are only
emitted inside the existing idx < n guard and are bounded by the module
validator.
Corpus files can be inspected directly with opt -S corpus-entry -o -.
| Component | Notes |
|---|---|
| ROCm LLVM | Defaults to /opt/rocm-7.1.1/lib/llvm/bin/clang-20, lld, and llvm-objdump; override with CLANG, LLD, and LLVM_OBJDUMP. |
| HIP | hipcc is used to build the module runner. |
| AMDGPU | Defaults to gfx950; override with --mcpu. |
Build the current upstream-HEAD LLVM fuzzing toolchain and run the directed C++ GPU differential fuzzer:
scripts/build_instrumented_llvm.sh
scripts/build_directed_fuzzer.sh
HIP_DEVICE=0 scripts/run_directed_fuzzer.sh -runs=100 -max_len=131072Run one directed fuzzer process per GPU:
scripts/run_directed_multigpu_fuzzer.sh -runs=1000 -max_len=131072Run multiple directed fuzzer workers on each selected GPU:
WORKERS_PER_GPU=2 scripts/run_directed_multigpu_fuzzer.sh -runs=1000 -max_len=131072Multi-GPU runs share one live libFuzzer corpus by default, so workers can
reload inputs discovered by other workers while keeping per-worker logs and
artifact directories. Set FUZZX_CORPUS_MODE=isolated to return to one
independent corpus directory per worker.
Fresh corpus directories are seeded with a valid LLVM bitcode module before
libFuzzer starts. Set FUZZX_IMPORT_CORPUS to one or more colon-separated
files or directories to copy an older corpus into the fresh corpus before
workers launch.
For the current upstream-HEAD campaign, run multiple workers across all GPUs:
GPUS="0 1 2 3 4 5 6 7" WORKERS_PER_GPU=12 \
FUZZX_REQUIRE_LLVM_INTERPRETER_ORACLE=1 \
FUZZX_IMPORT_CORPUS=/tmp/old-run/corpus/directed-gpu/shared \
scripts/run_directed_multigpu_fuzzer.sh \
-max_total_time=900 -max_len=131072 -rss_limit_mb=8192 -use_value_profile=1With an optimized LLVM build using sanitizer coverage and no ASan, the directed
fuzzer currently reaches about 500 exec/s aggregate across 8 GPUs.
Keep the corpus, logs, artifacts, findings, and TMPDIR on a local filesystem;
the run scripts default these hot paths to /tmp/fuzzx-amdgpu-$USER through
FUZZX_RUNTIME_ROOT. Avoid putting them on WekaFS or another shared filesystem,
because libFuzzer produces a high rate of tiny metadata and log writes. The run
scripts also copy the fuzzer binary into the local runtime root by default
before spawning workers; set FUZZX_LOCALIZE_FUZZER=0 to disable that. When
Weka client frontend processes reserve dedicated CPU cores, the run scripts
default FUZZX_CPUSET=auto, detect single-core-pinned wekanode processes, and
run fuzzer workers through taskset on the remaining CPUs. Set
FUZZX_CPUSET=none to disable this or FUZZX_CPUSET=0-63 to use an explicit
CPU set.
For historical ROCm 7.2.3 release fuzzing, use the release wrapper:
scripts/run_rocm_7_2_3_release_fuzzer.sh -max_total_time=900 -max_len=131072 -rss_limit_mb=8192 -use_value_profile=1That wrapper selects the ROCm 7.2.3 fuzzer build instead of the current upstream-HEAD fuzzer build.
Candidate compiler crashes, compile/link failures, or output mismatches are
saved under $FUZZX_RUNTIME_ROOT/findings by default. Generated corpora and
findings are local artifacts and are ignored by git; set FUZZX_RUNTIME_ROOT,
CORPUS_ROOT, LOG_DIR, ARTIFACT_ROOT, or FUZZX_FINDINGS_DIR to override
the default local runtime paths.
Known bug patterns are suppressed by default so continued fuzzing does not keep rediscovering the same issue.
| Flag | Default | Meaning |
|---|---|---|
FUZZX_ALLOW_M016_SCALAR_FSHL=1 |
unset | Re-enable scalar llvm.fshl.i32 generation for m015, m016, and m070; the legacy FUZZX_ALLOW_M015_SCALAR_FSHL_ZERO=1 flag is also accepted. |
FUZZX_ALLOW_M026_UMAX_XOR_AND_HIGHBIT=1 |
unset | Re-enable (umax(a, b) ^ b) & umax(a, b) shapes for m026. |
FUZZX_ALLOW_M028_UMAX_AND_NOT=1 |
unset | Re-enable (umax((y & ~x), C) & x) & ~x shapes for m028. |
FUZZX_ALLOW_M030_CTLZ_SHL_OR_BITOP3=1 |
unset | Re-enable or(add(shl(...), z), z) and or(smin(add(shl(...), z), z), z) tails for m030. |
FUZZX_ALLOW_M031_VECTOR_OR_EXTRACT_SUB=1 |
unset | Re-enable subtracting two scalar extracts from the same vector or for m031. |
FUZZX_ALLOW_M032_LOOP_VECTOR_SELECT=1 |
unset | Re-enable loop-carried values whose backedge depends on a vector select for m032. |
FUZZX_ALLOW_M035_WAVE_REDUCE_XOR=1 |
unset | Re-enable llvm.amdgcn.wave.reduce.xor generation for m035. |
FUZZX_ALLOW_M036_WAVE_REDUCE_ADD=1 |
unset | Re-enable llvm.amdgcn.wave.reduce.add generation for m036. |
FUZZX_ALLOW_M039_SEXT_I8_HIGHBYTE=1 |
unset | Re-enable sext i8 to i32 values feeding high-byte extraction for m039. |
FUZZX_ALLOW_M040_SIGNED_DIVREM24=1 |
unset | Re-enable signed sdiv / srem by small odd denominators when the numerator is not known to fit signed 24-bit for m040. |
FUZZX_ALLOW_M041_ASHR_HIGHBYTE_PACK=1 |
unset | Re-enable high-byte extraction from ashr i32 values feeding byte-pack shapes for m041. |
FUZZX_ALLOW_M045_UREM_OR_ONE=1 |
unset | Re-enable urem x, (x | 1) shapes for m045. |
FUZZX_ALLOW_M046_V4I16_CTTZ=1 |
unset | Re-enable llvm.cttz.v4i16 shapes for m046. |
FUZZX_ALLOW_M047_V8I8_SHL=1 |
unset | Re-enable <8 x i8> vector shl shapes for m047. |
FUZZX_ALLOW_M048_V8I8_UADD_SAT=1 |
unset | Re-enable llvm.uadd.sat.v8i8 shapes for m048. |
FUZZX_ALLOW_M049_VECTOR_FSHL=1 |
unset | Re-enable vector llvm.fshl calls for m049; the legacy FUZZX_ALLOW_M049_VECTOR_FSHL_ZERO=1 flag is also accepted. |
FUZZX_ALLOW_M051_VECTOR_FSHR_LOOP=1 |
unset | Re-enable vector llvm.fshr calls for m051. |
FUZZX_ALLOW_M052_TERNARY_BLEND_SHIFT=1 |
unset | Re-enable ((a ^ b) | (b & ~(a ^ b))) & 31 shift masks for m052. |
FUZZX_ALLOW_M053_BYTEDOT_HIGHBIT=1 |
unset | Re-enable byte-dot result values feeding a high-bit mask for m053. |
FUZZX_ALLOW_M054_I64_PAIR_LOW_ADD=1 |
unset | Re-enable ((zext x << 32) | 0xffff) + zext x pair-add shapes for m054. |
FUZZX_ALLOW_M055_I64BYTEPERM_LOOP=1 |
unset | Re-enable loop-carried values depending on i64 byte-permutation idioms for m055. |
FUZZX_ALLOW_M056_HALFDOT_BRANCH=1 |
unset | Re-enable low-bit branch keys depending on halfword-dot pack values for m056. |
FUZZX_ALLOW_M057_ROTCASCADE_STORE=1 |
unset | Re-enable final stores depending on rotate-cascade values for m057. |
FUZZX_ALLOW_M058_NIBBLE_BYTESEL_HIGHBIT=1 |
unset | Re-enable byte-lane select carry values derived from nibble-table packs for m058. |
FUZZX_ALLOW_M060_PACKUNPACK_BYTEDOT=1 |
unset | Re-enable final stores depending on generated packunpack byte-dot sums for m060. |
FUZZX_ALLOW_M061_OVMASKPACK_OVERFLOW=1 |
unset | Re-enable final stores depending on generated ovmaskpack overflow/byte-pack values for m061. |
FUZZX_ALLOW_M062_BYTEHIST_BITMUX=1 |
unset | Re-enable final stores depending on both generated bytehist and bitmux values for m062. |
FUZZX_ALLOW_M063_OVERFLOW_CARRY_BITOP3=1 |
unset | Re-enable final stores depending on generated carry values for m063. |
FUZZX_ALLOW_M064_NIBBLECARRY_LOOP=1 |
unset | Re-enable loop-carried final stores depending on generated nibblecarry values for m064. |
FUZZX_ALLOW_M065_USUB_OVERFLOW_XOR_FOLD=1 |
unset | Re-enable final stores depending on generated ovbytegather values for m065. |
FUZZX_ALLOW_M066_VECI16ZEXTMUL_BITOP3_LOOP=1 |
unset | Re-enable loop-carried final stores depending on generated veci16zextmul values for m066. |
FUZZX_ALLOW_M067_BYTECONDSEL_SELF_AND=1 |
unset | Re-enable final stores depending on generated bytecondsel values for m067. |
FUZZX_ALLOW_M068_LOOP_VOP3FUSED_UMAXBITOP3=1 |
unset | Re-enable final stores depending on generated umaxbitop3cascade values for m068 (shares a suppressor with m069). |
FUZZX_ALLOW_M069_UMAXBITOP3CASCADE_STORE=1 |
unset | Same umaxbitop3cascade suppressor as m068; see m069. |
FUZZX_ALLOW_C001_SUDOT_ISEL_ICE=1 |
unset | Re-enable llvm.amdgcn.sudot4 / llvm.amdgcn.sudot8 generation for c001. |
FUZZX_ALLOW_C002_FMA_LEGACY_ISEL_ICE=1 |
unset | Re-enable llvm.amdgcn.fma.legacy generation for c002. |
| Path | Purpose |
|---|---|
third_party/llvm-project |
LLVM source checkout, pinned as a git submodule. |
patches/llvm-pr-198373.diff |
Local source-fix patch for the current HEAD campaigns; scripts/build_instrumented_llvm.sh applies it by default to the selected LLVM_PROJECT_DIR. |
patches/llvm-pr-196418.diff |
Local patch for unsigned LowerDIVREM24; scripts/build_instrumented_llvm.sh applies it by default to the selected LLVM_PROJECT_DIR. |
patches/llvm-pr-198412.diff |
Local patch for non-add AMDGPU dot-product add-chain matching; scripts/build_instrumented_llvm.sh applies it by default to the selected LLVM_PROJECT_DIR. |
patches/llvm-pr-198419.diff |
Local source-fix patch for AMDGPU BitOp3_Op shared-source aliasing; scripts/build_instrumented_llvm.sh applies it by default to the selected LLVM_PROJECT_DIR. |
scripts/build_instrumented_llvm.sh |
Helper for configuring a sanitizer-coverage LLVM source build. |
scripts/build_directed_fuzzer.sh |
Builds the C++ GPU differential libFuzzer target. |
scripts/seed_ir_corpus.sh |
Writes the initial LLVM bitcode corpus seed. |
scripts/run_directed_fuzzer.sh |
Runs the C++ directed fuzzer on one GPU. |
scripts/run_directed_multigpu_fuzzer.sh |
Runs one or more C++ directed fuzzer processes per selected GPU. |
scripts/run_rocm_7_2_3_release_fuzzer.sh |
Runs the C++ directed fuzzer with the ROCm 7.2.3 release build. |
fuzzer/ |
LLVM API plus HIP differential libFuzzer target. |
runner/hip_module_runner.cpp |
HIP module loader used to execute generated HSACO files. |
known-miscompiles/ |
Reduced or standalone reproducers for confirmed findings. |
Except where otherwise noted, these have been tested on gfx950. The result
columns report the generic known-miscompiles/run_ll_reproducer.sh
differential test: ✅ means no mismatch was observed for that reproducer, and
❌ means the toolchain reproduces the -O0 / -O2 mismatch.
Confirmed compiler ICEs should be recorded here too, with the table entry
describing the crashing toolchain and phase instead of a differential result.
Tested toolchains as of 2026-05-19:
| Column | Toolchain |
|---|---|
| ROCm release | ROCm 7.2.3 source tag, commit f58b06dce1f9c15707c5f808fd002e18c2accf7e; also checked against the matching ROCm 7.2.3 rocm-llvm package, package SHA256 4c406e184f88949cea60869949454e5392e1cbd9480c4c87274f7b59e9f810e5. |
| LLVM HEAD | https://github.com/llvm/llvm-project/commit/0dd29960cd6102b37651cc3f58f872652099b83b (2026-05-18) plus llvm/llvm-project#196418, llvm/llvm-project#198412, llvm/llvm-project#198491, llvm/llvm-project#198508, and llvm/llvm-project#198556, built Release with sanitizer coverage, no ASan. |
| ROCm HEAD | https://github.com/ROCm/llvm-project/commit/a5de13684ba84db953b28e632ea304080a4318d0 (2026-05-18) plus llvm/llvm-project#196418, llvm/llvm-project#198412, llvm/llvm-project#198491, llvm/llvm-project#198508 (source-only; the patch's .ll test diffs do not apply against ROCm-staging baseline checks), and llvm/llvm-project#198556, built with assertions, ASan, and sanitizer coverage. |
| Bug | ROCm 7.2.3 | LLVM HEAD | ROCm HEAD | Description |
|---|---|---|---|---|
| m001-ashr-i16-zext | ❌ | ✅ | ✅ | ashr i16 feeding zext i16 to i32 is folded to a sign-extending SDWA byte select; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198491. |
| m002-i8-clear-xor | ✅ | ✅ | ✅ | -O0 lowers a byte-clear xor through v_bitop3_b32 with the wrong result; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198373. |
| m003-shl3-add-chain | ✅ | ✅ | ✅ | -O0 scalarizes a divergent shl3/add chain through v_readfirstlane_b32; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198508. |
| m004-vector-identity-xor | ✅ | ✅ | ✅ | -O0 loses a lane-0 vector identity before xor; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198373. |
| m005-shl1-add-chain | ✅ | ✅ | ✅ | -O0 scalarizes a divergent shl1/add chain through the same class of bug as m003; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198508. |
| m006-i8-xor-clear | ✅ | ✅ | ✅ | -O0 lowers another adjacent i8 narrow byte-clear xor through the wrong v_bitop3_b32 result; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198373. |
| m007-vector-shl-identity-xor | ✅ | ✅ | ✅ | -O0 loses a vector shift-by-zero lane-0 identity before xor; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198373. |
| m008-i8-separated-clear | ✅ | ✅ | ✅ | -O0 miscompiles an i8 identity byte-clear xor when prior narrow ops are separated by no-op adds; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198373. |
| m009-i16-clear-xor | ✅ | ✅ | ✅ | -O0 miscompiles an i16 identity low-16 clear xor through the wrong v_bitop3_b32 result; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198373. |
| m010-i16-sext-clear-xor | ✅ | ✅ | ✅ | -O0 miscompiles an i16 sign-extended identity clear xor through the wrong v_bitop3_b32 result; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198373. |
| m011-i8-sext-clear-xor | ✅ | ✅ | ✅ | -O0 miscompiles an i8 sign-extended masked clear xor through the wrong v_bitop3_b32 result; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198373. |
| m012-add-shl-ladder | ✅ | ✅ | ✅ | -O0 scalarizes a divergent add/shl ladder through v_readfirstlane_b32; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198508. |
| m013-private-memory-fshl | ❌ | ❌ | ❌ | -O0 lowers fixed private-memory allocas through a dynamic scratch stack sequence that can return intermittent wrong values. |
| m014-shl-add-ctpop | ✅ | ✅ | ✅ | -O0 scalarizes a four-step shl/add chain feeding ctpop through lane 0; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198508. |
| m015-scalar-fshl-zero | ✅ | ❌ | ❌ | -O0 lowers scalar fshl.i32(x, y, 0) through a 64-bit shift-by--1 sequence that returns zero. |
| m016-scalar-fshl-one | ✅ | ❌ | ❌ | -O0 lowers scalar fshl.i32(x, y, 1) through a 64-bit shift-by--1 sequence that returns only bit 31. |
| m017-vector-and-lane0-clear-xor | ❌ | ✅ | ✅ | ROCm 7.2.3 -O0 drops a vector lane-0 and/extractelement clear before xor; LLVM HEAD and ROCm HEAD already pass. |
| m018-two-private-memory-ops | ❌ | ✅ | ✅ | ROCm 7.2.3 -O0 intermittently reads stale scratch data across two private-memory sequences; LLVM HEAD and ROCm HEAD pass 50 repeated combined runs. |
| m019-highbit-or-xor | ❌ | ✅ | ✅ | -O0 combines a high-bit (x | C) ^ x expression into v_bitop3_b32 with the wrong truth table or operands; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198419. |
| m020-or-xor-and | ❌ | ✅ | ✅ | -O0 combines ((a | b) ^ b) & (a | b) into v_bitop3_b32 with the wrong result; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198419. |
| m021-fshl-or-xor | ❌ | ✅ | ✅ | -O0 combines a dynamic (a | b) ^ a expression after fshl into v_bitop3_b32 with the wrong result; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198419. |
| m022-and-xor-constant | ❌ | ✅ | ✅ | -O0 combines ((x ^ C) & x) after a dynamic and into v_bitop3_b32 with the wrong low bit; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198419. |
| m023-and-xor-identity | ❌ | ✅ | ✅ | -O0 combines (x & y) ^ x into v_bitop3_b32 with the wrong identity result; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198419. |
| m024-udiv-or-one | ❌ | ✅ | ✅ | -O0 lowers unsigned division of a sign-extended i16 value by x | 1 through an imprecise float reciprocal path; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#196418. |
| m025-urem-or-one | ❌ | ✅ | ✅ | -O0 lowers unsigned remainder of a sign-extended i16 value by x | 1 through the same imprecise reciprocal path; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#196418. |
| m026-shl-umax-xor-and | ❌ | ❌ | ❌ | -O2 combines a shifted umax high-bit extraction into v_bitop3_b32 using the input and salt instead of their xor; llvm/llvm-project#198556 does not catch this shape. |
| m027-xor-and-or | ❌ | ✅ | ✅ | -O0 combines (((y ^ x) & x) | base) into v_bitop3_b32 with the wrong bit when x is (base ^ z) & base; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198556. |
| m028-umax-and-not | ❌ | ❌ | ❌ | -O0 combines (umax((y & ~x), C) & x) & ~x into v_bitop3_b32 using the input and salt separately; llvm/llvm-project#198556 does not catch this shape. |
| m029-fshl-select-phi | ❌ | ✅ | ✅ | -O2 lowers a signed compare/select over y & x, where x is a complemented masked fshl, so the true zero arm is chosen when the signed compare is false; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198556. |
| m030-ctlz-shl-or-bitop3 | ❌ | ❌ | ❌ | -O2 lowers a low-bit or through v_bitop3_b32 using the unmasked %n value instead of %n & 1. |
| m031-vector-or-extract-sub | ❌ | ✅ | ✅ | ROCm 7.2.3 -O2 scalarizes a vector or extract/sub as or(x, 255) - x instead of or(x, 255) - -1. |
| m032-loop-vector-select | ❌ | ✅ | ✅ | ROCm 7.2.3 -O2 kills the loop EXEC mask before storing a loop-carried value derived from a vector select. |
| m033-sub-zext-bool-fp | ❌ | ✅ | ✅ | -O2 lowers sub i32 X, zext(i1 Cond) through s_subb_u32 with the wrong false-case borrow before a masked FP accumulation; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198412. |
| m034-fshl-add-workitem-product | ❌ | ✅ | ✅ | -O2 rewrites a workitem-product fshl/add chain as a byte dot product that returns 0xffffffff instead of 0xc0000000 for x == 0; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198412. |
| m035-wave-reduce-xor-constant | ❌ | ✅ | ✅ | ROCm 7.2.3 -O2 folds llvm.amdgcn.wave.reduce.xor.i32(30, 0) to 30 instead of the even-wave XOR result 0. |
| m036-wave-reduce-add-constant | ❌ | ✅ | ✅ | ROCm 7.2.3 -O2 folds llvm.amdgcn.wave.reduce.add.i32(65536, 1) to 65536 instead of the full-wave sum 0x00400000. |
| m037-dot4-square-lowbit | ❌ | ✅ | ✅ | -O2 lowers a byte-masked x*x + (x*x & 1) expression to v_perm_b32 / v_dot4_u32_u8 with an extra constant accumulator; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198412. |
| m038-loop-fp-mask-xor | ❌ | ✅ | ✅ | -O2 unrolls nested xor loops and folds a masked integer-to-FP round-trip into a byte-dot sequence that adds 1023 for input zero; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198412. |
| m039-sext-i8-highbyte-pack | ❌ | ❌ | ❌ | -O2 packs bytes after an i8 sign-extension but clears the byte lanes contributed by the sign bits. |
| m040-sdivrem24-boundary | ❌ | ❌ | ❌ | -O2 applies the signed 24-bit reciprocal division lowering when the positive numerator has bit 23 set, returning a quotient one too large. |
| m041-ashr-highbyte-pack | ❌ | ❌ | ❌ | -O2 lowers a byte pack after ashr i32 to v_perm_b32 with the wrong high-byte lane. |
| m042-or-lshr-zero-xor | ✅ | ✅ | ✅ | -O0 lowered or x, (lshr x, 0) where x is (a ^ b) | ((a ^ b) >> 1) through the wrong v_bitop3_b32; LLVM HEAD passes after llvm/llvm-project#198373. |
| m043-zext-i8-self-xor | ✅ | ✅ | ✅ | -O0 lowered xor x, x, where x is zext(trunc(workitem.id.x)) ^ 1, through v_bitop3_b32; LLVM HEAD passes after llvm/llvm-project#198373. |
| m044-v4i32-self-and-zero-shuffle | ✅ | ✅ | ✅ | -O0 lowered a <4 x i32> and x, x lane ORed with a zero shuffle through v_bitop3_b32; LLVM HEAD passes after llvm/llvm-project#198373. |
| m045-urem-or-one-known24 | ❌ | ❌ | ❌ | -O2 lowers urem x, (x | 1) with known 24-bit x to 0x00ffffff instead of x when even x is smaller than x | 1; explicit masking can make -O0 wrong too. |
| m046-v4i16-cttz-funnel-loop | ✅ | ❌ | ❌ | -O2 miscomputes a dynamic-trip nested loop whose body extracts a lane from llvm.cttz.v4i16 and feeds a funnel-shift-shaped scalar expression. |
| m047-bytedot-v8i8-shl-loop | ✅ | ❌ | ❌ | -O2 folds a byte-dot-style dynamic loop with a <8 x i8> vector shift to 4 for lanes where -O0 produces smaller values. |
| m048-v8i8-uadd-sat-vecreduce-loop | ✅ | ❌ | ❌ | -O2 miscomputes a loop using llvm.uadd.sat.v8i8 followed by byte extraction and a two-lane vector-reduce xor/and idiom, changing the low bits by two. |
| m049-vector-fshl-zero | ✅ | ❌ | ❌ | -O0 lowers vector llvm.fshl.v4i32(x, 0, 0) through a 64-bit shift-by--1 sequence that returns zero instead of the selected vector lane. |
| m050-bitcount-and-sub-zero | ✅ | ✅ | ✅ | -O0 lowered and X, (X - 0) feeding ctpop through the wrong v_bitop3_b32; LLVM HEAD passes after llvm/llvm-project#198373. |
| m051-vector-fshr-divergent-loop | ✅ | ❌ | ❌ | -O2 scalarizes a vector llvm.fshr.v2i32 loop tail and carries one scalar inner-loop result into divergent lanes that exited earlier. |
| m052-ternary-blend-shift | ✅ | ❌ | ❌ | -O0 lowers ((a ^ b) | (b & ~(a ^ b))) & 31 as a & 31, dropping b before a funnel-shift-like expression. |
| m053-bytedot-highbit | ✅ | ❌ | ❌ | LLVM HEAD and ROCm HEAD -O0 lower a byte-dot/high-bit expression through a changed v_bitop3_b32 / v_bfi_b32 sequence that clears a high bit before a final xor. |
| m054-i64-pair-low-add | ❌ | ❌ | ❌ | -O2 folds ((zext x << 32) | 0xffff) + zext x into a u24 multiply-add-like sequence that drops the high-half copy of x. |
| m055-i64byteperm-loop-readfirstlane | ✅ | ❌ | ✅ | LLVM HEAD -O0 miscompiles a loop-carried value depending on an i64 byte-permutation fold, returning 0xffffffff instead of 0xff22dd00; ROCm 7.2.3 and ROCm HEAD pass. |
| m056-halfdot-lowbit-branch | ✅ | ❌ | ❌ | LLVM HEAD and ROCm HEAD -O0 miscompute a low-bit branch key derived from a halfword-dot byte pack and store zero instead of 0xfffd7ffc. |
| m057-rotcascade-store | ✅ | ❌ | ✅ | LLVM HEAD -O0 miscomputes a repeated rotate/popcount/bitselect cascade before the final store; ROCm 7.2.3 and ROCm HEAD pass. |
| m058-nibble-bytesel-highbit | ❌ | ❌ | ❌ | -O0/-O2 disagree on the high bit of a funnel-shift-shaped final store when a byte-lane select carry is derived from a nibble-table pack; the original oracle finding has LLVM HEAD -O0 wrong. |
| m059-srem-loop-branch | ✅ | ✅ | ✅ | A stale LLVM HEAD build missing llvm/llvm-project#198373 skipped a live lane when a multi-exit loop branch key came from srem; the current patched toolchains pass. |
| m060-packunpack-bytedot-dot4 | ❌ | ❌ | ❌ | -O2 folds a generated packunpack three-term byte-dot sum into v_dot4_u32_u8 with the wrong packed byte or accumulator, returning 0x1e35 instead of 0x1f98. |
| m061-ovmaskpack-o0-overflow-lowering | ✅ | ❌ | ❌ | LLVM HEAD and ROCm HEAD -O0 mislower an unoptimized overflow-mask-pack chain and store 0xa1df8800 instead of the oracle/-O2 result 0xa0df8400; ROCm 7.2.3 passes. |
| m062-bytehist-bitmux-lowbyte | ✅ | ❌ | ❌ | LLVM HEAD and ROCm HEAD -O0 lower a bytehist/bitmux low-byte expression through v_bitop3_b32 and store 0xb81c0001 instead of the oracle/-O2 result 0xb81c0002; ROCm 7.2.3 passes. |
| m063-overflow-carry-bitop3 | ✅ | ❌ | ❌ | LLVM HEAD and ROCm HEAD -O0 lower an overflow-derived duplicated carry expression through v_bitop3_b32 and store 0x6 instead of the oracle/-O2 result 0x2; ROCm 7.2.3 passes. |
| m064-nibblecarry-loop-readfirstlane | ✅ | ❌ | ❌ | LLVM HEAD and ROCm HEAD -O0 scalarize a divergent nibble-carry-derived loop value through v_readfirstlane_b32 and store 0x1805d9 instead of the oracle/-O2 result 0xc1b09; ROCm 7.2.3 passes. |
| m065-usub-overflow-xor-fold | ✅ | ❌ | ❌ | LLVM HEAD and ROCm HEAD -O0 fold (lane ^ fold) & 1 after usub.with.overflow into a single v_bitop3_b32 with the wrong truth table, storing 0x0 instead of the oracle/-O2 result 0x1; ROCm 7.2.3 passes. |
| m066-veci16zextmul-bitop3-loop | ❌ | ❌ | ❌ | -O2 miscompiles a 12-iteration loop whose body builds <4 x i16> from the accumulator halves, zext-multiplies against constants, xor-reduces, smaxes two lanes, and xors the result back; exit value goes through a bitop3 cascade and stores 0x8BD601F1 instead of the oracle/-O0 result 0x2BE83DE2. |
| m067-bytecondsel-and-i1-self | ✅ | ❌ | ❌ | LLVM HEAD and ROCm HEAD -O0 mis-lower select i1 (and i1 X, X) c, 0 (where X = icmp ult i32 a, 0, always false) by evaluating the select as if the condition were true, storing 0xCE instead of the oracle/-O2 result 0x59; ROCm 7.2.3 passes. |
| m068-loop-vop3fused-umaxbitop3 | ❌ | ❌ | ❌ | -O2 miscompiles a nested loop whose accumulator is seeded from vop3fused + umaxbitop3cascade shapes, storing 0x937E instead of the oracle/-O0 0x8210A05D. |
| m069-umaxbitop3cascade-store | ❌? | ❌ | ❌? | -O2 miscompiles a final store whose value is fuzz.umaxbitop3cascade.idiom.a.add, storing 0x5C83AF47 instead of the oracle/-O0 0x814EF57. Sibling bug to m068; ROCm 7.2.3 / ROCm HEAD not yet verified. |
| m070-scalar-fshl-shift8 | ✅ | ❌ | ❌ | -O0 lowers scalar fshl.i32(x, 0, 8) to a 64-bit shift by -8, returning x >> 24 instead of x << 8; same lowering family as m015/m016 but shows the bug applies to every non-zero constant shift, not just c=1. |
| m071-bxorand-or-and-not-bitop3 | ❌ | ❌ | ❌ | -O0 lowers `((b ^ (c & a)) |
| m072-bitop3-tand-or-and-not-zero | ✅ | ❌ | ❌ | -O0 lowers `((b & (a & c)) |
| m073-bitop3-t1t2-and-or-xor | ❌ | ❌ | ❌ | -O0 lowers `((a&b) & (a |
| m074-fmed3-nan-ieee-off-maxmin | ❌ | ❌ | ❌ | -O2 InstCombine fold of amdgcn.fmed3(x, y, NaN) in IEEE-off mode produces maximumnum(x, y) instead of minimumnum(x, y); the polarity check in AMDGPUInstCombineIntrinsic.cpp only treats -inf as "Min" and defaults NaN to "Max", inconsistent with both the documented behaviour table and the parallel arms for Src0/Src1. |
| m075-rcp-constant-denormal-flush | ❌ | ❌ | ❌ | -O2 InstCombine fold of amdgcn.rcp.f32(C) returns the exact 1/C even when the kernel's f32 denormal mode is PreserveSign (the default) and the hardware would have flushed the denormal result to ±0. For C = 2^127 the fold returns 0x00400000 while v_rcp_f32 returns 0. A TODO next to the fold already calls out this issue. |
| m076-sffbh-umin-knownbits-check | ✅ | ❌ | ❌ | -O2 SDAG fold of umin(amdgcn.sffbh(x), Clamp) to sffbh(x) fires when x is provably non-zero but x = -1 is still reachable, because the negative side of the check uses the weak !Known.isAllOnes() ("not provably all-ones") instead of "provably not all-ones". For x = (load | 1) with input 0xFFFFFFFE the fold returns -1 instead of Clamp = 32. HEAD-only regression. |
| m077-rcp-constant-denormal-input | ❌ | ❌ | ❌ | -O2 InstCombine fold of amdgcn.rcp.f32(C) ignores the kernel's f32 denormal mode on the input side: for a denormal constant C = 2^-127 (0x00400000) the fold returns the exact 2^127 (0x7f000000) while v_rcp_f32 on gfx950 with the default PreserveSign mode first flushes the denormal input to ±0 and then returns +Inf (0x7f800000). Distinct from m075 (which is the same fold's output-side flush bug). |
| m078-wave-reduce-fsub-f64-dpp-identity | ❌ | ❌ | ❌ | The DPP-strategy lowering of llvm.amdgcn.wave.reduce.fsub.f64 (and the SGPR-uniform V_ADD_F64 arm) uses the generic FP64 additive identity -0.0 from getIdentityValueForWaveReduction, but the ITERATIVE strategy explicitly overrides that to +0.0 with the comment "+0.0 for double sub reduction". For all-zero input the two strategies disagree on the sign of zero (iterative: +0.0, DPP/uniform: -0.0); IEEE chained 0-0-...-0 rounds to +0.0, so the DPP/uniform path is wrong. Not an -O0/-O2 mismatch (strategy is an immarg), so the reproducer XORs the two strategies' bit-patterns inside one kernel. |
| m079-fcmp-icmp-i64-wave32-fold | ❌ | ❌ | ❌ | -O2 InstCombine "always-true" fold for amdgcn.fcmp/icmp blindly uses II.getType() as the type for read_register("exec", ...). On wave32 with .i64 return the fold therefore reads the full 64-bit EXEC pair, leaking the architecturally-unused EXEC_HI into the high 32 bits; -O0's SDAG path correctly emits v_cmp + zext i32 -> i64 so the high bits are zero. Sibling miscompile to the c007 ICE (wave64/.i32). Demonstrated as static asm divergence -- the FuzzX box has no wave32 GPU. |
| m080-gisel-clamp-i64-i16-degenerate | ❌ | ❌ | ❌ | GlobalISel AMDGPUPreLegalizerCombiner::matchClampI64ToI16 validator OR's both orderings of (Cmp1, Cmp2) but the matcher distinguishes two patterns. For pattern 1 = smin(smax(X, Cmp2), Cmp1) with Cmp1 < Cmp2 the IR is identically Cmp1, but the combiner rewrites it to med3(min, X_packed, max) -- a real clamp that returns X whenever it falls inside [Cmp1, Cmp2]. For Cmp1=5, Cmp2=100, X=50 the IR semantic is 5 but compiled code returns 50. GISel-only (-mllvm -global-isel); standard SDAG path is unaffected. |
| m081-gisel-wave-shuffle-half-check | ❌ | ❌ | ❌ | GISel selectWaveShuffleIntrin for wave64 GFX10/GFX11 builds the same-or-other-half check by XORing ThreadID with set_inactive(Index << 2) instead of the unshifted Index. The & 32 then extracts bit 3 of the original index instead of bit 5, so for any index where bit 3 ≠ bit 5 the selector routes through the wrong of {ds_bpermute, permlane64-then-bpermute} and returns the value from the opposite 32-lane half. SDAG's lowerWaveShuffle keeps unshifted Index for the XOR and is correct. Demonstrated as static asm divergence on gfx1100 +wavefrontsize64 -- the box has no wave64 GFX10/GFX11 GPU. |
| m082-kernarg-range-md-width-mismatch | ❌ | ❌ | ❌ | AMDGPULowerKernelArguments widens any sub-dword scalar kernarg load to i32, then transplants the argument's range ParamAttr onto the widened load via MDB.createRange(Range.getLower(), Range.getUpper()) -- but the APInts are still at the argument's original (sub-dword) width, so the load gets !range !{i8 0, i8 4} on an i32 instruction. The IR verifier rejects this ("Range types must match instruction type!"), so opt -passes=amdgpu-lower-kernel-arguments aborts; the default clang -O2 pipeline has no in-pipeline verifier so the wrong-typed MD survives to codegen and is a latent miscompile risk for any downstream pass that consults the range MD on the un-truncated i32 load. Sibling nonnull/dereferenceable/align block is correctly guarded by isa<PointerType>(ArgTy). |
| m083-rewrite-out-args-mayalias-swap | ❌ | ❌ | ❌ | AMDGPURewriteOutArguments uses a single MemoryDependence query to find "the store" for each out-arg without checking that the returned store's pointer is actually that out-arg. For two non-noalias ptr args MDA returns the last store in the block as the def of both, so the pass pairs each out-arg with the OTHER store's value -- producing a clean value swap (store 1, %a; store 2, %b becomes ret { 2, 1 } consumed as *%a=2, *%b=1). Egregiously, the existing LIT test multiple_same_return_mayalias in rewrite-out-arguments.ll encodes the buggy { 2, 1 } output as the expected result. Pass not in default pipeline -- reachable via opt -amdgpu-rewrite-out-arguments. |
| m084-s-barrier-init-unmasked-membercount | ❌ | ❌ | ❌ | SDAG lowering of llvm.amdgcn.s.barrier.init / s.barrier.signal.var at SIISelLowering.cpp:12450-12459 builds the masked member-count SDValue but then immediately overwrites it with SHL CntOp, 16 using the unmasked raw CntOp. Bits CntOp[15:6] leak into M0[31:22], above the legal 6-bit M0[21:16] member-count field; for %cnt >= 64 the hardware named-barrier sees a corrupted member count. The GISel counterpart at AMDGPUInstructionSelector.cpp:7240-7250 masks correctly. gfx12+ intrinsic so demonstrated as static SDAG-vs-GISel asm divergence -- the FuzzX box has no gfx12 GPU. |
| m085-fatptr-array-vec-elem-store-vs-alloc-stride | ❌ | ❌ | ❌ | AMDGPULowerBufferFatPointers at AMDGPULowerBufferFatPointers.cpp:978-985 (load) and :1098-1105 (store) uses getTypeStoreSize(ElemTy) for the per-element stride when lowering an [N x vec] load/store, but LLVM lays out array elements at multiples of getTypeAllocSize(ElemTy). For <3 x i32> (storeSize=12, allocSize=16 on AMDGPU's v96:128 layout), [2 x <3 x i32>] element[1] is at byte offset 16 but the pass loads/stores from offset 12 -- short-reading 4 bytes of element[0]'s padding plus only 8 bytes of element[1]. Pass IS in the default clang -O2 codegen pipeline. Demonstrated at IR (opt -passes=amdgpu-lower-buffer-fat-pointers) and asm (buffer_load_dwordx3 ... offset:12) levels. |
| m086-set-inactive-known-bits-overclaim | ❌ | ❌ | ❌ | AMDGPUTargetLowering::SimplifyDemandedBitsForTargetNode (AMDGPUISelLowering.cpp:5838-5846) handles amdgcn.set_inactive in the same case body as readfirstlane/readlane/wwm, populating Known only from Op.getOperand(1) (the active-lane value) and never visiting Op.getOperand(2) (the inactive_value). When value is a constant, the generic SimplifyDemandedBits framework constant-folds the entire call to that constant, silently dropping inactive_value. Asm-level proof: set_inactive(0xAAAAAAAA, 0x55555555) & 0xFFFF at -O0 emits a v_cndmask_b32_e64 selecting between both constants; at -O2 it collapses to s_mov_b32 s2, 0xaaaa with no cndmask and no 0x55555555 anywhere. Same shape as m076 (target-node knownbits lying). |
| m087-image-store-sparse-dmask-trim | ❌ | ❌ | ❌ | simplifyAMDGCNMemoryIntrinsicDemanded channel-trimming loop (AMDGPUInstCombineIntrinsic.cpp:2317-2342, reached from the image_store_* case) walks DMask bits left-to-right and drops every set DMask bit whose position-among-set-bits is past the contiguous-prefix demanded mask returned by trimTrailingZerosInVector. For a sparse DMask like 0b1010 (Y+W) with vdata = <a, 0>, the W channel is dropped: O0 emits image_store v[0:1], v2, s[0:7] dmask:0xa unorm (writes Y=a, W=0), O2 emits image_store v0, v2, s[0:7] dmask:0x2 unorm (writes only Y; W is left unchanged). Existing LIT tests use only contiguous DMask and miss this. |
| m088-kernarg-noundef-widened-load | ❌ | ❌ | ❌ | AMDGPULowerKernelArguments.cpp:319-320 unconditionally stamps !noundef on the widened i32 kernarg load whenever the original sub-dword arg has the noundef attribute, but the load's high bits come from sibling kernargs or padding whose noundef-ness is not constrained. isGuaranteedNotToBeUndefOrPoison then returns true for the un-truncated i32, and GVN+InstCombine can drop a freeze guarding a branch on a different arg, producing immediate UB from a well-defined source. Sibling to m082 (range MD); silent past --verify-each. Bug is in default-pipeline-emitted IR; weaponization requires a post-amdgpu-lower-kernel-arguments IR opt (LTO post-link, hand-rolled opt, JIT). |
| m089-lowerkernattr-grid-div-not-uniform-gated | ❌ | ❌ | ❌ | AMDGPULowerKernelAttributes.cpp:409-446 ("Upgrade the old method") rewrites udiv(grid_size_x, group_size_x) (floor) into a load of HIDDEN_BLOCK_COUNT_X (ceil per AMDHSA ABI) without checking the uniform-work-group-size attribute; the two sibling rewrites at lines 310-347 and 348-404 in the same file do check it. For non-uniform dispatches (OpenCL ≥2.0 with -cl-uniform-work-group-size=false, or hand-built AQL packets) where grid % group != 0, floor and ceil differ. HIP runtime always sets dispatch.grid_size = gridDim * blockDim so the harness can't observe runtime divergence, but the IR-level transform is unconditionally unsound; the upstream LIT test implicit-arg-block-count.ll encodes the buggy behavior as expected. |
| m090-image-msaa-load-merge-ignores-dmask | ❌ | ❌ | ❌ | AMDGPUImageIntrinsicOptimizer::collectMergeableInsts (AMDGPUImageIntrinsicOptimizer.cpp:114) starts its arg-equality loop at I=1, silently skipping arg 0 (DMaskIndex). Two image_load_2dmsaa calls at the same coords with different DMasks get fused into a single image_msaa_load using only the first call's DMask, so the second extract reads from the wrong channel (e.g., R(f1) instead of A(f1)). The in-source comment claims to check DMask but the code doesn't. Gated off for gfx950 by MSAALoadDstSelBug; reproduces on gfx1150 (and any gfx10/gfx11 wave-graphics target without that erratum). |
| m091-latecgp-widen-load-noundef | ❌ | ❌ | ❌ | AMDGPULateCodeGenPrepare.cpp:538-540 widens a sub-DWORD constant-AS load to i32 via copyMetadata(LI) + setMetadata(MD_range, nullptr) -- but MD_noundef is NOT cleared. The widened i32 load's high bits come from neighbouring bytes whose noundef-ness wasn't implied by the original attribute. GVN+InstCombine can then drop a freeze guarding a branch on bits the source program never claimed noundef on, producing immediate UB. Same shape as m088 (kernarg widening); this pass's WidenLoads cl::opt defaults to true so the bad MD is in default-pipeline-emitted IR. Twin latent in AMDGPUCodeGenPrepare.cpp:1561-1562 where WidenLoads defaults to false. |
| m092-select-fcmp-one-nan-arg | ✅ | ❌ | ❌ | SITargetLowering::performSelectCombine (SIISelLowering.cpp:18335-18374) rewrites select (fcmp one x, K), other, K -> select (fcmp one x, K), other, x to avoid materializing the constant K twice. The fold guards the constant side (excludes NaN/Inf/zero/subnormal/inline-immediates) but never checks the non-constant operand x. When x = NaN, fcmp one NaN, K is false, so the original returns K while the folded form returns x = NaN. HEAD-only regression (fold added after ROCm 7.1.1 snapshot). Runtime confirmed: input 0x7fc00000 -> O0=0x402df850, O2=0x7fc00000. |
| m093-libcalls-pow-sqrt-no-fmf-guard | ❌ | ❌ | ❌ | AMDGPULibCalls::fold_pow (AMDGPULibCalls.cpp:936-950) rewrites pow(x, ±0.5) to sqrt(x)/rsqrt(x) without checking fast-math flags -- it returns before the isUnsafeFiniteOnlyMath guard at the bottom of fold_pow. C99/IEEE powr says pow(-Inf, 0.5)=+Inf and pow(-0.0, 0.5)=+0.0, but sqrt(-Inf)=NaN and sqrt(-0.0)=-0.0. Runtime confirmed: input -Inf -> O0=+Inf, O2=NaN; input -0.0 -> O0=+0.0, O2=-0.0. Reproduces on both LLVM HEAD and ROCm 7.1.1 (not a HEAD-only regression). |
| m094-fmul-legacy-sign-of-zero | ❌ | ❌ | ❌ | canSimplifyLegacyMulToMul (AMDGPUInstCombineIntrinsic.cpp:398-418) lets the amdgcn.fmul.legacy and amdgcn.fma.legacy folds rewrite to plain fmul/fma whenever one operand is finite-nonzero. But V_MUL_LEGACY_F32 returns +0.0 whenever either operand is ±0, while IEEE fmul XORs the signs: fmul.legacy(-2.0, +0.0) = +0.0 but fmul(-2.0, +0.0) = -0.0. Runtime confirmed: O0=+0.0, O2=-0.0. Reproduces on both LLVM HEAD and ROCm 7.1.1; the existing LIT test fmul_legacy.ll:28 encodes the buggy transform as expected. |
| m095-fmed3-sign-of-zero-maxnum | ❌ | ❌ | ❌ | fmed3AMDGCN (AMDGPUInstCombineIntrinsic.cpp:53-68) implements all-constant amdgcn.fmed3 as a chain of three maxnum calls. APFloat::compare treats ±0 as cmpEqual and APFloat::maxnum "treats +0 > -0", so fmed3(-0, -0, +0) folds via maxnum(-0, +0) = +0, while HW v_med3_f32(-0, -0, +0) returns the actual median by sort order = -0. Runtime confirmed: O0=0x80000000, O2=0x00000000. Reproduces on both LLVM HEAD and ROCm 7.1.1. Generalises to any input triple where two operands tie at -0 and the third is +0. |
| m096-fatptr-cmpxchg-weak-success-bool-poison | ❌ | ❌ | ❌ | AMDGPULowerBufferFatPointers.cpp:1881-1886 only fills the cmpxchg result's success-bool field ({T, i1} lane 1) for strong cmpxchg; for cmpxchg weak it leaves i1 as poison. buffer_atomic_cmpswap is non-spurious so the same ICmpEQ(Call, CompareOperand) would work for both forms. Default pipeline. The lowered IR has insertvalue {i32, i1} poison, %r, 0 with field 1 never set; downstream extractvalue ..., 1 returns poison and any branch/store/arithmetic on it is UB. Sibling shape to m085 (same file). |
| m097-swlowerlds-memintrinsic-leak | ❌ | ❌ | ❌ | AMDGPUSwLowerLDS::getLDSMemoryInstructions (AMDGPUSwLowerLDS.cpp:639-663) whitelists only Load/Store/AtomicRMW/AtomicCmpXchg/AddrSpaceCast, omitting MemIntrinsic. replaceKernelLDSAccesses then blanket-rewrites the dest pointer arg of llvm.memset.p3.* / llvm.memcpy.p3.* / memmove to point into the SwLDS metadata cell, while translateLDSMemoryOperationsToGlobalMemory never lowers the intrinsic itself. Net: memset/memcpy on an LDS pointer in an asan-instrumented kernel writes into the malloc-pointer slot and adjacent SwLDS metadata, never reaching the global backing store, and ASAN sees no access. Gated by sanitize_address. |
| m098-unify-exit-nodes-musttail-bitcast | ❌ | ❌ | ❌ | AMDGPUUnifyDivergentExitNodes.cpp:256-261 skips musttail blocks by checking RI->getPrevNode() for a CallInst. The Verifier explicitly permits an optional bitcast between a musttail call and the ret (Verifier.cpp:4283-4290); when that bitcast is present, getPrevNode() returns the BitCastInst, the block is NOT skipped, the ret is replaced with br UnifiedReturnBlock, and the musttail invariant is destroyed. opt aborts with musttail call must precede a ret with an optional bitcast; -disable-verify produces silently-broken IR. The companion LIT test do-not-unify-divergent-exit-nodes-with-musttail.ll only covers the bitcast-LESS form. Default pipeline. |
| m099-tti-uniform-and-tid-divergent-mask | ❌ | ❌ | ❌ | GCNTTIImpl::isAlwaysUniform (AMDGPUTargetTransformInfo.cpp:1219-1225) matches (workitem.id.x & Mask) and returns AlwaysUniform when Mask's countMinTrailingZeros >= log2(wavefrontSize), without checking that Mask is uniform. A divergent Mask = shl %div, log2(wave) satisfies the trailing-zero check but the AND has divergent high bits. AMDGPUUniformIntrinsicCombine then deletes a readlane(over-claimed_val, lane), so each lane stores its own value instead of the requested lane's. Runtime confirmed: lane 65 stores 0x40 (its own val) instead of 0x00 (lane 0's val). Same shape as m086 (target-node uniformity hook lying about an operand it never inspects). |
| m100-performfmacombine-fdot2-ignores-denormal-mode | ❌ | ❌ | ❌ | performFMACombine (SIISelLowering.cpp:17729-17800) folds the FMA-chain pattern fma(fpext(ax),fpext(bx),fma(fpext(ay),fpext(by),z)) into AMDGPUISD::FDOT2 whenever both FMAs carry contract. The in-source comment justifies the contract-only guard by claiming "fdot2_f32_f16 always flushes fp32 denormal regardless of mode" -- conflating two orthogonal properties. contract is a fusion permit; it does NOT license flushing denormals. A kernel compiled with denormal-fp-math-f32="ieee,ieee" silently switches from v_fma_mix_f32 (mode-respecting) to v_dot2c_f32_f16 (always FTZ), losing any intermediate subnormals the source asked to preserve. dot10-insts is gfx950 default. |
| m101-performaddcarry-rewrites-carryout | ❌ | ❌ | ❌ | SITargetLowering::performAddCarrySubCarryCombine (SIISelLowering.cpp:17527-17533) folds UADDO_CARRY((add x,y), 0, cc) -> UADDO_CARRY(x, y, cc) reusing N->getVTList(), so CombineTo rewires BOTH value and carry-out. The carry-outs are not equivalent: original ((x+y) mod 2^32 + cc) >= 2^32 vs folded (x+y+cc) >= 2^32. They differ whenever x+y wraps. No hasOneUse or hasAnyUseOfValue(1) guard. Symmetric USUBO_CARRY((x-y), 0, cc) borrow output has the same flaw. With x=0xFFFFFFFF, y=1, cc=0: IR carry=0, O2 produces 1. |
| m102-f64-flog-silent-undef | ❌ | ❌ | ❌ | AMDGPUISelLowering.cpp:419-427 marks {FEXP,FEXP2,FEXP10} Custom for f64 (handled by lowerFEXPF64), but the LOG family is not. FLOG/FLOG2/FLOG10 on f64 fall through generic Expand to a nonexistent libcall. llc prints error: no libcall available for flog2, exits 0, and emits a kernel that stores {0, undef} instead of the log value. The LowerFLOG2/LowerFLOGCommon helpers can't handle f64 either -- they emit AMDGPUISD::LOG which only has a V_LOG_F32 selector. Affects gfx900/gfx950/gfx1100. Under strictfp, hard crash instead of silent miscompile. |
| m103-lowersdivrem-i64-int32min-narrowing | ❌ | ❌ | ❌ | AMDGPUTargetLowering::LowerSDIVREM (AMDGPUISelLowering.cpp:2415-2430) narrows i64 SDIVREM to i32 whenever both operands have ComputeNumSignBits > 32. That admits LHS = sext(INT32_MIN) (33 sign bits) and RHS = sext(-1). The narrowed sdiv i32 0x80000000, -1 is poison; lowering wraps to 0x80000000, and the outer SIGN_EXTEND produces -2^31 instead of the well-defined i64 result +2^31. Mirrored in AMDGPUCodeGenPrepare::expandDivRem32 (AMDGPUCodeGenPrepare.cpp:1219), so the O0-vs-O2 oracle agrees wrong unless InstCombine pre-folds the divisor. Fix: tighten gate to > 33. |
| m104-sdag-rcp-constant-denormal | ❌ | ❌ | ❌ | AMDGPUTargetLowering::performRcpCombine (AMDGPUISelLowering.cpp:5549-5558) folds AMDGPUISD::RCP(C) to APFloat(1.0)/Val without consulting the kernel's denormal mode. Literal in-source comment: // XXX - Should this flush denormals?. SDAG twin of m075 (output-denormal) + m077 (input-denormal): the InstCombine fixes only fire on direct @llvm.amdgcn.rcp(C), but fdiv afn 1.0, C is rewritten by lowerFastUnsafeFDIV (SIISelLowering.cpp:13117) into AMDGPUISD::RCP, then hits this fold. 1.0 / 2^127 under denormal-fp-math-f32="preserve-sign,preserve-sign" produces subnormal 0x00400000 instead of HW's +0.0. Both O0 and O2 run the same combine. |
| m105-fptosisat-bf16-i64-clamps-at-i32 | ❌ | ❌ | ❌ | LowerFP_TO_INT_SAT (AMDGPUISelLowering.cpp:3979-3986) groups bf16 with f16 in the "saturate at i32 then ext to i64" shortcut. Sound for f16 (max finite 65504 fits in i32) but wrong for bf16, which shares f32's 8-bit exponent. Values in [INT32_MAX+1, INT64_MAX] silently clamp to INT32_MAX instead of returning the correct in-range i64 or INT64_MAX. Symmetric fptoui.sat defect. For bf16 input 0x4f80 (= 2^32): expected 0x100000000, observed 0x7fffffff. Bug is in Custom legalization (runs at every -O), so both -O0 and -O2 are wrong. |
| m106-selectvop3mods-fsub-pzero-sign-of-zero | ❌ | ❌ | ❌ | SelectVOP3ModsImpl (AMDGPUISelDAGToDAG.cpp:3415-3423) folds fsub C, x into the VOP3 NEG source modifier whenever LHS->isZero(). APFloat::isZero() matches BOTH +0.0 AND -0.0. Under IEEE 754, fsub +0.0, x is NOT equivalent to fneg x when x = +0.0: the former returns +0.0, the latter -0.0. No nsz gate. SDAG: v_mul_f32_e64 v1, -v1, v2; GISel: preserves the fsub. Sibling shape to m094 (fmul.legacy sign-of-zero) but at ISel layer. Fix: restrict to LHS->isNegZero() (since -0 - x = -x is unconditional). |
| m107-performfnegcombine-fmul-flips-nan-sign | ❌ | ❌ | ❌ | AMDGPUTargetLowering::performFNegCombine FMUL/FMUL_LEGACY/FADD/FMA arms (AMDGPUISelLowering.cpp:5298-5318) fold fneg(fmul x,y) -> fmul(x, fneg y) (and the FADD/FMA siblings) with no nnan guard. Under IEEE 754, fneg(x) flips the sign bit of every value including NaN. But on AMDGPU HW, v_mul_f32(NaN, -y) propagates the input NaN's sign bit unchanged -- the VOP3 NEG src-modifier on the other operand has no effect on the propagated NaN's sign. For x=NaN, y=1.0: O0 produces 0xFFC00000 (sign flipped), O2 produces 0x7FC00000 (sign preserved). Asm: O0 v_sub -0, mul; O2 v_mul x, -y. |
| m108-lowerkernattrs-grid-dims-folded-from-reqd-wgsize | ❌ | ❌ | ❌ | AMDGPULowerKernelAttributes (AMDGPULowerKernelAttributes.cpp:121-142, 146-158, 271-274) replaces the dispatch-time hidden_grid_dims load (at implicitarg.ptr + 64, COV5) with a constant derived from the kernel-static !reqd_work_group_size metadata. But hidden_grid_dims is the AQL dispatch packet's setup.DIMENSIONS field -- set by the runtime when the kernel is launched. OpenCL/HIP explicitly permit dispatching a kernel with reqd_work_group_size(N,1,1) as a 2-D or 3-D NDRange. get_work_dim() then returns the wrong value (always 1 instead of the actual runtime dispatch dim). Per AMDGPUUsage.rst:5358: "hidden_grid_dims = AQL dispatch packet dimensionality". Post-ROCm-7.2.3 upstream regression. |
| m109-sifixsgprcopies-s_mov_b64-truncates-imm | ❌ | ❌ | ❌ | SIFixSGPRCopies::tryMoveVGPRConstToSGPR (SIFixSGPRCopies.cpp:887) picks MoveOp = MoveSize == 64 ? S_MOV_B64 : S_MOV_B32. S_MOV_B64 only encodes a 32-bit literal -- non-inline 64-bit immediates have their high 32 bits silently dropped at encoding time. Sibling helper isSafeToFoldImmIntoCopy at line 386 correctly uses S_MOV_B64_IMM_PSEUDO. For input V_MOV_B64_PSEUDO 0x123456789ABCDEF0, asm prints the full value but encoding BE8001FF 9ABCDEF0 is one literal slot; disassembly shows s_mov_b64 s[0:1], 0x9abcdef0. High 32 bits 0x12345678 zero-extended away. Triggered when a uniform PHI takes a V_MOV_B64_PSEUDO non-inline imm. Reproduces on ROCm 7.2.3 -- not HEAD-only. |
| m110-performfnegcombine-fmed3-nan-asymmetry | ❌ | ❌ | ❌ | AMDGPUTargetLowering::performFNegCombine AMDGPUISD::FMED3 arm (AMDGPUISelLowering.cpp:5383-5401) folds fneg(fmed3 x,y,z) -> fmed3(-x,-y,-z) with no nnan guard. But v_med3_f32 treats NaN asymmetrically: NaN sorts as smaller-than-everything regardless of NaN's sign bit. So negating operands doesn't yield a sign-flipped median. For med3(NaN, 1.0, 2.0) = 1.0, fneg = -1.0; but med3(-NaN, -1.0, -2.0) = -2.0. O0: 0xBF800000; O2: 0xC0000000. Sibling shape to m107. Reproduces on ROCm 7.1.1. Fix: gate on nnan. |
| m111-vop3p-madfmamix-fsub-fpext-nan-sign | ❌ | ❌ | ❌ | VOP3P MadFmaMixFP32Pats TableGen pattern (VOP3PInstructions.td:240-251) matches canonical IR fneg(fpext h) (= fsub -0.0, fpext(h)) and lowers to v_fma_mix_f32(h, -1.0, -0.0). The VOP3 NEG src-modifier on -1.0 does NOT flip the sign of a NaN propagated from h -- HW preserves the input NaN's sign through v_fma_mix. Result at O0: 0x7FC00000 (sign NOT flipped, wrong per LangRef fneg). At O2 the performFNegCombine FP_EXTEND arm fires first and produces correct v_cvt_f32_f16_e64 -h (= 0xFFC00000). HEAD-only regression at O0 (ROCm 7.1.1 does not have the racing TableGen pattern). |
| m114-promotekernargs-flat-to-global-unconditional | ❌ | ❌ | ❌ | AMDGPUPromoteKernelArguments (AMDGPUPromoteKernelArguments.cpp:105-128) unconditionally wraps every FLAT_ADDRESS kernel-arg pointer in addrspacecast(addrspacecast(p to ptr addrspace(1)) to ptr) so InferAddressSpaces converts downstream memops to global_*. NO check that the flat pointer is actually in the global aperture. A flat kernarg can legitimately carry LDS / private aperture pointers (host stuffs addrspacecast(@LDS to ptr)). Per LangRef, addrspacecast to a non-containing AS is poison; AMDGPU lowering strips the aperture base. Result: global_store_dword to garbage address instead of correct flat/LDS/scratch dispatch. Default pipeline at -O2. Reproduces on ROCm 7.1.1. |
| m115-fcanonicalize-v2f16-undef-lane-asymmetric | ❌ | ❌ | ❌ | performFCanonicalizeCombine v2f16 build_vector path (SIISelLowering.cpp:15910-15915) has a dead ternary: if (isa<ConstantFPSDNode>(NewElts[1])) NewElts[0] = isa<ConstantFPSDNode>(NewElts[1]) ? NewElts[1] : DAG.getConstantFP(0.0, ...); -- the false branch is unreachable. When the OTHER lane is a non-constant (e.g. fcanon(runtime)), NewElts[0] stays undef and the combined build_vector lets the low lane decay to raw register bits at O2. The symmetric branch at 15917-15921 is correctly written. O0 fcanonicalizes low half to 0, O2 leaves raw sNaN/denormal bits. Sibling shape to m086 (target hook over-claims). |
| m116-regforinlineasm-i64-named-single-vgpr-truncate | ❌ | ❌ | ❌ | getRegForInlineAsmConstraint (SIISelLowering.cpp:19164-19211) silently accepts ={v0} for an i64 result type and binds it to a single 32-bit VGPR. The width check at 19175-19183 only runs when NumRegs > 1; the NumRegs == 1 path at 19204-19208 only checks VT.isVector(), not scalar bit-width. Codegen synthesises the upper half from thin air (v_mov_b32_e32 v1, 0 before the asm), so a 64-bit-writing asm reads back as {v0=asm_value, v1=0}. The range form {v[0:0]} is correctly rejected; the off-by-one is asymmetry in parsing. Reproduces on ROCm 7.1.1. |
| m117-attributor-waves-per-eu-max-of-lower | ❌ | ❌ | ❌ | AAAMDWavesPerEU::updateImpl (AMDGPUAttributor.cpp:1165-1172) uses std::max for BOTH endpoints of the waves-per-eu range when computing the union over callers. Lower bound should be std::min. A callee shared by kernels with [1,1] and [8,8] becomes [8,8] -- the tightest kernel's register budget is imposed on the callee even when invoked from the relaxed kernel, forcing spills. Also overwrites state non-monotonically and uses a wrong type-comparison for fixpoint detection. Acknowledged but undirected upstream as min-waves-per-eu-not-respected.ll. |
| m118-iscanonicalized-frexpmant-rcp-sqrt-overclaim | ❌ | ❌ | ❌ | SITargetLowering::isCanonicalized (SIISelLowering.cpp:15670-15683 + 15541-15559) lists ~12 AMDGPU target intrinsics/ISD nodes as unconditionally canonical with no input check: amdgcn.frexp_mant, rcp, rsq, sqrt, exp2, log, trig_preop, cubeid, cvt_pkrtz, fdot2, rcp_legacy, rsq_legacy, rsq_clamp. HW v_frexp_mant_f32 etc. propagate NaN payload unchanged (sNaN in -> sNaN out). The fcanonicalize_canonicalized PatFrag (SIInstrInfo.td:1013) lowers to COPY, so canonicalize of any of these is elided. Codebase contradicts itself: known-never-snan.ll:566 (v_test_NOT_known_frexp_mant_input_fmed3_r_i_i_f32) explicitly asserts frexp_mant output is NOT known-never-sNaN. |
| m119-sssid-merge-drops-stronger-scope | ❌ | ❌ | ❌ | SIMemOpAccess::constructFromMIWithMMO (SIMemoryLegalizer.cpp:836) merges SSIDs from multi-MMO atomics by checking isSyncScopeInclusion(A,B); if neither subsumes the other (e.g. agent-one-as vs workgroup), the optional returns false and the code overwrites SSID := B, silently dropping the agent-level scope. Asm divergence verified: [agent-one-as, workgroup] MMO order emits no BUFFER_WBL2 16 / BUFFER_INV 16; reversed order emits both. Order-dependent codegen on multi-MMO atomics (created by cloneMergedMemRefs etc.). Should compute LUB instead. |
| m120-performfmulcombine-fneg-lhs-flips-nan-sign | ❌ | ❌ | ❌ | performFMulCombine (SIISelLowering.cpp:17719-17721) folds fmul x, (select y, -A, -B) -> ldexp(fneg x, ...) with no nnan guard. HW v_mul_f64(NaN, -K) preserves input NaN's sign; HW v_ldexp_f64(-x, k) lowers FNEG into the VOP3 NEG src-modifier which XORs the sign bit before ldexp sees the operand, flipping the NaN sign. For x=+qNaN, y=true: O0 stores 0x7FF8000000000000, O2 stores 0xFFF8000000000000. Inverse of m107 (m107 preserved sign; m120 flips it). Affects f64 always and f32/f16 in divergent contexts. |
| m113-preloadkernargs-explicit-budget-omits-baseoffset | ❌ | ❌ | ❌ | AMDGPUPreloadKernelArguments (AMDGPUPreloadKernelArguments.cpp:181-183, 295-329) computes the explicit-arg preload budget using only the raw ExplicitArgOffset, omitting BaseOffset = ST.getExplicitKernelArgOffset() (= 36 on non-AMDHSA/PAL/Mesa3D triples). On amdgcn-- triple, the pass over-marks explicit args as inreg, then SIISelLowering's allocatePreloadKernArgSGPRs bails partway through and AMDGPULowerKernelArguments skips lowering all inreg args -- dropped args read garbage. Verified asm divergence: amdgcn-- emits second load at offset 0x44 (correctly excluding baseoffset); amdgcn-amd-amdhsa at 0x20. Companion of m088 (kernarg widening). |
| m112-printfbinding-pct-s-strlen-mod4-zero-offset | ❌ | ❌ | ❌ | AMDGPUPrintfRuntimeBinding (AMDGPUPrintfRuntimeBinding.cpp:218-220, 357-389) sizes the %s printf slot as alignTo(strlen(s)+1, 4) (reserves room for NUL), but the store loop only writes strlen(s) bytes and the next-arg GEP advances by getTypeAllocSize of the actual stored value, not the metadata-recorded size. For any string with strlen % 4 == 0, the metadata-recorded offset of arg N+1 is +4 ahead of the IR-emitted offset. Runtime reads %d arg from uninitialized buffer bytes. Repro: printf("%s %d", "abcd", x). |
| m121-lowerkernattrs-udiv-blockcount-drops-volatile-load | ❌ | ❌ | ❌ | AMDGPULowerKernelAttributes (AMDGPULowerKernelAttributes.cpp:421-444) "upgrade old block-size calc" rewrite matches m_UDiv(m_ZExtOrSelf(m_Load(m_GEP(dispatch_ptr, GRID_SIZE_X+I*4))), m_Value()) without isSimple() on m_Load. Volatile and atomic loads match; the pass RAUW's the UDiv with a freshly-built non-volatile load of HIDDEN_BLOCK_COUNT_X from a different memory location, silently discarding the value the user's volatile/atomic load produced. The original load survives as a dead side-effecting op, masking the loss. All other load matches in the same pass (line 205) correctly require isSimple(). Sibling of m108. |
| m122-lowerfdiv64-ignores-denormal-mode | ❌ | ❌ | ❌ | SITargetLowering::LowerFDIV64 (SIISelLowering.cpp:13471-13538) lowers f64 fdiv through a NR refinement chain (DIV_SCALE / FMA*4 / FMUL / DIV_FMAS / DIV_FIXUP) with no denormal-mode toggle. Sibling LowerFDIV32 correctly wraps its NR chain in S_SETREG_B32 / DENORM_MODE writes (lines 13364-13416 / 13436-13462) to force IEEE denormals around the FMA chain, saving and restoring under denormal-fp-math-f32="preserve-sign,...". Under denormal-fp-math="preserve-sign,preserve-sign" the f64 v_fma_f64 chain runs with FTZ; near-denormal intermediates get flushed and NR converges to the wrong value for divisors near 2^-1022. Same lowering at -O0 and -O2 (Custom legalization, not a combine). Sibling of m075/m077/m104 at the f64 fdiv layer. |
| m123-lowerfastunsafefdiv64-nan-on-zero-divisor | ❌ | ❌ | ❌ | lowerFastUnsafeFDIV64 (SIISelLowering.cpp:13140-13178) lowers fdiv afn double X, Y to RCP + 4-step NR refinement. For runtime Y=0: RCP(0) = +/-Inf; FMA(-0, +Inf, 1.0) = NaN + 1.0 = NaN; the entire NR chain returns NaN. IEEE / AMDGCN-RCP say X / +0 = sign(X)*Inf. LangRef afn allows imprecise approximation but does NOT permit +Inf -> NaN (that would need ninf+nnan). f32 fast path uses simple X * RCP(Y) and is safe. Same buggy asm at -O0 and -O2 (Custom legalization). Sibling of m075/m077/m104/m122. |
| m124-fcanonicalize-v2f16-both-undef-decays-to-zero | ❌ | ❌ | ❌ | performFCanonicalizeCombine v2f16 BUILD_VECTOR path (SIISelLowering.cpp:15885-15924) decays fcanonicalize(<2 x half> undef) to <0.0, 0.0> instead of <qNaN, qNaN>. LangRef says canonicalize(undef) should yield a quiet NaN. The correct scalar undef arm at line 15868 returns qNaN, but SDAG lowers <2 x half> undef as BUILD_VECTOR undef, undef, bypassing it. The lane-1 fixup correctly falls back to 0.0, then the lane-0 ternary (m115's dead-branch shape) sees NewElts[1]=ConstantFPSDNode(0.0) and splats. v4f16 type-legalizes to scalar arms and is correct, demonstrating the v2f16 asymmetry. Distinct from m115 (lane-0 undef + lane-1 runtime canonicalize). |
| m125-printfbinding-half-sext-corrupts-fp-bits | ❌ | ❌ | ❌ | AMDGPUPrintfRuntimeBinding (AMDGPUPrintfRuntimeBinding.cpp:191-203) widens scalar half/bfloat printf args via bitcast (half) -> i16 -> sext -> i32. sext is the wrong widener: for negative half values (sign bit set), the top 16 bits flip from 0x0000 to 0xFFFF, corrupting the FP bit pattern. half -2.0 (0xC000) becomes 0xFFFFC000 instead of 0x0000C000; runtime %f reads garbage. Companion to m112 (%s size off-by-4 in same pass). Fix: replace CreateSExt with CreateZExt (or, better, emit fpext half -> float for %f args). |
| m126-lowerfsqrt-ignores-denormal-mode | ❌ | ❌ | ❌ | lowerFSQRTF32 (SIISelLowering.cpp:13682-13770) and lowerFSQRTF64 (13772-13862) emit NR refinement chains with v_fma_f32/v_fma_f64 and no AMDGPUISD::DENORM_MODE toggle. Sibling LowerFDIV32 (13334-13468) wraps its NR chain in S_SETREG_B32 / DENORM_MODE writes (lines 13379-13416 / 13436-13462) to force IEEE denormals; sqrt does not. Under denormal-fp-math="preserve-sign,..." the FMA NR intermediates flush and the result diverges from IEEE. f32 ELSE branch (SqrtE = 0.5 - SqrtH*SqrtS) and f64 Goldschmidt chain both affected. Also missing Flags.setNoFPExcept(true). Sibling of m122 at the sqrt layer. |
| m127-performfsubcombine-fadd-folds-flip-nan-sign | ❌ | ❌ | ❌ | performFSubCombine (SIISelLowering.cpp:17579-17624) two arms: Arm 1 (fsub (fadd a,a), c) -> fma(a, 2.0, fneg(c)) and Arm 2 (fsub c, (fadd a,a)) -> fma(a, -2.0, c). Both fire on contract-flagged inputs without an nnan guard. HW v_sub_f32 c, sum flips the propagated NaN's sign via implicit NEG-on-b; HW v_fma_f32 a, 2.0, -c (NEG src-modifier on c) propagates NaN sign UNCHANGED. For a=1.0, c=+qNaN: O0 stores 0xFFC00000, O2 stores 0x7FC00000. Sibling of m107 (FMul) and m120 (FMul fneg-LHS). Reproduces on ROCm 7.1.1. |
| m128-performfmacombine-fdot2-flips-nan-sign | ❌ | ❌ | ❌ | performFMACombine FDOT2 fold (SIISelLowering.cpp:17729-17800) gates only on contract (and dot10-insts) -- no nnan guard. Source IR lowers to two v_fma_mix_f32 (preserves input NaN sign); folded form is one v_dot2c_f32_f16 which unconditionally sets the sign bit of any NaN output regardless of input NaN sign. For a=<+qNaN,1>, b=<1,1>, z=0: two-FMA path stores 0x7FC00000, FDOT2 stores 0xFFC00000. Bug also fires via z-only NaN propagation. Distinct from m100 (same fold, denormal). Reproduces on ROCm 7.1.1. |
| m130-libcalls-powr-negative-base-folds-finite | ❌ | ❌ | ❌ | AMDGPULibCalls::fold_pow constant-exponent shortcuts (AMDGPULibCalls.cpp:900-1005) don't check FInfo.getId(), firing for EI_POWR / EI_POWR_FAST too. OpenCL powr(x<0, y) = NaN (base must be >= 0), powr(NaN/+0/-0, 0) = NaN. Fold instead produces x*x / 1.0/x / 1.0 / sqrt(x). Repro: powr(-2.0, 2.0) returns 4.0 (0x40800000) instead of NaN. Sibling of m093 but for powr semantics. Reproduces on ROCm 7.1.1. |
| m131-simplifydemandedbits-set-inactive-divergent-witness | ❌ | ❌ | ❌ | AMDGPUTargetLowering::SimplifyDemandedBitsForTargetNode (AMDGPUISelLowering.cpp:5841) treats amdgcn_set_inactive as 1-source, propagating Known from operand(1) only. The intrinsic signature is set_inactive(active, inactive); Known should be the intersection of both. Witness: divergent EXEC + set_inactive(active=tid&0xFF, inactive=0xFFFF0000) + readlane to expose inactive-lane bits + lshr 16 -- buggy fold collapses to 0 (over-promised "high bits zero"); operand-swap reference correctly stores 0xFFFF. Verified on gfx1100 (gfx950 ICEs in Branch Relaxation on the divergent shape -- separate bug). |
| m132-codegenprepare-vector-sdiv-int32min-narrowing | ❌ | ❌ | ❌ | AMDGPUCodeGenPrepare vector scalarizer (AMDGPUCodeGenPrepare.cpp:1488-1520) extracts each lane of a vector sdiv/srem and calls shrinkDivRem64 per element, composing the m103 i64 INT32_MIN/-1 narrowing per lane. v2i64 sdiv with lane 0 = sext(INT32_MIN), lane 1 = sext(100), divisor = literal splat <i64 -1, i64 -1>: O2 InstCombine folds sdiv x, splat(-1) -> 0 - x (correct); O0 takes the buggy per-lane narrowing. Clean O0/O2 mismatch on lane-0's high32: 0xFFFFFFFF vs 0x00000000. Same fix as m103 (tighten >32 to >33) but must apply at both scalar AND vector entry points. |
| m133-getcanonicalconstantfp-drops-nan-payload | ❌ | ❌ | ❌ | getCanonicalConstantFP (SIISelLowering.cpp:15820-15854) drops SNaN and non-default-QNaN payload, returning the default-payload canonical QNaN. SNaN-quietening at line 15841 has an in-source FIXME: "Is this supposed to preserve payload bits?". AMDGPU HW v_max_f32(SNaN, SNaN) quiets by setting bit 22 only and preserves the rest of the payload (and QNaN payload is preserved entirely). For input SNaN 0x7F8A5A5A (payload 0x0a5a5a): O0 stores 0x7FCA5A5A (HW), O2 stores 0x7FC00000 (constant fold). Payload divergence is observable via bit-pattern inspection. Same defect applies to QNaN with non-default payload, and to v2f16/v2bf16 vector paths. Reproduces on ROCm 7.1.1. |
| m134-amdgpuisel-bitop3-stale-slot-rhs-not-reset | ✅ | ❌ | ❌ | v_bitop3_b32 selector stale-slot patch (AMDGPUISelDAGToDAG.cpp:4413-4450) only resets LHSBits = LHSBitsOrig and NumOpcodes = 0 when the recursive RHS mutated src slots; the parallel RHSBits = RHSBitsOrig reset is missing. Result: the returned LHSBitsOrig OP RHSBits_recursed truth-table encodes inconsistent slot semantics. Reduced repro (6 IR ops): `r = v2 ^ (~v2 |
| m135-libcalls-rootn-2-folds-to-sqrt-without-fmf | ✅ | ❌ | ❌ | AMDGPULibCalls::fold_rootn (AMDGPULibCalls.cpp:1171-1189 for rootn(x, 2); 1209-1235 for rootn(x, -2)) rewrites to Intrinsic::sqrt / Intrinsic::rsqrt without any FMF gate. Per OpenCL spec for even n: rootn(-0.0, 2) = +0.0 (sign dropped); rootn(-Inf, 2) = NaN; rootn(-0.0, -2) = +Inf; rootn(-Inf, -2) = +0.0. llvm.sqrt follows IEEE: sqrt(-0) = -0, sqrt(-Inf) = NaN with sign. O0/O2 mismatch verified: input -0.0 gives O0 +0.0 vs O2 -0.0; input -Inf gives O0 +qNaN vs O2 -qNaN. Unlike m093 the fold uses Intrinsic::sqrt directly so doesn't need a module-visible _Z4sqrtf body to fire. |
| m136-fatptr-seqcst-atomic-loses-scope-bits | ❌ | ❌ | ❌ | addrspace(7) seq_cst atomicrmw/cmpxchg/load/store lowering: SIISelLowering::getTgtMemIntrinsic (SIISelLowering.cpp:1473-1475, atomic-buffer branch) sets only MOVolatile on the MMO -- no atomic ordering, no SSID. AMDGPULowerBufferFatPointers.cpp:1656-1680 lowers seq_cst to fence-pair + non-atomic intrinsic call. SIMemoryLegalizer then sees getSuccessOrdering() == NotAtomic, skips toSIAtomicScope, never calls enableRMWCacheBypass -- SC0/SC1 scope bits at gfx940/950 RMW lines 1137-1154 are never set. Asm divergence: addrspace(7) emits buffer_atomic_add with NO SC bits (default/wavefront scope), while equivalent addrspace(1) emits global_atomic_add ... sc1 (system scope). Cross-agent contention can lose updates. rmw_sys and rmw_agent emit byte-identical asm -- scope is lost entirely. |
| m137-lowerf64tof16safe-drops-nan-payload | ❌ | ❌ | ❌ | LowerF64ToF16Safe (AMDGPUISelLowering.cpp:3787-3873, 3824-3827) handles f64 NaN/Inf by collapsing to 0x7c00 | 0x200 (canonical qNaN), dropping ALL payload bits. The HW chain f64→f32→f16 (v_cvt_f32_f64 + v_cvt_f16_f32) preserves the top 9 bits of payload. Same kernel with fptrunc double->half direct vs fptrunc double->float; fptrunc float->half chain produces different NaN payloads (0x7e00 vs 0x7e6f). Toggling afn flag flips between the two paths -- LangRef afn doesn't license changing NaN payload preservation. Sibling of m133 (constfold drops payload). |
| m138-bitop3-selector-revert-missing-return | ✅ | ❌ | ❌ | v_bitop3_b32 selector stale-slot revert at AMDGPUISelDAGToDAG.cpp:4446-4451 sets NumOpcodes = 0 and restores LHSBits/Src but DOES NOT RETURN. Control falls through to TTbl computation, returning (1, LHSBitsOrig OP RHSBits_recursed) = (1, garbage). Random IR fuzzing found 5 distinct O0/O2 miscompiles in ~500 random 5-8-op bitwise chains, each emitting a different wrong truth-table immediate (0x22/0x40/0x33/0x7e/0x3c). Cleanest: always-zero chain (c & v1) & (c & ~v1) = 0 emits v_bitop3_b32 ... bitop3:0x40 computing a & b & ~c = 0xC888A886 (nonzero). m134 covered one symptom (RHSBits reset); m138 is the structural root cause -- early-return is needed, plus findSlot polarity-blindness and missing RHSBitsOrig snapshot are adjacent contributing defects. |
| m139-performfnegcombine-fma-flips-nan-sign | ❌ | ❌ | ❌ | performFNegCombine FMA arm (AMDGPUISelLowering.cpp:5319-5347) folds (fneg (fma x, y, z)) -> (fma x, -y, -z) gated only on nsz. For NaN inputs, fneg must flip the sign bit precisely; the substituted fma(x, -y, -z) selects a NaN payload from a different operand chain. v2f16 repro (in0=0xfe00fc00, in2=0x7c007c00): O0 emits v_pk_fma_f16 + v_xor -> 0x7e007c00; O2 emits v_pk_fma_f16 ... neg_lo:[0,1,1] neg_hi:[0,1,1] -> 0xfe007c00 (top-half NaN sign kept negative). Sibling of m107/m110/m111/m120/m127/m128 -- all need nnan not just nsz. |
| m140-performfnegcombine-fadd-flips-nan-sign | ❌ | ❌ | ❌ | performFNegCombine FADD arm (AMDGPUISelLowering.cpp:5273-5297) folds (fneg (fadd x, y)) -> (fadd -x, -y) gated only on nsz. For NaN-producing fadds (e.g. +Inf + -Inf, or NaN input), the substitution changes which operand's NaN sign survives. v2f16 repro (in0=0xfe00fc00, in1=0x7c007c00): O0 emits v_pk_add_f16 + v_xor 0x80008000 -> 0x7e007e00; O2 emits v_pk_add_f16 ... neg_lo:[1,1] neg_hi:[1,1] -> 0x7e00fe00 (bottom-half NaN sign kept negative -- m139 sibling on FADD arm). |
| m141-iscanonicalized-bitcast-loses-fp-type | ❌ | ❌ | ❌ | SITargetLowering::isCanonicalized recurses through ISD::BITCAST (SIISelLowering.cpp:15649-15653) WITHOUT consulting source/dest FP semantics. In-source TODO acknowledges the bug. v2bf16 (8-bit exp) and v2f16 (5-bit exp) have different denormal ranges, so the same 16-bit pattern can be normal in one type and denormal in the other -- and a NaN-payload bit-pattern in one type may not be canonical-NaN in the other. The function gates is_canonicalized_1/2 PatFrags (AMDGPUInstructions.td:189,207; SIInstrInfo.td:1017,1025) which decide whether V_PACK_B32_F16 / min/max selection may omit the explicit canonicalise. Effect: O2 drops the canonicalize and lets a denormal/sNaN through where O0 emits v_pk_max_f16 v, v that FTZs/quiets. Sibling of m118 (same function, different arm), m115/m124, m133. |
| m142-image-d16-bf16-skipped | ❌ | ❌ | ❌ | SITargetLowering::lowerImage D16 detection at SIISelLowering.cpp:10190 (store) / 10203 (load) / 11926 (handleD16VData) / 12056/12084/12120/12170 (TBUFFER) uses getScalarType() == MVT::f16 only. bf16 is a distinct MVT, so <N x bfloat> image data/result silently skips handleD16VData, selects the non-D16 MIMG opcode, computes wrong NumVDataDwords, and bypasses the HasD16 guard. GISel correctly handles bf16 via getScalarType() == S16 (AMDGPULegalizerInfo.cpp:7182) -- SDAG-only miscompile. Same IR, same bits, different image opcode depending on -global-isel. No upstream lit test covers bf16 image. |
| m143-strict-fp-round-f64-bf16-drops-chain | ❌ | ❌ | ❌ | SITargetLowering::lowerFP_ROUND (SIISelLowering.cpp:8604-8613, f64->bf16 path) lacks the strictfp-bail guard that lives on the f64->f16 path (line 8585). For ISD::STRICT_FP_ROUND with src f64 and dst bf16, it calls expandRoundInexactToOdd (non-strict graph) then emits non-strict ISD::FP_ROUND -- silently dropping the strict chain (operand 0), losing the second result (chain), and dropping exception semantics from the inexact-to-odd expansion. Downstream side-effecting ops can reorder past the strict round. Sibling to m137 (lowering drops payload) and m141 (bf16 family). |
| m161-verifyinstruction-atomic-av-class | ❌ | ❌ | ❌ | SIInstrInfo::verifyInstruction atomic vdst/vdata file-match check at SIInstrInfo.cpp:5857-5869 uses RI.isAGPR(X) != RI.isAGPR(Y). isAGPR = hasAGPRs && !hasVGPRs returns false for AV_* classes (have both bits). On gfx90A+ where getLargestLegalSuperClass widens VReg/AReg pairs to AV, AV-class virtuals are the norm. Two failure modes: (1) AV vdst + AGPR vdata -> false != true spurious reject; (2) AV vdst + VGPR vdata -> false == false silently accepted but allocator may split vdst across A-half while vdata stays VGPR, corrupting atomic encoding. Sibling family m149/m152/m153 -- all isAGPR/isVGPRClass-blindness defects in gfx950 AV-class handling. |
| m160-shrinkdivrem-24bit-narrowing-int24-min-neg-one | ❌ | ❌ | ❌ | shrinkDivRem64 24-bit narrowing path (AMDGPUCodeGenPrepare.cpp:1354-1361 + expandDivRem24Impl at :1155-1162) computes INT24_MIN / -1 = +2^23 = 0x00800000 correctly, then sign-extends from 24-bit via SHL 8; AShr 8 which inverts to 0xFF800000 (= -2^23). Same gate reachable from i32 sdiv when LHS has >=9 sign bits + RHS=-1 (getDivNumBits returns 24). Sibling of m103 (32-bit), m132 (vector). Full family covers INT_MIN_at_N / -1 for N ∈ {24, 32}. Reproducer: sdiv i32 -8388608, -1 stores 0xFF800000 instead of 0x00800000. |
| m159-fcanon-v2bf16-undef-lane | ❌ | ❌ | ❌ | performFCanonicalizeCombine BUILD_VECTOR per-lane undef-fixup at SIISelLowering.cpp:15885 is guarded by VT == MVT::v2f16 -- v2bf16 missing. On gfx950, v2bf16 fcanonicalize is Legal but the combine falls through -- O0 reads garbage bits for undef lane via v_max_f32_e64; O2 splats a constant qNaN. Example: input 0x00007fc0 (lane0 undef, lane1 bf16 qNaN) -> O0 stores 0x7fc00000, O2 stores 0x7fc07fc0. Found by random fcanon-chain fuzz: 29/500 mismatches all of this shape. Direct sibling of m115/m124 for v2f16 -- same defect, different element type. Fix: extend guard to include v2bf16. |
| m158-lowerfcopysign-v2f16-trunc-drops-sign | ❌ | ❌ | ❌ | SITargetLowering::lowerFCOPYSIGN v2f16/v2bf16 mag + wider sign path (SIISelLowering.cpp:8817-8823) bitcasts sign to v2i32 then TRUNCATEs to v2i16. TRUNCATE takes low 16 bits of each i32 -- drops bit 31 (the f32 sign bit), substitutes mantissa bit 15. Subsequent FCOPYSIGN(v2f16, v2f16) reads bit 15 of the truncated value -> the produced half carries random sign instead of the input f32's sign. Fix: SRL by 16 before TRUNCATE, or extract high half. Reachable when sign comes from a v2f32 producer not wrapped in fpround (performFCopySignCombine only peeks through FP_ROUND). |
| m157-lower-fminmax-num-v32f16-missing | ❌ | ❌ | ❌ | SIISelLowering.cpp:826-829 sets ISD::FMINIMUMNUM/FMAXIMUMNUM Custom for {v4f16, v8f16, v16f16, v32f16}. lowerFMINIMUMNUM_FMAXIMUMNUM (SIISelLowering.cpp:8637-8651) handles only {v4f16, v8f16, v16f16, v16bf16} -- v32f16 missing, v16bf16 is dead (not Custom anywhere). In non-IEEE mode, v32f16 falls through to return Op; -- legalizer treats as "no change" and either infinite-loops or asserts. Reachable via llvm.minimumnum.v32f16/llvm.maximumnum.v32f16 on gfx950. Fix: add v32f16 to splitter, drop dead v16bf16; or set v32f16 Expand matching FMINNUM/FMAXNUM at line 831. |
| m156-hasnon16bitaccesses-copypaste-tempotherop | ❌ | ❌ | ❌ | SIISelLowering.cpp:14923-14924 (hasNon16BitAccesses): OpIs16Bit check uses TempOtherOp.getValueSizeInBits() == 16 instead of TempOp.getValueSizeInBits() == 16. The symmetric OtherOpIs16Bit clause two lines down correctly uses TempOtherOp on both sides -- this is a clear copy-paste defect. Called from performOrCombine -> matchPERM (SIISelLowering.cpp:15070). When Op is 32-bit and OtherOp is 16-bit, OpIs16Bit becomes spuriously true; combine concludes "both 16-bit" and skips v_perm -- can drop upper 16 bits of Op on i16->i32 zext / v2i16->v2i32 lane patterns under an or-tree. Same defect in ROCm fork at :14978-14979. |
| m155-sched-barrier-ds-aggregate-ldsdma | ❌ | ❌ | ❌ | amdgcn.sched.barrier(0x800) (LDSDMA-allow) still blocks LDSDMA via the DS aggregate bit. invertSchedBarrierMask (AMDGPUIGroupLP.cpp:2678-2685) DS clause does NOT check the LDSDMA bit; only DS_READ/DS_WRITE imply clearing DS. For input 2048 the inverted mask is 2031 (0b011111101111) -- DS (0x80) stays set. canAddMI DS branch (AMDGPUIGroupLP.cpp:2482-2484) matches LDSDMA via isLDSDMA(MI) on the DS bit, pinning the op. Direct sibling of m144 (which fixed the VMEM aggregate route); both must be fixed for the LDSDMA-allow semantic to work. Fix: add (InvertedMask & LDSDMA) == NONE to the DS clause symmetric to lines 2672-2676. |
| m154-performfmulcombine-ldexp-flips-nan-sign | ❌ | ❌ | ❌ | performFMulCombine ldexp arm (SIISelLowering.cpp:17684-17722) folds fmul x, (select c, -A, -B) -> ldexp(fneg x, select c, log2|A|, log2|B|) gated only on TrueNode->isNegative() == FalseNode->isNegative() + exact-log2 -- NO nnan check. For x=NaN, original v_mul_f32(NaN, -K) propagates input NaN sign; rewrite v_ldexp_f32(-NaN, exp) via VOP3 NEG src-modifier XORs sign before ldexp passthrough -> output NaN sign flipped. Applies to f16/f32/f64. Asm verified: v_ldexp_f32 v0, -v0, v1. Sibling family of m107/m110/m111/m120/m127/m128/m139/m140 -- all need nnan not just nsz/no-FMF. |
| m153-wholewave-func-prologue-exec-xor-not-mov | ❌ | ❌ | ❌ | SIFrameLowering.cpp:1041-1048 WholeWaveFunction prologue path: when no WWM scratch / CSR spills exist, calls buildScratchExecCopy(..., EnableInactiveLanes=true) which emits S_XOR_SAVEEXEC_B{32,64} tmp, -1. Per ISA: tmp = EXEC; EXEC = -1 XOR EXEC = ~EXEC. The in-source comment claims "set EXEC to -1 here," but the result is EXEC = ~entryEXEC -- bit-inverted entry mask, not all-ones. Triggered by trivial amdgpu_gfx_whole_wave function with no strict.wwm MFMA chains. Body executes with previously-active lanes inactive and vice versa -- whole-wave semantic violated end-to-end. Fix: call EnableAllLanes() (S_MOV_B64 EXEC, -1) or pass EnableInactiveLanes=false (S_OR_SAVEEXEC reaches EXEC=-1 from any entry mask). |
| m152-getdestequivalentvgpr-strips-av-class | ❌ | ❌ | ❌ | SIInstrInfo::getDestEquivalentVGPRClass (SIInstrInfo.cpp:9684-9688, SrcRC-not-AGPR branch) early-exits with RI.isVGPRClass(NewDstRC). isVGPRClass = hasVGPRs && !hasAGPRs is false for AV_* classes (have AGPR bits). On gfx90A+ (gfx950), getLargestLegalSuperClass promotes VReg_/AReg_ to AV_, so AV-class virtregs are the norm for COPY/PHI/REG_SEQUENCE/INSERT_SUBREG dests. The early bail-out is skipped and dest is replaced by getEquivalentVGPRClass(AV_xx) -- a strict-VGPR class. moveToVALU then drops AGPR legality; subsequent MFMA/V_ACCVGPR_ uses see class mismatch -> extra V_ACCVGPR_READ/WRITE moves or illegal-copy verifier failures. Same isVGPRClass-blindness as m149. |
| m151-cs-chain-undefined-flags-silent-fallthrough | ❌ | ❌ | ❌ | SITargetLowering::LowerCall (SIISelLowering.cpp:4248-4266) dispatches on llvm.amdgcn.cs.chain Flags immarg with two recognized cases (isZero() -> error if extra args; isOneBitSet(0) -> DVGPR path with 3 fallback args). There is NO else clause. Any Flags value that is neither 0 nor 1 (e.g. 2, 3, 5) falls through silently: UsesDynamicVGPRs stays false, ChainCallSpecialArgs only contains exec, NumVGPRs/FallbackExec/FallbackCallee are never pushed. Opcode picked is non-DVGPR SI_CS_CHAIN_TC_W32/W64; trailing IR-level variadic args are dropped from the lowered call without diagnostic. IR Verifier (Verifier.cpp:7061-7092) does not range-check the immarg. Sibling of m145 (chain-call ExternalSymbol drops target flag). |
| m150-s-barrier-init-m0-mask-discarded | ❌ | ❌ | ❌ | SIISelLowering.cpp:12450-12459 (lowering of amdgcn.s.barrier.init / amdgcn.s.barrier.signal.var) computes M0Val = (CntOp & 0x3F) then IMMEDIATELY overwrites with M0Val = (CntOp << 16) using the UNMASKED CntOp. The mask is dead code; the intended (CntOp & 0x3F) << 16 expression is never produced. If member-count has bits >=6 set (e.g. 0x40, 0x80), those bits land in M0[27:22], corrupting unrelated fields when the HW decodes M0. Adjacent (not filed): SOPInstructions.td:507-573 -- barrier pseudos lack mayLoad/mayStore, combined with IntrNoMem on intrinsics, MachineScheduler can reorder memops across s_barrier. |
| m149-preallocate-wwm-skips-av-class-gfx950 | ❌ | ❌ | ❌ | SIPreAllocateWWMRegs.cpp:102 (processDef) early-exits when TRI->isVGPR(*MRI, Reg) is false. isVGPR calls isVGPRClass, which is true only when a regclass has VGPRs AND no AGPRs. On gfx90A+ (gfx950 included), MAI-capable register classes are unified AV_* super-classes (hasVGPRs && hasAGPRs), so isVGPRClass == false and the WWM pre-allocator silently drops the virtreg. The physreg is never added to WWMReservedRegs, so VGPRAllocator may reuse it across EXIT_STRICT_WWM -- post-EXIT writes under restored EXEC corrupt inactive-lane data WWM was supposed to preserve. Same defect in SILowerWWMCopies.cpp:135 (addToWWMSpills). Reachable via amdgcn.strict.wwm on MFMA results. |
| m148-v2f64-to-v2f16-gcnpat-double-rounds | ❌ | ❌ | ❌ | VOP3Instructions.td:1461-1463 GCNPat for (v2f16 (fpround v2f64:$src)) emits V_CVT_F32_F64 + V_CVT_PK_F16_F32 -- classic IEEE double-rounding (f64->f32->f16). The scalar fptrunc double to half path (SIISelLowering.cpp:8599 -> LowerF64ToF16Safe, AMDGPUISelLowering.cpp:3787-3873) emulates single-step rounding via inexact-to-odd correction. Same IR, same gfx950: scalar vs lane-0-of-vector fpround produce different f16 bits for f64 values near half-way f16 boundaries. Also bypasses the m137 NaN-payload preservation logic. lowerFP_ROUND at line 8573 bails on non-f32 source for v2f16 dst, letting the buggy GCNPat fire. Sibling of m137. |
| m147-performclampcombine-drops-snan-quietening | ❌ | ❌ | ❌ | performClampCombine (SIISelLowering.cpp:18284-18303) constant-folds AMDGPUISD::CLAMP(c) and returns the input sNaN bit-pattern unchanged when DX10Clamp is OFF (default for compute kernels). AMDGPUISD::CLAMP lowers via ClampPat (SIInstructions.td:2030-2036) to V_MAX_F32_e64 src, src, DSTCLAMP.ENABLE; with IEEE_MODE=1 the HW v_max_f32(sNaN, sNaN) quiets the sNaN (sets mantissa bit 22) but preserves payload. Constant-fold path returns 0x7F800001 (raw sNaN); HW path returns 0x7FC00001 (quieted). Reachable via amdgcn.fmed3(c, 0.0, 1.0), fminnum(fmaxnum(c, 0.0), 1.0), or fmed3 pattern-matchers. Sibling family of m133. |
| m146-resource-usage-agpr-undercount-with-calls | ❌ | ❌ | ❌ | AMDGPUResourceUsageAnalysis.cpp:173-177 sets NumAGPR and NumExplicitSGPR ONCE via getNumUsedPhysRegs(..., IncludeCalls=false). The per-MI call-path scan at lines 188-321 only updates MaxVGPR; AGPR/SGPR are skipped via the !TRI.isVGPRClass(RC) filter at line 244. On gfx950 (MAI/AGPR-heavy kernels with external calls), AGPR usage is under-reported. Combined with unified VGPR+AGPR allocation on gfx90a/gfx950, downstream max(NumVGPR, NumAGPR) for occupancy may select stale AGPR count -> kernel descriptor over-allocates waves per CU. Secondary: call-path MaxVGPR scan lacks the getAddressableNumArchVGPRs() clip that the fast-path applies, so wave32-vs-wave64 mismatch is unhandled. |
| m145-mcinstlower-externalsymbol-drops-target-flag | ❌ | ❌ | ❌ | AMDGPUMCInstLower.cpp:104-109 (MO_ExternalSymbol branch) constructs MCSymbolRefExpr::create(Sym, Ctx) WITHOUT passing getSpecifier(MO.getTargetFlags()). The sibling MO_GlobalAddress branch at lines 89-103 correctly applies the specifier. Any AMDGPU symbol specifier (MO_GOTPCREL/MO_REL32_LO/MO_REL32_HI/MO_ABS32_LO/MO_ABS32_HI/MO_REL64/MO_ABS64) on an ExternalSymbol is silently dropped, producing wrong relocation type in the emitted object. Reachable via RuntimeLibcalls (__divdi3 etc.) and SI_TCRETURN_CHAIN with external dest. Same defect class in the MO_MCSymbol branch at 113-119 (only handles MO_FAR_BRANCH_OFFSET). |
| m144-sched-barrier-ldsdma-mask-ineffective | ❌ | ❌ | ❌ | llvm.amdgcn.sched.barrier mask = 0x800 (allow LDSDMA past barrier) is silently ineffective. invertSchedBarrierMask (AMDGPUIGroupLP.cpp:2667-2676) clears the aggregate VMEM bit but leaves VMEM_READ (0x10) and VMEM_WRITE (0x20) sub-bits set. Every LDSDMA instruction also satisfies isVMEM && mayLoad/Store (SIInstrInfo.h:631), so canAddMI (AMDGPUIGroupLP.cpp:2474-2480) classifies LDSDMA into the SchedGroup via VMEM_READ/WRITE branches -- the instruction receives ordering edges to the SCHED_BARRIER and cannot move past, contradicting the AMDGPUUsage.rst:1626 documented semantics. Asymmetry: requesting DS-allow (mask=0x80) correctly allows LDSDMA (line 2680), but reverse path fails. Existing lit test only verifies the debug-print mask value, not the actual scheduling effect. |
| c001-sudot-isel-ice | ❌ | ❌ | ❌ | llvm.amdgcn.sudot4 / llvm.amdgcn.sudot8 abort in AMDGPU instruction selection with Cannot select. |
| c002-fma-legacy-isel-ice | ❌ | ❌ | ❌ | -O0 leaves llvm.amdgcn.fma.legacy for AMDGPU instruction selection, which aborts with Cannot select; -O2 compiles the reduced case. |
| c003-permlane16-isel-ice | ❌ | ❌ | ❌ | llvm.amdgcn.permlane16 ICEs with Cannot select on every CDNA target (gfx9xx); the instruction is GFX10+/RDNA only but the intrinsic is declared target-unconditional. |
| c004-mov-dpp8-isel-ice | ❌ | ❌ | ❌ | llvm.amdgcn.mov.dpp8 ICEs with Cannot select on every CDNA target; same root cause as c003 -- DPP8 is GFX10+/RDNA only. |
| c005-global-load-lds-isel-ice | ❌ | ❌ | ❌ | llvm.amdgcn.global.load.lds ICEs with Cannot select on gfx950; same family as c003/c004. llvm.amdgcn.ds.ordered.add ICEs the same way (mentioned in the c005 notes rather than getting its own entry). |
| c006-tanh-isel-ice | ❌ | ❌ | ❌ | llvm.amdgcn.tanh (both .f32 and .f16) ICEs with Cannot select on gfx950; v_tanh_* is a GFX12 instruction not available on CDNA. Same fix shape as c003. |
| c007-fcmp-i32-wave64-fold-ice | ❌ | ❌ | ❌ | llvm.amdgcn.fcmp.i32 with two equal constant FP operands ICEs at -O2 on any wave64 target with invalid type for register "exec"; the constant folder doesn't validate that the requested return width matches the wave size. Distinct shape from c003--c006 -- a constant-folder bug rather than a missing subtarget predicate. |
| c008-amdgcn-class-bf16-isel-ice | ❌ | ❌ | ❌ | llvm.amdgcn.class.bf16 ICEs at both -O0 and -O2 with LLVM ERROR: Cannot select: i1 = AMDGPUISD::FP_CLASS ... bf16. SIISelLowering.cpp:10931-10933 unconditionally lowers int_amdgcn_class (polymorphic over llvm_anyfloat_ty) to AMDGPUISD::FP_CLASS for any source VT; VOPCInstructions.td:1223-1229 defines VOPCClassPat64 only for _F16/_F32/_F64 -- no V_CMP_CLASS_BF16. llvm.is.fpclass.bf16 correctly expands via i16 compares; the amdgcn-specific intrinsic skips that path. Sibling to c001/c003/c006 and m118 (bf16 over-promise). |
| c009-ballot-wrong-width-cannot-select | ❌ | ❌ | ❌ | llvm.amdgcn.ballot.<N> with <N> != WavefrontSize ICEs in ISel for non-constant args. lowerBALLOTIntrinsic (SIISelLowering.cpp:7811-7852) emits AMDGPUISD::SETCC directly in the user-requested return type without first emitting the wave-sized SETCC and getZExtOrTrunc'ing. ISel has no pattern matching a wave-mask SETCC at the wrong width: LLVM ERROR: Cannot select: i32 = AMDGPUISD::SETCC ..., setne:ch. Reproduces on ballot.i32 / wave64 (gfx950) and ballot.i64 / wave32 (gfx1030). Distinct from c007 (constant-fold ICE) -- c009 fires for arbitrary non-constant inputs. Only the isOne() fast-path checks the active wave width. Sibling lowerICMPIntrinsic does it correctly. |
| c010-strict-fp-extend-bf16-unreachable | ❌ | ❌ | ❌ | STRICT_FP_EXTEND bf16 -> f32/f64 ICEs with llvm_unreachable("Need STRICT_BF16_TO_FP") at SIISelLowering.cpp:4914-4915 (lowerFP_EXTEND). STRICT_FP_EXTEND is set Custom for f32/f64 dst (lines 580-581); the bf16 source case has no strict expansion. Reproducer: llvm.experimental.constrained.fpext.f32.bf16 with strictfp attribute. Crashes a release-asserts compiler at any opt level. Sibling to c001/c003/c006/c008 (intrinsic without selector / wrong target gate) and m143 (sibling f64->bf16 strict-round drops chain). |
| c011-buffer-load-format-tfe-illegal-data-type-ice | ❌ | ❌ | ❌ | amdgcn.struct.ptr.buffer.load.format with TFE enabled AND illegal data type (<3 x i16>, likely <6 x i16>/<3 x bfloat>) ICEs in SDAG legalization. Two cooperating defects: (1) SITargetLowering::lowerIntrinsicLoad illegal-type branch (SIISelLowering.cpp:7740) unconditionally builds a 2-result VTList (CastVT, MVT::Other), ignoring IsTFE -- the TFE i32 result is never plumbed; legal-type branch at 7735 correctly uses M->getVTList(). (2) ReplaceNodeResults else branch (8263-8266) pushes exactly 2 results regardless of N->getNumValues(), dropping the 3rd value for TFE's (data, status, chain). Crash in SelectionDAG::ReplaceAllUsesWith. SDAG-only (GISel cleanly errors with "unable to legalize"). |
| c018-image-atomic-illegal-data-type-ice | ❌ | ❌ | ❌ | amdgcn.image.atomic.<op>.<dim> ICEs/miscompiles in SDAG for any data width != 32/64 bits. lowerImage atomic branch (SIISelLowering.cpp:10156-10181) is a binary Is64Bit = VData.getValueSizeInBits() == 64 dispatch -- everything else gets NumVDataDwords=1, DMask=0x1 -> MIMG selector picks 1-dword V1 opcode for arbitrary-width VData. 7-variant matrix: <3 x i32> ICE in copyPhysReg/MCInstPrinter; <3 x i16>/v6i16/v3bf16 widen-result error; i128 expand; i128 cmpswap Cannot select; bfloat SILENT MISCOMPILE -- 1-dword image_atomic_swap dmask:0x1 corrupts upper 16 bits of texel. Family sibling m142/c011/c014/c015/c016/c017. |
| c017-buffer-atomic-illegal-data-type-ice | ❌ | ❌ | ❌ | amdgcn.{raw,struct}.ptr.buffer.atomic.* ICEs in SDAG for illegal data types. lowerRawBufferAtomicIntrin (SIISelLowering.cpp:11196-11222), lowerStructBufferAtomicIntrin (:11224-11250), and cmpswap arms (:11541-11587) hand the user-typed value straight to getMemIntrinsicNode with no illegal-type bitcast/widen fallback (unlike lowerIntrinsicLoad's 7739-7745). 7-variant reproducer matrix all ICE on LLVM HEAD and ROCm 7.2.3: add.v3i16, add.i128, swap.v6i8, swap.i24 (segfault), fadd.bf16 (Cannot select), struct.add.v3i16, struct.cmpswap.i128. Sibling family c011/c014/c015/c016. |
| c016-s-buffer-load-illegal-data-type-ice | ❌ | ❌ | ❌ | llvm.amdgcn.s.buffer.load.<T> ICEs in SDAG when T is an illegal data type. lowerSBuffer (SIISelLowering.cpp:10537-10632) handles only i16 (line 10559) and v3-of-legal-scalar (10567) for widening; everything else falls through to getMemIntrinsicNode with illegal VT. Divergent path asserts scalar in {i32, f32}. ReplaceNodeResults at :8199-8246 hard-asserts VT == MVT::i8 (line 8212). 7-variant reproducer matrix all ICE on both LLVM HEAD and ROCm 7.2.3: i1, i4, i24, <2 x i1>, <3 x i16>, <6 x i8>, i128. Sibling family c011/c014/c015. |
| c015-buffer-load-format-i8-drops-format | ❌ | ❌ | ❌ | amdgcn.{raw,struct,struct.ptr}.buffer.load.format.i8 (and store mirror) drop format encoding in SDAG. SIISelLowering.cpp:7730-7732 routes scalar i8 format loads to handleByteShortBufferLoads which (line 12760) ignores IsFormat and emits BUFFER_LOAD_UBYTE instead of BUFFER_LOAD_FORMAT_X. The buffer-rsrc format descriptor is not applied. Store mirror at SIISelLowering.cpp:12151-12153, 12202-12205 -> handleByteShortBufferStores (12798). i16 escapes via D16 early-return at 7725. SDAG vs GISel divergence: GISel correctly emits buffer_load_format_x. Sibling family c011/c014. |
| c014-tbuffer-load-illegal-vector-data-ice | ❌ | ❌ | ❌ | amdgcn.raw.ptr.tbuffer.load.v3i16 (also v6i16, v3bf16, struct variants, and store mirrors) ICEs at -O0/-O2 with LLVM ERROR: Do not know how to widen the result of this operator!. SIISelLowering.cpp:11394-11399 (raw) / :11421-11426 (struct) D16 fast-path checks ONLY MVT::f16; all other illegal vector return types skip that branch and fall through to a plain getMemIntrinsicNode(TBUFFER_LOAD_FORMAT, ..., LoadVT, ...) with the illegal type. No CastVT fallback (sibling lowerIntrinsicLoad at 7739-7745 has one). ReplaceNodeResults at 8256 returns the still-illegal value. Distinct from c011 (TFE chain-drop) -- tbuffer has no TFE form. |
| c013-cube-intrinsics-wrong-gate-cdna | ❌ | ❌ | ❌ | llvm.amdgcn.cube{id,ma,sc,tc} mis-select on gfx940/gfx950 (CDNA, no cube ALU). HasCubeInsts is gated correctly on V_CUBE*_F32 patterns at VOP3Instructions.td:264-269, BUT FeatureGFX9 itself unconditionally includes FeatureCubeInsts at AMDGPU.td:1462. gfx940/941/942/950 inherit FeatureGFX9 via FeatureISAVersion9_4_Common (AMDGPU.td:1747) and never subtract FeatureCubeInsts. llc -mcpu=gfx950 -O2 cleanly emits v_cubeid_f32/v_cubema_f32/v_cubesc_f32/v_cubetc_f32 -- HW would trap as illegal instructions. Adjacent: GISel isCanonicalized over-promises cubema/cubesc/cubetc as canonical even though they propagate NaN. |
| c012-pops-exiting-wave-id-wrong-gate-cdna | ❌ | ❌ | ❌ | llvm.amdgcn.pops.exiting.wave.id selects to invalid SRC_POPS_EXITING_WAVE_ID SGPR on gfx940/gfx950. SOPInstructions.td:2050-2054 gates the pattern with isGFX9GFX10 (true for gfx940/gfx950, Generation=GFX9), but POPS is a graphics-only HW feature absent on the CDNA line. MC layer either rejects the encoding or accepts a binary that triggers an illegal-instruction trap at runtime. Adjacent defect (not filed separately): SIISelLowering.cpp:12024 amdgcn.exp.compr guard uses hasCompressedExport() = !HasGFX11Insts which is true on gfx950, but gfx950 has no export HW (hasExportInsts() = !hasGFX940Insts() = false); no diagnostic fires and an unencodable EXP machine node is emitted. Sibling family: c001/c003/c004/c005/c006/c008. |
Human-written note: Up through bug m016 I was testing against upstream LLVM. But then it became clear that the ROCm 7.2.3 release didn't have many of these bugs, so I switched to testing the release. After m038, AMD asked me to switch fuzzing back to upstream.
The fuzzer can use an installed ROCm LLVM today. For coverage-guided compiler
fuzzing, initialize the LLVM submodule and build an instrumented LLVM. To use a
different LLVM checkout or fork, set LLVM_PROJECT_DIR=/path/to/llvm-project.
Typical directed-fuzzing setup:
git submodule update --init --depth 1 third_party/llvm-project
scripts/build_instrumented_llvm.sh
scripts/build_directed_fuzzer.sh
scripts/run_directed_fuzzer.sh