Name	Name	Last commit message	Last commit date
parent directory ..
fuzzer	fuzzer
known-miscompiles	known-miscompiles
patches	patches
runner	runner
scripts	scripts
third_party	third_party
.gitignore	.gitignore
README.md	README.md

FuzzX AMDGPU

Human-written content

This is a vibe-coded fuzzer for the AMDGPU path in LLVM.

We test the full LLVM IR -> AMDGPU assembly compilation path, although in practice most of the bugs we're finding are in the AMDGPU-specific parts of the compiler.

The idea is to:

generate programs that have defined semantics (no UB or poison),
compile them with -O0 and -O2,
ensure that -O0 and -O2 have the same result, and
compare that result to that of a trusted interpreter.

In most of the reproducers we've found, -O0 gives the wrong result and -O2 gives the correct result. My untested hypothesis is that we could find reproducers for most of these bugs at -O2 as well, it's just that LLVM is good at simplifying code, and simpler code is less likely to hit a backend bug.

I initially used LLVM HEAD as the primary fuzzing target, but many of the bugs I found didn't reproduce in the latest ROCm release. (IOW HEAD has regressions compared to the release.) Seeing this, I figured I should be fuzzing the release instead. After m038, AMD asked us to switch active fuzzing back to HEAD builds; the current upstream LLVM HEAD column has llvm/llvm-project#196418, llvm/llvm-project#198412, llvm/llvm-project#198491, llvm/llvm-project#198508, and llvm/llvm-project#198556 applied locally (the last three are AMD-provided fixes for the m001, m003/m005/m012/m014, and m026-m029 bug classes; 198556 supersedes the older 198373 and 198419 bitop3 fixes that previous builds carried). In any case, the table of results below shows which versions reproduce which bugs.

Everything below this line is AI-generated. You probably only care about the "bugs generated" table. Good luck.

This directory contains the AMDGPU fuzzer work area. It is intentionally separate from the PTX / ptxas fuzzer in ../ptx/.

The AMDGPU fuzzer is the directed C++ libFuzzer target in fuzzer/. Its only input format is an LLVM bitcode module containing an AMDGPU kernel named fuzz_kernel. For each input module, the fuzzer compiles the kernel through -O0 and -O2 LLVM pipelines, links both code objects into one HSACO, runs both kernels through HIP, and compares device output. Set FUZZX_USE_LLVM_INTERPRETER_ORACLE=1 to also run an LLVM-interpreter oracle for modules that do not use AMDGPU-specific intrinsics beyond workgroup and workitem IDs and do not use FP types. Pure LLVM integer bit-counting and byte-swap intrinsics are allowed in oracle-compatible modules. The interpreter clone scalarizes vector integer intrinsics and lowers safe LLVM integer min/max, saturation, absolute value, funnel-shift, bit-reverse, and overflow intrinsics to plain IR before execution. Oracle findings include the expected value in mismatch.txt. Set FUZZX_REQUIRE_LLVM_INTERPRETER_ORACLE=1 for an oracle-focused campaign where mutation and crossover keep only interpreter-compatible modules.

The custom mutator and crossover operate on LLVM IR rather than on raw bytes. They currently build a conservative, defined subset of integer IR: no undef, no explicit poison values, no nuw / nsw / exact, no inbounds, no integer division except nonzero-denominator udiv / urem, only masked or constant shift amounts, and only the fixed skeleton input load/output store. Coverage includes scalar i32 integer arithmetic, bitwise ops, compares/selects, masked dynamic shifts, rare signed division/remainder by proven-positive divisors, standalone i8 / i16 scalar subexpressions, i64 subexpressions truncated to i32, <2 x i32> / <4 x i32> vector subexpressions including fixed shufflevector masks, and narrow <4/8 x i8> / <4/8 x i16> vector subexpressions reduced back to i32, scalar and vector forms of LLVM bit/min/max/saturation/absolute intrinsics, narrow scalar funnel shifts and unsigned division/remainder by proven-nonzero denominators, explicit i1 boolean subexpressions reduced back to i32, pure-IR unsigned min/max and saturating add/sub select idioms, and pure-IR masked funnel-shift/rotate idioms, pure-IR signed add/sub overflow select idioms, pure-IR predicate-mask blend/sign idioms, and pure-IR bitfield extract/insert idioms, pure-IR byte/word pack-unpack idioms, pure-IR widening multiply-high/low idioms, pure-IR byte dot-product chain idioms, pure-IR bit-count/bit-twiddle idioms, pure-IR average/absolute-difference idioms, and pure-IR lane clamp/saturating-pack idioms, pure-IR vector shuffle/horizontal-reduction idioms, pure-IR carry/borrow-chain idioms, pure-IR dynamic byte extraction/permutation idioms, pure-IR compare-rank/mask idioms, pure-IR ternary bit-logic idioms, pure-IR 64-bit pair arithmetic idioms, and pure-IR byte-prefix/permutation idioms, pure-IR overflow-chain idioms, pure-IR select lookup-table idioms, and pure-IR nibble reduction idioms, pure-IR SWAR bit tricks, pure-IR byte compare/mask idioms, pure-IR limb multiply/add idioms, pure-IR select-network idioms, pure-IR vector compare/mask pack idioms, pure-IR byte Horner-mix idioms, pure-IR bit ballot/matrix-pack idioms, pure-IR halfword compare/pack idioms, pure-IR nibble table-lookup idioms, pure-IR bit deposit/extract idioms, pure-IR i64 byte-permutation idioms, and pure-IR narrow-vector min/max idioms, pure-IR byte-lane select idioms, pure-IR halfword dot-accumulate idioms, pure-IR rotate/mask cascade idioms, and pure-IR vector byte gather idioms, pure-IR byte-prefix compare and byte median/range idioms, pure-IR i64 cross-lane fold idioms, pure-IR vector pairwise byte arithmetic idioms, pure-IR byte permute-control idioms, pure-IR bit-run mask idioms, pure-IR i64 multiply-fold idioms, pure-IR halfword blend-network idioms, pure-IR byte ternary-blend idioms, pure-IR halfword prefix-sum idioms, pure-IR i64 rotate-add idioms, pure-IR vector compare bitmask idioms, pure-IR byte carry propagation idioms, pure-IR bit-slice boolean idioms, pure-IR vector splat/blend idioms, pure-IR i64 compare/pack idioms, pure-IR nibble carry-chain idioms, pure-IR halfword saturating-difference idioms, pure-IR i64 bitfield-mix idioms, pure-IR vector lane mix/pack idioms, pure-IR byte saturating pack idioms, pure-IR halfword multiply-high idioms, pure-IR i64 prefix-fold idioms, and pure-IR vector byte rotate/pack idioms, alongside LLVM bit, min/max, saturation, absolute-value, funnel-shift, and integer overflow intrinsics. It also emits a small AMDGPU-specific pure integer-intrinsic subset covering BFE, SAD/MSAD, lerp, 24-bit multiply, packed SAD/MQSAD, alignbyte, signed first-bit-high, mbcnt, perm, explicit bitop3, readfirstlane, wave reductions, and integer dot-product operations, plus bounded AMDGPU FP/packing intrinsics such as fmed3, frexp, fract, class, and packed FP/int conversions. Known sudot* and fma.legacy instruction-selection crashes are gated off by default. It also emits a finite scalar FP subset by masking inputs to small nonnegative integers, converting with uitofp, using exact fadd / fmul / nonzero-denominator fdiv / fcmp / select shapes, and converting back with in-range fptoui; a signed variant uses small sign-extended integers, sitofp, fadd / fsub / fmul / nonzero-denominator fdiv, and in-range fptosi. It also emits finite scalar half and <2/4 x half> / <2/4 x float> vector FP subexpressions reduced back to i32. The mutator can also wrap the current result in structured two-way branches, wider multi-way switches, branch/PHI cascades, and deeper bounded CFG subgraphs with i32 phi joins. Those subgraphs can nest more diamonds, switches, cascades, and small counted loops with optional guarded early exits. The mutator also generates top-level counted loops with bounded constant or dynamically masked trip counts whose bodies can contain nested diamonds, switches, cascades, and inner loops. A dedicated loop-nest mutation wraps an inner counted loop and optional tail CFG inside an outer bounded loop. A complex-CFG mutation chains several nested subgraphs before the final store, so a single corpus entry can contain multiple high-fanout joins and loop nests instead of just one wrapper around the result. Some generated loops carry two independent i32 accumulator phis, combine them after the loop, take a guarded early exit from the loop body through an exit phi, or switch from the loop body to multiple distinct exit values before one joined exit phi, so corpus entries exercise both expression simplification and CFG and loop transforms. CFG arms include the same scalar integer, bit, boolean, narrowing, saturating, funnel-shift, finite-FP, and vector expression families as the linear mutator. Scalar and CFG expressions can also mix in extra i32 global input loads from in[seed % n]; these loads are only emitted inside the existing idx < n guard and are bounded by the module validator. Corpus files can be inspected directly with opt -S corpus-entry -o -.

Requirements

Component	Notes
ROCm LLVM	Defaults to `/opt/rocm-7.1.1/lib/llvm/bin/clang-20`, `lld`, and `llvm-objdump`; override with `CLANG`, `LLD`, and `LLVM_OBJDUMP`.
HIP	`hipcc` is used to build the module runner.
AMDGPU	Defaults to `gfx950`; override with `--mcpu`.

Run

Build the current upstream-HEAD LLVM fuzzing toolchain and run the directed C++ GPU differential fuzzer:

scripts/build_instrumented_llvm.sh
scripts/build_directed_fuzzer.sh
HIP_DEVICE=0 scripts/run_directed_fuzzer.sh -runs=100 -max_len=131072

Run one directed fuzzer process per GPU:

scripts/run_directed_multigpu_fuzzer.sh -runs=1000 -max_len=131072

Run multiple directed fuzzer workers on each selected GPU:

WORKERS_PER_GPU=2 scripts/run_directed_multigpu_fuzzer.sh -runs=1000 -max_len=131072

Multi-GPU runs share one live libFuzzer corpus by default, so workers can reload inputs discovered by other workers while keeping per-worker logs and artifact directories. Set FUZZX_CORPUS_MODE=isolated to return to one independent corpus directory per worker. Fresh corpus directories are seeded with a valid LLVM bitcode module before libFuzzer starts. Set FUZZX_IMPORT_CORPUS to one or more colon-separated files or directories to copy an older corpus into the fresh corpus before workers launch.

For the current upstream-HEAD campaign, run multiple workers across all GPUs:

GPUS="0 1 2 3 4 5 6 7" WORKERS_PER_GPU=12 \
  FUZZX_REQUIRE_LLVM_INTERPRETER_ORACLE=1 \
  FUZZX_IMPORT_CORPUS=/tmp/old-run/corpus/directed-gpu/shared \
  scripts/run_directed_multigpu_fuzzer.sh \
    -max_total_time=900 -max_len=131072 -rss_limit_mb=8192 -use_value_profile=1

With an optimized LLVM build using sanitizer coverage and no ASan, the directed fuzzer currently reaches about 500 exec/s aggregate across 8 GPUs. Keep the corpus, logs, artifacts, findings, and TMPDIR on a local filesystem; the run scripts default these hot paths to /tmp/fuzzx-amdgpu-$USER through FUZZX_RUNTIME_ROOT. Avoid putting them on WekaFS or another shared filesystem, because libFuzzer produces a high rate of tiny metadata and log writes. The run scripts also copy the fuzzer binary into the local runtime root by default before spawning workers; set FUZZX_LOCALIZE_FUZZER=0 to disable that. When Weka client frontend processes reserve dedicated CPU cores, the run scripts default FUZZX_CPUSET=auto, detect single-core-pinned wekanode processes, and run fuzzer workers through taskset on the remaining CPUs. Set FUZZX_CPUSET=none to disable this or FUZZX_CPUSET=0-63 to use an explicit CPU set.

For historical ROCm 7.2.3 release fuzzing, use the release wrapper:

scripts/run_rocm_7_2_3_release_fuzzer.sh -max_total_time=900 -max_len=131072 -rss_limit_mb=8192 -use_value_profile=1

That wrapper selects the ROCm 7.2.3 fuzzer build instead of the current upstream-HEAD fuzzer build.

Candidate compiler crashes, compile/link failures, or output mismatches are saved under $FUZZX_RUNTIME_ROOT/findings by default. Generated corpora and findings are local artifacts and are ignored by git; set FUZZX_RUNTIME_ROOT, CORPUS_ROOT, LOG_DIR, ARTIFACT_ROOT, or FUZZX_FINDINGS_DIR to override the default local runtime paths.

Known-Bug Suppression

Known bug patterns are suppressed by default so continued fuzzing does not keep rediscovering the same issue.

Flag	Default	Meaning
`FUZZX_ALLOW_M016_SCALAR_FSHL=1`	unset	Re-enable scalar `llvm.fshl.i32` generation for m015, m016, and m070; the legacy `FUZZX_ALLOW_M015_SCALAR_FSHL_ZERO=1` flag is also accepted.
`FUZZX_ALLOW_M026_UMAX_XOR_AND_HIGHBIT=1`	unset	Re-enable `(umax(a, b) ^ b) & umax(a, b)` shapes for m026.
`FUZZX_ALLOW_M028_UMAX_AND_NOT=1`	unset	Re-enable `(umax((y & ~x), C) & x) & ~x` shapes for m028.
`FUZZX_ALLOW_M030_CTLZ_SHL_OR_BITOP3=1`	unset	Re-enable `or(add(shl(...), z), z)` and `or(smin(add(shl(...), z), z), z)` tails for m030.
`FUZZX_ALLOW_M031_VECTOR_OR_EXTRACT_SUB=1`	unset	Re-enable subtracting two scalar extracts from the same vector `or` for m031.
`FUZZX_ALLOW_M032_LOOP_VECTOR_SELECT=1`	unset	Re-enable loop-carried values whose backedge depends on a vector `select` for m032.
`FUZZX_ALLOW_M035_WAVE_REDUCE_XOR=1`	unset	Re-enable `llvm.amdgcn.wave.reduce.xor` generation for m035.
`FUZZX_ALLOW_M036_WAVE_REDUCE_ADD=1`	unset	Re-enable `llvm.amdgcn.wave.reduce.add` generation for m036.
`FUZZX_ALLOW_M039_SEXT_I8_HIGHBYTE=1`	unset	Re-enable `sext i8 to i32` values feeding high-byte extraction for m039.
`FUZZX_ALLOW_M040_SIGNED_DIVREM24=1`	unset	Re-enable signed `sdiv` / `srem` by small odd denominators when the numerator is not known to fit signed 24-bit for m040.
`FUZZX_ALLOW_M041_ASHR_HIGHBYTE_PACK=1`	unset	Re-enable high-byte extraction from `ashr i32` values feeding byte-pack shapes for m041.
`FUZZX_ALLOW_M045_UREM_OR_ONE=1`	unset	Re-enable `urem x, (x \| 1)` shapes for m045.
`FUZZX_ALLOW_M046_V4I16_CTTZ=1`	unset	Re-enable `llvm.cttz.v4i16` shapes for m046.
`FUZZX_ALLOW_M047_V8I8_SHL=1`	unset	Re-enable `<8 x i8>` vector `shl` shapes for m047.
`FUZZX_ALLOW_M048_V8I8_UADD_SAT=1`	unset	Re-enable `llvm.uadd.sat.v8i8` shapes for m048.
`FUZZX_ALLOW_M049_VECTOR_FSHL=1`	unset	Re-enable vector `llvm.fshl` calls for m049; the legacy `FUZZX_ALLOW_M049_VECTOR_FSHL_ZERO=1` flag is also accepted.
`FUZZX_ALLOW_M051_VECTOR_FSHR_LOOP=1`	unset	Re-enable vector `llvm.fshr` calls for m051.
`FUZZX_ALLOW_M052_TERNARY_BLEND_SHIFT=1`	unset	Re-enable `((a ^ b) \| (b & ~(a ^ b))) & 31` shift masks for m052.
`FUZZX_ALLOW_M053_BYTEDOT_HIGHBIT=1`	unset	Re-enable byte-dot result values feeding a high-bit mask for m053.
`FUZZX_ALLOW_M054_I64_PAIR_LOW_ADD=1`	unset	Re-enable `((zext x << 32) \| 0xffff) + zext x` pair-add shapes for m054.
`FUZZX_ALLOW_M055_I64BYTEPERM_LOOP=1`	unset	Re-enable loop-carried values depending on i64 byte-permutation idioms for m055.
`FUZZX_ALLOW_M056_HALFDOT_BRANCH=1`	unset	Re-enable low-bit branch keys depending on halfword-dot pack values for m056.
`FUZZX_ALLOW_M057_ROTCASCADE_STORE=1`	unset	Re-enable final stores depending on rotate-cascade values for m057.
`FUZZX_ALLOW_M058_NIBBLE_BYTESEL_HIGHBIT=1`	unset	Re-enable byte-lane select carry values derived from nibble-table packs for m058.
`FUZZX_ALLOW_M060_PACKUNPACK_BYTEDOT=1`	unset	Re-enable final stores depending on generated `packunpack` byte-dot sums for m060.
`FUZZX_ALLOW_M061_OVMASKPACK_OVERFLOW=1`	unset	Re-enable final stores depending on generated `ovmaskpack` overflow/byte-pack values for m061.
`FUZZX_ALLOW_M062_BYTEHIST_BITMUX=1`	unset	Re-enable final stores depending on both generated `bytehist` and `bitmux` values for m062.
`FUZZX_ALLOW_M063_OVERFLOW_CARRY_BITOP3=1`	unset	Re-enable final stores depending on generated `carry` values for m063.
`FUZZX_ALLOW_M064_NIBBLECARRY_LOOP=1`	unset	Re-enable loop-carried final stores depending on generated `nibblecarry` values for m064.
`FUZZX_ALLOW_M065_USUB_OVERFLOW_XOR_FOLD=1`	unset	Re-enable final stores depending on generated `ovbytegather` values for m065.
`FUZZX_ALLOW_M066_VECI16ZEXTMUL_BITOP3_LOOP=1`	unset	Re-enable loop-carried final stores depending on generated `veci16zextmul` values for m066.
`FUZZX_ALLOW_M067_BYTECONDSEL_SELF_AND=1`	unset	Re-enable final stores depending on generated `bytecondsel` values for m067.
`FUZZX_ALLOW_M068_LOOP_VOP3FUSED_UMAXBITOP3=1`	unset	Re-enable final stores depending on generated `umaxbitop3cascade` values for m068 (shares a suppressor with m069).
`FUZZX_ALLOW_M069_UMAXBITOP3CASCADE_STORE=1`	unset	Same `umaxbitop3cascade` suppressor as m068; see m069.
`FUZZX_ALLOW_C001_SUDOT_ISEL_ICE=1`	unset	Re-enable `llvm.amdgcn.sudot4` / `llvm.amdgcn.sudot8` generation for c001.
`FUZZX_ALLOW_C002_FMA_LEGACY_ISEL_ICE=1`	unset	Re-enable `llvm.amdgcn.fma.legacy` generation for c002.

Layout

Path	Purpose
`third_party/llvm-project`	LLVM source checkout, pinned as a git submodule.
`patches/llvm-pr-198373.diff`	Local source-fix patch for the current HEAD campaigns; `scripts/build_instrumented_llvm.sh` applies it by default to the selected `LLVM_PROJECT_DIR`.
`patches/llvm-pr-196418.diff`	Local patch for unsigned `LowerDIVREM24`; `scripts/build_instrumented_llvm.sh` applies it by default to the selected `LLVM_PROJECT_DIR`.
`patches/llvm-pr-198412.diff`	Local patch for non-add AMDGPU dot-product add-chain matching; `scripts/build_instrumented_llvm.sh` applies it by default to the selected `LLVM_PROJECT_DIR`.
`patches/llvm-pr-198419.diff`	Local source-fix patch for AMDGPU `BitOp3_Op` shared-source aliasing; `scripts/build_instrumented_llvm.sh` applies it by default to the selected `LLVM_PROJECT_DIR`.
`scripts/build_instrumented_llvm.sh`	Helper for configuring a sanitizer-coverage LLVM source build.
`scripts/build_directed_fuzzer.sh`	Builds the C++ GPU differential libFuzzer target.
`scripts/seed_ir_corpus.sh`	Writes the initial LLVM bitcode corpus seed.
`scripts/run_directed_fuzzer.sh`	Runs the C++ directed fuzzer on one GPU.
`scripts/run_directed_multigpu_fuzzer.sh`	Runs one or more C++ directed fuzzer processes per selected GPU.
`scripts/run_rocm_7_2_3_release_fuzzer.sh`	Runs the C++ directed fuzzer with the ROCm 7.2.3 release build.
`fuzzer/`	LLVM API plus HIP differential libFuzzer target.
`runner/hip_module_runner.cpp`	HIP module loader used to execute generated HSACO files.
`known-miscompiles/`	Reduced or standalone reproducers for confirmed findings.

AMDGPU Bugs Found

Except where otherwise noted, these have been tested on gfx950. The result columns report the generic known-miscompiles/run_ll_reproducer.sh differential test: ✅ means no mismatch was observed for that reproducer, and ❌ means the toolchain reproduces the -O0 / -O2 mismatch. Confirmed compiler ICEs should be recorded here too, with the table entry describing the crashing toolchain and phase instead of a differential result.

Tested toolchains as of 2026-05-19:

Column	Toolchain
ROCm release	ROCm 7.2.3 source tag, commit `f58b06dce1f9c15707c5f808fd002e18c2accf7e`; also checked against the matching ROCm 7.2.3 `rocm-llvm` package, package SHA256 `4c406e184f88949cea60869949454e5392e1cbd9480c4c87274f7b59e9f810e5`.
LLVM HEAD	https://github.com/llvm/llvm-project/commit/0dd29960cd6102b37651cc3f58f872652099b83b (2026-05-18) plus llvm/llvm-project#196418, llvm/llvm-project#198412, llvm/llvm-project#198491, llvm/llvm-project#198508, and llvm/llvm-project#198556, built `Release` with sanitizer coverage, no ASan.
ROCm HEAD	https://github.com/ROCm/llvm-project/commit/a5de13684ba84db953b28e632ea304080a4318d0 (2026-05-18) plus llvm/llvm-project#196418, llvm/llvm-project#198412, llvm/llvm-project#198491, llvm/llvm-project#198508 (source-only; the patch's `.ll` test diffs do not apply against ROCm-staging baseline checks), and llvm/llvm-project#198556, built with assertions, ASan, and sanitizer coverage.

Bug	ROCm 7.2.3	LLVM HEAD	ROCm HEAD	Description
m001-ashr-i16-zext	❌	✅	✅	`ashr i16` feeding `zext i16 to i32` is folded to a sign-extending SDWA byte select; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198491.
m002-i8-clear-xor	✅	✅	✅	`-O0` lowers a byte-clear xor through `v_bitop3_b32` with the wrong result; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198373.
m003-shl3-add-chain	✅	✅	✅	`-O0` scalarizes a divergent `shl3/add` chain through `v_readfirstlane_b32`; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198508.
m004-vector-identity-xor	✅	✅	✅	`-O0` loses a lane-0 vector identity before `xor`; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198373.
m005-shl1-add-chain	✅	✅	✅	`-O0` scalarizes a divergent `shl1/add` chain through the same class of bug as m003; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198508.
m006-i8-xor-clear	✅	✅	✅	`-O0` lowers another adjacent `i8` narrow byte-clear xor through the wrong `v_bitop3_b32` result; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198373.
m007-vector-shl-identity-xor	✅	✅	✅	`-O0` loses a vector shift-by-zero lane-0 identity before `xor`; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198373.
m008-i8-separated-clear	✅	✅	✅	`-O0` miscompiles an `i8` identity byte-clear xor when prior narrow ops are separated by no-op adds; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198373.
m009-i16-clear-xor	✅	✅	✅	`-O0` miscompiles an `i16` identity low-16 clear xor through the wrong `v_bitop3_b32` result; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198373.
m010-i16-sext-clear-xor	✅	✅	✅	`-O0` miscompiles an `i16` sign-extended identity clear xor through the wrong `v_bitop3_b32` result; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198373.
m011-i8-sext-clear-xor	✅	✅	✅	`-O0` miscompiles an `i8` sign-extended masked clear xor through the wrong `v_bitop3_b32` result; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198373.
m012-add-shl-ladder	✅	✅	✅	`-O0` scalarizes a divergent `add/shl` ladder through `v_readfirstlane_b32`; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198508.
m013-private-memory-fshl	❌	❌	❌	`-O0` lowers fixed private-memory allocas through a dynamic scratch stack sequence that can return intermittent wrong values.
m014-shl-add-ctpop	✅	✅	✅	`-O0` scalarizes a four-step `shl/add` chain feeding `ctpop` through lane 0; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198508.
m015-scalar-fshl-zero	✅	❌	❌	`-O0` lowers scalar `fshl.i32(x, y, 0)` through a 64-bit shift-by-`-1` sequence that returns zero.
m016-scalar-fshl-one	✅	❌	❌	`-O0` lowers scalar `fshl.i32(x, y, 1)` through a 64-bit shift-by-`-1` sequence that returns only bit 31.
m017-vector-and-lane0-clear-xor	❌	✅	✅	ROCm 7.2.3 `-O0` drops a vector lane-0 `and`/`extractelement` clear before `xor`; LLVM HEAD and ROCm HEAD already pass.
m018-two-private-memory-ops	❌	✅	✅	ROCm 7.2.3 `-O0` intermittently reads stale scratch data across two private-memory sequences; LLVM HEAD and ROCm HEAD pass 50 repeated combined runs.
m019-highbit-or-xor	❌	✅	✅	`-O0` combines a high-bit `(x \| C) ^ x` expression into `v_bitop3_b32` with the wrong truth table or operands; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198419.
m020-or-xor-and	❌	✅	✅	`-O0` combines `((a \| b) ^ b) & (a \| b)` into `v_bitop3_b32` with the wrong result; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198419.
m021-fshl-or-xor	❌	✅	✅	`-O0` combines a dynamic `(a \| b) ^ a` expression after `fshl` into `v_bitop3_b32` with the wrong result; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198419.
m022-and-xor-constant	❌	✅	✅	`-O0` combines `((x ^ C) & x)` after a dynamic `and` into `v_bitop3_b32` with the wrong low bit; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198419.
m023-and-xor-identity	❌	✅	✅	`-O0` combines `(x & y) ^ x` into `v_bitop3_b32` with the wrong identity result; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198419.
m024-udiv-or-one	❌	✅	✅	`-O0` lowers unsigned division of a sign-extended `i16` value by `x \| 1` through an imprecise float reciprocal path; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#196418.
m025-urem-or-one	❌	✅	✅	`-O0` lowers unsigned remainder of a sign-extended `i16` value by `x \| 1` through the same imprecise reciprocal path; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#196418.
m026-shl-umax-xor-and	❌	❌	❌	`-O2` combines a shifted `umax` high-bit extraction into `v_bitop3_b32` using the input and salt instead of their xor; llvm/llvm-project#198556 does not catch this shape.
m027-xor-and-or	❌	✅	✅	`-O0` combines `(((y ^ x) & x) \| base)` into `v_bitop3_b32` with the wrong bit when `x` is `(base ^ z) & base`; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198556.
m028-umax-and-not	❌	❌	❌	`-O0` combines `(umax((y & ~x), C) & x) & ~x` into `v_bitop3_b32` using the input and salt separately; llvm/llvm-project#198556 does not catch this shape.
m029-fshl-select-phi	❌	✅	✅	`-O2` lowers a signed compare/select over `y & x`, where `x` is a complemented masked `fshl`, so the true zero arm is chosen when the signed compare is false; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198556.
m030-ctlz-shl-or-bitop3	❌	❌	❌	`-O2` lowers a low-bit `or` through `v_bitop3_b32` using the unmasked `%n` value instead of `%n & 1`.
m031-vector-or-extract-sub	❌	✅	✅	ROCm 7.2.3 `-O2` scalarizes a vector `or` extract/sub as `or(x, 255) - x` instead of `or(x, 255) - -1`.
m032-loop-vector-select	❌	✅	✅	ROCm 7.2.3 `-O2` kills the loop EXEC mask before storing a loop-carried value derived from a vector `select`.
m033-sub-zext-bool-fp	❌	✅	✅	`-O2` lowers `sub i32 X, zext(i1 Cond)` through `s_subb_u32` with the wrong false-case borrow before a masked FP accumulation; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198412.
m034-fshl-add-workitem-product	❌	✅	✅	`-O2` rewrites a workitem-product `fshl`/add chain as a byte dot product that returns `0xffffffff` instead of `0xc0000000` for `x == 0`; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198412.
m035-wave-reduce-xor-constant	❌	✅	✅	ROCm 7.2.3 `-O2` folds `llvm.amdgcn.wave.reduce.xor.i32(30, 0)` to `30` instead of the even-wave XOR result `0`.
m036-wave-reduce-add-constant	❌	✅	✅	ROCm 7.2.3 `-O2` folds `llvm.amdgcn.wave.reduce.add.i32(65536, 1)` to `65536` instead of the full-wave sum `0x00400000`.
m037-dot4-square-lowbit	❌	✅	✅	`-O2` lowers a byte-masked `xx + (xx & 1)` expression to `v_perm_b32` / `v_dot4_u32_u8` with an extra constant accumulator; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198412.
m038-loop-fp-mask-xor	❌	✅	✅	`-O2` unrolls nested xor loops and folds a masked integer-to-FP round-trip into a byte-dot sequence that adds `1023` for input zero; LLVM HEAD and ROCm HEAD pass after llvm/llvm-project#198412.
m039-sext-i8-highbyte-pack	❌	❌	❌	`-O2` packs bytes after an `i8` sign-extension but clears the byte lanes contributed by the sign bits.
m040-sdivrem24-boundary	❌	❌	❌	`-O2` applies the signed 24-bit reciprocal division lowering when the positive numerator has bit 23 set, returning a quotient one too large.
m041-ashr-highbyte-pack	❌	❌	❌	`-O2` lowers a byte pack after `ashr i32` to `v_perm_b32` with the wrong high-byte lane.
m042-or-lshr-zero-xor	✅	✅	✅	`-O0` lowered `or x, (lshr x, 0)` where `x` is `(a ^ b) \| ((a ^ b) >> 1)` through the wrong `v_bitop3_b32`; LLVM HEAD passes after llvm/llvm-project#198373.
m043-zext-i8-self-xor	✅	✅	✅	`-O0` lowered `xor x, x`, where `x` is `zext(trunc(workitem.id.x)) ^ 1`, through `v_bitop3_b32`; LLVM HEAD passes after llvm/llvm-project#198373.
m044-v4i32-self-and-zero-shuffle	✅	✅	✅	`-O0` lowered a `<4 x i32>` `and x, x` lane ORed with a zero shuffle through `v_bitop3_b32`; LLVM HEAD passes after llvm/llvm-project#198373.
m045-urem-or-one-known24	❌	❌	❌	`-O2` lowers `urem x, (x \| 1)` with known 24-bit `x` to `0x00ffffff` instead of `x` when even `x` is smaller than `x \| 1`; explicit masking can make `-O0` wrong too.
m046-v4i16-cttz-funnel-loop	✅	❌	❌	`-O2` miscomputes a dynamic-trip nested loop whose body extracts a lane from `llvm.cttz.v4i16` and feeds a funnel-shift-shaped scalar expression.
m047-bytedot-v8i8-shl-loop	✅	❌	❌	`-O2` folds a byte-dot-style dynamic loop with a `<8 x i8>` vector shift to `4` for lanes where `-O0` produces smaller values.
m048-v8i8-uadd-sat-vecreduce-loop	✅	❌	❌	`-O2` miscomputes a loop using `llvm.uadd.sat.v8i8` followed by byte extraction and a two-lane vector-reduce xor/and idiom, changing the low bits by two.
m049-vector-fshl-zero	✅	❌	❌	`-O0` lowers vector `llvm.fshl.v4i32(x, 0, 0)` through a 64-bit shift-by-`-1` sequence that returns zero instead of the selected vector lane.
m050-bitcount-and-sub-zero	✅	✅	✅	`-O0` lowered `and X, (X - 0)` feeding `ctpop` through the wrong `v_bitop3_b32`; LLVM HEAD passes after llvm/llvm-project#198373.
m051-vector-fshr-divergent-loop	✅	❌	❌	`-O2` scalarizes a vector `llvm.fshr.v2i32` loop tail and carries one scalar inner-loop result into divergent lanes that exited earlier.
m052-ternary-blend-shift	✅	❌	❌	`-O0` lowers `((a ^ b) \| (b & ~(a ^ b))) & 31` as `a & 31`, dropping `b` before a funnel-shift-like expression.
m053-bytedot-highbit	✅	❌	❌	LLVM HEAD and ROCm HEAD `-O0` lower a byte-dot/high-bit expression through a changed `v_bitop3_b32` / `v_bfi_b32` sequence that clears a high bit before a final xor.
m054-i64-pair-low-add	❌	❌	❌	`-O2` folds `((zext x << 32) \| 0xffff) + zext x` into a u24 multiply-add-like sequence that drops the high-half copy of `x`.
m055-i64byteperm-loop-readfirstlane	✅	❌	✅	LLVM HEAD `-O0` miscompiles a loop-carried value depending on an i64 byte-permutation fold, returning `0xffffffff` instead of `0xff22dd00`; ROCm 7.2.3 and ROCm HEAD pass.
m056-halfdot-lowbit-branch	✅	❌	❌	LLVM HEAD and ROCm HEAD `-O0` miscompute a low-bit branch key derived from a halfword-dot byte pack and store zero instead of `0xfffd7ffc`.
m057-rotcascade-store	✅	❌	✅	LLVM HEAD `-O0` miscomputes a repeated rotate/popcount/bitselect cascade before the final store; ROCm 7.2.3 and ROCm HEAD pass.
m058-nibble-bytesel-highbit	❌	❌	❌	`-O0`/`-O2` disagree on the high bit of a funnel-shift-shaped final store when a byte-lane select carry is derived from a nibble-table pack; the original oracle finding has LLVM HEAD `-O0` wrong.
m059-srem-loop-branch	✅	✅	✅	A stale LLVM HEAD build missing llvm/llvm-project#198373 skipped a live lane when a multi-exit loop branch key came from `srem`; the current patched toolchains pass.
m060-packunpack-bytedot-dot4	❌	❌	❌	`-O2` folds a generated `packunpack` three-term byte-dot sum into `v_dot4_u32_u8` with the wrong packed byte or accumulator, returning `0x1e35` instead of `0x1f98`.
m061-ovmaskpack-o0-overflow-lowering	✅	❌	❌	LLVM HEAD and ROCm HEAD `-O0` mislower an unoptimized overflow-mask-pack chain and store `0xa1df8800` instead of the oracle/`-O2` result `0xa0df8400`; ROCm 7.2.3 passes.
m062-bytehist-bitmux-lowbyte	✅	❌	❌	LLVM HEAD and ROCm HEAD `-O0` lower a bytehist/bitmux low-byte expression through `v_bitop3_b32` and store `0xb81c0001` instead of the oracle/`-O2` result `0xb81c0002`; ROCm 7.2.3 passes.
m063-overflow-carry-bitop3	✅	❌	❌	LLVM HEAD and ROCm HEAD `-O0` lower an overflow-derived duplicated carry expression through `v_bitop3_b32` and store `0x6` instead of the oracle/`-O2` result `0x2`; ROCm 7.2.3 passes.
m064-nibblecarry-loop-readfirstlane	✅	❌	❌	LLVM HEAD and ROCm HEAD `-O0` scalarize a divergent nibble-carry-derived loop value through `v_readfirstlane_b32` and store `0x1805d9` instead of the oracle/`-O2` result `0xc1b09`; ROCm 7.2.3 passes.
m065-usub-overflow-xor-fold	✅	❌	❌	LLVM HEAD and ROCm HEAD `-O0` fold `(lane ^ fold) & 1` after `usub.with.overflow` into a single `v_bitop3_b32` with the wrong truth table, storing `0x0` instead of the oracle/`-O2` result `0x1`; ROCm 7.2.3 passes.
m066-veci16zextmul-bitop3-loop	❌	❌	❌	`-O2` miscompiles a 12-iteration loop whose body builds `<4 x i16>` from the accumulator halves, zext-multiplies against constants, xor-reduces, smaxes two lanes, and xors the result back; exit value goes through a bitop3 cascade and stores `0x8BD601F1` instead of the oracle/`-O0` result `0x2BE83DE2`.
m067-bytecondsel-and-i1-self	✅	❌	❌	LLVM HEAD and ROCm HEAD `-O0` mis-lower `select i1 (and i1 X, X) c, 0` (where `X = icmp ult i32 a, 0`, always false) by evaluating the select as if the condition were true, storing `0xCE` instead of the oracle/`-O2` result `0x59`; ROCm 7.2.3 passes.
m068-loop-vop3fused-umaxbitop3	❌	❌	❌	`-O2` miscompiles a nested loop whose accumulator is seeded from `vop3fused` + `umaxbitop3cascade` shapes, storing `0x937E` instead of the oracle/`-O0` `0x8210A05D`.
m069-umaxbitop3cascade-store	❌?	❌	❌?	`-O2` miscompiles a final store whose value is `fuzz.umaxbitop3cascade.idiom.a.add`, storing `0x5C83AF47` instead of the oracle/`-O0` `0x814EF57`. Sibling bug to m068; ROCm 7.2.3 / ROCm HEAD not yet verified.
m070-scalar-fshl-shift8	✅	❌	❌	`-O0` lowers scalar `fshl.i32(x, 0, 8)` to a 64-bit shift by `-8`, returning `x >> 24` instead of `x << 8`; same lowering family as m015/m016 but shows the bug applies to every non-zero constant shift, not just `c=1`.
m071-bxorand-or-and-not-bitop3	❌	❌	❌	`-O0` lowers `((b ^ (c & a))
m072-bitop3-tand-or-and-not-zero	✅	❌	❌	`-O0` lowers `((b & (a & c))
m073-bitop3-t1t2-and-or-xor	❌	❌	❌	`-O0` lowers `((a&b) & (a
m074-fmed3-nan-ieee-off-maxmin	❌	❌	❌	`-O2` InstCombine fold of `amdgcn.fmed3(x, y, NaN)` in IEEE-off mode produces `maximumnum(x, y)` instead of `minimumnum(x, y)`; the polarity check in `AMDGPUInstCombineIntrinsic.cpp` only treats `-inf` as "Min" and defaults NaN to "Max", inconsistent with both the documented behaviour table and the parallel arms for `Src0`/`Src1`.
m075-rcp-constant-denormal-flush	❌	❌	❌	`-O2` InstCombine fold of `amdgcn.rcp.f32(C)` returns the exact `1/C` even when the kernel's f32 denormal mode is `PreserveSign` (the default) and the hardware would have flushed the denormal result to `±0`. For `C = 2^127` the fold returns `0x00400000` while `v_rcp_f32` returns `0`. A `TODO` next to the fold already calls out this issue.
m076-sffbh-umin-knownbits-check	✅	❌	❌	`-O2` SDAG fold of `umin(amdgcn.sffbh(x), Clamp)` to `sffbh(x)` fires when `x` is provably non-zero but `x = -1` is still reachable, because the negative side of the check uses the weak `!Known.isAllOnes()` ("not provably all-ones") instead of "provably not all-ones". For `x = (load \| 1)` with input `0xFFFFFFFE` the fold returns `-1` instead of `Clamp = 32`. HEAD-only regression.
m077-rcp-constant-denormal-input	❌	❌	❌	`-O2` InstCombine fold of `amdgcn.rcp.f32(C)` ignores the kernel's f32 denormal mode on the input side: for a denormal constant `C = 2^-127` (`0x00400000`) the fold returns the exact `2^127` (`0x7f000000`) while `v_rcp_f32` on gfx950 with the default `PreserveSign` mode first flushes the denormal input to `±0` and then returns `+Inf` (`0x7f800000`). Distinct from m075 (which is the same fold's output-side flush bug).
m078-wave-reduce-fsub-f64-dpp-identity	❌	❌	❌	The DPP-strategy lowering of `llvm.amdgcn.wave.reduce.fsub.f64` (and the SGPR-uniform `V_ADD_F64` arm) uses the generic FP64 additive identity `-0.0` from `getIdentityValueForWaveReduction`, but the ITERATIVE strategy explicitly overrides that to `+0.0` with the comment "`+0.0 for double sub reduction`". For all-zero input the two strategies disagree on the sign of zero (iterative: `+0.0`, DPP/uniform: `-0.0`); IEEE chained `0-0-...-0` rounds to `+0.0`, so the DPP/uniform path is wrong. Not an `-O0`/`-O2` mismatch (strategy is an `immarg`), so the reproducer XORs the two strategies' bit-patterns inside one kernel.
m079-fcmp-icmp-i64-wave32-fold	❌	❌	❌	`-O2` InstCombine "always-true" fold for `amdgcn.fcmp`/`icmp` blindly uses `II.getType()` as the type for `read_register("exec", ...)`. On wave32 with `.i64` return the fold therefore reads the full 64-bit `EXEC` pair, leaking the architecturally-unused `EXEC_HI` into the high 32 bits; `-O0`'s SDAG path correctly emits `v_cmp + zext i32 -> i64` so the high bits are zero. Sibling miscompile to the c007 ICE (wave64/`.i32`). Demonstrated as static asm divergence -- the FuzzX box has no wave32 GPU.
m080-gisel-clamp-i64-i16-degenerate	❌	❌	❌	GlobalISel `AMDGPUPreLegalizerCombiner::matchClampI64ToI16` validator OR's both orderings of `(Cmp1, Cmp2)` but the matcher distinguishes two patterns. For pattern 1 = `smin(smax(X, Cmp2), Cmp1)` with `Cmp1 < Cmp2` the IR is identically `Cmp1`, but the combiner rewrites it to `med3(min, X_packed, max)` -- a real clamp that returns `X` whenever it falls inside `[Cmp1, Cmp2]`. For `Cmp1=5, Cmp2=100, X=50` the IR semantic is `5` but compiled code returns `50`. GISel-only (`-mllvm -global-isel`); standard SDAG path is unaffected.
m081-gisel-wave-shuffle-half-check	❌	❌	❌	GISel `selectWaveShuffleIntrin` for wave64 GFX10/GFX11 builds the same-or-other-half check by XORing `ThreadID` with `set_inactive(Index << 2)` instead of the unshifted `Index`. The `& 32` then extracts bit 3 of the original index instead of bit 5, so for any index where bit 3 ≠ bit 5 the selector routes through the wrong of `{ds_bpermute, permlane64-then-bpermute}` and returns the value from the opposite 32-lane half. SDAG's `lowerWaveShuffle` keeps unshifted `Index` for the XOR and is correct. Demonstrated as static asm divergence on `gfx1100 +wavefrontsize64` -- the box has no wave64 GFX10/GFX11 GPU.
m082-kernarg-range-md-width-mismatch	❌	❌	❌	`AMDGPULowerKernelArguments` widens any sub-dword scalar kernarg load to `i32`, then transplants the argument's `range` ParamAttr onto the widened load via `MDB.createRange(Range.getLower(), Range.getUpper())` -- but the APInts are still at the argument's original (sub-dword) width, so the load gets `!range !{i8 0, i8 4}` on an `i32` instruction. The IR verifier rejects this ("Range types must match instruction type!"), so `opt -passes=amdgpu-lower-kernel-arguments` aborts; the default `clang -O2` pipeline has no in-pipeline verifier so the wrong-typed MD survives to codegen and is a latent miscompile risk for any downstream pass that consults the range MD on the un-truncated `i32` load. Sibling `nonnull`/`dereferenceable`/`align` block is correctly guarded by `isa<PointerType>(ArgTy)`.
m083-rewrite-out-args-mayalias-swap	❌	❌	❌	`AMDGPURewriteOutArguments` uses a single `MemoryDependence` query to find "the store" for each out-arg without checking that the returned store's pointer is actually that out-arg. For two non-`noalias` ptr args MDA returns the last store in the block as the def of both, so the pass pairs each out-arg with the OTHER store's value -- producing a clean value swap (`store 1, %a; store 2, %b` becomes `ret { 2, 1 }` consumed as `%a=2, %b=1`). Egregiously, the existing LIT test `multiple_same_return_mayalias` in `rewrite-out-arguments.ll` encodes the buggy `{ 2, 1 }` output as the expected result. Pass not in default pipeline -- reachable via `opt -amdgpu-rewrite-out-arguments`.
m084-s-barrier-init-unmasked-membercount	❌	❌	❌	SDAG lowering of `llvm.amdgcn.s.barrier.init` / `s.barrier.signal.var` at `SIISelLowering.cpp:12450-12459` builds the masked member-count SDValue but then immediately overwrites it with `SHL CntOp, 16` using the unmasked raw `CntOp`. Bits `CntOp[15:6]` leak into `M0[31:22]`, above the legal 6-bit `M0[21:16]` member-count field; for `%cnt >= 64` the hardware named-barrier sees a corrupted member count. The GISel counterpart at `AMDGPUInstructionSelector.cpp:7240-7250` masks correctly. gfx12+ intrinsic so demonstrated as static SDAG-vs-GISel asm divergence -- the FuzzX box has no gfx12 GPU.
m085-fatptr-array-vec-elem-store-vs-alloc-stride	❌	❌	❌	`AMDGPULowerBufferFatPointers` at `AMDGPULowerBufferFatPointers.cpp:978-985` (load) and `:1098-1105` (store) uses `getTypeStoreSize(ElemTy)` for the per-element stride when lowering an `[N x vec]` load/store, but LLVM lays out array elements at multiples of `getTypeAllocSize(ElemTy)`. For `<3 x i32>` (storeSize=12, allocSize=16 on AMDGPU's `v96:128` layout), `[2 x <3 x i32>]` element[1] is at byte offset 16 but the pass loads/stores from offset 12 -- short-reading 4 bytes of element[0]'s padding plus only 8 bytes of element[1]. Pass IS in the default `clang -O2` codegen pipeline. Demonstrated at IR (`opt -passes=amdgpu-lower-buffer-fat-pointers`) and asm (`buffer_load_dwordx3 ... offset:12`) levels.
m086-set-inactive-known-bits-overclaim	❌	❌	❌	`AMDGPUTargetLowering::SimplifyDemandedBitsForTargetNode` (`AMDGPUISelLowering.cpp:5838-5846`) handles `amdgcn.set_inactive` in the same `case` body as `readfirstlane`/`readlane`/`wwm`, populating `Known` only from `Op.getOperand(1)` (the active-lane `value`) and never visiting `Op.getOperand(2)` (the `inactive_value`). When `value` is a constant, the generic SimplifyDemandedBits framework constant-folds the entire call to that constant, silently dropping `inactive_value`. Asm-level proof: `set_inactive(0xAAAAAAAA, 0x55555555) & 0xFFFF` at `-O0` emits a `v_cndmask_b32_e64` selecting between both constants; at `-O2` it collapses to `s_mov_b32 s2, 0xaaaa` with no cndmask and no `0x55555555` anywhere. Same shape as m076 (target-node knownbits lying).
m087-image-store-sparse-dmask-trim	❌	❌	❌	`simplifyAMDGCNMemoryIntrinsicDemanded` channel-trimming loop (`AMDGPUInstCombineIntrinsic.cpp:2317-2342`, reached from the `image_store_*` case) walks DMask bits left-to-right and drops every set DMask bit whose position-among-set-bits is past the contiguous-prefix demanded mask returned by `trimTrailingZerosInVector`. For a sparse DMask like `0b1010` (Y+W) with `vdata = <a, 0>`, the W channel is dropped: O0 emits `image_store v[0:1], v2, s[0:7] dmask:0xa unorm` (writes Y=a, W=0), O2 emits `image_store v0, v2, s[0:7] dmask:0x2 unorm` (writes only Y; W is left unchanged). Existing LIT tests use only contiguous DMask and miss this.
m088-kernarg-noundef-widened-load	❌	❌	❌	`AMDGPULowerKernelArguments.cpp:319-320` unconditionally stamps `!noundef` on the widened i32 kernarg load whenever the original sub-dword arg has the `noundef` attribute, but the load's high bits come from sibling kernargs or padding whose noundef-ness is not constrained. `isGuaranteedNotToBeUndefOrPoison` then returns true for the un-truncated i32, and GVN+InstCombine can drop a `freeze` guarding a branch on a different arg, producing immediate UB from a well-defined source. Sibling to m082 (range MD); silent past `--verify-each`. Bug is in default-pipeline-emitted IR; weaponization requires a post-`amdgpu-lower-kernel-arguments` IR opt (LTO post-link, hand-rolled `opt`, JIT).
m089-lowerkernattr-grid-div-not-uniform-gated	❌	❌	❌	`AMDGPULowerKernelAttributes.cpp:409-446` ("Upgrade the old method") rewrites `udiv(grid_size_x, group_size_x)` (floor) into a load of `HIDDEN_BLOCK_COUNT_X` (ceil per AMDHSA ABI) without checking the `uniform-work-group-size` attribute; the two sibling rewrites at lines 310-347 and 348-404 in the same file do check it. For non-uniform dispatches (OpenCL ≥2.0 with `-cl-uniform-work-group-size=false`, or hand-built AQL packets) where `grid % group != 0`, floor and ceil differ. HIP runtime always sets `dispatch.grid_size = gridDim * blockDim` so the harness can't observe runtime divergence, but the IR-level transform is unconditionally unsound; the upstream LIT test `implicit-arg-block-count.ll` encodes the buggy behavior as expected.
m090-image-msaa-load-merge-ignores-dmask	❌	❌	❌	`AMDGPUImageIntrinsicOptimizer::collectMergeableInsts` (`AMDGPUImageIntrinsicOptimizer.cpp:114`) starts its arg-equality loop at `I=1`, silently skipping arg 0 (`DMaskIndex`). Two `image_load_2dmsaa` calls at the same coords with different DMasks get fused into a single `image_msaa_load` using only the first call's DMask, so the second extract reads from the wrong channel (e.g., `R(f1)` instead of `A(f1)`). The in-source comment claims to check DMask but the code doesn't. Gated off for gfx950 by `MSAALoadDstSelBug`; reproduces on gfx1150 (and any gfx10/gfx11 wave-graphics target without that erratum).
m091-latecgp-widen-load-noundef	❌	❌	❌	`AMDGPULateCodeGenPrepare.cpp:538-540` widens a sub-DWORD constant-AS load to i32 via `copyMetadata(LI) + setMetadata(MD_range, nullptr)` -- but `MD_noundef` is NOT cleared. The widened i32 load's high bits come from neighbouring bytes whose noundef-ness wasn't implied by the original attribute. GVN+InstCombine can then drop a `freeze` guarding a branch on bits the source program never claimed noundef on, producing immediate UB. Same shape as m088 (kernarg widening); this pass's `WidenLoads` cl::opt defaults to `true` so the bad MD is in default-pipeline-emitted IR. Twin latent in `AMDGPUCodeGenPrepare.cpp:1561-1562` where `WidenLoads` defaults to `false`.
m092-select-fcmp-one-nan-arg	✅	❌	❌	`SITargetLowering::performSelectCombine` (`SIISelLowering.cpp:18335-18374`) rewrites `select (fcmp one x, K), other, K` -> `select (fcmp one x, K), other, x` to avoid materializing the constant `K` twice. The fold guards the constant side (excludes NaN/Inf/zero/subnormal/inline-immediates) but never checks the non-constant operand `x`. When `x = NaN`, `fcmp one NaN, K` is `false`, so the original returns `K` while the folded form returns `x = NaN`. HEAD-only regression (fold added after ROCm 7.1.1 snapshot). Runtime confirmed: input `0x7fc00000` -> O0=`0x402df850`, O2=`0x7fc00000`.
m093-libcalls-pow-sqrt-no-fmf-guard	❌	❌	❌	`AMDGPULibCalls::fold_pow` (`AMDGPULibCalls.cpp:936-950`) rewrites `pow(x, ±0.5)` to `sqrt(x)`/`rsqrt(x)` without checking fast-math flags -- it returns before the `isUnsafeFiniteOnlyMath` guard at the bottom of `fold_pow`. C99/IEEE `powr` says `pow(-Inf, 0.5)=+Inf` and `pow(-0.0, 0.5)=+0.0`, but `sqrt(-Inf)=NaN` and `sqrt(-0.0)=-0.0`. Runtime confirmed: input `-Inf` -> O0=`+Inf`, O2=`NaN`; input `-0.0` -> O0=`+0.0`, O2=`-0.0`. Reproduces on both LLVM HEAD and ROCm 7.1.1 (not a HEAD-only regression).
m094-fmul-legacy-sign-of-zero	❌	❌	❌	`canSimplifyLegacyMulToMul` (`AMDGPUInstCombineIntrinsic.cpp:398-418`) lets the `amdgcn.fmul.legacy` and `amdgcn.fma.legacy` folds rewrite to plain `fmul`/`fma` whenever one operand is finite-nonzero. But `V_MUL_LEGACY_F32` returns `+0.0` whenever either operand is `±0`, while IEEE `fmul` XORs the signs: `fmul.legacy(-2.0, +0.0) = +0.0` but `fmul(-2.0, +0.0) = -0.0`. Runtime confirmed: O0=`+0.0`, O2=`-0.0`. Reproduces on both LLVM HEAD and ROCm 7.1.1; the existing LIT test `fmul_legacy.ll:28` encodes the buggy transform as expected.
m095-fmed3-sign-of-zero-maxnum	❌	❌	❌	`fmed3AMDGCN` (`AMDGPUInstCombineIntrinsic.cpp:53-68`) implements all-constant `amdgcn.fmed3` as a chain of three `maxnum` calls. `APFloat::compare` treats `±0` as `cmpEqual` and `APFloat::maxnum` "treats +0 > -0", so `fmed3(-0, -0, +0)` folds via `maxnum(-0, +0) = +0`, while HW `v_med3_f32(-0, -0, +0)` returns the actual median by sort order = `-0`. Runtime confirmed: O0=`0x80000000`, O2=`0x00000000`. Reproduces on both LLVM HEAD and ROCm 7.1.1. Generalises to any input triple where two operands tie at `-0` and the third is `+0`.
m096-fatptr-cmpxchg-weak-success-bool-poison	❌	❌	❌	`AMDGPULowerBufferFatPointers.cpp:1881-1886` only fills the `cmpxchg` result's success-bool field (`{T, i1}` lane 1) for strong cmpxchg; for `cmpxchg weak` it leaves `i1` as `poison`. `buffer_atomic_cmpswap` is non-spurious so the same `ICmpEQ(Call, CompareOperand)` would work for both forms. Default pipeline. The lowered IR has `insertvalue {i32, i1} poison, %r, 0` with field 1 never set; downstream `extractvalue ..., 1` returns poison and any branch/store/arithmetic on it is UB. Sibling shape to m085 (same file).
m097-swlowerlds-memintrinsic-leak	❌	❌	❌	`AMDGPUSwLowerLDS::getLDSMemoryInstructions` (`AMDGPUSwLowerLDS.cpp:639-663`) whitelists only `Load`/`Store`/`AtomicRMW`/`AtomicCmpXchg`/`AddrSpaceCast`, omitting `MemIntrinsic`. `replaceKernelLDSAccesses` then blanket-rewrites the dest pointer arg of `llvm.memset.p3.` / `llvm.memcpy.p3.` / `memmove` to point into the SwLDS metadata cell, while `translateLDSMemoryOperationsToGlobalMemory` never lowers the intrinsic itself. Net: memset/memcpy on an LDS pointer in an asan-instrumented kernel writes into the malloc-pointer slot and adjacent SwLDS metadata, never reaching the global backing store, and ASAN sees no access. Gated by `sanitize_address`.
m098-unify-exit-nodes-musttail-bitcast	❌	❌	❌	`AMDGPUUnifyDivergentExitNodes.cpp:256-261` skips musttail blocks by checking `RI->getPrevNode()` for a `CallInst`. The Verifier explicitly permits an optional bitcast between a `musttail` call and the `ret` (`Verifier.cpp:4283-4290`); when that bitcast is present, `getPrevNode()` returns the BitCastInst, the block is NOT skipped, the `ret` is replaced with `br UnifiedReturnBlock`, and the musttail invariant is destroyed. `opt` aborts with `musttail call must precede a ret with an optional bitcast`; `-disable-verify` produces silently-broken IR. The companion LIT test `do-not-unify-divergent-exit-nodes-with-musttail.ll` only covers the bitcast-LESS form. Default pipeline.
m099-tti-uniform-and-tid-divergent-mask	❌	❌	❌	`GCNTTIImpl::isAlwaysUniform` (`AMDGPUTargetTransformInfo.cpp:1219-1225`) matches `(workitem.id.x & Mask)` and returns `AlwaysUniform` when `Mask`'s `countMinTrailingZeros >= log2(wavefrontSize)`, without checking that `Mask` is uniform. A divergent `Mask = shl %div, log2(wave)` satisfies the trailing-zero check but the AND has divergent high bits. `AMDGPUUniformIntrinsicCombine` then deletes a `readlane(over-claimed_val, lane)`, so each lane stores its own value instead of the requested lane's. Runtime confirmed: lane 65 stores `0x40` (its own `val`) instead of `0x00` (lane 0's `val`). Same shape as m086 (target-node uniformity hook lying about an operand it never inspects).
m100-performfmacombine-fdot2-ignores-denormal-mode	❌	❌	❌	`performFMACombine` (`SIISelLowering.cpp:17729-17800`) folds the FMA-chain pattern `fma(fpext(ax),fpext(bx),fma(fpext(ay),fpext(by),z))` into `AMDGPUISD::FDOT2` whenever both FMAs carry `contract`. The in-source comment justifies the contract-only guard by claiming "fdot2_f32_f16 always flushes fp32 denormal regardless of mode" -- conflating two orthogonal properties. `contract` is a fusion permit; it does NOT license flushing denormals. A kernel compiled with `denormal-fp-math-f32="ieee,ieee"` silently switches from `v_fma_mix_f32` (mode-respecting) to `v_dot2c_f32_f16` (always FTZ), losing any intermediate subnormals the source asked to preserve. `dot10-insts` is gfx950 default.
m101-performaddcarry-rewrites-carryout	❌	❌	❌	`SITargetLowering::performAddCarrySubCarryCombine` (`SIISelLowering.cpp:17527-17533`) folds `UADDO_CARRY((add x,y), 0, cc) -> UADDO_CARRY(x, y, cc)` reusing `N->getVTList()`, so `CombineTo` rewires BOTH value and carry-out. The carry-outs are not equivalent: original `((x+y) mod 2^32 + cc) >= 2^32` vs folded `(x+y+cc) >= 2^32`. They differ whenever `x+y` wraps. No `hasOneUse` or `hasAnyUseOfValue(1)` guard. Symmetric `USUBO_CARRY((x-y), 0, cc)` borrow output has the same flaw. With `x=0xFFFFFFFF, y=1, cc=0`: IR carry=0, O2 produces 1.
m102-f64-flog-silent-undef	❌	❌	❌	`AMDGPUISelLowering.cpp:419-427` marks `{FEXP,FEXP2,FEXP10}` Custom for f64 (handled by `lowerFEXPF64`), but the LOG family is not. `FLOG/FLOG2/FLOG10` on f64 fall through generic Expand to a nonexistent libcall. llc prints `error: no libcall available for flog2`, exits 0, and emits a kernel that stores `{0, undef}` instead of the log value. The `LowerFLOG2`/`LowerFLOGCommon` helpers can't handle f64 either -- they emit `AMDGPUISD::LOG` which only has a `V_LOG_F32` selector. Affects gfx900/gfx950/gfx1100. Under `strictfp`, hard crash instead of silent miscompile.
m103-lowersdivrem-i64-int32min-narrowing	❌	❌	❌	`AMDGPUTargetLowering::LowerSDIVREM` (`AMDGPUISelLowering.cpp:2415-2430`) narrows i64 SDIVREM to i32 whenever both operands have `ComputeNumSignBits > 32`. That admits `LHS = sext(INT32_MIN)` (33 sign bits) and `RHS = sext(-1)`. The narrowed `sdiv i32 0x80000000, -1` is poison; lowering wraps to `0x80000000`, and the outer `SIGN_EXTEND` produces `-2^31` instead of the well-defined i64 result `+2^31`. Mirrored in `AMDGPUCodeGenPrepare::expandDivRem32` (`AMDGPUCodeGenPrepare.cpp:1219`), so the O0-vs-O2 oracle agrees wrong unless InstCombine pre-folds the divisor. Fix: tighten gate to `> 33`.
m104-sdag-rcp-constant-denormal	❌	❌	❌	`AMDGPUTargetLowering::performRcpCombine` (`AMDGPUISelLowering.cpp:5549-5558`) folds `AMDGPUISD::RCP(C)` to `APFloat(1.0)/Val` without consulting the kernel's denormal mode. Literal in-source comment: `// XXX - Should this flush denormals?`. SDAG twin of m075 (output-denormal) + m077 (input-denormal): the InstCombine fixes only fire on direct `@llvm.amdgcn.rcp(C)`, but `fdiv afn 1.0, C` is rewritten by `lowerFastUnsafeFDIV` (`SIISelLowering.cpp:13117`) into `AMDGPUISD::RCP`, then hits this fold. `1.0 / 2^127` under `denormal-fp-math-f32="preserve-sign,preserve-sign"` produces subnormal `0x00400000` instead of HW's `+0.0`. Both O0 and O2 run the same combine.
m105-fptosisat-bf16-i64-clamps-at-i32	❌	❌	❌	`LowerFP_TO_INT_SAT` (`AMDGPUISelLowering.cpp:3979-3986`) groups `bf16` with `f16` in the "saturate at i32 then ext to i64" shortcut. Sound for f16 (max finite 65504 fits in i32) but wrong for bf16, which shares f32's 8-bit exponent. Values in `[INT32_MAX+1, INT64_MAX]` silently clamp to `INT32_MAX` instead of returning the correct in-range i64 or `INT64_MAX`. Symmetric `fptoui.sat` defect. For bf16 input `0x4f80` (= 2^32): expected `0x100000000`, observed `0x7fffffff`. Bug is in Custom legalization (runs at every -O), so both -O0 and -O2 are wrong.
m106-selectvop3mods-fsub-pzero-sign-of-zero	❌	❌	❌	`SelectVOP3ModsImpl` (`AMDGPUISelDAGToDAG.cpp:3415-3423`) folds `fsub C, x` into the VOP3 `NEG` source modifier whenever `LHS->isZero()`. `APFloat::isZero()` matches BOTH `+0.0` AND `-0.0`. Under IEEE 754, `fsub +0.0, x` is NOT equivalent to `fneg x` when `x = +0.0`: the former returns `+0.0`, the latter `-0.0`. No `nsz` gate. SDAG: `v_mul_f32_e64 v1, -v1, v2`; GISel: preserves the `fsub`. Sibling shape to m094 (`fmul.legacy` sign-of-zero) but at ISel layer. Fix: restrict to `LHS->isNegZero()` (since `-0 - x = -x` is unconditional).
m107-performfnegcombine-fmul-flips-nan-sign	❌	❌	❌	`AMDGPUTargetLowering::performFNegCombine` FMUL/FMUL_LEGACY/FADD/FMA arms (`AMDGPUISelLowering.cpp:5298-5318`) fold `fneg(fmul x,y) -> fmul(x, fneg y)` (and the FADD/FMA siblings) with no `nnan` guard. Under IEEE 754, `fneg(x)` flips the sign bit of every value including NaN. But on AMDGPU HW, `v_mul_f32(NaN, -y)` propagates the input NaN's sign bit unchanged -- the VOP3 NEG src-modifier on the other operand has no effect on the propagated NaN's sign. For `x=NaN, y=1.0`: O0 produces `0xFFC00000` (sign flipped), O2 produces `0x7FC00000` (sign preserved). Asm: O0 `v_sub -0, mul`; O2 `v_mul x, -y`.
m108-lowerkernattrs-grid-dims-folded-from-reqd-wgsize	❌	❌	❌	`AMDGPULowerKernelAttributes` (`AMDGPULowerKernelAttributes.cpp:121-142, 146-158, 271-274`) replaces the dispatch-time `hidden_grid_dims` load (at `implicitarg.ptr + 64`, COV5) with a constant derived from the kernel-static `!reqd_work_group_size` metadata. But `hidden_grid_dims` is the AQL dispatch packet's `setup.DIMENSIONS` field -- set by the runtime when the kernel is launched. OpenCL/HIP explicitly permit dispatching a kernel with `reqd_work_group_size(N,1,1)` as a 2-D or 3-D NDRange. `get_work_dim()` then returns the wrong value (always 1 instead of the actual runtime dispatch dim). Per `AMDGPUUsage.rst:5358`: "hidden_grid_dims = AQL dispatch packet dimensionality". Post-ROCm-7.2.3 upstream regression.
m109-sifixsgprcopies-s_mov_b64-truncates-imm	❌	❌	❌	`SIFixSGPRCopies::tryMoveVGPRConstToSGPR` (`SIFixSGPRCopies.cpp:887`) picks `MoveOp = MoveSize == 64 ? S_MOV_B64 : S_MOV_B32`. `S_MOV_B64` only encodes a 32-bit literal -- non-inline 64-bit immediates have their high 32 bits silently dropped at encoding time. Sibling helper `isSafeToFoldImmIntoCopy` at line 386 correctly uses `S_MOV_B64_IMM_PSEUDO`. For input `V_MOV_B64_PSEUDO 0x123456789ABCDEF0`, asm prints the full value but encoding `BE8001FF 9ABCDEF0` is one literal slot; disassembly shows `s_mov_b64 s[0:1], 0x9abcdef0`. High 32 bits `0x12345678` zero-extended away. Triggered when a uniform PHI takes a `V_MOV_B64_PSEUDO` non-inline imm. Reproduces on ROCm 7.2.3 -- not HEAD-only.
m110-performfnegcombine-fmed3-nan-asymmetry	❌	❌	❌	`AMDGPUTargetLowering::performFNegCombine` `AMDGPUISD::FMED3` arm (`AMDGPUISelLowering.cpp:5383-5401`) folds `fneg(fmed3 x,y,z) -> fmed3(-x,-y,-z)` with no `nnan` guard. But `v_med3_f32` treats NaN asymmetrically: NaN sorts as smaller-than-everything regardless of NaN's sign bit. So negating operands doesn't yield a sign-flipped median. For `med3(NaN, 1.0, 2.0) = 1.0`, fneg = `-1.0`; but `med3(-NaN, -1.0, -2.0) = -2.0`. O0: `0xBF800000`; O2: `0xC0000000`. Sibling shape to m107. Reproduces on ROCm 7.1.1. Fix: gate on `nnan`.
m111-vop3p-madfmamix-fsub-fpext-nan-sign	❌	❌	❌	VOP3P `MadFmaMixFP32Pats` TableGen pattern (`VOP3PInstructions.td:240-251`) matches canonical IR `fneg(fpext h)` (= `fsub -0.0, fpext(h)`) and lowers to `v_fma_mix_f32(h, -1.0, -0.0)`. The VOP3 NEG src-modifier on `-1.0` does NOT flip the sign of a NaN propagated from `h` -- HW preserves the input NaN's sign through `v_fma_mix`. Result at O0: `0x7FC00000` (sign NOT flipped, wrong per LangRef `fneg`). At O2 the `performFNegCombine` FP_EXTEND arm fires first and produces correct `v_cvt_f32_f16_e64 -h` (= `0xFFC00000`). HEAD-only regression at O0 (ROCm 7.1.1 does not have the racing TableGen pattern).
m114-promotekernargs-flat-to-global-unconditional	❌	❌	❌	`AMDGPUPromoteKernelArguments` (`AMDGPUPromoteKernelArguments.cpp:105-128`) unconditionally wraps every `FLAT_ADDRESS` kernel-arg pointer in `addrspacecast(addrspacecast(p to ptr addrspace(1)) to ptr)` so `InferAddressSpaces` converts downstream memops to `global_*`. NO check that the flat pointer is actually in the global aperture. A flat kernarg can legitimately carry LDS / private aperture pointers (host stuffs `addrspacecast(@LDS to ptr)`). Per LangRef, `addrspacecast` to a non-containing AS is poison; AMDGPU lowering strips the aperture base. Result: `global_store_dword` to garbage address instead of correct flat/LDS/scratch dispatch. Default pipeline at -O2. Reproduces on ROCm 7.1.1.
m115-fcanonicalize-v2f16-undef-lane-asymmetric	❌	❌	❌	`performFCanonicalizeCombine` v2f16 build_vector path (`SIISelLowering.cpp:15910-15915`) has a dead ternary: `if (isa<ConstantFPSDNode>(NewElts[1])) NewElts[0] = isa<ConstantFPSDNode>(NewElts[1]) ? NewElts[1] : DAG.getConstantFP(0.0, ...);` -- the false branch is unreachable. When the OTHER lane is a non-constant (e.g. fcanon(runtime)), NewElts[0] stays `undef` and the combined build_vector lets the low lane decay to raw register bits at O2. The symmetric branch at 15917-15921 is correctly written. O0 fcanonicalizes low half to 0, O2 leaves raw sNaN/denormal bits. Sibling shape to m086 (target hook over-claims).
m116-regforinlineasm-i64-named-single-vgpr-truncate	❌	❌	❌	`getRegForInlineAsmConstraint` (`SIISelLowering.cpp:19164-19211`) silently accepts `={v0}` for an `i64` result type and binds it to a single 32-bit VGPR. The width check at 19175-19183 only runs when `NumRegs > 1`; the `NumRegs == 1` path at 19204-19208 only checks `VT.isVector()`, not scalar bit-width. Codegen synthesises the upper half from thin air (`v_mov_b32_e32 v1, 0` before the asm), so a 64-bit-writing asm reads back as `{v0=asm_value, v1=0}`. The range form `{v[0:0]}` is correctly rejected; the off-by-one is asymmetry in parsing. Reproduces on ROCm 7.1.1.
m117-attributor-waves-per-eu-max-of-lower	❌	❌	❌	`AAAMDWavesPerEU::updateImpl` (`AMDGPUAttributor.cpp:1165-1172`) uses `std::max` for BOTH endpoints of the waves-per-eu range when computing the union over callers. Lower bound should be `std::min`. A callee shared by kernels with `[1,1]` and `[8,8]` becomes `[8,8]` -- the tightest kernel's register budget is imposed on the callee even when invoked from the relaxed kernel, forcing spills. Also overwrites state non-monotonically and uses a wrong type-comparison for fixpoint detection. Acknowledged but undirected upstream as `min-waves-per-eu-not-respected.ll`.
m118-iscanonicalized-frexpmant-rcp-sqrt-overclaim	❌	❌	❌	`SITargetLowering::isCanonicalized` (`SIISelLowering.cpp:15670-15683` + `15541-15559`) lists ~12 AMDGPU target intrinsics/ISD nodes as unconditionally canonical with no input check: `amdgcn.frexp_mant`, `rcp`, `rsq`, `sqrt`, `exp2`, `log`, `trig_preop`, `cubeid`, `cvt_pkrtz`, `fdot2`, `rcp_legacy`, `rsq_legacy`, `rsq_clamp`. HW `v_frexp_mant_f32` etc. propagate NaN payload unchanged (sNaN in -> sNaN out). The `fcanonicalize_canonicalized` PatFrag (`SIInstrInfo.td:1013`) lowers to COPY, so `canonicalize` of any of these is elided. Codebase contradicts itself: `known-never-snan.ll:566` (`v_test_NOT_known_frexp_mant_input_fmed3_r_i_i_f32`) explicitly asserts frexp_mant output is NOT known-never-sNaN.
m119-sssid-merge-drops-stronger-scope	❌	❌	❌	`SIMemOpAccess::constructFromMIWithMMO` (`SIMemoryLegalizer.cpp:836`) merges SSIDs from multi-MMO atomics by checking `isSyncScopeInclusion(A,B)`; if neither subsumes the other (e.g. `agent-one-as` vs `workgroup`), the optional returns `false` and the code overwrites `SSID := B`, silently dropping the agent-level scope. Asm divergence verified: `[agent-one-as, workgroup]` MMO order emits no `BUFFER_WBL2 16 / BUFFER_INV 16`; reversed order emits both. Order-dependent codegen on multi-MMO atomics (created by `cloneMergedMemRefs` etc.). Should compute LUB instead.
m120-performfmulcombine-fneg-lhs-flips-nan-sign	❌	❌	❌	`performFMulCombine` (`SIISelLowering.cpp:17719-17721`) folds `fmul x, (select y, -A, -B)` -> `ldexp(fneg x, ...)` with no `nnan` guard. HW `v_mul_f64(NaN, -K)` preserves input NaN's sign; HW `v_ldexp_f64(-x, k)` lowers FNEG into the VOP3 NEG src-modifier which XORs the sign bit before ldexp sees the operand, flipping the NaN sign. For x=+qNaN, y=true: O0 stores `0x7FF8000000000000`, O2 stores `0xFFF8000000000000`. Inverse of m107 (m107 preserved sign; m120 flips it). Affects f64 always and f32/f16 in divergent contexts.
m113-preloadkernargs-explicit-budget-omits-baseoffset	❌	❌	❌	`AMDGPUPreloadKernelArguments` (`AMDGPUPreloadKernelArguments.cpp:181-183, 295-329`) computes the explicit-arg preload budget using only the raw `ExplicitArgOffset`, omitting `BaseOffset = ST.getExplicitKernelArgOffset()` (= 36 on non-AMDHSA/PAL/Mesa3D triples). On `amdgcn--` triple, the pass over-marks explicit args as `inreg`, then SIISelLowering's `allocatePreloadKernArgSGPRs` bails partway through and `AMDGPULowerKernelArguments` skips lowering all `inreg` args -- dropped args read garbage. Verified asm divergence: `amdgcn--` emits second load at offset `0x44` (correctly excluding baseoffset); `amdgcn-amd-amdhsa` at `0x20`. Companion of m088 (kernarg widening).
m112-printfbinding-pct-s-strlen-mod4-zero-offset	❌	❌	❌	`AMDGPUPrintfRuntimeBinding` (`AMDGPUPrintfRuntimeBinding.cpp:218-220, 357-389`) sizes the `%s` printf slot as `alignTo(strlen(s)+1, 4)` (reserves room for NUL), but the store loop only writes `strlen(s)` bytes and the next-arg GEP advances by `getTypeAllocSize` of the actual stored value, not the metadata-recorded size. For any string with `strlen % 4 == 0`, the metadata-recorded offset of arg N+1 is +4 ahead of the IR-emitted offset. Runtime reads `%d` arg from uninitialized buffer bytes. Repro: `printf("%s %d", "abcd", x)`.
m121-lowerkernattrs-udiv-blockcount-drops-volatile-load	❌	❌	❌	`AMDGPULowerKernelAttributes` (`AMDGPULowerKernelAttributes.cpp:421-444`) "upgrade old block-size calc" rewrite matches `m_UDiv(m_ZExtOrSelf(m_Load(m_GEP(dispatch_ptr, GRID_SIZE_X+I4))), m_Value())` without `isSimple()` on `m_Load`. Volatile and atomic loads match; the pass RAUW's the UDiv with a freshly-built non-volatile load of `HIDDEN_BLOCK_COUNT_X` from a different* memory location, silently discarding the value the user's volatile/atomic load produced. The original load survives as a dead side-effecting op, masking the loss. All other load matches in the same pass (line 205) correctly require `isSimple()`. Sibling of m108.
m122-lowerfdiv64-ignores-denormal-mode	❌	❌	❌	`SITargetLowering::LowerFDIV64` (`SIISelLowering.cpp:13471-13538`) lowers f64 fdiv through a NR refinement chain (`DIV_SCALE / FMA*4 / FMUL / DIV_FMAS / DIV_FIXUP`) with no denormal-mode toggle. Sibling `LowerFDIV32` correctly wraps its NR chain in `S_SETREG_B32 / DENORM_MODE` writes (lines 13364-13416 / 13436-13462) to force IEEE denormals around the FMA chain, saving and restoring under `denormal-fp-math-f32="preserve-sign,..."`. Under `denormal-fp-math="preserve-sign,preserve-sign"` the f64 `v_fma_f64` chain runs with FTZ; near-denormal intermediates get flushed and NR converges to the wrong value for divisors near `2^-1022`. Same lowering at -O0 and -O2 (Custom legalization, not a combine). Sibling of m075/m077/m104 at the f64 fdiv layer.
m123-lowerfastunsafefdiv64-nan-on-zero-divisor	❌	❌	❌	`lowerFastUnsafeFDIV64` (`SIISelLowering.cpp:13140-13178`) lowers `fdiv afn double X, Y` to RCP + 4-step NR refinement. For runtime `Y=0`: `RCP(0) = +/-Inf`; `FMA(-0, +Inf, 1.0) = NaN + 1.0 = NaN`; the entire NR chain returns NaN. IEEE / AMDGCN-RCP say `X / +0 = sign(X)Inf`. LangRef `afn` allows imprecise approximation but does NOT permit `+Inf -> NaN` (that would need `ninf+nnan`). f32 fast path uses simple `X RCP(Y)` and is safe. Same buggy asm at -O0 and -O2 (Custom legalization). Sibling of m075/m077/m104/m122.
m124-fcanonicalize-v2f16-both-undef-decays-to-zero	❌	❌	❌	`performFCanonicalizeCombine` v2f16 BUILD_VECTOR path (`SIISelLowering.cpp:15885-15924`) decays `fcanonicalize(<2 x half> undef)` to `<0.0, 0.0>` instead of `<qNaN, qNaN>`. LangRef says `canonicalize(undef)` should yield a quiet NaN. The correct scalar undef arm at line 15868 returns qNaN, but SDAG lowers `<2 x half> undef` as `BUILD_VECTOR undef, undef`, bypassing it. The lane-1 fixup correctly falls back to `0.0`, then the lane-0 ternary (m115's dead-branch shape) sees NewElts[1]=ConstantFPSDNode(0.0) and splats. v4f16 type-legalizes to scalar arms and is correct, demonstrating the v2f16 asymmetry. Distinct from m115 (lane-0 undef + lane-1 runtime canonicalize).
m125-printfbinding-half-sext-corrupts-fp-bits	❌	❌	❌	`AMDGPUPrintfRuntimeBinding` (`AMDGPUPrintfRuntimeBinding.cpp:191-203`) widens scalar `half`/`bfloat` printf args via `bitcast (half) -> i16 -> sext -> i32`. `sext` is the wrong widener: for negative half values (sign bit set), the top 16 bits flip from `0x0000` to `0xFFFF`, corrupting the FP bit pattern. `half -2.0` (`0xC000`) becomes `0xFFFFC000` instead of `0x0000C000`; runtime `%f` reads garbage. Companion to m112 (`%s` size off-by-4 in same pass). Fix: replace `CreateSExt` with `CreateZExt` (or, better, emit `fpext half -> float` for `%f` args).
m126-lowerfsqrt-ignores-denormal-mode	❌	❌	❌	`lowerFSQRTF32` (`SIISelLowering.cpp:13682-13770`) and `lowerFSQRTF64` (`13772-13862`) emit NR refinement chains with `v_fma_f32`/`v_fma_f64` and no `AMDGPUISD::DENORM_MODE` toggle. Sibling `LowerFDIV32` (`13334-13468`) wraps its NR chain in `S_SETREG_B32 / DENORM_MODE` writes (lines 13379-13416 / 13436-13462) to force IEEE denormals; sqrt does not. Under `denormal-fp-math="preserve-sign,..."` the FMA NR intermediates flush and the result diverges from IEEE. f32 ELSE branch (`SqrtE = 0.5 - SqrtH*SqrtS`) and f64 Goldschmidt chain both affected. Also missing `Flags.setNoFPExcept(true)`. Sibling of m122 at the sqrt layer.
m127-performfsubcombine-fadd-folds-flip-nan-sign	❌	❌	❌	`performFSubCombine` (`SIISelLowering.cpp:17579-17624`) two arms: Arm 1 `(fsub (fadd a,a), c) -> fma(a, 2.0, fneg(c))` and Arm 2 `(fsub c, (fadd a,a)) -> fma(a, -2.0, c)`. Both fire on `contract`-flagged inputs without an `nnan` guard. HW `v_sub_f32 c, sum` flips the propagated NaN's sign via implicit NEG-on-b; HW `v_fma_f32 a, 2.0, -c` (NEG src-modifier on c) propagates NaN sign UNCHANGED. For `a=1.0, c=+qNaN`: O0 stores `0xFFC00000`, O2 stores `0x7FC00000`. Sibling of m107 (FMul) and m120 (FMul fneg-LHS). Reproduces on ROCm 7.1.1.
m128-performfmacombine-fdot2-flips-nan-sign	❌	❌	❌	`performFMACombine` FDOT2 fold (`SIISelLowering.cpp:17729-17800`) gates only on `contract` (and `dot10-insts`) -- no `nnan` guard. Source IR lowers to two `v_fma_mix_f32` (preserves input NaN sign); folded form is one `v_dot2c_f32_f16` which unconditionally sets the sign bit of any NaN output regardless of input NaN sign. For `a=<+qNaN,1>, b=<1,1>, z=0`: two-FMA path stores `0x7FC00000`, FDOT2 stores `0xFFC00000`. Bug also fires via z-only NaN propagation. Distinct from m100 (same fold, denormal). Reproduces on ROCm 7.1.1.
m130-libcalls-powr-negative-base-folds-finite	❌	❌	❌	`AMDGPULibCalls::fold_pow` constant-exponent shortcuts (`AMDGPULibCalls.cpp:900-1005`) don't check `FInfo.getId()`, firing for `EI_POWR` / `EI_POWR_FAST` too. OpenCL `powr(x<0, y) = NaN` (base must be >= 0), `powr(NaN/+0/-0, 0) = NaN`. Fold instead produces `x*x` / `1.0/x` / `1.0` / `sqrt(x)`. Repro: `powr(-2.0, 2.0)` returns `4.0` (`0x40800000`) instead of NaN. Sibling of m093 but for `powr` semantics. Reproduces on ROCm 7.1.1.
m131-simplifydemandedbits-set-inactive-divergent-witness	❌	❌	❌	`AMDGPUTargetLowering::SimplifyDemandedBitsForTargetNode` (`AMDGPUISelLowering.cpp:5841`) treats `amdgcn_set_inactive` as 1-source, propagating Known from operand(1) only. The intrinsic signature is `set_inactive(active, inactive)`; Known should be the intersection of both. Witness: divergent EXEC + `set_inactive(active=tid&0xFF, inactive=0xFFFF0000)` + `readlane` to expose inactive-lane bits + `lshr 16` -- buggy fold collapses to `0` (over-promised "high bits zero"); operand-swap reference correctly stores `0xFFFF`. Verified on gfx1100 (gfx950 ICEs in Branch Relaxation on the divergent shape -- separate bug).
m132-codegenprepare-vector-sdiv-int32min-narrowing	❌	❌	❌	`AMDGPUCodeGenPrepare` vector scalarizer (`AMDGPUCodeGenPrepare.cpp:1488-1520`) extracts each lane of a vector sdiv/srem and calls `shrinkDivRem64` per element, composing the m103 i64 `INT32_MIN/-1` narrowing per lane. v2i64 sdiv with lane 0 = `sext(INT32_MIN)`, lane 1 = `sext(100)`, divisor = literal splat `<i64 -1, i64 -1>`: O2 InstCombine folds `sdiv x, splat(-1) -> 0 - x` (correct); O0 takes the buggy per-lane narrowing. Clean O0/O2 mismatch on lane-0's high32: 0xFFFFFFFF vs 0x00000000. Same fix as m103 (tighten `>32` to `>33`) but must apply at both scalar AND vector entry points.
m133-getcanonicalconstantfp-drops-nan-payload	❌	❌	❌	`getCanonicalConstantFP` (`SIISelLowering.cpp:15820-15854`) drops SNaN and non-default-QNaN payload, returning the default-payload canonical QNaN. SNaN-quietening at line 15841 has an in-source FIXME: "Is this supposed to preserve payload bits?". AMDGPU HW `v_max_f32(SNaN, SNaN)` quiets by setting bit 22 only and preserves the rest of the payload (and QNaN payload is preserved entirely). For input SNaN `0x7F8A5A5A` (payload `0x0a5a5a`): O0 stores `0x7FCA5A5A` (HW), O2 stores `0x7FC00000` (constant fold). Payload divergence is observable via bit-pattern inspection. Same defect applies to QNaN with non-default payload, and to v2f16/v2bf16 vector paths. Reproduces on ROCm 7.1.1.
m134-amdgpuisel-bitop3-stale-slot-rhs-not-reset	✅	❌	❌	`v_bitop3_b32` selector stale-slot patch (`AMDGPUISelDAGToDAG.cpp:4413-4450`) only resets `LHSBits = LHSBitsOrig` and `NumOpcodes = 0` when the recursive RHS mutated src slots; the parallel `RHSBits = RHSBitsOrig` reset is missing. Result: the returned `LHSBitsOrig OP RHSBits_recursed` truth-table encodes inconsistent slot semantics. Reduced repro (6 IR ops): `r = v2 ^ (~v2
m135-libcalls-rootn-2-folds-to-sqrt-without-fmf	✅	❌	❌	`AMDGPULibCalls::fold_rootn` (`AMDGPULibCalls.cpp:1171-1189` for `rootn(x, 2)`; `1209-1235` for `rootn(x, -2)`) rewrites to `Intrinsic::sqrt` / `Intrinsic::rsqrt` without any FMF gate. Per OpenCL spec for even `n`: `rootn(-0.0, 2) = +0.0` (sign dropped); `rootn(-Inf, 2) = NaN`; `rootn(-0.0, -2) = +Inf`; `rootn(-Inf, -2) = +0.0`. `llvm.sqrt` follows IEEE: `sqrt(-0) = -0`, `sqrt(-Inf) = NaN with sign`. O0/O2 mismatch verified: input `-0.0` gives O0 `+0.0` vs O2 `-0.0`; input `-Inf` gives O0 `+qNaN` vs O2 `-qNaN`. Unlike m093 the fold uses Intrinsic::sqrt directly so doesn't need a module-visible `_Z4sqrtf` body to fire.
m136-fatptr-seqcst-atomic-loses-scope-bits	❌	❌	❌	addrspace(7) seq_cst atomicrmw/cmpxchg/load/store lowering: `SIISelLowering::getTgtMemIntrinsic` (`SIISelLowering.cpp:1473-1475`, atomic-buffer branch) sets only `MOVolatile` on the MMO -- no atomic ordering, no SSID. `AMDGPULowerBufferFatPointers.cpp:1656-1680` lowers seq_cst to fence-pair + non-atomic intrinsic call. `SIMemoryLegalizer` then sees `getSuccessOrdering() == NotAtomic`, skips `toSIAtomicScope`, never calls `enableRMWCacheBypass` -- SC0/SC1 scope bits at gfx940/950 RMW lines 1137-1154 are never set. Asm divergence: `addrspace(7)` emits `buffer_atomic_add` with NO SC bits (default/wavefront scope), while equivalent `addrspace(1)` emits `global_atomic_add ... sc1` (system scope). Cross-agent contention can lose updates. `rmw_sys` and `rmw_agent` emit byte-identical asm -- scope is lost entirely.
m137-lowerf64tof16safe-drops-nan-payload	❌	❌	❌	`LowerF64ToF16Safe` (`AMDGPUISelLowering.cpp:3787-3873, 3824-3827`) handles f64 NaN/Inf by collapsing to `0x7c00 \| 0x200` (canonical qNaN), dropping ALL payload bits. The HW chain f64→f32→f16 (`v_cvt_f32_f64` + `v_cvt_f16_f32`) preserves the top 9 bits of payload. Same kernel with `fptrunc double->half` direct vs `fptrunc double->float; fptrunc float->half` chain produces different NaN payloads (`0x7e00` vs `0x7e6f`). Toggling `afn` flag flips between the two paths -- LangRef `afn` doesn't license changing NaN payload preservation. Sibling of m133 (constfold drops payload).
m138-bitop3-selector-revert-missing-return	✅	❌	❌	`v_bitop3_b32` selector stale-slot revert at `AMDGPUISelDAGToDAG.cpp:4446-4451` sets `NumOpcodes = 0` and restores `LHSBits`/`Src` but DOES NOT RETURN. Control falls through to TTbl computation, returning `(1, LHSBitsOrig OP RHSBits_recursed) = (1, garbage)`. Random IR fuzzing found 5 distinct O0/O2 miscompiles in ~500 random 5-8-op bitwise chains, each emitting a different wrong truth-table immediate (`0x22`/`0x40`/`0x33`/`0x7e`/`0x3c`). Cleanest: always-zero chain `(c & v1) & (c & ~v1) = 0` emits `v_bitop3_b32 ... bitop3:0x40` computing `a & b & ~c = 0xC888A886` (nonzero). m134 covered one symptom (RHSBits reset); m138 is the structural root cause -- early-return is needed, plus `findSlot` polarity-blindness and missing `RHSBitsOrig` snapshot are adjacent contributing defects.
m139-performfnegcombine-fma-flips-nan-sign	❌	❌	❌	`performFNegCombine` FMA arm (`AMDGPUISelLowering.cpp:5319-5347`) folds `(fneg (fma x, y, z)) -> (fma x, -y, -z)` gated only on `nsz`. For NaN inputs, `fneg` must flip the sign bit precisely; the substituted `fma(x, -y, -z)` selects a NaN payload from a different operand chain. v2f16 repro (`in0=0xfe00fc00, in2=0x7c007c00`): O0 emits `v_pk_fma_f16 + v_xor` -> `0x7e007c00`; O2 emits `v_pk_fma_f16 ... neg_lo:[0,1,1] neg_hi:[0,1,1]` -> `0xfe007c00` (top-half NaN sign kept negative). Sibling of m107/m110/m111/m120/m127/m128 -- all need `nnan` not just `nsz`.
m140-performfnegcombine-fadd-flips-nan-sign	❌	❌	❌	`performFNegCombine` FADD arm (`AMDGPUISelLowering.cpp:5273-5297`) folds `(fneg (fadd x, y)) -> (fadd -x, -y)` gated only on `nsz`. For NaN-producing fadds (e.g. `+Inf + -Inf`, or NaN input), the substitution changes which operand's NaN sign survives. v2f16 repro (`in0=0xfe00fc00, in1=0x7c007c00`): O0 emits `v_pk_add_f16 + v_xor 0x80008000` -> `0x7e007e00`; O2 emits `v_pk_add_f16 ... neg_lo:[1,1] neg_hi:[1,1]` -> `0x7e00fe00` (bottom-half NaN sign kept negative -- m139 sibling on FADD arm).
m141-iscanonicalized-bitcast-loses-fp-type	❌	❌	❌	`SITargetLowering::isCanonicalized` recurses through `ISD::BITCAST` (`SIISelLowering.cpp:15649-15653`) WITHOUT consulting source/dest FP semantics. In-source TODO acknowledges the bug. v2bf16 (8-bit exp) and v2f16 (5-bit exp) have different denormal ranges, so the same 16-bit pattern can be normal in one type and denormal in the other -- and a NaN-payload bit-pattern in one type may not be canonical-NaN in the other. The function gates `is_canonicalized_1/2` PatFrags (`AMDGPUInstructions.td:189,207`; `SIInstrInfo.td:1017,1025`) which decide whether `V_PACK_B32_F16` / `min/max` selection may omit the explicit canonicalise. Effect: O2 drops the canonicalize and lets a denormal/sNaN through where O0 emits `v_pk_max_f16 v, v` that FTZs/quiets. Sibling of m118 (same function, different arm), m115/m124, m133.
m142-image-d16-bf16-skipped	❌	❌	❌	`SITargetLowering::lowerImage` D16 detection at `SIISelLowering.cpp:10190` (store) / `10203` (load) / `11926` (handleD16VData) / `12056/12084/12120/12170` (TBUFFER) uses `getScalarType() == MVT::f16` only. bf16 is a distinct MVT, so `<N x bfloat>` image data/result silently skips `handleD16VData`, selects the non-D16 MIMG opcode, computes wrong NumVDataDwords, and bypasses the `HasD16` guard. GISel correctly handles bf16 via `getScalarType() == S16` (`AMDGPULegalizerInfo.cpp:7182`) -- SDAG-only miscompile. Same IR, same bits, different image opcode depending on `-global-isel`. No upstream lit test covers bf16 image.
m143-strict-fp-round-f64-bf16-drops-chain	❌	❌	❌	`SITargetLowering::lowerFP_ROUND` (`SIISelLowering.cpp:8604-8613`, f64->bf16 path) lacks the strictfp-bail guard that lives on the f64->f16 path (line 8585). For `ISD::STRICT_FP_ROUND` with src f64 and dst bf16, it calls `expandRoundInexactToOdd` (non-strict graph) then emits non-strict `ISD::FP_ROUND` -- silently dropping the strict chain (operand 0), losing the second result (chain), and dropping exception semantics from the inexact-to-odd expansion. Downstream side-effecting ops can reorder past the strict round. Sibling to m137 (lowering drops payload) and m141 (bf16 family).
m161-verifyinstruction-atomic-av-class	❌	❌	❌	`SIInstrInfo::verifyInstruction` atomic vdst/vdata file-match check at `SIInstrInfo.cpp:5857-5869` uses `RI.isAGPR(X) != RI.isAGPR(Y)`. `isAGPR = hasAGPRs && !hasVGPRs` returns false for AV_* classes (have both bits). On gfx90A+ where `getLargestLegalSuperClass` widens VReg/AReg pairs to AV, AV-class virtuals are the norm. Two failure modes: (1) AV vdst + AGPR vdata -> `false != true` spurious reject; (2) AV vdst + VGPR vdata -> `false == false` silently accepted but allocator may split vdst across A-half while vdata stays VGPR, corrupting atomic encoding. Sibling family m149/m152/m153 -- all isAGPR/isVGPRClass-blindness defects in gfx950 AV-class handling.
m160-shrinkdivrem-24bit-narrowing-int24-min-neg-one	❌	❌	❌	`shrinkDivRem64` 24-bit narrowing path (`AMDGPUCodeGenPrepare.cpp:1354-1361` + `expandDivRem24Impl` at `:1155-1162`) computes `INT24_MIN / -1 = +2^23 = 0x00800000` correctly, then sign-extends from 24-bit via `SHL 8; AShr 8` which inverts to `0xFF800000` (= -2^23). Same gate reachable from i32 sdiv when LHS has >=9 sign bits + RHS=-1 (getDivNumBits returns 24). Sibling of m103 (32-bit), m132 (vector). Full family covers `INT_MIN_at_N / -1` for N ∈ {24, 32}. Reproducer: `sdiv i32 -8388608, -1` stores `0xFF800000` instead of `0x00800000`.
m159-fcanon-v2bf16-undef-lane	❌	❌	❌	`performFCanonicalizeCombine` BUILD_VECTOR per-lane undef-fixup at `SIISelLowering.cpp:15885` is guarded by `VT == MVT::v2f16` -- v2bf16 missing. On gfx950, v2bf16 fcanonicalize is Legal but the combine falls through -- O0 reads garbage bits for undef lane via `v_max_f32_e64`; O2 splats a constant qNaN. Example: input `0x00007fc0` (lane0 undef, lane1 bf16 qNaN) -> O0 stores `0x7fc00000`, O2 stores `0x7fc07fc0`. Found by random fcanon-chain fuzz: 29/500 mismatches all of this shape. Direct sibling of m115/m124 for v2f16 -- same defect, different element type. Fix: extend guard to include v2bf16.
m158-lowerfcopysign-v2f16-trunc-drops-sign	❌	❌	❌	`SITargetLowering::lowerFCOPYSIGN` v2f16/v2bf16 mag + wider sign path (`SIISelLowering.cpp:8817-8823`) bitcasts sign to `v2i32` then `TRUNCATE`s to `v2i16`. TRUNCATE takes low 16 bits of each i32 -- drops bit 31 (the f32 sign bit), substitutes mantissa bit 15. Subsequent `FCOPYSIGN(v2f16, v2f16)` reads bit 15 of the truncated value -> the produced half carries random sign instead of the input f32's sign. Fix: SRL by 16 before TRUNCATE, or extract high half. Reachable when sign comes from a v2f32 producer not wrapped in fpround (performFCopySignCombine only peeks through FP_ROUND).
m157-lower-fminmax-num-v32f16-missing	❌	❌	❌	`SIISelLowering.cpp:826-829` sets `ISD::FMINIMUMNUM`/`FMAXIMUMNUM` Custom for `{v4f16, v8f16, v16f16, v32f16}`. `lowerFMINIMUMNUM_FMAXIMUMNUM` (`SIISelLowering.cpp:8637-8651`) handles only `{v4f16, v8f16, v16f16, v16bf16}` -- v32f16 missing, v16bf16 is dead (not Custom anywhere). In non-IEEE mode, v32f16 falls through to `return Op;` -- legalizer treats as "no change" and either infinite-loops or asserts. Reachable via `llvm.minimumnum.v32f16`/`llvm.maximumnum.v32f16` on gfx950. Fix: add v32f16 to splitter, drop dead v16bf16; or set v32f16 Expand matching FMINNUM/FMAXNUM at line 831.
m156-hasnon16bitaccesses-copypaste-tempotherop	❌	❌	❌	`SIISelLowering.cpp:14923-14924` (`hasNon16BitAccesses`): `OpIs16Bit` check uses `TempOtherOp.getValueSizeInBits() == 16` instead of `TempOp.getValueSizeInBits() == 16`. The symmetric `OtherOpIs16Bit` clause two lines down correctly uses `TempOtherOp` on both sides -- this is a clear copy-paste defect. Called from `performOrCombine` -> `matchPERM` (`SIISelLowering.cpp:15070`). When Op is 32-bit and OtherOp is 16-bit, `OpIs16Bit` becomes spuriously true; combine concludes "both 16-bit" and skips v_perm -- can drop upper 16 bits of Op on i16->i32 zext / v2i16->v2i32 lane patterns under an or-tree. Same defect in ROCm fork at `:14978-14979`.
m155-sched-barrier-ds-aggregate-ldsdma	❌	❌	❌	`amdgcn.sched.barrier(0x800)` (LDSDMA-allow) still blocks LDSDMA via the DS aggregate bit. `invertSchedBarrierMask` (`AMDGPUIGroupLP.cpp:2678-2685`) DS clause does NOT check the LDSDMA bit; only DS_READ/DS_WRITE imply clearing DS. For input 2048 the inverted mask is 2031 (`0b011111101111`) -- DS (0x80) stays set. `canAddMI` DS branch (`AMDGPUIGroupLP.cpp:2482-2484`) matches LDSDMA via `isLDSDMA(MI)` on the DS bit, pinning the op. Direct sibling of m144 (which fixed the VMEM aggregate route); both must be fixed for the LDSDMA-allow semantic to work. Fix: add `(InvertedMask & LDSDMA) == NONE` to the DS clause symmetric to lines 2672-2676.
m154-performfmulcombine-ldexp-flips-nan-sign	❌	❌	❌	`performFMulCombine` ldexp arm (`SIISelLowering.cpp:17684-17722`) folds `fmul x, (select c, -A, -B)` -> `ldexp(fneg x, select c, log2\|A\|, log2\|B\|)` gated only on `TrueNode->isNegative() == FalseNode->isNegative()` + exact-log2 -- NO `nnan` check. For x=NaN, original `v_mul_f32(NaN, -K)` propagates input NaN sign; rewrite `v_ldexp_f32(-NaN, exp)` via VOP3 NEG src-modifier XORs sign before ldexp passthrough -> output NaN sign flipped. Applies to f16/f32/f64. Asm verified: `v_ldexp_f32 v0, -v0, v1`. Sibling family of m107/m110/m111/m120/m127/m128/m139/m140 -- all need `nnan` not just `nsz`/no-FMF.
m153-wholewave-func-prologue-exec-xor-not-mov	❌	❌	❌	`SIFrameLowering.cpp:1041-1048` WholeWaveFunction prologue path: when no WWM scratch / CSR spills exist, calls `buildScratchExecCopy(..., EnableInactiveLanes=true)` which emits `S_XOR_SAVEEXEC_B{32,64} tmp, -1`. Per ISA: `tmp = EXEC; EXEC = -1 XOR EXEC = ~EXEC`. The in-source comment claims "set EXEC to -1 here," but the result is `EXEC = ~entryEXEC` -- bit-inverted entry mask, not all-ones. Triggered by trivial `amdgpu_gfx_whole_wave` function with no `strict.wwm` MFMA chains. Body executes with previously-active lanes inactive and vice versa -- whole-wave semantic violated end-to-end. Fix: call `EnableAllLanes()` (S_MOV_B64 EXEC, -1) or pass `EnableInactiveLanes=false` (S_OR_SAVEEXEC reaches EXEC=-1 from any entry mask).
m152-getdestequivalentvgpr-strips-av-class	❌	❌	❌	`SIInstrInfo::getDestEquivalentVGPRClass` (`SIInstrInfo.cpp:9684-9688`, SrcRC-not-AGPR branch) early-exits with `RI.isVGPRClass(NewDstRC)`. `isVGPRClass = hasVGPRs && !hasAGPRs` is false for AV_* classes (have AGPR bits). On gfx90A+ (gfx950), `getLargestLegalSuperClass` promotes VReg_/AReg_ to AV_, so AV-class virtregs are the norm for COPY/PHI/REG_SEQUENCE/INSERT_SUBREG dests. The early bail-out is skipped and dest is replaced by `getEquivalentVGPRClass(AV_xx)` -- a strict-VGPR class. `moveToVALU` then drops AGPR legality; subsequent MFMA/V_ACCVGPR_ uses see class mismatch -> extra V_ACCVGPR_READ/WRITE moves or illegal-copy verifier failures. Same isVGPRClass-blindness as m149.
m151-cs-chain-undefined-flags-silent-fallthrough	❌	❌	❌	`SITargetLowering::LowerCall` (`SIISelLowering.cpp:4248-4266`) dispatches on `llvm.amdgcn.cs.chain` Flags immarg with two recognized cases (`isZero()` -> error if extra args; `isOneBitSet(0)` -> DVGPR path with 3 fallback args). There is NO else clause. Any Flags value that is neither 0 nor 1 (e.g. 2, 3, 5) falls through silently: `UsesDynamicVGPRs` stays false, `ChainCallSpecialArgs` only contains `exec`, `NumVGPRs/FallbackExec/FallbackCallee` are never pushed. Opcode picked is non-DVGPR `SI_CS_CHAIN_TC_W32/W64`; trailing IR-level variadic args are dropped from the lowered call without diagnostic. IR Verifier (`Verifier.cpp:7061-7092`) does not range-check the immarg. Sibling of m145 (chain-call ExternalSymbol drops target flag).
m150-s-barrier-init-m0-mask-discarded	❌	❌	❌	`SIISelLowering.cpp:12450-12459` (lowering of `amdgcn.s.barrier.init` / `amdgcn.s.barrier.signal.var`) computes `M0Val = (CntOp & 0x3F)` then IMMEDIATELY overwrites with `M0Val = (CntOp << 16)` using the UNMASKED `CntOp`. The mask is dead code; the intended `(CntOp & 0x3F) << 16` expression is never produced. If member-count has bits >=6 set (e.g. 0x40, 0x80), those bits land in M0[27:22], corrupting unrelated fields when the HW decodes M0. Adjacent (not filed): `SOPInstructions.td:507-573` -- barrier pseudos lack `mayLoad/mayStore`, combined with `IntrNoMem` on intrinsics, MachineScheduler can reorder memops across `s_barrier`.
m149-preallocate-wwm-skips-av-class-gfx950	❌	❌	❌	`SIPreAllocateWWMRegs.cpp:102` (`processDef`) early-exits when `TRI->isVGPR(MRI, Reg)` is false. `isVGPR` calls `isVGPRClass`, which is true only when a regclass has VGPRs AND no AGPRs. On gfx90A+ (gfx950 included), MAI-capable register classes are unified AV_ super-classes (`hasVGPRs && hasAGPRs`), so `isVGPRClass == false` and the WWM pre-allocator silently drops the virtreg. The physreg is never added to `WWMReservedRegs`, so VGPRAllocator may reuse it across `EXIT_STRICT_WWM` -- post-EXIT writes under restored EXEC corrupt inactive-lane data WWM was supposed to preserve. Same defect in `SILowerWWMCopies.cpp:135` (`addToWWMSpills`). Reachable via `amdgcn.strict.wwm` on MFMA results.
m148-v2f64-to-v2f16-gcnpat-double-rounds	❌	❌	❌	`VOP3Instructions.td:1461-1463` GCNPat for `(v2f16 (fpround v2f64:$src))` emits `V_CVT_F32_F64` + `V_CVT_PK_F16_F32` -- classic IEEE double-rounding (f64->f32->f16). The scalar `fptrunc double to half` path (`SIISelLowering.cpp:8599` -> `LowerF64ToF16Safe`, `AMDGPUISelLowering.cpp:3787-3873`) emulates single-step rounding via inexact-to-odd correction. Same IR, same gfx950: scalar vs lane-0-of-vector fpround produce different f16 bits for f64 values near half-way f16 boundaries. Also bypasses the m137 NaN-payload preservation logic. `lowerFP_ROUND` at line 8573 bails on non-f32 source for v2f16 dst, letting the buggy GCNPat fire. Sibling of m137.
m147-performclampcombine-drops-snan-quietening	❌	❌	❌	`performClampCombine` (`SIISelLowering.cpp:18284-18303`) constant-folds `AMDGPUISD::CLAMP(c)` and returns the input sNaN bit-pattern unchanged when DX10Clamp is OFF (default for compute kernels). `AMDGPUISD::CLAMP` lowers via `ClampPat` (`SIInstructions.td:2030-2036`) to `V_MAX_F32_e64 src, src, DSTCLAMP.ENABLE`; with IEEE_MODE=1 the HW `v_max_f32(sNaN, sNaN)` quiets the sNaN (sets mantissa bit 22) but preserves payload. Constant-fold path returns `0x7F800001` (raw sNaN); HW path returns `0x7FC00001` (quieted). Reachable via `amdgcn.fmed3(c, 0.0, 1.0)`, `fminnum(fmaxnum(c, 0.0), 1.0)`, or `fmed3` pattern-matchers. Sibling family of m133.
m146-resource-usage-agpr-undercount-with-calls	❌	❌	❌	`AMDGPUResourceUsageAnalysis.cpp:173-177` sets `NumAGPR` and `NumExplicitSGPR` ONCE via `getNumUsedPhysRegs(..., IncludeCalls=false)`. The per-MI call-path scan at lines 188-321 only updates `MaxVGPR`; AGPR/SGPR are skipped via the `!TRI.isVGPRClass(RC)` filter at line 244. On gfx950 (MAI/AGPR-heavy kernels with external calls), AGPR usage is under-reported. Combined with unified VGPR+AGPR allocation on gfx90a/gfx950, downstream `max(NumVGPR, NumAGPR)` for occupancy may select stale AGPR count -> kernel descriptor over-allocates waves per CU. Secondary: call-path MaxVGPR scan lacks the `getAddressableNumArchVGPRs()` clip that the fast-path applies, so wave32-vs-wave64 mismatch is unhandled.
m145-mcinstlower-externalsymbol-drops-target-flag	❌	❌	❌	`AMDGPUMCInstLower.cpp:104-109` (`MO_ExternalSymbol` branch) constructs `MCSymbolRefExpr::create(Sym, Ctx)` WITHOUT passing `getSpecifier(MO.getTargetFlags())`. The sibling `MO_GlobalAddress` branch at lines 89-103 correctly applies the specifier. Any AMDGPU symbol specifier (`MO_GOTPCREL`/`MO_REL32_LO`/`MO_REL32_HI`/`MO_ABS32_LO`/`MO_ABS32_HI`/`MO_REL64`/`MO_ABS64`) on an ExternalSymbol is silently dropped, producing wrong relocation type in the emitted object. Reachable via RuntimeLibcalls (`__divdi3` etc.) and `SI_TCRETURN_CHAIN` with external dest. Same defect class in the `MO_MCSymbol` branch at 113-119 (only handles `MO_FAR_BRANCH_OFFSET`).
m144-sched-barrier-ldsdma-mask-ineffective	❌	❌	❌	`llvm.amdgcn.sched.barrier` mask = 0x800 (allow LDSDMA past barrier) is silently ineffective. `invertSchedBarrierMask` (`AMDGPUIGroupLP.cpp:2667-2676`) clears the aggregate VMEM bit but leaves VMEM_READ (0x10) and VMEM_WRITE (0x20) sub-bits set. Every LDSDMA instruction also satisfies `isVMEM && mayLoad/Store` (`SIInstrInfo.h:631`), so `canAddMI` (`AMDGPUIGroupLP.cpp:2474-2480`) classifies LDSDMA into the SchedGroup via VMEM_READ/WRITE branches -- the instruction receives ordering edges to the SCHED_BARRIER and cannot move past, contradicting the `AMDGPUUsage.rst:1626` documented semantics. Asymmetry: requesting DS-allow (mask=0x80) correctly allows LDSDMA (line 2680), but reverse path fails. Existing lit test only verifies the debug-print mask value, not the actual scheduling effect.
c001-sudot-isel-ice	❌	❌	❌	`llvm.amdgcn.sudot4` / `llvm.amdgcn.sudot8` abort in AMDGPU instruction selection with `Cannot select`.
c002-fma-legacy-isel-ice	❌	❌	❌	`-O0` leaves `llvm.amdgcn.fma.legacy` for AMDGPU instruction selection, which aborts with `Cannot select`; `-O2` compiles the reduced case.
c003-permlane16-isel-ice	❌	❌	❌	`llvm.amdgcn.permlane16` ICEs with `Cannot select` on every CDNA target (gfx9xx); the instruction is GFX10+/RDNA only but the intrinsic is declared target-unconditional.
c004-mov-dpp8-isel-ice	❌	❌	❌	`llvm.amdgcn.mov.dpp8` ICEs with `Cannot select` on every CDNA target; same root cause as c003 -- DPP8 is GFX10+/RDNA only.
c005-global-load-lds-isel-ice	❌	❌	❌	`llvm.amdgcn.global.load.lds` ICEs with `Cannot select` on gfx950; same family as c003/c004. `llvm.amdgcn.ds.ordered.add` ICEs the same way (mentioned in the c005 notes rather than getting its own entry).
c006-tanh-isel-ice	❌	❌	❌	`llvm.amdgcn.tanh` (both `.f32` and `.f16`) ICEs with `Cannot select` on gfx950; `v_tanh_*` is a GFX12 instruction not available on CDNA. Same fix shape as c003.
c007-fcmp-i32-wave64-fold-ice	❌	❌	❌	`llvm.amdgcn.fcmp.i32` with two equal constant FP operands ICEs at `-O2` on any wave64 target with `invalid type for register "exec"`; the constant folder doesn't validate that the requested return width matches the wave size. Distinct shape from c003--c006 -- a constant-folder bug rather than a missing subtarget predicate.
c008-amdgcn-class-bf16-isel-ice	❌	❌	❌	`llvm.amdgcn.class.bf16` ICEs at both -O0 and -O2 with `LLVM ERROR: Cannot select: i1 = AMDGPUISD::FP_CLASS ... bf16`. `SIISelLowering.cpp:10931-10933` unconditionally lowers `int_amdgcn_class` (polymorphic over `llvm_anyfloat_ty`) to `AMDGPUISD::FP_CLASS` for any source VT; `VOPCInstructions.td:1223-1229` defines `VOPCClassPat64` only for `_F16/_F32/_F64` -- no `V_CMP_CLASS_BF16`. `llvm.is.fpclass.bf16` correctly expands via i16 compares; the amdgcn-specific intrinsic skips that path. Sibling to c001/c003/c006 and m118 (bf16 over-promise).
c009-ballot-wrong-width-cannot-select	❌	❌	❌	`llvm.amdgcn.ballot.<N>` with `<N> != WavefrontSize` ICEs in ISel for non-constant args. `lowerBALLOTIntrinsic` (`SIISelLowering.cpp:7811-7852`) emits `AMDGPUISD::SETCC` directly in the user-requested return type without first emitting the wave-sized SETCC and `getZExtOrTrunc`'ing. ISel has no pattern matching a wave-mask SETCC at the wrong width: `LLVM ERROR: Cannot select: i32 = AMDGPUISD::SETCC ..., setne:ch`. Reproduces on `ballot.i32` / wave64 (gfx950) and `ballot.i64` / wave32 (gfx1030). Distinct from c007 (constant-fold ICE) -- c009 fires for arbitrary non-constant inputs. Only the `isOne()` fast-path checks the active wave width. Sibling `lowerICMPIntrinsic` does it correctly.
c010-strict-fp-extend-bf16-unreachable	❌	❌	❌	`STRICT_FP_EXTEND bf16 -> f32/f64` ICEs with `llvm_unreachable("Need STRICT_BF16_TO_FP")` at `SIISelLowering.cpp:4914-4915` (`lowerFP_EXTEND`). `STRICT_FP_EXTEND` is set Custom for f32/f64 dst (lines 580-581); the bf16 source case has no strict expansion. Reproducer: `llvm.experimental.constrained.fpext.f32.bf16` with `strictfp` attribute. Crashes a release-asserts compiler at any opt level. Sibling to c001/c003/c006/c008 (intrinsic without selector / wrong target gate) and m143 (sibling f64->bf16 strict-round drops chain).
c011-buffer-load-format-tfe-illegal-data-type-ice	❌	❌	❌	`amdgcn.struct.ptr.buffer.load.format` with TFE enabled AND illegal data type (`<3 x i16>`, likely `<6 x i16>`/`<3 x bfloat>`) ICEs in SDAG legalization. Two cooperating defects: (1) `SITargetLowering::lowerIntrinsicLoad` illegal-type branch (`SIISelLowering.cpp:7740`) unconditionally builds a 2-result VTList `(CastVT, MVT::Other)`, ignoring `IsTFE` -- the TFE i32 result is never plumbed; legal-type branch at 7735 correctly uses `M->getVTList()`. (2) `ReplaceNodeResults` else branch (`8263-8266`) pushes exactly 2 results regardless of `N->getNumValues()`, dropping the 3rd value for TFE's `(data, status, chain)`. Crash in `SelectionDAG::ReplaceAllUsesWith`. SDAG-only (GISel cleanly errors with "unable to legalize").
c018-image-atomic-illegal-data-type-ice	❌	❌	❌	`amdgcn.image.atomic.<op>.<dim>` ICEs/miscompiles in SDAG for any data width != 32/64 bits. `lowerImage` atomic branch (`SIISelLowering.cpp:10156-10181`) is a binary `Is64Bit = VData.getValueSizeInBits() == 64` dispatch -- everything else gets `NumVDataDwords=1, DMask=0x1` -> MIMG selector picks 1-dword V1 opcode for arbitrary-width VData. 7-variant matrix: `<3 x i32>` ICE in copyPhysReg/MCInstPrinter; `<3 x i16>/v6i16/v3bf16` widen-result error; `i128` expand; `i128 cmpswap` Cannot select; bfloat SILENT MISCOMPILE -- 1-dword `image_atomic_swap dmask:0x1` corrupts upper 16 bits of texel. Family sibling m142/c011/c014/c015/c016/c017.
c017-buffer-atomic-illegal-data-type-ice	❌	❌	❌	`amdgcn.{raw,struct}.ptr.buffer.atomic.*` ICEs in SDAG for illegal data types. `lowerRawBufferAtomicIntrin` (`SIISelLowering.cpp:11196-11222`), `lowerStructBufferAtomicIntrin` (`:11224-11250`), and cmpswap arms (`:11541-11587`) hand the user-typed value straight to `getMemIntrinsicNode` with no illegal-type bitcast/widen fallback (unlike `lowerIntrinsicLoad`'s 7739-7745). 7-variant reproducer matrix all ICE on LLVM HEAD and ROCm 7.2.3: `add.v3i16`, `add.i128`, `swap.v6i8`, `swap.i24` (segfault), `fadd.bf16` (Cannot select), `struct.add.v3i16`, `struct.cmpswap.i128`. Sibling family c011/c014/c015/c016.
c016-s-buffer-load-illegal-data-type-ice	❌	❌	❌	`llvm.amdgcn.s.buffer.load.<T>` ICEs in SDAG when T is an illegal data type. `lowerSBuffer` (`SIISelLowering.cpp:10537-10632`) handles only `i16` (line 10559) and v3-of-legal-scalar (10567) for widening; everything else falls through to `getMemIntrinsicNode` with illegal VT. Divergent path asserts scalar in {i32, f32}. `ReplaceNodeResults` at `:8199-8246` hard-asserts `VT == MVT::i8` (line 8212). 7-variant reproducer matrix all ICE on both LLVM HEAD and ROCm 7.2.3: `i1`, `i4`, `i24`, `<2 x i1>`, `<3 x i16>`, `<6 x i8>`, `i128`. Sibling family c011/c014/c015.
c015-buffer-load-format-i8-drops-format	❌	❌	❌	`amdgcn.{raw,struct,struct.ptr}.buffer.load.format.i8` (and store mirror) drop format encoding in SDAG. `SIISelLowering.cpp:7730-7732` routes scalar i8 format loads to `handleByteShortBufferLoads` which (line 12760) ignores `IsFormat` and emits `BUFFER_LOAD_UBYTE` instead of `BUFFER_LOAD_FORMAT_X`. The buffer-rsrc format descriptor is not applied. Store mirror at `SIISelLowering.cpp:12151-12153, 12202-12205` -> `handleByteShortBufferStores` (12798). i16 escapes via D16 early-return at 7725. SDAG vs GISel divergence: GISel correctly emits `buffer_load_format_x`. Sibling family c011/c014.
c014-tbuffer-load-illegal-vector-data-ice	❌	❌	❌	`amdgcn.raw.ptr.tbuffer.load.v3i16` (also v6i16, v3bf16, struct variants, and store mirrors) ICEs at -O0/-O2 with `LLVM ERROR: Do not know how to widen the result of this operator!`. `SIISelLowering.cpp:11394-11399` (raw) / `:11421-11426` (struct) D16 fast-path checks ONLY `MVT::f16`; all other illegal vector return types skip that branch and fall through to a plain `getMemIntrinsicNode(TBUFFER_LOAD_FORMAT, ..., LoadVT, ...)` with the illegal type. No CastVT fallback (sibling lowerIntrinsicLoad at 7739-7745 has one). `ReplaceNodeResults` at 8256 returns the still-illegal value. Distinct from c011 (TFE chain-drop) -- tbuffer has no TFE form.
c013-cube-intrinsics-wrong-gate-cdna	❌	❌	❌	`llvm.amdgcn.cube{id,ma,sc,tc}` mis-select on gfx940/gfx950 (CDNA, no cube ALU). `HasCubeInsts` is gated correctly on V_CUBE*_F32 patterns at `VOP3Instructions.td:264-269`, BUT `FeatureGFX9` itself unconditionally includes `FeatureCubeInsts` at `AMDGPU.td:1462`. gfx940/941/942/950 inherit FeatureGFX9 via FeatureISAVersion9_4_Common (`AMDGPU.td:1747`) and never subtract `FeatureCubeInsts`. `llc -mcpu=gfx950 -O2` cleanly emits `v_cubeid_f32`/`v_cubema_f32`/`v_cubesc_f32`/`v_cubetc_f32` -- HW would trap as illegal instructions. Adjacent: GISel `isCanonicalized` over-promises cubema/cubesc/cubetc as canonical even though they propagate NaN.
c012-pops-exiting-wave-id-wrong-gate-cdna	❌	❌	❌	`llvm.amdgcn.pops.exiting.wave.id` selects to invalid `SRC_POPS_EXITING_WAVE_ID` SGPR on gfx940/gfx950. `SOPInstructions.td:2050-2054` gates the pattern with `isGFX9GFX10` (true for gfx940/gfx950, Generation=GFX9), but POPS is a graphics-only HW feature absent on the CDNA line. MC layer either rejects the encoding or accepts a binary that triggers an illegal-instruction trap at runtime. Adjacent defect (not filed separately): `SIISelLowering.cpp:12024` `amdgcn.exp.compr` guard uses `hasCompressedExport() = !HasGFX11Insts` which is true on gfx950, but gfx950 has no export HW (`hasExportInsts() = !hasGFX940Insts() = false`); no diagnostic fires and an unencodable EXP machine node is emitted. Sibling family: c001/c003/c004/c005/c006/c008.

Human-written note: Up through bug m016 I was testing against upstream LLVM. But then it became clear that the ROCm 7.2.3 release didn't have many of these bugs, so I switched to testing the release. After m038, AMD asked me to switch fuzzing back to upstream.

LLVM Source Builds

The fuzzer can use an installed ROCm LLVM today. For coverage-guided compiler fuzzing, initialize the LLVM submodule and build an instrumented LLVM. To use a different LLVM checkout or fork, set LLVM_PROJECT_DIR=/path/to/llvm-project.

Typical directed-fuzzing setup:

git submodule update --init --depth 1 third_party/llvm-project
scripts/build_instrumented_llvm.sh
scripts/build_directed_fuzzer.sh
scripts/run_directed_fuzzer.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

FuzzX AMDGPU

Requirements

Run

Known-Bug Suppression

Layout

AMDGPU Bugs Found

LLVM Source Builds

FilesExpand file tree

amdgpu

Directory actions

More options

Directory actions

More options

Latest commit

History

amdgpu

Folders and files

parent directory

README.md

FuzzX AMDGPU

Requirements

Run

Known-Bug Suppression

Layout

AMDGPU Bugs Found

LLVM Source Builds