This section is human-written.
This directory contains the FuzzX PTX fuzzer. It looks for correctness bugs in NVIDIA's PTX toolchain.
Currently it fuzzes ptxas.
The process of looking for a bug is:
- Generate a random PTX program.
- Compile it with
-O0and-O2. - Run both programs.
- Compare their outputs.
So long as the random program is "legal" (meaning, mostly, it doesn't have undefined behavior) the output from the two programs should be identical. If they are not the same, that indicates a likely miscompile.
Most of the complexity in the fuzzer is around generating random programs. Obviously we can't generate truly arbitrary programs; they might have UB or infinite loops. Perhaps less obvious is that, after we've found one bug, we need to generate programs that avoid that bug, otherwise we'll just keep finding it over and over. So we have many flags that let you disable particular known-buggy idioms.
Fuzzers like libFuzzer and AFL++ allow you to do "directed" fuzzing, where you
observe the branches taken by the binary under test and steer fuzzing towards
"interesting" inputs. It is possible to do directed fuzzing on black-box
binaries like ptxas using e.g. AFL++'s QEMU mode. But we don't currently do
this, because we've found that undirected fuzzing is sufficient (for now).
All of the code here is AI-written, using ChatGPT 5.5 and Opus 4.7. I haven't read it at all. Fuzzing is inherently messy, and anyway the goal here is to find bugs, not to build a beautiful fuzzer.
After finding a miscompile, you'll want to:
- come up with a minimal testcase,
- root-cause the bug,
- write a reproducer to share with the vendor, and
- add a flag to the fuzzer so it avoids finding the same bug again.
I also use AI for this. Eventually it writes a reproducer into the
known-miscompiles directory.
I've had good luck using /goal to get the AI to run the fuzzer, wait for a
bug to appear, process it as above, and then restart the fuzzer. The biggest
issue seems to be that it's slow at minimizing testcases.
Everything below this line is AI-written slop. Good luck!
| Component | Notes |
|---|---|
| Rust | Uses the toolchain in rust-toolchain.toml. |
CUDA driver + libcuda |
Required for fuzzing, verification, and reduction. |
CUDA Toolkit ptxas |
Set PTXAS=/path/to/ptxas for reproducible runs. |
CUDA Toolkit nvcc |
Required for the standalone CUDA inline-PTX reproducers. |
NVIDIA GPU matching TARGET_ARCH |
fuzzx-execgen currently defaults to sm_103. |
| Path | Purpose |
|---|---|
crates/fuzzx-execgen |
PTX kernel generator for differential testing. |
crates/fuzzx-exec |
ptxas compiler wrapper plus CUDA launch/diff helpers. |
crates/fuzzx-diff |
Differential fuzzer plus show/verify/reduce helpers. |
known-miscompiles/ |
Reduced or standalone reproducers for confirmed findings. |
scripts/check-gen.sh |
Generator acceptance-rate smoke test against ptxas. |
Except where otherwise noted, these have been tested on sm_103 (i.e. B300).
The primary table lists reproducers still observed on the latest tested ptxas.
| Version | Description |
|---|---|
| 13.2.78 | m001-seed-050f: Uniform loop-latch optimization mishandles divergent loop-header entry. |
| 13.2.78 | m002-structured-lop3: selp / lop3 / xor fold computes the wrong truth-table result. |
| 13.2.78 | m003-no-lop3-max-chain: sub.u32 plus max.s32 chain fold incorrectly includes the pre-subtract value. |
| 13.2.78 | m051-sat-sub-add-fold: sub.sat.s32 followed by adding back the subtrahend folds as if saturation cannot occur. |
| 13.2.78 | m004-mulhi-loop-tripcount: Loop removal drops two mul.hi.s32 accumulator updates. |
| 13.2.78 | m005-prmt-ifconvert-mask: If-converted prmt.b32 mask fold drops a source operand. |
| 13.2.78 | m006-ifconvert-not-xor: If-converted not.b32 plus xor.b32 fold uses the wrong truth table. |
| 13.2.78 | m007-signed-unsigned-ifconvert: Nested if-conversion conflates signed and unsigned predicates. |
| 13.2.78 | m008-funnel-shift-loop-unroll: Loop unroll rewrites a loop-carried shf.r.wrap.b32 recurrence incorrectly. |
| 13.2.78 | m009-neg-loop-after-counted-loop: Loop deletion stores a pre-neg.s32 value after counted-loop simplification. |
| 13.2.78 | m010-shr-s32-range-fold: Range fold treats shr.s32 as if it were unsigned before an unsigned compare. |
| 13.2.78 | m011-bfind-after-empty-loop: Empty-loop simplification folds a bfind.u32-derived value incorrectly. |
| 13.2.78 | m012-empty-loop-intmax-sub: Counted empty-loop fold miscomputes an INT_MAX subtraction sequence. |
| 13.2.78 | m048-intmax-popc-sub-mask-fold: Likely related to m012; structured branch context misfolds a popc-derived INT_MAX subtract before an and mask. |
| 13.2.78 | m013-set-true-cmp-one: set.eq materialization is folded as a predicate instead of 0xffffffff. |
| 13.2.78 | m047-selp-ge-zero-branch-fold: selp materialization of 0xffffffff feeding an unsigned >= 0 branch fold skips an always-taken arm. |
| 13.2.78 | m014-vsub4-divergent-branch: vsub4.u32.u32.u32 constant fold uses the wrong byte-lane intermediate. |
| 13.2.78 | m015-abs-loop-bmsk-fold: Loop deletion uses the pre-abs.s32 live-out value in a bmsk expression. |
| 13.2.78 | m016-slct-s32-immediate-fold: slct.s32.s32 immediate fold selects the wrong arm for a positive value. |
| 13.2.78 | m017-addc-shift-carry-fold: add.cc.u32 / addc.u32 fold injects an incorrect carry-in. |
| 13.2.78 | m029-addc-mul-carry-fold: Likely same root cause as m017; addc.u32 fold injects an incorrect carry-in after multiply-derived operands. |
| 13.2.78 | m018-subc-cnot-shift-borrow-fold: sub.cc.u32 / subc.u32 fold injects an incorrect borrow-in after cnot. |
| 13.2.78 | m027-subc-shr-mul-borrow-fold: Likely same root cause as m018; subc.u32 fold uses the wrong borrow source after shift and multiply. |
| 13.2.78 | m082-subc-shared-byte-borrow-fold: Likely same root cause as m018; subc.u32 borrow-chain fold is wrong after a shared signed-byte roundtrip and scalar 16-bit producer. |
| 13.2.78 | m019-structured-loop-uniform-counter: Structured loop counters are promoted to uniform state and lose per-lane values. |
| 13.2.78 | m020-mixed-minmax-signedness-fold: Mixed signed/unsigned min / max fold drops the runtime input. |
| 13.2.78 | m021-cnot-funnel-add: shf.r.wrap.b32 plus add fold loses part of the shifted value. |
| 13.2.78 | m081-cnot-shf-left-add: Likely related to m021; cnot.b32 feeding shf.l.wrap.b32 plus add flips the shifted contribution under optimized ptxas. |
| 13.2.78 | m022-neg-funnel-left-add: neg.s32 plus shf.l.wrap.b32 fold produces a sign-extension-shaped error. |
| 13.2.78 | m023-mul-wide-hi-ice: Optimized compile crashes on a mul.wide low-half feeding signed high multiply. |
| 13.2.78 | m024-prmt-cvt-u16-fold: prmt.b32 plus cvt.u16 fold drops the permuted source contribution. |
| 13.2.78 | m055-prmt-reg-control-eq-fold: Register-control prmt.b32 feeding an equality fold selects the wrong arm. |
| 13.2.78 | m066-prmt-sign-byte-and-fold: Sign-control prmt.b32 feeding a low-byte and.b32 fold drops the sign-filled byte. |
| 13.2.78 | m057-s16-unary-intmin-fold: abs.s16 / neg.s16 of INT16_MIN feeding cvt.s32.s16 is treated as a positive value. |
| 13.2.78 | m058-scalar16-min-cvt-fold: Scalar min.{u16,s16} through .b16 scratch registers folds a following equality predicate incorrectly. |
| 13.2.78 | m060-scalar16-sub-intmin-fold: Scalar sub.s16 0, INT16_MIN feeding cvt.s32.s16 is zero-extended by optimized ptxas. |
| 13.2.78 | m061-f32-div-pred-neg-cvt-fold: div.approx.ftz.f32 feeding cvt.rzi.s32.f32 and a skipped predicated neg.f32 loses the fallback move at -O0. |
| 13.2.78 | m062-packed-max-u16x2-liveout-fold: max.u16x2 in a live-range-heavy block corrupts later global roundtrip live-outs under optimized ptxas. |
| 13.2.78 | m063-f32-approx-div-cvt-boundary: div.approx.ftz.f32 can legally straddle an integer boundary before cvt.rzi.s32.f32, producing an exact-output oracle mismatch. |
| 13.2.78 | m064-scalar16-max-cvt-predicate-fold: Scalar max.u16 through .b16 scratch registers folds a following zero predicate incorrectly. |
| 13.2.78 | m065-red-global-min-loop-fold: Loop-carried value feeding a per-thread red.global.min.u32 roundtrip uses an earlier store value under optimized ptxas. |
| 13.2.78 | m067-atom-global-dec-loop-fold: Loop-carried state around a per-thread atom.global.dec.u32 roundtrip changes a final predicate-controlled live-out under optimized ptxas. |
| 13.2.78 | m068-atom-global-xor-loop-fold: Loop-carried value feeding a per-thread atom.global.xor.b32 roundtrip uses the original input word under optimized ptxas. |
| 13.2.78 | m069-wide-subc-loop-borrow-fold: Loop-carried state around a predicated sub.cc.u64 / subc.u64 pair drops the borrow/value feeding the final live-out. |
| 13.2.78 | m070-lop3-loop-liveout-fold: Loop-carried lop3.b32 live-out is replaced with the original input-size value under optimized ptxas. |
| 13.2.78 | m071-f32-neg-loop-liveout-fold: Loop-carried neg.f32 / cvt.rzi.s32.f32 live-out is replaced with the original input word under optimized ptxas. |
| 13.2.78 | m072-const-u16-highbits-fold: ld.const.u16 into a reused 32-bit register preserves stale high bits under optimized ptxas. |
| 13.2.78 | m073-cached-global-load-loop-entry: Likely related to m001; divergent loop-header entry drops a loop-body cached narrow global load under optimized ptxas. |
| 13.2.78 | m074-mad-hi-carry-loop-tripcount: Likely related to m004; loop optimization drops mad.hi.cc.s32 high-multiply carry-chain updates. |
| 13.2.78 | m075-wide-bfi-loop-liveout-fold: Loop-carried bfi.b64 live-out is folded to shifted or sign-shaped values under optimized ptxas. |
| 13.2.78 | m076-predicated-mad-loop-liveout-fold: Loop-body predicated mad.lo.u32 update is dropped, leaving %tid.x instead of %lanemask_gt under optimized ptxas. |
| 13.2.78 | m077-global-store-loop-liveout-fold: Loop-carried value reset after a per-thread global store is ignored, so the final mad.lo.s32 uses stale pre-store state. |
| 13.2.78 | m078-mul-lo-loop-recurrence-fold: Low-multiply loop recurrence is skipped for some lanes, leaving `input |
| 13.2.78 | m079-predicated-packed-add-high-half: Branch-local packed add leaves the low half correct but fills the high half with a lane value under optimized ptxas. |
| 13.2.78 | m080-ldu-signed-branch-fold: Signed branch control around predicated ldu.global.u32 leaves original per-lane inputs instead of the uniform load value under optimized ptxas. |
| 13.2.78 | m059-scalar16-pred-mulwide-fold: Scalar max.s16 feeding a predicate-guarded mul.wide.u16 is optimized as if the multiply did not execute. |
| 13.2.78 | m025-shl-xor-square-lowbits: Fold loses the fact that a value is shifted left before testing low bits. |
| 13.2.78 | m026-shr-abs-ult-fold: Fold reasons about 0 - abs(n) as signed or non-wrapping before unsigned compare. |
| 13.2.78 | m028-shf-r-wrap-sub-fold: shf.r.wrap.b32 output is folded to zero before a final subtract. |
| 13.2.78 | m030-not-clz-predicate-fold: Guarded path fold drops or misapplies not.b32 before clz.b32. |
| 13.2.78 | m031-guarded-sub-sub-fold: Guarded x - (0x80000000 - x) fold drops the 2*x contribution. |
| 13.2.78 | m032-cnot-neg-ugt-fold: cnot / neg chain feeding an unsigned-greater-than predicate folds to the wrong arm. |
| 13.2.78 | m046-cnot-underflow-ugt-fold: Likely same root cause as m032; cnot feeding wrapped subtraction before an unsigned comparison selects the wrong arm. |
| 13.2.78 | m033-not-xor-branch-fold: Branch-specialized not / xor path folds the wrong value into the store. |
| 13.2.78 | m035-xor-not-predicate-fold: Likely same root cause as m033; xor.b32 by 0xffffffff feeding a predicate selects the wrong arm. |
| 13.2.78 | m034-bfind-zero-branch-fold: Branch fold treats bfind.u32 0 as 0 instead of 0xffffffff. |
| 13.2.78 | m036-mulhi-control-fold: Control-flow fold around mul.hi.s32 uses an incorrect folded constant. |
| 13.2.78 | m037-bmsk-clz-bfi-fold: bmsk / clz / bfi / mad.lo value-chain fold sets an extra output bit. |
| 13.2.78 | m038-structured-empty-else-fold: Always-false structured branch with an empty else arm folds as if the untaken then arm executed. |
| 13.2.78 | m039-else-redefinition-fold: Branch fold drops the executed else-path redefinition of a value initialized before the branch. |
| 13.2.78 | m040-mulwide-neg-shr-fold: mul.wide low word feeding wrapped negation and logical shift loses the shifted high-bit contribution. |
| 13.2.78 | m049-wide-or-shift-mask-fold: Likely related to m040; or.b64 low word feeding a shift/add mask fold computes the wrong mask. |
| 13.2.78 | m041-or-shifted-square-fold: or.b32 after a square known to have zero low 32 bits folds with a missing output bit. |
| 13.2.78 | m044-mul-lo-square-fold: Likely same root cause as m041; square of a shifted mul.lo value folds to 0x80000000 instead of zero. |
| 13.2.78 | m042-vsub4-else-ifconvert-fold: If-converted else arm using vsub4 computes the wrong value for the one lane that takes it. |
| 13.2.78 | m043-shr-sub-branch-fold: Branch-sensitive unsigned shift after wrapped subtraction loses the shifted high bit. |
| 13.2.78 | m050-reg-shl-mask-fold: Masked register-count shl.b32 chains fold to the wrong shifted value. |
| 13.2.78 | m052-bfe-reg-pos-fold: Register-position bfe.s32 with an out-of-range start folds to the wrong sign-filled value. |
| 13.2.78 | m053-bfi-reg-len-fold: Likely related to m052; register-length bfi.b32 preserves high base bits that should be overwritten. |
| 13.2.78 | m045-brev-branch-fold: Branch-join fold around brev.b32 computes 0x8000001d instead of 0x8000001f. |
| 13.2.78 | m083-orphan-param-ld: ptxas segfaults at every optimization level on an 11-line kernel that declares a local .param and reads it with ld.param without ever using the .param as a call argument or return value. Also reproduces on 13.0.88. |
| 13.2.78 | m084-multi-island-o-opt-crash: ptxas optimizer segfaults at -O1 and above (clean at -O0) on a 66-line kernel whose body simultaneously contains cvt.pack.sat.u8, bar.red.popc.u32, shfl.sync.up.b32, elect.sync, redux.sync.max.u32, createpolicy.fractional.L2 + cache-hint loads, bf16/tf32 conversions, and sub.rn.f16x2; removing any single category makes the crash disappear. Also reproduces on 13.0.88. |
| 13.2.78 | m085-cond-skip-or-imm-neg1: 19-line repro — mov.b32 %r1, 0x3f800000; followed by a kernel-param-dependent setp and @%p bra done; over or.b32 %r1, -1, %r0; causes -O3 to skip the OR and keep the 0x3f800000 initialiser. Specific to the 0x3f800000 (1.0f) bit pattern — tested 11 other constants, none trigger. Also reproduces on 13.0.88. |
| 13.2.78 | m086-predicated-addc-carry-crash: 16-line kernel with a predicated add.cc.u32 / addc.u32 carry chain triggers a C7907 internal compiler error at every opt level above -O0; the same input segfaults ptxas 13.0.88 at -O3. |
| 13.2.78 | m087-redux-xor-guarded-f64-cvt: Same shape as m085 — setp.ne.u32 %p, %r_redux, 0 guarding a single-arm bra over an f64-to-s32 cvt that writes %r3 causes -O3 to skip the cvt and leave %r3 at whatever an upstream predicated mov.b64 {%r3, %r9}, %rd7; left it. The differing reduce predicate is computed from a redux.sync.xor.b32 reduction. Also reproduces on 13.0.88. |
| Found In | Fixed In | Description |
|---|---|---|
| 13.0.88 | 13.2.78 | m054-packed-add-cvt-fold: add.s16x2 feeding cvt.u16 and another packed add drops the first packed-add contribution. |
| 13.0.88 | 13.2.78 | m056-packed-add-cvt-s16-fold: Likely same root cause as m054; add.u16x2 feeding cvt.s16 drops the packed-add contribution. |
Build the tools:
cargo build --release -p fuzzx-diffRun a differential sweep:
target/release/fuzzx-diff \
--ptxas /usr/local/cuda/bin/ptxas \
--max-iters 100000Divergences are saved under DIV_OUT_DIR (default: divergences/) as
directories containing seed.bin, program.ptx, input.bin, output_o0.*,
output_o3.*, and summary.txt.
Useful follow-up commands:
target/release/fuzzx-diff-show divergences/div-...
target/release/fuzzx-diff-verify divergences/div-...
target/release/fuzzx-diff-reduce divergences/div-...
target/release/fuzzx-diff-test divergences/div-.../program.ptx divergences/div-.../input.bin
target/release/fuzzx-diff-inspect-outputs divergences/div-.../program.ptx divergences/div-.../input.binCheck how often generated PTX assembles:
PTXAS=/usr/local/cuda/bin/ptxas scripts/check-gen.sh 200fuzzx-diff accepts kebab-case CLI flags for the run-control and generator
settings below; target/release/fuzzx-diff --help lists the full set. The
same settings can still be supplied as environment variables, which is useful
for long-running scripted sweeps. Boolean environment variables accept 1,
true, yes, or on for true, and 0, false, no, or off for false.
| Variable | Default | Meaning |
|---|---|---|
PTXAS |
/usr/local/cuda/bin/ptxas, then $HOME/bin/ptxas, then ptxas |
Target ptxas binary. Set this explicitly for reproducible runs. |
TMPDIR |
Caller value; some tools use /dev/shm when unset and available. |
Temporary directory for PTX/cubin files. |
| Variable | Default | Meaning |
|---|---|---|
DIV_OUT_DIR |
divergences |
Directory for saved divergence bundles. |
DIV_STARTING_SEED |
nanoseconds since epoch | First seed in the deterministic seed stream. |
DIV_MAX_ITERS |
unlimited | Stop after this many generated candidates. |
DIV_PRINT_EVERY_SECS |
5 |
Progress-report interval. |
DIV_PROGRAM_BYTES |
4096 |
Bytes derived from each seed and consumed by the generator. |
DIV_GPUS |
all visible CUDA devices | Comma-separated CUDA device ordinals, for example 0,1,2. |
DIV_WORKERS_PER_GPU |
16 |
Worker threads per selected GPU. |
| Variable | Default | Meaning |
|---|---|---|
DIV_STRUCTURED_CONTROL_FLOW |
false |
Use structured single-entry if/loop generation instead of arbitrary CFG generation. |
DIV_MIN_BLOCKS / DIV_MAX_BLOCKS |
1 / 10 |
Block-count bounds. |
DIV_MIN_INSTS_PER_BLOCK / DIV_MAX_INSTS_PER_BLOCK |
1 / 6 |
Instruction-count bounds per block. |
DIV_WORKING_REGS |
8 |
Number of working u32 registers. |
DIV_MAX_LOOP_ITERS |
16 |
Maximum generated loop-trip count. |
DIV_MAX_IMMEDIATE |
32 |
Maximum ordinary immediate value. |
DIV_MAX_STRUCTURED_DEPTH |
3 |
Maximum nesting depth for structured control flow. |
All variables in this table default to false; setting one to true suppresses
that feature.
| Variable | Suppresses |
|---|---|
DIV_DISABLE_STRUCTURED_LOOPS |
Counted-loop shapes in structured mode. |
DIV_DISABLE_ARBITRARY_LOOPS |
Backedge loop terminators in arbitrary CFG mode. |
DIV_DISABLE_LOP3 |
lop3.b32. |
DIV_DISABLE_PREDICATED_LOP3 |
Predicated lop3.b32 instructions. |
DIV_DISABLE_MINMAX |
min.u32, max.u32, min.s32, max.s32. |
DIV_DISABLE_SELP |
Direct integer selp.{b32,u32,s32} instructions. |
DIV_DISABLE_TYPED_SELP |
Direct integer selp.u32 and selp.s32 instructions. |
DIV_DISABLE_SUB |
Random sub.u32 ALU instructions. |
DIV_DISABLE_MUL_LO |
mul.lo.u32 and mad.lo.u32. |
DIV_DISABLE_SIGNED_LO_ALU |
Signed low-ALU spellings, including saturating add/sub. |
DIV_DISABLE_SAT_ARITH |
add.sat.s32 and sub.sat.s32. |
DIV_DISABLE_PACKED_ADD |
add.u16x2 and add.s16x2. |
DIV_DISABLE_SIGNED_PACKED_ADD |
add.s16x2 only. |
DIV_DISABLE_PREDICATED_PACKED_ADD |
Predicated add.u16x2 and add.s16x2 instructions. |
DIV_DISABLE_PACKED_MINMAX |
min/max.{u16x2,s16x2}. |
DIV_DISABLE_SIGNED_PACKED_MINMAX |
min/max.s16x2 only. |
DIV_DISABLE_PREDICATED_PACKED_MINMAX |
Predicated min/max.{u16x2,s16x2} instructions. |
DIV_DISABLE_SCALAR_16BIT |
Scalar 16-bit ALU through .b16 scratch registers. |
DIV_DISABLE_SIGNED_SCALAR_16BIT |
Signed scalar 16-bit ALU while retaining unsigned u16 ops. |
DIV_DISABLE_SCALAR_16BIT_MIN |
Scalar 16-bit min.{u16,s16} and max.{u16,s16} while retaining other scalar 16-bit arithmetic instructions. |
DIV_DISABLE_SCALAR_16BIT_SIGNED_UNARY |
abs.s16 and neg.s16 while retaining other scalar 16-bit ALU instructions. |
DIV_DISABLE_SCALAR_16BIT_BITWISE |
and.b16, or.b16, xor.b16, and not.b16. |
DIV_DISABLE_SCALAR_16BIT_SHIFTS |
shl.b16, shr.u16, and shr.s16 with immediate counts in 0..15. |
DIV_DISABLE_SCALAR_16BIT_COMPARE |
Scalar 16-bit setp and set through .b16 scratch registers. |
DIV_DISABLE_SCALAR_16BIT_SELP |
Scalar 16-bit selp.u16 and selp.s16; also disabled by DIV_DISABLE_SCALAR_16BIT_COMPARE. |
DIV_DISABLE_PREDICATED_SCALAR_16BIT |
Predicated scalar 16-bit ALU instructions. |
DIV_DISABLE_MULHI |
mul.hi.u32 and mul.hi.s32. |
DIV_DISABLE_SIGNED_MULHI |
mul.hi.s32 only. |
DIV_DISABLE_MAD_HI |
mad.hi.u32 and mad.hi.s32. |
DIV_DISABLE_SIGNED_MAD_HI |
mad.hi.s32 only. |
DIV_DISABLE_BITWISE_BINOPS |
and.b32, or.b32, xor.b32. |
DIV_DISABLE_OR |
or.b32 while retaining and.b32 and xor.b32. |
DIV_DISABLE_XOR |
xor.b32 while retaining and.b32 and or.b32. |
DIV_DISABLE_PRMT |
prmt.b32. |
DIV_DISABLE_PREDICATED_PRMT |
Predicated prmt.b32 instructions. |
DIV_DISABLE_REG_PRMT |
Register-control prmt.b32 instructions. |
DIV_DISABLE_PREDICATED_REG_PRMT |
Predicated register-control prmt.b32 instructions. |
DIV_DISABLE_PRMT_MODES |
prmt.b32 mode variants such as .f4e, .ecl, and .rc16. |
DIV_DISABLE_NOT |
not.b32 and xor-by-0xffffffff forms. |
DIV_DISABLE_CLZ |
clz.b32. |
DIV_DISABLE_BREV |
brev.b32. |
DIV_DISABLE_CNOT |
cnot.b32. |
DIV_DISABLE_POPC |
popc.b32. |
DIV_DISABLE_ABS |
abs.s32. |
DIV_DISABLE_SPECIAL_REGS |
Deterministic special-register reads and constants such as %laneid, %nwarpid, WARP_SZ, and %lanemask_*. |
DIV_DISABLE_PREDICATED_SPECIAL_REGS |
Predicated deterministic special-register reads. |
DIV_DISABLE_GLOBAL_LOADS |
Bounded read-only ld.global.{u8,s8,u16,s16,u32,u64,s64} loads from the input buffer. |
DIV_DISABLE_UNIFORM_GLOBAL_LOADS |
Uniform-address scalar/vector ldu.global loads from the input buffer. |
DIV_DISABLE_GLOBAL_STORE_ROUNDTRIPS |
Per-thread st.global.{u8,u16,u32,u64} plus ld.global.{u8,s8,u16,s16,u32,u64,s64} roundtrips through the output buffer. |
DIV_DISABLE_GLOBAL_ATOMICS |
Per-thread atom.global.{add,exch,cas,inc,dec,min,max,and,or,xor} roundtrips through the output buffer. |
DIV_DISABLE_GLOBAL_ATOMIC_DEC |
atom.global.dec.u32 roundtrips only; other global atomics remain enabled. |
DIV_DISABLE_GLOBAL_ATOMIC_XOR |
atom.global.xor.b32 roundtrips only; other global atomics remain enabled. |
DIV_DISABLE_PREDICATED_GLOBAL_ATOMICS |
Predicated per-thread global atomic roundtrips. |
DIV_DISABLE_GLOBAL_REDUCTIONS |
Per-thread red.global.{add,inc,dec,min,max,and,or,xor} roundtrips through the output buffer. |
DIV_DISABLE_PREDICATED_GLOBAL_REDUCTIONS |
Predicated per-thread global reduction roundtrips. |
DIV_DISABLE_GENERIC_MEMORY |
Generic-address ld/st roundtrips through the output buffer, plus isspacep.global. |
DIV_DISABLE_PREDICATED_GENERIC_MEMORY |
Predicated generic-address ld/st roundtrips. |
DIV_DISABLE_CONST_MEMORY |
Bounded read-only ld.const.{u8,s8,u16,s16,u32,u64,s64} loads from a module-scope constant buffer. |
DIV_DISABLE_LOCAL_MEMORY |
Bounded private local-memory store/load roundtrips, including 64-bit forms. |
DIV_DISABLE_SHARED_MEMORY |
Race-free per-thread shared-memory store/load roundtrips, including 64-bit forms. |
DIV_DISABLE_SHARED_ATOMICS |
Per-thread atom.shared.{add,exch,cas,inc,dec,min,max,and,or,xor} roundtrips through a private shared-memory slot. |
DIV_DISABLE_PREDICATED_SHARED_ATOMICS |
Predicated per-thread shared atomic roundtrips. |
DIV_DISABLE_SHARED_REDUCTIONS |
Per-thread red.shared.{add,inc,dec,min,max,and,or,xor} roundtrips through a private shared-memory slot. |
DIV_DISABLE_PREDICATED_SHARED_REDUCTIONS |
Predicated per-thread shared reduction roundtrips. |
DIV_DISABLE_PREDICATED_MEMORY |
Predicated forms of bounded scalar/vector memory loads and store/load roundtrips. |
DIV_DISABLE_VECTOR_MEMORY |
Aligned v2/v4 u32 and v2.u64 memory loads and store/load roundtrips. |
DIV_DISABLE_WIDE_MEMORY |
Scalar 64-bit and v2.u64 memory loads and store/load roundtrips. |
DIV_DISABLE_MEMORY_CACHE_OPS |
Scalar/vector global-memory cache-policy variants such as ld.global.ca, ld.global.nc, and st.global.wt. |
DIV_DISABLE_VOLATILE_MEMORY |
Volatile scalar/vector global and shared memory loads and store/load roundtrips. |
DIV_DISABLE_BIT_MEMORY |
Scalar/vector memory load/store type suffixes such as .b8, .b16, .b32, and .b64. |
DIV_DISABLE_MEMORY_FENCES |
Uniform membar.{cta,gl,sys} and fence.{acq_rel,sc}.{cta,gpu,sys} memory-ordering instructions. |
DIV_DISABLE_WARP_BARRIERS |
Full-warp bar.warp.sync emitted before generated divergent control flow. |
DIV_DISABLE_WARP_COLLECTIVES |
Full-mask activemask, vote.sync, match.sync, elect.sync, shfl.sync, and redux.sync emitted before generated divergent control flow, including randomized uniform-prefix dataflow islands. |
DIV_DISABLE_CTA_BARRIERS |
Full-CTA bar.sync and barrier.sync emitted before generated divergent control flow. |
DIV_DISABLE_CTA_BARRIER_REDUCTIONS |
Full-CTA bar.red and barrier.red reductions emitted before generated divergent control flow. |
DIV_DISABLE_PREFETCH |
Bounded prefetch.global.* and prefetchu.L1 cache hints against the input buffer. |
DIV_DISABLE_CACHE_POLICY_HELPERS |
Deterministic prologue coverage for createpolicy, applypriority, cache-hint ld.global, ld.global.nc, st.global, and, when enabled, cache-hint global atomic/reduction forms. |
DIV_DISABLE_HELPER_CALLS |
Deterministic calls to a generated leaf .func in the entry prologue, plus randomized helper-call islands that re-invoke the available helper variants with call.uni/call spellings. |
DIV_DISABLE_RICH_HELPER_CALLS |
Deterministic prologue coverage for multi-argument, multi-return, mixed-width return, 64-bit/f32/f64 return, nested acyclic, explicit .param ABI, mixed register/.param, and predicate/select helper .func calls, plus randomized helper-call islands covering the same variants. |
DIV_DISABLE_F32_ARITH |
Sanitized add/sub/mul/div/fma/copysign/min/max.f32 arithmetic, including approximate f32 division, f32 .sat arithmetic, and .ftz min/max. |
DIV_DISABLE_F32_ROUNDING |
Sanitized non-default rounding and .ftz f32 add/sub/mul/div/fma arithmetic. |
DIV_DISABLE_F32_UNARY |
Sanitized abs/neg.f32, including .ftz forms. |
DIV_DISABLE_F32_CVT |
Sanitized signed/unsigned 32/64-bit f32/int, saturating f32-to-int, f64-to-f32, and .ftz conversion chains. |
DIV_DISABLE_F32_SPECIAL_MATH |
Sanitized rounded and .ftz f32 sqrt/rcp plus approx rcp, rsqrt, ex2, lg2, sin, and cos. |
DIV_DISABLE_F32_COMPARE |
Sanitized ordered/unordered set.*.u32.f32 and setp.*.f32 comparisons, including .ftz forms, plus testp.*.f32 classification. |
DIV_DISABLE_F32_SELP |
Sanitized setp.*.f32, including .ftz forms, feeding selp.f32. |
DIV_DISABLE_F16_ARITH |
Deterministic prologue coverage and randomized in-body coverage for scalar .f16 and packed .f16x2 add/sub/mul/fma/min/max/abs/neg, including predicated forms. |
DIV_DISABLE_F16_COMPARE |
Deterministic prologue and randomized in-body coverage for scalar .f16 and packed .f16x2 ordered/unordered set, setp, .and/.or/.xor setp forms, and predicate-fed selp forms. |
DIV_DISABLE_F16_CVT |
Deterministic prologue coverage and randomized in-body coverage for .f16 conversion chains through f32, f64, integer, and packed .f16x2 forms, including predicated forms. |
DIV_DISABLE_BF16_TF32_CVT |
Deterministic prologue coverage and randomized in-body coverage for .bf16, .bf16x2, and .tf32 conversion forms, including predicated forms. |
DIV_DISABLE_F64_ARITH |
Sanitized add/sub/mul/div/fma/copysign/min/max.f64 arithmetic. |
DIV_DISABLE_F64_ROUNDING |
Sanitized .rz/.rm/.rp f64 add/sub/mul/div/fma arithmetic. |
DIV_DISABLE_F64_UNARY |
Sanitized abs.f64 and neg.f64. |
DIV_DISABLE_F64_CVT |
Sanitized signed/unsigned 32/64-bit f64/int, saturating f64-to-int, and f32-to-f64 conversion chains. |
DIV_DISABLE_F64_SPECIAL_MATH |
Sanitized rounded f64 sqrt and rcp. |
DIV_DISABLE_F64_COMPARE |
Sanitized ordered/unordered set.*.u32.f64 and setp.*.f64 comparisons plus testp.*.f64 classification. |
DIV_DISABLE_F64_SELP |
Sanitized setp.*.f64 feeding selp.f64. |
DIV_DISABLE_SIGNED_CMP |
Signed predicate comparisons. |
DIV_DISABLE_SIGNED_DIVREM |
div.s32 and rem.s32. |
DIV_DISABLE_REG_DIVREM |
Register-divisor div.u32 and rem.u32 with sanitized divisors. |
DIV_DISABLE_PREDICATED_REG_DIVREM |
Predicated register-divisor div.u32 and rem.u32. |
DIV_DISABLE_PREDICATED_DIVREM |
Predicated div and rem instructions. |
DIV_DISABLE_BRANCH_TABLES |
Bounded prologue and randomized in-body brx.idx branch tables with four rejoining targets. |
DIV_DISABLE_FUNNEL |
shf.{l,r}.{wrap,clamp}.b32. |
DIV_DISABLE_REG_FUNNEL |
Register-count shf.{l,r}.{wrap,clamp}.b32. |
DIV_DISABLE_PREDICATED_FUNNEL |
Predicated shf.{l,r}.{wrap,clamp}.b32. |
DIV_DISABLE_FUNNEL_CLAMP |
shf.l.clamp.b32 and shf.r.clamp.b32. |
DIV_DISABLE_NEG |
neg.s32. |
DIV_DISABLE_SHL |
shl.b32. |
DIV_DISABLE_SHR |
shr.u32. |
DIV_DISABLE_SIGNED_SHR |
shr.s32. |
DIV_DISABLE_REG_SHIFTS |
Masked register-count shifts. |
DIV_DISABLE_PREDICATED_SHIFTS |
Predicated immediate-count shifts. |
DIV_DISABLE_PREDICATED_REG_SHIFTS |
Predicated masked register-count shifts. |
DIV_DISABLE_BFIND |
bfind and bfind.shiftamt instructions. |
DIV_DISABLE_SIGNED_BFIND |
bfind.s32 and bfind.shiftamt.s32. |
DIV_DISABLE_WIDE_BFIND |
64-bit-source bfind and bfind.shiftamt instructions. |
DIV_DISABLE_SIGNED_WIDE_BFIND |
bfind.s64 and bfind.shiftamt.s64. |
DIV_DISABLE_PREDICATED_BFIND |
Predicated bfind and bfind.shiftamt instructions. |
DIV_DISABLE_PREDICATED_WIDE_BFIND |
Predicated 64-bit-source bfind and bfind.shiftamt instructions. |
DIV_DISABLE_FNS |
fns.b32. |
DIV_DISABLE_REG_FNS |
fns.b32 with a sanitized register base or offset operand. |
DIV_DISABLE_PREDICATED_FNS |
Predicated fns.b32 instructions. |
DIV_DISABLE_PREDICATED_REG_FNS |
Predicated fns.b32 instructions with a sanitized register base or offset operand. |
DIV_DISABLE_BFI |
bfi.b32. |
DIV_DISABLE_BFE |
bfe.{u32,s32}. |
DIV_DISABLE_BMSK |
bmsk.{clamp,wrap}.b32. |
DIV_DISABLE_BMSK_WRAP |
bmsk.wrap.b32. |
DIV_DISABLE_PREDICATED_BITFIELD |
Predicated bfe, bfi, and bmsk instructions. |
DIV_DISABLE_REG_BITFIELD |
Register pos/len operands for bfe, bfi, and bmsk. |
DIV_DISABLE_PREDICATED_REG_BITFIELD |
Predicated bfe, bfi, and bmsk instructions with register pos/len operands. |
DIV_DISABLE_WIDE_BFE |
64-bit scratch-register bfe.{u64,s64} instructions. |
DIV_DISABLE_SIGNED_WIDE_BFE |
64-bit scratch-register bfe.s64 instructions. |
DIV_DISABLE_WIDE_BFI |
64-bit scratch-register bfi.b64 instructions. |
DIV_DISABLE_PREDICATED_WIDE_BITFIELD |
Predicated 64-bit scratch-register bfe and bfi instructions. |
DIV_DISABLE_REG_WIDE_BITFIELD |
Sanitized register pos/len operands for 64-bit scratch-register bfe and bfi. |
DIV_DISABLE_PREDICATED_REG_WIDE_BITFIELD |
Predicated 64-bit scratch-register bfe and bfi instructions with register pos/len operands. |
DIV_DISABLE_MAD24 |
mad24.lo.u32 and mad24.hi.u32. |
DIV_DISABLE_MUL24 |
mul24.{lo,hi}.{u32,s32}. |
DIV_DISABLE_PREDICATED_24BIT |
Predicated mad24 and mul24 instructions. |
DIV_DISABLE_SUBWORD_WIDE |
16-bit-source mul.wide and mad.wide through .b16 scratch registers. |
DIV_DISABLE_SIGNED_SUBWORD_WIDE |
Signed 16-bit-source mul.wide.s16 and mad.wide.s16. |
DIV_DISABLE_PREDICATED_SUBWORD_WIDE |
Predicated 16-bit-source mul.wide and mad.wide instructions. |
DIV_DISABLE_MUL_WIDE |
mul.wide.{u32,s32}. |
DIV_DISABLE_PREDICATED_MUL_WIDE |
Predicated mul.wide.{u32,s32} instructions. |
DIV_DISABLE_MAD_WIDE |
mad.wide.{u32,s32}. |
DIV_DISABLE_SIGNED_MAD_WIDE |
mad.wide.s32. |
DIV_DISABLE_PREDICATED_MAD_WIDE |
Predicated mad.wide.{u32,s32} instructions. |
DIV_DISABLE_WIDE_HIGH_RESULT |
High-half extraction from mul.wide and mad.wide results. |
DIV_DISABLE_WIDE_INT |
64-bit scratch-register ALU generation. |
DIV_DISABLE_WIDE_MINMAX |
64-bit scratch-register min/max.{u64,s64} instructions. |
DIV_DISABLE_WIDE_MULHI |
64-bit scratch-register mul.hi.{u64,s64} instructions. |
DIV_DISABLE_PREDICATED_WIDE_INT |
Predicated 64-bit scratch-register ALU generation. |
DIV_DISABLE_WIDE_MAD64 |
64-bit operand mad.{lo,hi}.{u64,s64} instructions. |
DIV_DISABLE_SIGNED_WIDE_MAD64 |
64-bit operand mad.{lo,hi}.s64 instructions. |
DIV_DISABLE_PREDICATED_WIDE_MAD64 |
Predicated 64-bit operand mad instructions. |
DIV_DISABLE_WIDE_SET |
64-bit scratch-register set.{cmp}.u32.{u64,s64} materialization. |
DIV_DISABLE_PREDICATED_WIDE_SET |
Predicated 64-bit scratch-register set materialization. |
DIV_DISABLE_WIDE_SETP |
64-bit scratch-register setp-fed guarded ALU instructions. |
DIV_DISABLE_WIDE_SETP_BOOL |
64-bit scratch-register `setp..<and |
DIV_DISABLE_WIDE_SELP |
64-bit scratch-register select selp.{b64,u64,s64} instructions. |
DIV_DISABLE_WIDE_UNARY |
64-bit scratch-register not, cnot, popc, clz, brev, neg, and abs instructions. |
DIV_DISABLE_SIGNED_WIDE_UNARY |
neg.s64 and abs.s64. |
DIV_DISABLE_PREDICATED_WIDE_UNARY |
Predicated 64-bit scratch-register unary instructions. |
DIV_DISABLE_WIDE_SHIFTS |
64-bit scratch-register shifts. |
DIV_DISABLE_WIDE_REG_SHIFTS |
Masked register-count 64-bit scratch-register shifts. |
DIV_DISABLE_PREDICATED_WIDE_SHIFTS |
Predicated 64-bit scratch-register shifts. |
DIV_DISABLE_PREDICATED_WIDE_REG_SHIFTS |
Predicated masked register-count 64-bit scratch-register shifts. |
DIV_DISABLE_WIDE_DIVREM |
64-bit scratch-register div/rem.{u64,s64} instructions. |
DIV_DISABLE_SIGNED_WIDE_DIVREM |
64-bit scratch-register div/rem.s64 instructions. |
DIV_DISABLE_REG_WIDE_DIVREM |
Register-divisor 64-bit scratch-register div/rem.{u64,s64} instructions with sanitized divisors. |
DIV_DISABLE_PREDICATED_REG_WIDE_DIVREM |
Predicated register-divisor 64-bit scratch-register div/rem instructions. |
DIV_DISABLE_PREDICATED_WIDE_DIVREM |
Predicated 64-bit scratch-register div/rem instructions. |
DIV_DISABLE_WIDE_ADDC |
64-bit scratch-register add.cc.u64 / addc.u64 pairs. |
DIV_DISABLE_WIDE_SUBC |
64-bit scratch-register sub.cc.u64 / subc.u64 pairs. |
DIV_DISABLE_PREDICATED_WIDE_CARRY |
Predicated 64-bit scratch-register carry pairs. |
DIV_DISABLE_WIDE_CARRY_CHAIN |
Three-instruction 64-bit scratch-register carry chains using addc.cc.u64 or subc.cc.u64. |
DIV_DISABLE_PREDICATED_WIDE_CARRY_CHAIN |
Predicated three-instruction 64-bit scratch-register carry chains. |
DIV_DISABLE_ADDC |
add.cc.u32 / addc.u32 pairs. |
DIV_DISABLE_SUBC |
sub.cc.u32 / subc.u32 pairs. |
DIV_DISABLE_PREDICATED_CARRY |
Predicated add.cc / addc and sub.cc / subc pairs. |
DIV_DISABLE_CARRY_CHAIN |
Three-instruction add/sub.cc plus addc/subc.cc carry chains. |
DIV_DISABLE_PREDICATED_CARRY_CHAIN |
Predicated three-instruction add/sub carry chains. |
DIV_DISABLE_I32_BOUNDARY_IMMS |
Immediate 0x7fffffff / 0x80000000 generation. |
DIV_DISABLE_DP4A |
dp4a.{u32,s32}.{u32,s32}. |
DIV_DISABLE_DP2A |
dp2a.{lo,hi}.{u32,s32}.{u32,s32}. |
DIV_DISABLE_NEGATED_PREDICATES |
Negated @!%p instruction predicates. |
DIV_DISABLE_PREDICATED_ALU |
Predicated integer ALU and floating-point arithmetic instructions. |
DIV_DISABLE_PREDICATED_UNARY |
Predicated integer unary, floating-point unary, and floating-point special-math instructions. |
DIV_DISABLE_CVT |
Direct base cvt.{u32,s32}.{u8,u16,s8,s16} instructions; narrow and wide round-trips have separate flags. |
DIV_DISABLE_PREDICATED_CVT |
Predicated integer and floating-point cvt instructions. |
DIV_DISABLE_NARROW_CVT |
Narrow cvt round-trips through 8/16-bit destination types. |
DIV_DISABLE_SIGNED_NARROW_CVT |
Signed narrow cvt round-trips. |
DIV_DISABLE_PREDICATED_NARROW_CVT |
Predicated narrow cvt round-trips. |
DIV_DISABLE_WIDE_CVT |
64-bit-source cvt round-trips. |
DIV_DISABLE_SIGNED_WIDE_CVT |
Signed 64-bit-source cvt round-trips. |
DIV_DISABLE_PREDICATED_WIDE_CVT |
Predicated 64-bit-source cvt round-trips. |
DIV_DISABLE_CVT_PACK |
Deterministic prologue and randomized in-body coverage for saturating cvt.pack integer packing, including predicated forms. |
DIV_DISABLE_SZEXT |
szext.{wrap,clamp}.{u32,s32}. |
DIV_DISABLE_SIGNED_SZEXT |
szext.*.s32. |
DIV_DISABLE_PREDICATED_SZEXT |
Predicated szext instructions. |
DIV_DISABLE_SETP_BOOL |
Integer and floating setp.<cmp>.{and,or,xor} predicate-combiner instructions. |
DIV_DISABLE_SETP_DUAL |
`setp. %p |
DIV_DISABLE_PRED_LOGIC |
and.pred, or.pred, xor.pred, and not.pred. |
DIV_DISABLE_PREDICATED_MAD |
Predicated mad.lo.{u32,s32} instructions. |
DIV_DISABLE_PREDICATED_MAD_HI |
Predicated mad.hi.{u32,s32} instructions. |
DIV_DISABLE_MAD_CARRY |
Three-instruction mad.cc / madc.cc / madc carry chains. |
DIV_DISABLE_SIGNED_MAD_CARRY |
Signed mad.cc / madc.cc / madc carry chains. |
DIV_DISABLE_PREDICATED_MAD_CARRY |
Predicated mad.cc / madc.cc / madc carry chains. |
DIV_DISABLE_PREDICATED_SET |
Predicated integer and floating-point set, setp, and testp instructions. |
DIV_DISABLE_PREDICATED_SELP |
Instruction-predicated selp.{b32,u32,s32}, selp.f32, and selp.f64 instructions. |
DIV_DISABLE_SAD |
sad.{u32,s32}. |
DIV_DISABLE_SLCT |
slct.{u32,s32,b32}.s32. |
DIV_DISABLE_PREDICATED_SAD |
Predicated sad.{u32,s32} instructions. |
DIV_DISABLE_PREDICATED_SLCT |
Predicated slct instructions. |
DIV_DISABLE_PREDICATED_DP |
Predicated dp4a and dp2a instructions. |
DIV_DISABLE_PREDICATED_VIDEO |
Predicated video instructions. |
DIV_DISABLE_SET |
set.{cmp}.u32.{u32,s32}. |
DIV_DISABLE_S32_SLCT |
slct.s32.*. |
DIV_DISABLE_F32_SLCT |
slct forms with f32 result or selector. |
DIV_DISABLE_WIDE_SLCT |
slct forms with u64, s64, or b64 result. |
DIV_DISABLE_F64_SLCT |
slct.f64.*. |
DIV_DISABLE_VIDEO |
PTX video instructions. |
DIV_DISABLE_SIGNED_VIDEO |
Signed/mixed s32 PTX video type suffixes. |
DIV_DISABLE_VIDEO_SAT |
PTX video .sat forms. |
DIV_DISABLE_VSUB4 |
vsub4.* video forms. |
| Variable | Default | Meaning |
|---|---|---|
REDUCE_GPUS |
DIV_GPUS, then all visible devices |
CUDA devices used by fuzzx-diff-reduce. |
REDUCE_WORKERS_PER_GPU |
DIV_WORKERS_PER_GPU, then host-core based default capped at 16 |
Reducer worker count per GPU. |
REDUCE_NO_PROGRESS_SECS |
120 |
Reducer timeout when no candidate completes. |
DIV_HANG_SECS |
4 |
fuzzx-diff-sweep no-progress threshold before reporting hangs. |
FuzzX is licensed under the Apache License, Version 2.0. See LICENSE.