Skip to content

gfx1250: address review findings — correctness fixes, safety checks, ISA improvements#58

Open
AviralGoelAMD wants to merge 36 commits into
HazyResearch:neoblizz/gfx1250from
AviralGoelAMD:users/avirgoel/ck/review-fixes
Open

gfx1250: address review findings — correctness fixes, safety checks, ISA improvements#58
AviralGoelAMD wants to merge 36 commits into
HazyResearch:neoblizz/gfx1250from
AviralGoelAMD:users/avirgoel/ck/review-fixes

Conversation

@AviralGoelAMD
Copy link
Copy Markdown
Contributor

@AviralGoelAMD AviralGoelAMD commented May 23, 2026

Summary

Systematic review addressing for the gfx1250 HipKittens PR. 48 items addressed across 36 commits.

Correctness fixes (P1):

  • RC-01: cluster::sync() barrier ID -1 → -3 (cluster-wide)
  • RC-02: mfma323232 const C aliasing — use local accumulator (verified on gfx950 MI350X)
  • RC-03: init_barrier count validation assert
  • RC-04: packed_shfl_down width=64 → WARP_THREADS (verified on the gfx1250 simulator: sum 1056→528)
  • RC-05: Slab race analysis — no race (verified by barrier ordering trace)

Safety checks (P2):

  • SR-01: M/N/K alignment asserts in grid()
  • SR-02: shared_allocator overflow via assert (device-side assert verified on gfx950)
  • SR-03: Per-segment watermarks for allocate_in (verified on the gfx1250 simulator)
  • SR-05: static_assert 16-byte pad alignment for ds_load_b128
  • SR-07: laneid()==0 guard moved into async_barrier_arrive
  • SR-08: D# barrier address 16-bit range assert
  • SR-09: static_assert pad interval power-of-2 for D# encoding
  • Plus SR-04/06/10-14 (documentation and minor fixes)

ISA improvements (P4):

  • IV-01: Skip last-iteration sync/wait_async in all 7 kernels
  • IV-03: Move arrive() before async loads (s_barrier_signal 24 instructions earlier)
  • IV-02: Documented why expert mode scope must be wide (compiler reorders s_setreg)

Missing features (P3):

  • MF-01: Named barrier wrappers (IDs 1-16)
  • MF-02: SCHED_MODE bit[4] DISABLE_VALU_STALL + 5-bit HWREG
  • MF-08: s_wakeup and s_sleep_var wrappers
  • MF-09: WMMA co-execution documentation

Code quality (P5):

  • CQ-01: -w → -Wall with targeted suppressions (0 warnings)
  • CQ-04/06: bf16 round-trip for CPU reference + K-scaled tolerance
  • CQ-10: Trait-based st_shape concept for auto-matching padded configs
  • Plus CQ-02/03/05/07/11-14/16-21/23

Verification:

  • All 8 kernels build with 0 warnings on gfx1250
  • 7/7 runnable kernels pass correctness on the gfx1250 simulator (TDM skipped — known simulator limitation)
  • RC-02 verified on gfx950 MI350X hardware
  • RC-04 verified on the gfx1250 simulator
  • SR-03 interleaved segment allocation verified on the gfx1250 simulator
  • ISA dumps confirm barrier and scheduling changes

…f WG barrier (-1)

s_barrier_signal/wait -1 only syncs waves within a single workgroup.
Cluster-wide CGA sync requires barrier ID -3 (cluster user barrier)
per MI400 ISA Section 4.3.6. Verified via ISA dump.
First MFMA was writing result into C (stripping const via cast), then
second MFMA read the modified C. Use a local accumulator instead so C
is never mutated. Compiler optimizes it away when D aliases C.
Bug confirmed on gfx950 MI350X: C corrupted from 42.0 to 58.0.
count==0 wraps to pending=0xFFFF causing silent hang.
count>65535 wraps silently. Assert catches both at runtime.
Hardcoded width=64 causes ds_bpermute to wrap around on wave-32,
reading from lane (L+delta)%32 instead of returning identity. This
doubles every value in tree reductions (sum=1056 instead of 528).
Verified on the gfx1250 simulator.
Non-aligned dimensions silently produce wrong results: grid()
truncates via integer division, store_acc writes partial tiles.
Assert fires for M=100 on the gfx1250 simulator, passes for M=64.
All three allocate variants now assert ptr stays within bounds.
Regular allocate() checks against MAX_SHARED_MEMORY. Segmented
allocate_in() checks against the segment's end boundary.
Unified base pointer storage across UDNA1 and CDNA paths.

Uses device-side assert() which prints file:line and traps. Verified
on gfx950 MI350X: pointer-comparison assert fires correctly when
overflow condition is runtime-determined.
allocate_in() used a single ptr for all segments. Allocating into
segment 0 after segment 1 would silently place data in segment 1
because ptr can only advance forward. Now each segment tracks its
own cursor via seg_ptr[LDS_NUM_SEGMENTS], allowing interleaved
segment allocations. Verified on the gfx1250 simulator: gemm_segment produces
correct results.
load_b32 does not apply padding offsets (unlike load_b128 which takes
a Pad template param). Adding Pad support was attempted but blocked
by include ordering (lds_nopad defined in global_to_shared.cuh which
is included after shared_to_register.cuh). Added comment directing
users to load_b128<Pad> for padded layouts.
ds_load_b128 requires 16-byte aligned LDS addresses. A custom
lds_padded<N, M> where M * sizeof(bf16) is not a multiple of 16
would silently misalign loads. static_assert catches this at
compile time.
No bounds clamping on DRAM reads — out-of-bounds tile coordinates
read from invalid memory silently. Added doc comment to load() and
load_async() noting caller must ensure alignment. Enforced at
dispatch level by SR-01's M/N/K alignment asserts.
DS_ATOMIC_ASYNC_BARRIER_ARRIVE_B64 fires per active lane. Without
a lane guard, one wave produces 32 arrivals instead of 1, causing
the barrier to flip prematurely. Moved the guard into the function
so callers can't forget it. Removed redundant guards from all 4
call sites in gemm_tdm_arrive.
The TDM D# descriptor's atomic_barrier_address field is 16 bits,
limiting barrier cells to the first 64KB of LDS. Added assert at
the point of truncation in build_tdm_d_2d() so any kernel passing
an out-of-range barrier address fails immediately instead of
silently arriving at the wrong LDS location.
__builtin_ctz encodes pad_interval as log2 for the TDM D# descriptor.
Non-power-of-2 intervals produce wrong encoding (ctz returns lowest
set bit, not log2). static_assert catches this at compile time.
Boundary workgroup at limit should be in the partial tail (identity
mapping), not enter the full-block remapping. Change > to >=.
Cosmetic on all tested parameters (boundary always maps to itself),
but matches the documented intent. No-op on gfx1250 (NUM_XCDS=1).
SR-11: Added bank conflict derivation for lds_pad_default<128,8>.
SR-12: Added comment that Pad type must match between load_async
(write) and load_b128 (read) — mismatch causes silent corruption.
SR-13: Added static_assert that double-buffered A and B each fit in
one LDS segment (64KB). Catches oversized tile configs at compile time.
SR-14: Added WARNING comment to wait_barrier documenting infinite loop
risk if arrive never fires. Timeout deferred — can't test on the gfx1250 simulator.
The final loop iteration syncs to protect a next-iteration load that
never happens. Guard with if(k+1 < k_iters) — uniform scalar branch,
no divergence. ISA confirms the barrier is conditionally skipped via
s_cbranch. All 7 kernels pass correctness on the gfx1250 simulator.
Attempted to scope RAII guard tightly around mma_ABt_burst. ISA showed
compiler reordering scalar s_setreg past VALU WMMA instructions,
making the expert mode window empty. Wide scope is correct — ensures
s_setreg reset stays after all WMMAs. Added comment.
arrive() signals completion of previous iteration's LDS reads. Async
loads write to nxt slab (different from cur), so signaling before them
is safe. ISA confirms s_barrier_signal moved 24 instructions earlier,
allowing other waves to unblock from wait() sooner. All 3 kernels
pass correctness on the gfx1250 simulator.
-w hid all compiler warnings. Now using -Wall with specific -Wno
flags for upstream framework warnings (unused-variable, unused-local-
typedef, duplicate-decl-specifier, unused-value, pass-failed). All 8
kernels build with 0 warnings.
CQ-01: Replace -w with -Wall + targeted -Wno suppressions. 0 warnings.
CQ-02: Add HK_HEADERS wildcard dep so header changes trigger rebuild.
CQ-18: Change `all` target from $(BIN) to `ladder` (builds all 8).
CQ-03: Use __FILE__ instead of hardcoded "gemm_naive" in printf.
CQ-05: Add HIP_OK(hipGetLastError()) after warmup dispatch to catch
silent launch failures (e.g., invalid shared memory size).
CQ-07: MASK_ALL now 0xFFFFFFFF on gfx1250 (wave-32), 0xFFFFFFFFFFFFFFFF
on CDNA (wave-64).
CQ-11: Removed commented-out static_asserts in shared_allocator.
Two separate asm volatile blocks allowed compiler to theoretically
reorder them. Single block ensures both loads are emitted together.
CQ-04: Round-trip CPU inputs through bf16 before reference computation
so CPU and GPU see identical inputs. Drops max_abs_err 0.0368 → 0.0311.
CQ-06: Replace fixed tolerance (1.0) with K-scaled formula:
tol = 2 * sqrt(K) * 2^-7. Require zero bad elements. Passes K=32
(tol=0.088, err=0.031) and K=256 (tol=0.250, err=0.123).
make isa KERNEL=gemm_naive produces .s file via --save-temps.
Identical offset computation was duplicated between load_b128 and
load_b32. Extracted into detail::gfx1250_lane_offset(sub_id, row, half).
Integer division by sub_elems and SUB_COLS compiles to shifts only
when these are power-of-2. Added static_asserts to catch non-power-of-2
subtile configs that would emit expensive division.
…gment docs

CQ-10: Replace manual st_16x32_padded<> enumeration in UDNA1 concept
with template specialization (is_st_16x32_padded_inst). Any padded
config auto-satisfies the concept without manual addition. Keeps the
closed whitelist — only named shapes + padded instantiations pass.
CDNA concept left unchanged.
CQ-20: Add comment that st_16x32_padded::swizzle() applies padding,
not XOR swizzle (named for API compat).
CQ-21: Document segment 0 LDS waste tradeoff in gemm_segment.
Added limited_nostall mode (limited + bit[4]) to sched::mode enum.
Widened HWREG encoding from 2 to 5 bits to cover the full SCHED_MODE
field. Applied to gemm_expert kernel — ISA confirms s_setreg value
changed from 2 to 18. Correctness verified on the gfx1250 simulator.
sched::wakeup() wakes all sleeping waves in the workgroup — useful for
waking consumers after producer finishes. sched::sleep_var() sleeps for
a runtime-variable duration (SGPR * 64 cycles). Both use inline asm
since no builtins exist in ROCm 7.x.
Each WMMA takes 16 cycles; the SIMD can issue up to 8 independent VALU
ops for free during this time. Added documentation near compiler_fence().
Added init_named(), join<ID>(), signal<ID>(), wait_named<ID>(), and
wakeup_barrier<ID>(). s_barrier_leave is not available on gfx1250
(assembler rejects it). Named barriers require dispatch-time allocation
via HIP launch attributes — wrappers handle the device-side instructions
only. Verified: all wrappers compile and emit correct ISA on gfx1250.
@AviralGoelAMD AviralGoelAMD force-pushed the users/avirgoel/ck/review-fixes branch from 582ff2f to aae0401 Compare May 26, 2026 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant