gfx1250: address review findings — correctness fixes, safety checks, ISA improvements by AviralGoelAMD · Pull Request #58 · HazyResearch/HipKittens

AviralGoelAMD · 2026-05-23T23:25:20Z

Summary

Systematic review addressing for the gfx1250 HipKittens PR. 48 items addressed across 36 commits.

Correctness fixes (P1):

RC-01: cluster::sync() barrier ID -1 → -3 (cluster-wide)
RC-02: mfma323232 const C aliasing — use local accumulator (verified on gfx950 MI350X)
RC-03: init_barrier count validation assert
RC-04: packed_shfl_down width=64 → WARP_THREADS (verified on the gfx1250 simulator: sum 1056→528)
RC-05: Slab race analysis — no race (verified by barrier ordering trace)

Safety checks (P2):

SR-01: M/N/K alignment asserts in grid()
SR-02: shared_allocator overflow via assert (device-side assert verified on gfx950)
SR-03: Per-segment watermarks for allocate_in (verified on the gfx1250 simulator)
SR-05: static_assert 16-byte pad alignment for ds_load_b128
SR-07: laneid()==0 guard moved into async_barrier_arrive
SR-08: D# barrier address 16-bit range assert
SR-09: static_assert pad interval power-of-2 for D# encoding
Plus SR-04/06/10-14 (documentation and minor fixes)

ISA improvements (P4):

IV-01: Skip last-iteration sync/wait_async in all 7 kernels
IV-03: Move arrive() before async loads (s_barrier_signal 24 instructions earlier)
IV-02: Documented why expert mode scope must be wide (compiler reorders s_setreg)

Missing features (P3):

MF-01: Named barrier wrappers (IDs 1-16)
MF-02: SCHED_MODE bit[4] DISABLE_VALU_STALL + 5-bit HWREG
MF-08: s_wakeup and s_sleep_var wrappers
MF-09: WMMA co-execution documentation

Code quality (P5):

CQ-01: -w → -Wall with targeted suppressions (0 warnings)
CQ-04/06: bf16 round-trip for CPU reference + K-scaled tolerance
CQ-10: Trait-based st_shape concept for auto-matching padded configs
Plus CQ-02/03/05/07/11-14/16-21/23

Verification:

All 8 kernels build with 0 warnings on gfx1250
7/7 runnable kernels pass correctness on the gfx1250 simulator (TDM skipped — known simulator limitation)
RC-02 verified on gfx950 MI350X hardware
RC-04 verified on the gfx1250 simulator
SR-03 interleaved segment allocation verified on the gfx1250 simulator
ISA dumps confirm barrier and scheduling changes

…f WG barrier (-1) s_barrier_signal/wait -1 only syncs waves within a single workgroup. Cluster-wide CGA sync requires barrier ID -3 (cluster user barrier) per MI400 ISA Section 4.3.6. Verified via ISA dump.

First MFMA was writing result into C (stripping const via cast), then second MFMA read the modified C. Use a local accumulator instead so C is never mutated. Compiler optimizes it away when D aliases C. Bug confirmed on gfx950 MI350X: C corrupted from 42.0 to 58.0.

count==0 wraps to pending=0xFFFF causing silent hang. count>65535 wraps silently. Assert catches both at runtime.

Hardcoded width=64 causes ds_bpermute to wrap around on wave-32, reading from lane (L+delta)%32 instead of returning identity. This doubles every value in tree reductions (sum=1056 instead of 528). Verified on the gfx1250 simulator.

Non-aligned dimensions silently produce wrong results: grid() truncates via integer division, store_acc writes partial tiles. Assert fires for M=100 on the gfx1250 simulator, passes for M=64.

All three allocate variants now assert ptr stays within bounds. Regular allocate() checks against MAX_SHARED_MEMORY. Segmented allocate_in() checks against the segment's end boundary. Unified base pointer storage across UDNA1 and CDNA paths. Uses device-side assert() which prints file:line and traps. Verified on gfx950 MI350X: pointer-comparison assert fires correctly when overflow condition is runtime-determined.

allocate_in() used a single ptr for all segments. Allocating into segment 0 after segment 1 would silently place data in segment 1 because ptr can only advance forward. Now each segment tracks its own cursor via seg_ptr[LDS_NUM_SEGMENTS], allowing interleaved segment allocations. Verified on the gfx1250 simulator: gemm_segment produces correct results.

load_b32 does not apply padding offsets (unlike load_b128 which takes a Pad template param). Adding Pad support was attempted but blocked by include ordering (lds_nopad defined in global_to_shared.cuh which is included after shared_to_register.cuh). Added comment directing users to load_b128<Pad> for padded layouts.

ds_load_b128 requires 16-byte aligned LDS addresses. A custom lds_padded<N, M> where M * sizeof(bf16) is not a multiple of 16 would silently misalign loads. static_assert catches this at compile time.

No bounds clamping on DRAM reads — out-of-bounds tile coordinates read from invalid memory silently. Added doc comment to load() and load_async() noting caller must ensure alignment. Enforced at dispatch level by SR-01's M/N/K alignment asserts.

DS_ATOMIC_ASYNC_BARRIER_ARRIVE_B64 fires per active lane. Without a lane guard, one wave produces 32 arrivals instead of 1, causing the barrier to flip prematurely. Moved the guard into the function so callers can't forget it. Removed redundant guards from all 4 call sites in gemm_tdm_arrive.

The TDM D# descriptor's atomic_barrier_address field is 16 bits, limiting barrier cells to the first 64KB of LDS. Added assert at the point of truncation in build_tdm_d_2d() so any kernel passing an out-of-range barrier address fails immediately instead of silently arriving at the wrong LDS location.

__builtin_ctz encodes pad_interval as log2 for the TDM D# descriptor. Non-power-of-2 intervals produce wrong encoding (ctz returns lowest set bit, not log2). static_assert catches this at compile time.

Boundary workgroup at limit should be in the partial tail (identity mapping), not enter the full-block remapping. Change > to >=. Cosmetic on all tested parameters (boundary always maps to itself), but matches the documented intent. No-op on gfx1250 (NUM_XCDS=1).

SR-11: Added bank conflict derivation for lds_pad_default<128,8>. SR-12: Added comment that Pad type must match between load_async (write) and load_b128 (read) — mismatch causes silent corruption.

SR-13: Added static_assert that double-buffered A and B each fit in one LDS segment (64KB). Catches oversized tile configs at compile time. SR-14: Added WARNING comment to wait_barrier documenting infinite loop risk if arrive never fires. Timeout deferred — can't test on the gfx1250 simulator.

The final loop iteration syncs to protect a next-iteration load that never happens. Guard with if(k+1 < k_iters) — uniform scalar branch, no divergence. ISA confirms the barrier is conditionally skipped via s_cbranch. All 7 kernels pass correctness on the gfx1250 simulator.

Attempted to scope RAII guard tightly around mma_ABt_burst. ISA showed compiler reordering scalar s_setreg past VALU WMMA instructions, making the expert mode window empty. Wide scope is correct — ensures s_setreg reset stays after all WMMAs. Added comment.

arrive() signals completion of previous iteration's LDS reads. Async loads write to nxt slab (different from cur), so signaling before them is safe. ISA confirms s_barrier_signal moved 24 instructions earlier, allowing other waves to unblock from wait() sooner. All 3 kernels pass correctness on the gfx1250 simulator.

-w hid all compiler warnings. Now using -Wall with specific -Wno flags for upstream framework warnings (unused-variable, unused-local- typedef, duplicate-decl-specifier, unused-value, pass-failed). All 8 kernels build with 0 warnings.

CQ-01: Replace -w with -Wall + targeted -Wno suppressions. 0 warnings. CQ-02: Add HK_HEADERS wildcard dep so header changes trigger rebuild. CQ-18: Change `all` target from $(BIN) to `ladder` (builds all 8).

CQ-03: Use __FILE__ instead of hardcoded "gemm_naive" in printf. CQ-05: Add HIP_OK(hipGetLastError()) after warmup dispatch to catch silent launch failures (e.g., invalid shared memory size).

CQ-07: MASK_ALL now 0xFFFFFFFF on gfx1250 (wave-32), 0xFFFFFFFFFFFFFFFF on CDNA (wave-64). CQ-11: Removed commented-out static_asserts in shared_allocator.

Two separate asm volatile blocks allowed compiler to theoretically reorder them. Single block ensures both loads are emitted together.

CQ-04: Round-trip CPU inputs through bf16 before reference computation so CPU and GPU see identical inputs. Drops max_abs_err 0.0368 → 0.0311. CQ-06: Replace fixed tolerance (1.0) with K-scaled formula: tol = 2 * sqrt(K) * 2^-7. Require zero bad elements. Passes K=32 (tol=0.088, err=0.031) and K=256 (tol=0.250, err=0.123).

make isa KERNEL=gemm_naive produces .s file via --save-temps.

Identical offset computation was duplicated between load_b128 and load_b32. Extracted into detail::gfx1250_lane_offset(sub_id, row, half).

Integer division by sub_elems and SUB_COLS compiles to shifts only when these are power-of-2. Added static_asserts to catch non-power-of-2 subtile configs that would emit expensive division.

…gment docs CQ-10: Replace manual st_16x32_padded<> enumeration in UDNA1 concept with template specialization (is_st_16x32_padded_inst). Any padded config auto-satisfies the concept without manual addition. Keeps the closed whitelist — only named shapes + padded instantiations pass. CDNA concept left unchanged. CQ-20: Add comment that st_16x32_padded::swizzle() applies padding, not XOR swizzle (named for API compat). CQ-21: Document segment 0 LDS waste tradeoff in gemm_segment.

Added limited_nostall mode (limited + bit[4]) to sched::mode enum. Widened HWREG encoding from 2 to 5 bits to cover the full SCHED_MODE field. Applied to gemm_expert kernel — ISA confirms s_setreg value changed from 2 to 18. Correctness verified on the gfx1250 simulator.

sched::wakeup() wakes all sleeping waves in the workgroup — useful for waking consumers after producer finishes. sched::sleep_var() sleeps for a runtime-variable duration (SGPR * 64 cycles). Both use inline asm since no builtins exist in ROCm 7.x.

Each WMMA takes 16 cycles; the SIMD can issue up to 8 independent VALU ops for free during this time. Added documentation near compiler_fence().

Added init_named(), join<ID>(), signal<ID>(), wait_named<ID>(), and wakeup_barrier<ID>(). s_barrier_leave is not available on gfx1250 (assembler rejects it). Named barriers require dispatch-time allocation via HIP launch attributes — wrappers handle the device-side instructions only. Verified: all wrappers compile and emit correct ISA on gfx1250.

AviralGoelAMD added 30 commits May 23, 2026 17:08

RC-01: fix cluster::sync() to use cluster user barrier (-3) instead o…

3b6b7c6

…f WG barrier (-1) s_barrier_signal/wait -1 only syncs waves within a single workgroup. Cluster-wide CGA sync requires barrier ID -3 (cluster user barrier) per MI400 ISA Section 4.3.6. Verified via ISA dump.

RC-03: add assert to init_barrier validating count range

c02694f

count==0 wraps to pending=0xFFFF causing silent hang. count>65535 wraps silently. Assert catches both at runtime.

SR-01: add M/N/K alignment asserts in grid() to catch silent truncation

4d9afaf

Non-aligned dimensions silently produce wrong results: grid() truncates via integer division, store_acc writes partial tiles. Assert fires for M=100 on the gfx1250 simulator, passes for M=64.

SR-05: add static_assert for 16-byte pad alignment in load_b128

63eed26

ds_load_b128 requires 16-byte aligned LDS addresses. A custom lds_padded<N, M> where M * sizeof(bf16) is not a multiple of 16 would silently misalign loads. static_assert catches this at compile time.

SR-09: static_assert pad interval is power-of-2 for D# encoding

3dbae36

__builtin_ctz encodes pad_interval as log2 for the TDM D# descriptor. Non-power-of-2 intervals produce wrong encoding (ctz returns lowest set bit, not log2). static_assert catches this at compile time.

SR-11/SR-12: document padding derivation and write/read consistency

3c45296

SR-11: Added bank conflict derivation for lds_pad_default<128,8>. SR-12: Added comment that Pad type must match between load_async (write) and load_b128 (read) — mismatch causes silent corruption.

CQ-01/CQ-02/CQ-18: Makefile warning flags, header deps, all target

94a463a

CQ-01: Replace -w with -Wall + targeted -Wno suppressions. 0 warnings. CQ-02: Add HK_HEADERS wildcard dep so header changes trigger rebuild. CQ-18: Change `all` target from $(BIN) to `ladder` (builds all 8).

CQ-03/CQ-05: fix kernel name printf and add hipGetLastError after launch

ddddb1a

CQ-03: Use __FILE__ instead of hardcoded "gemm_naive" in printf. CQ-05: Add HIP_OK(hipGetLastError()) after warmup dispatch to catch silent launch failures (e.g., invalid shared memory size).

CQ-07/CQ-11: fix MASK_ALL for wave-32 and remove dead static_asserts

e749779

CQ-07: MASK_ALL now 0xFFFFFFFF on gfx1250 (wave-32), 0xFFFFFFFFFFFFFFFF on CDNA (wave-64). CQ-11: Removed commented-out static_asserts in shared_allocator.

CQ-13: remove unused bar_bytes variable in TDM dispatch

2088056

CQ-23: combine paired ds_load_b128 into single asm volatile block

52bf9fe

Two separate asm volatile blocks allowed compiler to theoretically reorder them. Single block ensures both loads are emitted together.

CQ-14: add hipFree and hipEventDestroy cleanup in harness

aedc8b9

CQ-19: add isa target to Makefile for GPU assembly dump

1802e7b

make isa KERNEL=gemm_naive produces .s file via --save-temps.

CQ-12: document fence() scope — drains loadcnt + dscnt only

ac71d74

CQ-16: extract gfx1250_lane_offset helper for shared_to_register

50d19fa

Identical offset computation was duplicated between load_b128 and load_b32. Extracted into detail::gfx1250_lane_offset(sub_id, row, half).

AviralGoelAMD added 6 commits May 26, 2026 11:28

CQ-17: static_assert subtile dims are power-of-2 in subtile_flat

17d0275

Integer division by sub_elems and SUB_COLS compiles to shifts only when these are power-of-2. Added static_asserts to catch non-power-of-2 subtile configs that would emit expensive division.

MF-09: document WMMA co-execution opportunity

12cdbab

Each WMMA takes 16 cycles; the SIMD can issue up to 8 independent VALU ops for free during this time. Added documentation near compiler_fence().

AviralGoelAMD force-pushed the users/avirgoel/ck/review-fixes branch from 582ff2f to aae0401 Compare May 26, 2026 15:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gfx1250: address review findings — correctness fixes, safety checks, ISA improvements#58

gfx1250: address review findings — correctness fixes, safety checks, ISA improvements#58
AviralGoelAMD wants to merge 36 commits into
HazyResearch:neoblizz/gfx1250from
AviralGoelAMD:users/avirgoel/ck/review-fixes

AviralGoelAMD commented May 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AviralGoelAMD commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AviralGoelAMD commented May 23, 2026 •

edited

Loading