gfx1250: address review findings — correctness fixes, safety checks, ISA improvements#58
Open
AviralGoelAMD wants to merge 36 commits into
Open
Conversation
…f WG barrier (-1) s_barrier_signal/wait -1 only syncs waves within a single workgroup. Cluster-wide CGA sync requires barrier ID -3 (cluster user barrier) per MI400 ISA Section 4.3.6. Verified via ISA dump.
First MFMA was writing result into C (stripping const via cast), then second MFMA read the modified C. Use a local accumulator instead so C is never mutated. Compiler optimizes it away when D aliases C. Bug confirmed on gfx950 MI350X: C corrupted from 42.0 to 58.0.
count==0 wraps to pending=0xFFFF causing silent hang. count>65535 wraps silently. Assert catches both at runtime.
Hardcoded width=64 causes ds_bpermute to wrap around on wave-32, reading from lane (L+delta)%32 instead of returning identity. This doubles every value in tree reductions (sum=1056 instead of 528). Verified on the gfx1250 simulator.
Non-aligned dimensions silently produce wrong results: grid() truncates via integer division, store_acc writes partial tiles. Assert fires for M=100 on the gfx1250 simulator, passes for M=64.
All three allocate variants now assert ptr stays within bounds. Regular allocate() checks against MAX_SHARED_MEMORY. Segmented allocate_in() checks against the segment's end boundary. Unified base pointer storage across UDNA1 and CDNA paths. Uses device-side assert() which prints file:line and traps. Verified on gfx950 MI350X: pointer-comparison assert fires correctly when overflow condition is runtime-determined.
allocate_in() used a single ptr for all segments. Allocating into segment 0 after segment 1 would silently place data in segment 1 because ptr can only advance forward. Now each segment tracks its own cursor via seg_ptr[LDS_NUM_SEGMENTS], allowing interleaved segment allocations. Verified on the gfx1250 simulator: gemm_segment produces correct results.
load_b32 does not apply padding offsets (unlike load_b128 which takes a Pad template param). Adding Pad support was attempted but blocked by include ordering (lds_nopad defined in global_to_shared.cuh which is included after shared_to_register.cuh). Added comment directing users to load_b128<Pad> for padded layouts.
ds_load_b128 requires 16-byte aligned LDS addresses. A custom lds_padded<N, M> where M * sizeof(bf16) is not a multiple of 16 would silently misalign loads. static_assert catches this at compile time.
No bounds clamping on DRAM reads — out-of-bounds tile coordinates read from invalid memory silently. Added doc comment to load() and load_async() noting caller must ensure alignment. Enforced at dispatch level by SR-01's M/N/K alignment asserts.
DS_ATOMIC_ASYNC_BARRIER_ARRIVE_B64 fires per active lane. Without a lane guard, one wave produces 32 arrivals instead of 1, causing the barrier to flip prematurely. Moved the guard into the function so callers can't forget it. Removed redundant guards from all 4 call sites in gemm_tdm_arrive.
The TDM D# descriptor's atomic_barrier_address field is 16 bits, limiting barrier cells to the first 64KB of LDS. Added assert at the point of truncation in build_tdm_d_2d() so any kernel passing an out-of-range barrier address fails immediately instead of silently arriving at the wrong LDS location.
__builtin_ctz encodes pad_interval as log2 for the TDM D# descriptor. Non-power-of-2 intervals produce wrong encoding (ctz returns lowest set bit, not log2). static_assert catches this at compile time.
Boundary workgroup at limit should be in the partial tail (identity mapping), not enter the full-block remapping. Change > to >=. Cosmetic on all tested parameters (boundary always maps to itself), but matches the documented intent. No-op on gfx1250 (NUM_XCDS=1).
SR-11: Added bank conflict derivation for lds_pad_default<128,8>. SR-12: Added comment that Pad type must match between load_async (write) and load_b128 (read) — mismatch causes silent corruption.
SR-13: Added static_assert that double-buffered A and B each fit in one LDS segment (64KB). Catches oversized tile configs at compile time. SR-14: Added WARNING comment to wait_barrier documenting infinite loop risk if arrive never fires. Timeout deferred — can't test on the gfx1250 simulator.
The final loop iteration syncs to protect a next-iteration load that never happens. Guard with if(k+1 < k_iters) — uniform scalar branch, no divergence. ISA confirms the barrier is conditionally skipped via s_cbranch. All 7 kernels pass correctness on the gfx1250 simulator.
Attempted to scope RAII guard tightly around mma_ABt_burst. ISA showed compiler reordering scalar s_setreg past VALU WMMA instructions, making the expert mode window empty. Wide scope is correct — ensures s_setreg reset stays after all WMMAs. Added comment.
arrive() signals completion of previous iteration's LDS reads. Async loads write to nxt slab (different from cur), so signaling before them is safe. ISA confirms s_barrier_signal moved 24 instructions earlier, allowing other waves to unblock from wait() sooner. All 3 kernels pass correctness on the gfx1250 simulator.
-w hid all compiler warnings. Now using -Wall with specific -Wno flags for upstream framework warnings (unused-variable, unused-local- typedef, duplicate-decl-specifier, unused-value, pass-failed). All 8 kernels build with 0 warnings.
CQ-01: Replace -w with -Wall + targeted -Wno suppressions. 0 warnings. CQ-02: Add HK_HEADERS wildcard dep so header changes trigger rebuild. CQ-18: Change `all` target from $(BIN) to `ladder` (builds all 8).
CQ-03: Use __FILE__ instead of hardcoded "gemm_naive" in printf. CQ-05: Add HIP_OK(hipGetLastError()) after warmup dispatch to catch silent launch failures (e.g., invalid shared memory size).
CQ-07: MASK_ALL now 0xFFFFFFFF on gfx1250 (wave-32), 0xFFFFFFFFFFFFFFFF on CDNA (wave-64). CQ-11: Removed commented-out static_asserts in shared_allocator.
Two separate asm volatile blocks allowed compiler to theoretically reorder them. Single block ensures both loads are emitted together.
CQ-04: Round-trip CPU inputs through bf16 before reference computation so CPU and GPU see identical inputs. Drops max_abs_err 0.0368 → 0.0311. CQ-06: Replace fixed tolerance (1.0) with K-scaled formula: tol = 2 * sqrt(K) * 2^-7. Require zero bad elements. Passes K=32 (tol=0.088, err=0.031) and K=256 (tol=0.250, err=0.123).
make isa KERNEL=gemm_naive produces .s file via --save-temps.
Identical offset computation was duplicated between load_b128 and load_b32. Extracted into detail::gfx1250_lane_offset(sub_id, row, half).
Integer division by sub_elems and SUB_COLS compiles to shifts only when these are power-of-2. Added static_asserts to catch non-power-of-2 subtile configs that would emit expensive division.
…gment docs CQ-10: Replace manual st_16x32_padded<> enumeration in UDNA1 concept with template specialization (is_st_16x32_padded_inst). Any padded config auto-satisfies the concept without manual addition. Keeps the closed whitelist — only named shapes + padded instantiations pass. CDNA concept left unchanged. CQ-20: Add comment that st_16x32_padded::swizzle() applies padding, not XOR swizzle (named for API compat). CQ-21: Document segment 0 LDS waste tradeoff in gemm_segment.
Added limited_nostall mode (limited + bit[4]) to sched::mode enum. Widened HWREG encoding from 2 to 5 bits to cover the full SCHED_MODE field. Applied to gemm_expert kernel — ISA confirms s_setreg value changed from 2 to 18. Correctness verified on the gfx1250 simulator.
sched::wakeup() wakes all sleeping waves in the workgroup — useful for waking consumers after producer finishes. sched::sleep_var() sleeps for a runtime-variable duration (SGPR * 64 cycles). Both use inline asm since no builtins exist in ROCm 7.x.
Each WMMA takes 16 cycles; the SIMD can issue up to 8 independent VALU ops for free during this time. Added documentation near compiler_fence().
Added init_named(), join<ID>(), signal<ID>(), wait_named<ID>(), and wakeup_barrier<ID>(). s_barrier_leave is not available on gfx1250 (assembler rejects it). Named barriers require dispatch-time allocation via HIP launch attributes — wrappers handle the device-side instructions only. Verified: all wrappers compile and emit correct ISA on gfx1250.
582ff2f to
aae0401
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Systematic review addressing for the gfx1250 HipKittens PR. 48 items addressed across 36 commits.
Correctness fixes (P1):
Safety checks (P2):
ISA improvements (P4):
Missing features (P3):
Code quality (P5):
Verification: