feat(ck-tile): stream-K GEMM TE to dispatcher bridge by ozturkosu · Pull Request #9028 · ROCm/rocm-libraries

ozturkosu · 2026-07-01T17:54:43Z

Re-opened from #8136 with a policy-compliant branch name (users/muozturk/ck-tile/dispatcher-te-bridge-streamk-gemm). Supersedes #8136.

Summary

Routes the stream_k GEMM variant through the same Tile Engine (TE) →
Dispatcher bridge already landed for regular GEMM (Phase 1, #8123) and grouped
GEMM (Phase 3, #8130). Goal of the overall effort: the Dispatcher is the single
source of truth for codegen/build/runtime, and TE only produces configs +
benchmarks.

This PR is stacked on muozturk/dispatcher-gemm-bridge (#8123) — please merge
that first. Its own diff is just two commits:

[CK_TILE] Add stream_k variant to GEMM Dispatcher codegen (cherry-picked)
[CK_TILE] Add Tile Engine -> Dispatcher bridge for Stream-K GEMM (this work)

What Stream-K needs that regular GEMM doesn't

Stream-K is a single-problem GEMM (one A/B/C, one M/N/K) with the same C
ABI as regular GEMM, so the Python side (GpuGemmRunner / GemmDispatcherLib
/ GemmProblem) and the GPU worker are reused unchanged. The differences are
internal to the .so:

The generated launch has a Stream-K signature
SelectedKernel::launch(const ck_tile::StreamKHostArgs&, const stream_config&),
which allocates the reduction workspace internally (DeviceMem) and uses the
Atomic reduction strategy.
The registry path (generated_tile_backend.hpp::run()) hard-codes the
single-problem GemmHostArgs launch and won't compile against a Stream-K
SelectedKernel. So the Stream-K ctypes lib bypasses the registry and calls
SelectedKernel::launch(args, stream) directly, reporting the name from the
KERNEL_NAME macro (same approach grouped uses).

Changes

New

dispatcher/bindings/ctypes/streamk_gemm_ctypes_lib.cpp — same single-problem
C ABI (dispatcher_run_gemm(A,B,C,M,N,K,time_ms)); hipMalloc + copy A/B,
memset C=0 (Atomic accumulates into C), build StreamKHostArgs with rcr
strides (stride_A=K, stride_B=K, stride_C=N, k_batch=1), launch, copy C back.
Returns 0 / -1 (HIP or throw) / -2 (kernel reports args unsupported).
tile_engine/ops/gemm/streamk_gemm_full_benchmark.py — 3-phase driver
(expand configs → setup_multiple_gemm_dispatchers build → subprocess-isolated
benchmark), mirroring gemm_full_benchmark.py with variant="stream_k".
tile_engine/ops/gemm/run_one_streamk_gemm_kernel.py — disposable GPU worker
(identical to the regular worker since the ABI matches).
tile_engine/ops/gemm/gemm_streamk/configs/default_config.json — small sweep
config (128x128x{32,64}, 2x2x1, 32x32x16, compv3/compv4, intrawave, cshuffle,
pad true, persistent false) → 4 kernels.

Modified

dispatcher/python/gemm_utils.py — _ctypes_source_name() selects
streamk_gemm_ctypes_lib.cpp for variant=="stream_k" (in both
_build_compile_jobs and setup_multiple_gemm_dispatchers); .name appends
_streamk; variant threaded into codegen_args and expand_sweep.
dispatcher/python/ctypes_utils.py — pass the requested variant to codegen
--variants instead of hard-coding "standard".

Validation (gfx942 / MI300X, fp16 / rcr)

Numeric parity vs a numpy fp32 reference (A.f32 @ B.f32). Stream-K's Atomic
reduction does multiple fp16 atomic-adds (one per K-split partial) vs
regular/grouped's single fp32→fp16 store, so it is inherently noisier; tolerance
is widened to max_rel ≤ 2.5e-3, frob_rel ≤ 1.5e-3 (regular/grouped use 5e-4).

Shape (M,N,K)	TFLOPS	max_rel	frob_rel	result
1024 x 1024 x 1024	23.1	1.10e-3	5.47e-4	PASS
2048 x 2048 x 2048	129.5	7.02e-4	3.72e-4	PASS
512 x 768 x 4096	24.6	2.01e-3	8.19e-4	PASS
256 x 256 x 256	2.1	3.44e-4	2.08e-4	PASS
4096 x 4096 x 1024	149.8	5.89e-4	2.65e-4	PASS

Full TE driver run (4 kernels x 4 problems = 16/16 OK, 0 failures),
default problem set uses Stream-K's sweet spot (squares + a large-K skinny shape):

1024 x 1024 x 1024
2048 x 2048 x 2048
4096 x 4096 x 4096
512 x 512 x 8192 (large-K skinny)

All status 0, positive TFLOPS, nonzero output. Name parity holds end-to-end:
the runtime name reported by each .so equals
GemmKernelConfig(variant="stream_k").name, ending in _streamk.

Unsupported-shape handling: a tiny 257^3 problem is correctly reported as
unsupported by the kernel (status -2, too few tiles to partition across CUs)
and surfaced gracefully by the bridge — not a crash.

Test plan

unified_gemm_codegen.py ... --variants stream_k emits a *_streamk.hpp
whose stem == GemmKernelConfig(variant="stream_k").name
setup_multiple_gemm_dispatchers builds the Stream-K config set → .so
compiles & links against streamk_gemm_ctypes_lib.cpp
Numeric parity (table above) passes under the fp16 Atomic tolerance
Full driver run 16/16 OK; name parity verified
Unsupported tiny shape returns status -2 gracefully

Update 2026-06-12 — brought current with #8123 + Copilot fixes

This branch had forked at #8123's first commit, so it lacked every later
regular-GEMM bridge improvement. Merged the current #8123 HEAD and ported the
Stream-K-specific analogues (the Stream-K bridge keeps its own driver, worker and
ctypes lib, so those fixes do not arrive via the merge):

Benchmark-param parity / perf-methodology fix (streamk_gemm_ctypes_lib.cpp):
benchmark knobs defaulted to warmup=3/repeat=10 — a cold, un-ramped clock, the
root cause of the regular bridge's spurious "perf gap." Now default to old-TE's
warmup=50/repeat=100, env-overridable via
CK_TILE_BENCH_WARMUP/REPEAT/FLUSH/ROTATING. rotating_count stays 1 for
Stream-K: the Atomic preprocess re-zeros only the original C buffer, so rotating
C would leave rotated copies un-zeroed and corrupt the accumulation.
--verify correctness gate (driver + worker): opt-in fp32 numpy reference
check (global max|out-ref|/max|ref|, verified/max_rel in the CSV); a
mismatch counts as a failure.
Multi-GPU benchmarking (driver): fans the (kernel × problem) work across all
visible GPUs via device-pinned HIP_VISIBLE_DEVICES workers (--devices, device
CSV column); also fixes a latent proc-unbound error in the batch handler.
--dtype/--layout guards (driver): constrained to the supported fp16/rcr
surface so a mismatch fails fast.
Copilot codegen nits: std::stoi → std::stoll for M/N/K in
03_streamk_gemm_driver.cpp; stride-aware C zeroing via hipMemset2DAsync
(CLayout-aware, checked HIP status) in _launch_function_streamk.
README: new Stream-K bridge subsection.

Validation status: DONE on gfx942/MI300X (ctr-cx64-mi300x-4, enroot container). Bridge build+benchmark+--verify on gemm_streamk/configs/default_config.json = 16/16 OK, all verified (max_rel ≤ 3.0e-3, fp16 atomic tol), name parity holds. Bridge-vs-Old-TE parity (perf + correctness, byte-identical device kernel) posted as a comment with the full table + streamk_bridge_oldTE.csv.

Add the stream-K GEMM variant to the unified GEMM dispatcher codegen the dispatcher way: a single-GEMM launch(args, stream) that allocates the reduction workspace internally via DeviceMem (GetWorkSpaceSize / SetWorkSpacePointer), zeroes it, and launches StreamKKernel with an atomic-reduction preprocess that resets C between timed iterations. No external workspace pointer (not the Tile Engine way). - arch_filter.py: add OperatorType.GEMM_STREAMK + tile constraints. - unified_gemm_codegen.py: add GemmVariant.STREAM_K, CLI --variants stream_k, naming, includes, _launch_function_streamk, variant->operator map, cshuffle-only config selection, and A/B/CLayout export in the CK_TILE_SINGLE_KERNEL_INCLUDE block. - examples/gemm/cpp/03_streamk_gemm_driver.cpp: standalone single-kernel driver that calls SelectedKernel::launch and verifies vs reference_gemm. Parity vs Tile Engine on MI300X (gfx942), fp16 rcr atomic 128x128x64_2x2x1_32x32x16, 3840x4096x2048, warmup=10/repeat=50: dispatcher 0.242 ms / 266 TFLOPS PASS vs TE 0.24 ms / 266 TFLOPS correct.

Introduce the shared-config bridge that lets Tile Engine drive the GEMM dispatcher the same way FMHA and grouped conv already do: one config dataclass owned by the dispatcher, imported by Tile Engine, with no translator between two vocabularies. - dispatcher/python/gemm_utils.py: GemmKernelConfig (the common contract; .name mirrors the codegen KERNEL_NAME byte-for-byte), GemmProblem, GemmDispatcherLib, GpuGemmRunner, setup_multiple_gemm_dispatchers (codegen + hipcc -> .so paths, CPU-only/parallel), and expand_sweep. - dispatcher/bindings/ctypes/gemm_ctypes_lib.cpp: add the indexed multi-kernel ABI dispatcher_get_kernel_name_at(index, buf, size); legacy single-kernel dispatcher_get_kernel_name retained. - tile_engine/ops/gemm/gemm_full_benchmark.py: 3-phase TE driver (compile -> load problems -> benchmark) that generates no binaries. - tile_engine/ops/gemm/run_one_gemm_kernel.py: disposable GPU worker for subprocess fault isolation. Scope: regular GEMM, fp16, rcr (Phase 1). Name parity verified end-to-end (config.name == generated .hpp stem == runtime registry name). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add the stream-K GEMM variant to the unified GEMM dispatcher codegen the dispatcher way: a single-GEMM launch(args, stream) that allocates the reduction workspace internally via DeviceMem (GetWorkSpaceSize / SetWorkSpacePointer), zeroes it, and launches StreamKKernel with an atomic-reduction preprocess that resets C between timed iterations. No external workspace pointer (not the Tile Engine way). - arch_filter.py: add OperatorType.GEMM_STREAMK + tile constraints. - unified_gemm_codegen.py: add GemmVariant.STREAM_K, CLI --variants stream_k, naming, includes, _launch_function_streamk, variant->operator map, cshuffle-only config selection, and A/B/CLayout export in the CK_TILE_SINGLE_KERNEL_INCLUDE block. - examples/gemm/cpp/03_streamk_gemm_driver.cpp: standalone single-kernel driver that calls SelectedKernel::launch and verifies vs reference_gemm. Parity vs Tile Engine on MI300X (gfx942), fp16 rcr atomic 128x128x64_2x2x1_32x32x16, 3840x4096x2048, warmup=10/repeat=50: dispatcher 0.242 ms / 266 TFLOPS PASS vs TE 0.24 ms / 266 TFLOPS correct. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Routes the stream_k GEMM variant through the same TE->Dispatcher bridge as regular GEMM (Phase 1) and grouped GEMM (Phase 3). Stream-K is a single-problem GEMM with the same C ABI as regular GEMM, so the Python runner side is reused unchanged; only the .so internals and variant routing differ. - streamk_gemm_ctypes_lib.cpp (new): same single-problem C ABI (dispatcher_run_gemm) but builds a ck_tile::StreamKHostArgs and calls SelectedKernel::launch(args, stream) directly, bypassing the registry (whose generated backend hard-codes the GemmHostArgs launch signature). The launch allocates the Atomic-reduction workspace internally; C is zeroed per run. - gemm_utils.py: _ctypes_source_name() selects streamk_gemm_ctypes_lib.cpp for variant "stream_k"; .name appends _streamk; variant threaded through codegen_args and expand_sweep. - ctypes_utils.py: pass the requested variant to codegen --variants instead of hard-coding "standard". - TE driver/worker/config (new): streamk_gemm_full_benchmark.py, run_one_streamk_gemm_kernel.py, gemm_streamk/configs/default_config.json. Validated end-to-end on gfx942/MI300X: full driver run 16/16 OK (4 kernels x 4 problems), name parity holds (.so name == config .name, ends _streamk). Numeric parity vs fp32 reference passes under an fp16 Atomic-reduction tolerance (max_rel <= 2.5e-3) which is wider than regular/grouped because Atomic does multiple fp16 atomic-adds per K-split. Tiny problems (e.g. 257^3) are correctly reported unsupported (status -2) by the kernel and surfaced gracefully. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

setup_multiple_gemm_dispatchers built its per-config codegen args without the config's variant, and _generate_single_kernel_subprocess hard-coded --variants standard. A GemmKernelConfig(variant='preshuffle'/'multi_d') passed to the bridge would therefore be code-generated as a standard kernel, while its name (and the hpp_glob_pattern derived from it) still carried the variant suffix -- so the lookup could never match the emitted header. Pass the variant through; the subprocess default stays "standard" so all existing callers are unaffected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…dler If subprocess.Popen() itself raises, the generic except handler referenced proc before it was bound, masking the real error with an UnboundLocalError. Initialize proc = None before the try; the handler already guards with `if proc and proc.poll() is None`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot review item: the driver exposed unrestricted --dtype/--layout while the Phase-1 worker hard-codes fp16 inputs and an rcr (column-major B) host transpose. Passing e.g. --dtype bf16 would codegen bf16 kernels but feed them fp16 data, silently benchmarking the wrong thing. Add SUPPORTED_DTYPES/ SUPPORTED_LAYOUTS and wire them into argparse choices so a mismatch fails fast. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Addresses the three review items on the TE->Dispatcher GEMM bridge driver, scoped to this foundation PR's fp16/rcr surface (bf16/layouts follow in the #8190/#8191 stack): 1. Example configs to sweep - gemm_full_benchmark.py defaults to the selected variant's configs/default_ci_config.json (small CI sweep) when no config is passed, and to configs/example_problems.json when --problems is omitted; configs/default_config.json remains the full sweep. - New gemm_universal/configs/example_problems.json (square / rectangular / large M,N,K). Nightly-test JSON drops into the same configs/ dir -- no driver change needed. 2. Multi-GPU launch in parallel (supersedes grouped_conv's serial-GPU design) - Phase 3 fans the (kernel x problem) work across every visible GPU: one worker thread per device pulls batches from a shared queue and spawns a disposable subprocess pinned with HIP_VISIBLE_DEVICES, so an N-GPU box runs ~Nx faster while keeping per-batch fault isolation. - Devices auto-detected (HIP_VISIBLE_DEVICES, then rocm-smi/amd-smi); override with --devices (count, explicit ids, or all). 3. Variant organization + README + deprecation note - --variant selects the per-variant configs/ directory. - New README "Dispatcher Bridge Workflow" section: scripts, per-variant config layout, run examples, multi-GPU explanation, supported surface (fp16/rcr here), and a deprecation note for the legacy *_instance_builder.py generators. Driver --dtype/--layout choices stay fp16/rcr to match this PR's dispatcher host path; run_one_gemm_kernel.py (fp16 host gen) is unchanged.

Adds a "Variant scope" section clarifying that the bridge is one shared, variant-aware driver (not per-variant driver copies), that only gemm_universal is wired and validated through the bridge on this PR, and that the gemm_multi_d/gemm_preshuffle/grouped_gemm configs/ dirs are scaffolding following the per-variant convention -- not yet working support. Notes that grouped GEMM and stream-K are separate bridge efforts (#8136 stream-K).

…layout Phase 4: remove the legacy standalone regular-GEMM build path and reorganize the bridged path to mirror the merged fmha/ and grouped_conv/ bridges. - Delete the gemm_universal standalone generator/benchmark/profiler/CMake and the dormant test/ck_tile/gemm_tile_engine suite (its only other consumer). - Promote gemm_universal/configs/ to the op-root configs/ (flat), matching the fmha/grouped_conv convention; remove the now-empty gemm_universal/ folder. - Keep the bridge driver + worker at the gemm op root (gemm_full_benchmark.py, run_one_gemm_kernel.py); VARIANT_CONFIGS[gemm_universal] -> configs. - Drop gemm_universal from the sampling foreach loops and add_subdirectory. - Refresh README (folder layout, running examples, removal note). Shared harness (gemm_instance_builder.py, gemm_validation_utils.py, gemm C++ harness) and the not-yet-bridged variants (gemm_multi_d, gemm_preshuffle, grouped_gemm) are unchanged. Dispatcher untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…s gate Address quality-check findings on the Tile-Engine -> Dispatcher GEMM bridge: - default_ci_config.json warp_tile [16,16,32] -> [32,32,16]; the old value is rejected by the gfx942/fp16 validator, so the documented default command (`python gemm_full_benchmark.py`) expanded to 0 kernels and aborted. Now it expands to 16 valid configs and benchmarks 80/80 OK. - Add an opt-in `--verify` (with `--verify-tol`, default 2e-2) path: the worker compares each output against an fp32 numpy reference using the global metric max|out-ref|/max|ref|, results read VERIFY/MISMATCH, a mismatch counts as a failure, and max_rel/verified are written to the CSV. Previously an OK status meant liveness (non-zero output), not correctness. - README: document the default-vs-correctness distinction and the new flag. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ples perf The bridge hardcoded warmup=5/repeat=10/flush_cache=false/rotating_count=1 in the generated-kernel stream_config and ignored the requested warmup/repeat. Old TE defaults to 50/100/flush=true/rotating=1000, so bridge-vs-old-TE comparisons measured different windows (the source of the ~13% "gap vs default" in PR #8123, and the tiny 5-iter warmup left the GPU clock un-ramped, producing the spurious 2048^3 dip). Default to the old-TE values and allow per-run override via CK_TILE_BENCH_WARMUP/REPEAT/FLUSH/ROTATING. Device code is unchanged; parity re-verified (12/12, max_rel <= 3.82e-4). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…duler The codegen arch filter hard-coded pipeline=compv4 / scheduler=intrawave when validating tile geometry, ignoring each config's actual traits. Since compv4 has the strictest MFMA constraints, tiles legal under mem/compv3 were wrongly rejected -- collapsing the generated fp16/rcr set from ~1520 to 512 kernels (compv3 and mem each decimated ~5x; compv4 roughly preserved). Thread the trait's real pipeline/scheduler into _is_tile_arch_valid at both call sites; the tile pre-filter now keeps a tile if it is legal under any configured pipeline/scheduler, with the precise per-trait check deferred to _get_configs_for_variant. Verified on the 6144 fp16/rcr config population: emitted kernels 512 -> 1520 (compv3 464, compv4 128, mem 928), and a previously-rejected compv3 64x64x192 config now generates a header end-to-end. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…artifact, not a speedup) The sweep's >=20% "bridge faster" cells (all compv4/intrawave/1024^3) are NOT a bridge speedup. Proven on MI300X: the device kernel is byte-identical (same ck_tile::kentry<1,GemmKernel<...>> symbol, rocprof), and through any uniform harness it runs at the same speed on both sides. Ruled out empirically: kernel, compiler/flags (4 toolchain rebuilds incl. clang++-HIP with old-TE flags all give ~189), all bench knobs (warmup/repeat/flush/rotating/timer), allocation/placement (DeviceMem, 4GB decoys), and stale timing headers (byte-identical across trees). rocprof confirms the slowdown is real device time (13.78us vs 11.34us): old TE's *standalone benchmark binary* runs the identical kernel ~18-20% slower purely due to that process's GPU clock/execution state (+8% stall cycles under PMC, plus ~13% lower sustained SCLK). Fix: - ab_same_harness.py: apples-to-apples A/B that builds the old-TE kernel into a .so and runs BOTH it and the bridge .so through the SAME worker. Gap collapses to ~+/-0.5% at 1024^3 (was +20..+24% vs the standalone binary). Proof in ab_same_harness.out. - diagnose.md sec.4 rewritten: the prior hipcc-vs-clang++ toolchain theory is disproven and replaced with the evidence above. - generated_tile_backend.hpp: correct the misleading comment that claimed matching bench knobs makes bridge-vs-old-TE "apples-to-apples". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Resolve CMakeLists.txt conflict in tile_engine/ops/gemm: keep develop's expanded op list and new add_subdirectory entries while preserving this branch's retirement of legacy gemm_universal (dropped from both budget foreach loops and from add_subdirectory).

Remove dispatcher/parity_diag/regression/diagnose.md from the PR; the content now lives on Confluence (MLSE) as a child page under the fp16/rcr A/B sweep report: https://amd.atlassian.net/wiki/spaces/MLSE/pages/1737132108

- ab_same_harness.py: derive ROOT from __file__ and take old-TE header dir from OLD_TE_GEN env var (was hard-coded dev paths); drop unused statistics import - generated_tile_backend.hpp: make env_bool case-insensitive (handles False/Off) and align its comment - gemm_full_benchmark.py: clarify in --devices help and resolve_devices docstring that a bare digit is a count; a single id needs the comma form (5,) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The Stream-K bridge (#8136) was branched at #8123's first commit, so it lacked all subsequent regular-GEMM bridge improvements (arch-validated tile filtering, the develop merge + legacy gemm_universal retirement, benchmark-param/--verify work on the shared driver, README). Merge the current #8123 HEAD to pick those up; the Stream-K-specific analogues that live in the duplicated driver/worker/ctypes lib are ported in follow-up commits. Sole conflict: dispatcher/python/gemm_utils.py variant threading. Kept the Stream-K routing (_ctypes_source_name -> streamk_gemm_ctypes_lib.cpp, .name _streamk suffix, variant through codegen_args/expand_sweep) and adopted #8123's explanatory comment.

The Stream-K bridge keeps its own driver, worker and ctypes lib, so the regular-GEMM bridge improvements that landed on #8123 after this branch forked did not arrive via the merge. Port the Stream-K-specific analogues: - streamk_gemm_ctypes_lib.cpp: benchmark knobs now default to old-TE's warmup=50/repeat=100 (was 3/10 -- a cold, un-ramped clock, the root of #8123's spurious "perf gap") and are env-overridable via CK_TILE_BENCH_WARMUP/REPEAT/FLUSH/ROTATING. Unlike the regular path, rotating_count defaults to 1: the Atomic preprocess re-zeros only the original C buffer, so rotating C would corrupt the accumulation. - streamk_gemm_full_benchmark.py: fan the (kernel x problem) work across every visible GPU (device-pinned HIP_VISIBLE_DEVICES workers, --devices, device CSV column), add the --verify/--verify-tol fp32-reference gate, and constrain --dtype/--layout to the supported fp16/rcr surface. Also fixes a latent proc-unbound error in the batch handler. - run_one_streamk_gemm_kernel.py: add the fp32 numpy reference check (global max|out-ref|/max|ref|, verified/max_rel) behind --verify. - README: document the Stream-K bridge driver/worker, flags, _streamk name suffix, fp16 Atomic tolerance, and the rotating_count divergence. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two Copilot findings on the Stream-K codegen carried in from #8094: - 03_streamk_gemm_driver.cpp: parse M/N/K with std::stoll (not std::stoi) before narrowing to ck_tile::index_t; stoi throws std::out_of_range past INT_MAX, needlessly rejecting large GEMM sizes. - unified_gemm_codegen.py (_launch_function_streamk): the Atomic reduction's per-iteration C reset zeroed args.M*args.N as a flat contiguous block, which skips elements when C has a padded leading dimension and corrupts the accumulation. Zero the used MxN region honoring stride_E via hipMemset2DAsync (CLayout-aware row/col-major), and check the HIP status instead of discarding it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…uction Previously the stream-K codegen hard-coded the Atomic reduction strategy. This makes the reduction strategy a first-class config axis so linear and tree reductions can be generated, named, and selected alongside atomic: - unified_gemm_codegen.py: add reduction_strategy to KernelConfig; encode it in key_name (redux_*) and KernelNaming.generate (atomic keeps the bare "_streamk" suffix for name parity, linear/tree are disambiguated); _launch_function_streamk now emits the config's StreamKReductionStrategy (the existing reset lambda already zeroes C for atomic vs the workspace for linear/tree); _get_configs_for_variant iterates strategies from a new streamk_config section, which is added to the default config (atomic, linear, tree). - gemm_utils.py: GemmKernelConfig gains reduction_strategy, threaded into .name, to_codegen_json (per-kernel streamk_config so single-kernel codegen emits exactly the requested strategy) and expand_sweep (reduction-strategy sweep axis). - ctypes_utils.py: reduction_strategy field on KernelConfig for end-to-end parity. arch_specs.json intentionally unchanged: stream-K reuses the standard warp-tile combos and arch_filter reads no stream-K-specific keys, so adding them would be dead data. Validated on gfx942 (MI300X): atomic/linear/tree each codegen + compile + run and pass fp32 verification (max_rel 3.1e-4) at 256x256x4096. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…rategy fields First slice of moving Stream-K into the dispatcher core (registry-addressable), per the deep-core checklist. Additive and inert by default: - KernelKey: new ReductionStrategy enum {None,Atomic,Linear,Tree}; Algorithm gains streamk / reduction_strategy / workspace. tie() includes them so the three strategies are distinct keys. encode_identifier() appends the Stream-K suffix ("_streamk" / "_streamk_linear" / "_streamk_tree") byte-for-byte with unified_gemm_codegen.py KernelNaming.generate(), guarded by algorithm.streamk so non-Stream-K identifiers are unchanged. - Problem: streamk / reduction_strategy request fields + ProblemBuilder::stream_k(). Validated on gfx90a (hipcc 7.12): non-SK encode_identifier byte-identical; atomic/linear/tree suffixes correct; tie() distinguishes strategies.

Add two non-pure virtuals so existing GEMM/FMHA/Conv instances compile unchanged: - get_workspace_size(Problem) -> bytes (default 0) - run(a,b,c,d_ptrs, void* workspace, problem, stream) overload whose default forwards to the existing no-workspace run(). The Dispatcher invokes these through a base KernelInstance* pointer (so the new overload is visible despite derived 6-arg run() overrides). The Stream-K backend (PR-C) overrides both to size and bind the reduction workspace. Validated on gfx90a (hipcc 7.12): a concrete instance overriding only the pre-existing pure virtuals compiles; default get_workspace_size==0 and the workspace-run forwards correctly via base pointer.

…tree codegen Adds the C++ backend that lets Stream-K ride the registry, plus the reduction strategy codegen needed to generate the three variants on this branch. - generated_tile_backend_streamk.hpp (NEW): GeneratedStreamKKernelInstance wraps a generated Stream-K kernel and builds ck_tile::StreamKHostArgs (the ABI-incompatible args the GemmHostArgs path could not). supports() gates on Problem.streamk + reduction_strategy so atomic/linear/tree coexist in the registry and the Dispatcher's first-fit selection picks the requested one. create_generated_streamk_kernel<> mirrors create_generated_tile_kernel<>. - codegen: reduction_strategy axis (atomic/linear/tree) -> KernelConfig field, key_name redux_*, KernelNaming "_streamk"/"_streamk_linear"/"_streamk_tree" (matches KernelKey::encode_identifier from PR-A), per-strategy StreamKReductionStrategy in the generated launch, and a streamk_config sweep axis. (Ported from the bridge branch reduction-strategy work.) PR-C keeps the generated launch's internal workspace/reset; PR-D relocates those to Dispatcher::run() via get_workspace_size()/the workspace-aware run(). Validated on gfx90a (hipcc 7.12): codegen emits 584 atomic + 584 linear + 584 tree headers with correct names; the backend device-compiles (22s) against a generated header and supports() accepts the matching strategy while rejecting the others and non-Stream-K problems.

…pace Relocate the Stream-K reduction-workspace buffer from the per-call generated launch() to a grow-on-demand buffer owned by the Dispatcher, so a long-lived dispatcher stops paying a hipMalloc/hipFree on every invocation. - codegen: hoist the StreamKGemmKernel type to struct scope and add GetWorkSpaceSize() + an external-workspace launch(args, cfg, workspace) overload. The existing 2-arg launch (internal DeviceMem) is unchanged so the bridge ctypes lib and the standalone 03 driver keep working. - backend: override get_workspace_size() and the workspace-aware run(); the no-workspace run() delegates with a null buffer. The per-iteration reset stays in the backend (it needs CDataType + the reduction strategy). - dispatcher: own a grow-on-demand workspace (raw void*/size_t to keep HIP out of the public header), size it via get_workspace_size(), and pass it through run_fused()/run_explicit(); free it in the destructor. Atomic needs none (size 0 -> null -> internal path); linear/tree consume the owned buffer. Validated on MI210/gfx90a: atomic/linear/tree all verify vs reference_gemm at unchanged perf, with linear/tree now running on the dispatcher-owned workspace.

…river Add 04_streamk_registry_driver.cpp: a runnable proof of the full deep-core path (Registry::register_kernel -> Dispatcher::run -> first-fit supports() gate on reduction_strategy -> GeneratedStreamKKernelInstance::run -> generated launch -> verify vs reference_gemm). Unlike 03_streamk_gemm_driver.cpp, which calls SelectedKernel::launch() directly and bypasses the dispatcher, this exercises the registry selection and the Dispatcher-owned workspace. Selectable strategy via --strategy {atomic,linear,tree}. Validated on MI210/gfx90a for all three (distinct registry identifiers, each PASS).

…K backend The dispatcher-wrapper generator emitted ONE template for every variant: backends::GeneratedKernelInstance<KernelStruct> with no streamk/reduction_strategy on the key. For Stream-K that is wrong twice over -- the regular backend calls launch(GemmHostArgs,...) which the SK kernel struct does not have (so the aggregate register_all_kernels.hpp would not compile against SK), and the key omits the SK fields so encode_identifier() emits no _streamk suffix and atomic/linear/tree collide in the registry. Make the wrapper variant-aware: for STREAM_K configs include generated_tile_backend_streamk.hpp, set key.algorithm.streamk + reduction_strategy + workspace (and pad flags for identifier parity), and return create_generated_streamk_kernel<KernelStruct, KernelStruct::ADataType, ...>. All other variants are unchanged. Validated on MI210/gfx90a: a registry populated via the generated wrappers holds atomic+linear+tree side by side; Dispatcher::run() selects each by Problem::reduction_strategy and all three verify vs reference_gemm.

…are atomic reset P2: GeneratedStreamKKernelInstance::supports() now ends with SelectedKernel::IsSupported(make_args(problem)) (a new generated static that runs MakeKernelArgs + IsSupportedArgument). A problem too small to partition across CUs is rejected during selection, so first-fit falls back to a non-Stream-K kernel instead of throwing std::runtime_error at launch. P3: the atomic reduction reset zeroes C with a stride-aware hipMemset2DAsync (pitch = stride_E * sizeof(C), width = N * sizeof(C), height = M) instead of a flat hipMemsetAsync over M*N. Correct for a padded/strided C; identical coverage for the contiguous rcr case. Applied to both the internal and external-workspace launch overloads. Validated on MI210/gfx90a: atomic/linear/tree still select + run + verify from a multi-kernel registry; valid small problems are accepted (no false-negatives).

The bridge dispatcher's tile-divisibility gate rejected any problem where M % TileM != 0 for every layout, returning status -2 ("No suitable kernel") at runtime even though the .so built fine. This wrongly excluded bf16 rcr/rrr kernels with a non-power-of-two TileM (e.g. 192) on standard shapes like 1024^3 -- cases Old-TE compiles, runs, and verifies as correct. Root cause: supports() was layout-blind, while the underlying ck_tile::GemmKernel::IsSupportedArgument only constrains a dimension when an operand whose inner axis is that dimension participates without padding: RowMajor A -> K, ColMajor A -> M RowMajor B -> N, ColMajor B -> K RowMajor C -> N, ColMajor C -> M So for rcr (RowMajor A & C) M is never gated, which is why Old-TE runs M=192 tiles on M-indivisible problems. Make supports() compute require_m/n/k from the kernel key's A/B/C layouts so it mirrors IsSupportedArgument exactly (also honoring k_batch in the K grain). Anything it now lets through is still validated by the kernel's own IsSupportedArgument inside launch(), so the bridge stays a strict functional equivalent of Old-TE. Applied to both generated_tile_backend.hpp (the GEMM .so path) and the sibling tile_backend.hpp. Validated on gfx942 (MI300X): 85 previously status-2 rcr/rrr bf16 192-tile .so now run at 1024^3 (Old-TE runs the same, verification correct); the 8 remaining rejects are tile N=192 cases that Old-TE also reports "Arguments not supported" at N=1024 -- parity preserved in both directions.

…oding rcr dispatcher_initialize() in gemm_ctypes_lib.cpp hardcoded the KernelKey layout to rcr (RowMajor/ColMajor/RowMajor) for every kernel. Now that supports() is layout-aware, that wrong key layout makes the dispatcher reject valid problems: a crr kernel does not gate K (neither A=ColMajor nor B=RowMajor has K as its inner axis), but with a hardcoded rcr key supports() applies rcr's K-gate and returns status -2 for TileK=192 problems (e.g. crr 64x64x192 at 1024^3) that Old-TE compiles, runs, and verifies (~87 TFLOPS). Derive signature.layout_a/b/c from the force-included kernel's own ALayout/BLayout/CLayout types via std::is_same_v with tensor_layout::gemm::RowMajor. The key now matches the kernel, so the layout-aware gate is correct for all four layouts. Execution was already layout-correct (the kernel uses its own compile-time layouts); only the host-side selection metadata was wrong. Validated on gfx942 (MI300X): crr 64x64x192 now runs on the bridge (93 TFLOPS), restoring parity with Old-TE.

The >=20% bridge-vs-old-TE perf gaps in the parity sweep are a harness artifact: the sweep timed the bridge in-process but timed old-TE via its separate standalone benchmark binary, which runs the byte-identical kernel at a lower sustained SCLK. Measured through one harness the gap is <1%. ab_same_harness.py removed that artifact but hardcoded the old-TE header dir to fp16/rcr. Derive it per stem as <base>/<dtype>/<layout> so one run covers rcr/rrr/ccr/crr and fp16+bf16, add a --stems-file/--csv resume-aware sweep mode, and use the median (not max) per point.

For a full ~2000-stem sweep on a single GPU: batch all shapes into one worker call per side (5x fewer process startups), cache the compiled old-TE .so, and add a parallel --build-only pre-pass so hipcc compilation uses all CPU cores while GPU measurement stays serial.

…eductionStrategy) Close two review nits on the Stream-K drivers: - Parse M/N/K with std::stoll instead of std::stoi in the 03/04 drivers so large GEMM dimensions no longer overflow/throw int range (Copilot nit). - Add inline to_string(ReductionStrategy) in kernel_key.hpp and route the 04 driver through it, removing the driver-local strategy_name() duplicate so callers share one spelling that matches the codegen suffix scheme.

Adds dispatcher_test_streamk_registry, a GPU test that generates the three reduction-strategy kernels (atomic/linear/tree) from one tile config, builds the 04 registry driver once per strategy (each force-including its own header, since SkReductionStrategy is a compile-time constexpr), and asserts for each that the encode_identifier() suffix matches, the Dispatcher selects it by Problem::reduction_strategy, and the result verifies against the reference. This converts the previously manual deep-core validation into a regression- guarded CTest. It SKIPs (return 77) when no GPU or hipcc is present, so CPU-only CI is unaffected.

…IBRARY_PATH meas()/meas_all() built the worker env without /opt/rocm/lib on LD_LIBRARY_PATH, so run_one_gemm_kernel.py failed to load every .so ("libamdhip64.so.7: cannot open shared object file") and every cell came back nan. Set it the same way ab_efficient_sweep.py does.

… guard) The bridge-vs-old-TE A/B reported phantom regressions from two MEASUREMENT bugs, not real codegen gaps: - ab_same_harness.py built the old-TE side WITHOUT the TE codegen flags the bridge (and real old-TE's own CMake) use, so -enable-post-misched defaulted back on and old-TE ran ~10-40% faster -> the bridge looked regressed when it is at parity. Now both sides build with identical flags. - ab_efficient_sweep.py measured whatever libgemm_<stem>.so existed with no freshness check, so 3-day-old binaries built from an obsolete codegen showed up as -78%/+703% gaps. Added a guard: skip any .so older than its generated header (treated as missing) instead of reporting a phantom gap. With both fixes the 41 former >15% outlier stems measure within +/-10% (median +0.01%); no bridge codegen regression exists. Note: a separate, deliberately UNCOMMITTED perf change in gemm_utils.py (gate -enable-post-misched=0 on persistent) gives non-persistent large tiles ~9-40%; held back pending a broader persistent-kernel no-regression sweep.

… driver The standalone stream-K driver verified atomic results with the single-pass GEMM tolerance get_*_threshold<...>(K). Atomic reduction accumulates K-split partials directly into low-precision C (workspace size 0), incurring rounding error that grows with the split factor -- correct results were flagged FAIL on small-M/N, large-K shapes (e.g. 512x512x8192) where tiles < CUs. Mirror tile_engine's calculate_rtol_atol (validation.hpp): derive kbatch from the kernel's tile partitioner (estimate_num_wgs_per_tile), widen atol/rtol with the split-K CDataType accumulation term, and take the max with the per-split tolerance. The driver and tile_engine now verify identically; the kernel is unchanged.

…gine The standalone stream-K driver built its stream_config as {stream, true, 0, warmup, repeat}, leaving is_gpu_timer/flush_cache/rotating_count at defaults (flush_cache=false, rotating_count=1). The tile_engine benchmark instead times with flush_cache=true and rotating_count=1000, so the driver measured a warm-cache best case while tile_engine measured cold-cache -- the entire source of the reported dispatcher-vs-TE "performance gap" at low tile counts. Add --timer/--flush_cache/--rotating_count (defaulting to the tile_engine values) and pass them through stream_config so both sides use identical timing methodology. A validating run still times a single cold shot, mirroring tile_engine's repeat_once_if_verify(); collect perf with a separate --validate 0 pass.

The 04 registry driver hardcoded the KernelKey signature to DataType::FP16 and an rcr layout, so fp8/bf8/bf16 Stream-K kernels registered under the wrong key and failed dispatch/identifier checks. Derive dtype_a/b/c/acc and layout tags from the generated kernel's actual A/B/C types via compile-time dtype_enum_of<T>()/layout_tag_of<Layout>() helpers (fp8/bf8 inputs accumulate in fp32 and write fp16 C, matching Tile Engine). Parametrize test_streamk_registry.py over fp16/bf16/fp8/bf8 (dtype-independent core objects built once; per-dtype codegen + build + verify with per-dtype identifier assertions). All four datatypes register, dispatch, and verify across atomic/linear/tree on gfx942 (MI300X).

Port #8136's Tile-Engine->Dispatcher Stream-K bridge onto the rewritten deep-core #8094 engine (KernelKey reduction fields, KernelInstance workspace virtuals, StreamK backend, Dispatcher-owned reduction workspace, registry + validation driver). 3-way merge over the shared stream_k ancestor; only the streamk launch emitter in unified_gemm_codegen.py and 03_streamk_gemm_driver.cpp conflicted -- both resolved to the deep-core side: - codegen now emits the struct-scope Sk* kernel type + GetWorkSpaceSize + IsSupported, keeps the 2-arg internal-workspace launch the bridge ctypes lib calls, and adds the 3-arg dispatcher-owned-workspace launch. - driver takes deep-core's stoll parse + apple-to-apple timing + validate cold shot. Bridge ctypes lib still bypasses the registry and calls the 2-arg launch directly, so the bridge runs the exact deep-core kernels. Codegen smoke: atomic/linear/tree + regular gemm all generate cleanly (0 failed).

Build the Stream-K bridge .so without the dispatcher static lib and with TE-streamk-matching flags: - compile flags per-variant: Stream-K matches tile_engine/ops/gemm_streamk CMake (only -Wno-* + --offload-compress, NO -mllvm codegen flags and NOT -enable-noalias-to-md-conversion=0 which is a gemm_universal-bridge concern). This keeps the A/B fair; the regular path is unchanged. - link skips libck_tile_dispatcher.a for Stream-K (the ctypes lib launches the force-included kernel directly, no registry/dispatcher symbols), and the build guard no longer requires the static lib for Stream-K. - ensure build/examples exists before hipcc writes there (the cmake build that normally creates it is skipped on the Stream-K path). Validated on MI300X (gfx942): atomic/linear/tree fp16/rcr all build, run, and verify PASS (max_rel_err <8e-4) through the bridge GpuGemmRunner path.

Correct the Stream-K compile flags to match Tile Engine's gemm_streamk build verbatim (ground truth: a TE streamk build's compile_commands.json). The -mllvm codegen flags come from the composablekernel project-root add_compile_options applied globally to the TE benchmark -- they are NOT in the per-target options, so the earlier "minimal flags" assumption was wrong and would have produced a phantom A/B gap (different occupancy). Flags now: -std=c++20 -fno-offload-uniform-block -mllvm --lsr-drop-solution=1 -mllvm -enable-post-misched=0 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false --offload-compress -enable-post-misched=0 is applied unconditionally (TE does so for streamk), and -enable-noalias-to-md-conversion=0 is not used (TE streamk omits it).

Add a bf16 codec to the bridge runner so bf16 Stream-K kernels can run through the same ctypes path (the ABI is void*+sizeof, so 2-byte bf16 shares the fp16 path; only the bit pattern differs). Dtype is inferred from the kernel name. ENCODE is round-to-nearest-even to bf16 bits; DECODE is bit-exact to device bf16_t so the numpy reference multiplies the same values the GPU does.

…ing) Make the Stream-K bridge layout-generic instead of rcr-hardcoded, so all 4 A/B/C layouts (rcr/rrr/ccr/crr) work end to end: - streamk_gemm_ctypes_lib.cpp: derive stride_A/B/C at compile time from the kernel's ALayout/BLayout/CLayout (RowMajor RxC -> ld=C, ColumnMajor -> ld=R) instead of the hardcoded K/K/N. - generated_tile_backend_streamk.hpp (registry path): same layout-derived strides. - GpuGemmRunner: read dtype AND layout off the kernel name; arrange each operand per layout (RowMajor=C-contiguous, ColumnMajor=F-contiguous); bf16 encode is now memory-order-preserving so column-major operands stay column-major. - run_one_streamk_gemm_kernel.py: dtype/layout-aware A/B + reference (was fp16-only). - streamk_gemm_full_benchmark.py: SUPPORTED_LAYOUTS now rcr/rrr/ccr/crr, SUPPORTED_DTYPES fp16+bf16 (fp8/bf8/int8 still need runner codecs).

Extend the Tile-Engine -> Dispatcher Stream-K bridge (PR #8136) beyond fp16/bf16 to the FNUZ fp8 (E4M3) and bf8 (E5M2) formats used by gfx942/MI300. GpuGemmRunner (dispatcher/python/gemm_utils.py): - Port the tested FNUZ codecs from the sibling fp8 bridge (PR #8887): bit-exact decode tables + nearest-representable/saturating encode, carried as uint8 bit patterns (sizeof fp8_t/bf8_t == 1). Encode preserves operand C/F contiguity so the layout-generic _to_buf path holds for the new dtypes. - run() now sizes the C buffer per get_output_dtype: fp8/bf8 -> fp16 store, int8 -> int32; bf16 still carried as raw uint16. fp16/bf16 paths unchanged. - Arch guard: fp8/bf8 raise a clear error on a non-gfx942 GPU (gfx950/MI350 uses OCP fp8, a different bit layout) rather than silently mis-decoding. - An int8 codec is included for when the engine supports it (see below). Reference + surface: - run_one_streamk_gemm_kernel.py verify reference is now dtype-aware (decode(encode(x)) per dtype; int8 = exact int32 matmul). - streamk_gemm_full_benchmark.py SUPPORTED_DTYPES += fp8, bf8. int8 is intentionally left OUT of SUPPORTED_DTYPES: it is blocked at the ck_tile engine, not the bridge. The int8 kernel codegens but fails to compile for every reduction strategy -- warp_gemm_dispatcher has no Dispatcher<int8,int8,float,32,32,16,...> specialization for the streamk CompV3 path, so the BlockUniversalGemmAsBsCr WarpGemm static_asserts fail. Matches the PR #8094 decision to leave int8 out. GPU-validated on gfx942 (MI300X), 2048^3, both reduction + layout variants: fp8 atomic/linear/tree rcr: PASS (192/180/183 TFLOPS, max_rel <= 9.4e-4) bf8 atomic/linear/tree rcr: PASS (192/181/181 TFLOPS, max_rel <= 7.8e-4) fp8 ccr / bf8 crr (col-major): PASS (245/210 TFLOPS)

therock-pr-bot · 2026-07-01T17:59:44Z

✅ All Checks Passed — Ready for Review

Check	Status	Details
🌿 Branch Name	✅ Pass	—
📝 PR Title/Description	✅ Pass	—
⛔ Forbidden Files	✅ Pass	—
🧪 Unit Test	✅ Pass	—
🔎 pre-commit	✅ Pass	—
🚫 Draft PR	🔜 To Be Enabled	—
🚩 Feature Flag	🔜 To Be Enabled	—
📊 Code Coverage	🔜 To Be Enabled	—
🤖 therock-pr-bot	✅ Pass	—

🎉 All checks passed! This PR is ready for review.

📖 Need help? See the Policy FAQ for details on every check and how to fix failures.

therock-pr-bot · 2026-07-01T18:04:56Z

🎉 All checks passed! This PR is ready for review.

Muhammed Ozturk and others added 30 commits June 5, 2026 02:48

[CK_TILE] GEMM bridge: move perf-gap diagnosis to Confluence

d68ac64

Remove dispatcher/parity_diag/regression/diagnose.md from the PR; the content now lives on Confluence (MLSE) as a child page under the fp16/rcr A/B sweep report: https://amd.atlassian.net/wiki/spaces/MLSE/pages/1737132108

ozturkosu added 15 commits June 17, 2026 22:19

ozturkosu requested review from a team as code owners July 1, 2026 17:54

ozturkosu mentioned this pull request Jul 1, 2026

[CK_TILE] Add Tile Engine -> Dispatcher bridge for Stream-K GEMM #8136

Closed

5 tasks

github-actions Bot added documentation project: composablekernel labels Jul 1, 2026

ozturkosu self-assigned this Jul 1, 2026

assistant-librarian Bot added the organization: ROCm label Jul 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ck-tile): stream-K GEMM TE to dispatcher bridge#9028

feat(ck-tile): stream-K GEMM TE to dispatcher bridge#9028
ozturkosu wants to merge 45 commits into
developfrom
users/muozturk/ck-tile/dispatcher-te-bridge-streamk-gemm

ozturkosu commented Jul 1, 2026

Uh oh!

therock-pr-bot Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

therock-pr-bot Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ozturkosu commented Jul 1, 2026

Summary

What Stream-K needs that regular GEMM doesn't

Changes

Validation (gfx942 / MI300X, fp16 / rcr)

Test plan

Next

Update 2026-06-12 — brought current with #8123 + Copilot fixes

Uh oh!

therock-pr-bot Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ All Checks Passed — Ready for Review

Uh oh!

therock-pr-bot Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

therock-pr-bot Bot commented Jul 1, 2026 •

edited

Loading