Skip to content

feat(ck-tile): stream-K GEMM TE to dispatcher bridge#9028

Open
ozturkosu wants to merge 45 commits into
developfrom
users/muozturk/ck-tile/dispatcher-te-bridge-streamk-gemm
Open

feat(ck-tile): stream-K GEMM TE to dispatcher bridge#9028
ozturkosu wants to merge 45 commits into
developfrom
users/muozturk/ck-tile/dispatcher-te-bridge-streamk-gemm

Conversation

@ozturkosu

Copy link
Copy Markdown
Contributor

Re-opened from #8136 with a policy-compliant branch name (users/muozturk/ck-tile/dispatcher-te-bridge-streamk-gemm). Supersedes #8136.

Summary

Routes the stream_k GEMM variant through the same Tile Engine (TE) →
Dispatcher bridge already landed for regular GEMM (Phase 1, #8123) and grouped
GEMM (Phase 3, #8130). Goal of the overall effort: the Dispatcher is the single
source of truth for codegen/build/runtime, and TE only produces configs +
benchmarks.

This PR is stacked on muozturk/dispatcher-gemm-bridge (#8123) — please merge
that first. Its own diff is just two commits:

  1. [CK_TILE] Add stream_k variant to GEMM Dispatcher codegen (cherry-picked)
  2. [CK_TILE] Add Tile Engine -> Dispatcher bridge for Stream-K GEMM (this work)

What Stream-K needs that regular GEMM doesn't

Stream-K is a single-problem GEMM (one A/B/C, one M/N/K) with the same C
ABI
as regular GEMM, so the Python side (GpuGemmRunner / GemmDispatcherLib
/ GemmProblem) and the GPU worker are reused unchanged. The differences are
internal to the .so:

  • The generated launch has a Stream-K signature
    SelectedKernel::launch(const ck_tile::StreamKHostArgs&, const stream_config&),
    which allocates the reduction workspace internally (DeviceMem) and uses the
    Atomic reduction strategy.
  • The registry path (generated_tile_backend.hpp::run()) hard-codes the
    single-problem GemmHostArgs launch and won't compile against a Stream-K
    SelectedKernel. So the Stream-K ctypes lib bypasses the registry and calls
    SelectedKernel::launch(args, stream) directly, reporting the name from the
    KERNEL_NAME macro (same approach grouped uses).

Changes

New

  • dispatcher/bindings/ctypes/streamk_gemm_ctypes_lib.cpp — same single-problem
    C ABI (dispatcher_run_gemm(A,B,C,M,N,K,time_ms)); hipMalloc + copy A/B,
    memset C=0 (Atomic accumulates into C), build StreamKHostArgs with rcr
    strides (stride_A=K, stride_B=K, stride_C=N, k_batch=1), launch, copy C back.
    Returns 0 / -1 (HIP or throw) / -2 (kernel reports args unsupported).
  • tile_engine/ops/gemm/streamk_gemm_full_benchmark.py — 3-phase driver
    (expand configs → setup_multiple_gemm_dispatchers build → subprocess-isolated
    benchmark), mirroring gemm_full_benchmark.py with variant="stream_k".
  • tile_engine/ops/gemm/run_one_streamk_gemm_kernel.py — disposable GPU worker
    (identical to the regular worker since the ABI matches).
  • tile_engine/ops/gemm/gemm_streamk/configs/default_config.json — small sweep
    config (128x128x{32,64}, 2x2x1, 32x32x16, compv3/compv4, intrawave, cshuffle,
    pad true, persistent false) → 4 kernels.

Modified

  • dispatcher/python/gemm_utils.py_ctypes_source_name() selects
    streamk_gemm_ctypes_lib.cpp for variant=="stream_k" (in both
    _build_compile_jobs and setup_multiple_gemm_dispatchers); .name appends
    _streamk; variant threaded into codegen_args and expand_sweep.
  • dispatcher/python/ctypes_utils.py — pass the requested variant to codegen
    --variants instead of hard-coding "standard".

Validation (gfx942 / MI300X, fp16 / rcr)

Numeric parity vs a numpy fp32 reference (A.f32 @ B.f32). Stream-K's Atomic
reduction does multiple fp16 atomic-adds (one per K-split partial) vs
regular/grouped's single fp32→fp16 store, so it is inherently noisier; tolerance
is widened to max_rel ≤ 2.5e-3, frob_rel ≤ 1.5e-3 (regular/grouped use 5e-4).

Shape (M,N,K) status TFLOPS max_rel frob_rel result
1024 x 1024 x 1024 0 23.1 1.10e-3 5.47e-4 PASS
2048 x 2048 x 2048 0 129.5 7.02e-4 3.72e-4 PASS
512 x 768 x 4096 0 24.6 2.01e-3 8.19e-4 PASS
256 x 256 x 256 0 2.1 3.44e-4 2.08e-4 PASS
4096 x 4096 x 1024 0 149.8 5.89e-4 2.65e-4 PASS

Full TE driver run (4 kernels x 4 problems = 16/16 OK, 0 failures),
default problem set uses Stream-K's sweet spot (squares + a large-K skinny shape):

  • 1024 x 1024 x 1024
  • 2048 x 2048 x 2048
  • 4096 x 4096 x 4096
  • 512 x 512 x 8192 (large-K skinny)

All status 0, positive TFLOPS, nonzero output. Name parity holds end-to-end:
the runtime name reported by each .so equals
GemmKernelConfig(variant="stream_k").name, ending in _streamk.

Unsupported-shape handling: a tiny 257^3 problem is correctly reported as
unsupported by the kernel (status -2, too few tiles to partition across CUs)
and surfaced gracefully by the bridge — not a crash.

Test plan

  • unified_gemm_codegen.py ... --variants stream_k emits a *_streamk.hpp
    whose stem == GemmKernelConfig(variant="stream_k").name
  • setup_multiple_gemm_dispatchers builds the Stream-K config set → .so
    compiles & links against streamk_gemm_ctypes_lib.cpp
  • Numeric parity (table above) passes under the fp16 Atomic tolerance
  • Full driver run 16/16 OK; name parity verified
  • Unsupported tiny shape returns status -2 gracefully

Next

Land #8123, then this; afterwards delete the legacy tile_engine/ops/gemm_streamk/ machinery (Phase 4).


Update 2026-06-12 — brought current with #8123 + Copilot fixes

This branch had forked at #8123's first commit, so it lacked every later
regular-GEMM bridge improvement. Merged the current #8123 HEAD and ported the
Stream-K-specific analogues (the Stream-K bridge keeps its own driver, worker and
ctypes lib, so those fixes do not arrive via the merge):

  • Benchmark-param parity / perf-methodology fix (streamk_gemm_ctypes_lib.cpp):
    benchmark knobs defaulted to warmup=3/repeat=10 — a cold, un-ramped clock, the
    root cause of the regular bridge's spurious "perf gap." Now default to old-TE's
    warmup=50/repeat=100, env-overridable via
    CK_TILE_BENCH_WARMUP/REPEAT/FLUSH/ROTATING. rotating_count stays 1 for
    Stream-K: the Atomic preprocess re-zeros only the original C buffer, so rotating
    C would leave rotated copies un-zeroed and corrupt the accumulation.
  • --verify correctness gate (driver + worker): opt-in fp32 numpy reference
    check (global max|out-ref|/max|ref|, verified/max_rel in the CSV); a
    mismatch counts as a failure.
  • Multi-GPU benchmarking (driver): fans the (kernel × problem) work across all
    visible GPUs via device-pinned HIP_VISIBLE_DEVICES workers (--devices, device
    CSV column); also fixes a latent proc-unbound error in the batch handler.
  • --dtype/--layout guards (driver): constrained to the supported fp16/rcr
    surface so a mismatch fails fast.
  • Copilot codegen nits: std::stoi → std::stoll for M/N/K in
    03_streamk_gemm_driver.cpp; stride-aware C zeroing via hipMemset2DAsync
    (CLayout-aware, checked HIP status) in _launch_function_streamk.
  • README: new Stream-K bridge subsection.

Validation status: DONE on gfx942/MI300X (ctr-cx64-mi300x-4, enroot container). Bridge build+benchmark+--verify on gemm_streamk/configs/default_config.json = 16/16 OK, all verified (max_rel ≤ 3.0e-3, fp16 atomic tol), name parity holds. Bridge-vs-Old-TE parity (perf + correctness, byte-identical device kernel) posted as a comment with the full table + streamk_bridge_oldTE.csv.

Muhammed Ozturk and others added 30 commits June 5, 2026 02:48
Add the stream-K GEMM variant to the unified GEMM dispatcher codegen the
dispatcher way: a single-GEMM launch(args, stream) that allocates the
reduction workspace internally via DeviceMem (GetWorkSpaceSize /
SetWorkSpacePointer), zeroes it, and launches StreamKKernel with an
atomic-reduction preprocess that resets C between timed iterations. No
external workspace pointer (not the Tile Engine way).

- arch_filter.py: add OperatorType.GEMM_STREAMK + tile constraints.
- unified_gemm_codegen.py: add GemmVariant.STREAM_K, CLI --variants
  stream_k, naming, includes, _launch_function_streamk, variant->operator
  map, cshuffle-only config selection, and A/B/CLayout export in the
  CK_TILE_SINGLE_KERNEL_INCLUDE block.
- examples/gemm/cpp/03_streamk_gemm_driver.cpp: standalone single-kernel
  driver that calls SelectedKernel::launch and verifies vs reference_gemm.

Parity vs Tile Engine on MI300X (gfx942), fp16 rcr atomic
128x128x64_2x2x1_32x32x16, 3840x4096x2048, warmup=10/repeat=50:
dispatcher 0.242 ms / 266 TFLOPS PASS vs TE 0.24 ms / 266 TFLOPS correct.
Introduce the shared-config bridge that lets Tile Engine drive the GEMM
dispatcher the same way FMHA and grouped conv already do: one config
dataclass owned by the dispatcher, imported by Tile Engine, with no
translator between two vocabularies.

- dispatcher/python/gemm_utils.py: GemmKernelConfig (the common contract;
  .name mirrors the codegen KERNEL_NAME byte-for-byte), GemmProblem,
  GemmDispatcherLib, GpuGemmRunner, setup_multiple_gemm_dispatchers
  (codegen + hipcc -> .so paths, CPU-only/parallel), and expand_sweep.
- dispatcher/bindings/ctypes/gemm_ctypes_lib.cpp: add the indexed
  multi-kernel ABI dispatcher_get_kernel_name_at(index, buf, size);
  legacy single-kernel dispatcher_get_kernel_name retained.
- tile_engine/ops/gemm/gemm_full_benchmark.py: 3-phase TE driver
  (compile -> load problems -> benchmark) that generates no binaries.
- tile_engine/ops/gemm/run_one_gemm_kernel.py: disposable GPU worker
  for subprocess fault isolation.

Scope: regular GEMM, fp16, rcr (Phase 1). Name parity verified
end-to-end (config.name == generated .hpp stem == runtime registry name).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add the stream-K GEMM variant to the unified GEMM dispatcher codegen the
dispatcher way: a single-GEMM launch(args, stream) that allocates the
reduction workspace internally via DeviceMem (GetWorkSpaceSize /
SetWorkSpacePointer), zeroes it, and launches StreamKKernel with an
atomic-reduction preprocess that resets C between timed iterations. No
external workspace pointer (not the Tile Engine way).

- arch_filter.py: add OperatorType.GEMM_STREAMK + tile constraints.
- unified_gemm_codegen.py: add GemmVariant.STREAM_K, CLI --variants
  stream_k, naming, includes, _launch_function_streamk, variant->operator
  map, cshuffle-only config selection, and A/B/CLayout export in the
  CK_TILE_SINGLE_KERNEL_INCLUDE block.
- examples/gemm/cpp/03_streamk_gemm_driver.cpp: standalone single-kernel
  driver that calls SelectedKernel::launch and verifies vs reference_gemm.

Parity vs Tile Engine on MI300X (gfx942), fp16 rcr atomic
128x128x64_2x2x1_32x32x16, 3840x4096x2048, warmup=10/repeat=50:
dispatcher 0.242 ms / 266 TFLOPS PASS vs TE 0.24 ms / 266 TFLOPS correct.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Routes the stream_k GEMM variant through the same TE->Dispatcher bridge as
regular GEMM (Phase 1) and grouped GEMM (Phase 3). Stream-K is a single-problem
GEMM with the same C ABI as regular GEMM, so the Python runner side is reused
unchanged; only the .so internals and variant routing differ.

- streamk_gemm_ctypes_lib.cpp (new): same single-problem C ABI
  (dispatcher_run_gemm) but builds a ck_tile::StreamKHostArgs and calls
  SelectedKernel::launch(args, stream) directly, bypassing the registry (whose
  generated backend hard-codes the GemmHostArgs launch signature). The launch
  allocates the Atomic-reduction workspace internally; C is zeroed per run.
- gemm_utils.py: _ctypes_source_name() selects streamk_gemm_ctypes_lib.cpp for
  variant "stream_k"; .name appends _streamk; variant threaded through
  codegen_args and expand_sweep.
- ctypes_utils.py: pass the requested variant to codegen --variants instead of
  hard-coding "standard".
- TE driver/worker/config (new): streamk_gemm_full_benchmark.py,
  run_one_streamk_gemm_kernel.py, gemm_streamk/configs/default_config.json.

Validated end-to-end on gfx942/MI300X: full driver run 16/16 OK (4 kernels x 4
problems), name parity holds (.so name == config .name, ends _streamk). Numeric
parity vs fp32 reference passes under an fp16 Atomic-reduction tolerance
(max_rel <= 2.5e-3) which is wider than regular/grouped because Atomic does
multiple fp16 atomic-adds per K-split. Tiny problems (e.g. 257^3) are correctly
reported unsupported (status -2) by the kernel and surfaced gracefully.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
setup_multiple_gemm_dispatchers built its per-config codegen args without
the config's variant, and _generate_single_kernel_subprocess hard-coded
--variants standard. A GemmKernelConfig(variant='preshuffle'/'multi_d')
passed to the bridge would therefore be code-generated as a standard
kernel, while its name (and the hpp_glob_pattern derived from it) still
carried the variant suffix -- so the lookup could never match the emitted
header. Pass the variant through; the subprocess default stays "standard"
so all existing callers are unaffected.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…dler

If subprocess.Popen() itself raises, the generic except handler referenced
proc before it was bound, masking the real error with an UnboundLocalError.
Initialize proc = None before the try; the handler already guards with
`if proc and proc.poll() is None`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot review item: the driver exposed unrestricted --dtype/--layout while
the Phase-1 worker hard-codes fp16 inputs and an rcr (column-major B) host
transpose. Passing e.g. --dtype bf16 would codegen bf16 kernels but feed them
fp16 data, silently benchmarking the wrong thing. Add SUPPORTED_DTYPES/
SUPPORTED_LAYOUTS and wire them into argparse choices so a mismatch fails fast.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addresses the three review items on the TE->Dispatcher GEMM bridge driver,
scoped to this foundation PR's fp16/rcr surface (bf16/layouts follow in the
#8190/#8191 stack):

1. Example configs to sweep
   - gemm_full_benchmark.py defaults to the selected variant's
     configs/default_ci_config.json (small CI sweep) when no config is
     passed, and to configs/example_problems.json when --problems is
     omitted; configs/default_config.json remains the full sweep.
   - New gemm_universal/configs/example_problems.json (square / rectangular
     / large M,N,K). Nightly-test JSON drops into the same configs/ dir --
     no driver change needed.

2. Multi-GPU launch in parallel (supersedes grouped_conv's serial-GPU design)
   - Phase 3 fans the (kernel x problem) work across every visible GPU: one
     worker thread per device pulls batches from a shared queue and spawns a
     disposable subprocess pinned with HIP_VISIBLE_DEVICES, so an N-GPU box
     runs ~Nx faster while keeping per-batch fault isolation.
   - Devices auto-detected (HIP_VISIBLE_DEVICES, then rocm-smi/amd-smi);
     override with --devices (count, explicit ids, or all).

3. Variant organization + README + deprecation note
   - --variant selects the per-variant configs/ directory.
   - New README "Dispatcher Bridge Workflow" section: scripts, per-variant
     config layout, run examples, multi-GPU explanation, supported surface
     (fp16/rcr here), and a deprecation note for the legacy
     *_instance_builder.py generators.

Driver --dtype/--layout choices stay fp16/rcr to match this PR's dispatcher
host path; run_one_gemm_kernel.py (fp16 host gen) is unchanged.
Adds a "Variant scope" section clarifying that the bridge is one shared,
variant-aware driver (not per-variant driver copies), that only gemm_universal
is wired and validated through the bridge on this PR, and that the
gemm_multi_d/gemm_preshuffle/grouped_gemm configs/ dirs are scaffolding
following the per-variant convention -- not yet working support. Notes that
grouped GEMM and stream-K are separate bridge efforts (#8136 stream-K).
…layout

Phase 4: remove the legacy standalone regular-GEMM build path and reorganize
the bridged path to mirror the merged fmha/ and grouped_conv/ bridges.

- Delete the gemm_universal standalone generator/benchmark/profiler/CMake and
  the dormant test/ck_tile/gemm_tile_engine suite (its only other consumer).
- Promote gemm_universal/configs/ to the op-root configs/ (flat), matching the
  fmha/grouped_conv convention; remove the now-empty gemm_universal/ folder.
- Keep the bridge driver + worker at the gemm op root (gemm_full_benchmark.py,
  run_one_gemm_kernel.py); VARIANT_CONFIGS[gemm_universal] -> configs.
- Drop gemm_universal from the sampling foreach loops and add_subdirectory.
- Refresh README (folder layout, running examples, removal note).

Shared harness (gemm_instance_builder.py, gemm_validation_utils.py, gemm C++
harness) and the not-yet-bridged variants (gemm_multi_d, gemm_preshuffle,
grouped_gemm) are unchanged. Dispatcher untouched.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s gate

Address quality-check findings on the Tile-Engine -> Dispatcher GEMM bridge:

- default_ci_config.json warp_tile [16,16,32] -> [32,32,16]; the old value is
  rejected by the gfx942/fp16 validator, so the documented default command
  (`python gemm_full_benchmark.py`) expanded to 0 kernels and aborted. Now it
  expands to 16 valid configs and benchmarks 80/80 OK.
- Add an opt-in `--verify` (with `--verify-tol`, default 2e-2) path: the worker
  compares each output against an fp32 numpy reference using the global metric
  max|out-ref|/max|ref|, results read VERIFY/MISMATCH, a mismatch counts as a
  failure, and max_rel/verified are written to the CSV. Previously an OK status
  meant liveness (non-zero output), not correctness.
- README: document the default-vs-correctness distinction and the new flag.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ples perf

The bridge hardcoded warmup=5/repeat=10/flush_cache=false/rotating_count=1 in
the generated-kernel stream_config and ignored the requested warmup/repeat. Old
TE defaults to 50/100/flush=true/rotating=1000, so bridge-vs-old-TE comparisons
measured different windows (the source of the ~13% "gap vs default" in PR #8123,
and the tiny 5-iter warmup left the GPU clock un-ramped, producing the spurious
2048^3 dip). Default to the old-TE values and allow per-run override via
CK_TILE_BENCH_WARMUP/REPEAT/FLUSH/ROTATING. Device code is unchanged; parity
re-verified (12/12, max_rel <= 3.82e-4).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…duler

The codegen arch filter hard-coded pipeline=compv4 / scheduler=intrawave when
validating tile geometry, ignoring each config's actual traits. Since compv4
has the strictest MFMA constraints, tiles legal under mem/compv3 were wrongly
rejected -- collapsing the generated fp16/rcr set from ~1520 to 512 kernels
(compv3 and mem each decimated ~5x; compv4 roughly preserved).

Thread the trait's real pipeline/scheduler into _is_tile_arch_valid at both
call sites; the tile pre-filter now keeps a tile if it is legal under any
configured pipeline/scheduler, with the precise per-trait check deferred to
_get_configs_for_variant. Verified on the 6144 fp16/rcr config population:
emitted kernels 512 -> 1520 (compv3 464, compv4 128, mem 928), and a
previously-rejected compv3 64x64x192 config now generates a header end-to-end.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…artifact, not a speedup)

The sweep's >=20% "bridge faster" cells (all compv4/intrawave/1024^3) are NOT a
bridge speedup. Proven on MI300X: the device kernel is byte-identical (same
ck_tile::kentry<1,GemmKernel<...>> symbol, rocprof), and through any uniform
harness it runs at the same speed on both sides.

Ruled out empirically: kernel, compiler/flags (4 toolchain rebuilds incl.
clang++-HIP with old-TE flags all give ~189), all bench knobs
(warmup/repeat/flush/rotating/timer), allocation/placement (DeviceMem, 4GB
decoys), and stale timing headers (byte-identical across trees). rocprof
confirms the slowdown is real device time (13.78us vs 11.34us): old TE's
*standalone benchmark binary* runs the identical kernel ~18-20% slower purely
due to that process's GPU clock/execution state (+8% stall cycles under PMC,
plus ~13% lower sustained SCLK).

Fix:
- ab_same_harness.py: apples-to-apples A/B that builds the old-TE kernel into a
  .so and runs BOTH it and the bridge .so through the SAME worker. Gap collapses
  to ~+/-0.5% at 1024^3 (was +20..+24% vs the standalone binary). Proof in
  ab_same_harness.out.
- diagnose.md sec.4 rewritten: the prior hipcc-vs-clang++ toolchain theory is
  disproven and replaced with the evidence above.
- generated_tile_backend.hpp: correct the misleading comment that claimed
  matching bench knobs makes bridge-vs-old-TE "apples-to-apples".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Resolve CMakeLists.txt conflict in tile_engine/ops/gemm: keep develop's
expanded op list and new add_subdirectory entries while preserving this
branch's retirement of legacy gemm_universal (dropped from both budget
foreach loops and from add_subdirectory).
Remove dispatcher/parity_diag/regression/diagnose.md from the PR; the
content now lives on Confluence (MLSE) as a child page under the fp16/rcr
A/B sweep report:
https://amd.atlassian.net/wiki/spaces/MLSE/pages/1737132108
- ab_same_harness.py: derive ROOT from __file__ and take old-TE header dir
  from OLD_TE_GEN env var (was hard-coded dev paths); drop unused statistics import
- generated_tile_backend.hpp: make env_bool case-insensitive (handles
  False/Off) and align its comment
- gemm_full_benchmark.py: clarify in --devices help and resolve_devices
  docstring that a bare digit is a count; a single id needs the comma form (5,)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Stream-K bridge (#8136) was branched at #8123's first commit, so it
lacked all subsequent regular-GEMM bridge improvements (arch-validated
tile filtering, the develop merge + legacy gemm_universal retirement,
benchmark-param/--verify work on the shared driver, README). Merge the
current #8123 HEAD to pick those up; the Stream-K-specific analogues that
live in the duplicated driver/worker/ctypes lib are ported in follow-up
commits.

Sole conflict: dispatcher/python/gemm_utils.py variant threading. Kept
the Stream-K routing (_ctypes_source_name -> streamk_gemm_ctypes_lib.cpp,
.name _streamk suffix, variant through codegen_args/expand_sweep) and
adopted #8123's explanatory comment.
The Stream-K bridge keeps its own driver, worker and ctypes lib, so the
regular-GEMM bridge improvements that landed on #8123 after this branch
forked did not arrive via the merge. Port the Stream-K-specific analogues:

- streamk_gemm_ctypes_lib.cpp: benchmark knobs now default to old-TE's
  warmup=50/repeat=100 (was 3/10 -- a cold, un-ramped clock, the root of
  #8123's spurious "perf gap") and are env-overridable via
  CK_TILE_BENCH_WARMUP/REPEAT/FLUSH/ROTATING. Unlike the regular path,
  rotating_count defaults to 1: the Atomic preprocess re-zeros only the
  original C buffer, so rotating C would corrupt the accumulation.
- streamk_gemm_full_benchmark.py: fan the (kernel x problem) work across
  every visible GPU (device-pinned HIP_VISIBLE_DEVICES workers, --devices,
  device CSV column), add the --verify/--verify-tol fp32-reference gate, and
  constrain --dtype/--layout to the supported fp16/rcr surface. Also fixes a
  latent proc-unbound error in the batch handler.
- run_one_streamk_gemm_kernel.py: add the fp32 numpy reference check
  (global max|out-ref|/max|ref|, verified/max_rel) behind --verify.
- README: document the Stream-K bridge driver/worker, flags, _streamk name
  suffix, fp16 Atomic tolerance, and the rotating_count divergence.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two Copilot findings on the Stream-K codegen carried in from #8094:

- 03_streamk_gemm_driver.cpp: parse M/N/K with std::stoll (not std::stoi)
  before narrowing to ck_tile::index_t; stoi throws std::out_of_range past
  INT_MAX, needlessly rejecting large GEMM sizes.
- unified_gemm_codegen.py (_launch_function_streamk): the Atomic reduction's
  per-iteration C reset zeroed args.M*args.N as a flat contiguous block,
  which skips elements when C has a padded leading dimension and corrupts
  the accumulation. Zero the used MxN region honoring stride_E via
  hipMemset2DAsync (CLayout-aware row/col-major), and check the HIP status
  instead of discarding it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…uction

Previously the stream-K codegen hard-coded the Atomic reduction strategy. This
makes the reduction strategy a first-class config axis so linear and tree
reductions can be generated, named, and selected alongside atomic:

- unified_gemm_codegen.py: add reduction_strategy to KernelConfig; encode it in
  key_name (redux_*) and KernelNaming.generate (atomic keeps the bare "_streamk"
  suffix for name parity, linear/tree are disambiguated); _launch_function_streamk
  now emits the config's StreamKReductionStrategy (the existing reset lambda
  already zeroes C for atomic vs the workspace for linear/tree);
  _get_configs_for_variant iterates strategies from a new streamk_config section,
  which is added to the default config (atomic, linear, tree).
- gemm_utils.py: GemmKernelConfig gains reduction_strategy, threaded into .name,
  to_codegen_json (per-kernel streamk_config so single-kernel codegen emits exactly
  the requested strategy) and expand_sweep (reduction-strategy sweep axis).
- ctypes_utils.py: reduction_strategy field on KernelConfig for end-to-end parity.

arch_specs.json intentionally unchanged: stream-K reuses the standard warp-tile
combos and arch_filter reads no stream-K-specific keys, so adding them would be
dead data.

Validated on gfx942 (MI300X): atomic/linear/tree each codegen + compile + run and
pass fp32 verification (max_rel 3.1e-4) at 256x256x4096.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rategy fields

First slice of moving Stream-K into the dispatcher core (registry-addressable),
per the deep-core checklist. Additive and inert by default:

- KernelKey: new ReductionStrategy enum {None,Atomic,Linear,Tree}; Algorithm
  gains streamk / reduction_strategy / workspace. tie() includes them so the
  three strategies are distinct keys. encode_identifier() appends the Stream-K
  suffix ("_streamk" / "_streamk_linear" / "_streamk_tree") byte-for-byte with
  unified_gemm_codegen.py KernelNaming.generate(), guarded by algorithm.streamk
  so non-Stream-K identifiers are unchanged.
- Problem: streamk / reduction_strategy request fields + ProblemBuilder::stream_k().

Validated on gfx90a (hipcc 7.12): non-SK encode_identifier byte-identical;
atomic/linear/tree suffixes correct; tie() distinguishes strategies.
Add two non-pure virtuals so existing GEMM/FMHA/Conv instances compile
unchanged:
- get_workspace_size(Problem) -> bytes (default 0)
- run(a,b,c,d_ptrs, void* workspace, problem, stream) overload whose default
  forwards to the existing no-workspace run().

The Dispatcher invokes these through a base KernelInstance* pointer (so the new
overload is visible despite derived 6-arg run() overrides). The Stream-K backend
(PR-C) overrides both to size and bind the reduction workspace.

Validated on gfx90a (hipcc 7.12): a concrete instance overriding only the
pre-existing pure virtuals compiles; default get_workspace_size==0 and the
workspace-run forwards correctly via base pointer.
…tree codegen

Adds the C++ backend that lets Stream-K ride the registry, plus the reduction
strategy codegen needed to generate the three variants on this branch.

- generated_tile_backend_streamk.hpp (NEW): GeneratedStreamKKernelInstance wraps
  a generated Stream-K kernel and builds ck_tile::StreamKHostArgs (the
  ABI-incompatible args the GemmHostArgs path could not). supports() gates on
  Problem.streamk + reduction_strategy so atomic/linear/tree coexist in the
  registry and the Dispatcher's first-fit selection picks the requested one.
  create_generated_streamk_kernel<> mirrors create_generated_tile_kernel<>.
- codegen: reduction_strategy axis (atomic/linear/tree) -> KernelConfig field,
  key_name redux_*, KernelNaming "_streamk"/"_streamk_linear"/"_streamk_tree"
  (matches KernelKey::encode_identifier from PR-A), per-strategy
  StreamKReductionStrategy in the generated launch, and a streamk_config sweep
  axis. (Ported from the bridge branch reduction-strategy work.)

PR-C keeps the generated launch's internal workspace/reset; PR-D relocates those
to Dispatcher::run() via get_workspace_size()/the workspace-aware run().

Validated on gfx90a (hipcc 7.12): codegen emits 584 atomic + 584 linear + 584
tree headers with correct names; the backend device-compiles (22s) against a
generated header and supports() accepts the matching strategy while rejecting
the others and non-Stream-K problems.
…pace

Relocate the Stream-K reduction-workspace buffer from the per-call generated
launch() to a grow-on-demand buffer owned by the Dispatcher, so a long-lived
dispatcher stops paying a hipMalloc/hipFree on every invocation.

- codegen: hoist the StreamKGemmKernel type to struct scope and add
  GetWorkSpaceSize() + an external-workspace launch(args, cfg, workspace)
  overload. The existing 2-arg launch (internal DeviceMem) is unchanged so the
  bridge ctypes lib and the standalone 03 driver keep working.
- backend: override get_workspace_size() and the workspace-aware run(); the
  no-workspace run() delegates with a null buffer. The per-iteration reset stays
  in the backend (it needs CDataType + the reduction strategy).
- dispatcher: own a grow-on-demand workspace (raw void*/size_t to keep HIP out
  of the public header), size it via get_workspace_size(), and pass it through
  run_fused()/run_explicit(); free it in the destructor. Atomic needs none
  (size 0 -> null -> internal path); linear/tree consume the owned buffer.

Validated on MI210/gfx90a: atomic/linear/tree all verify vs reference_gemm at
unchanged perf, with linear/tree now running on the dispatcher-owned workspace.
…river

Add 04_streamk_registry_driver.cpp: a runnable proof of the full deep-core path
(Registry::register_kernel -> Dispatcher::run -> first-fit supports() gate on
reduction_strategy -> GeneratedStreamKKernelInstance::run -> generated launch ->
verify vs reference_gemm). Unlike 03_streamk_gemm_driver.cpp, which calls
SelectedKernel::launch() directly and bypasses the dispatcher, this exercises the
registry selection and the Dispatcher-owned workspace.

Selectable strategy via --strategy {atomic,linear,tree}. Validated on
MI210/gfx90a for all three (distinct registry identifiers, each PASS).
…K backend

The dispatcher-wrapper generator emitted ONE template for every variant:
backends::GeneratedKernelInstance<KernelStruct> with no streamk/reduction_strategy
on the key. For Stream-K that is wrong twice over -- the regular backend calls
launch(GemmHostArgs,...) which the SK kernel struct does not have (so the
aggregate register_all_kernels.hpp would not compile against SK), and the key
omits the SK fields so encode_identifier() emits no _streamk suffix and
atomic/linear/tree collide in the registry.

Make the wrapper variant-aware: for STREAM_K configs include
generated_tile_backend_streamk.hpp, set key.algorithm.streamk +
reduction_strategy + workspace (and pad flags for identifier parity), and return
create_generated_streamk_kernel<KernelStruct, KernelStruct::ADataType, ...>.
All other variants are unchanged.

Validated on MI210/gfx90a: a registry populated via the generated wrappers holds
atomic+linear+tree side by side; Dispatcher::run() selects each by
Problem::reduction_strategy and all three verify vs reference_gemm.
…are atomic reset

P2: GeneratedStreamKKernelInstance::supports() now ends with
SelectedKernel::IsSupported(make_args(problem)) (a new generated static that runs
MakeKernelArgs + IsSupportedArgument). A problem too small to partition across CUs
is rejected during selection, so first-fit falls back to a non-Stream-K kernel
instead of throwing std::runtime_error at launch.

P3: the atomic reduction reset zeroes C with a stride-aware hipMemset2DAsync
(pitch = stride_E * sizeof(C), width = N * sizeof(C), height = M) instead of a
flat hipMemsetAsync over M*N. Correct for a padded/strided C; identical coverage
for the contiguous rcr case. Applied to both the internal and external-workspace
launch overloads.

Validated on MI210/gfx90a: atomic/linear/tree still select + run + verify from a
multi-kernel registry; valid small problems are accepted (no false-negatives).
The bridge dispatcher's tile-divisibility gate rejected any problem where
M % TileM != 0 for every layout, returning status -2 ("No suitable kernel")
at runtime even though the .so built fine. This wrongly excluded bf16 rcr/rrr
kernels with a non-power-of-two TileM (e.g. 192) on standard shapes like
1024^3 -- cases Old-TE compiles, runs, and verifies as correct.

Root cause: supports() was layout-blind, while the underlying
ck_tile::GemmKernel::IsSupportedArgument only constrains a dimension when an
operand whose inner axis is that dimension participates without padding:

  RowMajor A -> K, ColMajor A -> M
  RowMajor B -> N, ColMajor B -> K
  RowMajor C -> N, ColMajor C -> M

So for rcr (RowMajor A & C) M is never gated, which is why Old-TE runs M=192
tiles on M-indivisible problems.

Make supports() compute require_m/n/k from the kernel key's A/B/C layouts so
it mirrors IsSupportedArgument exactly (also honoring k_batch in the K grain).
Anything it now lets through is still validated by the kernel's own
IsSupportedArgument inside launch(), so the bridge stays a strict functional
equivalent of Old-TE. Applied to both generated_tile_backend.hpp (the GEMM
.so path) and the sibling tile_backend.hpp.

Validated on gfx942 (MI300X): 85 previously status-2 rcr/rrr bf16 192-tile
.so now run at 1024^3 (Old-TE runs the same, verification correct); the 8
remaining rejects are tile N=192 cases that Old-TE also reports "Arguments
not supported" at N=1024 -- parity preserved in both directions.
…oding rcr

dispatcher_initialize() in gemm_ctypes_lib.cpp hardcoded the KernelKey layout to
rcr (RowMajor/ColMajor/RowMajor) for every kernel. Now that supports() is
layout-aware, that wrong key layout makes the dispatcher reject valid problems:
a crr kernel does not gate K (neither A=ColMajor nor B=RowMajor has K as its
inner axis), but with a hardcoded rcr key supports() applies rcr's K-gate and
returns status -2 for TileK=192 problems (e.g. crr 64x64x192 at 1024^3) that
Old-TE compiles, runs, and verifies (~87 TFLOPS).

Derive signature.layout_a/b/c from the force-included kernel's own
ALayout/BLayout/CLayout types via std::is_same_v with tensor_layout::gemm::RowMajor.
The key now matches the kernel, so the layout-aware gate is correct for all four
layouts. Execution was already layout-correct (the kernel uses its own compile-time
layouts); only the host-side selection metadata was wrong.

Validated on gfx942 (MI300X): crr 64x64x192 now runs on the bridge (93 TFLOPS),
restoring parity with Old-TE.
ozturkosu added 15 commits June 17, 2026 22:19
The >=20% bridge-vs-old-TE perf gaps in the parity sweep are a harness
artifact: the sweep timed the bridge in-process but timed old-TE via its
separate standalone benchmark binary, which runs the byte-identical kernel
at a lower sustained SCLK. Measured through one harness the gap is <1%.

ab_same_harness.py removed that artifact but hardcoded the old-TE header dir
to fp16/rcr. Derive it per stem as <base>/<dtype>/<layout> so one run covers
rcr/rrr/ccr/crr and fp16+bf16, add a --stems-file/--csv resume-aware sweep
mode, and use the median (not max) per point.
For a full ~2000-stem sweep on a single GPU: batch all shapes into one worker
call per side (5x fewer process startups), cache the compiled old-TE .so, and
add a parallel --build-only pre-pass so hipcc compilation uses all CPU cores
while GPU measurement stays serial.
…eductionStrategy)

Close two review nits on the Stream-K drivers:
- Parse M/N/K with std::stoll instead of std::stoi in the 03/04 drivers so
  large GEMM dimensions no longer overflow/throw int range (Copilot nit).
- Add inline to_string(ReductionStrategy) in kernel_key.hpp and route the 04
  driver through it, removing the driver-local strategy_name() duplicate so
  callers share one spelling that matches the codegen suffix scheme.
Adds dispatcher_test_streamk_registry, a GPU test that generates the three
reduction-strategy kernels (atomic/linear/tree) from one tile config, builds the
04 registry driver once per strategy (each force-including its own header, since
SkReductionStrategy is a compile-time constexpr), and asserts for each that the
encode_identifier() suffix matches, the Dispatcher selects it by
Problem::reduction_strategy, and the result verifies against the reference.

This converts the previously manual deep-core validation into a regression-
guarded CTest. It SKIPs (return 77) when no GPU or hipcc is present, so CPU-only
CI is unaffected.
…IBRARY_PATH

meas()/meas_all() built the worker env without /opt/rocm/lib on
LD_LIBRARY_PATH, so run_one_gemm_kernel.py failed to load every .so
("libamdhip64.so.7: cannot open shared object file") and every cell
came back nan. Set it the same way ab_efficient_sweep.py does.
… guard)

The bridge-vs-old-TE A/B reported phantom regressions from two MEASUREMENT
bugs, not real codegen gaps:

- ab_same_harness.py built the old-TE side WITHOUT the TE codegen flags the
  bridge (and real old-TE's own CMake) use, so -enable-post-misched defaulted
  back on and old-TE ran ~10-40% faster -> the bridge looked regressed when it
  is at parity. Now both sides build with identical flags.

- ab_efficient_sweep.py measured whatever libgemm_<stem>.so existed with no
  freshness check, so 3-day-old binaries built from an obsolete codegen showed
  up as -78%/+703% gaps. Added a guard: skip any .so older than its generated
  header (treated as missing) instead of reporting a phantom gap.

With both fixes the 41 former >15% outlier stems measure within +/-10%
(median +0.01%); no bridge codegen regression exists.

Note: a separate, deliberately UNCOMMITTED perf change in gemm_utils.py (gate
-enable-post-misched=0 on persistent) gives non-persistent large tiles ~9-40%;
held back pending a broader persistent-kernel no-regression sweep.
… driver

The standalone stream-K driver verified atomic results with the single-pass
GEMM tolerance get_*_threshold<...>(K). Atomic reduction accumulates K-split
partials directly into low-precision C (workspace size 0), incurring rounding
error that grows with the split factor -- correct results were flagged FAIL on
small-M/N, large-K shapes (e.g. 512x512x8192) where tiles < CUs.

Mirror tile_engine's calculate_rtol_atol (validation.hpp): derive kbatch from
the kernel's tile partitioner (estimate_num_wgs_per_tile), widen atol/rtol with
the split-K CDataType accumulation term, and take the max with the per-split
tolerance. The driver and tile_engine now verify identically; the kernel is
unchanged.
…gine

The standalone stream-K driver built its stream_config as {stream, true, 0,
warmup, repeat}, leaving is_gpu_timer/flush_cache/rotating_count at defaults
(flush_cache=false, rotating_count=1). The tile_engine benchmark instead times
with flush_cache=true and rotating_count=1000, so the driver measured a
warm-cache best case while tile_engine measured cold-cache -- the entire source
of the reported dispatcher-vs-TE "performance gap" at low tile counts.

Add --timer/--flush_cache/--rotating_count (defaulting to the tile_engine
values) and pass them through stream_config so both sides use identical timing
methodology. A validating run still times a single cold shot, mirroring
tile_engine's repeat_once_if_verify(); collect perf with a separate --validate 0
pass.
The 04 registry driver hardcoded the KernelKey signature to DataType::FP16
and an rcr layout, so fp8/bf8/bf16 Stream-K kernels registered under the
wrong key and failed dispatch/identifier checks. Derive dtype_a/b/c/acc and
layout tags from the generated kernel's actual A/B/C types via compile-time
dtype_enum_of<T>()/layout_tag_of<Layout>() helpers (fp8/bf8 inputs accumulate
in fp32 and write fp16 C, matching Tile Engine).

Parametrize test_streamk_registry.py over fp16/bf16/fp8/bf8 (dtype-independent
core objects built once; per-dtype codegen + build + verify with per-dtype
identifier assertions). All four datatypes register, dispatch, and verify
across atomic/linear/tree on gfx942 (MI300X).
Port #8136's Tile-Engine->Dispatcher Stream-K bridge onto the rewritten
deep-core #8094 engine (KernelKey reduction fields, KernelInstance workspace
virtuals, StreamK backend, Dispatcher-owned reduction workspace, registry +
validation driver). 3-way merge over the shared stream_k ancestor; only the
streamk launch emitter in unified_gemm_codegen.py and 03_streamk_gemm_driver.cpp
conflicted -- both resolved to the deep-core side:
- codegen now emits the struct-scope Sk* kernel type + GetWorkSpaceSize +
  IsSupported, keeps the 2-arg internal-workspace launch the bridge ctypes lib
  calls, and adds the 3-arg dispatcher-owned-workspace launch.
- driver takes deep-core's stoll parse + apple-to-apple timing + validate cold shot.

Bridge ctypes lib still bypasses the registry and calls the 2-arg launch directly,
so the bridge runs the exact deep-core kernels. Codegen smoke: atomic/linear/tree
+ regular gemm all generate cleanly (0 failed).
Build the Stream-K bridge .so without the dispatcher static lib and with
TE-streamk-matching flags:
- compile flags per-variant: Stream-K matches tile_engine/ops/gemm_streamk
  CMake (only -Wno-* + --offload-compress, NO -mllvm codegen flags and NOT
  -enable-noalias-to-md-conversion=0 which is a gemm_universal-bridge concern).
  This keeps the A/B fair; the regular path is unchanged.
- link skips libck_tile_dispatcher.a for Stream-K (the ctypes lib launches the
  force-included kernel directly, no registry/dispatcher symbols), and the
  build guard no longer requires the static lib for Stream-K.
- ensure build/examples exists before hipcc writes there (the cmake build that
  normally creates it is skipped on the Stream-K path).

Validated on MI300X (gfx942): atomic/linear/tree fp16/rcr all build, run, and
verify PASS (max_rel_err <8e-4) through the bridge GpuGemmRunner path.
Correct the Stream-K compile flags to match Tile Engine's gemm_streamk build
verbatim (ground truth: a TE streamk build's compile_commands.json). The
-mllvm codegen flags come from the composablekernel project-root
add_compile_options applied globally to the TE benchmark -- they are NOT in the
per-target options, so the earlier "minimal flags" assumption was wrong and
would have produced a phantom A/B gap (different occupancy). Flags now:
  -std=c++20 -fno-offload-uniform-block
  -mllvm --lsr-drop-solution=1 -mllvm -enable-post-misched=0
  -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false
  --offload-compress
-enable-post-misched=0 is applied unconditionally (TE does so for streamk),
and -enable-noalias-to-md-conversion=0 is not used (TE streamk omits it).
Add a bf16 codec to the bridge runner so bf16 Stream-K kernels can run through
the same ctypes path (the ABI is void*+sizeof, so 2-byte bf16 shares the fp16
path; only the bit pattern differs). Dtype is inferred from the kernel name.
ENCODE is round-to-nearest-even to bf16 bits; DECODE is bit-exact to device
bf16_t so the numpy reference multiplies the same values the GPU does.
…ing)

Make the Stream-K bridge layout-generic instead of rcr-hardcoded, so all 4 A/B/C
layouts (rcr/rrr/ccr/crr) work end to end:
- streamk_gemm_ctypes_lib.cpp: derive stride_A/B/C at compile time from the
  kernel's ALayout/BLayout/CLayout (RowMajor RxC -> ld=C, ColumnMajor -> ld=R)
  instead of the hardcoded K/K/N.
- generated_tile_backend_streamk.hpp (registry path): same layout-derived strides.
- GpuGemmRunner: read dtype AND layout off the kernel name; arrange each operand
  per layout (RowMajor=C-contiguous, ColumnMajor=F-contiguous); bf16 encode is
  now memory-order-preserving so column-major operands stay column-major.
- run_one_streamk_gemm_kernel.py: dtype/layout-aware A/B + reference (was fp16-only).
- streamk_gemm_full_benchmark.py: SUPPORTED_LAYOUTS now rcr/rrr/ccr/crr,
  SUPPORTED_DTYPES fp16+bf16 (fp8/bf8/int8 still need runner codecs).
Extend the Tile-Engine -> Dispatcher Stream-K bridge (PR #8136) beyond
fp16/bf16 to the FNUZ fp8 (E4M3) and bf8 (E5M2) formats used by gfx942/MI300.

GpuGemmRunner (dispatcher/python/gemm_utils.py):
- Port the tested FNUZ codecs from the sibling fp8 bridge (PR #8887):
  bit-exact decode tables + nearest-representable/saturating encode, carried
  as uint8 bit patterns (sizeof fp8_t/bf8_t == 1). Encode preserves operand
  C/F contiguity so the layout-generic _to_buf path holds for the new dtypes.
- run() now sizes the C buffer per get_output_dtype: fp8/bf8 -> fp16 store,
  int8 -> int32; bf16 still carried as raw uint16. fp16/bf16 paths unchanged.
- Arch guard: fp8/bf8 raise a clear error on a non-gfx942 GPU (gfx950/MI350
  uses OCP fp8, a different bit layout) rather than silently mis-decoding.
- An int8 codec is included for when the engine supports it (see below).

Reference + surface:
- run_one_streamk_gemm_kernel.py verify reference is now dtype-aware
  (decode(encode(x)) per dtype; int8 = exact int32 matmul).
- streamk_gemm_full_benchmark.py SUPPORTED_DTYPES += fp8, bf8.

int8 is intentionally left OUT of SUPPORTED_DTYPES: it is blocked at the
ck_tile engine, not the bridge. The int8 kernel codegens but fails to compile
for every reduction strategy -- warp_gemm_dispatcher has no
Dispatcher<int8,int8,float,32,32,16,...> specialization for the streamk CompV3
path, so the BlockUniversalGemmAsBsCr WarpGemm static_asserts fail. Matches the
PR #8094 decision to leave int8 out.

GPU-validated on gfx942 (MI300X), 2048^3, both reduction + layout variants:
  fp8 atomic/linear/tree rcr: PASS  (192/180/183 TFLOPS, max_rel <= 9.4e-4)
  bf8 atomic/linear/tree rcr: PASS  (192/181/181 TFLOPS, max_rel <= 7.8e-4)
  fp8 ccr / bf8 crr (col-major): PASS (245/210 TFLOPS)
@therock-pr-bot

therock-pr-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

✅ All Checks Passed — Ready for Review

Check Status Details
🌿 Branch Name ✅ Pass
📝 PR Title/Description ✅ Pass
Forbidden Files ✅ Pass
🧪 Unit Test ✅ Pass
🔎 pre-commit ✅ Pass
🚫 Draft PR 🔜 To Be Enabled
🚩 Feature Flag 🔜 To Be Enabled
📊 Code Coverage 🔜 To Be Enabled
🤖 therock-pr-bot ✅ Pass

🎉 All checks passed! This PR is ready for review.

📖 Need help? See the Policy FAQ for details on every check and how to fix failures.

@therock-pr-bot

therock-pr-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

🎉 All checks passed! This PR is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant