feat(ck-tile): stream-K GEMM TE to dispatcher bridge#9028
Open
ozturkosu wants to merge 45 commits into
Open
Conversation
Add the stream-K GEMM variant to the unified GEMM dispatcher codegen the dispatcher way: a single-GEMM launch(args, stream) that allocates the reduction workspace internally via DeviceMem (GetWorkSpaceSize / SetWorkSpacePointer), zeroes it, and launches StreamKKernel with an atomic-reduction preprocess that resets C between timed iterations. No external workspace pointer (not the Tile Engine way). - arch_filter.py: add OperatorType.GEMM_STREAMK + tile constraints. - unified_gemm_codegen.py: add GemmVariant.STREAM_K, CLI --variants stream_k, naming, includes, _launch_function_streamk, variant->operator map, cshuffle-only config selection, and A/B/CLayout export in the CK_TILE_SINGLE_KERNEL_INCLUDE block. - examples/gemm/cpp/03_streamk_gemm_driver.cpp: standalone single-kernel driver that calls SelectedKernel::launch and verifies vs reference_gemm. Parity vs Tile Engine on MI300X (gfx942), fp16 rcr atomic 128x128x64_2x2x1_32x32x16, 3840x4096x2048, warmup=10/repeat=50: dispatcher 0.242 ms / 266 TFLOPS PASS vs TE 0.24 ms / 266 TFLOPS correct.
Introduce the shared-config bridge that lets Tile Engine drive the GEMM dispatcher the same way FMHA and grouped conv already do: one config dataclass owned by the dispatcher, imported by Tile Engine, with no translator between two vocabularies. - dispatcher/python/gemm_utils.py: GemmKernelConfig (the common contract; .name mirrors the codegen KERNEL_NAME byte-for-byte), GemmProblem, GemmDispatcherLib, GpuGemmRunner, setup_multiple_gemm_dispatchers (codegen + hipcc -> .so paths, CPU-only/parallel), and expand_sweep. - dispatcher/bindings/ctypes/gemm_ctypes_lib.cpp: add the indexed multi-kernel ABI dispatcher_get_kernel_name_at(index, buf, size); legacy single-kernel dispatcher_get_kernel_name retained. - tile_engine/ops/gemm/gemm_full_benchmark.py: 3-phase TE driver (compile -> load problems -> benchmark) that generates no binaries. - tile_engine/ops/gemm/run_one_gemm_kernel.py: disposable GPU worker for subprocess fault isolation. Scope: regular GEMM, fp16, rcr (Phase 1). Name parity verified end-to-end (config.name == generated .hpp stem == runtime registry name). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add the stream-K GEMM variant to the unified GEMM dispatcher codegen the dispatcher way: a single-GEMM launch(args, stream) that allocates the reduction workspace internally via DeviceMem (GetWorkSpaceSize / SetWorkSpacePointer), zeroes it, and launches StreamKKernel with an atomic-reduction preprocess that resets C between timed iterations. No external workspace pointer (not the Tile Engine way). - arch_filter.py: add OperatorType.GEMM_STREAMK + tile constraints. - unified_gemm_codegen.py: add GemmVariant.STREAM_K, CLI --variants stream_k, naming, includes, _launch_function_streamk, variant->operator map, cshuffle-only config selection, and A/B/CLayout export in the CK_TILE_SINGLE_KERNEL_INCLUDE block. - examples/gemm/cpp/03_streamk_gemm_driver.cpp: standalone single-kernel driver that calls SelectedKernel::launch and verifies vs reference_gemm. Parity vs Tile Engine on MI300X (gfx942), fp16 rcr atomic 128x128x64_2x2x1_32x32x16, 3840x4096x2048, warmup=10/repeat=50: dispatcher 0.242 ms / 266 TFLOPS PASS vs TE 0.24 ms / 266 TFLOPS correct. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Routes the stream_k GEMM variant through the same TE->Dispatcher bridge as regular GEMM (Phase 1) and grouped GEMM (Phase 3). Stream-K is a single-problem GEMM with the same C ABI as regular GEMM, so the Python runner side is reused unchanged; only the .so internals and variant routing differ. - streamk_gemm_ctypes_lib.cpp (new): same single-problem C ABI (dispatcher_run_gemm) but builds a ck_tile::StreamKHostArgs and calls SelectedKernel::launch(args, stream) directly, bypassing the registry (whose generated backend hard-codes the GemmHostArgs launch signature). The launch allocates the Atomic-reduction workspace internally; C is zeroed per run. - gemm_utils.py: _ctypes_source_name() selects streamk_gemm_ctypes_lib.cpp for variant "stream_k"; .name appends _streamk; variant threaded through codegen_args and expand_sweep. - ctypes_utils.py: pass the requested variant to codegen --variants instead of hard-coding "standard". - TE driver/worker/config (new): streamk_gemm_full_benchmark.py, run_one_streamk_gemm_kernel.py, gemm_streamk/configs/default_config.json. Validated end-to-end on gfx942/MI300X: full driver run 16/16 OK (4 kernels x 4 problems), name parity holds (.so name == config .name, ends _streamk). Numeric parity vs fp32 reference passes under an fp16 Atomic-reduction tolerance (max_rel <= 2.5e-3) which is wider than regular/grouped because Atomic does multiple fp16 atomic-adds per K-split. Tiny problems (e.g. 257^3) are correctly reported unsupported (status -2) by the kernel and surfaced gracefully. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
setup_multiple_gemm_dispatchers built its per-config codegen args without the config's variant, and _generate_single_kernel_subprocess hard-coded --variants standard. A GemmKernelConfig(variant='preshuffle'/'multi_d') passed to the bridge would therefore be code-generated as a standard kernel, while its name (and the hpp_glob_pattern derived from it) still carried the variant suffix -- so the lookup could never match the emitted header. Pass the variant through; the subprocess default stays "standard" so all existing callers are unaffected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…dler If subprocess.Popen() itself raises, the generic except handler referenced proc before it was bound, masking the real error with an UnboundLocalError. Initialize proc = None before the try; the handler already guards with `if proc and proc.poll() is None`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot review item: the driver exposed unrestricted --dtype/--layout while the Phase-1 worker hard-codes fp16 inputs and an rcr (column-major B) host transpose. Passing e.g. --dtype bf16 would codegen bf16 kernels but feed them fp16 data, silently benchmarking the wrong thing. Add SUPPORTED_DTYPES/ SUPPORTED_LAYOUTS and wire them into argparse choices so a mismatch fails fast. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addresses the three review items on the TE->Dispatcher GEMM bridge driver, scoped to this foundation PR's fp16/rcr surface (bf16/layouts follow in the #8190/#8191 stack): 1. Example configs to sweep - gemm_full_benchmark.py defaults to the selected variant's configs/default_ci_config.json (small CI sweep) when no config is passed, and to configs/example_problems.json when --problems is omitted; configs/default_config.json remains the full sweep. - New gemm_universal/configs/example_problems.json (square / rectangular / large M,N,K). Nightly-test JSON drops into the same configs/ dir -- no driver change needed. 2. Multi-GPU launch in parallel (supersedes grouped_conv's serial-GPU design) - Phase 3 fans the (kernel x problem) work across every visible GPU: one worker thread per device pulls batches from a shared queue and spawns a disposable subprocess pinned with HIP_VISIBLE_DEVICES, so an N-GPU box runs ~Nx faster while keeping per-batch fault isolation. - Devices auto-detected (HIP_VISIBLE_DEVICES, then rocm-smi/amd-smi); override with --devices (count, explicit ids, or all). 3. Variant organization + README + deprecation note - --variant selects the per-variant configs/ directory. - New README "Dispatcher Bridge Workflow" section: scripts, per-variant config layout, run examples, multi-GPU explanation, supported surface (fp16/rcr here), and a deprecation note for the legacy *_instance_builder.py generators. Driver --dtype/--layout choices stay fp16/rcr to match this PR's dispatcher host path; run_one_gemm_kernel.py (fp16 host gen) is unchanged.
Adds a "Variant scope" section clarifying that the bridge is one shared, variant-aware driver (not per-variant driver copies), that only gemm_universal is wired and validated through the bridge on this PR, and that the gemm_multi_d/gemm_preshuffle/grouped_gemm configs/ dirs are scaffolding following the per-variant convention -- not yet working support. Notes that grouped GEMM and stream-K are separate bridge efforts (#8136 stream-K).
…layout Phase 4: remove the legacy standalone regular-GEMM build path and reorganize the bridged path to mirror the merged fmha/ and grouped_conv/ bridges. - Delete the gemm_universal standalone generator/benchmark/profiler/CMake and the dormant test/ck_tile/gemm_tile_engine suite (its only other consumer). - Promote gemm_universal/configs/ to the op-root configs/ (flat), matching the fmha/grouped_conv convention; remove the now-empty gemm_universal/ folder. - Keep the bridge driver + worker at the gemm op root (gemm_full_benchmark.py, run_one_gemm_kernel.py); VARIANT_CONFIGS[gemm_universal] -> configs. - Drop gemm_universal from the sampling foreach loops and add_subdirectory. - Refresh README (folder layout, running examples, removal note). Shared harness (gemm_instance_builder.py, gemm_validation_utils.py, gemm C++ harness) and the not-yet-bridged variants (gemm_multi_d, gemm_preshuffle, grouped_gemm) are unchanged. Dispatcher untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s gate Address quality-check findings on the Tile-Engine -> Dispatcher GEMM bridge: - default_ci_config.json warp_tile [16,16,32] -> [32,32,16]; the old value is rejected by the gfx942/fp16 validator, so the documented default command (`python gemm_full_benchmark.py`) expanded to 0 kernels and aborted. Now it expands to 16 valid configs and benchmarks 80/80 OK. - Add an opt-in `--verify` (with `--verify-tol`, default 2e-2) path: the worker compares each output against an fp32 numpy reference using the global metric max|out-ref|/max|ref|, results read VERIFY/MISMATCH, a mismatch counts as a failure, and max_rel/verified are written to the CSV. Previously an OK status meant liveness (non-zero output), not correctness. - README: document the default-vs-correctness distinction and the new flag. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ples perf The bridge hardcoded warmup=5/repeat=10/flush_cache=false/rotating_count=1 in the generated-kernel stream_config and ignored the requested warmup/repeat. Old TE defaults to 50/100/flush=true/rotating=1000, so bridge-vs-old-TE comparisons measured different windows (the source of the ~13% "gap vs default" in PR #8123, and the tiny 5-iter warmup left the GPU clock un-ramped, producing the spurious 2048^3 dip). Default to the old-TE values and allow per-run override via CK_TILE_BENCH_WARMUP/REPEAT/FLUSH/ROTATING. Device code is unchanged; parity re-verified (12/12, max_rel <= 3.82e-4). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…duler The codegen arch filter hard-coded pipeline=compv4 / scheduler=intrawave when validating tile geometry, ignoring each config's actual traits. Since compv4 has the strictest MFMA constraints, tiles legal under mem/compv3 were wrongly rejected -- collapsing the generated fp16/rcr set from ~1520 to 512 kernels (compv3 and mem each decimated ~5x; compv4 roughly preserved). Thread the trait's real pipeline/scheduler into _is_tile_arch_valid at both call sites; the tile pre-filter now keeps a tile if it is legal under any configured pipeline/scheduler, with the precise per-trait check deferred to _get_configs_for_variant. Verified on the 6144 fp16/rcr config population: emitted kernels 512 -> 1520 (compv3 464, compv4 128, mem 928), and a previously-rejected compv3 64x64x192 config now generates a header end-to-end. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…artifact, not a speedup) The sweep's >=20% "bridge faster" cells (all compv4/intrawave/1024^3) are NOT a bridge speedup. Proven on MI300X: the device kernel is byte-identical (same ck_tile::kentry<1,GemmKernel<...>> symbol, rocprof), and through any uniform harness it runs at the same speed on both sides. Ruled out empirically: kernel, compiler/flags (4 toolchain rebuilds incl. clang++-HIP with old-TE flags all give ~189), all bench knobs (warmup/repeat/flush/rotating/timer), allocation/placement (DeviceMem, 4GB decoys), and stale timing headers (byte-identical across trees). rocprof confirms the slowdown is real device time (13.78us vs 11.34us): old TE's *standalone benchmark binary* runs the identical kernel ~18-20% slower purely due to that process's GPU clock/execution state (+8% stall cycles under PMC, plus ~13% lower sustained SCLK). Fix: - ab_same_harness.py: apples-to-apples A/B that builds the old-TE kernel into a .so and runs BOTH it and the bridge .so through the SAME worker. Gap collapses to ~+/-0.5% at 1024^3 (was +20..+24% vs the standalone binary). Proof in ab_same_harness.out. - diagnose.md sec.4 rewritten: the prior hipcc-vs-clang++ toolchain theory is disproven and replaced with the evidence above. - generated_tile_backend.hpp: correct the misleading comment that claimed matching bench knobs makes bridge-vs-old-TE "apples-to-apples". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Resolve CMakeLists.txt conflict in tile_engine/ops/gemm: keep develop's expanded op list and new add_subdirectory entries while preserving this branch's retirement of legacy gemm_universal (dropped from both budget foreach loops and from add_subdirectory).
Remove dispatcher/parity_diag/regression/diagnose.md from the PR; the content now lives on Confluence (MLSE) as a child page under the fp16/rcr A/B sweep report: https://amd.atlassian.net/wiki/spaces/MLSE/pages/1737132108
- ab_same_harness.py: derive ROOT from __file__ and take old-TE header dir from OLD_TE_GEN env var (was hard-coded dev paths); drop unused statistics import - generated_tile_backend.hpp: make env_bool case-insensitive (handles False/Off) and align its comment - gemm_full_benchmark.py: clarify in --devices help and resolve_devices docstring that a bare digit is a count; a single id needs the comma form (5,) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Stream-K bridge (#8136) was branched at #8123's first commit, so it lacked all subsequent regular-GEMM bridge improvements (arch-validated tile filtering, the develop merge + legacy gemm_universal retirement, benchmark-param/--verify work on the shared driver, README). Merge the current #8123 HEAD to pick those up; the Stream-K-specific analogues that live in the duplicated driver/worker/ctypes lib are ported in follow-up commits. Sole conflict: dispatcher/python/gemm_utils.py variant threading. Kept the Stream-K routing (_ctypes_source_name -> streamk_gemm_ctypes_lib.cpp, .name _streamk suffix, variant through codegen_args/expand_sweep) and adopted #8123's explanatory comment.
The Stream-K bridge keeps its own driver, worker and ctypes lib, so the regular-GEMM bridge improvements that landed on #8123 after this branch forked did not arrive via the merge. Port the Stream-K-specific analogues: - streamk_gemm_ctypes_lib.cpp: benchmark knobs now default to old-TE's warmup=50/repeat=100 (was 3/10 -- a cold, un-ramped clock, the root of #8123's spurious "perf gap") and are env-overridable via CK_TILE_BENCH_WARMUP/REPEAT/FLUSH/ROTATING. Unlike the regular path, rotating_count defaults to 1: the Atomic preprocess re-zeros only the original C buffer, so rotating C would corrupt the accumulation. - streamk_gemm_full_benchmark.py: fan the (kernel x problem) work across every visible GPU (device-pinned HIP_VISIBLE_DEVICES workers, --devices, device CSV column), add the --verify/--verify-tol fp32-reference gate, and constrain --dtype/--layout to the supported fp16/rcr surface. Also fixes a latent proc-unbound error in the batch handler. - run_one_streamk_gemm_kernel.py: add the fp32 numpy reference check (global max|out-ref|/max|ref|, verified/max_rel) behind --verify. - README: document the Stream-K bridge driver/worker, flags, _streamk name suffix, fp16 Atomic tolerance, and the rotating_count divergence. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two Copilot findings on the Stream-K codegen carried in from #8094: - 03_streamk_gemm_driver.cpp: parse M/N/K with std::stoll (not std::stoi) before narrowing to ck_tile::index_t; stoi throws std::out_of_range past INT_MAX, needlessly rejecting large GEMM sizes. - unified_gemm_codegen.py (_launch_function_streamk): the Atomic reduction's per-iteration C reset zeroed args.M*args.N as a flat contiguous block, which skips elements when C has a padded leading dimension and corrupts the accumulation. Zero the used MxN region honoring stride_E via hipMemset2DAsync (CLayout-aware row/col-major), and check the HIP status instead of discarding it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…uction Previously the stream-K codegen hard-coded the Atomic reduction strategy. This makes the reduction strategy a first-class config axis so linear and tree reductions can be generated, named, and selected alongside atomic: - unified_gemm_codegen.py: add reduction_strategy to KernelConfig; encode it in key_name (redux_*) and KernelNaming.generate (atomic keeps the bare "_streamk" suffix for name parity, linear/tree are disambiguated); _launch_function_streamk now emits the config's StreamKReductionStrategy (the existing reset lambda already zeroes C for atomic vs the workspace for linear/tree); _get_configs_for_variant iterates strategies from a new streamk_config section, which is added to the default config (atomic, linear, tree). - gemm_utils.py: GemmKernelConfig gains reduction_strategy, threaded into .name, to_codegen_json (per-kernel streamk_config so single-kernel codegen emits exactly the requested strategy) and expand_sweep (reduction-strategy sweep axis). - ctypes_utils.py: reduction_strategy field on KernelConfig for end-to-end parity. arch_specs.json intentionally unchanged: stream-K reuses the standard warp-tile combos and arch_filter reads no stream-K-specific keys, so adding them would be dead data. Validated on gfx942 (MI300X): atomic/linear/tree each codegen + compile + run and pass fp32 verification (max_rel 3.1e-4) at 256x256x4096. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rategy fields
First slice of moving Stream-K into the dispatcher core (registry-addressable),
per the deep-core checklist. Additive and inert by default:
- KernelKey: new ReductionStrategy enum {None,Atomic,Linear,Tree}; Algorithm
gains streamk / reduction_strategy / workspace. tie() includes them so the
three strategies are distinct keys. encode_identifier() appends the Stream-K
suffix ("_streamk" / "_streamk_linear" / "_streamk_tree") byte-for-byte with
unified_gemm_codegen.py KernelNaming.generate(), guarded by algorithm.streamk
so non-Stream-K identifiers are unchanged.
- Problem: streamk / reduction_strategy request fields + ProblemBuilder::stream_k().
Validated on gfx90a (hipcc 7.12): non-SK encode_identifier byte-identical;
atomic/linear/tree suffixes correct; tie() distinguishes strategies.
Add two non-pure virtuals so existing GEMM/FMHA/Conv instances compile unchanged: - get_workspace_size(Problem) -> bytes (default 0) - run(a,b,c,d_ptrs, void* workspace, problem, stream) overload whose default forwards to the existing no-workspace run(). The Dispatcher invokes these through a base KernelInstance* pointer (so the new overload is visible despite derived 6-arg run() overrides). The Stream-K backend (PR-C) overrides both to size and bind the reduction workspace. Validated on gfx90a (hipcc 7.12): a concrete instance overriding only the pre-existing pure virtuals compiles; default get_workspace_size==0 and the workspace-run forwards correctly via base pointer.
…tree codegen Adds the C++ backend that lets Stream-K ride the registry, plus the reduction strategy codegen needed to generate the three variants on this branch. - generated_tile_backend_streamk.hpp (NEW): GeneratedStreamKKernelInstance wraps a generated Stream-K kernel and builds ck_tile::StreamKHostArgs (the ABI-incompatible args the GemmHostArgs path could not). supports() gates on Problem.streamk + reduction_strategy so atomic/linear/tree coexist in the registry and the Dispatcher's first-fit selection picks the requested one. create_generated_streamk_kernel<> mirrors create_generated_tile_kernel<>. - codegen: reduction_strategy axis (atomic/linear/tree) -> KernelConfig field, key_name redux_*, KernelNaming "_streamk"/"_streamk_linear"/"_streamk_tree" (matches KernelKey::encode_identifier from PR-A), per-strategy StreamKReductionStrategy in the generated launch, and a streamk_config sweep axis. (Ported from the bridge branch reduction-strategy work.) PR-C keeps the generated launch's internal workspace/reset; PR-D relocates those to Dispatcher::run() via get_workspace_size()/the workspace-aware run(). Validated on gfx90a (hipcc 7.12): codegen emits 584 atomic + 584 linear + 584 tree headers with correct names; the backend device-compiles (22s) against a generated header and supports() accepts the matching strategy while rejecting the others and non-Stream-K problems.
…pace Relocate the Stream-K reduction-workspace buffer from the per-call generated launch() to a grow-on-demand buffer owned by the Dispatcher, so a long-lived dispatcher stops paying a hipMalloc/hipFree on every invocation. - codegen: hoist the StreamKGemmKernel type to struct scope and add GetWorkSpaceSize() + an external-workspace launch(args, cfg, workspace) overload. The existing 2-arg launch (internal DeviceMem) is unchanged so the bridge ctypes lib and the standalone 03 driver keep working. - backend: override get_workspace_size() and the workspace-aware run(); the no-workspace run() delegates with a null buffer. The per-iteration reset stays in the backend (it needs CDataType + the reduction strategy). - dispatcher: own a grow-on-demand workspace (raw void*/size_t to keep HIP out of the public header), size it via get_workspace_size(), and pass it through run_fused()/run_explicit(); free it in the destructor. Atomic needs none (size 0 -> null -> internal path); linear/tree consume the owned buffer. Validated on MI210/gfx90a: atomic/linear/tree all verify vs reference_gemm at unchanged perf, with linear/tree now running on the dispatcher-owned workspace.
…river
Add 04_streamk_registry_driver.cpp: a runnable proof of the full deep-core path
(Registry::register_kernel -> Dispatcher::run -> first-fit supports() gate on
reduction_strategy -> GeneratedStreamKKernelInstance::run -> generated launch ->
verify vs reference_gemm). Unlike 03_streamk_gemm_driver.cpp, which calls
SelectedKernel::launch() directly and bypasses the dispatcher, this exercises the
registry selection and the Dispatcher-owned workspace.
Selectable strategy via --strategy {atomic,linear,tree}. Validated on
MI210/gfx90a for all three (distinct registry identifiers, each PASS).
…K backend The dispatcher-wrapper generator emitted ONE template for every variant: backends::GeneratedKernelInstance<KernelStruct> with no streamk/reduction_strategy on the key. For Stream-K that is wrong twice over -- the regular backend calls launch(GemmHostArgs,...) which the SK kernel struct does not have (so the aggregate register_all_kernels.hpp would not compile against SK), and the key omits the SK fields so encode_identifier() emits no _streamk suffix and atomic/linear/tree collide in the registry. Make the wrapper variant-aware: for STREAM_K configs include generated_tile_backend_streamk.hpp, set key.algorithm.streamk + reduction_strategy + workspace (and pad flags for identifier parity), and return create_generated_streamk_kernel<KernelStruct, KernelStruct::ADataType, ...>. All other variants are unchanged. Validated on MI210/gfx90a: a registry populated via the generated wrappers holds atomic+linear+tree side by side; Dispatcher::run() selects each by Problem::reduction_strategy and all three verify vs reference_gemm.
…are atomic reset P2: GeneratedStreamKKernelInstance::supports() now ends with SelectedKernel::IsSupported(make_args(problem)) (a new generated static that runs MakeKernelArgs + IsSupportedArgument). A problem too small to partition across CUs is rejected during selection, so first-fit falls back to a non-Stream-K kernel instead of throwing std::runtime_error at launch. P3: the atomic reduction reset zeroes C with a stride-aware hipMemset2DAsync (pitch = stride_E * sizeof(C), width = N * sizeof(C), height = M) instead of a flat hipMemsetAsync over M*N. Correct for a padded/strided C; identical coverage for the contiguous rcr case. Applied to both the internal and external-workspace launch overloads. Validated on MI210/gfx90a: atomic/linear/tree still select + run + verify from a multi-kernel registry; valid small problems are accepted (no false-negatives).
The bridge dispatcher's tile-divisibility gate rejected any problem where
M % TileM != 0 for every layout, returning status -2 ("No suitable kernel")
at runtime even though the .so built fine. This wrongly excluded bf16 rcr/rrr
kernels with a non-power-of-two TileM (e.g. 192) on standard shapes like
1024^3 -- cases Old-TE compiles, runs, and verifies as correct.
Root cause: supports() was layout-blind, while the underlying
ck_tile::GemmKernel::IsSupportedArgument only constrains a dimension when an
operand whose inner axis is that dimension participates without padding:
RowMajor A -> K, ColMajor A -> M
RowMajor B -> N, ColMajor B -> K
RowMajor C -> N, ColMajor C -> M
So for rcr (RowMajor A & C) M is never gated, which is why Old-TE runs M=192
tiles on M-indivisible problems.
Make supports() compute require_m/n/k from the kernel key's A/B/C layouts so
it mirrors IsSupportedArgument exactly (also honoring k_batch in the K grain).
Anything it now lets through is still validated by the kernel's own
IsSupportedArgument inside launch(), so the bridge stays a strict functional
equivalent of Old-TE. Applied to both generated_tile_backend.hpp (the GEMM
.so path) and the sibling tile_backend.hpp.
Validated on gfx942 (MI300X): 85 previously status-2 rcr/rrr bf16 192-tile
.so now run at 1024^3 (Old-TE runs the same, verification correct); the 8
remaining rejects are tile N=192 cases that Old-TE also reports "Arguments
not supported" at N=1024 -- parity preserved in both directions.
…oding rcr dispatcher_initialize() in gemm_ctypes_lib.cpp hardcoded the KernelKey layout to rcr (RowMajor/ColMajor/RowMajor) for every kernel. Now that supports() is layout-aware, that wrong key layout makes the dispatcher reject valid problems: a crr kernel does not gate K (neither A=ColMajor nor B=RowMajor has K as its inner axis), but with a hardcoded rcr key supports() applies rcr's K-gate and returns status -2 for TileK=192 problems (e.g. crr 64x64x192 at 1024^3) that Old-TE compiles, runs, and verifies (~87 TFLOPS). Derive signature.layout_a/b/c from the force-included kernel's own ALayout/BLayout/CLayout types via std::is_same_v with tensor_layout::gemm::RowMajor. The key now matches the kernel, so the layout-aware gate is correct for all four layouts. Execution was already layout-correct (the kernel uses its own compile-time layouts); only the host-side selection metadata was wrong. Validated on gfx942 (MI300X): crr 64x64x192 now runs on the bridge (93 TFLOPS), restoring parity with Old-TE.
The >=20% bridge-vs-old-TE perf gaps in the parity sweep are a harness artifact: the sweep timed the bridge in-process but timed old-TE via its separate standalone benchmark binary, which runs the byte-identical kernel at a lower sustained SCLK. Measured through one harness the gap is <1%. ab_same_harness.py removed that artifact but hardcoded the old-TE header dir to fp16/rcr. Derive it per stem as <base>/<dtype>/<layout> so one run covers rcr/rrr/ccr/crr and fp16+bf16, add a --stems-file/--csv resume-aware sweep mode, and use the median (not max) per point.
For a full ~2000-stem sweep on a single GPU: batch all shapes into one worker call per side (5x fewer process startups), cache the compiled old-TE .so, and add a parallel --build-only pre-pass so hipcc compilation uses all CPU cores while GPU measurement stays serial.
…eductionStrategy) Close two review nits on the Stream-K drivers: - Parse M/N/K with std::stoll instead of std::stoi in the 03/04 drivers so large GEMM dimensions no longer overflow/throw int range (Copilot nit). - Add inline to_string(ReductionStrategy) in kernel_key.hpp and route the 04 driver through it, removing the driver-local strategy_name() duplicate so callers share one spelling that matches the codegen suffix scheme.
Adds dispatcher_test_streamk_registry, a GPU test that generates the three reduction-strategy kernels (atomic/linear/tree) from one tile config, builds the 04 registry driver once per strategy (each force-including its own header, since SkReductionStrategy is a compile-time constexpr), and asserts for each that the encode_identifier() suffix matches, the Dispatcher selects it by Problem::reduction_strategy, and the result verifies against the reference. This converts the previously manual deep-core validation into a regression- guarded CTest. It SKIPs (return 77) when no GPU or hipcc is present, so CPU-only CI is unaffected.
…IBRARY_PATH
meas()/meas_all() built the worker env without /opt/rocm/lib on
LD_LIBRARY_PATH, so run_one_gemm_kernel.py failed to load every .so
("libamdhip64.so.7: cannot open shared object file") and every cell
came back nan. Set it the same way ab_efficient_sweep.py does.
… guard) The bridge-vs-old-TE A/B reported phantom regressions from two MEASUREMENT bugs, not real codegen gaps: - ab_same_harness.py built the old-TE side WITHOUT the TE codegen flags the bridge (and real old-TE's own CMake) use, so -enable-post-misched defaulted back on and old-TE ran ~10-40% faster -> the bridge looked regressed when it is at parity. Now both sides build with identical flags. - ab_efficient_sweep.py measured whatever libgemm_<stem>.so existed with no freshness check, so 3-day-old binaries built from an obsolete codegen showed up as -78%/+703% gaps. Added a guard: skip any .so older than its generated header (treated as missing) instead of reporting a phantom gap. With both fixes the 41 former >15% outlier stems measure within +/-10% (median +0.01%); no bridge codegen regression exists. Note: a separate, deliberately UNCOMMITTED perf change in gemm_utils.py (gate -enable-post-misched=0 on persistent) gives non-persistent large tiles ~9-40%; held back pending a broader persistent-kernel no-regression sweep.
… driver The standalone stream-K driver verified atomic results with the single-pass GEMM tolerance get_*_threshold<...>(K). Atomic reduction accumulates K-split partials directly into low-precision C (workspace size 0), incurring rounding error that grows with the split factor -- correct results were flagged FAIL on small-M/N, large-K shapes (e.g. 512x512x8192) where tiles < CUs. Mirror tile_engine's calculate_rtol_atol (validation.hpp): derive kbatch from the kernel's tile partitioner (estimate_num_wgs_per_tile), widen atol/rtol with the split-K CDataType accumulation term, and take the max with the per-split tolerance. The driver and tile_engine now verify identically; the kernel is unchanged.
…gine
The standalone stream-K driver built its stream_config as {stream, true, 0,
warmup, repeat}, leaving is_gpu_timer/flush_cache/rotating_count at defaults
(flush_cache=false, rotating_count=1). The tile_engine benchmark instead times
with flush_cache=true and rotating_count=1000, so the driver measured a
warm-cache best case while tile_engine measured cold-cache -- the entire source
of the reported dispatcher-vs-TE "performance gap" at low tile counts.
Add --timer/--flush_cache/--rotating_count (defaulting to the tile_engine
values) and pass them through stream_config so both sides use identical timing
methodology. A validating run still times a single cold shot, mirroring
tile_engine's repeat_once_if_verify(); collect perf with a separate --validate 0
pass.
The 04 registry driver hardcoded the KernelKey signature to DataType::FP16 and an rcr layout, so fp8/bf8/bf16 Stream-K kernels registered under the wrong key and failed dispatch/identifier checks. Derive dtype_a/b/c/acc and layout tags from the generated kernel's actual A/B/C types via compile-time dtype_enum_of<T>()/layout_tag_of<Layout>() helpers (fp8/bf8 inputs accumulate in fp32 and write fp16 C, matching Tile Engine). Parametrize test_streamk_registry.py over fp16/bf16/fp8/bf8 (dtype-independent core objects built once; per-dtype codegen + build + verify with per-dtype identifier assertions). All four datatypes register, dispatch, and verify across atomic/linear/tree on gfx942 (MI300X).
Port #8136's Tile-Engine->Dispatcher Stream-K bridge onto the rewritten deep-core #8094 engine (KernelKey reduction fields, KernelInstance workspace virtuals, StreamK backend, Dispatcher-owned reduction workspace, registry + validation driver). 3-way merge over the shared stream_k ancestor; only the streamk launch emitter in unified_gemm_codegen.py and 03_streamk_gemm_driver.cpp conflicted -- both resolved to the deep-core side: - codegen now emits the struct-scope Sk* kernel type + GetWorkSpaceSize + IsSupported, keeps the 2-arg internal-workspace launch the bridge ctypes lib calls, and adds the 3-arg dispatcher-owned-workspace launch. - driver takes deep-core's stoll parse + apple-to-apple timing + validate cold shot. Bridge ctypes lib still bypasses the registry and calls the 2-arg launch directly, so the bridge runs the exact deep-core kernels. Codegen smoke: atomic/linear/tree + regular gemm all generate cleanly (0 failed).
Build the Stream-K bridge .so without the dispatcher static lib and with TE-streamk-matching flags: - compile flags per-variant: Stream-K matches tile_engine/ops/gemm_streamk CMake (only -Wno-* + --offload-compress, NO -mllvm codegen flags and NOT -enable-noalias-to-md-conversion=0 which is a gemm_universal-bridge concern). This keeps the A/B fair; the regular path is unchanged. - link skips libck_tile_dispatcher.a for Stream-K (the ctypes lib launches the force-included kernel directly, no registry/dispatcher symbols), and the build guard no longer requires the static lib for Stream-K. - ensure build/examples exists before hipcc writes there (the cmake build that normally creates it is skipped on the Stream-K path). Validated on MI300X (gfx942): atomic/linear/tree fp16/rcr all build, run, and verify PASS (max_rel_err <8e-4) through the bridge GpuGemmRunner path.
Correct the Stream-K compile flags to match Tile Engine's gemm_streamk build verbatim (ground truth: a TE streamk build's compile_commands.json). The -mllvm codegen flags come from the composablekernel project-root add_compile_options applied globally to the TE benchmark -- they are NOT in the per-target options, so the earlier "minimal flags" assumption was wrong and would have produced a phantom A/B gap (different occupancy). Flags now: -std=c++20 -fno-offload-uniform-block -mllvm --lsr-drop-solution=1 -mllvm -enable-post-misched=0 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false --offload-compress -enable-post-misched=0 is applied unconditionally (TE does so for streamk), and -enable-noalias-to-md-conversion=0 is not used (TE streamk omits it).
Add a bf16 codec to the bridge runner so bf16 Stream-K kernels can run through the same ctypes path (the ABI is void*+sizeof, so 2-byte bf16 shares the fp16 path; only the bit pattern differs). Dtype is inferred from the kernel name. ENCODE is round-to-nearest-even to bf16 bits; DECODE is bit-exact to device bf16_t so the numpy reference multiplies the same values the GPU does.
…ing) Make the Stream-K bridge layout-generic instead of rcr-hardcoded, so all 4 A/B/C layouts (rcr/rrr/ccr/crr) work end to end: - streamk_gemm_ctypes_lib.cpp: derive stride_A/B/C at compile time from the kernel's ALayout/BLayout/CLayout (RowMajor RxC -> ld=C, ColumnMajor -> ld=R) instead of the hardcoded K/K/N. - generated_tile_backend_streamk.hpp (registry path): same layout-derived strides. - GpuGemmRunner: read dtype AND layout off the kernel name; arrange each operand per layout (RowMajor=C-contiguous, ColumnMajor=F-contiguous); bf16 encode is now memory-order-preserving so column-major operands stay column-major. - run_one_streamk_gemm_kernel.py: dtype/layout-aware A/B + reference (was fp16-only). - streamk_gemm_full_benchmark.py: SUPPORTED_LAYOUTS now rcr/rrr/ccr/crr, SUPPORTED_DTYPES fp16+bf16 (fp8/bf8/int8 still need runner codecs).
Extend the Tile-Engine -> Dispatcher Stream-K bridge (PR #8136) beyond fp16/bf16 to the FNUZ fp8 (E4M3) and bf8 (E5M2) formats used by gfx942/MI300. GpuGemmRunner (dispatcher/python/gemm_utils.py): - Port the tested FNUZ codecs from the sibling fp8 bridge (PR #8887): bit-exact decode tables + nearest-representable/saturating encode, carried as uint8 bit patterns (sizeof fp8_t/bf8_t == 1). Encode preserves operand C/F contiguity so the layout-generic _to_buf path holds for the new dtypes. - run() now sizes the C buffer per get_output_dtype: fp8/bf8 -> fp16 store, int8 -> int32; bf16 still carried as raw uint16. fp16/bf16 paths unchanged. - Arch guard: fp8/bf8 raise a clear error on a non-gfx942 GPU (gfx950/MI350 uses OCP fp8, a different bit layout) rather than silently mis-decoding. - An int8 codec is included for when the engine supports it (see below). Reference + surface: - run_one_streamk_gemm_kernel.py verify reference is now dtype-aware (decode(encode(x)) per dtype; int8 = exact int32 matmul). - streamk_gemm_full_benchmark.py SUPPORTED_DTYPES += fp8, bf8. int8 is intentionally left OUT of SUPPORTED_DTYPES: it is blocked at the ck_tile engine, not the bridge. The int8 kernel codegens but fails to compile for every reduction strategy -- warp_gemm_dispatcher has no Dispatcher<int8,int8,float,32,32,16,...> specialization for the streamk CompV3 path, so the BlockUniversalGemmAsBsCr WarpGemm static_asserts fail. Matches the PR #8094 decision to leave int8 out. GPU-validated on gfx942 (MI300X), 2048^3, both reduction + layout variants: fp8 atomic/linear/tree rcr: PASS (192/180/183 TFLOPS, max_rel <= 9.4e-4) bf8 atomic/linear/tree rcr: PASS (192/181/181 TFLOPS, max_rel <= 7.8e-4) fp8 ccr / bf8 crr (col-major): PASS (245/210 TFLOPS)
5 tasks
✅ All Checks Passed — Ready for Review
📖 Need help? See the Policy FAQ for details on every check and how to fix failures. |
|
🎉 All checks passed! This PR is ready for review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Routes the stream_k GEMM variant through the same Tile Engine (TE) →
Dispatcher bridge already landed for regular GEMM (Phase 1, #8123) and grouped
GEMM (Phase 3, #8130). Goal of the overall effort: the Dispatcher is the single
source of truth for codegen/build/runtime, and TE only produces configs +
benchmarks.
This PR is stacked on
muozturk/dispatcher-gemm-bridge(#8123) — please mergethat first. Its own diff is just two commits:
[CK_TILE] Add stream_k variant to GEMM Dispatcher codegen(cherry-picked)[CK_TILE] Add Tile Engine -> Dispatcher bridge for Stream-K GEMM(this work)What Stream-K needs that regular GEMM doesn't
Stream-K is a single-problem GEMM (one A/B/C, one M/N/K) with the same C
ABI as regular GEMM, so the Python side (
GpuGemmRunner/GemmDispatcherLib/
GemmProblem) and the GPU worker are reused unchanged. The differences areinternal to the
.so:SelectedKernel::launch(const ck_tile::StreamKHostArgs&, const stream_config&),which allocates the reduction workspace internally (
DeviceMem) and uses theAtomic reduction strategy.
generated_tile_backend.hpp::run()) hard-codes thesingle-problem
GemmHostArgslaunch and won't compile against a Stream-KSelectedKernel. So the Stream-K ctypes lib bypasses the registry and callsSelectedKernel::launch(args, stream)directly, reporting the name from theKERNEL_NAMEmacro (same approach grouped uses).Changes
New
dispatcher/bindings/ctypes/streamk_gemm_ctypes_lib.cpp— same single-problemC ABI (
dispatcher_run_gemm(A,B,C,M,N,K,time_ms)); hipMalloc + copy A/B,memset C=0 (Atomic accumulates into C), build
StreamKHostArgswith rcrstrides (stride_A=K, stride_B=K, stride_C=N, k_batch=1), launch, copy C back.
Returns 0 / -1 (HIP or throw) / -2 (kernel reports args unsupported).
tile_engine/ops/gemm/streamk_gemm_full_benchmark.py— 3-phase driver(expand configs →
setup_multiple_gemm_dispatchersbuild → subprocess-isolatedbenchmark), mirroring
gemm_full_benchmark.pywithvariant="stream_k".tile_engine/ops/gemm/run_one_streamk_gemm_kernel.py— disposable GPU worker(identical to the regular worker since the ABI matches).
tile_engine/ops/gemm/gemm_streamk/configs/default_config.json— small sweepconfig (128x128x{32,64}, 2x2x1, 32x32x16, compv3/compv4, intrawave, cshuffle,
pad true, persistent false) → 4 kernels.
Modified
dispatcher/python/gemm_utils.py—_ctypes_source_name()selectsstreamk_gemm_ctypes_lib.cppforvariant=="stream_k"(in both_build_compile_jobsandsetup_multiple_gemm_dispatchers);.nameappends_streamk;variantthreaded intocodegen_argsandexpand_sweep.dispatcher/python/ctypes_utils.py— pass the requested variant to codegen--variantsinstead of hard-coding"standard".Validation (gfx942 / MI300X, fp16 / rcr)
Numeric parity vs a numpy fp32 reference (
A.f32 @ B.f32). Stream-K's Atomicreduction does multiple fp16 atomic-adds (one per K-split partial) vs
regular/grouped's single fp32→fp16 store, so it is inherently noisier; tolerance
is widened to max_rel ≤ 2.5e-3, frob_rel ≤ 1.5e-3 (regular/grouped use 5e-4).
Full TE driver run (4 kernels x 4 problems = 16/16 OK, 0 failures),
default problem set uses Stream-K's sweet spot (squares + a large-K skinny shape):
All status 0, positive TFLOPS, nonzero output. Name parity holds end-to-end:
the runtime name reported by each
.soequalsGemmKernelConfig(variant="stream_k").name, ending in_streamk.Unsupported-shape handling: a tiny
257^3problem is correctly reported asunsupported by the kernel (
status -2, too few tiles to partition across CUs)and surfaced gracefully by the bridge — not a crash.
Test plan
unified_gemm_codegen.py ... --variants stream_kemits a*_streamk.hppwhose stem ==
GemmKernelConfig(variant="stream_k").namesetup_multiple_gemm_dispatchersbuilds the Stream-K config set →.socompiles & links against
streamk_gemm_ctypes_lib.cppNext
Land #8123, then this; afterwards delete the legacy
tile_engine/ops/gemm_streamk/machinery (Phase 4).Update 2026-06-12 — brought current with #8123 + Copilot fixes
This branch had forked at #8123's first commit, so it lacked every later
regular-GEMM bridge improvement. Merged the current #8123 HEAD and ported the
Stream-K-specific analogues (the Stream-K bridge keeps its own driver, worker and
ctypes lib, so those fixes do not arrive via the merge):
streamk_gemm_ctypes_lib.cpp):benchmark knobs defaulted to
warmup=3/repeat=10— a cold, un-ramped clock, theroot cause of the regular bridge's spurious "perf gap." Now default to old-TE's
warmup=50/repeat=100, env-overridable viaCK_TILE_BENCH_WARMUP/REPEAT/FLUSH/ROTATING.rotating_countstays 1 forStream-K: the Atomic preprocess re-zeros only the original C buffer, so rotating
C would leave rotated copies un-zeroed and corrupt the accumulation.
--verifycorrectness gate (driver + worker): opt-in fp32 numpy referencecheck (global
max|out-ref|/max|ref|,verified/max_relin the CSV); amismatch counts as a failure.
visible GPUs via device-pinned
HIP_VISIBLE_DEVICESworkers (--devices,deviceCSV column); also fixes a latent proc-unbound error in the batch handler.
--dtype/--layoutguards (driver): constrained to the supportedfp16/rcrsurface so a mismatch fails fast.
std::stoi → std::stollfor M/N/K in03_streamk_gemm_driver.cpp; stride-aware C zeroing viahipMemset2DAsync(CLayout-aware, checked HIP status) in
_launch_function_streamk.Validation status: DONE on gfx942/MI300X (ctr-cx64-mi300x-4, enroot container). Bridge build+benchmark+
--verifyongemm_streamk/configs/default_config.json= 16/16 OK, all verified (max_rel ≤ 3.0e-3, fp16 atomic tol), name parity holds. Bridge-vs-Old-TE parity (perf + correctness, byte-identical device kernel) posted as a comment with the full table +streamk_bridge_oldTE.csv.