Preprocess: harness shape-faithfulness (+ backend-rewrite subagents)#290
Open
iraj465 wants to merge 20 commits into
Open
Preprocess: harness shape-faithfulness (+ backend-rewrite subagents)#290iraj465 wants to merge 20 commits into
iraj465 wants to merge 20 commits into
Conversation
sdubagun-amd
approved these changes
Jun 23, 2026
yueliu14
approved these changes
Jun 23, 2026
…>TileLang)
Extend the translation system from the single pytorch->flydsl pair to 10 pairs,
converting the less-optimized source languages (pytorch, triton, ck, hip) toward
the two most-optimized DSLs on MI300X/gfx942 (FlyDSL, TileLang), plus tilelang<->flydsl
cross-conversion.
- translation_registry.py: source detectors (_detect_triton/_tilelang/_ck/_hip/_flydsl),
_tilelang_env_setup, kb_skill_dir on TranslationPair, generalized load_translation_kb,
register 9 new pairs. Targets restricted to flydsl/tilelang (always >= ck/triton/hip).
- mini.py: generic kernel_type->target resolver (<src>2<tgt> or bare target) unlocking
translation for the new targets (was hardcoded flydsl-only).
- config/mini_kernel_to_tilelang.yaml: source-agnostic ->TileLang translation agent config.
- skills/tilelang*, skills/flydsl/docs/psv4_ref: TileLang + FlyDSL KB (ported from PSv4).
- subagents/{triton,ck,hip,tilelang}-to-flydsl + {pytorch,triton,ck,hip,flydsl}-to-tilelang:
9 rewrite subagents with language_match auto-selection.
Validated: 178 tests pass, registry lists 10 pairs + 15 subagents, detectors fire on real
sglang triton + aiter CK files.
… subagent In mixed mode GEAK splits workers 50/50: fixed-canonical (in-place, agent_name=general-kernel-optimization) + planned (LLM-diverse). This adds an opt-in env knob GEAK_FIXED_REWRITE_TARGET=flydsl|tilelang that re-routes the canonical FIXED slot through the matching <source>-to-<target> rewrite subagent (triton/hip/ck/pytorch -> flydsl/tilelang) based on the detected kernel_language. Effect: the fixed half attempts a backend rewrite into the optimized DSL while the planned half does in-place tuning, competing in the same best-of-N round. Default (unset) preserves the existing in-place behavior; no-op rewrites (source==target, unknown lang) fall back to general-kernel-optimization. task_planner.py: _fixed_rewrite_agent_name() + wire into canonical_fixed.agent_name. Verified: build_pool stamps triton-to-flydsl on the fixed slot when set; 31 planner/dispatch tests pass.
…-route as pytorch
Two fixes for the fixed-half rewrite path:
1. Rewrite subagents' instance_template referenced {{knowledge_base}}, which is only
provided on the legacy TranslationAgent path — not when dispatched as a fixed-mode
OptimizationAgent worker. Caused jinja StrictUndefined -> exit=UndefinedError (0s).
Removed the KB block from the 9 new subagents (guidance lives in SYSTEM_PROMPT.md,
which IS loaded); pytorch-to-flydsl uses {{ knowledge_base | default("", true) }}
so it works on BOTH paths.
2. task_planner._fixed_rewrite_agent_name now prefers kernel_type (precise
triton/hip/ck classifier) over kernel_language (reports "python" for Triton .py),
so a Triton kernel routes to triton-to-flydsl, not pytorch-to-flydsl.
Verified: type=triton->triton-to-flydsl; 26 planner/dispatch tests pass.
…ated) The FlyDSL rewrite subagent prompts referenced non-existent APIs (compile_preshuffle_gemm_a8, build_flash_attn_func_module, bare @flyc.kernel) — GEAK would hallucinate and fail. Replaced with the VERIFIED on-box aiter.ops.flydsl API that PSv4 uses for its real wins (1.77x isolated / +14% E2E on Qwen3.5 fp8 GEMM; +162% Kimi-K2.5 MoE): - flydsl_hgemm(a,b,tile_m/n/k,split_k,block_*_warps,b_preshuffle,auto_shuffle_b) for dense bf16/fp16 - flydsl_preshuffle_gemm_a8(XQ,WQ,x_scale,w_scale,...) for fp8/a8 block-scale GEMM - flydsl_moe_stage1/2 for fused MoE - linear_attention_kernels.flydsl_gdr_decode(...) — direct replacement for the GDN fused_recurrent_gated_delta_rule_packed_decode kernel (our k006) - aiter.tuned_gemm.gemm_a16w16 production dispatch seam Signatures verified via inspect against /sgl-workspace/aiter. Added "do NOT invent APIs" guard + a self-check command. TileLang prompts already use the genuine on-box T.* API (verified).
…write authoring agents Add two source-agnostic kernel-authoring subagents that follow AMD's real recipe (verified against PSv4's winning kernels): author an optimized kernel targeting the DSL — split-K, tiling, fused scale/epilogue, per-(M,N,K) config — prefer the real aiter FlyDSL ops / TileLang T.* + @tilelang.autotune where they win on gfx942, fall back to an optimized authored Triton kernel where the DSL has no edge (e.g. gfx942 block-scaled GEMM has no native MFMA). Parity-gated. This is broader & more correct than the narrow "<src>-to-<tgt> API-swap" pair subagents. GEAK_FIXED_REWRITE_TARGET=flydsl|tilelang now routes the mixed-mode fixed slot to these dedicated agents by default; set GEAK_REWRITE_USE_PAIR_SUBAGENTS=1 for the legacy pairs. Validated: both register + render; resolver routes correctly; 49 planner/dispatch tests pass.
…ool candidates Drop the GEAK_FIXED_REWRITE_TARGET / GEAK_REWRITE_USE_PAIR_SUBAGENTS env flags and the per-target source-mapping. Instead, every mixed-mode round adds both dedicated source-agnostic authoring subagents (flydsl-kernel-rewrite, tilelang-kernel-rewrite) as fixed candidates in the pool — so GEAK naturally tries both rewrites in best-of-N alongside the in-place optimizer and planned strategies, and the dispatcher/selector picks the winner. No hardcoding, no flags. Canonical fixed slot stays general-kernel-optimization. Only added when the subagents are registered. 49 planner/dispatch tests pass.
Instead of always adding both authoring rewrites, pick the DSL with the real edge for the kernel's op-type (matched on kernel_name + function_names — already in kernel_meta, no new signal/infra): attention/flash/MLA -> TileLang (FA ~1.5x Triton, MLA ~parity w/ asm) MoE/grouped-expert -> FlyDSL (fused-MoE) linear-attn/gated-delta/GDN decode -> FlyDSL (aiter flydsl_gdr_decode) GEMM / norm / elementwise / unknown -> both (DSL edge small/ambiguous; best-of-N decides) Avoids wasting a slot on a rewrite that can't win (e.g. TileLang on a blockscale GEMM where gfx942 has no native MFMA). Single ~25-line selector in the existing insertion point — no GEAK-main changes, no env flags. 49 planner/dispatch tests pass.
…tier)
Third authoring subagent for the ceiling tier under the DSLs: MFMA intrinsics ->
inline asm volatile -> raw .s, preferring shipped aiter asm ops (gemm_a8w8_asm,
flatmm_a8w8_blockscale_asm, asm MLA decode, fused_moe_bf16_asm). KB ported from PSv4
asm_mfma (CDNA3 exec model, VGPR/AGPR/LDS occupancy, mfma intrinsics, raw asm, pitfalls).
Adaptive selector now routes ASM to the compute-heavy classes where it beats the DSLs'
ceiling: GEMM -> {flydsl, asm}, MoE -> {flydsl, asm}. Left off norm/elementwise (not worth
hand-asm) and attention (TileLang ~= asm there). 49 planner/dispatch tests pass.
… aiter op before authoring Root cause of MoE non-win: prompts LISTED flydsl_moe_stage1/2 + fused_moe_bf16_asm as advisory bullets, so the agent free-authored Triton tiling (verified 0.92x regression) instead of dispatching to the shipped fused-MoE ops (the +162% Kimi-K2.5 lever). Add an explicit MANDATORY STEP 0 to both flydsl/asm subagents: identify op-class -> grep aiter for the matching shipped op -> wire it as candidate #0 -> only author from scratch if none fits/it regresses. Op-class->op table included.
…arity, CK backend hardening Session work proven on the GLM-4.7 fused-MoE E2E run (Triton->CK ck_moe = +6% live E2E): - warm-up: baseline_jit_cache_env (repo-keyed shared JIT cache) + warm_up_harness tool so the first cold compile is paid once before the verifier (fixes preprocess thrash/wedge on heavy MoE kernels); slots reuse the warm baseline cache via save_and_test (diff-safe, outside worktrees). - post-round PROFILE made non-fatal (advisory) in postprocess/evaluation — a slow/timed-out profiler no longer crashes the round and discards a verified benchmark win. - harness shape-parity: --benchmark and --full-benchmark run the SAME config set (selection signal == final verification), enforced in harness-generator; verifier timeouts equalized + cold-compile timeout reclassified (retry-once, not regenerate). - subagent self-test iteration cap (GEAK_SUBAGENT_SELFTEST_ITERATIONS) so generator/verifier self-tests are fast; real baseline keeps full count. - CK authoring backend (subagents/ck-kernel-rewrite) + planner routing (_REWRITE_CK on MoE/GEMM); v3 translate generalized to all targets. - tests for baseline_jit_cache_env, warm_up_harness, slot cache reuse.
Refresh and extend the kernel-rewrite reference docs for the tilelang, ck, asm, and flydsl subagents against the authoritative upstream repos (ROCm/FlyDSL, ROCm/composable_kernel, tile-ai/tilelang) and the on-box aiter API, so each subagent has accurate, self-contained (offline) material for the rewrite step. - flydsl: consolidate the language reference under docs/reference/; verify every intrinsic/API against upstream FlyDSL + on-box aiter.ops.flydsl and remove names that don't exist (ds_swizzle, s_wait_loadcnt, global_load_async_lds, arith.* helper fns, get_hip_arch); fix flydsl_hgemm / ck_moe stage signatures. - tilelang: verify T.* primitives + autotune levers vs tile-ai/tilelang; fix GemmWarpPolicy enum + T.use_swizzle; real FlashAttention-fwd / GEMM skeletons. Register the skill (add SKILL.md). - asm: fix gfx942 MFMA intrinsic names (require the _1k suffix; the bf16 8-bit-K forms are gfx908), correct the shared VGPR/AGPR register-pool budget, document the shipped aiter asm ops to prefer. Register the skill (add SKILL.md). - ck: new skill doc set (overview, shipped_aiter_ck_ops, instance_tuning, ck_tile_authoring, pitfalls) sourced from composable_kernel ck_tile examples and the aiter CK codegen tuners. Register the skill (add SKILL.md). tilelang/asm/ck previously shipped no SKILL.md, so skill_runtime never surfaced their docs; all four now register (verified). Docs are self-contained with no external/network dependency. tests/skills green. Co-Authored-By: Claude <noreply@anthropic.com>
When the call site provides an importable launcher callable (module:func) plus the exact per-arg traced shapes, the universal harness contract is fully determined — no LLM authoring is needed. Synthesize the harness directly and run the existing deterministic Path-A sequence (baseline -> profile -> commandment + worktree-bypass gate), short-circuiting the harness-generator. This stops the LLM from burning the whole preprocess budget compiling a CK/.cu kernel from scratch when discovery extracts no callable function. General, not a hotfix: keyed only on "callable + shapes" (a property of every traced hot kernel), reuses harness.j2's import/_inputs/_ref contract and the _run_prevalidated_path_a precedent, and is strictly additive — returns None on any miss so the LLM generator path is unchanged. Correctness invariants enforced: - golden != patched op: snapshot the ORIGINAL op output at synthesis time (pre-patch) to golden.pt; _ref replays it (no self-compare tautology). - aiter worktree routing: emit AITER_META_DIR + per-worktree AITER_JIT_DIR before import aiter so the JIT compiles the PATCHED worktree, not the baseline source (the ~1.00x worktree-bypass bug). Wiring: run_preprocess_v3 gains reference_entry_point/input_shapes/input_dtypes kwargs (default None; env fallback GEAK_REFERENCE_ENTRY_POINT / GEAK_INPUT_SHAPES_JSON). New _io_dtypes.py is the only new primitive (dtype map incl. fp8 + shape parser + tensor builder). Tier-2 fidelity (reconstructed inputs); trace-time real-input capture is the documented follow-up. Tests: tests/preprocess_v3/test_reference_harness.py (parser incl. generic- dispatcher rejection, dtype/shape, guards, op-agnostic GPU run on relu+gelu, contract validation). Existing preprocess_v3 suite green. Co-Authored-By: Claude <noreply@anthropic.com>
…gnature Collapse to one harness-build route: the LLM orchestrator, now given the complete trace-captured argument signature (tensors + scalar args, in call order) via GEAK_RAW_ARG_SPEC_JSON. The orchestrator reconstructs the exact kernel call — correct arity and scalar values (epsilon, group sizes, flags) that a tensor-shape-only view drops. Remove the deterministic harness synthesizer and its tensor-only fast path: it could not satisfy ops with required scalar args (failed ck_moe with "incompatible function arguments") and fell through to the LLM path anyway. One route is simpler and, with the full signature in context, sufficient. - delete reference_harness.py, _io_dtypes.py, test_reference_harness.py - adapter.py: drop the synthesis bypass + dead kwargs; render the traced arg signature into the orchestrator task Co-Authored-By: Claude <noreply@anthropic.com>
…hape-faithfulness rule - baseline.py/orchestrator.py: GEAK_SKIP_PROFILE short-circuits the advisory profiler-mcp roofline pass (it hangs on some kernels e.g. paged-attention and starves the harness-init budget). Returns a profile-less ProfileResult so the optimizer proceeds to rounds instead of wedging. - adapter.py: render the traced raw_arg_spec into the orchestrator task so the harness builder reconstructs the exact kernel call. - harness-generator SUBAGENT.yaml: faithfulness rule - trust the authoritative traced shapes/args (raw_arg_spec) over a discovered test file or device symbol. Co-Authored-By: Claude <noreply@anthropic.com>
…ed tools_module import) Co-Authored-By: Claude <noreply@anthropic.com>
…no logic change) Co-Authored-By: Claude <noreply@anthropic.com>
Commit 1cdd20e bumped the defaults (min_workers 4->8, workers_per_gpu 3->4) but left this test asserting the old 4/3 values. Update the GPU-count and empty-env cases to the intended new defaults. Co-Authored-By: Claude <noreply@anthropic.com>
…stically GEAK's internal round patch-apply (git apply) aborted on every HIP kernel that compiles in-tree: the diff swept in the _geak_build/ build directory (.ninja_deps, .ninja_log, build.ninja) and git apply cannot apply a binary hunk, so the whole patch failed -> verification skipped -> the unverified agent-reported speedup was silently promoted as the round result. Root cause: _section_is_binary only matched "GIT binary patch" (emitted with git diff --binary). The patches use the default marker "Binary files a/.. and b/.. differ", which slipped through. Detect both markers so any binary artifact section is dropped regardless of path - the name-agnostic backstop so new build/cache dirs need no denylist entry. Also add _geak_build to the two denylists so it is excluded at diff-capture time, consistent with build/build_harness. Add regression tests covering both binary markers and stripping of unknown build dirs while keeping the real kernel source. Co-Authored-By: Claude <noreply@anthropic.com>
…g error The Anthropic SDK refuses non-streaming requests whose estimated time (3600 * max_tokens / 128000) exceeds 600s, and amd_claude._query_api does not enable streaming. max_tokens=32000 -> 900s -> raises "Streaming is required for operations that may take longer than 10 minutes", failing every model call. 16000 -> 450s, safe. Co-Authored-By: Claude <noreply@anthropic.com>
Guard the agent against emitting a bash tool call with `{}` / empty
command, which the gateway can reject as a malformed request. Add the
rule to the strategy-list system prompt and reinforce it in the bash
tool schema description.
Co-Authored-By: Claude <noreply@anthropic.com>
d44e1cb to
9e14d4a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings the
chore/subagent-docs-upstream-resourceline (built on the backend-rewrite-subagents work, PR #287) into main, plus this session's preprocess fixes:adapter.py): renders the fullraw_arg_specinto the orchestrator task so the harness builder reconstructs the exact kernel call.Why
These enabled a faithful, kernel-time-scored GEAK run on Qwen3-30B-A3B ck_moe_stage2 → verified 2.56x kernel-level → +19.2% measured E2E throughput (3-repeat A/B).
Test plan