Preprocess: harness shape-faithfulness (+ backend-rewrite subagents) by iraj465 · Pull Request #290 · AMD-AGI/GEAK

iraj465 · 2026-06-19T19:33:54Z

Summary

Brings the chore/subagent-docs-upstream-resource line (built on the backend-rewrite-subagents work, PR #287) into main, plus this session's preprocess fixes:

Traced arg-signature into the harness (adapter.py): renders the full raw_arg_spec into the orchestrator task so the harness builder reconstructs the exact kernel call.
Harness shape-faithfulness rule (harness-generator SUBAGENT.yaml): trust the authoritative traced shapes/args over a discovered test file or device symbol.
Earlier in the line: asm/ck/flydsl/tilelang rewrite subagents + skill docs, single harness path, CK backend hardening.

Why

These enabled a faithful, kernel-time-scored GEAK run on Qwen3-30B-A3B ck_moe_stage2 → verified 2.56x kernel-level → +19.2% measured E2E throughput (3-repeat A/B).

Test plan

ruff clean on all changed files (fixed a pre-existing F401/I001 in test_preprocess_v3_bugfixes)
merged origin/main cleanly (no conflicts)
verified live: GEAK actor honors GEAK_SCORE_TARGET=kernel
CI green

…>TileLang) Extend the translation system from the single pytorch->flydsl pair to 10 pairs, converting the less-optimized source languages (pytorch, triton, ck, hip) toward the two most-optimized DSLs on MI300X/gfx942 (FlyDSL, TileLang), plus tilelang<->flydsl cross-conversion. - translation_registry.py: source detectors (_detect_triton/_tilelang/_ck/_hip/_flydsl), _tilelang_env_setup, kb_skill_dir on TranslationPair, generalized load_translation_kb, register 9 new pairs. Targets restricted to flydsl/tilelang (always >= ck/triton/hip). - mini.py: generic kernel_type->target resolver (<src>2<tgt> or bare target) unlocking translation for the new targets (was hardcoded flydsl-only). - config/mini_kernel_to_tilelang.yaml: source-agnostic ->TileLang translation agent config. - skills/tilelang*, skills/flydsl/docs/psv4_ref: TileLang + FlyDSL KB (ported from PSv4). - subagents/{triton,ck,hip,tilelang}-to-flydsl + {pytorch,triton,ck,hip,flydsl}-to-tilelang: 9 rewrite subagents with language_match auto-selection. Validated: 178 tests pass, registry lists 10 pairs + 15 subagents, detectors fire on real sglang triton + aiter CK files.

… subagent In mixed mode GEAK splits workers 50/50: fixed-canonical (in-place, agent_name=general-kernel-optimization) + planned (LLM-diverse). This adds an opt-in env knob GEAK_FIXED_REWRITE_TARGET=flydsl|tilelang that re-routes the canonical FIXED slot through the matching <source>-to-<target> rewrite subagent (triton/hip/ck/pytorch -> flydsl/tilelang) based on the detected kernel_language. Effect: the fixed half attempts a backend rewrite into the optimized DSL while the planned half does in-place tuning, competing in the same best-of-N round. Default (unset) preserves the existing in-place behavior; no-op rewrites (source==target, unknown lang) fall back to general-kernel-optimization. task_planner.py: _fixed_rewrite_agent_name() + wire into canonical_fixed.agent_name. Verified: build_pool stamps triton-to-flydsl on the fixed slot when set; 31 planner/dispatch tests pass.

…-route as pytorch Two fixes for the fixed-half rewrite path: 1. Rewrite subagents' instance_template referenced {{knowledge_base}}, which is only provided on the legacy TranslationAgent path — not when dispatched as a fixed-mode OptimizationAgent worker. Caused jinja StrictUndefined -> exit=UndefinedError (0s). Removed the KB block from the 9 new subagents (guidance lives in SYSTEM_PROMPT.md, which IS loaded); pytorch-to-flydsl uses {{ knowledge_base | default("", true) }} so it works on BOTH paths. 2. task_planner._fixed_rewrite_agent_name now prefers kernel_type (precise triton/hip/ck classifier) over kernel_language (reports "python" for Triton .py), so a Triton kernel routes to triton-to-flydsl, not pytorch-to-flydsl. Verified: type=triton->triton-to-flydsl; 26 planner/dispatch tests pass.

…ated) The FlyDSL rewrite subagent prompts referenced non-existent APIs (compile_preshuffle_gemm_a8, build_flash_attn_func_module, bare @flyc.kernel) — GEAK would hallucinate and fail. Replaced with the VERIFIED on-box aiter.ops.flydsl API that PSv4 uses for its real wins (1.77x isolated / +14% E2E on Qwen3.5 fp8 GEMM; +162% Kimi-K2.5 MoE): - flydsl_hgemm(a,b,tile_m/n/k,split_k,block_*_warps,b_preshuffle,auto_shuffle_b) for dense bf16/fp16 - flydsl_preshuffle_gemm_a8(XQ,WQ,x_scale,w_scale,...) for fp8/a8 block-scale GEMM - flydsl_moe_stage1/2 for fused MoE - linear_attention_kernels.flydsl_gdr_decode(...) — direct replacement for the GDN fused_recurrent_gated_delta_rule_packed_decode kernel (our k006) - aiter.tuned_gemm.gemm_a16w16 production dispatch seam Signatures verified via inspect against /sgl-workspace/aiter. Added "do NOT invent APIs" guard + a self-check command. TileLang prompts already use the genuine on-box T.* API (verified).

…write authoring agents Add two source-agnostic kernel-authoring subagents that follow AMD's real recipe (verified against PSv4's winning kernels): author an optimized kernel targeting the DSL — split-K, tiling, fused scale/epilogue, per-(M,N,K) config — prefer the real aiter FlyDSL ops / TileLang T.* + @tilelang.autotune where they win on gfx942, fall back to an optimized authored Triton kernel where the DSL has no edge (e.g. gfx942 block-scaled GEMM has no native MFMA). Parity-gated. This is broader & more correct than the narrow "<src>-to-<tgt> API-swap" pair subagents. GEAK_FIXED_REWRITE_TARGET=flydsl|tilelang now routes the mixed-mode fixed slot to these dedicated agents by default; set GEAK_REWRITE_USE_PAIR_SUBAGENTS=1 for the legacy pairs. Validated: both register + render; resolver routes correctly; 49 planner/dispatch tests pass.

…ool candidates Drop the GEAK_FIXED_REWRITE_TARGET / GEAK_REWRITE_USE_PAIR_SUBAGENTS env flags and the per-target source-mapping. Instead, every mixed-mode round adds both dedicated source-agnostic authoring subagents (flydsl-kernel-rewrite, tilelang-kernel-rewrite) as fixed candidates in the pool — so GEAK naturally tries both rewrites in best-of-N alongside the in-place optimizer and planned strategies, and the dispatcher/selector picks the winner. No hardcoding, no flags. Canonical fixed slot stays general-kernel-optimization. Only added when the subagents are registered. 49 planner/dispatch tests pass.

Instead of always adding both authoring rewrites, pick the DSL with the real edge for the kernel's op-type (matched on kernel_name + function_names — already in kernel_meta, no new signal/infra): attention/flash/MLA -> TileLang (FA ~1.5x Triton, MLA ~parity w/ asm) MoE/grouped-expert -> FlyDSL (fused-MoE) linear-attn/gated-delta/GDN decode -> FlyDSL (aiter flydsl_gdr_decode) GEMM / norm / elementwise / unknown -> both (DSL edge small/ambiguous; best-of-N decides) Avoids wasting a slot on a rewrite that can't win (e.g. TileLang on a blockscale GEMM where gfx942 has no native MFMA). Single ~25-line selector in the existing insertion point — no GEAK-main changes, no env flags. 49 planner/dispatch tests pass.

…tier) Third authoring subagent for the ceiling tier under the DSLs: MFMA intrinsics -> inline asm volatile -> raw .s, preferring shipped aiter asm ops (gemm_a8w8_asm, flatmm_a8w8_blockscale_asm, asm MLA decode, fused_moe_bf16_asm). KB ported from PSv4 asm_mfma (CDNA3 exec model, VGPR/AGPR/LDS occupancy, mfma intrinsics, raw asm, pitfalls). Adaptive selector now routes ASM to the compute-heavy classes where it beats the DSLs' ceiling: GEMM -> {flydsl, asm}, MoE -> {flydsl, asm}. Left off norm/elementwise (not worth hand-asm) and attention (TileLang ~= asm there). 49 planner/dispatch tests pass.

… aiter op before authoring Root cause of MoE non-win: prompts LISTED flydsl_moe_stage1/2 + fused_moe_bf16_asm as advisory bullets, so the agent free-authored Triton tiling (verified 0.92x regression) instead of dispatching to the shipped fused-MoE ops (the +162% Kimi-K2.5 lever). Add an explicit MANDATORY STEP 0 to both flydsl/asm subagents: identify op-class -> grep aiter for the matching shipped op -> wire it as candidate #0 -> only author from scratch if none fits/it regresses. Op-class->op table included.

…arity, CK backend hardening Session work proven on the GLM-4.7 fused-MoE E2E run (Triton->CK ck_moe = +6% live E2E): - warm-up: baseline_jit_cache_env (repo-keyed shared JIT cache) + warm_up_harness tool so the first cold compile is paid once before the verifier (fixes preprocess thrash/wedge on heavy MoE kernels); slots reuse the warm baseline cache via save_and_test (diff-safe, outside worktrees). - post-round PROFILE made non-fatal (advisory) in postprocess/evaluation — a slow/timed-out profiler no longer crashes the round and discards a verified benchmark win. - harness shape-parity: --benchmark and --full-benchmark run the SAME config set (selection signal == final verification), enforced in harness-generator; verifier timeouts equalized + cold-compile timeout reclassified (retry-once, not regenerate). - subagent self-test iteration cap (GEAK_SUBAGENT_SELFTEST_ITERATIONS) so generator/verifier self-tests are fast; real baseline keeps full count. - CK authoring backend (subagents/ck-kernel-rewrite) + planner routing (_REWRITE_CK on MoE/GEMM); v3 translate generalized to all targets. - tests for baseline_jit_cache_env, warm_up_harness, slot cache reuse.

Refresh and extend the kernel-rewrite reference docs for the tilelang, ck, asm, and flydsl subagents against the authoritative upstream repos (ROCm/FlyDSL, ROCm/composable_kernel, tile-ai/tilelang) and the on-box aiter API, so each subagent has accurate, self-contained (offline) material for the rewrite step. - flydsl: consolidate the language reference under docs/reference/; verify every intrinsic/API against upstream FlyDSL + on-box aiter.ops.flydsl and remove names that don't exist (ds_swizzle, s_wait_loadcnt, global_load_async_lds, arith.* helper fns, get_hip_arch); fix flydsl_hgemm / ck_moe stage signatures. - tilelang: verify T.* primitives + autotune levers vs tile-ai/tilelang; fix GemmWarpPolicy enum + T.use_swizzle; real FlashAttention-fwd / GEMM skeletons. Register the skill (add SKILL.md). - asm: fix gfx942 MFMA intrinsic names (require the _1k suffix; the bf16 8-bit-K forms are gfx908), correct the shared VGPR/AGPR register-pool budget, document the shipped aiter asm ops to prefer. Register the skill (add SKILL.md). - ck: new skill doc set (overview, shipped_aiter_ck_ops, instance_tuning, ck_tile_authoring, pitfalls) sourced from composable_kernel ck_tile examples and the aiter CK codegen tuners. Register the skill (add SKILL.md). tilelang/asm/ck previously shipped no SKILL.md, so skill_runtime never surfaced their docs; all four now register (verified). Docs are self-contained with no external/network dependency. tests/skills green. Co-Authored-By: Claude <noreply@anthropic.com>

When the call site provides an importable launcher callable (module:func) plus the exact per-arg traced shapes, the universal harness contract is fully determined — no LLM authoring is needed. Synthesize the harness directly and run the existing deterministic Path-A sequence (baseline -> profile -> commandment + worktree-bypass gate), short-circuiting the harness-generator. This stops the LLM from burning the whole preprocess budget compiling a CK/.cu kernel from scratch when discovery extracts no callable function. General, not a hotfix: keyed only on "callable + shapes" (a property of every traced hot kernel), reuses harness.j2's import/_inputs/_ref contract and the _run_prevalidated_path_a precedent, and is strictly additive — returns None on any miss so the LLM generator path is unchanged. Correctness invariants enforced: - golden != patched op: snapshot the ORIGINAL op output at synthesis time (pre-patch) to golden.pt; _ref replays it (no self-compare tautology). - aiter worktree routing: emit AITER_META_DIR + per-worktree AITER_JIT_DIR before import aiter so the JIT compiles the PATCHED worktree, not the baseline source (the ~1.00x worktree-bypass bug). Wiring: run_preprocess_v3 gains reference_entry_point/input_shapes/input_dtypes kwargs (default None; env fallback GEAK_REFERENCE_ENTRY_POINT / GEAK_INPUT_SHAPES_JSON). New _io_dtypes.py is the only new primitive (dtype map incl. fp8 + shape parser + tensor builder). Tier-2 fidelity (reconstructed inputs); trace-time real-input capture is the documented follow-up. Tests: tests/preprocess_v3/test_reference_harness.py (parser incl. generic- dispatcher rejection, dtype/shape, guards, op-agnostic GPU run on relu+gelu, contract validation). Existing preprocess_v3 suite green. Co-Authored-By: Claude <noreply@anthropic.com>

…gnature Collapse to one harness-build route: the LLM orchestrator, now given the complete trace-captured argument signature (tensors + scalar args, in call order) via GEAK_RAW_ARG_SPEC_JSON. The orchestrator reconstructs the exact kernel call — correct arity and scalar values (epsilon, group sizes, flags) that a tensor-shape-only view drops. Remove the deterministic harness synthesizer and its tensor-only fast path: it could not satisfy ops with required scalar args (failed ck_moe with "incompatible function arguments") and fell through to the LLM path anyway. One route is simpler and, with the full signature in context, sufficient. - delete reference_harness.py, _io_dtypes.py, test_reference_harness.py - adapter.py: drop the synthesis bypass + dead kwargs; render the traced arg signature into the orchestrator task Co-Authored-By: Claude <noreply@anthropic.com>

…hape-faithfulness rule - baseline.py/orchestrator.py: GEAK_SKIP_PROFILE short-circuits the advisory profiler-mcp roofline pass (it hangs on some kernels e.g. paged-attention and starves the harness-init budget). Returns a profile-less ProfileResult so the optimizer proceeds to rounds instead of wedging. - adapter.py: render the traced raw_arg_spec into the orchestrator task so the harness builder reconstructs the exact kernel call. - harness-generator SUBAGENT.yaml: faithfulness rule - trust the authoritative traced shapes/args (raw_arg_spec) over a discovered test file or device symbol. Co-Authored-By: Claude <noreply@anthropic.com>

…ed tools_module import) Co-Authored-By: Claude <noreply@anthropic.com>

…no logic change) Co-Authored-By: Claude <noreply@anthropic.com>

Commit 1cdd20e bumped the defaults (min_workers 4->8, workers_per_gpu 3->4) but left this test asserting the old 4/3 values. Update the GPU-count and empty-env cases to the intended new defaults. Co-Authored-By: Claude <noreply@anthropic.com>

…stically GEAK's internal round patch-apply (git apply) aborted on every HIP kernel that compiles in-tree: the diff swept in the _geak_build/ build directory (.ninja_deps, .ninja_log, build.ninja) and git apply cannot apply a binary hunk, so the whole patch failed -> verification skipped -> the unverified agent-reported speedup was silently promoted as the round result. Root cause: _section_is_binary only matched "GIT binary patch" (emitted with git diff --binary). The patches use the default marker "Binary files a/.. and b/.. differ", which slipped through. Detect both markers so any binary artifact section is dropped regardless of path - the name-agnostic backstop so new build/cache dirs need no denylist entry. Also add _geak_build to the two denylists so it is excluded at diff-capture time, consistent with build/build_harness. Add regression tests covering both binary markers and stripping of unknown build dirs while keeping the real kernel source. Co-Authored-By: Claude <noreply@anthropic.com>

…g error The Anthropic SDK refuses non-streaming requests whose estimated time (3600 * max_tokens / 128000) exceeds 600s, and amd_claude._query_api does not enable streaming. max_tokens=32000 -> 900s -> raises "Streaming is required for operations that may take longer than 10 minutes", failing every model call. 16000 -> 450s, safe. Co-Authored-By: Claude <noreply@anthropic.com>

Guard the agent against emitting a bash tool call with `{}` / empty command, which the gateway can reject as a malformed request. Add the rule to the strategy-list system prompt and reinforce it in the bash tool schema description. Co-Authored-By: Claude <noreply@anthropic.com>

iraj465 requested review from Umangatamd, amd-ethany, chao-xu-spec, sdubagun-amd and yueliu14 as code owners June 19, 2026 19:33

iraj465 changed the title ~~Preprocess: GEAK_SKIP_PROFILE + kernel-time scoring + harness shape-faithfulness (+ backend-rewrite subagents)~~ Preprocess: harness shape-faithfulness (+ backend-rewrite subagents) Jun 19, 2026

sdubagun-amd approved these changes Jun 23, 2026

View reviewed changes

yueliu14 approved these changes Jun 23, 2026

View reviewed changes

hyperloom-run and others added 20 commits June 25, 2026 18:15

style(tests): fix ruff F401/I001 in test_preprocess_v3_bugfixes (unus…

b91d947

…ed tools_module import) Co-Authored-By: Claude <noreply@anthropic.com>

style: ruff format src/ to satisfy CI format check (pure formatting, …

32470e7

…no logic change) Co-Authored-By: Claude <noreply@anthropic.com>

sdubagun-amd force-pushed the chore/subagent-docs-upstream-resource branch from d44e1cb to 9e14d4a Compare June 25, 2026 18:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preprocess: harness shape-faithfulness (+ backend-rewrite subagents)#290

Preprocess: harness shape-faithfulness (+ backend-rewrite subagents)#290
iraj465 wants to merge 20 commits into
mainfrom
chore/subagent-docs-upstream-resource

iraj465 commented Jun 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

iraj465 commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

iraj465 commented Jun 19, 2026 •

edited

Loading