Skip to content

Preprocess: harness shape-faithfulness (+ backend-rewrite subagents)#290

Open
iraj465 wants to merge 20 commits into
mainfrom
chore/subagent-docs-upstream-resource
Open

Preprocess: harness shape-faithfulness (+ backend-rewrite subagents)#290
iraj465 wants to merge 20 commits into
mainfrom
chore/subagent-docs-upstream-resource

Conversation

@iraj465

@iraj465 iraj465 commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

Brings the chore/subagent-docs-upstream-resource line (built on the backend-rewrite-subagents work, PR #287) into main, plus this session's preprocess fixes:

  • Traced arg-signature into the harness (adapter.py): renders the full raw_arg_spec into the orchestrator task so the harness builder reconstructs the exact kernel call.
  • Harness shape-faithfulness rule (harness-generator SUBAGENT.yaml): trust the authoritative traced shapes/args over a discovered test file or device symbol.
  • Earlier in the line: asm/ck/flydsl/tilelang rewrite subagents + skill docs, single harness path, CK backend hardening.

Why

These enabled a faithful, kernel-time-scored GEAK run on Qwen3-30B-A3B ck_moe_stage2 → verified 2.56x kernel-level → +19.2% measured E2E throughput (3-repeat A/B).

Test plan

  • ruff clean on all changed files (fixed a pre-existing F401/I001 in test_preprocess_v3_bugfixes)
  • merged origin/main cleanly (no conflicts)
  • verified live: GEAK actor honors GEAK_SCORE_TARGET=kernel
  • CI green

@iraj465 iraj465 changed the title Preprocess: GEAK_SKIP_PROFILE + kernel-time scoring + harness shape-faithfulness (+ backend-rewrite subagents) Preprocess: harness shape-faithfulness (+ backend-rewrite subagents) Jun 19, 2026
hyperloom-run and others added 20 commits June 25, 2026 18:15
…>TileLang)

Extend the translation system from the single pytorch->flydsl pair to 10 pairs,
converting the less-optimized source languages (pytorch, triton, ck, hip) toward
the two most-optimized DSLs on MI300X/gfx942 (FlyDSL, TileLang), plus tilelang<->flydsl
cross-conversion.

- translation_registry.py: source detectors (_detect_triton/_tilelang/_ck/_hip/_flydsl),
  _tilelang_env_setup, kb_skill_dir on TranslationPair, generalized load_translation_kb,
  register 9 new pairs. Targets restricted to flydsl/tilelang (always >= ck/triton/hip).
- mini.py: generic kernel_type->target resolver (<src>2<tgt> or bare target) unlocking
  translation for the new targets (was hardcoded flydsl-only).
- config/mini_kernel_to_tilelang.yaml: source-agnostic ->TileLang translation agent config.
- skills/tilelang*, skills/flydsl/docs/psv4_ref: TileLang + FlyDSL KB (ported from PSv4).
- subagents/{triton,ck,hip,tilelang}-to-flydsl + {pytorch,triton,ck,hip,flydsl}-to-tilelang:
  9 rewrite subagents with language_match auto-selection.

Validated: 178 tests pass, registry lists 10 pairs + 15 subagents, detectors fire on real
sglang triton + aiter CK files.
… subagent

In mixed mode GEAK splits workers 50/50: fixed-canonical (in-place,
agent_name=general-kernel-optimization) + planned (LLM-diverse). This adds an
opt-in env knob GEAK_FIXED_REWRITE_TARGET=flydsl|tilelang that re-routes the
canonical FIXED slot through the matching <source>-to-<target> rewrite subagent
(triton/hip/ck/pytorch -> flydsl/tilelang) based on the detected kernel_language.

Effect: the fixed half attempts a backend rewrite into the optimized DSL while the
planned half does in-place tuning, competing in the same best-of-N round. Default
(unset) preserves the existing in-place behavior; no-op rewrites (source==target,
unknown lang) fall back to general-kernel-optimization.

task_planner.py: _fixed_rewrite_agent_name() + wire into canonical_fixed.agent_name.
Verified: build_pool stamps triton-to-flydsl on the fixed slot when set; 31
planner/dispatch tests pass.
…-route as pytorch

Two fixes for the fixed-half rewrite path:
1. Rewrite subagents' instance_template referenced {{knowledge_base}}, which is only
   provided on the legacy TranslationAgent path — not when dispatched as a fixed-mode
   OptimizationAgent worker. Caused jinja StrictUndefined -> exit=UndefinedError (0s).
   Removed the KB block from the 9 new subagents (guidance lives in SYSTEM_PROMPT.md,
   which IS loaded); pytorch-to-flydsl uses {{ knowledge_base | default("", true) }}
   so it works on BOTH paths.
2. task_planner._fixed_rewrite_agent_name now prefers kernel_type (precise
   triton/hip/ck classifier) over kernel_language (reports "python" for Triton .py),
   so a Triton kernel routes to triton-to-flydsl, not pytorch-to-flydsl.

Verified: type=triton->triton-to-flydsl; 26 planner/dispatch tests pass.
…ated)

The FlyDSL rewrite subagent prompts referenced non-existent APIs
(compile_preshuffle_gemm_a8, build_flash_attn_func_module, bare @flyc.kernel) — GEAK
would hallucinate and fail. Replaced with the VERIFIED on-box aiter.ops.flydsl API that
PSv4 uses for its real wins (1.77x isolated / +14% E2E on Qwen3.5 fp8 GEMM; +162% Kimi-K2.5 MoE):
- flydsl_hgemm(a,b,tile_m/n/k,split_k,block_*_warps,b_preshuffle,auto_shuffle_b) for dense bf16/fp16
- flydsl_preshuffle_gemm_a8(XQ,WQ,x_scale,w_scale,...) for fp8/a8 block-scale GEMM
- flydsl_moe_stage1/2 for fused MoE
- linear_attention_kernels.flydsl_gdr_decode(...) — direct replacement for the GDN
  fused_recurrent_gated_delta_rule_packed_decode kernel (our k006)
- aiter.tuned_gemm.gemm_a16w16 production dispatch seam
Signatures verified via inspect against /sgl-workspace/aiter. Added "do NOT invent APIs"
guard + a self-check command. TileLang prompts already use the genuine on-box T.* API (verified).
…write authoring agents

Add two source-agnostic kernel-authoring subagents that follow AMD's real recipe (verified
against PSv4's winning kernels): author an optimized kernel targeting the DSL — split-K,
tiling, fused scale/epilogue, per-(M,N,K) config — prefer the real aiter FlyDSL ops /
TileLang T.* + @tilelang.autotune where they win on gfx942, fall back to an optimized
authored Triton kernel where the DSL has no edge (e.g. gfx942 block-scaled GEMM has no
native MFMA). Parity-gated. This is broader & more correct than the narrow
"<src>-to-<tgt> API-swap" pair subagents.

GEAK_FIXED_REWRITE_TARGET=flydsl|tilelang now routes the mixed-mode fixed slot to these
dedicated agents by default; set GEAK_REWRITE_USE_PAIR_SUBAGENTS=1 for the legacy pairs.
Validated: both register + render; resolver routes correctly; 49 planner/dispatch tests pass.
…ool candidates

Drop the GEAK_FIXED_REWRITE_TARGET / GEAK_REWRITE_USE_PAIR_SUBAGENTS env flags and the
per-target source-mapping. Instead, every mixed-mode round adds both dedicated
source-agnostic authoring subagents (flydsl-kernel-rewrite, tilelang-kernel-rewrite) as
fixed candidates in the pool — so GEAK naturally tries both rewrites in best-of-N alongside
the in-place optimizer and planned strategies, and the dispatcher/selector picks the winner.
No hardcoding, no flags. Canonical fixed slot stays general-kernel-optimization.
Only added when the subagents are registered. 49 planner/dispatch tests pass.
Instead of always adding both authoring rewrites, pick the DSL with the real edge for the
kernel's op-type (matched on kernel_name + function_names — already in kernel_meta, no new
signal/infra):
  attention/flash/MLA  -> TileLang (FA ~1.5x Triton, MLA ~parity w/ asm)
  MoE/grouped-expert   -> FlyDSL  (fused-MoE)
  linear-attn/gated-delta/GDN decode -> FlyDSL (aiter flydsl_gdr_decode)
  GEMM / norm / elementwise / unknown -> both (DSL edge small/ambiguous; best-of-N decides)

Avoids wasting a slot on a rewrite that can't win (e.g. TileLang on a blockscale GEMM where
gfx942 has no native MFMA). Single ~25-line selector in the existing insertion point — no
GEAK-main changes, no env flags. 49 planner/dispatch tests pass.
…tier)

Third authoring subagent for the ceiling tier under the DSLs: MFMA intrinsics ->
inline asm volatile -> raw .s, preferring shipped aiter asm ops (gemm_a8w8_asm,
flatmm_a8w8_blockscale_asm, asm MLA decode, fused_moe_bf16_asm). KB ported from PSv4
asm_mfma (CDNA3 exec model, VGPR/AGPR/LDS occupancy, mfma intrinsics, raw asm, pitfalls).

Adaptive selector now routes ASM to the compute-heavy classes where it beats the DSLs'
ceiling: GEMM -> {flydsl, asm}, MoE -> {flydsl, asm}. Left off norm/elementwise (not worth
hand-asm) and attention (TileLang ~= asm there). 49 planner/dispatch tests pass.
… aiter op before authoring

Root cause of MoE non-win: prompts LISTED flydsl_moe_stage1/2 + fused_moe_bf16_asm as advisory
bullets, so the agent free-authored Triton tiling (verified 0.92x regression) instead of dispatching
to the shipped fused-MoE ops (the +162% Kimi-K2.5 lever). Add an explicit MANDATORY STEP 0 to both
flydsl/asm subagents: identify op-class -> grep aiter for the matching shipped op -> wire it as
candidate #0 -> only author from scratch if none fits/it regresses. Op-class->op table included.
…arity, CK backend hardening

Session work proven on the GLM-4.7 fused-MoE E2E run (Triton->CK ck_moe = +6% live E2E):

- warm-up: baseline_jit_cache_env (repo-keyed shared JIT cache) + warm_up_harness tool so the
  first cold compile is paid once before the verifier (fixes preprocess thrash/wedge on heavy
  MoE kernels); slots reuse the warm baseline cache via save_and_test (diff-safe, outside worktrees).
- post-round PROFILE made non-fatal (advisory) in postprocess/evaluation — a slow/timed-out
  profiler no longer crashes the round and discards a verified benchmark win.
- harness shape-parity: --benchmark and --full-benchmark run the SAME config set (selection signal
  == final verification), enforced in harness-generator; verifier timeouts equalized + cold-compile
  timeout reclassified (retry-once, not regenerate).
- subagent self-test iteration cap (GEAK_SUBAGENT_SELFTEST_ITERATIONS) so generator/verifier
  self-tests are fast; real baseline keeps full count.
- CK authoring backend (subagents/ck-kernel-rewrite) + planner routing (_REWRITE_CK on MoE/GEMM);
  v3 translate generalized to all targets.
- tests for baseline_jit_cache_env, warm_up_harness, slot cache reuse.
Refresh and extend the kernel-rewrite reference docs for the tilelang, ck,
asm, and flydsl subagents against the authoritative upstream repos
(ROCm/FlyDSL, ROCm/composable_kernel, tile-ai/tilelang) and the on-box aiter
API, so each subagent has accurate, self-contained (offline) material for the
rewrite step.

- flydsl: consolidate the language reference under docs/reference/; verify every
  intrinsic/API against upstream FlyDSL + on-box aiter.ops.flydsl and remove
  names that don't exist (ds_swizzle, s_wait_loadcnt, global_load_async_lds,
  arith.* helper fns, get_hip_arch); fix flydsl_hgemm / ck_moe stage signatures.
- tilelang: verify T.* primitives + autotune levers vs tile-ai/tilelang; fix
  GemmWarpPolicy enum + T.use_swizzle; real FlashAttention-fwd / GEMM skeletons.
  Register the skill (add SKILL.md).
- asm: fix gfx942 MFMA intrinsic names (require the _1k suffix; the bf16 8-bit-K
  forms are gfx908), correct the shared VGPR/AGPR register-pool budget, document
  the shipped aiter asm ops to prefer. Register the skill (add SKILL.md).
- ck: new skill doc set (overview, shipped_aiter_ck_ops, instance_tuning,
  ck_tile_authoring, pitfalls) sourced from composable_kernel ck_tile examples
  and the aiter CK codegen tuners. Register the skill (add SKILL.md).

tilelang/asm/ck previously shipped no SKILL.md, so skill_runtime never surfaced
their docs; all four now register (verified). Docs are self-contained with no
external/network dependency. tests/skills green.

Co-Authored-By: Claude <noreply@anthropic.com>
When the call site provides an importable launcher callable (module:func) plus
the exact per-arg traced shapes, the universal harness contract is fully
determined — no LLM authoring is needed. Synthesize the harness directly and run
the existing deterministic Path-A sequence (baseline -> profile -> commandment +
worktree-bypass gate), short-circuiting the harness-generator. This stops the
LLM from burning the whole preprocess budget compiling a CK/.cu kernel from
scratch when discovery extracts no callable function.

General, not a hotfix: keyed only on "callable + shapes" (a property of every
traced hot kernel), reuses harness.j2's import/_inputs/_ref contract and the
_run_prevalidated_path_a precedent, and is strictly additive — returns None on
any miss so the LLM generator path is unchanged.

Correctness invariants enforced:
- golden != patched op: snapshot the ORIGINAL op output at synthesis time
  (pre-patch) to golden.pt; _ref replays it (no self-compare tautology).
- aiter worktree routing: emit AITER_META_DIR + per-worktree AITER_JIT_DIR before
  import aiter so the JIT compiles the PATCHED worktree, not the baseline source
  (the ~1.00x worktree-bypass bug).

Wiring: run_preprocess_v3 gains reference_entry_point/input_shapes/input_dtypes
kwargs (default None; env fallback GEAK_REFERENCE_ENTRY_POINT /
GEAK_INPUT_SHAPES_JSON). New _io_dtypes.py is the only new primitive (dtype map
incl. fp8 + shape parser + tensor builder). Tier-2 fidelity (reconstructed
inputs); trace-time real-input capture is the documented follow-up.

Tests: tests/preprocess_v3/test_reference_harness.py (parser incl. generic-
dispatcher rejection, dtype/shape, guards, op-agnostic GPU run on relu+gelu,
contract validation). Existing preprocess_v3 suite green.

Co-Authored-By: Claude <noreply@anthropic.com>
…gnature

Collapse to one harness-build route: the LLM orchestrator, now given the
complete trace-captured argument signature (tensors + scalar args, in call
order) via GEAK_RAW_ARG_SPEC_JSON. The orchestrator reconstructs the exact
kernel call — correct arity and scalar values (epsilon, group sizes, flags)
that a tensor-shape-only view drops.

Remove the deterministic harness synthesizer and its tensor-only fast path: it
could not satisfy ops with required scalar args (failed ck_moe with
"incompatible function arguments") and fell through to the LLM path anyway.
One route is simpler and, with the full signature in context, sufficient.

- delete reference_harness.py, _io_dtypes.py, test_reference_harness.py
- adapter.py: drop the synthesis bypass + dead kwargs; render the traced arg
  signature into the orchestrator task

Co-Authored-By: Claude <noreply@anthropic.com>
…hape-faithfulness rule

- baseline.py/orchestrator.py: GEAK_SKIP_PROFILE short-circuits the advisory
  profiler-mcp roofline pass (it hangs on some kernels e.g. paged-attention and
  starves the harness-init budget). Returns a profile-less ProfileResult so the
  optimizer proceeds to rounds instead of wedging.
- adapter.py: render the traced raw_arg_spec into the orchestrator task so the
  harness builder reconstructs the exact kernel call.
- harness-generator SUBAGENT.yaml: faithfulness rule - trust the authoritative
  traced shapes/args (raw_arg_spec) over a discovered test file or device symbol.

Co-Authored-By: Claude <noreply@anthropic.com>
…ed tools_module import)

Co-Authored-By: Claude <noreply@anthropic.com>
…no logic change)

Co-Authored-By: Claude <noreply@anthropic.com>
Commit 1cdd20e bumped the defaults (min_workers 4->8, workers_per_gpu 3->4)
but left this test asserting the old 4/3 values. Update the GPU-count and
empty-env cases to the intended new defaults.

Co-Authored-By: Claude <noreply@anthropic.com>
…stically

GEAK's internal round patch-apply (git apply) aborted on every HIP kernel
that compiles in-tree: the diff swept in the _geak_build/ build directory
(.ninja_deps, .ninja_log, build.ninja) and git apply cannot apply a binary
hunk, so the whole patch failed -> verification skipped -> the unverified
agent-reported speedup was silently promoted as the round result.

Root cause: _section_is_binary only matched "GIT binary patch" (emitted with
git diff --binary). The patches use the default marker "Binary files a/.. and
b/.. differ", which slipped through. Detect both markers so any binary
artifact section is dropped regardless of path - the name-agnostic backstop
so new build/cache dirs need no denylist entry. Also add _geak_build to the
two denylists so it is excluded at diff-capture time, consistent with
build/build_harness.

Add regression tests covering both binary markers and stripping of unknown
build dirs while keeping the real kernel source.

Co-Authored-By: Claude <noreply@anthropic.com>
…g error

The Anthropic SDK refuses non-streaming requests whose estimated time
(3600 * max_tokens / 128000) exceeds 600s, and amd_claude._query_api does
not enable streaming. max_tokens=32000 -> 900s -> raises "Streaming is
required for operations that may take longer than 10 minutes", failing every
model call. 16000 -> 450s, safe.

Co-Authored-By: Claude <noreply@anthropic.com>
Guard the agent against emitting a bash tool call with `{}` / empty
command, which the gateway can reject as a malformed request. Add the
rule to the strategy-list system prompt and reinforce it in the bash
tool schema description.

Co-Authored-By: Claude <noreply@anthropic.com>
@sdubagun-amd sdubagun-amd force-pushed the chore/subagent-docs-upstream-resource branch from d44e1cb to 9e14d4a Compare June 25, 2026 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants