fix(flydsl): make PyTorch→FlyDSL e2e reliable (gate, harness resolution, verify/opt wiring) by Umangatamd · Pull Request #288 · AMD-AGI/GEAK

Umangatamd · 2026-06-19T10:35:58Z

✅ Errors fixed — PyTorch→FlyDSL e2e now runs end-to-end (translation → baseline → optimization)

Validated on KernelBench/level3/1_MLP.py (claude-opus-4.8 / amd_llm gateway). Reproduced the original FAIL_PREPROCESS/no-baseline/step_limit failures, traced the full chain, and fixed each root cause.

Root causes found & fixed

Baseline correctness-gate false-abort — collect_baseline re-validated correctness on the stricter harness-generator harness and aborted on any non-zero exit, discarding kernels translation already validated → scoped skip when translation.success (env-independent; user-supplied harnesses still gate).
Harness couldn't resolve the kernel (LLM guessed a wrong rel-path, e.g. a spurious KernelBench/ segment) → deterministic kernel_relpath injection into the harness subagents.
Silent no-baseline → loud kernel not found at X diagnostic (detect_kernel_resolution_failure).
Spin to step_limit on a broken harness → fail-closed once the harness-generator retry budget is exhausted.
Work-dir mismatch across verify/baseline/optimize → retarget preprocess (subagent sandbox + baseline work_dir) to the per-run _opt_repo, git-init it for patch capture, and fix _copy_repo_sandbox to copy a repo that lives under output_dir.
Verifier never confirms a working harness → deterministic --correctness backstop (mark HARNESS_VERIFIED when the harness passes --correctness but the LLM verifier didn't confirm).
mini.py — use the per-run _opt_repo as the optimization root for translation runs even when --repo is passed (otherwise preflight/opt root at the source repo, which has no translated kernel).
RAG postprocessor built a provider-less LitellmModel (else: get_model()) → litellm.BadRequestError: LLM Provider NOT provided crashing the optimize loop → route through the gateway model (load_geak_model) + make post-processing non-fatal.

Plus: raise full-mode preprocess soft cap 900s → 2400s (geak.yaml) so translation + multi-round harness-gen + baseline fit.

Result

Optimize loop now produces real patches with verified speedups (round-1, over the translated FlyDSL baseline):

strategy	speedup
`gemm-tile-tuning-per-layer`	1.68× (0.609 → 0.363 ms)
`fixed-pad-2`	1.46×
`fixed-pad-3`	1.17×

Cross-verified speedup 1.14×. Tests: 20 new preprocess_v3 bugfix unit tests + broader suite green.

Note for `main`

Fixes #7 (mini.py) and #8 (RAG postprocessor) are general bugs that exist on main too (affect any amd_llm + RAG / translation run), worth landing independently of the FlyDSL branch.

Made with Cursor

…ion, verify/opt wiring) Fixes the chain of failures that stopped translated FlyDSL kernels from completing preprocess + optimization end-to-end: - baseline: scope the correctness-gate skip to translation runs (translation already validates correctness + perf-regression); keep gate for user harnesses - baseline: detect_kernel_resolution_failure -> loud "kernel not found at X" - tools: deterministic kernel_relpath injection into harness subagents so the harness resolves os.path.join(WORK_DIR, kernel_relpath) instead of guessing - tools: fail-closed when harness-generator retry budget is exhausted (no spin to step_limit on a broken harness) - tools: retarget preprocess (sandbox + baseline work_dir) to the per-run _opt_repo after translation + git-init it; fix _copy_repo_sandbox to copy a repo that lives under output_dir - tools: deterministic harness-verifier backstop (mark HARNESS_VERIFIED when the harness passes --correctness but the LLM verifier did not confirm) - mini: use the per-run _opt_repo as the optimization root for translation runs even when --repo is passed (else preflight/opt root at the source repo) - rag_postprocessor: route through the gateway model (load_geak_model) instead of a provider-less LitellmModel, and make post-processing non-fatal - geak.yaml: raise full-mode preprocess soft cap 900s -> 2400s Adds 20 preprocess_v3 bugfix unit tests. Note: the mini.py and rag_postprocessor fixes are general bugs present on main too. Co-authored-by: Cursor <cursoragent@cursor.com>

Umangatamd requested review from amd-ethany, chao-xu-spec, iraj465, sdubagun-amd and yueliu14 as code owners June 19, 2026 10:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(flydsl): make PyTorch→FlyDSL e2e reliable (gate, harness resolution, verify/opt wiring)#288

fix(flydsl): make PyTorch→FlyDSL e2e reliable (gate, harness resolution, verify/opt wiring)#288
Umangatamd wants to merge 1 commit into
feature/flydsl_translation_optimizationfrom
fix/scoped-correctness-gate

Umangatamd commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Umangatamd commented Jun 19, 2026

✅ Errors fixed — PyTorch→FlyDSL e2e now runs end-to-end (translation → baseline → optimization)

Root causes found & fixed

Result

Note for main

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Note for `main`