fix(flydsl): make PyTorch→FlyDSL e2e reliable (gate, harness resolution, verify/opt wiring)#288
Open
Umangatamd wants to merge 1 commit into
Open
Conversation
…ion, verify/opt wiring) Fixes the chain of failures that stopped translated FlyDSL kernels from completing preprocess + optimization end-to-end: - baseline: scope the correctness-gate skip to translation runs (translation already validates correctness + perf-regression); keep gate for user harnesses - baseline: detect_kernel_resolution_failure -> loud "kernel not found at X" - tools: deterministic kernel_relpath injection into harness subagents so the harness resolves os.path.join(WORK_DIR, kernel_relpath) instead of guessing - tools: fail-closed when harness-generator retry budget is exhausted (no spin to step_limit on a broken harness) - tools: retarget preprocess (sandbox + baseline work_dir) to the per-run _opt_repo after translation + git-init it; fix _copy_repo_sandbox to copy a repo that lives under output_dir - tools: deterministic harness-verifier backstop (mark HARNESS_VERIFIED when the harness passes --correctness but the LLM verifier did not confirm) - mini: use the per-run _opt_repo as the optimization root for translation runs even when --repo is passed (else preflight/opt root at the source repo) - rag_postprocessor: route through the gateway model (load_geak_model) instead of a provider-less LitellmModel, and make post-processing non-fatal - geak.yaml: raise full-mode preprocess soft cap 900s -> 2400s Adds 20 preprocess_v3 bugfix unit tests. Note: the mini.py and rag_postprocessor fixes are general bugs present on main too. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
✅ Errors fixed — PyTorch→FlyDSL e2e now runs end-to-end (translation → baseline → optimization)
Validated on
KernelBench/level3/1_MLP.py(claude-opus-4.8 /amd_llmgateway). Reproduced the originalFAIL_PREPROCESS/no-baseline/step_limitfailures, traced the full chain, and fixed each root cause.Root causes found & fixed
collect_baselinere-validated correctness on the stricter harness-generator harness and aborted on any non-zero exit, discarding kernels translation already validated → scoped skip whentranslation.success(env-independent; user-supplied harnesses still gate).KernelBench/segment) → deterministickernel_relpathinjection into the harness subagents.kernel not found at Xdiagnostic (detect_kernel_resolution_failure).step_limiton a broken harness → fail-closed once the harness-generator retry budget is exhausted.work_dir) to the per-run_opt_repo, git-init it for patch capture, and fix_copy_repo_sandboxto copy a repo that lives underoutput_dir.--correctnessbackstop (markHARNESS_VERIFIEDwhen the harness passes--correctnessbut the LLM verifier didn't confirm).mini.py— use the per-run_opt_repoas the optimization root for translation runs even when--repois passed (otherwise preflight/opt root at the source repo, which has no translated kernel).LitellmModel(else: get_model()) →litellm.BadRequestError: LLM Provider NOT providedcrashing the optimize loop → route through the gateway model (load_geak_model) + make post-processing non-fatal.Plus: raise full-mode preprocess soft cap
900s → 2400s(geak.yaml) so translation + multi-round harness-gen + baseline fit.Result
Optimize loop now produces real patches with verified speedups (round-1, over the translated FlyDSL baseline):
gemm-tile-tuning-per-layerfixed-pad-2fixed-pad-3Cross-verified speedup 1.14×. Tests: 20 new
preprocess_v3bugfix unit tests + broader suite green.Note for
mainFixes #7 (mini.py) and #8 (RAG postprocessor) are general bugs that exist on
maintoo (affect anyamd_llm+ RAG / translation run), worth landing independently of the FlyDSL branch.Made with Cursor