fix(preprocess_v3): synthesize single-metric harness for flag-less/co…#264
Open
sdubagun-amd wants to merge 4 commits into
Open
fix(preprocess_v3): synthesize single-metric harness for flag-less/co…#264sdubagun-amd wants to merge 4 commits into
sdubagun-amd wants to merge 4 commits into
Conversation
…mposite Path-A tasks (#258) A flag-less Path-A test command (no GEAK mode flag) was copied byte-identically into all four COMMANDMENT modes, silently. The correctness preflight then ran the harness's full default sweep until it timed out, producing 0% kernel gain. Two cooperating defects: the orchestrator prompt trained the LLM to mark all four modes covered, and _substitute_mode_flag silently no-ops on a flag-less command, so the renderer loop emitted no warning. Fix: - tools.py: add a deterministic flag-less backstop in commandment_from_user_command. When the command exposes no mode flag, is not a compound && shell contract, and yields no harness, refuse to render the all-modes-identical COMMANDMENT and return ok:False / PATH_A_FLAG_MISSING. This guard is independent of the modes_covered the LLM passes. - orchestrator.py: split Case A into A1 (flag-aware -> today's happy path) and A2 (flag-less/composite -> route into harness synthesis). A2 routes on whether the prompt carries shapes: with shapes, skip discovery and dispatch harness-generator with prompt shapes only; without shapes, fall through to discovery as Case C. Wire the PATH_A_FLAG_MISSING -> switch-to-A2 recovery and remove the all-four modes_covered example from the generic Case A default. - tests: cover the deterministic backstop (no silent duplicates even with all four modes covered) and the A1 happy path (detector does not mis-fire). Validated against the real k006 prompt on an AMD gfx942 GPU: the preprocessor now emits four distinct modes backed by a synthesized single-shape (983040x128 bf16), single-metric harness; correctness passes in ~66s instead of timing out at 600s. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…earing tasks to harness generator (#258 follow-up) Two broken Path-A cases remained after the flag-less backstop (PR #264): 1. Amalgamation && (a non-build command chaining two tests, e.g. the same script run twice with different settings) was split blindly left=correctness / right=performance by _try_synthesize_shell_contract_harness, silently dropping one latency number. Crucially, a *flag-bearing* amalgamation also escaped the flag-less backstop: _extract_harness_from_command returns a path for it, so original_harness_path was set and the backstop's `original_harness_path is None` guard never fired. 2. A command that carries a mode flag (or a build-bearing &&) but whose prompt lists shapes the command does not pin was classified A1 and run as-is, ignoring the shapes and triggering the harness's full default sweep -> timeout. Fix: - tools.py: add _is_amalgamation_command (gated on infer_compile_command_from_eval so the keep/refuse decision stays consistent with the compile prefix the split path re-prepends; ImportError -> treat as amalgamation, the safe route). Refuse amalgamations in commandment_from_user_command BEFORE harness extraction so the guard is flag-independent and catches the flag-bearing case. Skip synthesis for non-build && in _try_synthesize_shell_contract_harness, with comments on the hard-coded correctness/performance ordering and the delegation of all non-build && to the generator. - orchestrator.py: add a shapes pre-check ahead of the A1/A2 split so a flag-aware command whose prompt carries unpinned shapes routes to A2-with-shapes (the deterministic tool is shape-blind and cannot refuse on shape grounds). Bias to A2 when uncertain (perf cost vs. silent timeout). Enumerate the single-flag and build-bearing-&& shape cases under A2-with-shapes; never inject shapes into the command (generator owns shape handling). - tests: cover flag-less and flag-bearing amalgamation refusal, and a regression guard that build-bearing && still synthesizes. Problems #2/#3 are LLM-prose routing and not unit-testable; they are covered by manual verification. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…follow-up)
The previous amalgamation fix lived only in the preprocessor
(commandment_from_user_command), but the CLI intercepts a compound --test-command
first and did its own blind rsplit("&&", 1) into correctness/performance hints
before the preprocessor ran. So an amalgamation passed via --test-command
(the primary entry path) was split one layer up -- left=correctness,
right=performance -- silently dropping one metric, and never reached the
deterministic guard as a single string. Caught by running the real GEAK binary:
the CLI logged "Correctness command: ...--mode 1, Performance command: ...--mode 2".
Fix:
- contract_normalize.py: promote the amalgamation predicate to a shared, public
is_amalgamation_command (stdlib-only module; no circular import) so the CLI and
the preprocessor classify a compound command identically.
- tools.py: _is_amalgamation_command becomes a thin wrapper that imports the shared
predicate, preserving the safe ImportError fallback (treat as amalgamation -> A2).
- mini.py: only pre-split a compound && into correctness/performance hints when it
is build-bearing (is_amalgamation_command is False). A non-build && now passes
through whole as eval_command, so the preprocessor's guard fires
(PATH_A_FLAG_MISSING) and routes it to the harness generator (Case A2).
- tests: unit tests for is_amalgamation_command (no-&&, non-build &&, flag-bearing
&&, build-bearing && / canonical compile-chain).
Verified end-to-end with the real binary on the k006 prompt: the amalgamation
--test-command no longer triggers the CLI's "Correctness/Performance command"
split; it is passed through whole.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…finds the kernel source
End-to-end testing of the A2 (harness-generator) path showed baseline collection
failing silently: every sample logged "produced no latency (rc=0)", no
benchmark_baseline.txt was written, and preprocess aborted at the 900s soft cap
during harness-init. The synthesized harness was correct (runs fine by hand) and
routing was correct; the bug was the baseline subprocess environment.
Root cause: the v3 baseline env builder never exported GEAK_WORK_DIR, and the
collect_baseline tool defaulted work_dir to None:
- baseline._build_env set only PYTHONPATH/HIP_VISIBLE_DEVICES/PYTHONUNBUFFERED,
dropping the GEAK_WORK_DIR + GEAK_REPO_ROOT exports the legacy
run_harness._build_env sets.
- collect_baseline's work_dir is an LLM-supplied arg with no default; the subagent
often omits it, so baseline ran with work_dir=None (no PYTHONPATH, no
GEAK_WORK_DIR at all).
A contract-compliant harness derives WORK_DIR = os.environ.get("GEAK_WORK_DIR")
or its own dir; with the var unset it fell back to its own directory (which has no
kernel source), ran nothing, and emitted no GEAK_RESULT_LATENCY_MS marker.
Fix:
- baseline.py: _build_env now exports GEAK_WORK_DIR=work_dir and
setdefault GEAK_REPO_ROOT=work_dir when work_dir is given (mirrors the legacy
contract; setdefault preserves an adapter-exported source root).
- tools.py: collect_baseline falls back to agent.config.repo (the source repo)
when the subagent omits work_dir. At preprocess time no per-slot worktree exists
yet, so the source repo is the correct target; optimization-time runs still pass
their per-slot worktree explicitly via the legacy run_harness path.
- tools.py: add a comment documenting that compile_command is intentionally unset
for HIP (the harness self-builds, mtime-keyed) so the absent COMMANDMENT compile
line isn't mistaken for a bug.
- tests: _build_env exports the keys (and stays a no-op when work_dir is None);
collect_baseline defaults work_dir to the configured source repo.
Verified via the real synthesized harness run through the fixed _build_env (cold
JIT cache): both facets build and run, and extract_latency_ms returns a value
where it previously returned None.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
sdubagun-amd
added a commit
that referenced
this pull request
Jun 8, 2026
…earing tasks to harness generator (#258 follow-up) Two broken Path-A cases remained after the flag-less backstop (PR #264): 1. Amalgamation && (a non-build command chaining two tests, e.g. the same script run twice with different settings) was split blindly left=correctness / right=performance by _try_synthesize_shell_contract_harness, silently dropping one latency number. Crucially, a *flag-bearing* amalgamation also escaped the flag-less backstop: _extract_harness_from_command returns a path for it, so original_harness_path was set and the backstop's `original_harness_path is None` guard never fired. 2. A command that carries a mode flag (or a build-bearing &&) but whose prompt lists shapes the command does not pin was classified A1 and run as-is, ignoring the shapes and triggering the harness's full default sweep -> timeout. Fix: - tools.py: add _is_amalgamation_command (gated on infer_compile_command_from_eval so the keep/refuse decision stays consistent with the compile prefix the split path re-prepends; ImportError -> treat as amalgamation, the safe route). Refuse amalgamations in commandment_from_user_command BEFORE harness extraction so the guard is flag-independent and catches the flag-bearing case. Skip synthesis for non-build && in _try_synthesize_shell_contract_harness, with comments on the hard-coded correctness/performance ordering and the delegation of all non-build && to the generator. - orchestrator.py: add a shapes pre-check ahead of the A1/A2 split so a flag-aware command whose prompt carries unpinned shapes routes to A2-with-shapes (the deterministic tool is shape-blind and cannot refuse on shape grounds). Bias to A2 when uncertain (perf cost vs. silent timeout). Enumerate the single-flag and build-bearing-&& shape cases under A2-with-shapes; never inject shapes into the command (generator owns shape handling). - tests: cover flag-less and flag-bearing amalgamation refusal, and a regression guard that build-bearing && still synthesizes. Problems #2/#3 are LLM-prose routing and not unit-testable; they are covered by manual verification. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
yueliu14
approved these changes
Jun 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…mposite Path-A tasks (#258)
A flag-less Path-A test command (no GEAK mode flag) was copied byte-identically into all four COMMANDMENT modes, silently. The correctness preflight then ran the harness's full default sweep until it timed out, producing 0% kernel gain.
Two cooperating defects: the orchestrator prompt trained the LLM to mark all four modes covered, and _substitute_mode_flag silently no-ops on a flag-less command, so the renderer loop emitted no warning.
Fix:
Validated against the real k006 prompt on an AMD gfx942 GPU: the preprocessor now emits four distinct modes backed by a synthesized single-shape (983040x128 bf16), single-metric harness; correctness passes in ~66s instead of timing out at 600s.
Follow-up commits (scope expansion)
Beyond the original flag-less backstop, end-to-end testing surfaced adjacent cases now also fixed on this branch:
&&— a non-buildcmd_a && cmd_b(e.g.mode 1 && mode 2) was blindly split into correctness/performance, dropping one metric. Addedis_amalgamation_command(gated oninfer_compile_command_from_eval), checked at both the CLI (mini.py, which previously did its ownrsplit("&&")before the preprocessor) and the preprocessor (flag-independent, before harness extraction). Both refuse withPATH_A_FLAG_MISSING→ harness generator.&&command whose prompt carries unpinned shapes was run as-is (default sweep → timeout). Added a shapes pre-check ahead of the A1/A2 split so these route to A2-with-shapes.GEAK_WORK_DIR— the v3 baseline subprocess never exportedGEAK_WORK_DIR/GEAK_REPO_ROOT(the legacyrun_harnessdid), so a contract-compliant harness fell back to its own dir, found no kernel source, and emitted no latency (silent "produced no latency" → soft-cap abort). Fixed in_build_env, plus awork_dirfallback to the source repo incollect_baseline.Tests added for the amalgamation predicate, the baseline env exports, and the
work_dirfallback.