fix(preprocess_v3): synthesize single-metric harness for flag-less/co… by sdubagun-amd · Pull Request #264 · AMD-AGI/GEAK

sdubagun-amd · 2026-06-04T16:18:42Z

…mposite Path-A tasks (#258)

A flag-less Path-A test command (no GEAK mode flag) was copied byte-identically into all four COMMANDMENT modes, silently. The correctness preflight then ran the harness's full default sweep until it timed out, producing 0% kernel gain.

Two cooperating defects: the orchestrator prompt trained the LLM to mark all four modes covered, and _substitute_mode_flag silently no-ops on a flag-less command, so the renderer loop emitted no warning.

Fix:

tools.py: add a deterministic flag-less backstop in commandment_from_user_command. When the command exposes no mode flag, is not a compound && shell contract, and yields no harness, refuse to render the all-modes-identical COMMANDMENT and return ok:False / PATH_A_FLAG_MISSING. This guard is independent of the modes_covered the LLM passes.
orchestrator.py: split Case A into A1 (flag-aware -> today's happy path) and A2 (flag-less/composite -> route into harness synthesis). A2 routes on whether the prompt carries shapes: with shapes, skip discovery and dispatch harness-generator with prompt shapes only; without shapes, fall through to discovery as Case C. Wire the PATH_A_FLAG_MISSING -> switch-to-A2 recovery and remove the all-four modes_covered example from the generic Case A default.
tests: cover the deterministic backstop (no silent duplicates even with all four modes covered) and the A1 happy path (detector does not mis-fire).

Validated against the real k006 prompt on an AMD gfx942 GPU: the preprocessor now emits four distinct modes backed by a synthesized single-shape (983040x128 bf16), single-metric harness; correctness passes in ~66s instead of timing out at 600s.

Follow-up commits (scope expansion)

Beyond the original flag-less backstop, end-to-end testing surfaced adjacent cases now also fixed on this branch:

Amalgamation && — a non-build cmd_a && cmd_b (e.g. mode 1 && mode 2) was blindly split into correctness/performance, dropping one metric. Added is_amalgamation_command (gated on infer_compile_command_from_eval), checked at both the CLI (mini.py, which previously did its own rsplit("&&") before the preprocessor) and the preprocessor (flag-independent, before harness extraction). Both refuse with PATH_A_FLAG_MISSING → harness generator.
Shape-bearing commands — a flag-aware or build-&& command whose prompt carries unpinned shapes was run as-is (default sweep → timeout). Added a shapes pre-check ahead of the A1/A2 split so these route to A2-with-shapes.
Baseline GEAK_WORK_DIR — the v3 baseline subprocess never exported GEAK_WORK_DIR/GEAK_REPO_ROOT (the legacy run_harness did), so a contract-compliant harness fell back to its own dir, found no kernel source, and emitted no latency (silent "produced no latency" → soft-cap abort). Fixed in _build_env, plus a work_dir fallback to the source repo in collect_baseline.

Tests added for the amalgamation predicate, the baseline env exports, and the work_dir fallback.

…mposite Path-A tasks (#258) A flag-less Path-A test command (no GEAK mode flag) was copied byte-identically into all four COMMANDMENT modes, silently. The correctness preflight then ran the harness's full default sweep until it timed out, producing 0% kernel gain. Two cooperating defects: the orchestrator prompt trained the LLM to mark all four modes covered, and _substitute_mode_flag silently no-ops on a flag-less command, so the renderer loop emitted no warning. Fix: - tools.py: add a deterministic flag-less backstop in commandment_from_user_command. When the command exposes no mode flag, is not a compound && shell contract, and yields no harness, refuse to render the all-modes-identical COMMANDMENT and return ok:False / PATH_A_FLAG_MISSING. This guard is independent of the modes_covered the LLM passes. - orchestrator.py: split Case A into A1 (flag-aware -> today's happy path) and A2 (flag-less/composite -> route into harness synthesis). A2 routes on whether the prompt carries shapes: with shapes, skip discovery and dispatch harness-generator with prompt shapes only; without shapes, fall through to discovery as Case C. Wire the PATH_A_FLAG_MISSING -> switch-to-A2 recovery and remove the all-four modes_covered example from the generic Case A default. - tests: cover the deterministic backstop (no silent duplicates even with all four modes covered) and the A1 happy path (detector does not mis-fire). Validated against the real k006 prompt on an AMD gfx942 GPU: the preprocessor now emits four distinct modes backed by a synthesized single-shape (983040x128 bf16), single-metric harness; correctness passes in ~66s instead of timing out at 600s. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…earing tasks to harness generator (#258 follow-up) Two broken Path-A cases remained after the flag-less backstop (PR #264): 1. Amalgamation && (a non-build command chaining two tests, e.g. the same script run twice with different settings) was split blindly left=correctness / right=performance by _try_synthesize_shell_contract_harness, silently dropping one latency number. Crucially, a *flag-bearing* amalgamation also escaped the flag-less backstop: _extract_harness_from_command returns a path for it, so original_harness_path was set and the backstop's `original_harness_path is None` guard never fired. 2. A command that carries a mode flag (or a build-bearing &&) but whose prompt lists shapes the command does not pin was classified A1 and run as-is, ignoring the shapes and triggering the harness's full default sweep -> timeout. Fix: - tools.py: add _is_amalgamation_command (gated on infer_compile_command_from_eval so the keep/refuse decision stays consistent with the compile prefix the split path re-prepends; ImportError -> treat as amalgamation, the safe route). Refuse amalgamations in commandment_from_user_command BEFORE harness extraction so the guard is flag-independent and catches the flag-bearing case. Skip synthesis for non-build && in _try_synthesize_shell_contract_harness, with comments on the hard-coded correctness/performance ordering and the delegation of all non-build && to the generator. - orchestrator.py: add a shapes pre-check ahead of the A1/A2 split so a flag-aware command whose prompt carries unpinned shapes routes to A2-with-shapes (the deterministic tool is shape-blind and cannot refuse on shape grounds). Bias to A2 when uncertain (perf cost vs. silent timeout). Enumerate the single-flag and build-bearing-&& shape cases under A2-with-shapes; never inject shapes into the command (generator owns shape handling). - tests: cover flag-less and flag-bearing amalgamation refusal, and a regression guard that build-bearing && still synthesizes. Problems #2/#3 are LLM-prose routing and not unit-testable; they are covered by manual verification. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…follow-up) The previous amalgamation fix lived only in the preprocessor (commandment_from_user_command), but the CLI intercepts a compound --test-command first and did its own blind rsplit("&&", 1) into correctness/performance hints before the preprocessor ran. So an amalgamation passed via --test-command (the primary entry path) was split one layer up -- left=correctness, right=performance -- silently dropping one metric, and never reached the deterministic guard as a single string. Caught by running the real GEAK binary: the CLI logged "Correctness command: ...--mode 1, Performance command: ...--mode 2". Fix: - contract_normalize.py: promote the amalgamation predicate to a shared, public is_amalgamation_command (stdlib-only module; no circular import) so the CLI and the preprocessor classify a compound command identically. - tools.py: _is_amalgamation_command becomes a thin wrapper that imports the shared predicate, preserving the safe ImportError fallback (treat as amalgamation -> A2). - mini.py: only pre-split a compound && into correctness/performance hints when it is build-bearing (is_amalgamation_command is False). A non-build && now passes through whole as eval_command, so the preprocessor's guard fires (PATH_A_FLAG_MISSING) and routes it to the harness generator (Case A2). - tests: unit tests for is_amalgamation_command (no-&&, non-build &&, flag-bearing &&, build-bearing && / canonical compile-chain). Verified end-to-end with the real binary on the k006 prompt: the amalgamation --test-command no longer triggers the CLI's "Correctness/Performance command" split; it is passed through whole. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…finds the kernel source End-to-end testing of the A2 (harness-generator) path showed baseline collection failing silently: every sample logged "produced no latency (rc=0)", no benchmark_baseline.txt was written, and preprocess aborted at the 900s soft cap during harness-init. The synthesized harness was correct (runs fine by hand) and routing was correct; the bug was the baseline subprocess environment. Root cause: the v3 baseline env builder never exported GEAK_WORK_DIR, and the collect_baseline tool defaulted work_dir to None: - baseline._build_env set only PYTHONPATH/HIP_VISIBLE_DEVICES/PYTHONUNBUFFERED, dropping the GEAK_WORK_DIR + GEAK_REPO_ROOT exports the legacy run_harness._build_env sets. - collect_baseline's work_dir is an LLM-supplied arg with no default; the subagent often omits it, so baseline ran with work_dir=None (no PYTHONPATH, no GEAK_WORK_DIR at all). A contract-compliant harness derives WORK_DIR = os.environ.get("GEAK_WORK_DIR") or its own dir; with the var unset it fell back to its own directory (which has no kernel source), ran nothing, and emitted no GEAK_RESULT_LATENCY_MS marker. Fix: - baseline.py: _build_env now exports GEAK_WORK_DIR=work_dir and setdefault GEAK_REPO_ROOT=work_dir when work_dir is given (mirrors the legacy contract; setdefault preserves an adapter-exported source root). - tools.py: collect_baseline falls back to agent.config.repo (the source repo) when the subagent omits work_dir. At preprocess time no per-slot worktree exists yet, so the source repo is the correct target; optimization-time runs still pass their per-slot worktree explicitly via the legacy run_harness path. - tools.py: add a comment documenting that compile_command is intentionally unset for HIP (the harness self-builds, mtime-keyed) so the absent COMMANDMENT compile line isn't mistaken for a bug. - tests: _build_env exports the keys (and stays a no-op when work_dir is None); collect_baseline defaults work_dir to the configured source repo. Verified via the real synthesized harness run through the fixed _build_env (cold JIT cache): both facets build and run, and extract_latency_ms returns a value where it previously returned None. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…earing tasks to harness generator (#258 follow-up) Two broken Path-A cases remained after the flag-less backstop (PR #264): 1. Amalgamation && (a non-build command chaining two tests, e.g. the same script run twice with different settings) was split blindly left=correctness / right=performance by _try_synthesize_shell_contract_harness, silently dropping one latency number. Crucially, a *flag-bearing* amalgamation also escaped the flag-less backstop: _extract_harness_from_command returns a path for it, so original_harness_path was set and the backstop's `original_harness_path is None` guard never fired. 2. A command that carries a mode flag (or a build-bearing &&) but whose prompt lists shapes the command does not pin was classified A1 and run as-is, ignoring the shapes and triggering the harness's full default sweep -> timeout. Fix: - tools.py: add _is_amalgamation_command (gated on infer_compile_command_from_eval so the keep/refuse decision stays consistent with the compile prefix the split path re-prepends; ImportError -> treat as amalgamation, the safe route). Refuse amalgamations in commandment_from_user_command BEFORE harness extraction so the guard is flag-independent and catches the flag-bearing case. Skip synthesis for non-build && in _try_synthesize_shell_contract_harness, with comments on the hard-coded correctness/performance ordering and the delegation of all non-build && to the generator. - orchestrator.py: add a shapes pre-check ahead of the A1/A2 split so a flag-aware command whose prompt carries unpinned shapes routes to A2-with-shapes (the deterministic tool is shape-blind and cannot refuse on shape grounds). Bias to A2 when uncertain (perf cost vs. silent timeout). Enumerate the single-flag and build-bearing-&& shape cases under A2-with-shapes; never inject shapes into the command (generator owns shape handling). - tests: cover flag-less and flag-bearing amalgamation refusal, and a regression guard that build-bearing && still synthesizes. Problems #2/#3 are LLM-prose routing and not unit-testable; they are covered by manual verification. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

sdubagun-amd requested review from Umangatamd, amd-ethany, chao-xu-spec, iraj465, jianghui-jianghui and yueliu14 as code owners June 4, 2026 16:18

sdubagun-amd and others added 3 commits June 4, 2026 19:46

sdubagun-amd linked an issue Jun 4, 2026 that may be closed by this pull request

When the test command has no mode flags, all four modes become the same command #258

Open

sdubagun-amd mentioned this pull request Jun 4, 2026

When the test command has no mode flags, all four modes become the same command #258

Open

yueliu14 approved these changes Jun 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(preprocess_v3): synthesize single-metric harness for flag-less/co…#264

fix(preprocess_v3): synthesize single-metric harness for flag-less/co…#264
sdubagun-amd wants to merge 4 commits into
mainfrom
fix/preprocess-v3-path-a-synthesize-harness

sdubagun-amd commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sdubagun-amd commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Follow-up commits (scope expansion)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sdubagun-amd commented Jun 4, 2026 •

edited

Loading