Skip to content

fix(preprocess_v3): synthesize single-metric harness for flag-less/co…#264

Open
sdubagun-amd wants to merge 4 commits into
mainfrom
fix/preprocess-v3-path-a-synthesize-harness
Open

fix(preprocess_v3): synthesize single-metric harness for flag-less/co…#264
sdubagun-amd wants to merge 4 commits into
mainfrom
fix/preprocess-v3-path-a-synthesize-harness

Conversation

@sdubagun-amd

@sdubagun-amd sdubagun-amd commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

…mposite Path-A tasks (#258)

A flag-less Path-A test command (no GEAK mode flag) was copied byte-identically into all four COMMANDMENT modes, silently. The correctness preflight then ran the harness's full default sweep until it timed out, producing 0% kernel gain.

Two cooperating defects: the orchestrator prompt trained the LLM to mark all four modes covered, and _substitute_mode_flag silently no-ops on a flag-less command, so the renderer loop emitted no warning.

Fix:

  • tools.py: add a deterministic flag-less backstop in commandment_from_user_command. When the command exposes no mode flag, is not a compound && shell contract, and yields no harness, refuse to render the all-modes-identical COMMANDMENT and return ok:False / PATH_A_FLAG_MISSING. This guard is independent of the modes_covered the LLM passes.
  • orchestrator.py: split Case A into A1 (flag-aware -> today's happy path) and A2 (flag-less/composite -> route into harness synthesis). A2 routes on whether the prompt carries shapes: with shapes, skip discovery and dispatch harness-generator with prompt shapes only; without shapes, fall through to discovery as Case C. Wire the PATH_A_FLAG_MISSING -> switch-to-A2 recovery and remove the all-four modes_covered example from the generic Case A default.
  • tests: cover the deterministic backstop (no silent duplicates even with all four modes covered) and the A1 happy path (detector does not mis-fire).

Validated against the real k006 prompt on an AMD gfx942 GPU: the preprocessor now emits four distinct modes backed by a synthesized single-shape (983040x128 bf16), single-metric harness; correctness passes in ~66s instead of timing out at 600s.


Follow-up commits (scope expansion)

Beyond the original flag-less backstop, end-to-end testing surfaced adjacent cases now also fixed on this branch:

  • Amalgamation && — a non-build cmd_a && cmd_b (e.g. mode 1 && mode 2) was blindly split into correctness/performance, dropping one metric. Added is_amalgamation_command (gated on infer_compile_command_from_eval), checked at both the CLI (mini.py, which previously did its own rsplit("&&") before the preprocessor) and the preprocessor (flag-independent, before harness extraction). Both refuse with PATH_A_FLAG_MISSING → harness generator.
  • Shape-bearing commands — a flag-aware or build-&& command whose prompt carries unpinned shapes was run as-is (default sweep → timeout). Added a shapes pre-check ahead of the A1/A2 split so these route to A2-with-shapes.
  • Baseline GEAK_WORK_DIR — the v3 baseline subprocess never exported GEAK_WORK_DIR/GEAK_REPO_ROOT (the legacy run_harness did), so a contract-compliant harness fell back to its own dir, found no kernel source, and emitted no latency (silent "produced no latency" → soft-cap abort). Fixed in _build_env, plus a work_dir fallback to the source repo in collect_baseline.

Tests added for the amalgamation predicate, the baseline env exports, and the work_dir fallback.

…mposite Path-A tasks (#258)

A flag-less Path-A test command (no GEAK mode flag) was copied byte-identically
into all four COMMANDMENT modes, silently. The correctness preflight then ran the
harness's full default sweep until it timed out, producing 0% kernel gain.

Two cooperating defects: the orchestrator prompt trained the LLM to mark all four
modes covered, and _substitute_mode_flag silently no-ops on a flag-less command, so
the renderer loop emitted no warning.

Fix:
- tools.py: add a deterministic flag-less backstop in
  commandment_from_user_command. When the command exposes no mode flag, is not a
  compound && shell contract, and yields no harness, refuse to render the
  all-modes-identical COMMANDMENT and return ok:False / PATH_A_FLAG_MISSING. This
  guard is independent of the modes_covered the LLM passes.
- orchestrator.py: split Case A into A1 (flag-aware -> today's happy path) and A2
  (flag-less/composite -> route into harness synthesis). A2 routes on whether the
  prompt carries shapes: with shapes, skip discovery and dispatch harness-generator
  with prompt shapes only; without shapes, fall through to discovery as Case C. Wire
  the PATH_A_FLAG_MISSING -> switch-to-A2 recovery and remove the all-four
  modes_covered example from the generic Case A default.
- tests: cover the deterministic backstop (no silent duplicates even with all four
  modes covered) and the A1 happy path (detector does not mis-fire).

Validated against the real k006 prompt on an AMD gfx942 GPU: the preprocessor now
emits four distinct modes backed by a synthesized single-shape (983040x128 bf16),
single-metric harness; correctness passes in ~66s instead of timing out at 600s.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
sdubagun-amd and others added 3 commits June 4, 2026 19:46
…earing tasks to harness generator (#258 follow-up)

Two broken Path-A cases remained after the flag-less backstop (PR #264):

1. Amalgamation && (a non-build command chaining two tests, e.g. the same
   script run twice with different settings) was split blindly
   left=correctness / right=performance by _try_synthesize_shell_contract_harness,
   silently dropping one latency number. Crucially, a *flag-bearing* amalgamation
   also escaped the flag-less backstop: _extract_harness_from_command returns a
   path for it, so original_harness_path was set and the backstop's
   `original_harness_path is None` guard never fired.

2. A command that carries a mode flag (or a build-bearing &&) but whose prompt
   lists shapes the command does not pin was classified A1 and run as-is,
   ignoring the shapes and triggering the harness's full default sweep -> timeout.

Fix:
- tools.py: add _is_amalgamation_command (gated on infer_compile_command_from_eval
  so the keep/refuse decision stays consistent with the compile prefix the split
  path re-prepends; ImportError -> treat as amalgamation, the safe route). Refuse
  amalgamations in commandment_from_user_command BEFORE harness extraction so the
  guard is flag-independent and catches the flag-bearing case. Skip synthesis for
  non-build && in _try_synthesize_shell_contract_harness, with comments on the
  hard-coded correctness/performance ordering and the delegation of all non-build
  && to the generator.
- orchestrator.py: add a shapes pre-check ahead of the A1/A2 split so a flag-aware
  command whose prompt carries unpinned shapes routes to A2-with-shapes (the
  deterministic tool is shape-blind and cannot refuse on shape grounds). Bias to
  A2 when uncertain (perf cost vs. silent timeout). Enumerate the single-flag and
  build-bearing-&& shape cases under A2-with-shapes; never inject shapes into the
  command (generator owns shape handling).
- tests: cover flag-less and flag-bearing amalgamation refusal, and a regression
  guard that build-bearing && still synthesizes. Problems #2/#3 are LLM-prose
  routing and not unit-testable; they are covered by manual verification.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…follow-up)

The previous amalgamation fix lived only in the preprocessor
(commandment_from_user_command), but the CLI intercepts a compound --test-command
first and did its own blind rsplit("&&", 1) into correctness/performance hints
before the preprocessor ran. So an amalgamation passed via --test-command
(the primary entry path) was split one layer up -- left=correctness,
right=performance -- silently dropping one metric, and never reached the
deterministic guard as a single string. Caught by running the real GEAK binary:
the CLI logged "Correctness command: ...--mode 1, Performance command: ...--mode 2".

Fix:
- contract_normalize.py: promote the amalgamation predicate to a shared, public
  is_amalgamation_command (stdlib-only module; no circular import) so the CLI and
  the preprocessor classify a compound command identically.
- tools.py: _is_amalgamation_command becomes a thin wrapper that imports the shared
  predicate, preserving the safe ImportError fallback (treat as amalgamation -> A2).
- mini.py: only pre-split a compound && into correctness/performance hints when it
  is build-bearing (is_amalgamation_command is False). A non-build && now passes
  through whole as eval_command, so the preprocessor's guard fires
  (PATH_A_FLAG_MISSING) and routes it to the harness generator (Case A2).
- tests: unit tests for is_amalgamation_command (no-&&, non-build &&, flag-bearing
  &&, build-bearing && / canonical compile-chain).

Verified end-to-end with the real binary on the k006 prompt: the amalgamation
--test-command no longer triggers the CLI's "Correctness/Performance command"
split; it is passed through whole.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…finds the kernel source

End-to-end testing of the A2 (harness-generator) path showed baseline collection
failing silently: every sample logged "produced no latency (rc=0)", no
benchmark_baseline.txt was written, and preprocess aborted at the 900s soft cap
during harness-init. The synthesized harness was correct (runs fine by hand) and
routing was correct; the bug was the baseline subprocess environment.

Root cause: the v3 baseline env builder never exported GEAK_WORK_DIR, and the
collect_baseline tool defaulted work_dir to None:
- baseline._build_env set only PYTHONPATH/HIP_VISIBLE_DEVICES/PYTHONUNBUFFERED,
  dropping the GEAK_WORK_DIR + GEAK_REPO_ROOT exports the legacy
  run_harness._build_env sets.
- collect_baseline's work_dir is an LLM-supplied arg with no default; the subagent
  often omits it, so baseline ran with work_dir=None (no PYTHONPATH, no
  GEAK_WORK_DIR at all).

A contract-compliant harness derives WORK_DIR = os.environ.get("GEAK_WORK_DIR")
or its own dir; with the var unset it fell back to its own directory (which has no
kernel source), ran nothing, and emitted no GEAK_RESULT_LATENCY_MS marker.

Fix:
- baseline.py: _build_env now exports GEAK_WORK_DIR=work_dir and
  setdefault GEAK_REPO_ROOT=work_dir when work_dir is given (mirrors the legacy
  contract; setdefault preserves an adapter-exported source root).
- tools.py: collect_baseline falls back to agent.config.repo (the source repo)
  when the subagent omits work_dir. At preprocess time no per-slot worktree exists
  yet, so the source repo is the correct target; optimization-time runs still pass
  their per-slot worktree explicitly via the legacy run_harness path.
- tools.py: add a comment documenting that compile_command is intentionally unset
  for HIP (the harness self-builds, mtime-keyed) so the absent COMMANDMENT compile
  line isn't mistaken for a bug.
- tests: _build_env exports the keys (and stays a no-op when work_dir is None);
  collect_baseline defaults work_dir to the configured source repo.

Verified via the real synthesized harness run through the fixed _build_env (cold
JIT cache): both facets build and run, and extract_latency_ms returns a value
where it previously returned None.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
sdubagun-amd added a commit that referenced this pull request Jun 8, 2026
…earing tasks to harness generator (#258 follow-up)

Two broken Path-A cases remained after the flag-less backstop (PR #264):

1. Amalgamation && (a non-build command chaining two tests, e.g. the same
   script run twice with different settings) was split blindly
   left=correctness / right=performance by _try_synthesize_shell_contract_harness,
   silently dropping one latency number. Crucially, a *flag-bearing* amalgamation
   also escaped the flag-less backstop: _extract_harness_from_command returns a
   path for it, so original_harness_path was set and the backstop's
   `original_harness_path is None` guard never fired.

2. A command that carries a mode flag (or a build-bearing &&) but whose prompt
   lists shapes the command does not pin was classified A1 and run as-is,
   ignoring the shapes and triggering the harness's full default sweep -> timeout.

Fix:
- tools.py: add _is_amalgamation_command (gated on infer_compile_command_from_eval
  so the keep/refuse decision stays consistent with the compile prefix the split
  path re-prepends; ImportError -> treat as amalgamation, the safe route). Refuse
  amalgamations in commandment_from_user_command BEFORE harness extraction so the
  guard is flag-independent and catches the flag-bearing case. Skip synthesis for
  non-build && in _try_synthesize_shell_contract_harness, with comments on the
  hard-coded correctness/performance ordering and the delegation of all non-build
  && to the generator.
- orchestrator.py: add a shapes pre-check ahead of the A1/A2 split so a flag-aware
  command whose prompt carries unpinned shapes routes to A2-with-shapes (the
  deterministic tool is shape-blind and cannot refuse on shape grounds). Bias to
  A2 when uncertain (perf cost vs. silent timeout). Enumerate the single-flag and
  build-bearing-&& shape cases under A2-with-shapes; never inject shapes into the
  command (generator owns shape handling).
- tests: cover flag-less and flag-bearing amalgamation refusal, and a regression
  guard that build-bearing && still synthesizes. Problems #2/#3 are LLM-prose
  routing and not unit-testable; they are covered by manual verification.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

When the test command has no mode flags, all four modes become the same command

2 participants