Independent Patch Auditor for the e2e_workflow (gated post-accept sign-off)#292
Independent Patch Auditor for the e2e_workflow (gated post-accept sign-off)#292Umangatamd wants to merge 1 commit into
Conversation
Adds an independent, prompt-driven Patch Auditor that re-derives every number from raw bench_runs.jsonl and signs off every place the workflow banks a decision: the baseline, the profile/routing, each config/kernel accept, the kernel harness, and the final bundle. - skill: perf_knowledge/expert_skills/skills/patch_auditor/ (5 scopes: patch|bundle|baseline| profile|harness; PASS|FLAG|FAIL verdicts; + proof/ backtest fixtures + index.yaml entry) - wiring: e2e_workflow.js, additive + gated behind use_auditor (default OFF -> byte-identical when off). baseline FLAG auto-triggers a re-measure; harness FAIL loops the op-benchmarker. - prompt-level enforcement: e2e_integrator / config_tuner / op_benchmarker / kernel_extractor treat the auditor as authoritative and fix until it would PASS. Catches (blind backtests + first live run): orthogonal-lever (Gap 1), benchmark-gaming (Gap 5), no-evidence accepts (Gap 4), correctness=determinism on gibberish (Gap 7/8), mem-fraction same-conditions confound + misreport, and a noisy baseline. See AUDITOR_PR.md (writeup) and AUDITOR_RUN_REPORT.md (in-run evidence). Co-authored-by: Cursor <cursoragent@cursor.com>
|
Hi @Umangatamd, Regarding the improvement of global verification accuracy: from the GEAK_v4 architecture perspective, this falls under the responsibility of the top agent. It can be addressed by improving precision and enhancing the harness, so we do not need to introduce a new role/step. For global accuracy verification, I am currently working on a downstream task validation module, aligned with inferenceMax, using GSM8K. As for the harness/unittest, it has already been enhanced in the recent versions. Liuyue will also provide a new version of the UT construction soon. |
|
Hey @zihaoanllm ,Thanks for the detailed review — we're aligned on the GSM8K module and the harness/UT work, and the auditor is meant to build on both rather than duplicate them. The one thing we'd surface is around relying on the top agent's prompt/precision for verification: in our experience that path is fragile, and it's precisely why a separate, independent verifier is the safer complement. Why single-agent prompt-level verification is risky:
So a separate verifier isn't a competing approach — it's the safety net that makes the top agent, the GSM8K module, and the new harness trustworthy:
Net: keep improving the top agent, GSM8K, and the harness — and let an independent verifier re-derive from raw on top, because that independence is the one property a single agent can't give itself, no matter how good the prompt. |
PR: Independent Patch Auditor for the GEAK e2e_workflow
Summary
Adds an independent, prompt-driven Patch Auditor that signs off every place the workflow banks a
decision — the baseline, the profile/routing, each config/kernel accept, the kernel harness, and the
final bundle. It re-derives every number from the raw
bench_runs.jsonl(never the producing role'sreported numbers), so a win can't be unreal, unfairly measured, misattributed, or unsafe. The producing
agents are told to treat its verdict as authoritative and fix until it would pass. Everything is additive
and gated behind
use_auditor(default OFF) — with the flag off the workflow is byte-identical to before.The problem (grounded in our run corpus)
Every producer persona either optimizes or grades its own work; the Integrator both builds the overlay
and gates it, and the gates were prompt-asserted prose an LLM can skip. Across real multi-model runs that
let through, repeatedly:
(slower), measured through a graph-replay path the live server never uses (Gap 5);
was perpetually deferred (Gap 7/8);
The meta-finding: v4's fixes for these were prompt-level, not enforced — non-deterministic and
re-overfittable. This PR makes the verification independent and re-derived from raw, at every gate.
What this PR adds
perf_knowledge/expert_skills/skills/patch_auditor/skill.md— the auditor persona + 5 auditscopes:
patch,bundle,baseline,profile,harness. Verdict isPASS | FLAG | FAILwith aper-finding
action. (+proof/backtest fixtures andindex.yamlentry.)e2e_workflow.js(+161/-13), additive +use_auditor-gated:e2e_integrator,config_tuner,op_benchmarker,kernel_extractor): the auditor is authoritative; read its verdict, fix every reason, resubmit, and donot consider the work done until it would PASS.
How it works
legs' launch config from the raw
server_info; it cross-checks the reported summary against the raw.FAILis fed back to the producing agent (with reasons) tofix and resubmit; the auditor re-audits from raw, so a superficial "fix" can't pass (independence).
glitch must not kill a result); only an explicit
FAILwithholds a bank. Loops are bounded.it — it trusts a harness the auditor has already validated.
Evidence it works (blind tests on real recorded runs + first live run)
The auditor was run blind (skill + raw artifacts only; findings/answer-key withheld) against recorded
runs in
exp/. It reasoned from primary evidence in every case — it did not pattern-match known cases:director_e2e_validation.jsonmisreported mem as 0.85 for both → confounded bundle, not a clean kernel winisolated 3.0306×opbench_result.json, found a self-comparing baseline (baseline backend == winner) andrel_err 45.4on the only faster alt → the isolated number is fabricated; the e2e win is real--dataset random), 0/12 byte-exact → byte-parity uninformative, cannot certifyunverifiable_noncoherent_reference)First live (auditor-enabled) run — in-loop baseline sign-off fired automatically:
re-derived the baseline reps
[966.36, 1004.05, 1004.71], flagged spread 3.82% (>7× the 0.5% noise band)driven by a rep-0 warm-up outlier, and recommended re-measure — exactly the "too noisy to gate a
sub-4% win" blind spot, caught in-run with actionable advice. With this PR that FLAG now auto-triggers a
baseline re-measure before anything is gated on it.
Gaps closed
profilegate re-aggregates GPU-time by op, checks fragmentation + edit-flag + skip, before routingharnessgate enforces dispatch/launch parity + fair A/B + served shapes; e2e gate backstops isolated≠e2ebaselinegate flags a noisy/handicapped reference and auto re-measuresHonest limitations (by design)
reliability at scale is not yet proven (the N×/multi-model sweep is future work).
the kernel-track nested workflow rely on the maker prompts (loop wiring there is a follow-up).
Risk / safety
use_auditor-gated (default OFF) → byte-identical to the current workflow when off(verified). All loops are bounded; a degraded auditor fails open (never blocks a real win or hangs).
exp/e2e_Qwen-Qwen3.5-27B-FP8_20260622_133415_176030_24863) is exercising it end-to-endwith
use_auditor=true.