Independent Patch Auditor for the e2e_workflow (gated post-accept sign-off) by Umangatamd · Pull Request #292 · AMD-AGI/GEAK

Umangatamd · 2026-06-23T05:14:46Z

PR: Independent Patch Auditor for the GEAK e2e_workflow

Summary

Adds an independent, prompt-driven Patch Auditor that signs off every place the workflow banks a
decision — the baseline, the profile/routing, each config/kernel accept, the kernel harness, and the
final bundle. It re-derives every number from the raw bench_runs.jsonl (never the producing role's
reported numbers), so a win can't be unreal, unfairly measured, misattributed, or unsafe. The producing
agents are told to treat its verdict as authoritative and fix until it would pass. Everything is additive
and gated behind use_auditor (default OFF) — with the flag off the workflow is byte-identical to before.

The problem (grounded in our run corpus)

Every producer persona either optimizes or grades its own work; the Integrator both builds the overlay
and gates it, and the gates were prompt-asserted prose an LLM can skip. Across real multi-model runs that
let through, repeatedly:

an orthogonal spec-decode lever (+34.5%) reported as a kernel win (Gap 1);
a benchmark-gamed "verified 1.5×/3.0×" isolated speedup whose deployed kernel was actually ~0.985×
(slower), measured through a graph-replay path the live server never uses (Gap 5);
an accept recorded with null evidence fields (medians/non-overlap unprovable from the artifact) (Gap 4);
a +8–9% banked on a gibberish-emitting model where byte-parity is meaningless and the accuracy probe
was perpetually deferred (Gap 7/8);
a mem-fraction-handicapped / misreported baseline inflating a headline (found live, see below).

The meta-finding: v4's fixes for these were prompt-level, not enforced — non-deterministic and
re-overfittable. This PR makes the verification independent and re-derived from raw, at every gate.

What this PR adds

Skill perf_knowledge/expert_skills/skills/patch_auditor/skill.md — the auditor persona + 5 audit
scopes: patch, bundle, baseline, profile, harness. Verdict is PASS | FLAG | FAIL with a
per-finding action. (+ proof/ backtest fixtures and index.yaml entry.)
Orchestrator wiring in e2e_workflow.js (+161/-13), additive + use_auditor-gated:
- baseline sign-off after setup, with auto re-measure if the baseline is noisy/contended/wrong-invariant;
- profile/route sign-off before strategize (closes mis-attribution at the source);
- harness sign-off after the op bake-off, with an op-benchmarker redo loop on FAIL;
- accept sign-off at the config gate + all three kernel/head integrate gates;
- final bundle sign-off after validation.
Prompt-level enforcement in the producing roles (e2e_integrator, config_tuner, op_benchmarker,
kernel_extractor): the auditor is authoritative; read its verdict, fix every reason, resubmit, and do
not consider the work done until it would PASS.

How it works

Re-derive, never trust: every gate recomputes medians/min-max/non-overlap/TTFT/TPOT and diffs the two
legs' launch config from the raw server_info; it cross-checks the reported summary against the raw.
Fix-until-pass, prompt-level: an explicit FAIL is fed back to the producing agent (with reasons) to
fix and resubmit; the auditor re-audits from raw, so a superficial "fix" can't pass (independence).
Fail-OPEN on tooling failure: a degraded/no-verdict auditor never hard-blocks a real win (a parser
glitch must not kill a result); only an explicit FAIL withholds a bank. Loops are bounded.
Separation of duties: the auditor optimizes/integrates nothing; the kernel changer never consults
it — it trusts a harness the auditor has already validated.

Evidence it works (blind tests on real recorded runs + first live run)

The auditor was run blind (skill + raw artifacts only; findings/answer-key withheld) against recorded
runs in exp/. It reasoned from primary evidence in every case — it did not pattern-match known cases:

Case (real artifact)	What the auditor did	Verdict
FlyDSL GEMM "+66.9% clean win" (Qwen3.5-27B)	Re-derived raw legs: baseline ran at mem 0.7225 vs final 0.85 (handicapped) and `director_e2e_validation.json` misreported mem as 0.85 for both → confounded bundle, not a clean kernel win	FAIL / eject — and corrected a flaw the human answer key had missed
Authored GEMM integrate claiming `isolated 3.0306×`	Independently opened the kernel-layer `opbench_result.json`, found a self-comparing baseline (baseline backend == winner) and `rel_err 45.4` on the only faster alt → the isolated number is fabricated; the e2e win is real	FLAG → replace_headline_with_e2e
MoE attention CK→triton "+9%"	Inspected the parity dumps: reference decode is gibberish (`--dataset random`), 0/12 byte-exact → byte-parity uninformative, cannot certify	FAIL / eject (`unverifiable_noncoherent_reference`)
Crashed authored attention kernel (iso 6.6×)	Saw the live server crashed (HSA fault) → do-no-harm violation	FAIL / eject (agreed with the justified reject)

First live (auditor-enabled) run — in-loop baseline sign-off fired automatically:
re-derived the baseline reps [966.36, 1004.05, 1004.71], flagged spread 3.82% (>7× the 0.5% noise band)
driven by a rep-0 warm-up outlier, and recommended re-measure — exactly the "too noisy to gate a
sub-4% win" blind spot, caught in-run with actionable advice. With this PR that FLAG now auto-triggers a
baseline re-measure before anything is gated on it.

Gaps closed

Gap	How the auditor closes it
1 — orthogonal lever as kernel win	lever A/B classification at every accept; B never folded into a kernel headline
2 — profiler mis-attribution / mis-routing	`profile` gate re-aggregates GPU-time by op, checks fragmentation + edit-flag + skip, before routing
4 — accept without evidence	re-derives from raw; reported nulls are ignored; missing raw ⇒ FAIL
5 — benchmark-gaming	`harness` gate enforces dispatch/launch parity + fair A/B + served shapes; e2e gate backstops isolated≠e2e
6 (storm)	same-conditions uncontended-legs check (detection)
6 (latency)	headline-integrity flags a TTFT/TPOT regression hidden by a throughput headline
7/8 — correctness=determinism	coherence check; non-coherent reference ⇒ unverifiable ⇒ FAIL
baseline quality	`baseline` gate flags a noisy/handicapped reference and auto re-measures

Honest limitations (by design)

It catches false accepts, not missed wins / wasted runs (that's the future Researcher/PI persona).
It detects contamination; it does not prevent the process storm (the reaper does).
The interpretive judgments (lever-class, coherence) are LLM calls anchored on the objective gates; their
reliability at scale is not yet proven (the N×/multi-model sweep is future work).
The harness redo-loop is wired on the serial head path (the default); the fast-mode parallel path and
the kernel-track nested workflow rely on the maker prompts (loop wiring there is a follow-up).

Risk / safety

Purely additive + use_auditor-gated (default OFF) → byte-identical to the current workflow when off
(verified). All loops are bounded; a degraded auditor fails open (never blocks a real win or hangs).
Live e2e run (exp/e2e_Qwen-Qwen3.5-27B-FP8_20260622_133415_176030_24863) is exercising it end-to-end
with use_auditor=true.

Adds an independent, prompt-driven Patch Auditor that re-derives every number from raw bench_runs.jsonl and signs off every place the workflow banks a decision: the baseline, the profile/routing, each config/kernel accept, the kernel harness, and the final bundle. - skill: perf_knowledge/expert_skills/skills/patch_auditor/ (5 scopes: patch|bundle|baseline| profile|harness; PASS|FLAG|FAIL verdicts; + proof/ backtest fixtures + index.yaml entry) - wiring: e2e_workflow.js, additive + gated behind use_auditor (default OFF -> byte-identical when off). baseline FLAG auto-triggers a re-measure; harness FAIL loops the op-benchmarker. - prompt-level enforcement: e2e_integrator / config_tuner / op_benchmarker / kernel_extractor treat the auditor as authoritative and fix until it would PASS. Catches (blind backtests + first live run): orthogonal-lever (Gap 1), benchmark-gaming (Gap 5), no-evidence accepts (Gap 4), correctness=determinism on gibberish (Gap 7/8), mem-fraction same-conditions confound + misreport, and a noisy baseline. See AUDITOR_PR.md (writeup) and AUDITOR_RUN_REPORT.md (in-run evidence). Co-authored-by: Cursor <cursoragent@cursor.com>

zihaoanllm · 2026-06-23T10:05:06Z

Hi @Umangatamd,

Regarding the improvement of global verification accuracy: from the GEAK_v4 architecture perspective, this falls under the responsibility of the top agent. It can be addressed by improving precision and enhancing the harness, so we do not need to introduce a new role/step.

For global accuracy verification, I am currently working on a downstream task validation module, aligned with inferenceMax, using GSM8K.

As for the harness/unittest, it has already been enhanced in the recent versions. Liuyue will also provide a new version of the UT construction soon.

Umangatamd · 2026-06-23T10:46:07Z

Hey @zihaoanllm ,Thanks for the detailed review — we're aligned on the GSM8K module and the harness/UT work, and the auditor is meant to build on both rather than duplicate them.

The one thing we'd surface is around relying on the top agent's prompt/precision for verification: in our experience that path is fragile, and it's precisely why a separate, independent verifier is the safer complement.

Why single-agent prompt-level verification is risky:

Self-attestation, not verification. When the agent that produces the win also signs it off, there's an inherent conflict of interest — like a team auditing its own books. Improving that agent's prompt makes it more precise, but it can't make it independent of itself.
Blind spots can't be prompted away. On a recorded v4 run, the Director's own director_e2e_validation.json misreported the serving invariant (claimed mem 0.85 on both legs; the raw server_info was 0.7225), turning a confounded, baseline-handicapped delta into a "clean +66.9%." A more careful prompt doesn't catch this — the agent can't notice what it can't see. An independent re-derivation from the raw bench_runs.jsonl does. (Same for the fabricated isolated 3.03x from an op-bench self-comparing baseline, and the +9% banked on a gibberish MoE via a parity heuristic — all passed the existing top-agent flow.)
Prompt fixes regress. The corpus meta-finding was that v4's verification fixes were prompt-level — and prompt-level guarantees are non-deterministic and re-overfittable (e.g. the planted spec-decode bias is just text; nothing stops it coming back). The piece that doesn't drift is independent re-derivation from raw + a gate that can't narrate around a FAIL.

So a separate verifier isn't a competing approach — it's the safety net that makes the top agent, the GSM8K module, and the new harness trustworthy:

the GSM8K probe answers how to measure correctness; the issue we hit was that the probe kept getting deferred and never run — an independent gate is what forces it at each accept and won't certify without it (it calls your module, doesn't replace it);
the improved harness makes the isolated measurement faithful; "faithful by construction" still benefits from being verified per run — when the harness is good, the independent check is a cheap no-op (it PASSed our GEMM rig this run) and only bites on a regression.

Net: keep improving the top agent, GSM8K, and the harness — and let an independent verifier re-derive from raw on top, because that independence is the one property a single agent can't give itself, no matter how good the prompt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Independent Patch Auditor for the e2e_workflow (gated post-accept sign-off)#292

Independent Patch Auditor for the e2e_workflow (gated post-accept sign-off)#292
Umangatamd wants to merge 1 commit into
GEAK_v4from
v4_auditor

Umangatamd commented Jun 23, 2026 •

edited

Loading

Uh oh!

zihaoanllm commented Jun 23, 2026

Uh oh!

Umangatamd commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Umangatamd commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR: Independent Patch Auditor for the GEAK e2e_workflow

Summary

The problem (grounded in our run corpus)

What this PR adds

How it works

Evidence it works (blind tests on real recorded runs + first live run)

Gaps closed

Honest limitations (by design)

Risk / safety

Uh oh!

zihaoanllm commented Jun 23, 2026

Uh oh!

Umangatamd commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Umangatamd commented Jun 23, 2026 •

edited

Loading

Umangatamd commented Jun 23, 2026 •

edited

Loading