Skip to content

Independent Patch Auditor for the e2e_workflow (gated post-accept sign-off)#292

Open
Umangatamd wants to merge 1 commit into
GEAK_v4from
v4_auditor
Open

Independent Patch Auditor for the e2e_workflow (gated post-accept sign-off)#292
Umangatamd wants to merge 1 commit into
GEAK_v4from
v4_auditor

Conversation

@Umangatamd

@Umangatamd Umangatamd commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

PR: Independent Patch Auditor for the GEAK e2e_workflow

Summary

Adds an independent, prompt-driven Patch Auditor that signs off every place the workflow banks a
decision
— the baseline, the profile/routing, each config/kernel accept, the kernel harness, and the
final bundle. It re-derives every number from the raw bench_runs.jsonl (never the producing role's
reported numbers), so a win can't be unreal, unfairly measured, misattributed, or unsafe. The producing
agents are told to treat its verdict as authoritative and fix until it would pass. Everything is additive
and gated behind use_auditor (default OFF)
— with the flag off the workflow is byte-identical to before.

The problem (grounded in our run corpus)

Every producer persona either optimizes or grades its own work; the Integrator both builds the overlay
and gates it, and the gates were prompt-asserted prose an LLM can skip. Across real multi-model runs that
let through, repeatedly:

  • an orthogonal spec-decode lever (+34.5%) reported as a kernel win (Gap 1);
  • a benchmark-gamed "verified 1.5×/3.0×" isolated speedup whose deployed kernel was actually ~0.985×
    (slower), measured through a graph-replay path the live server never uses (Gap 5);
  • an accept recorded with null evidence fields (medians/non-overlap unprovable from the artifact) (Gap 4);
  • a +8–9% banked on a gibberish-emitting model where byte-parity is meaningless and the accuracy probe
    was perpetually deferred (Gap 7/8);
  • a mem-fraction-handicapped / misreported baseline inflating a headline (found live, see below).

The meta-finding: v4's fixes for these were prompt-level, not enforced — non-deterministic and
re-overfittable. This PR makes the verification independent and re-derived from raw, at every gate.

What this PR adds

  • Skill perf_knowledge/expert_skills/skills/patch_auditor/skill.md — the auditor persona + 5 audit
    scopes: patch, bundle, baseline, profile, harness. Verdict is PASS | FLAG | FAIL with a
    per-finding action. (+ proof/ backtest fixtures and index.yaml entry.)
  • Orchestrator wiring in e2e_workflow.js (+161/-13), additive + use_auditor-gated:
    • baseline sign-off after setup, with auto re-measure if the baseline is noisy/contended/wrong-invariant;
    • profile/route sign-off before strategize (closes mis-attribution at the source);
    • harness sign-off after the op bake-off, with an op-benchmarker redo loop on FAIL;
    • accept sign-off at the config gate + all three kernel/head integrate gates;
    • final bundle sign-off after validation.
  • Prompt-level enforcement in the producing roles (e2e_integrator, config_tuner, op_benchmarker,
    kernel_extractor): the auditor is authoritative; read its verdict, fix every reason, resubmit, and do
    not consider the work done until it would PASS.

How it works

  • Re-derive, never trust: every gate recomputes medians/min-max/non-overlap/TTFT/TPOT and diffs the two
    legs' launch config from the raw server_info; it cross-checks the reported summary against the raw.
  • Fix-until-pass, prompt-level: an explicit FAIL is fed back to the producing agent (with reasons) to
    fix and resubmit; the auditor re-audits from raw, so a superficial "fix" can't pass (independence).
  • Fail-OPEN on tooling failure: a degraded/no-verdict auditor never hard-blocks a real win (a parser
    glitch must not kill a result); only an explicit FAIL withholds a bank. Loops are bounded.
  • Separation of duties: the auditor optimizes/integrates nothing; the kernel changer never consults
    it
    — it trusts a harness the auditor has already validated.

Evidence it works (blind tests on real recorded runs + first live run)

The auditor was run blind (skill + raw artifacts only; findings/answer-key withheld) against recorded
runs in exp/. It reasoned from primary evidence in every case — it did not pattern-match known cases:

Case (real artifact) What the auditor did Verdict
FlyDSL GEMM "+66.9% clean win" (Qwen3.5-27B) Re-derived raw legs: baseline ran at mem 0.7225 vs final 0.85 (handicapped) and director_e2e_validation.json misreported mem as 0.85 for both → confounded bundle, not a clean kernel win FAIL / ejectand corrected a flaw the human answer key had missed
Authored GEMM integrate claiming isolated 3.0306× Independently opened the kernel-layer opbench_result.json, found a self-comparing baseline (baseline backend == winner) and rel_err 45.4 on the only faster alt → the isolated number is fabricated; the e2e win is real FLAG → replace_headline_with_e2e
MoE attention CK→triton "+9%" Inspected the parity dumps: reference decode is gibberish (--dataset random), 0/12 byte-exact → byte-parity uninformative, cannot certify FAIL / eject (unverifiable_noncoherent_reference)
Crashed authored attention kernel (iso 6.6×) Saw the live server crashed (HSA fault) → do-no-harm violation FAIL / eject (agreed with the justified reject)

First live (auditor-enabled) run — in-loop baseline sign-off fired automatically:
re-derived the baseline reps [966.36, 1004.05, 1004.71], flagged spread 3.82% (>7× the 0.5% noise band)
driven by a rep-0 warm-up outlier, and recommended re-measure — exactly the "too noisy to gate a
sub-4% win" blind spot, caught in-run with actionable advice. With this PR that FLAG now auto-triggers a
baseline re-measure
before anything is gated on it.

Gaps closed

Gap How the auditor closes it
1 — orthogonal lever as kernel win lever A/B classification at every accept; B never folded into a kernel headline
2 — profiler mis-attribution / mis-routing profile gate re-aggregates GPU-time by op, checks fragmentation + edit-flag + skip, before routing
4 — accept without evidence re-derives from raw; reported nulls are ignored; missing raw ⇒ FAIL
5 — benchmark-gaming harness gate enforces dispatch/launch parity + fair A/B + served shapes; e2e gate backstops isolated≠e2e
6 (storm) same-conditions uncontended-legs check (detection)
6 (latency) headline-integrity flags a TTFT/TPOT regression hidden by a throughput headline
7/8 — correctness=determinism coherence check; non-coherent reference ⇒ unverifiable ⇒ FAIL
baseline quality baseline gate flags a noisy/handicapped reference and auto re-measures

Honest limitations (by design)

  • It catches false accepts, not missed wins / wasted runs (that's the future Researcher/PI persona).
  • It detects contamination; it does not prevent the process storm (the reaper does).
  • The interpretive judgments (lever-class, coherence) are LLM calls anchored on the objective gates; their
    reliability at scale is not yet proven (the N×/multi-model sweep is future work).
  • The harness redo-loop is wired on the serial head path (the default); the fast-mode parallel path and
    the kernel-track nested workflow rely on the maker prompts (loop wiring there is a follow-up).

Risk / safety

  • Purely additive + use_auditor-gated (default OFF) → byte-identical to the current workflow when off
    (verified). All loops are bounded; a degraded auditor fails open (never blocks a real win or hangs).
  • Live e2e run (exp/e2e_Qwen-Qwen3.5-27B-FP8_20260622_133415_176030_24863) is exercising it end-to-end
    with use_auditor=true.

Adds an independent, prompt-driven Patch Auditor that re-derives every number from raw
bench_runs.jsonl and signs off every place the workflow banks a decision: the baseline,
the profile/routing, each config/kernel accept, the kernel harness, and the final bundle.

- skill: perf_knowledge/expert_skills/skills/patch_auditor/ (5 scopes: patch|bundle|baseline|
  profile|harness; PASS|FLAG|FAIL verdicts; + proof/ backtest fixtures + index.yaml entry)
- wiring: e2e_workflow.js, additive + gated behind use_auditor (default OFF -> byte-identical
  when off). baseline FLAG auto-triggers a re-measure; harness FAIL loops the op-benchmarker.
- prompt-level enforcement: e2e_integrator / config_tuner / op_benchmarker / kernel_extractor
  treat the auditor as authoritative and fix until it would PASS.

Catches (blind backtests + first live run): orthogonal-lever (Gap 1), benchmark-gaming (Gap 5),
no-evidence accepts (Gap 4), correctness=determinism on gibberish (Gap 7/8), mem-fraction
same-conditions confound + misreport, and a noisy baseline. See AUDITOR_PR.md (writeup) and
AUDITOR_RUN_REPORT.md (in-run evidence).

Co-authored-by: Cursor <cursoragent@cursor.com>
@zihaoanllm

Copy link
Copy Markdown
Collaborator

Hi @Umangatamd,

Regarding the improvement of global verification accuracy: from the GEAK_v4 architecture perspective, this falls under the responsibility of the top agent. It can be addressed by improving precision and enhancing the harness, so we do not need to introduce a new role/step.

For global accuracy verification, I am currently working on a downstream task validation module, aligned with inferenceMax, using GSM8K.

As for the harness/unittest, it has already been enhanced in the recent versions. Liuyue will also provide a new version of the UT construction soon.

@Umangatamd

Umangatamd commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Hey @zihaoanllm ,Thanks for the detailed review — we're aligned on the GSM8K module and the harness/UT work, and the auditor is meant to build on both rather than duplicate them.

The one thing we'd surface is around relying on the top agent's prompt/precision for verification: in our experience that path is fragile, and it's precisely why a separate, independent verifier is the safer complement.

Why single-agent prompt-level verification is risky:

  • Self-attestation, not verification. When the agent that produces the win also signs it off, there's an inherent conflict of interest — like a team auditing its own books. Improving that agent's prompt makes it more precise, but it can't make it independent of itself.
  • Blind spots can't be prompted away. On a recorded v4 run, the Director's own director_e2e_validation.json misreported the serving invariant (claimed mem 0.85 on both legs; the raw server_info was 0.7225), turning a confounded, baseline-handicapped delta into a "clean +66.9%." A more careful prompt doesn't catch this — the agent can't notice what it can't see. An independent re-derivation from the raw bench_runs.jsonl does. (Same for the fabricated isolated 3.03x from an op-bench self-comparing baseline, and the +9% banked on a gibberish MoE via a parity heuristic — all passed the existing top-agent flow.)
  • Prompt fixes regress. The corpus meta-finding was that v4's verification fixes were prompt-level — and prompt-level guarantees are non-deterministic and re-overfittable (e.g. the planted spec-decode bias is just text; nothing stops it coming back). The piece that doesn't drift is independent re-derivation from raw + a gate that can't narrate around a FAIL.

So a separate verifier isn't a competing approach — it's the safety net that makes the top agent, the GSM8K module, and the new harness trustworthy:

  • the GSM8K probe answers how to measure correctness; the issue we hit was that the probe kept getting deferred and never run — an independent gate is what forces it at each accept and won't certify without it (it calls your module, doesn't replace it);
  • the improved harness makes the isolated measurement faithful; "faithful by construction" still benefits from being verified per run — when the harness is good, the independent check is a cheap no-op (it PASSed our GEMM rig this run) and only bites on a regression.

Net: keep improving the top agent, GSM8K, and the harness — and let an independent verifier re-derive from raw on top, because that independence is the one property a single agent can't give itself, no matter how good the prompt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants