AMD-AGI · Umangatamd · Jun 23, 2026
diff --git a/AUDITOR_PR.md b/AUDITOR_PR.md
@@ -0,0 +1,98 @@
+# PR: Independent Patch Auditor for the GEAK e2e_workflow
+
+## Summary
+Adds an **independent, prompt-driven Patch Auditor** that signs off **every place the workflow banks a
+decision** — the baseline, the profile/routing, each config/kernel accept, the kernel harness, and the
+final bundle. It re-derives every number **from the raw `bench_runs.jsonl`** (never the producing role's
+reported numbers), so a win can't be unreal, unfairly measured, misattributed, or unsafe. The producing
+agents are told to treat its verdict as authoritative and fix until it would pass. Everything is **additive
+and gated behind `use_auditor` (default OFF)** — with the flag off the workflow is byte-identical to before.
+
+## The problem (grounded in our run corpus)
+Every producer persona either optimizes or **grades its own work**; the Integrator both *builds* the overlay
+and *gates* it, and the gates were prompt-asserted prose an LLM can skip. Across real multi-model runs that
+let through, repeatedly:
+- an orthogonal **spec-decode** lever (+34.5%) reported as a kernel win (Gap 1);
+- a **benchmark-gamed** "verified 1.5×/3.0×" isolated speedup whose deployed kernel was actually ~0.985×
+  (slower), measured through a graph-replay path the live server never uses (Gap 5);
+- an accept recorded with **null evidence fields** (medians/non-overlap unprovable from the artifact) (Gap 4);
+- a **+8–9% banked on a gibberish-emitting model** where byte-parity is meaningless and the accuracy probe
+  was perpetually deferred (Gap 7/8);
+- a **mem-fraction-handicapped / misreported baseline** inflating a headline (found live, see below).
+
+The meta-finding: v4's fixes for these were **prompt-level, not enforced** — non-deterministic and
+re-overfittable. This PR makes the verification **independent and re-derived from raw**, at every gate.
+
+## What this PR adds
+- **Skill** `perf_knowledge/expert_skills/skills/patch_auditor/skill.md` — the auditor persona + 5 audit
+  scopes: `patch`, `bundle`, `baseline`, `profile`, `harness`. Verdict is `PASS | FLAG | FAIL` with a
+  per-finding `action`. (+ `proof/` backtest fixtures and `index.yaml` entry.)
+- **Orchestrator wiring** in `e2e_workflow.js` (+161/-13), additive + `use_auditor`-gated:
+  - **baseline** sign-off after setup, with **auto re-measure** if the baseline is noisy/contended/wrong-invariant;
+  - **profile/route** sign-off before strategize (closes mis-attribution at the source);
+  - **harness** sign-off after the op bake-off, with an op-benchmarker **redo loop** on FAIL;
+  - **accept** sign-off at the config gate + all three kernel/head integrate gates;
+  - **final bundle** sign-off after validation.
+- **Prompt-level enforcement** in the producing roles (`e2e_integrator`, `config_tuner`, `op_benchmarker`,
+  `kernel_extractor`): the auditor is authoritative; read its verdict, fix every reason, resubmit, and do
+  not consider the work done until it would PASS.
+
+## How it works
+- **Re-derive, never trust:** every gate recomputes medians/min-max/non-overlap/TTFT/TPOT and diffs the two
+  legs' launch config from the raw `server_info`; it cross-checks the reported summary against the raw.
+- **Fix-until-pass, prompt-level:** an explicit `FAIL` is fed back to the producing agent (with reasons) to
+  fix and resubmit; the auditor re-audits from raw, so a superficial "fix" can't pass (independence).
+- **Fail-OPEN on tooling failure:** a degraded/no-verdict auditor never hard-blocks a real win (a parser
+  glitch must not kill a result); only an explicit `FAIL` withholds a bank. Loops are bounded.
+- **Separation of duties:** the auditor optimizes/integrates nothing; the **kernel changer never consults
+  it** — it trusts a harness the auditor has already validated.
+
+## Evidence it works (blind tests on real recorded runs + first live run)
+The auditor was run **blind** (skill + raw artifacts only; findings/answer-key withheld) against recorded
+runs in `exp/`. It reasoned from primary evidence in every case — it did **not** pattern-match known cases:
+
+| Case (real artifact) | What the auditor did | Verdict |
+|---|---|---|
+| FlyDSL GEMM "+66.9% clean win" (Qwen3.5-27B) | Re-derived raw legs: **baseline ran at mem 0.7225 vs final 0.85** (handicapped) **and `director_e2e_validation.json` misreported mem as 0.85 for both** → confounded bundle, not a clean kernel win | **FAIL / eject** — *and corrected a flaw the human answer key had missed* |
+| Authored GEMM integrate claiming `isolated 3.0306×` | Independently opened the kernel-layer `opbench_result.json`, found a **self-comparing baseline** (baseline backend == winner) and `rel_err 45.4` on the only faster alt → the isolated number is fabricated; the e2e win is real | **FLAG → replace_headline_with_e2e** |
+| MoE attention CK→triton "+9%" | Inspected the parity dumps: reference decode is **gibberish** (`--dataset random`), **0/12 byte-exact** → byte-parity uninformative, cannot certify | **FAIL / eject** (`unverifiable_noncoherent_reference`) |
+| Crashed authored attention kernel (iso 6.6×) | Saw the live server **crashed** (HSA fault) → do-no-harm violation | **FAIL / eject** (agreed with the justified reject) |
+
+**First live (auditor-enabled) run — in-loop baseline sign-off fired automatically:**
+re-derived the baseline reps `[966.36, 1004.05, 1004.71]`, flagged **spread 3.82% (>7× the 0.5% noise band)**
+driven by a **rep-0 warm-up outlier**, and recommended re-measure — *exactly* the "too noisy to gate a
+sub-4% win" blind spot, caught in-run with actionable advice. With this PR that FLAG now **auto-triggers a
+baseline re-measure** before anything is gated on it.
+
+## Gaps closed
+| Gap | How the auditor closes it |
+|---|---|
+| 1 — orthogonal lever as kernel win | lever A/B classification at every accept; B never folded into a kernel headline |
+| 2 — profiler mis-attribution / mis-routing | `profile` gate re-aggregates GPU-time **by op**, checks fragmentation + edit-flag + skip, before routing |
+| 4 — accept without evidence | re-derives from raw; reported nulls are ignored; missing raw ⇒ FAIL |
+| 5 — benchmark-gaming | `harness` gate enforces **dispatch/launch parity** + fair A/B + served shapes; e2e gate backstops isolated≠e2e |
+| 6 (storm) | same-conditions uncontended-legs check (detection) |
+| 6 (latency) | headline-integrity flags a TTFT/TPOT regression hidden by a throughput headline |
+| 7/8 — correctness=determinism | coherence check; non-coherent reference ⇒ unverifiable ⇒ FAIL |
+| baseline quality | `baseline` gate flags a noisy/handicapped reference and **auto re-measures** |
+
+## Honest limitations (by design)
+- It catches **false accepts**, not **missed wins / wasted runs** (that's the future Researcher/PI persona).
+- It **detects** contamination; it does not **prevent** the process storm (the reaper does).
+- The interpretive judgments (lever-class, coherence) are LLM calls anchored on the objective gates; their
+  reliability at scale is not yet proven (the N×/multi-model sweep is future work).
+- The harness redo-loop is wired on the **serial** head path (the default); the fast-mode parallel path and
+  the kernel-track nested workflow rely on the maker prompts (loop wiring there is a follow-up).
+
+## Risk / safety
+- Purely **additive + `use_auditor`-gated (default OFF)** → byte-identical to the current workflow when off
+  (verified). All loops are **bounded**; a degraded auditor **fails open** (never blocks a real win or hangs).
+- Live e2e run (`exp/e2e_Qwen-Qwen3.5-27B-FP8_20260622_133415_176030_24863`) is exercising it end-to-end
+  with `use_auditor=true`.
+
+## Test plan
+- [x] `use_auditor=false` ⇒ workflow byte-identical (syntax-checked; gates inert).
+- [x] Blind backtests on recorded runs reproduce the catches above.
+- [x] First in-run baseline sign-off fires and flags the noisy baseline.
+- [ ] Full auditor-enabled e2e run to completion (in progress) — attach baseline/profile/harness/accept/bundle verdicts.
+- [ ] Reliability sweep (N× × ≥2 models) on the interpretive judgments.
diff --git a/AUDITOR_RUN_REPORT.md b/AUDITOR_RUN_REPORT.md
@@ -0,0 +1,72 @@
+# Auditor — live run report (in progress)
+
+**Run:** `exp/e2e_Qwen-Qwen3.5-27B-FP8_20260622_133415_176030_24863` · sglang · ISL/OSL 1024 · conc 64 ·
+TP=1 GPU 4 · `use_auditor=true`. Status at ~3h55m: **healthy, in the head-kernel track.** Baseline
+1003.7 tok/s.
+
+## What difference the auditor made to THIS run (with vs without)
+| Without the auditor (what the run would have done) | With the auditor (what actually happened) |
+|---|---|
+| Used the **noisy baseline** (spread 3.82%, a 966 tok/s warm-up outlier) as-is — every later win gated against a reference whose own noise (~±4%) is **bigger than a typical config/kernel win** (sub-4% wins would be unprovable / false-positives waiting to happen). | **Auto re-measured** the baseline to spread **0.118%** (1003.7 tok/s) before anything was gated on it. The whole run now stands on a clean reference. |
+| **Banked** the +2.9% config accept (`--attention-backend triton --kv-cache-dtype fp8_e4m3`) and carried it forward as the run's config. | **Ejected** it: the comparison was confounded — the aiter baseline effectively ran at mem 0.7225 (smaller KV) vs the triton candidate's 0.85 — so the +2.9% can't be attributed to the lever. The run carries the **clean baseline config** instead. |
+| `sweep_results.json`'s claim *"identical invariant … verified in each server.log"* stands **uncorrected** in the record. | The **misreport is surfaced** (raw `server_info` contradicts it) and the raw is trusted. |
+
+**Net so far:** the auditor **changed the measurement foundation** (replaced a noisy baseline with a clean one) and **changed what the run banks** (blocked a confounded config win + caught a false "same invariant" claim). **Honest trade-off:** a *likely-real* ~+2.9% was withheld because it couldn't be **certified** same-conditions — the auditor recommended a matched-mem re-measure rather than banking an unprovable number. Without the auditor, none of these three corrections happen — the run proceeds on a noisy baseline and a confounded, misreported config headline.
+
+## In-run auditor verdicts so far
+| Gate | Verdict | What it did |
+|---|---|---|
+| **baseline** | PASS (after auto-remeasure) | First measure FLAGged (spread **3.82%**, rep-0 warm-up outlier) → orchestrator **auto re-measured** (6 warm reps, rep-0 discarded) → re-audit **PASS** at spread **0.118%**. Now every downstream delta is gated on a clean reference. |
+| **profile** | PASS | Top-N attribution sane before routing (no fragmentation/edit-flag/skip issue). |
+| **config accept** (`--attention-backend triton --kv-cache-dtype fp8_e4m3`, +2.9%) | **FAIL → eject** | See below. |
+| harness / integrate / bundle | pending | head-kernel track in progress. |
+
+## The headline catch — config accept rejected (`FAIL → eject`)
+Re-deriving from raw, the auditor found the +2.9% config win was **not same-conditions**:
+- All legs launched with `--mem-fraction-static 0.85`, but the **aiter baseline effectively ran at 0.7225**
+  (KV pool 947k tok) while the **triton candidate ran at 0.85** (KV pool 1.15M+ tok) — because the *aiter
+  attention backend reserves a heavier workspace*, leaving less for KV. So the mem-fraction divergence is a
+  **side-effect of the attention-backend lever itself**, handicapping the baseline → invariant mismatch.
+- **Misreport caught:** `sweep_results.json` claimed *"MEM_FRACTION=0.85 … identical invariant … verified
+  in each server.log."* The raw `server_info` + KV-allocation lines contradict it → flagged, raw trusted.
+- **Calibrated, not trigger-happy:** noted a *mitigating* factor (at conc=64 the divergence is **inert** —
+  running-requests never exceeded ~65, so the larger KV pool never engaged → the delta is probably real),
+  and *did* verify fp8-KV correctness via an 8-prompt greedy probe (`accuracy_gate_pass`). It still
+  conservatively FAILed on the raw invariant mismatch and **recommended re-measuring the baseline at a
+  matched effective mem-fraction**.
+- **Consequence:** the accept was **ejected** (a likely-real +2.9% withheld because it couldn't be certified
+  same-conditions); the run carried the clean baseline config into the head-kernel track.
+
+## What difference the auditor made to THIS run (vs. without it)
+| Gate | Without the auditor | With the auditor — what actually changed |
+|---|---|---|
+| **baseline** | The whole run gates every config/kernel delta against a **3.82%-spread** reference (rep-0 warm-up outlier, median ~1001) — wins of 1–3% are smaller than the baseline's own noise, so accept/reject calls are unreliable for the entire run. | The orchestrator **re-measured the baseline** (6 warm reps, warm-up discarded) → **0.118% spread**, reference fixed to **1003.7 tok/s**. **Every downstream decision is now gated on a clean reference.** (reps 3→6, spread 3.82%→0.118%). |
+| **profile** | (no change) | Confirmed the routing attribution was sound — no false re-route. |
+| **config accept** | The workflow **banks the +2.9% `triton + fp8-KV` config** as a real win and **carries it forward** into the head-kernel track (the sweep even asserted "same invariant verified"); the lossy **fp8 KV-cache** lever rides along silently and the headline counts a confounded +2.9%. | The auditor **ejected it** → the run **carried the clean baseline config forward instead**. The confounded/misreported +2.9% never entered the headline, and the lossy fp8-KV lever was **not silently banked**. |
+
+**Net so far:** the auditor changed the run's trajectory in two concrete ways — (1) it **replaced the noisy
+baseline with a clean one** that all later gating uses, and (2) it **stopped a confounded, misreported
+config accept from being banked**, so the head-kernel track is building on the verified baseline config
+rather than an uncertified `triton + fp8-KV` stack.
+
+**Honest cost + fix:** on the config ejection the auditor itself judged the +2.9% *probably real* (the
+mem-fraction divergence is an **inert side-effect** of the attention-backend lever at conc=64, and
+correctness passed) — so the `FAIL → eject` was **too rigid**: it should have ALLOWED a logical,
+net-positive, correctness-verified step whose only "violation" was a secondary knob shifting as an inert
+consequence of the lever itself. We **refined the same-conditions rule** accordingly: a differing invariant
+that is (a) a side-effect of the lever under audit, (b) AFFIRMATIVELY shown inert for the measurement, and
+(c) correctness-clean ⇒ **PASS (or FLAG to correct a misreport), not eject**; only an independent confound
+or an *unproven* difference still FAILs. Under the refined rule this config accept would have been a
+**PASS/FLAG** (win kept, misreport flagged). The fix applies to the remaining gates this run and all future
+runs; the config phase here had already passed, so its accept stays ejected for this particular run.
+
+## Takeaways (so far)
+1. The **baseline auto-remeasure loop works end-to-end** — FLAG → re-measure → re-audit PASS, no human.
+2. The **config gate caught a mechanism-level confound + a misreport**, independently, in-run — deeper than
+   the hand-analysis we did earlier (it explained *why* the mem-fraction diverged).
+3. The auditor is **calibrated** (mitigating factors + a real correctness probe), not a blunt blocker, yet
+   stays conservative on an unprovable same-conditions invariant.
+
+## Still pending
+Harness (dispatch-parity), kernel/head integrate(s), and the final-bundle sign-off — to be appended when
+the head-kernel track and validation complete.