feat(benchmarks): LongMemEval failure-mode diagnosis harness + judge quota preflight#183
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a LongMemEval “failed-but-retrieved” diagnosis harness (for #158/#159) plus a lightweight judge quota/auth preflight to prevent wasted benchmark runs when the pinned judge is unavailable.
Changes:
- Record
retrieved_session_ids_full(rank-ordered session_id + score for all recalled memories) in LongMemEval result artifacts to remove the rank-6–10 visibility gap. - Introduce
tests/benchmarks/longmemeval/diagnose_failures.py(stage-1 deterministic evidence + optional stage-2 judge labeling with an agreement matrix) and comprehensive unit tests. - Add
tests/benchmarks/judge_preflight.pyand wire it intotest-longmemeval-benchmark.shwhen--llm-evalis enabled.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/benchmarks/longmemeval/test_longmemeval.py | Adds retrieved_session_ids_full to persisted results and implements helper to capture session IDs with scores. |
| tests/benchmarks/longmemeval/diagnose_failures.py | New two-stage failure-mode diagnosis CLI (deterministic evidence + optional LLM labeling). |
| tests/benchmarks/longmemeval/test_diagnose_failures.py | New unit tests covering stage-1 evidence/labeling, stage-2 parsing/errors, and judge preflight behavior. |
| tests/benchmarks/judge_preflight.py | New minimal judge call preflight with classified exit codes and actionable messages. |
| test-longmemeval-benchmark.sh | Runs judge preflight before benchmark ingestion/question loop when --llm-eval is requested. |
Comment on lines
+276
to
+279
| if ! (cd "$SCRIPT_DIR" && "${PREFLIGHT_CMD[@]}"); then | ||
| echo -e "${RED}Judge preflight failed — aborting before the question loop${NC}" | ||
| exit 1 | ||
| fi |
Member
Author
There was a problem hiding this comment.
Preserved the judge preflight exit status by capturing $? in the failure branch and exiting with it — test-longmemeval-benchmark.sh:276. GitHub marked the original line outdated after the push.
jack-arturo
added a commit
that referenced
this pull request
Jun 11, 2026
…#187) ## Summary Stacked on #186. Driven by the failure-mode diagnosis in #183 (58 failed-but-retrieved LongMemEval questions: answer-construction 42, missing-date-use 7, ranking 4). Server (production + benchmark): - **Timestamp tiebreak** (always-on): exact score ties now order newest-first deterministically (`_score_sort_key`). - **`recency_bias=auto|on|off`** (ships `off` via `RECALL_RECENCY_BIAS`): after dedup/state-filter and before the adaptive floor, candidate timestamps are min-max normalized and `SEARCH_WEIGHT_TEMPORAL` (default 0.1) × relative recency is added — so the newest version of a conflicting fact can outrank an older, heavier one. `auto` triggers on temporal intent ("latest", "current", "what changed", …; word-boundaried, "currency"/"nowhere" safe). - **Supersession chain-walk**: `current` state mode now resolves INVALIDATED_BY/EVOLVED_INTO chains to their head (A→B→C surfaces C, provenance still points at A; depth-bounded at 5, cycle-safe, batched). *Honesty note: benchmark corpora carry no supersession edges — this is a production-correctness fix, not a score mover.* Harness (benchmark-only, flag-gated for methodology reproducibility): - **`temporal_answer_hint`** config flag (default off; new `temporal-answer` preset for A/B): chronological memory rendering with scores + conflict-recency guidance + anti-overabstention guidance (27 of the 58 failures were literal "I don't know" with the answer retrieved; abstention remains possible). Flag-off prompt is byte-identical to the canonical methodology (equality-tested). ## Testing 40 new tests (tiebreak, bias flip with the shipped default weight, auto-detection, chain-walk depth/cycle/query-count guards, #158 preference latest-wins acceptance, prompt byte-identity). Full suite 563 passed, 12 skipped; black + flake8 clean. ## Validation plan before enabling anything Lab IR A/B → automem-evals 22-probe zero-delta gate (recency_bias=off default keeps it zero-delta) → LongMemEval mini judge-off (recall@5 ≥ 97.2% floor) → judged mini → full run; `temporal-answer` preset A/B for the harness flag, with server-vs-harness deltas reported separately in EXPERIMENT_LOG. Refs #158, #159 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…judge preflight Implements PR-2 of the LongMemEval failure-diagnosis plan (issues #158/#159): - tests/benchmarks/longmemeval/diagnose_failures.py: two-stage CLI that classifies failed-but-retrieved questions (is_correct=false AND recall_hit_at_5=true). Stage 1 is pure code: joins failures to dataset haystack sessions/dates and emits per-question evidence (answer_rank, abstained_despite_hit, stale_candidate_above_answer, noise_ratio, date_arithmetic_needed, answer_coverage_top5) plus a deterministic suggested_mode documented in the module docstring. Stage 2 (--llm) asks the pinned benchmark judge for an independent label and records the stage1-vs-stage2 agreement matrix. Default type filter is all types (58 questions on the canonical full run; the four weak categories cover 54 of them). - tests/benchmarks/judge_preflight.py: one minimal completion against the pinned judge model before benchmark runs; exit 0 ok, 2 quota/429, 3 auth, 1 other, with actionable one-line messages. Wired into test-longmemeval-benchmark.sh before the question loop when --llm-eval is set, so quota exhaustion aborts before any ingestion work. - test_longmemeval.py instrumentation: additive details key retrieved_session_ids_full with ALL recalled memories' session ids in rank order and per-memory score (existing top-5-unique key unchanged). - tests/benchmarks/longmemeval/test_diagnose_failures.py: synthetic-only unit tests for evidence extraction, the suggested_mode heuristic, the type filter, stage-2 with a mocked client, and preflight error classification. No network calls in tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review fixes for the failure-diagnosis harness: - run_stage2: records with llm_error set (transport/parse failures) no longer pollute the agreement matrix or exact_agreement; they are counted in a new agreement.llm_errors field and reported in the stdout summary. Per-record llm_error is unchanged. - Import _result_details from analyze_results instead of duplicating it (module is side-effect-free at import). - Comment the deliberate over-trigger in _DATE_MATH_RE (stage 2 cross-checks it). - main(): only print the exact-agreement line when a rate exists (no stray blank line). - _session_ids_with_scores docstring: note the full ranked pool is recorded for future rank-6-10 depth analysis, not yet consumed by diagnose_failures. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
c526bd4 to
bbabb3b
Compare
This was referenced Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tests/benchmarks/longmemeval/diagnose_failures.py: two-stage classifier for failed-but-retrieved questions (is_correct=false AND recall_hit_at_5=true). Stage 1 is pure code (answer rank, abstention-despite-hit, stale-candidate-above-answer, noise ratio, date-math detection → documented 8-rule mode ladder); stage 2 (--llm) labels each failure with the pinned judge and reports a stage-1/stage-2 agreement matrix (transport errors counted and excluded).tests/benchmarks/judge_preflight.py: one minimal judge call before any judged run; exits non-zero with an actionable message on 429/insufficient_quota or auth failure. Wired intotest-longmemeval-benchmark.sh(only when--llm-eval). Would have caught the June 6 quota-compromised runs before they started.retrieved_session_ids_full(all 10 recalled memories + scores) recorded per question — closes the rank-6-10 blind spot for future runs.Findings on the canonical run (87.0% accuracy, recall@5 97.2%)
58 of 69 failures had the answer retrieved in the top 5. LLM labels (judge
gpt-5.4-mini-2026-03-17): answer-construction 42 (27 of them literal "I don't know" abstentions with the answer in context), missing-date-use 7, ranking 4, conflict-resolution 2, retrieval-gap 2, outdated-fact 1. This reprioritized the release: the harness answer-assembly path (seefeat/date-aware-ranking) is the biggest benchmark lever; pure ranking fixes are the production-quality lever.Report artifact:
benchmarks/results/failure_modes_canonical_llm_20260611.json(local, gitignored).Testing
31 new tests (synthetic fixtures, mocked clients, no network); full suite 518 passed, 12 skipped; black + flake8 clean.
Refs #158, #159 (both issues require this failure-mode classification as their first acceptance criterion).
🤖 Generated with Claude Code