feat(benchmarks): LongMemEval failure-mode diagnosis harness + judge quota preflight by jack-arturo · Pull Request #183 · verygoodplugins/automem

jack-arturo · 2026-06-11T01:45:10Z

Summary

tests/benchmarks/longmemeval/diagnose_failures.py: two-stage classifier for failed-but-retrieved questions (is_correct=false AND recall_hit_at_5=true). Stage 1 is pure code (answer rank, abstention-despite-hit, stale-candidate-above-answer, noise ratio, date-math detection → documented 8-rule mode ladder); stage 2 (--llm) labels each failure with the pinned judge and reports a stage-1/stage-2 agreement matrix (transport errors counted and excluded).
tests/benchmarks/judge_preflight.py: one minimal judge call before any judged run; exits non-zero with an actionable message on 429/insufficient_quota or auth failure. Wired into test-longmemeval-benchmark.sh (only when --llm-eval). Would have caught the June 6 quota-compromised runs before they started.
Harness instrumentation: retrieved_session_ids_full (all 10 recalled memories + scores) recorded per question — closes the rank-6-10 blind spot for future runs.

Findings on the canonical run (87.0% accuracy, recall@5 97.2%)

58 of 69 failures had the answer retrieved in the top 5. LLM labels (judge gpt-5.4-mini-2026-03-17): answer-construction 42 (27 of them literal "I don't know" abstentions with the answer in context), missing-date-use 7, ranking 4, conflict-resolution 2, retrieval-gap 2, outdated-fact 1. This reprioritized the release: the harness answer-assembly path (see feat/date-aware-ranking) is the biggest benchmark lever; pure ranking fixes are the production-quality lever.

Report artifact: benchmarks/results/failure_modes_canonical_llm_20260611.json (local, gitignored).

Testing

31 new tests (synthetic fixtures, mocked clients, no network); full suite 518 passed, 12 skipped; black + flake8 clean.

Refs #158, #159 (both issues require this failure-mode classification as their first acceptance criterion).

🤖 Generated with Claude Code

Copilot

Pull request overview

Adds a LongMemEval “failed-but-retrieved” diagnosis harness (for #158/#159) plus a lightweight judge quota/auth preflight to prevent wasted benchmark runs when the pinned judge is unavailable.

Changes:

Record retrieved_session_ids_full (rank-ordered session_id + score for all recalled memories) in LongMemEval result artifacts to remove the rank-6–10 visibility gap.
Introduce tests/benchmarks/longmemeval/diagnose_failures.py (stage-1 deterministic evidence + optional stage-2 judge labeling with an agreement matrix) and comprehensive unit tests.
Add tests/benchmarks/judge_preflight.py and wire it into test-longmemeval-benchmark.sh when --llm-eval is enabled.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/benchmarks/longmemeval/test_longmemeval.py	Adds `retrieved_session_ids_full` to persisted results and implements helper to capture session IDs with scores.
tests/benchmarks/longmemeval/diagnose_failures.py	New two-stage failure-mode diagnosis CLI (deterministic evidence + optional LLM labeling).
tests/benchmarks/longmemeval/test_diagnose_failures.py	New unit tests covering stage-1 evidence/labeling, stage-2 parsing/errors, and judge preflight behavior.
tests/benchmarks/judge_preflight.py	New minimal judge call preflight with classified exit codes and actionable messages.
test-longmemeval-benchmark.sh	Runs judge preflight before benchmark ingestion/question loop when `--llm-eval` is requested.

jack-arturo · 2026-06-11T16:09:10Z

+    if ! (cd "$SCRIPT_DIR" && "${PREFLIGHT_CMD[@]}"); then
+        echo -e "${RED}Judge preflight failed — aborting before the question loop${NC}"
+        exit 1
+    fi


Preserved the judge preflight exit status by capturing $? in the failure branch and exiting with it — test-longmemeval-benchmark.sh:276. GitHub marked the original line outdated after the push.

…#187) ## Summary Stacked on #186. Driven by the failure-mode diagnosis in #183 (58 failed-but-retrieved LongMemEval questions: answer-construction 42, missing-date-use 7, ranking 4). Server (production + benchmark): - **Timestamp tiebreak** (always-on): exact score ties now order newest-first deterministically (`_score_sort_key`). - **`recency_bias=auto|on|off`** (ships `off` via `RECALL_RECENCY_BIAS`): after dedup/state-filter and before the adaptive floor, candidate timestamps are min-max normalized and `SEARCH_WEIGHT_TEMPORAL` (default 0.1) × relative recency is added — so the newest version of a conflicting fact can outrank an older, heavier one. `auto` triggers on temporal intent ("latest", "current", "what changed", …; word-boundaried, "currency"/"nowhere" safe). - **Supersession chain-walk**: `current` state mode now resolves INVALIDATED_BY/EVOLVED_INTO chains to their head (A→B→C surfaces C, provenance still points at A; depth-bounded at 5, cycle-safe, batched). *Honesty note: benchmark corpora carry no supersession edges — this is a production-correctness fix, not a score mover.* Harness (benchmark-only, flag-gated for methodology reproducibility): - **`temporal_answer_hint`** config flag (default off; new `temporal-answer` preset for A/B): chronological memory rendering with scores + conflict-recency guidance + anti-overabstention guidance (27 of the 58 failures were literal "I don't know" with the answer retrieved; abstention remains possible). Flag-off prompt is byte-identical to the canonical methodology (equality-tested). ## Testing 40 new tests (tiebreak, bias flip with the shipped default weight, auto-detection, chain-walk depth/cycle/query-count guards, #158 preference latest-wins acceptance, prompt byte-identity). Full suite 563 passed, 12 skipped; black + flake8 clean. ## Validation plan before enabling anything Lab IR A/B → automem-evals 22-probe zero-delta gate (recency_bias=off default keeps it zero-delta) → LongMemEval mini judge-off (recall@5 ≥ 97.2% floor) → judged mini → full run; `temporal-answer` preset A/B for the harness flag, with server-vs-harness deltas reported separately in EXPERIMENT_LOG. Refs #158, #159 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…judge preflight Implements PR-2 of the LongMemEval failure-diagnosis plan (issues #158/#159): - tests/benchmarks/longmemeval/diagnose_failures.py: two-stage CLI that classifies failed-but-retrieved questions (is_correct=false AND recall_hit_at_5=true). Stage 1 is pure code: joins failures to dataset haystack sessions/dates and emits per-question evidence (answer_rank, abstained_despite_hit, stale_candidate_above_answer, noise_ratio, date_arithmetic_needed, answer_coverage_top5) plus a deterministic suggested_mode documented in the module docstring. Stage 2 (--llm) asks the pinned benchmark judge for an independent label and records the stage1-vs-stage2 agreement matrix. Default type filter is all types (58 questions on the canonical full run; the four weak categories cover 54 of them). - tests/benchmarks/judge_preflight.py: one minimal completion against the pinned judge model before benchmark runs; exit 0 ok, 2 quota/429, 3 auth, 1 other, with actionable one-line messages. Wired into test-longmemeval-benchmark.sh before the question loop when --llm-eval is set, so quota exhaustion aborts before any ingestion work. - test_longmemeval.py instrumentation: additive details key retrieved_session_ids_full with ALL recalled memories' session ids in rank order and per-memory score (existing top-5-unique key unchanged). - tests/benchmarks/longmemeval/test_diagnose_failures.py: synthetic-only unit tests for evidence extraction, the suggested_mode heuristic, the type filter, stage-2 with a mocked client, and preflight error classification. No network calls in tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Review fixes for the failure-diagnosis harness: - run_stage2: records with llm_error set (transport/parse failures) no longer pollute the agreement matrix or exact_agreement; they are counted in a new agreement.llm_errors field and reported in the stdout summary. Per-record llm_error is unchanged. - Import _result_details from analyze_results instead of duplicating it (module is side-effect-free at import). - Comment the deliberate over-trigger in _DATE_MATH_RE (stage 2 cross-checks it). - main(): only print the exact-agreement line when a rate exists (no stray blank line). - _session_ids_with_scores docstring: note the full ranked pool is recorded for future rank-6-10 depth analysis, not yet consumed by diagnose_failures. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings June 11, 2026 01:45

Copilot started reviewing on behalf of jack-arturo June 11, 2026 01:45 View session

jack-arturo mentioned this pull request Jun 11, 2026

feat(recall): date-aware ranking + latest-fact selection (#158, #159) #187

Merged

Copilot AI reviewed Jun 11, 2026

View reviewed changes

jack-arturo and others added 3 commits June 11, 2026 21:05

fix(benchmarks): preserve judge preflight exit status

bbabb3b

jack-arturo force-pushed the feat/longmemeval-failure-diagnosis branch from c526bd4 to bbabb3b Compare June 11, 2026 19:06

jack-arturo changed the base branch from main to develop June 11, 2026 19:06

jack-arturo merged commit f99bece into develop Jun 11, 2026
5 checks passed

jack-arturo deleted the feat/longmemeval-failure-diagnosis branch June 11, 2026 19:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmarks): LongMemEval failure-mode diagnosis harness + judge quota preflight#183

feat(benchmarks): LongMemEval failure-mode diagnosis harness + judge quota preflight#183
jack-arturo merged 3 commits into
developfrom
feat/longmemeval-failure-diagnosis

jack-arturo commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

jack-arturo Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jack-arturo commented Jun 11, 2026

Summary

Findings on the canonical run (87.0% accuracy, recall@5 97.2%)

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

jack-arturo Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants