Skip to content

feat(benchmarks): LongMemEval failure-mode diagnosis harness + judge quota preflight#183

Merged
jack-arturo merged 3 commits into
developfrom
feat/longmemeval-failure-diagnosis
Jun 11, 2026
Merged

feat(benchmarks): LongMemEval failure-mode diagnosis harness + judge quota preflight#183
jack-arturo merged 3 commits into
developfrom
feat/longmemeval-failure-diagnosis

Conversation

@jack-arturo

Copy link
Copy Markdown
Member

Summary

  • tests/benchmarks/longmemeval/diagnose_failures.py: two-stage classifier for failed-but-retrieved questions (is_correct=false AND recall_hit_at_5=true). Stage 1 is pure code (answer rank, abstention-despite-hit, stale-candidate-above-answer, noise ratio, date-math detection → documented 8-rule mode ladder); stage 2 (--llm) labels each failure with the pinned judge and reports a stage-1/stage-2 agreement matrix (transport errors counted and excluded).
  • tests/benchmarks/judge_preflight.py: one minimal judge call before any judged run; exits non-zero with an actionable message on 429/insufficient_quota or auth failure. Wired into test-longmemeval-benchmark.sh (only when --llm-eval). Would have caught the June 6 quota-compromised runs before they started.
  • Harness instrumentation: retrieved_session_ids_full (all 10 recalled memories + scores) recorded per question — closes the rank-6-10 blind spot for future runs.

Findings on the canonical run (87.0% accuracy, recall@5 97.2%)

58 of 69 failures had the answer retrieved in the top 5. LLM labels (judge gpt-5.4-mini-2026-03-17): answer-construction 42 (27 of them literal "I don't know" abstentions with the answer in context), missing-date-use 7, ranking 4, conflict-resolution 2, retrieval-gap 2, outdated-fact 1. This reprioritized the release: the harness answer-assembly path (see feat/date-aware-ranking) is the biggest benchmark lever; pure ranking fixes are the production-quality lever.

Report artifact: benchmarks/results/failure_modes_canonical_llm_20260611.json (local, gitignored).

Testing

31 new tests (synthetic fixtures, mocked clients, no network); full suite 518 passed, 12 skipped; black + flake8 clean.

Refs #158, #159 (both issues require this failure-mode classification as their first acceptance criterion).

🤖 Generated with Claude Code

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a LongMemEval “failed-but-retrieved” diagnosis harness (for #158/#159) plus a lightweight judge quota/auth preflight to prevent wasted benchmark runs when the pinned judge is unavailable.

Changes:

  • Record retrieved_session_ids_full (rank-ordered session_id + score for all recalled memories) in LongMemEval result artifacts to remove the rank-6–10 visibility gap.
  • Introduce tests/benchmarks/longmemeval/diagnose_failures.py (stage-1 deterministic evidence + optional stage-2 judge labeling with an agreement matrix) and comprehensive unit tests.
  • Add tests/benchmarks/judge_preflight.py and wire it into test-longmemeval-benchmark.sh when --llm-eval is enabled.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/benchmarks/longmemeval/test_longmemeval.py Adds retrieved_session_ids_full to persisted results and implements helper to capture session IDs with scores.
tests/benchmarks/longmemeval/diagnose_failures.py New two-stage failure-mode diagnosis CLI (deterministic evidence + optional LLM labeling).
tests/benchmarks/longmemeval/test_diagnose_failures.py New unit tests covering stage-1 evidence/labeling, stage-2 parsing/errors, and judge preflight behavior.
tests/benchmarks/judge_preflight.py New minimal judge call preflight with classified exit codes and actionable messages.
test-longmemeval-benchmark.sh Runs judge preflight before benchmark ingestion/question loop when --llm-eval is requested.

Comment thread test-longmemeval-benchmark.sh Outdated
Comment on lines +276 to +279
if ! (cd "$SCRIPT_DIR" && "${PREFLIGHT_CMD[@]}"); then
echo -e "${RED}Judge preflight failed — aborting before the question loop${NC}"
exit 1
fi

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preserved the judge preflight exit status by capturing $? in the failure branch and exiting with it — test-longmemeval-benchmark.sh:276. GitHub marked the original line outdated after the push.

jack-arturo added a commit that referenced this pull request Jun 11, 2026
…#187)

## Summary
Stacked on #186. Driven by the failure-mode diagnosis in #183 (58
failed-but-retrieved LongMemEval questions: answer-construction 42,
missing-date-use 7, ranking 4).

Server (production + benchmark):
- **Timestamp tiebreak** (always-on): exact score ties now order
newest-first deterministically (`_score_sort_key`).
- **`recency_bias=auto|on|off`** (ships `off` via
`RECALL_RECENCY_BIAS`): after dedup/state-filter and before the adaptive
floor, candidate timestamps are min-max normalized and
`SEARCH_WEIGHT_TEMPORAL` (default 0.1) × relative recency is added — so
the newest version of a conflicting fact can outrank an older, heavier
one. `auto` triggers on temporal intent ("latest", "current", "what
changed", …; word-boundaried, "currency"/"nowhere" safe).
- **Supersession chain-walk**: `current` state mode now resolves
INVALIDATED_BY/EVOLVED_INTO chains to their head (A→B→C surfaces C,
provenance still points at A; depth-bounded at 5, cycle-safe, batched).
*Honesty note: benchmark corpora carry no supersession edges — this is a
production-correctness fix, not a score mover.*

Harness (benchmark-only, flag-gated for methodology reproducibility):
- **`temporal_answer_hint`** config flag (default off; new
`temporal-answer` preset for A/B): chronological memory rendering with
scores + conflict-recency guidance + anti-overabstention guidance (27 of
the 58 failures were literal "I don't know" with the answer retrieved;
abstention remains possible). Flag-off prompt is byte-identical to the
canonical methodology (equality-tested).

## Testing
40 new tests (tiebreak, bias flip with the shipped default weight,
auto-detection, chain-walk depth/cycle/query-count guards, #158
preference latest-wins acceptance, prompt byte-identity). Full suite 563
passed, 12 skipped; black + flake8 clean.

## Validation plan before enabling anything
Lab IR A/B → automem-evals 22-probe zero-delta gate (recency_bias=off
default keeps it zero-delta) → LongMemEval mini judge-off (recall@5 ≥
97.2% floor) → judged mini → full run; `temporal-answer` preset A/B for
the harness flag, with server-vs-harness deltas reported separately in
EXPERIMENT_LOG.

Refs #158, #159

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
jack-arturo and others added 3 commits June 11, 2026 21:05
…judge preflight

Implements PR-2 of the LongMemEval failure-diagnosis plan (issues #158/#159):

- tests/benchmarks/longmemeval/diagnose_failures.py: two-stage CLI that
  classifies failed-but-retrieved questions (is_correct=false AND
  recall_hit_at_5=true). Stage 1 is pure code: joins failures to dataset
  haystack sessions/dates and emits per-question evidence (answer_rank,
  abstained_despite_hit, stale_candidate_above_answer, noise_ratio,
  date_arithmetic_needed, answer_coverage_top5) plus a deterministic
  suggested_mode documented in the module docstring. Stage 2 (--llm) asks
  the pinned benchmark judge for an independent label and records the
  stage1-vs-stage2 agreement matrix. Default type filter is all types
  (58 questions on the canonical full run; the four weak categories
  cover 54 of them).

- tests/benchmarks/judge_preflight.py: one minimal completion against the
  pinned judge model before benchmark runs; exit 0 ok, 2 quota/429,
  3 auth, 1 other, with actionable one-line messages. Wired into
  test-longmemeval-benchmark.sh before the question loop when --llm-eval
  is set, so quota exhaustion aborts before any ingestion work.

- test_longmemeval.py instrumentation: additive details key
  retrieved_session_ids_full with ALL recalled memories' session ids in
  rank order and per-memory score (existing top-5-unique key unchanged).

- tests/benchmarks/longmemeval/test_diagnose_failures.py: synthetic-only
  unit tests for evidence extraction, the suggested_mode heuristic, the
  type filter, stage-2 with a mocked client, and preflight error
  classification. No network calls in tests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review fixes for the failure-diagnosis harness:

- run_stage2: records with llm_error set (transport/parse failures) no
  longer pollute the agreement matrix or exact_agreement; they are
  counted in a new agreement.llm_errors field and reported in the
  stdout summary. Per-record llm_error is unchanged.
- Import _result_details from analyze_results instead of duplicating it
  (module is side-effect-free at import).
- Comment the deliberate over-trigger in _DATE_MATH_RE (stage 2
  cross-checks it).
- main(): only print the exact-agreement line when a rate exists (no
  stray blank line).
- _session_ids_with_scores docstring: note the full ranked pool is
  recorded for future rank-6-10 depth analysis, not yet consumed by
  diagnose_failures.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@jack-arturo jack-arturo force-pushed the feat/longmemeval-failure-diagnosis branch from c526bd4 to bbabb3b Compare June 11, 2026 19:06
@jack-arturo jack-arturo changed the base branch from main to develop June 11, 2026 19:06
@jack-arturo jack-arturo merged commit f99bece into develop Jun 11, 2026
5 checks passed
@jack-arturo jack-arturo deleted the feat/longmemeval-failure-diagnosis branch June 11, 2026 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants