Evidence
The canonical LongMemEval full run (benchmarks/results/longmemeval_full_gpt5mini_20260425_231308.json) scored:
multi-session: 81.2% (108/133)
temporal-reasoning: 88.0% (117/133)
- overall recall@5: 97.2% (486/500)
Problem
The system usually retrieves relevant evidence, but still misses questions requiring chronology, session ordering, or resolving changed facts across sessions. This points to ranking, metadata use, conflict resolution, or answer construction rather than broad retrieval failure.
Acceptance Criteria
- Analyze failed multi-session, temporal, and knowledge-update questions where
recall_hit_at_5=true.
- Identify whether failures come from ranking, missing date/session metadata, outdated facts, graph expansion noise, answer construction, or another concrete category discovered during analysis.
- Add focused server tests for date-aware recall, latest-fact selection, and conflicting memories across sessions.
- Re-run the representative mini (
--per-type 5) and full LongMemEval before claiming improvement.
Related Work
Related to existing recall/graph/metadata issues: #74, #100, #110, and #111.
Evidence
The canonical LongMemEval full run (
benchmarks/results/longmemeval_full_gpt5mini_20260425_231308.json) scored:multi-session: 81.2% (108/133)temporal-reasoning: 88.0% (117/133)Problem
The system usually retrieves relevant evidence, but still misses questions requiring chronology, session ordering, or resolving changed facts across sessions. This points to ranking, metadata use, conflict resolution, or answer construction rather than broad retrieval failure.
Acceptance Criteria
recall_hit_at_5=true.--per-type 5) and full LongMemEval before claiming improvement.Related Work
Related to existing recall/graph/metadata issues: #74, #100, #110, and #111.