Evidence
The canonical LongMemEval full run (benchmarks/results/longmemeval_full_gpt5mini_20260425_231308.json) scored single-session-preference at 60.0% (18/30) with recall@5 90.0% (27/30).
Problem
Preference questions are the clearest weak category in the full run. Because recall is high relative to answer accuracy, failures may involve stale preferences, preference overwrite/update semantics, missing metadata, or answer synthesis over conflicting preference evidence rather than simple retrieval misses.
Acceptance Criteria
- Analyze failed preference questions from
benchmarks/results/longmemeval_full_gpt5mini_20260425_231308.json.
- Classify failures as retrieval misses, stale preference conflicts, missing metadata, answer synthesis errors, or another concrete category discovered during analysis.
- Add focused server tests for preference creation, preference update, and conflicting/latest preference retrieval.
- Re-run the representative mini (
--per-type 5) and full LongMemEval before claiming improvement.
Related Work
Related to existing metadata/recall work: #110 and #111.
Evidence
The canonical LongMemEval full run (
benchmarks/results/longmemeval_full_gpt5mini_20260425_231308.json) scoredsingle-session-preferenceat 60.0% (18/30) with recall@5 90.0% (27/30).Problem
Preference questions are the clearest weak category in the full run. Because recall is high relative to answer accuracy, failures may involve stale preferences, preference overwrite/update semantics, missing metadata, or answer synthesis over conflicting preference evidence rather than simple retrieval misses.
Acceptance Criteria
benchmarks/results/longmemeval_full_gpt5mini_20260425_231308.json.--per-type 5) and full LongMemEval before claiming improvement.Related Work
Related to existing metadata/recall work: #110 and #111.