Skip to content

LongMemEval: improve preference memory update and recall semantics #158

@jack-arturo

Description

@jack-arturo

Evidence

The canonical LongMemEval full run (benchmarks/results/longmemeval_full_gpt5mini_20260425_231308.json) scored single-session-preference at 60.0% (18/30) with recall@5 90.0% (27/30).

Problem

Preference questions are the clearest weak category in the full run. Because recall is high relative to answer accuracy, failures may involve stale preferences, preference overwrite/update semantics, missing metadata, or answer synthesis over conflicting preference evidence rather than simple retrieval misses.

Acceptance Criteria

  • Analyze failed preference questions from benchmarks/results/longmemeval_full_gpt5mini_20260425_231308.json.
  • Classify failures as retrieval misses, stale preference conflicts, missing metadata, answer synthesis errors, or another concrete category discovered during analysis.
  • Add focused server tests for preference creation, preference update, and conflicting/latest preference retrieval.
  • Re-run the representative mini (--per-type 5) and full LongMemEval before claiming improvement.

Related Work

Related to existing metadata/recall work: #110 and #111.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions