Skip to content

LongMemEval: improve multi-session and temporal conflict resolution #159

@jack-arturo

Description

@jack-arturo

Evidence

The canonical LongMemEval full run (benchmarks/results/longmemeval_full_gpt5mini_20260425_231308.json) scored:

  • multi-session: 81.2% (108/133)
  • temporal-reasoning: 88.0% (117/133)
  • overall recall@5: 97.2% (486/500)

Problem

The system usually retrieves relevant evidence, but still misses questions requiring chronology, session ordering, or resolving changed facts across sessions. This points to ranking, metadata use, conflict resolution, or answer construction rather than broad retrieval failure.

Acceptance Criteria

  • Analyze failed multi-session, temporal, and knowledge-update questions where recall_hit_at_5=true.
  • Identify whether failures come from ranking, missing date/session metadata, outdated facts, graph expansion noise, answer construction, or another concrete category discovered during analysis.
  • Add focused server tests for date-aware recall, latest-fact selection, and conflicting memories across sessions.
  • Re-run the representative mini (--per-type 5) and full LongMemEval before claiming improvement.

Related Work

Related to existing recall/graph/metadata issues: #74, #100, #110, and #111.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions