LongMemEval: improve multi-session and temporal conflict resolution

## Evidence

The canonical LongMemEval full run (`benchmarks/results/longmemeval_full_gpt5mini_20260425_231308.json`) scored:

- `multi-session`: **81.2% (108/133)**
- `temporal-reasoning`: **88.0% (117/133)**
- overall recall@5: **97.2% (486/500)**

## Problem

The system usually retrieves relevant evidence, but still misses questions requiring chronology, session ordering, or resolving changed facts across sessions. This points to ranking, metadata use, conflict resolution, or answer construction rather than broad retrieval failure.

## Acceptance Criteria

- Analyze failed multi-session, temporal, and knowledge-update questions where `recall_hit_at_5=true`.
- Identify whether failures come from ranking, missing date/session metadata, outdated facts, graph expansion noise, answer construction, or another concrete category discovered during analysis.
- Add focused server tests for date-aware recall, latest-fact selection, and conflicting memories across sessions.
- Re-run the representative mini (`--per-type 5`) and full LongMemEval before claiming improvement.

## Related Work

Related to existing recall/graph/metadata issues: #74, #100, #110, and #111.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LongMemEval: improve multi-session and temporal conflict resolution #159

Evidence

Problem

Acceptance Criteria

Related Work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

LongMemEval: improve multi-session and temporal conflict resolution #159

Description

Evidence

Problem

Acceptance Criteria

Related Work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions