LongMemEval: improve preference memory update and recall semantics

## Evidence

The canonical LongMemEval full run (`benchmarks/results/longmemeval_full_gpt5mini_20260425_231308.json`) scored `single-session-preference` at **60.0% (18/30)** with recall@5 **90.0% (27/30)**.

## Problem

Preference questions are the clearest weak category in the full run. Because recall is high relative to answer accuracy, failures may involve stale preferences, preference overwrite/update semantics, missing metadata, or answer synthesis over conflicting preference evidence rather than simple retrieval misses.

## Acceptance Criteria

- Analyze failed preference questions from `benchmarks/results/longmemeval_full_gpt5mini_20260425_231308.json`.
- Classify failures as retrieval misses, stale preference conflicts, missing metadata, answer synthesis errors, or another concrete category discovered during analysis.
- Add focused server tests for preference creation, preference update, and conflicting/latest preference retrieval.
- Re-run the representative mini (`--per-type 5`) and full LongMemEval before claiming improvement.

## Related Work

Related to existing metadata/recall work: #110 and #111.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LongMemEval: improve preference memory update and recall semantics #158

Evidence

Problem

Acceptance Criteria

Related Work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

LongMemEval: improve preference memory update and recall semantics #158

Description

Evidence

Problem

Acceptance Criteria

Related Work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions