feat(recall): date-aware ranking + latest-fact selection (#158, #159)#187
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves recall behavior for time-sensitive questions by adding deterministic score tie-breaking, an optional timestamp-based recency re-rank, and deeper “current state” supersession resolution; it also adds a benchmark-only prompt variant to reduce LongMemEval answer-construction failures when the answer is retrieved but not used.
Changes:
- Add newest-first timestamp tiebreak for exact score ties (
_score_sort_key) and expand “current state” filtering to walk INVALIDATED_BY/EVOLVED_INTO chains to their head (depth/cycle bounded). - Introduce
recency_bias=auto|on|off(env defaultRECALL_RECENCY_BIAS=off) plusSEARCH_WEIGHT_TEMPORALto add a min-max-normalized recency bonus for score ordering when activated. - Add LongMemEval harness flag
temporal_answer_hint(and preset) to render memories chronologically with scores + temporal guidance while keeping the default prompt byte-identical.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
automem/api/recall.py |
Implements score tie-break by timestamp, recency-bias re-rank, and supersession chain-walk for current-state filtering. |
automem/config.py |
Adds SEARCH_WEIGHT_TEMPORAL parsing and RECALL_RECENCY_BIAS env default handling. |
automem/utils/time.py |
Adds query_has_temporal_intent() heuristic for recency_bias=auto. |
tests/test_api_endpoints.py |
Adds endpoint/unit-style tests for tie-break, recency bias modes, temporal intent detection, and chain-walk semantics/guards. |
tests/benchmarks/longmemeval/test_longmemeval.py |
Adds harness prompt builder toggle (temporal_answer_hint) and preset plumbing/logging. |
tests/benchmarks/longmemeval/test_answer_prompt.py |
Adds byte-identity tests for prompt control + coverage for temporal-answer prompt behavior. |
tests/benchmarks/longmemeval/configs.py |
Adds temporal_answer_hint config field and new temporal-answer preset. |
docs/ENVIRONMENT_VARIABLES.md |
Documents SEARCH_WEIGHT_TEMPORAL and RECALL_RECENCY_BIAS. |
docs/API.md |
Documents newest-first tie-breaking, recency_bias, and chain-head supersession behavior in recall. |
CLAUDE.md |
Updates example env configuration with the new variables. |
| for res in results: | ||
| parsed_ts = _parse_iso_datetime((res.get("memory") or {}).get("timestamp")) | ||
| epochs.append(parsed_ts.timestamp() if parsed_ts is not None else None) |
There was a problem hiding this comment.
Reused a defensive timestamp conversion helper for recency_bias, skipping conversion failures instead of 500ing — automem/api/recall.py:395. Added a regression test for timestamp conversion errors — tests/test_api_endpoints.py:4014. GitHub marked the original line outdated after the push.
7dac6d1 to
fe62186
Compare
ae34c42 to
5a6fa23
Compare
…nge (#191) Fixes #190. ## Problem `_graph_keyword_search` returned the **raw Cypher additive score** — +2 per keyword contained in content, +1 per keyword in any tag, summed over all extracted keywords, plus a +2/+1 whole-phrase bonus — so a K-keyword query can score up to **3K+3** while every other channel (vector cosine, metadata, trending importance) lives in 0–1. Observed during the 2026-06-11 production forensics: a tag-scoped exact-content match returned `keyword=11.0`, `final_score=4.03`. Consequences: - `SEARCH_WEIGHT_KEYWORD (0.35) × 11 = 3.85` — a keyword hit trumps any vector/metadata/importance combination, scaling with *query length* rather than match quality. - Defeats `RECALL_RELEVANCE_GATE` semantics (PR #186): `evidence = max(vector, keyword, metadata, exact)` assumes 0–1 components; `evidence = 11` sails past any gate. ## Fix 1. **Producer**: normalize the raw score by its per-query maximum (`3·len(keywords) + 3` when a phrase is present; `3` in the phrase-only branch) before it leaves `_graph_keyword_search`. This is a monotone per-query transform — within-channel ordering (and the Cypher `ORDER BY`) is unchanged; only cross-channel blending changes, which is the point. 2. **Consumer**: defensively clamp the keyword component to `min(1.0, …)` in `_compute_metadata_score`, so no future producer can break the 0–1 contract or the gate again. ## Verification - 4 new tests in `tests/test_keyword_score_normalization.py` (TDD'd against the bug, including the literal `keyword=11.0` repro). Full suite: **503 passed**. - **Production-corpus lab A/B** (10,142-memory snapshot, 200 queries, vs the pooled 3-run parity baseline from the 2026-06-11 release sweep): - Recall@5 −0.2pp, Recall@10 −0.7pp, MRR −0.008, NDCG@10 −0.007 (all within baseline run-to-run variance; paired t-test p=0.32) - Per-query: **196/197 unchanged**, 0 improved, 1 regressed - The single flip is the intended behavior change made visible: the expected memory held rank 1 *only* via the inflated keyword score (ranks 2–5 identical before/after). It's in the fallback-typed `Memory` cohort (MRR 0.15 baseline — the known data-quality cohort from #188's classification incident). ## Notes - This lands on main independently of the #182–#187 chain; #186's gate evidence check is the main beneficiary once the chain rebases over it. - Trending (`importance`), metadata (capped), and vector (cosine) channels were verified already bounded; the graph keyword channel was the only unbounded producer. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
fe62186 to
3693734
Compare
Three server-side changes for date-aware recall:
1. Timestamp tiebreak (always-on). The two score-sort lambdas are
extracted into _score_sort_key, which appends the memory timestamp
(defensive ISO parse, epoch fallback) as the final descending key —
exact score ties now order newest-first deterministically. Existing
keys and their order are unchanged.
2. recency_bias relative-recency re-rank (param-gated, ships off).
New request param recency_bias ∈ {auto,on,off}; default from
RECALL_RECENCY_BIAS (default "off", invalid → "off"). When active,
candidate timestamps are min-max normalized across the current
candidate set after dedup/expansion/state filtering and before the
adaptive floor, and SEARCH_WEIGHT_TEMPORAL (default 0.1, negatives
clamp to 0) × relative_recency is added to each final score with a
"temporal" score component recorded. "auto" activates only when the
query matches the temporal-intent detector (query_has_temporal_intent
in automem/utils/time.py — pure, word-boundaried keyword regex:
latest / most recent / current(ly) / now / today / changed / updated
/ last time / newest / ...). The response echoes "recency_bias": "on"
only when the re-rank actually ran. Degenerate candidate sets
(single candidate, all-same-timestamp, unparseable timestamps)
contribute nothing — no div-by-zero, scores untouched.
3. Supersession chain-walk to head. _active_replacements_for_memories
previously resolved ONE hop of INVALIDATED_BY/EVOLVED_INTO; it now
iteratively re-queries whether each replacement is itself superseded
(batched per round with a per-head cache, bounded at
STATE_REPLACEMENT_MAX_DEPTH=5 hops, cycle-safe via per-source visited
sets) and surfaces the chain HEAD. Provenance contract preserved: the
head carries the first-hop _state_relation, so state_replaces and
relations[].from keep pointing at the originally suppressed memory.
Extra queries fire only when a first-hop replacement was found.
Honesty note: benchmark corpora carry no supersession edges, so the
chain-walk is production-correctness work, not a benchmark mover.
Defaults preserve behavior: RECALL_RECENCY_BIAS=off ⇒ identical scores;
the tiebreak only reorders exact ties and the chain-walk only changes
multi-hop supersession chains (both intended).
Docs: API.md (recency_bias + chain-walk), ENVIRONMENT_VARIABLES.md and
CLAUDE.md (SEARCH_WEIGHT_TEMPORAL, RECALL_RECENCY_BIAS).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…emEval
Flag-gated prompt change informed by the LongMemEval failure diagnosis
(58 failed-but-retrieved questions: 42 answer-construction incl. 27
abstentions-despite-hit, 7 missing-date-use, 4 ranking, 2
conflict-resolution):
- New config flag temporal_answer_hint (default False), plumbed like
use_temporal_hints, with a "temporal-answer" preset (baseline + flag)
and CLI choice for reproducible A/B runs. Recorded in the results
config block and run banner.
- generate_answer's prompt construction is extracted into
_build_answer_prompt. With the flag on: memories render in
chronological order with retrieval scores noted, plus conflict-recency
guidance ("when memories conflict, prefer the most recent one unless
the question asks about an earlier time") and anti-overabstention
guidance ("before answering 'I don't know', re-check each memory…").
Abstention remains possible — some questions require it.
- With the flag off the prompt is byte-identical to the historical one
(guarded by equality tests against a verbatim legacy copy, both
chain-of-note and plain variants).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- recall.py: correct the recency_bias echo comment — the flag is echoed whenever the mode activates, even when the re-rank is a no-op - docs/API.md: move recency_bias out of the sort-value enumeration into its own bullet and document echo vs temporal-component semantics - test_api_endpoints.py: rename single-hop chain test to reflect that one extra chain round fires, and document the +1-query cost - test_longmemeval.py: note why lexicographic session_date sort is chronological (zero-padded LongMemEval date format) - test_answer_prompt.py: document why byte identity of the legacy chain-of-note prompt is load-bearing (benchmark comparability) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
b75f818 to
fee64b3
Compare
…nce gate, date-aware ranking (#182, #193, #186, #187, #183, #184, #188) (#194) ## Release: ranking & recall series (develop → main)⚠️ **Merge with a MERGE COMMIT — do not squash.** release-please needs the individual conventional commits below to compute the version and changelog for PR #154. ### What's in this release | PR | Change | Default behavior | |---|---|---| | #182 | `feat(recall)`: configurable recency decay window/curve | unchanged (env-gated) | | #193 (replaces #185) | `feat(recall)`: tag-score denominator cap fixes query-length bias | unchanged (`SEARCH_TAG_SCORE_TOKEN_CAP=0`) | | #186 | `fix(recall)`: relevance gate — query-independent scoring gated on topical evidence (#130) | unchanged (gate off) | | #187 | `feat(recall)`: date-aware ranking, `recency_bias=off\|on\|auto`, latest-fact selection (#158, #159) | `RECALL_RECENCY_BIAS=off`; adds deterministic timestamp tiebreak for near-ties | | #183 | `feat(benchmarks)`: failure-mode diagnosis harness + judge quota preflight | tooling only | | #184 | `fix(mcp)`: surface stored metadata + `updated_at` in detailed recall format (#111) | additive | | #188 | `feat(enrichment)`: classification fallback-rate metrics in `/enrichment/status` | additive | Plus: CI now runs on `develop` pushes/PRs; benchmark experiment log + README contribution-policy note. ### Verification evidence - **Unit/lint/npm**: 625 pytest + 16 mcp-sse-server tests green on develop head; CI green. - **Default-preserve**: recall-lab baseline on the 10k-memory production snapshot — develop defaults vs main pooled baseline identical aggregates (R@5 0.655 / R@10 0.710 / MRR 0.434 / NDCG@10 0.501). Two-stack probe run (main vs develop, defaults): 11/12 preserve-exact, remaining diffs are near-tie reorders (top-1 score deltas ≤ 5.4e-5, the #187 timestamp tiebreak). - **Full judged 500q LongMemEval** (ship config: `RECALL_RECENCY_BIAS=auto` + `temporal-answer` harness): recall@5 96.6% (483/500), accuracy 86.0% (430/500), `judge_errors=0`, `memory_ingest_failures=0`. - **Churn attribution** (targeted re-runs of all 17 churned questions on current-main-at-defaults and develop-at-defaults): 15/17 moved with #191 (already on main) — the April canonical 97.2% floor is stale; current main measures ~97.0%. Develop-at-defaults differs from current main by **1 question in 500** (a near-tie rank-5/6 flip from #187's deterministic tiebreak). Accuracy is within answerer replicate noise (identical-config reference runs flip 28/500 answers). - Full detail: `benchmarks/EXPERIMENT_LOG.md` (2026-06-11 entry) and `benchmarks/results/lme_churn17_*` + `analyze_churn17.py`. ### Opt-in features shipped OFF `RECALL_RELEVANCE_GATE` (validated at 0.40 on lab corpus; improves negative-probe precision) and `RECALL_RECENCY_BIAS=auto` (current-state query re-ranking). Neither affects default behavior; see `docs/ENVIRONMENT_VARIABLES.md`. ### After merging release-please will update PR #154 (v0.16.0); merging *that* cuts the tag and publishes the `:stable` image — the actual user-facing deploy event for Railway template users. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Summary
Stacked on #186. Driven by the failure-mode diagnosis in #183 (58 failed-but-retrieved LongMemEval questions: answer-construction 42, missing-date-use 7, ranking 4).
Server (production + benchmark):
_score_sort_key).recency_bias=auto|on|off(shipsoffviaRECALL_RECENCY_BIAS): after dedup/state-filter and before the adaptive floor, candidate timestamps are min-max normalized andSEARCH_WEIGHT_TEMPORAL(default 0.1) × relative recency is added — so the newest version of a conflicting fact can outrank an older, heavier one.autotriggers on temporal intent ("latest", "current", "what changed", …; word-boundaried, "currency"/"nowhere" safe).currentstate mode now resolves INVALIDATED_BY/EVOLVED_INTO chains to their head (A→B→C surfaces C, provenance still points at A; depth-bounded at 5, cycle-safe, batched). Honesty note: benchmark corpora carry no supersession edges — this is a production-correctness fix, not a score mover.Harness (benchmark-only, flag-gated for methodology reproducibility):
temporal_answer_hintconfig flag (default off; newtemporal-answerpreset for A/B): chronological memory rendering with scores + conflict-recency guidance + anti-overabstention guidance (27 of the 58 failures were literal "I don't know" with the answer retrieved; abstention remains possible). Flag-off prompt is byte-identical to the canonical methodology (equality-tested).Testing
40 new tests (tiebreak, bias flip with the shipped default weight, auto-detection, chain-walk depth/cycle/query-count guards, #158 preference latest-wins acceptance, prompt byte-identity). Full suite 563 passed, 12 skipped; black + flake8 clean.
Validation plan before enabling anything
Lab IR A/B → automem-evals 22-probe zero-delta gate (recency_bias=off default keeps it zero-delta) → LongMemEval mini judge-off (recall@5 ≥ 97.2% floor) → judged mini → full run;
temporal-answerpreset A/B for the harness flag, with server-vs-harness deltas reported separately in EXPERIMENT_LOG.Refs #158, #159
🤖 Generated with Claude Code