feat(recall): date-aware ranking + latest-fact selection (#158, #159) by jack-arturo · Pull Request #187 · verygoodplugins/automem

jack-arturo · 2026-06-11T01:46:05Z

Summary

Stacked on #186. Driven by the failure-mode diagnosis in #183 (58 failed-but-retrieved LongMemEval questions: answer-construction 42, missing-date-use 7, ranking 4).

Server (production + benchmark):

Timestamp tiebreak (always-on): exact score ties now order newest-first deterministically (_score_sort_key).
recency_bias=auto|on|off (ships off via RECALL_RECENCY_BIAS): after dedup/state-filter and before the adaptive floor, candidate timestamps are min-max normalized and SEARCH_WEIGHT_TEMPORAL (default 0.1) × relative recency is added — so the newest version of a conflicting fact can outrank an older, heavier one. auto triggers on temporal intent ("latest", "current", "what changed", …; word-boundaried, "currency"/"nowhere" safe).
Supersession chain-walk: current state mode now resolves INVALIDATED_BY/EVOLVED_INTO chains to their head (A→B→C surfaces C, provenance still points at A; depth-bounded at 5, cycle-safe, batched). Honesty note: benchmark corpora carry no supersession edges — this is a production-correctness fix, not a score mover.

Harness (benchmark-only, flag-gated for methodology reproducibility):

temporal_answer_hint config flag (default off; new temporal-answer preset for A/B): chronological memory rendering with scores + conflict-recency guidance + anti-overabstention guidance (27 of the 58 failures were literal "I don't know" with the answer retrieved; abstention remains possible). Flag-off prompt is byte-identical to the canonical methodology (equality-tested).

Testing

40 new tests (tiebreak, bias flip with the shipped default weight, auto-detection, chain-walk depth/cycle/query-count guards, #158 preference latest-wins acceptance, prompt byte-identity). Full suite 563 passed, 12 skipped; black + flake8 clean.

Validation plan before enabling anything

Lab IR A/B → automem-evals 22-probe zero-delta gate (recency_bias=off default keeps it zero-delta) → LongMemEval mini judge-off (recall@5 ≥ 97.2% floor) → judged mini → full run; temporal-answer preset A/B for the harness flag, with server-vs-harness deltas reported separately in EXPERIMENT_LOG.

Refs #158, #159

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR improves recall behavior for time-sensitive questions by adding deterministic score tie-breaking, an optional timestamp-based recency re-rank, and deeper “current state” supersession resolution; it also adds a benchmark-only prompt variant to reduce LongMemEval answer-construction failures when the answer is retrieved but not used.

Changes:

Add newest-first timestamp tiebreak for exact score ties (_score_sort_key) and expand “current state” filtering to walk INVALIDATED_BY/EVOLVED_INTO chains to their head (depth/cycle bounded).
Introduce recency_bias=auto|on|off (env default RECALL_RECENCY_BIAS=off) plus SEARCH_WEIGHT_TEMPORAL to add a min-max-normalized recency bonus for score ordering when activated.
Add LongMemEval harness flag temporal_answer_hint (and preset) to render memories chronologically with scores + temporal guidance while keeping the default prompt byte-identical.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`automem/api/recall.py`	Implements score tie-break by timestamp, recency-bias re-rank, and supersession chain-walk for current-state filtering.
`automem/config.py`	Adds `SEARCH_WEIGHT_TEMPORAL` parsing and `RECALL_RECENCY_BIAS` env default handling.
`automem/utils/time.py`	Adds `query_has_temporal_intent()` heuristic for `recency_bias=auto`.
`tests/test_api_endpoints.py`	Adds endpoint/unit-style tests for tie-break, recency bias modes, temporal intent detection, and chain-walk semantics/guards.
`tests/benchmarks/longmemeval/test_longmemeval.py`	Adds harness prompt builder toggle (`temporal_answer_hint`) and preset plumbing/logging.
`tests/benchmarks/longmemeval/test_answer_prompt.py`	Adds byte-identity tests for prompt control + coverage for temporal-answer prompt behavior.
`tests/benchmarks/longmemeval/configs.py`	Adds `temporal_answer_hint` config field and new `temporal-answer` preset.
`docs/ENVIRONMENT_VARIABLES.md`	Documents `SEARCH_WEIGHT_TEMPORAL` and `RECALL_RECENCY_BIAS`.
`docs/API.md`	Documents newest-first tie-breaking, `recency_bias`, and chain-head supersession behavior in recall.
`CLAUDE.md`	Updates example env configuration with the new variables.

jack-arturo · 2026-06-11T16:09:13Z

+        for res in results:
+            parsed_ts = _parse_iso_datetime((res.get("memory") or {}).get("timestamp"))
+            epochs.append(parsed_ts.timestamp() if parsed_ts is not None else None)


Reused a defensive timestamp conversion helper for recency_bias, skipping conversion failures instead of 500ing — automem/api/recall.py:395. Added a regression test for timestamp conversion errors — tests/test_api_endpoints.py:4014. GitHub marked the original line outdated after the push.

…nge (#191) Fixes #190. ## Problem `_graph_keyword_search` returned the **raw Cypher additive score** — +2 per keyword contained in content, +1 per keyword in any tag, summed over all extracted keywords, plus a +2/+1 whole-phrase bonus — so a K-keyword query can score up to **3K+3** while every other channel (vector cosine, metadata, trending importance) lives in 0–1. Observed during the 2026-06-11 production forensics: a tag-scoped exact-content match returned `keyword=11.0`, `final_score=4.03`. Consequences: - `SEARCH_WEIGHT_KEYWORD (0.35) × 11 = 3.85` — a keyword hit trumps any vector/metadata/importance combination, scaling with *query length* rather than match quality. - Defeats `RECALL_RELEVANCE_GATE` semantics (PR #186): `evidence = max(vector, keyword, metadata, exact)` assumes 0–1 components; `evidence = 11` sails past any gate. ## Fix 1. **Producer**: normalize the raw score by its per-query maximum (`3·len(keywords) + 3` when a phrase is present; `3` in the phrase-only branch) before it leaves `_graph_keyword_search`. This is a monotone per-query transform — within-channel ordering (and the Cypher `ORDER BY`) is unchanged; only cross-channel blending changes, which is the point. 2. **Consumer**: defensively clamp the keyword component to `min(1.0, …)` in `_compute_metadata_score`, so no future producer can break the 0–1 contract or the gate again. ## Verification - 4 new tests in `tests/test_keyword_score_normalization.py` (TDD'd against the bug, including the literal `keyword=11.0` repro). Full suite: **503 passed**. - **Production-corpus lab A/B** (10,142-memory snapshot, 200 queries, vs the pooled 3-run parity baseline from the 2026-06-11 release sweep): - Recall@5 −0.2pp, Recall@10 −0.7pp, MRR −0.008, NDCG@10 −0.007 (all within baseline run-to-run variance; paired t-test p=0.32) - Per-query: **196/197 unchanged**, 0 improved, 1 regressed - The single flip is the intended behavior change made visible: the expected memory held rank 1 *only* via the inflated keyword score (ranks 2–5 identical before/after). It's in the fallback-typed `Memory` cohort (MRR 0.15 baseline — the known data-quality cohort from #188's classification incident). ## Notes - This lands on main independently of the #182–#187 chain; #186's gate evidence check is the main beneficiary once the chain rebases over it. - Trending (`importance`), metadata (capped), and vector (cosine) channels were verified already bounded; the graph keyword channel was the only unbounded producer. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

Three server-side changes for date-aware recall: 1. Timestamp tiebreak (always-on). The two score-sort lambdas are extracted into _score_sort_key, which appends the memory timestamp (defensive ISO parse, epoch fallback) as the final descending key — exact score ties now order newest-first deterministically. Existing keys and their order are unchanged. 2. recency_bias relative-recency re-rank (param-gated, ships off). New request param recency_bias ∈ {auto,on,off}; default from RECALL_RECENCY_BIAS (default "off", invalid → "off"). When active, candidate timestamps are min-max normalized across the current candidate set after dedup/expansion/state filtering and before the adaptive floor, and SEARCH_WEIGHT_TEMPORAL (default 0.1, negatives clamp to 0) × relative_recency is added to each final score with a "temporal" score component recorded. "auto" activates only when the query matches the temporal-intent detector (query_has_temporal_intent in automem/utils/time.py — pure, word-boundaried keyword regex: latest / most recent / current(ly) / now / today / changed / updated / last time / newest / ...). The response echoes "recency_bias": "on" only when the re-rank actually ran. Degenerate candidate sets (single candidate, all-same-timestamp, unparseable timestamps) contribute nothing — no div-by-zero, scores untouched. 3. Supersession chain-walk to head. _active_replacements_for_memories previously resolved ONE hop of INVALIDATED_BY/EVOLVED_INTO; it now iteratively re-queries whether each replacement is itself superseded (batched per round with a per-head cache, bounded at STATE_REPLACEMENT_MAX_DEPTH=5 hops, cycle-safe via per-source visited sets) and surfaces the chain HEAD. Provenance contract preserved: the head carries the first-hop _state_relation, so state_replaces and relations[].from keep pointing at the originally suppressed memory. Extra queries fire only when a first-hop replacement was found. Honesty note: benchmark corpora carry no supersession edges, so the chain-walk is production-correctness work, not a benchmark mover. Defaults preserve behavior: RECALL_RECENCY_BIAS=off ⇒ identical scores; the tiebreak only reorders exact ties and the chain-walk only changes multi-hop supersession chains (both intended). Docs: API.md (recency_bias + chain-walk), ENVIRONMENT_VARIABLES.md and CLAUDE.md (SEARCH_WEIGHT_TEMPORAL, RECALL_RECENCY_BIAS). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…emEval Flag-gated prompt change informed by the LongMemEval failure diagnosis (58 failed-but-retrieved questions: 42 answer-construction incl. 27 abstentions-despite-hit, 7 missing-date-use, 4 ranking, 2 conflict-resolution): - New config flag temporal_answer_hint (default False), plumbed like use_temporal_hints, with a "temporal-answer" preset (baseline + flag) and CLI choice for reproducible A/B runs. Recorded in the results config block and run banner. - generate_answer's prompt construction is extracted into _build_answer_prompt. With the flag on: memories render in chronological order with retrieval scores noted, plus conflict-recency guidance ("when memories conflict, prefer the most recent one unless the question asks about an earlier time") and anti-overabstention guidance ("before answering 'I don't know', re-check each memory…"). Abstention remains possible — some questions require it. - With the flag off the prompt is byte-identical to the historical one (guarded by equality tests against a verbatim legacy copy, both chain-of-note and plain variants). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

- recall.py: correct the recency_bias echo comment — the flag is echoed whenever the mode activates, even when the re-rank is a no-op - docs/API.md: move recency_bias out of the sort-value enumeration into its own bullet and document echo vs temporal-component semantics - test_api_endpoints.py: rename single-hop chain test to reflect that one extra chain round fires, and document the +1-query cost - test_longmemeval.py: note why lexicographic session_date sort is chronological (zero-padded LongMemEval date format) - test_answer_prompt.py: document why byte identity of the legacy chain-of-note prompt is load-bearing (benchmark comparability) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…nce gate, date-aware ranking (#182, #193, #186, #187, #183, #184, #188) (#194) ## Release: ranking & recall series (develop → main) ⚠️ **Merge with a MERGE COMMIT — do not squash.** release-please needs the individual conventional commits below to compute the version and changelog for PR #154. ### What's in this release | PR | Change | Default behavior | |---|---|---| | #182 | `feat(recall)`: configurable recency decay window/curve | unchanged (env-gated) | | #193 (replaces #185) | `feat(recall)`: tag-score denominator cap fixes query-length bias | unchanged (`SEARCH_TAG_SCORE_TOKEN_CAP=0`) | | #186 | `fix(recall)`: relevance gate — query-independent scoring gated on topical evidence (#130) | unchanged (gate off) | | #187 | `feat(recall)`: date-aware ranking, `recency_bias=off\|on\|auto`, latest-fact selection (#158, #159) | `RECALL_RECENCY_BIAS=off`; adds deterministic timestamp tiebreak for near-ties | | #183 | `feat(benchmarks)`: failure-mode diagnosis harness + judge quota preflight | tooling only | | #184 | `fix(mcp)`: surface stored metadata + `updated_at` in detailed recall format (#111) | additive | | #188 | `feat(enrichment)`: classification fallback-rate metrics in `/enrichment/status` | additive | Plus: CI now runs on `develop` pushes/PRs; benchmark experiment log + README contribution-policy note. ### Verification evidence - **Unit/lint/npm**: 625 pytest + 16 mcp-sse-server tests green on develop head; CI green. - **Default-preserve**: recall-lab baseline on the 10k-memory production snapshot — develop defaults vs main pooled baseline identical aggregates (R@5 0.655 / R@10 0.710 / MRR 0.434 / NDCG@10 0.501). Two-stack probe run (main vs develop, defaults): 11/12 preserve-exact, remaining diffs are near-tie reorders (top-1 score deltas ≤ 5.4e-5, the #187 timestamp tiebreak). - **Full judged 500q LongMemEval** (ship config: `RECALL_RECENCY_BIAS=auto` + `temporal-answer` harness): recall@5 96.6% (483/500), accuracy 86.0% (430/500), `judge_errors=0`, `memory_ingest_failures=0`. - **Churn attribution** (targeted re-runs of all 17 churned questions on current-main-at-defaults and develop-at-defaults): 15/17 moved with #191 (already on main) — the April canonical 97.2% floor is stale; current main measures ~97.0%. Develop-at-defaults differs from current main by **1 question in 500** (a near-tie rank-5/6 flip from #187's deterministic tiebreak). Accuracy is within answerer replicate noise (identical-config reference runs flip 28/500 answers). - Full detail: `benchmarks/EXPERIMENT_LOG.md` (2026-06-11 entry) and `benchmarks/results/lme_churn17_*` + `analyze_churn17.py`. ### Opt-in features shipped OFF `RECALL_RELEVANCE_GATE` (validated at 0.40 on lab corpus; improves negative-probe precision) and `RECALL_RECENCY_BIAS=auto` (current-state query re-ranking). Neither affects default behavior; see `docs/ENVIRONMENT_VARIABLES.md`. ### After merging release-please will update PR #154 (v0.16.0); merging *that* cuts the tag and publishes the `:stable` image — the actual user-facing deploy event for Railway template users. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Copilot AI review requested due to automatic review settings June 11, 2026 01:46

Copilot started reviewing on behalf of jack-arturo June 11, 2026 01:46 View session

Copilot AI reviewed Jun 11, 2026

View reviewed changes

jack-arturo force-pushed the fix/130-relevance-gate branch from 7dac6d1 to fe62186 Compare June 11, 2026 02:40

jack-arturo force-pushed the feat/date-aware-ranking branch from ae34c42 to 5a6fa23 Compare June 11, 2026 02:41

jack-arturo mentioned this pull request Jun 11, 2026

fix(recall): normalize graph keyword scores into the 0-1 component range #191

Merged

jack-arturo force-pushed the fix/130-relevance-gate branch from fe62186 to 3693734 Compare June 11, 2026 19:03

Base automatically changed from fix/130-relevance-gate to develop June 11, 2026 19:03

jack-arturo and others added 4 commits June 11, 2026 21:04

fix(recall): guard recency bias timestamp conversion

fee64b3

jack-arturo force-pushed the feat/date-aware-ranking branch from b75f818 to fee64b3 Compare June 11, 2026 19:04

jack-arturo merged commit a6ed945 into develop Jun 11, 2026
4 checks passed

jack-arturo deleted the feat/date-aware-ranking branch June 11, 2026 19:04

jack-arturo mentioned this pull request Jun 11, 2026

chore(main): release 0.16.0 #154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(recall): date-aware ranking + latest-fact selection (#158, #159)#187

feat(recall): date-aware ranking + latest-fact selection (#158, #159)#187
jack-arturo merged 4 commits into
developfrom
feat/date-aware-ranking

jack-arturo commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

jack-arturo Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jack-arturo commented Jun 11, 2026

Summary

Testing

Validation plan before enabling anything

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

jack-arturo Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants