Skip to content

feat(recall): date-aware ranking + latest-fact selection (#158, #159)#187

Merged
jack-arturo merged 4 commits into
developfrom
feat/date-aware-ranking
Jun 11, 2026
Merged

feat(recall): date-aware ranking + latest-fact selection (#158, #159)#187
jack-arturo merged 4 commits into
developfrom
feat/date-aware-ranking

Conversation

@jack-arturo

Copy link
Copy Markdown
Member

Summary

Stacked on #186. Driven by the failure-mode diagnosis in #183 (58 failed-but-retrieved LongMemEval questions: answer-construction 42, missing-date-use 7, ranking 4).

Server (production + benchmark):

  • Timestamp tiebreak (always-on): exact score ties now order newest-first deterministically (_score_sort_key).
  • recency_bias=auto|on|off (ships off via RECALL_RECENCY_BIAS): after dedup/state-filter and before the adaptive floor, candidate timestamps are min-max normalized and SEARCH_WEIGHT_TEMPORAL (default 0.1) × relative recency is added — so the newest version of a conflicting fact can outrank an older, heavier one. auto triggers on temporal intent ("latest", "current", "what changed", …; word-boundaried, "currency"/"nowhere" safe).
  • Supersession chain-walk: current state mode now resolves INVALIDATED_BY/EVOLVED_INTO chains to their head (A→B→C surfaces C, provenance still points at A; depth-bounded at 5, cycle-safe, batched). Honesty note: benchmark corpora carry no supersession edges — this is a production-correctness fix, not a score mover.

Harness (benchmark-only, flag-gated for methodology reproducibility):

  • temporal_answer_hint config flag (default off; new temporal-answer preset for A/B): chronological memory rendering with scores + conflict-recency guidance + anti-overabstention guidance (27 of the 58 failures were literal "I don't know" with the answer retrieved; abstention remains possible). Flag-off prompt is byte-identical to the canonical methodology (equality-tested).

Testing

40 new tests (tiebreak, bias flip with the shipped default weight, auto-detection, chain-walk depth/cycle/query-count guards, #158 preference latest-wins acceptance, prompt byte-identity). Full suite 563 passed, 12 skipped; black + flake8 clean.

Validation plan before enabling anything

Lab IR A/B → automem-evals 22-probe zero-delta gate (recency_bias=off default keeps it zero-delta) → LongMemEval mini judge-off (recall@5 ≥ 97.2% floor) → judged mini → full run; temporal-answer preset A/B for the harness flag, with server-vs-harness deltas reported separately in EXPERIMENT_LOG.

Refs #158, #159

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings June 11, 2026 01:46

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves recall behavior for time-sensitive questions by adding deterministic score tie-breaking, an optional timestamp-based recency re-rank, and deeper “current state” supersession resolution; it also adds a benchmark-only prompt variant to reduce LongMemEval answer-construction failures when the answer is retrieved but not used.

Changes:

  • Add newest-first timestamp tiebreak for exact score ties (_score_sort_key) and expand “current state” filtering to walk INVALIDATED_BY/EVOLVED_INTO chains to their head (depth/cycle bounded).
  • Introduce recency_bias=auto|on|off (env default RECALL_RECENCY_BIAS=off) plus SEARCH_WEIGHT_TEMPORAL to add a min-max-normalized recency bonus for score ordering when activated.
  • Add LongMemEval harness flag temporal_answer_hint (and preset) to render memories chronologically with scores + temporal guidance while keeping the default prompt byte-identical.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
automem/api/recall.py Implements score tie-break by timestamp, recency-bias re-rank, and supersession chain-walk for current-state filtering.
automem/config.py Adds SEARCH_WEIGHT_TEMPORAL parsing and RECALL_RECENCY_BIAS env default handling.
automem/utils/time.py Adds query_has_temporal_intent() heuristic for recency_bias=auto.
tests/test_api_endpoints.py Adds endpoint/unit-style tests for tie-break, recency bias modes, temporal intent detection, and chain-walk semantics/guards.
tests/benchmarks/longmemeval/test_longmemeval.py Adds harness prompt builder toggle (temporal_answer_hint) and preset plumbing/logging.
tests/benchmarks/longmemeval/test_answer_prompt.py Adds byte-identity tests for prompt control + coverage for temporal-answer prompt behavior.
tests/benchmarks/longmemeval/configs.py Adds temporal_answer_hint config field and new temporal-answer preset.
docs/ENVIRONMENT_VARIABLES.md Documents SEARCH_WEIGHT_TEMPORAL and RECALL_RECENCY_BIAS.
docs/API.md Documents newest-first tie-breaking, recency_bias, and chain-head supersession behavior in recall.
CLAUDE.md Updates example env configuration with the new variables.

Comment thread automem/api/recall.py Outdated
Comment on lines +2223 to +2225
for res in results:
parsed_ts = _parse_iso_datetime((res.get("memory") or {}).get("timestamp"))
epochs.append(parsed_ts.timestamp() if parsed_ts is not None else None)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reused a defensive timestamp conversion helper for recency_bias, skipping conversion failures instead of 500ing — automem/api/recall.py:395. Added a regression test for timestamp conversion errors — tests/test_api_endpoints.py:4014. GitHub marked the original line outdated after the push.

@jack-arturo jack-arturo force-pushed the fix/130-relevance-gate branch from 7dac6d1 to fe62186 Compare June 11, 2026 02:40
@jack-arturo jack-arturo force-pushed the feat/date-aware-ranking branch from ae34c42 to 5a6fa23 Compare June 11, 2026 02:41
jack-arturo added a commit that referenced this pull request Jun 11, 2026
…nge (#191)

Fixes #190.

## Problem

`_graph_keyword_search` returned the **raw Cypher additive score** — +2
per keyword contained in content, +1 per keyword in any tag, summed over
all extracted keywords, plus a +2/+1 whole-phrase bonus — so a K-keyword
query can score up to **3K+3** while every other channel (vector cosine,
metadata, trending importance) lives in 0–1.

Observed during the 2026-06-11 production forensics: a tag-scoped
exact-content match returned `keyword=11.0`, `final_score=4.03`.
Consequences:

- `SEARCH_WEIGHT_KEYWORD (0.35) × 11 = 3.85` — a keyword hit trumps any
vector/metadata/importance combination, scaling with *query length*
rather than match quality.
- Defeats `RECALL_RELEVANCE_GATE` semantics (PR #186): `evidence =
max(vector, keyword, metadata, exact)` assumes 0–1 components; `evidence
= 11` sails past any gate.

## Fix

1. **Producer**: normalize the raw score by its per-query maximum
(`3·len(keywords) + 3` when a phrase is present; `3` in the phrase-only
branch) before it leaves `_graph_keyword_search`. This is a monotone
per-query transform — within-channel ordering (and the Cypher `ORDER
BY`) is unchanged; only cross-channel blending changes, which is the
point.
2. **Consumer**: defensively clamp the keyword component to `min(1.0,
…)` in `_compute_metadata_score`, so no future producer can break the
0–1 contract or the gate again.

## Verification

- 4 new tests in `tests/test_keyword_score_normalization.py` (TDD'd
against the bug, including the literal `keyword=11.0` repro). Full
suite: **503 passed**.
- **Production-corpus lab A/B** (10,142-memory snapshot, 200 queries, vs
the pooled 3-run parity baseline from the 2026-06-11 release sweep):
- Recall@5 −0.2pp, Recall@10 −0.7pp, MRR −0.008, NDCG@10 −0.007 (all
within baseline run-to-run variance; paired t-test p=0.32)
  - Per-query: **196/197 unchanged**, 0 improved, 1 regressed
- The single flip is the intended behavior change made visible: the
expected memory held rank 1 *only* via the inflated keyword score (ranks
2–5 identical before/after). It's in the fallback-typed `Memory` cohort
(MRR 0.15 baseline — the known data-quality cohort from #188's
classification incident).

## Notes

- This lands on main independently of the #182#187 chain; #186's gate
evidence check is the main beneficiary once the chain rebases over it.
- Trending (`importance`), metadata (capped), and vector (cosine)
channels were verified already bounded; the graph keyword channel was
the only unbounded producer.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
@jack-arturo jack-arturo force-pushed the fix/130-relevance-gate branch from fe62186 to 3693734 Compare June 11, 2026 19:03
Base automatically changed from fix/130-relevance-gate to develop June 11, 2026 19:03
jack-arturo and others added 4 commits June 11, 2026 21:04
Three server-side changes for date-aware recall:

1. Timestamp tiebreak (always-on). The two score-sort lambdas are
   extracted into _score_sort_key, which appends the memory timestamp
   (defensive ISO parse, epoch fallback) as the final descending key —
   exact score ties now order newest-first deterministically. Existing
   keys and their order are unchanged.

2. recency_bias relative-recency re-rank (param-gated, ships off).
   New request param recency_bias ∈ {auto,on,off}; default from
   RECALL_RECENCY_BIAS (default "off", invalid → "off"). When active,
   candidate timestamps are min-max normalized across the current
   candidate set after dedup/expansion/state filtering and before the
   adaptive floor, and SEARCH_WEIGHT_TEMPORAL (default 0.1, negatives
   clamp to 0) × relative_recency is added to each final score with a
   "temporal" score component recorded. "auto" activates only when the
   query matches the temporal-intent detector (query_has_temporal_intent
   in automem/utils/time.py — pure, word-boundaried keyword regex:
   latest / most recent / current(ly) / now / today / changed / updated
   / last time / newest / ...). The response echoes "recency_bias": "on"
   only when the re-rank actually ran. Degenerate candidate sets
   (single candidate, all-same-timestamp, unparseable timestamps)
   contribute nothing — no div-by-zero, scores untouched.

3. Supersession chain-walk to head. _active_replacements_for_memories
   previously resolved ONE hop of INVALIDATED_BY/EVOLVED_INTO; it now
   iteratively re-queries whether each replacement is itself superseded
   (batched per round with a per-head cache, bounded at
   STATE_REPLACEMENT_MAX_DEPTH=5 hops, cycle-safe via per-source visited
   sets) and surfaces the chain HEAD. Provenance contract preserved: the
   head carries the first-hop _state_relation, so state_replaces and
   relations[].from keep pointing at the originally suppressed memory.
   Extra queries fire only when a first-hop replacement was found.
   Honesty note: benchmark corpora carry no supersession edges, so the
   chain-walk is production-correctness work, not a benchmark mover.

Defaults preserve behavior: RECALL_RECENCY_BIAS=off ⇒ identical scores;
the tiebreak only reorders exact ties and the chain-walk only changes
multi-hop supersession chains (both intended).

Docs: API.md (recency_bias + chain-walk), ENVIRONMENT_VARIABLES.md and
CLAUDE.md (SEARCH_WEIGHT_TEMPORAL, RECALL_RECENCY_BIAS).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…emEval

Flag-gated prompt change informed by the LongMemEval failure diagnosis
(58 failed-but-retrieved questions: 42 answer-construction incl. 27
abstentions-despite-hit, 7 missing-date-use, 4 ranking, 2
conflict-resolution):

- New config flag temporal_answer_hint (default False), plumbed like
  use_temporal_hints, with a "temporal-answer" preset (baseline + flag)
  and CLI choice for reproducible A/B runs. Recorded in the results
  config block and run banner.
- generate_answer's prompt construction is extracted into
  _build_answer_prompt. With the flag on: memories render in
  chronological order with retrieval scores noted, plus conflict-recency
  guidance ("when memories conflict, prefer the most recent one unless
  the question asks about an earlier time") and anti-overabstention
  guidance ("before answering 'I don't know', re-check each memory…").
  Abstention remains possible — some questions require it.
- With the flag off the prompt is byte-identical to the historical one
  (guarded by equality tests against a verbatim legacy copy, both
  chain-of-note and plain variants).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- recall.py: correct the recency_bias echo comment — the flag is echoed
  whenever the mode activates, even when the re-rank is a no-op
- docs/API.md: move recency_bias out of the sort-value enumeration into
  its own bullet and document echo vs temporal-component semantics
- test_api_endpoints.py: rename single-hop chain test to reflect that
  one extra chain round fires, and document the +1-query cost
- test_longmemeval.py: note why lexicographic session_date sort is
  chronological (zero-padded LongMemEval date format)
- test_answer_prompt.py: document why byte identity of the legacy
  chain-of-note prompt is load-bearing (benchmark comparability)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@jack-arturo jack-arturo force-pushed the feat/date-aware-ranking branch from b75f818 to fee64b3 Compare June 11, 2026 19:04
@jack-arturo jack-arturo merged commit a6ed945 into develop Jun 11, 2026
4 checks passed
@jack-arturo jack-arturo deleted the feat/date-aware-ranking branch June 11, 2026 19:04
jack-arturo added a commit that referenced this pull request Jun 12, 2026
…nce gate, date-aware ranking (#182, #193, #186, #187, #183, #184, #188) (#194)

## Release: ranking & recall series (develop → main)

⚠️ **Merge with a MERGE COMMIT — do not squash.** release-please needs
the individual conventional commits below to compute the version and
changelog for PR #154.

### What's in this release

| PR | Change | Default behavior |
|---|---|---|
| #182 | `feat(recall)`: configurable recency decay window/curve |
unchanged (env-gated) |
| #193 (replaces #185) | `feat(recall)`: tag-score denominator cap fixes
query-length bias | unchanged (`SEARCH_TAG_SCORE_TOKEN_CAP=0`) |
| #186 | `fix(recall)`: relevance gate — query-independent scoring gated
on topical evidence (#130) | unchanged (gate off) |
| #187 | `feat(recall)`: date-aware ranking,
`recency_bias=off\|on\|auto`, latest-fact selection (#158, #159) |
`RECALL_RECENCY_BIAS=off`; adds deterministic timestamp tiebreak for
near-ties |
| #183 | `feat(benchmarks)`: failure-mode diagnosis harness + judge
quota preflight | tooling only |
| #184 | `fix(mcp)`: surface stored metadata + `updated_at` in detailed
recall format (#111) | additive |
| #188 | `feat(enrichment)`: classification fallback-rate metrics in
`/enrichment/status` | additive |

Plus: CI now runs on `develop` pushes/PRs; benchmark experiment log +
README contribution-policy note.

### Verification evidence

- **Unit/lint/npm**: 625 pytest + 16 mcp-sse-server tests green on
develop head; CI green.
- **Default-preserve**: recall-lab baseline on the 10k-memory production
snapshot — develop defaults vs main pooled baseline identical aggregates
(R@5 0.655 / R@10 0.710 / MRR 0.434 / NDCG@10 0.501). Two-stack probe
run (main vs develop, defaults): 11/12 preserve-exact, remaining diffs
are near-tie reorders (top-1 score deltas ≤ 5.4e-5, the #187 timestamp
tiebreak).
- **Full judged 500q LongMemEval** (ship config:
`RECALL_RECENCY_BIAS=auto` + `temporal-answer` harness): recall@5 96.6%
(483/500), accuracy 86.0% (430/500), `judge_errors=0`,
`memory_ingest_failures=0`.
- **Churn attribution** (targeted re-runs of all 17 churned questions on
current-main-at-defaults and develop-at-defaults): 15/17 moved with #191
(already on main) — the April canonical 97.2% floor is stale; current
main measures ~97.0%. Develop-at-defaults differs from current main by
**1 question in 500** (a near-tie rank-5/6 flip from #187's
deterministic tiebreak). Accuracy is within answerer replicate noise
(identical-config reference runs flip 28/500 answers).
- Full detail: `benchmarks/EXPERIMENT_LOG.md` (2026-06-11 entry) and
`benchmarks/results/lme_churn17_*` + `analyze_churn17.py`.

### Opt-in features shipped OFF

`RECALL_RELEVANCE_GATE` (validated at 0.40 on lab corpus; improves
negative-probe precision) and `RECALL_RECENCY_BIAS=auto` (current-state
query re-ranking). Neither affects default behavior; see
`docs/ENVIRONMENT_VARIABLES.md`.

### After merging

release-please will update PR #154 (v0.16.0); merging *that* cuts the
tag and publishes the `:stable` image — the actual user-facing deploy
event for Railway template users.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants