feat(recall): ranking release — recency config, tag-score cap, relevance gate, date-aware ranking (#182, #193, #186, #187, #183, #184, #188) by jack-arturo · Pull Request #194 · verygoodplugins/automem

jack-arturo · 2026-06-12T14:35:53Z

Release: ranking & recall series (develop → main)

⚠️ Merge with a MERGE COMMIT — do not squash. release-please needs the individual conventional commits below to compute the version and changelog for PR #154.

What's in this release

PR	Change	Default behavior
#182	`feat(recall)`: configurable recency decay window/curve	unchanged (env-gated)
#193 (replaces #185)	`feat(recall)`: tag-score denominator cap fixes query-length bias	unchanged (`SEARCH_TAG_SCORE_TOKEN_CAP=0`)
#186	`fix(recall)`: relevance gate — query-independent scoring gated on topical evidence (#130)	unchanged (gate off)
#187	`feat(recall)`: date-aware ranking, `recency_bias=off\|on\|auto`, latest-fact selection (#158, #159)	`RECALL_RECENCY_BIAS=off`; adds deterministic timestamp tiebreak for near-ties
#183	`feat(benchmarks)`: failure-mode diagnosis harness + judge quota preflight	tooling only
#184	`fix(mcp)`: surface stored metadata + `updated_at` in detailed recall format (#111)	additive
#188	`feat(enrichment)`: classification fallback-rate metrics in `/enrichment/status`	additive

Plus: CI now runs on develop pushes/PRs; benchmark experiment log + README contribution-policy note.

Verification evidence

Unit/lint/npm: 625 pytest + 16 mcp-sse-server tests green on develop head; CI green.
Default-preserve: recall-lab baseline on the 10k-memory production snapshot — develop defaults vs main pooled baseline identical aggregates (R@5 0.655 / R@10 0.710 / MRR 0.434 / NDCG@10 0.501). Two-stack probe run (main vs develop, defaults): 11/12 preserve-exact, remaining diffs are near-tie reorders (top-1 score deltas ≤ 5.4e-5, the feat(recall): date-aware ranking + latest-fact selection (#158, #159) #187 timestamp tiebreak).
Full judged 500q LongMemEval (ship config: RECALL_RECENCY_BIAS=auto + temporal-answer harness): recall@5 96.6% (483/500), accuracy 86.0% (430/500), judge_errors=0, memory_ingest_failures=0.
Churn attribution (targeted re-runs of all 17 churned questions on current-main-at-defaults and develop-at-defaults): 15/17 moved with fix(recall): normalize graph keyword scores into the 0-1 component range #191 (already on main) — the April canonical 97.2% floor is stale; current main measures ~97.0%. Develop-at-defaults differs from current main by 1 question in 500 (a near-tie rank-5/6 flip from feat(recall): date-aware ranking + latest-fact selection (#158, #159) #187's deterministic tiebreak). Accuracy is within answerer replicate noise (identical-config reference runs flip 28/500 answers).
Full detail: benchmarks/EXPERIMENT_LOG.md (2026-06-11 entry) and benchmarks/results/lme_churn17_* + analyze_churn17.py.

Opt-in features shipped OFF

RECALL_RELEVANCE_GATE (validated at 0.40 on lab corpus; improves negative-probe precision) and RECALL_RECENCY_BIAS=auto (current-state query re-ranking). Neither affects default behavior; see docs/ENVIRONMENT_VARIABLES.md.

After merging

release-please will update PR #154 (v0.16.0); merging that cuts the tag and publishes the :stable image — the actual user-facing deploy event for Railway template users.

🤖 Generated with Claude Code

Pools prod_parity runs as the baseline and reports per-config metric deltas with a paired t-test on per-query recall@5, plus per-category R@10 deltas. Companion to scripts/lab/run_pipeline_20260611.sh. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

develop is the new integration branch: feature PRs target develop and main only moves via validated release merges, so Railway users tracking main see one deploy event per release instead of one per PR. docker-build and release-please intentionally stay main-only. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

## Summary - `SEARCH_RECENCY_WINDOW_DAYS` (default 180) and `SEARCH_RECENCY_CURVE` (`linear`|`exp`, default `linear`) replace the hardcoded 180-day linear decay in `_compute_recency_score()`. Defaults are bit-for-bit identical to current behavior; `exp` treats the window as a half-life. - Window values ≤ 0 fall back to 180 (a `0` would otherwise 500 every recall with `ZeroDivisionError`; negatives produce unbounded scores). - `scripts/browse_memories.py diagnose` now simulates recency with the same env config instead of its own hardcoded copy. - Lab harness fix: `run_recall_test.py` restarted the API container only for `SEARCH_WEIGHT_*` keys, so sweeps of any other `SEARCH_*` var silently tested the baseline — prefix widened to `SEARCH_`. ## Why Recency was dead for any memory older than 180 days (and for entire benchmark corpora, which are dated 2023) — zero temporal discrimination among conflicting facts. This is the substrate for the date-aware ranking work (#158/#159) and makes the window/curve sweepable via `make lab-sweep`. ## Testing - 10 new unit tests (defaults, linear/exp curves, guards, monkeypatched windows, config domain validation) - Full suite: 497 passed, 12 skipped; black + flake8 clean 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

Continuation of #185, which GitHub auto-closed (and refused to reopen) when its stacked base branch `feat/tunable-recency` was deleted on #182's merge. Identical branch and content, now rebased onto `develop` with #182 included. See #185 for the full description, validation evidence (production-corpus lab A/B that flipped the default to `SEARCH_TAG_SCORE_TOKEN_CAP=0`), and review history. Part of the develop-branch integration series for the ranking release. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…n tag scope (#130) (#186) ## Summary Stacked on #185. **Corrected root cause for #130.** The issue hypothesized tags act as score boosters that outrank scoped results. Code inspection shows user-passed `tags` are already a hard gate on every base path (Qdrant must-filter, graph WHERE, metadata sidecar, tag-only fallback, final post-filter). What actually happened in the repro: the gate constrained the pool to flint-tagged memories, the Forge memories were *excluded*, and within the surviving pool query-independent components (importance 0.9 × 0.1 weight + recency + tag crumbs) dominated near-zero topical evidence — confident-looking garbage with no signal the pool was gated. Fixes, with tag semantics untouched: - **Relevance gate**: `evidence = max(vector, keyword, metadata, exact)`; when query tokens exist and `evidence < RECALL_RELEVANCE_GATE`, importance/confidence/recency/relevance/tag components are scaled by `evidence/gate` (linear ramp). **Ships at 0.0 = exactly current behavior**; the enabled value comes from the eval funnel (lab sweep grid 0.10–0.25 + 22-probe + negative-control gates in automem-evals). `components` now carry `evidence` + `relevance_gated`. - **`tag_scope` response diagnostics** when tags are passed: `{filtered, pool_size_hint, gated_low_evidence}` — recall is no longer silent about gating. - **Opt-in `scope_fallback=true`**: fills remaining slots with unscoped vector results flagged `outside_tag_scope: true`, appended after scoped results, with full filter parity (min_score, time, exclude_tags, edge-based current-state suppression) — only the tag scope is lifted. In-scope memories can never reappear as "outside scope" fills (guards are mutation-tested). - **Doc corrections**: MCP `tag_match` description claimed default "exact"; the API default is `prefix`. `tags` documented as a hard scope filter (use `context_tags` to boost). Per-path semantics table in `docs/API.md`. ## Testing 20 new tests including a gate-0 bit-identity matrix, linear-ramp midpoint, route-level rerank + diagnostics, fill-resurrection mutation kills, and MCP rendering of the fill flag. Full suite 523 passed, 12 skipped; Node 14/14; black + flake8 clean. Refs #130 (will update the issue with this corrected analysis). 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…#187) ## Summary Stacked on #186. Driven by the failure-mode diagnosis in #183 (58 failed-but-retrieved LongMemEval questions: answer-construction 42, missing-date-use 7, ranking 4). Server (production + benchmark): - **Timestamp tiebreak** (always-on): exact score ties now order newest-first deterministically (`_score_sort_key`). - **`recency_bias=auto|on|off`** (ships `off` via `RECALL_RECENCY_BIAS`): after dedup/state-filter and before the adaptive floor, candidate timestamps are min-max normalized and `SEARCH_WEIGHT_TEMPORAL` (default 0.1) × relative recency is added — so the newest version of a conflicting fact can outrank an older, heavier one. `auto` triggers on temporal intent ("latest", "current", "what changed", …; word-boundaried, "currency"/"nowhere" safe). - **Supersession chain-walk**: `current` state mode now resolves INVALIDATED_BY/EVOLVED_INTO chains to their head (A→B→C surfaces C, provenance still points at A; depth-bounded at 5, cycle-safe, batched). *Honesty note: benchmark corpora carry no supersession edges — this is a production-correctness fix, not a score mover.* Harness (benchmark-only, flag-gated for methodology reproducibility): - **`temporal_answer_hint`** config flag (default off; new `temporal-answer` preset for A/B): chronological memory rendering with scores + conflict-recency guidance + anti-overabstention guidance (27 of the 58 failures were literal "I don't know" with the answer retrieved; abstention remains possible). Flag-off prompt is byte-identical to the canonical methodology (equality-tested). ## Testing 40 new tests (tiebreak, bias flip with the shipped default weight, auto-detection, chain-walk depth/cycle/query-count guards, #158 preference latest-wins acceptance, prompt byte-identity). Full suite 563 passed, 12 skipped; black + flake8 clean. ## Validation plan before enabling anything Lab IR A/B → automem-evals 22-probe zero-delta gate (recency_bias=off default keeps it zero-delta) → LongMemEval mini judge-off (recall@5 ≥ 97.2% floor) → judged mini → full run; `temporal-answer` preset A/B for the harness flag, with server-vs-harness deltas reported separately in EXPERIMENT_LOG. Refs #158, #159 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…quota preflight (#183) ## Summary - `tests/benchmarks/longmemeval/diagnose_failures.py`: two-stage classifier for failed-but-retrieved questions (`is_correct=false AND recall_hit_at_5=true`). Stage 1 is pure code (answer rank, abstention-despite-hit, stale-candidate-above-answer, noise ratio, date-math detection → documented 8-rule mode ladder); stage 2 (`--llm`) labels each failure with the pinned judge and reports a stage-1/stage-2 agreement matrix (transport errors counted and excluded). - `tests/benchmarks/judge_preflight.py`: one minimal judge call before any judged run; exits non-zero with an actionable message on 429/insufficient_quota or auth failure. Wired into `test-longmemeval-benchmark.sh` (only when `--llm-eval`). Would have caught the June 6 quota-compromised runs before they started. - Harness instrumentation: `retrieved_session_ids_full` (all 10 recalled memories + scores) recorded per question — closes the rank-6-10 blind spot for future runs. ## Findings on the canonical run (87.0% accuracy, recall@5 97.2%) 58 of 69 failures had the answer retrieved in the top 5. LLM labels (judge `gpt-5.4-mini-2026-03-17`): **answer-construction 42** (27 of them literal "I don't know" abstentions with the answer in context), missing-date-use 7, ranking 4, conflict-resolution 2, retrieval-gap 2, outdated-fact 1. This reprioritized the release: the harness answer-assembly path (see `feat/date-aware-ranking`) is the biggest benchmark lever; pure ranking fixes are the production-quality lever. Report artifact: `benchmarks/results/failure_modes_canonical_llm_20260611.json` (local, gitignored). ## Testing 31 new tests (synthetic fixtures, mocked clients, no network); full suite 518 passed, 12 skipped; black + flake8 clean. Refs #158, #159 (both issues require this failure-mode classification as their first acceptance criterion). 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…ormat (#184) ## Summary The REST `/recall` API already returns `metadata`, `updated_at`, and `last_accessed` — the gap was the MCP server's `detailed` format, which omitted them (making custom metadata effectively write-only for MCP agents, the actual complaint in #111). - `formatRecallAsItems()` detailed branch now renders an `Updated:` line and a **size-capped** single-line `Metadata:` JSON (300 chars + ellipsis; omitted when empty) — capped because a prior raw-dump attempt was rejected for verbosity. - `json` format was already a raw passthrough; now pinned by a transport-level test. - `text`/`items` formats unchanged. - New REST contract test (`test_recall_metadata_roundtrip`) locks the server-side behavior; `docs/METADATA_BEHAVIOR.md` corrected (it over-claimed that detailed already exposed metadata). ## Testing - Node: 15/15 (`npm test`, includes truncation-boundary and empty-metadata cases) - Python: 488 passed, 12 skipped; black + flake8 clean Closes #111 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…ichment/status (#188) ## Summary On June 10, OpenAI quota exhaustion silently degraded every store-time classification to the fallback `("Memory", 0.3)` path — 78% of recent writes, invisible to every health surface (the classifier only logged; `/enrichment/status` counts only enrichment-worker activity, and classification runs in the API store path). 31.1% of the production corpus is now fallback-typed. - New `ClassificationStats` on `ServiceState` (mirrors `EnrichmentStats` style): `llm_attempts`, `llm_successes`, `fallbacks`, `pattern_classifications`, `last_error`, `last_error_at`. - `MemoryClassifier` threads an optional stats object (backward compatible, `None`-guarded); the inner LLM except stashes the real error (reset per call) so the fallback records *why*. - `/enrichment/status` gains a `classification` block — no new endpoint, no new MCP tool. `health_monitor`/alerting can now watch the fallback rate. ## Testing 9 unit tests + an end-to-end route test driving a 429-raising then succeeding stub through the blueprint-shared state. Full suite 497 passed, 12 skipped; black + flake8 clean. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…urn attribution Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Copilot

Pull request overview

This release PR merges the “ranking & recall” series from develop into main, introducing new recall-scoring controls (recency tuning, relevance gating, optional scope fallback, and date-aware re-ranking), improved metadata surfacing for MCP consumers, and new benchmark tooling for LongMemEval diagnosis and quota preflight.

Changes:

Extend recall ranking/scoring with configurable recency decay, tag-score denominator cap, an optional relevance gate, deterministic timestamp tiebreaking, optional tag-scope fallback fills, and recency-bias re-ranking (plus associated API/docs/test updates).
Add observability and safety improvements: classification fallback-rate metrics in /enrichment/status and a judge quota/auth preflight for benchmark runs.
Add/expand benchmark harness capabilities (failure-mode diagnosis, prompt flagging, richer artifacts) and update CI to run on develop as well as main.

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_classification_stats.py	New unit coverage for classification stats instrumentation + thread-safety checks
tests/test_api_endpoints.py	Large expansion of API/recall/enrichment behavior tests (tag score cap, relevance gate, recency bias, scope fallback, metadata roundtrip, etc.)
tests/benchmarks/longmemeval/test_longmemeval.py	Add richer result artifacts + temporal answer-hint prompt option
tests/benchmarks/longmemeval/test_diagnose_failures.py	Unit tests for the new failure-mode diagnosis harness and judge preflight
tests/benchmarks/longmemeval/test_answer_prompt.py	Tests to ensure prompt byte-identity when flag-off + expected behavior when flag-on
tests/benchmarks/longmemeval/diagnose_failures.py	New failure-mode diagnosis CLI/harness for LongMemEval
tests/benchmarks/longmemeval/configs.py	Add `temporal-answer` preset and `temporal_answer_hint` flag wiring
tests/benchmarks/judge_preflight.py	New judge quota/auth preflight helper (CLI + testable core)
test-longmemeval-benchmark.sh	Run judge preflight before judged LongMemEval runs
scripts/lab/summarize_pipeline_20260611.py	New lab summarization script for release verification sweeps
scripts/lab/run_recall_test.py	Broaden restart detection for `SEARCH_*` env changes in lab harness
scripts/browse_memories.py	Align diagnose recency simulation with configurable recency window/curve
README.md	Note contribution policy: feature PRs target `develop`
mcp-sse-server/test/server.test.js	Expand MCP server tests for new recall params and metadata/updated_at rendering
mcp-sse-server/server.js	MCP client: pass through `scope_fallback`; detailed formatting includes updated_at/metadata; surface outside-tag-scope fills
docs/METADATA_BEHAVIOR.md	Document metadata + timestamps surfacing across recall/MCP formats
docs/ENVIRONMENT_VARIABLES.md	Document new env vars and new recall behaviors (tag cap, recency tuning, gate, recency bias)
docs/COMPARISON.md	Update recency weight description to reflect configurable decay
docs/API.md	Document tag semantics, scope fallback, recency bias, timestamp tiebreak, chain-walk behavior, and new enrichment status block
docker-compose.yml	Add optional `.env.bench` env_file for lab override propagation
CLAUDE.md	Update documented recency/temporal-related knobs and browse_memories diagnose description
benchmarks/EXPERIMENT_LOG.md	Add 2026-06-11 ranking-release verification entry
automem/utils/time.py	Add temporal-intent detection helper for `recency_bias=auto`
automem/utils/scoring.py	Configurable recency curve/window; tag-score cap; relevance evidence + gate; richer score components
automem/service_state.py	Add `ClassificationStats` and attach it to `ServiceState`
automem/config.py	Add env parsing/guards for recency tuning, tag cap, relevance gate, temporal weight, and default recency-bias mode
automem/classification/memory_classifier.py	Thread optional stats through classifier; record attempts/success/fallback/error
automem/api/recall.py	Add score timestamp tiebreak; recency-bias rerank; scope fallback fills; tag scope diagnostics; supersession chain-walk
automem/api/enrichment.py	Add `classification` block to `/enrichment/status` response
app.py	Wire classifier stats from global ServiceState into MemoryClassifier
.github/workflows/ci.yml	Run CI for both `main` and `develop` pushes/PRs

- classifier: set descriptive llm_error when LLM returns no usable result so /enrichment/status last_error populates - mcp-sse-server: truncate metadata previews in detailed text format at 1500 chars (json format unchanged) + test - lab summarizer: label paired_t as the z-approximation it is (renamed paired_z) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

jack-arturo and others added 11 commits June 11, 2026 20:45

docs(bench): log full judged 500q LongMemEval ship-config run with ch…

41bf8d0

…urn attribution Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

docs: note develop-branch contribution policy in README

ccf02dd

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings June 12, 2026 14:35

Copilot started reviewing on behalf of jack-arturo June 12, 2026 14:36 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread automem/classification/memory_classifier.py

Comment thread mcp-sse-server/server.js

Comment thread scripts/lab/summarize_pipeline_20260611.py

jack-arturo changed the title ~~release: recall ranking series (#182, #193, #186, #187, #183, #184, #188)~~ feat(recall): ranking release — recency config, tag-score cap, relevance gate, date-aware ranking (#182, #193, #186, #187, #183, #184, #188) Jun 12, 2026

jack-arturo merged commit 337fe98 into main Jun 12, 2026
13 checks passed

jack-arturo deleted the develop branch June 12, 2026 15:32

jack-arturo mentioned this pull request Jun 11, 2026

chore(main): release 0.16.0 #154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(recall): ranking release — recency config, tag-score cap, relevance gate, date-aware ranking (#182, #193, #186, #187, #183, #184, #188)#194

feat(recall): ranking release — recency config, tag-score cap, relevance gate, date-aware ranking (#182, #193, #186, #187, #183, #184, #188)#194
jack-arturo merged 12 commits into
mainfrom
develop

jack-arturo commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jack-arturo commented Jun 12, 2026

Release: ranking & recall series (develop → main)

What's in this release

Verification evidence

Opt-in features shipped OFF

After merging

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants