feat(recall): ranking release — recency config, tag-score cap, relevance gate, date-aware ranking (#182, #193, #186, #187, #183, #184, #188)#194
Merged
Conversation
Pools prod_parity runs as the baseline and reports per-config metric deltas with a paired t-test on per-query recall@5, plus per-category R@10 deltas. Companion to scripts/lab/run_pipeline_20260611.sh. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
develop is the new integration branch: feature PRs target develop and main only moves via validated release merges, so Railway users tracking main see one deploy event per release instead of one per PR. docker-build and release-please intentionally stay main-only. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
## Summary - `SEARCH_RECENCY_WINDOW_DAYS` (default 180) and `SEARCH_RECENCY_CURVE` (`linear`|`exp`, default `linear`) replace the hardcoded 180-day linear decay in `_compute_recency_score()`. Defaults are bit-for-bit identical to current behavior; `exp` treats the window as a half-life. - Window values ≤ 0 fall back to 180 (a `0` would otherwise 500 every recall with `ZeroDivisionError`; negatives produce unbounded scores). - `scripts/browse_memories.py diagnose` now simulates recency with the same env config instead of its own hardcoded copy. - Lab harness fix: `run_recall_test.py` restarted the API container only for `SEARCH_WEIGHT_*` keys, so sweeps of any other `SEARCH_*` var silently tested the baseline — prefix widened to `SEARCH_`. ## Why Recency was dead for any memory older than 180 days (and for entire benchmark corpora, which are dated 2023) — zero temporal discrimination among conflicting facts. This is the substrate for the date-aware ranking work (#158/#159) and makes the window/curve sweepable via `make lab-sweep`. ## Testing - 10 new unit tests (defaults, linear/exp curves, guards, monkeypatched windows, config domain validation) - Full suite: 497 passed, 12 skipped; black + flake8 clean 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Continuation of #185, which GitHub auto-closed (and refused to reopen) when its stacked base branch `feat/tunable-recency` was deleted on #182's merge. Identical branch and content, now rebased onto `develop` with #182 included. See #185 for the full description, validation evidence (production-corpus lab A/B that flipped the default to `SEARCH_TAG_SCORE_TOKEN_CAP=0`), and review history. Part of the develop-branch integration series for the ranking release. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…n tag scope (#130) (#186) ## Summary Stacked on #185. **Corrected root cause for #130.** The issue hypothesized tags act as score boosters that outrank scoped results. Code inspection shows user-passed `tags` are already a hard gate on every base path (Qdrant must-filter, graph WHERE, metadata sidecar, tag-only fallback, final post-filter). What actually happened in the repro: the gate constrained the pool to flint-tagged memories, the Forge memories were *excluded*, and within the surviving pool query-independent components (importance 0.9 × 0.1 weight + recency + tag crumbs) dominated near-zero topical evidence — confident-looking garbage with no signal the pool was gated. Fixes, with tag semantics untouched: - **Relevance gate**: `evidence = max(vector, keyword, metadata, exact)`; when query tokens exist and `evidence < RECALL_RELEVANCE_GATE`, importance/confidence/recency/relevance/tag components are scaled by `evidence/gate` (linear ramp). **Ships at 0.0 = exactly current behavior**; the enabled value comes from the eval funnel (lab sweep grid 0.10–0.25 + 22-probe + negative-control gates in automem-evals). `components` now carry `evidence` + `relevance_gated`. - **`tag_scope` response diagnostics** when tags are passed: `{filtered, pool_size_hint, gated_low_evidence}` — recall is no longer silent about gating. - **Opt-in `scope_fallback=true`**: fills remaining slots with unscoped vector results flagged `outside_tag_scope: true`, appended after scoped results, with full filter parity (min_score, time, exclude_tags, edge-based current-state suppression) — only the tag scope is lifted. In-scope memories can never reappear as "outside scope" fills (guards are mutation-tested). - **Doc corrections**: MCP `tag_match` description claimed default "exact"; the API default is `prefix`. `tags` documented as a hard scope filter (use `context_tags` to boost). Per-path semantics table in `docs/API.md`. ## Testing 20 new tests including a gate-0 bit-identity matrix, linear-ramp midpoint, route-level rerank + diagnostics, fill-resurrection mutation kills, and MCP rendering of the fill flag. Full suite 523 passed, 12 skipped; Node 14/14; black + flake8 clean. Refs #130 (will update the issue with this corrected analysis). 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…#187) ## Summary Stacked on #186. Driven by the failure-mode diagnosis in #183 (58 failed-but-retrieved LongMemEval questions: answer-construction 42, missing-date-use 7, ranking 4). Server (production + benchmark): - **Timestamp tiebreak** (always-on): exact score ties now order newest-first deterministically (`_score_sort_key`). - **`recency_bias=auto|on|off`** (ships `off` via `RECALL_RECENCY_BIAS`): after dedup/state-filter and before the adaptive floor, candidate timestamps are min-max normalized and `SEARCH_WEIGHT_TEMPORAL` (default 0.1) × relative recency is added — so the newest version of a conflicting fact can outrank an older, heavier one. `auto` triggers on temporal intent ("latest", "current", "what changed", …; word-boundaried, "currency"/"nowhere" safe). - **Supersession chain-walk**: `current` state mode now resolves INVALIDATED_BY/EVOLVED_INTO chains to their head (A→B→C surfaces C, provenance still points at A; depth-bounded at 5, cycle-safe, batched). *Honesty note: benchmark corpora carry no supersession edges — this is a production-correctness fix, not a score mover.* Harness (benchmark-only, flag-gated for methodology reproducibility): - **`temporal_answer_hint`** config flag (default off; new `temporal-answer` preset for A/B): chronological memory rendering with scores + conflict-recency guidance + anti-overabstention guidance (27 of the 58 failures were literal "I don't know" with the answer retrieved; abstention remains possible). Flag-off prompt is byte-identical to the canonical methodology (equality-tested). ## Testing 40 new tests (tiebreak, bias flip with the shipped default weight, auto-detection, chain-walk depth/cycle/query-count guards, #158 preference latest-wins acceptance, prompt byte-identity). Full suite 563 passed, 12 skipped; black + flake8 clean. ## Validation plan before enabling anything Lab IR A/B → automem-evals 22-probe zero-delta gate (recency_bias=off default keeps it zero-delta) → LongMemEval mini judge-off (recall@5 ≥ 97.2% floor) → judged mini → full run; `temporal-answer` preset A/B for the harness flag, with server-vs-harness deltas reported separately in EXPERIMENT_LOG. Refs #158, #159 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…quota preflight (#183) ## Summary - `tests/benchmarks/longmemeval/diagnose_failures.py`: two-stage classifier for failed-but-retrieved questions (`is_correct=false AND recall_hit_at_5=true`). Stage 1 is pure code (answer rank, abstention-despite-hit, stale-candidate-above-answer, noise ratio, date-math detection → documented 8-rule mode ladder); stage 2 (`--llm`) labels each failure with the pinned judge and reports a stage-1/stage-2 agreement matrix (transport errors counted and excluded). - `tests/benchmarks/judge_preflight.py`: one minimal judge call before any judged run; exits non-zero with an actionable message on 429/insufficient_quota or auth failure. Wired into `test-longmemeval-benchmark.sh` (only when `--llm-eval`). Would have caught the June 6 quota-compromised runs before they started. - Harness instrumentation: `retrieved_session_ids_full` (all 10 recalled memories + scores) recorded per question — closes the rank-6-10 blind spot for future runs. ## Findings on the canonical run (87.0% accuracy, recall@5 97.2%) 58 of 69 failures had the answer retrieved in the top 5. LLM labels (judge `gpt-5.4-mini-2026-03-17`): **answer-construction 42** (27 of them literal "I don't know" abstentions with the answer in context), missing-date-use 7, ranking 4, conflict-resolution 2, retrieval-gap 2, outdated-fact 1. This reprioritized the release: the harness answer-assembly path (see `feat/date-aware-ranking`) is the biggest benchmark lever; pure ranking fixes are the production-quality lever. Report artifact: `benchmarks/results/failure_modes_canonical_llm_20260611.json` (local, gitignored). ## Testing 31 new tests (synthetic fixtures, mocked clients, no network); full suite 518 passed, 12 skipped; black + flake8 clean. Refs #158, #159 (both issues require this failure-mode classification as their first acceptance criterion). 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…ormat (#184) ## Summary The REST `/recall` API already returns `metadata`, `updated_at`, and `last_accessed` — the gap was the MCP server's `detailed` format, which omitted them (making custom metadata effectively write-only for MCP agents, the actual complaint in #111). - `formatRecallAsItems()` detailed branch now renders an `Updated:` line and a **size-capped** single-line `Metadata:` JSON (300 chars + ellipsis; omitted when empty) — capped because a prior raw-dump attempt was rejected for verbosity. - `json` format was already a raw passthrough; now pinned by a transport-level test. - `text`/`items` formats unchanged. - New REST contract test (`test_recall_metadata_roundtrip`) locks the server-side behavior; `docs/METADATA_BEHAVIOR.md` corrected (it over-claimed that detailed already exposed metadata). ## Testing - Node: 15/15 (`npm test`, includes truncation-boundary and empty-metadata cases) - Python: 488 passed, 12 skipped; black + flake8 clean Closes #111 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…ichment/status (#188) ## Summary On June 10, OpenAI quota exhaustion silently degraded every store-time classification to the fallback `("Memory", 0.3)` path — 78% of recent writes, invisible to every health surface (the classifier only logged; `/enrichment/status` counts only enrichment-worker activity, and classification runs in the API store path). 31.1% of the production corpus is now fallback-typed. - New `ClassificationStats` on `ServiceState` (mirrors `EnrichmentStats` style): `llm_attempts`, `llm_successes`, `fallbacks`, `pattern_classifications`, `last_error`, `last_error_at`. - `MemoryClassifier` threads an optional stats object (backward compatible, `None`-guarded); the inner LLM except stashes the real error (reset per call) so the fallback records *why*. - `/enrichment/status` gains a `classification` block — no new endpoint, no new MCP tool. `health_monitor`/alerting can now watch the fallback rate. ## Testing 9 unit tests + an end-to-end route test driving a 429-raising then succeeding stub through the blueprint-shared state. Full suite 497 passed, 12 skipped; black + flake8 clean. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…urn attribution Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This was referenced Jun 12, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
This release PR merges the “ranking & recall” series from develop into main, introducing new recall-scoring controls (recency tuning, relevance gating, optional scope fallback, and date-aware re-ranking), improved metadata surfacing for MCP consumers, and new benchmark tooling for LongMemEval diagnosis and quota preflight.
Changes:
- Extend recall ranking/scoring with configurable recency decay, tag-score denominator cap, an optional relevance gate, deterministic timestamp tiebreaking, optional tag-scope fallback fills, and recency-bias re-ranking (plus associated API/docs/test updates).
- Add observability and safety improvements: classification fallback-rate metrics in
/enrichment/statusand a judge quota/auth preflight for benchmark runs. - Add/expand benchmark harness capabilities (failure-mode diagnosis, prompt flagging, richer artifacts) and update CI to run on
developas well asmain.
Reviewed changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_classification_stats.py | New unit coverage for classification stats instrumentation + thread-safety checks |
| tests/test_api_endpoints.py | Large expansion of API/recall/enrichment behavior tests (tag score cap, relevance gate, recency bias, scope fallback, metadata roundtrip, etc.) |
| tests/benchmarks/longmemeval/test_longmemeval.py | Add richer result artifacts + temporal answer-hint prompt option |
| tests/benchmarks/longmemeval/test_diagnose_failures.py | Unit tests for the new failure-mode diagnosis harness and judge preflight |
| tests/benchmarks/longmemeval/test_answer_prompt.py | Tests to ensure prompt byte-identity when flag-off + expected behavior when flag-on |
| tests/benchmarks/longmemeval/diagnose_failures.py | New failure-mode diagnosis CLI/harness for LongMemEval |
| tests/benchmarks/longmemeval/configs.py | Add temporal-answer preset and temporal_answer_hint flag wiring |
| tests/benchmarks/judge_preflight.py | New judge quota/auth preflight helper (CLI + testable core) |
| test-longmemeval-benchmark.sh | Run judge preflight before judged LongMemEval runs |
| scripts/lab/summarize_pipeline_20260611.py | New lab summarization script for release verification sweeps |
| scripts/lab/run_recall_test.py | Broaden restart detection for SEARCH_* env changes in lab harness |
| scripts/browse_memories.py | Align diagnose recency simulation with configurable recency window/curve |
| README.md | Note contribution policy: feature PRs target develop |
| mcp-sse-server/test/server.test.js | Expand MCP server tests for new recall params and metadata/updated_at rendering |
| mcp-sse-server/server.js | MCP client: pass through scope_fallback; detailed formatting includes updated_at/metadata; surface outside-tag-scope fills |
| docs/METADATA_BEHAVIOR.md | Document metadata + timestamps surfacing across recall/MCP formats |
| docs/ENVIRONMENT_VARIABLES.md | Document new env vars and new recall behaviors (tag cap, recency tuning, gate, recency bias) |
| docs/COMPARISON.md | Update recency weight description to reflect configurable decay |
| docs/API.md | Document tag semantics, scope fallback, recency bias, timestamp tiebreak, chain-walk behavior, and new enrichment status block |
| docker-compose.yml | Add optional .env.bench env_file for lab override propagation |
| CLAUDE.md | Update documented recency/temporal-related knobs and browse_memories diagnose description |
| benchmarks/EXPERIMENT_LOG.md | Add 2026-06-11 ranking-release verification entry |
| automem/utils/time.py | Add temporal-intent detection helper for recency_bias=auto |
| automem/utils/scoring.py | Configurable recency curve/window; tag-score cap; relevance evidence + gate; richer score components |
| automem/service_state.py | Add ClassificationStats and attach it to ServiceState |
| automem/config.py | Add env parsing/guards for recency tuning, tag cap, relevance gate, temporal weight, and default recency-bias mode |
| automem/classification/memory_classifier.py | Thread optional stats through classifier; record attempts/success/fallback/error |
| automem/api/recall.py | Add score timestamp tiebreak; recency-bias rerank; scope fallback fills; tag scope diagnostics; supersession chain-walk |
| automem/api/enrichment.py | Add classification block to /enrichment/status response |
| app.py | Wire classifier stats from global ServiceState into MemoryClassifier |
| .github/workflows/ci.yml | Run CI for both main and develop pushes/PRs |
- classifier: set descriptive llm_error when LLM returns no usable result so /enrichment/status last_error populates - mcp-sse-server: truncate metadata previews in detailed text format at 1500 chars (json format unchanged) + test - lab summarizer: label paired_t as the z-approximation it is (renamed paired_z) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Release: ranking & recall series (develop → main)
What's in this release
feat(recall): configurable recency decay window/curvefeat(recall): tag-score denominator cap fixes query-length biasSEARCH_TAG_SCORE_TOKEN_CAP=0)fix(recall): relevance gate — query-independent scoring gated on topical evidence (#130)feat(recall): date-aware ranking,recency_bias=off|on|auto, latest-fact selection (#158, #159)RECALL_RECENCY_BIAS=off; adds deterministic timestamp tiebreak for near-tiesfeat(benchmarks): failure-mode diagnosis harness + judge quota preflightfix(mcp): surface stored metadata +updated_atin detailed recall format (#111)feat(enrichment): classification fallback-rate metrics in/enrichment/statusPlus: CI now runs on
developpushes/PRs; benchmark experiment log + README contribution-policy note.Verification evidence
RECALL_RECENCY_BIAS=auto+temporal-answerharness): recall@5 96.6% (483/500), accuracy 86.0% (430/500),judge_errors=0,memory_ingest_failures=0.benchmarks/EXPERIMENT_LOG.md(2026-06-11 entry) andbenchmarks/results/lme_churn17_*+analyze_churn17.py.Opt-in features shipped OFF
RECALL_RELEVANCE_GATE(validated at 0.40 on lab corpus; improves negative-probe precision) andRECALL_RECENCY_BIAS=auto(current-state query re-ranking). Neither affects default behavior; seedocs/ENVIRONMENT_VARIABLES.md.After merging
release-please will update PR #154 (v0.16.0); merging that cuts the tag and publishes the
:stableimage — the actual user-facing deploy event for Railway template users.🤖 Generated with Claude Code