Skip to content

feat(recall): ranking release — recency config, tag-score cap, relevance gate, date-aware ranking (#182, #193, #186, #187, #183, #184, #188)#194

Merged
jack-arturo merged 12 commits into
mainfrom
develop
Jun 12, 2026
Merged

feat(recall): ranking release — recency config, tag-score cap, relevance gate, date-aware ranking (#182, #193, #186, #187, #183, #184, #188)#194
jack-arturo merged 12 commits into
mainfrom
develop

Conversation

@jack-arturo

Copy link
Copy Markdown
Member

Release: ranking & recall series (develop → main)

⚠️ Merge with a MERGE COMMIT — do not squash. release-please needs the individual conventional commits below to compute the version and changelog for PR #154.

What's in this release

PR Change Default behavior
#182 feat(recall): configurable recency decay window/curve unchanged (env-gated)
#193 (replaces #185) feat(recall): tag-score denominator cap fixes query-length bias unchanged (SEARCH_TAG_SCORE_TOKEN_CAP=0)
#186 fix(recall): relevance gate — query-independent scoring gated on topical evidence (#130) unchanged (gate off)
#187 feat(recall): date-aware ranking, recency_bias=off|on|auto, latest-fact selection (#158, #159) RECALL_RECENCY_BIAS=off; adds deterministic timestamp tiebreak for near-ties
#183 feat(benchmarks): failure-mode diagnosis harness + judge quota preflight tooling only
#184 fix(mcp): surface stored metadata + updated_at in detailed recall format (#111) additive
#188 feat(enrichment): classification fallback-rate metrics in /enrichment/status additive

Plus: CI now runs on develop pushes/PRs; benchmark experiment log + README contribution-policy note.

Verification evidence

  • Unit/lint/npm: 625 pytest + 16 mcp-sse-server tests green on develop head; CI green.
  • Default-preserve: recall-lab baseline on the 10k-memory production snapshot — develop defaults vs main pooled baseline identical aggregates (R@5 0.655 / R@10 0.710 / MRR 0.434 / NDCG@10 0.501). Two-stack probe run (main vs develop, defaults): 11/12 preserve-exact, remaining diffs are near-tie reorders (top-1 score deltas ≤ 5.4e-5, the feat(recall): date-aware ranking + latest-fact selection (#158, #159) #187 timestamp tiebreak).
  • Full judged 500q LongMemEval (ship config: RECALL_RECENCY_BIAS=auto + temporal-answer harness): recall@5 96.6% (483/500), accuracy 86.0% (430/500), judge_errors=0, memory_ingest_failures=0.
  • Churn attribution (targeted re-runs of all 17 churned questions on current-main-at-defaults and develop-at-defaults): 15/17 moved with fix(recall): normalize graph keyword scores into the 0-1 component range #191 (already on main) — the April canonical 97.2% floor is stale; current main measures ~97.0%. Develop-at-defaults differs from current main by 1 question in 500 (a near-tie rank-5/6 flip from feat(recall): date-aware ranking + latest-fact selection (#158, #159) #187's deterministic tiebreak). Accuracy is within answerer replicate noise (identical-config reference runs flip 28/500 answers).
  • Full detail: benchmarks/EXPERIMENT_LOG.md (2026-06-11 entry) and benchmarks/results/lme_churn17_* + analyze_churn17.py.

Opt-in features shipped OFF

RECALL_RELEVANCE_GATE (validated at 0.40 on lab corpus; improves negative-probe precision) and RECALL_RECENCY_BIAS=auto (current-state query re-ranking). Neither affects default behavior; see docs/ENVIRONMENT_VARIABLES.md.

After merging

release-please will update PR #154 (v0.16.0); merging that cuts the tag and publishes the :stable image — the actual user-facing deploy event for Railway template users.

🤖 Generated with Claude Code

jack-arturo and others added 11 commits June 11, 2026 20:45
Pools prod_parity runs as the baseline and reports per-config metric
deltas with a paired t-test on per-query recall@5, plus per-category
R@10 deltas. Companion to scripts/lab/run_pipeline_20260611.sh.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
develop is the new integration branch: feature PRs target develop and
main only moves via validated release merges, so Railway users tracking
main see one deploy event per release instead of one per PR.
docker-build and release-please intentionally stay main-only.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
## Summary
- `SEARCH_RECENCY_WINDOW_DAYS` (default 180) and `SEARCH_RECENCY_CURVE`
(`linear`|`exp`, default `linear`) replace the hardcoded 180-day linear
decay in `_compute_recency_score()`. Defaults are bit-for-bit identical
to current behavior; `exp` treats the window as a half-life.
- Window values ≤ 0 fall back to 180 (a `0` would otherwise 500 every
recall with `ZeroDivisionError`; negatives produce unbounded scores).
- `scripts/browse_memories.py diagnose` now simulates recency with the
same env config instead of its own hardcoded copy.
- Lab harness fix: `run_recall_test.py` restarted the API container only
for `SEARCH_WEIGHT_*` keys, so sweeps of any other `SEARCH_*` var
silently tested the baseline — prefix widened to `SEARCH_`.

## Why
Recency was dead for any memory older than 180 days (and for entire
benchmark corpora, which are dated 2023) — zero temporal discrimination
among conflicting facts. This is the substrate for the date-aware
ranking work (#158/#159) and makes the window/curve sweepable via `make
lab-sweep`.

## Testing
- 10 new unit tests (defaults, linear/exp curves, guards, monkeypatched
windows, config domain validation)
- Full suite: 497 passed, 12 skipped; black + flake8 clean

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Continuation of #185, which GitHub auto-closed (and refused to reopen)
when its stacked base branch `feat/tunable-recency` was deleted on
#182's merge. Identical branch and content, now rebased onto `develop`
with #182 included.

See #185 for the full description, validation evidence
(production-corpus lab A/B that flipped the default to
`SEARCH_TAG_SCORE_TOKEN_CAP=0`), and review history.

Part of the develop-branch integration series for the ranking release.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…n tag scope (#130) (#186)

## Summary
Stacked on #185.

**Corrected root cause for #130.** The issue hypothesized tags act as
score boosters that outrank scoped results. Code inspection shows
user-passed `tags` are already a hard gate on every base path (Qdrant
must-filter, graph WHERE, metadata sidecar, tag-only fallback, final
post-filter). What actually happened in the repro: the gate constrained
the pool to flint-tagged memories, the Forge memories were *excluded*,
and within the surviving pool query-independent components (importance
0.9 × 0.1 weight + recency + tag crumbs) dominated near-zero topical
evidence — confident-looking garbage with no signal the pool was gated.

Fixes, with tag semantics untouched:
- **Relevance gate**: `evidence = max(vector, keyword, metadata,
exact)`; when query tokens exist and `evidence < RECALL_RELEVANCE_GATE`,
importance/confidence/recency/relevance/tag components are scaled by
`evidence/gate` (linear ramp). **Ships at 0.0 = exactly current
behavior**; the enabled value comes from the eval funnel (lab sweep grid
0.10–0.25 + 22-probe + negative-control gates in automem-evals).
`components` now carry `evidence` + `relevance_gated`.
- **`tag_scope` response diagnostics** when tags are passed: `{filtered,
pool_size_hint, gated_low_evidence}` — recall is no longer silent about
gating.
- **Opt-in `scope_fallback=true`**: fills remaining slots with unscoped
vector results flagged `outside_tag_scope: true`, appended after scoped
results, with full filter parity (min_score, time, exclude_tags,
edge-based current-state suppression) — only the tag scope is lifted.
In-scope memories can never reappear as "outside scope" fills (guards
are mutation-tested).
- **Doc corrections**: MCP `tag_match` description claimed default
"exact"; the API default is `prefix`. `tags` documented as a hard scope
filter (use `context_tags` to boost). Per-path semantics table in
`docs/API.md`.

## Testing
20 new tests including a gate-0 bit-identity matrix, linear-ramp
midpoint, route-level rerank + diagnostics, fill-resurrection mutation
kills, and MCP rendering of the fill flag. Full suite 523 passed, 12
skipped; Node 14/14; black + flake8 clean.

Refs #130 (will update the issue with this corrected analysis).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…#187)

## Summary
Stacked on #186. Driven by the failure-mode diagnosis in #183 (58
failed-but-retrieved LongMemEval questions: answer-construction 42,
missing-date-use 7, ranking 4).

Server (production + benchmark):
- **Timestamp tiebreak** (always-on): exact score ties now order
newest-first deterministically (`_score_sort_key`).
- **`recency_bias=auto|on|off`** (ships `off` via
`RECALL_RECENCY_BIAS`): after dedup/state-filter and before the adaptive
floor, candidate timestamps are min-max normalized and
`SEARCH_WEIGHT_TEMPORAL` (default 0.1) × relative recency is added — so
the newest version of a conflicting fact can outrank an older, heavier
one. `auto` triggers on temporal intent ("latest", "current", "what
changed", …; word-boundaried, "currency"/"nowhere" safe).
- **Supersession chain-walk**: `current` state mode now resolves
INVALIDATED_BY/EVOLVED_INTO chains to their head (A→B→C surfaces C,
provenance still points at A; depth-bounded at 5, cycle-safe, batched).
*Honesty note: benchmark corpora carry no supersession edges — this is a
production-correctness fix, not a score mover.*

Harness (benchmark-only, flag-gated for methodology reproducibility):
- **`temporal_answer_hint`** config flag (default off; new
`temporal-answer` preset for A/B): chronological memory rendering with
scores + conflict-recency guidance + anti-overabstention guidance (27 of
the 58 failures were literal "I don't know" with the answer retrieved;
abstention remains possible). Flag-off prompt is byte-identical to the
canonical methodology (equality-tested).

## Testing
40 new tests (tiebreak, bias flip with the shipped default weight,
auto-detection, chain-walk depth/cycle/query-count guards, #158
preference latest-wins acceptance, prompt byte-identity). Full suite 563
passed, 12 skipped; black + flake8 clean.

## Validation plan before enabling anything
Lab IR A/B → automem-evals 22-probe zero-delta gate (recency_bias=off
default keeps it zero-delta) → LongMemEval mini judge-off (recall@5 ≥
97.2% floor) → judged mini → full run; `temporal-answer` preset A/B for
the harness flag, with server-vs-harness deltas reported separately in
EXPERIMENT_LOG.

Refs #158, #159

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…quota preflight (#183)

## Summary
- `tests/benchmarks/longmemeval/diagnose_failures.py`: two-stage
classifier for failed-but-retrieved questions (`is_correct=false AND
recall_hit_at_5=true`). Stage 1 is pure code (answer rank,
abstention-despite-hit, stale-candidate-above-answer, noise ratio,
date-math detection → documented 8-rule mode ladder); stage 2 (`--llm`)
labels each failure with the pinned judge and reports a stage-1/stage-2
agreement matrix (transport errors counted and excluded).
- `tests/benchmarks/judge_preflight.py`: one minimal judge call before
any judged run; exits non-zero with an actionable message on
429/insufficient_quota or auth failure. Wired into
`test-longmemeval-benchmark.sh` (only when `--llm-eval`). Would have
caught the June 6 quota-compromised runs before they started.
- Harness instrumentation: `retrieved_session_ids_full` (all 10 recalled
memories + scores) recorded per question — closes the rank-6-10 blind
spot for future runs.

## Findings on the canonical run (87.0% accuracy, recall@5 97.2%)
58 of 69 failures had the answer retrieved in the top 5. LLM labels
(judge `gpt-5.4-mini-2026-03-17`): **answer-construction 42** (27 of
them literal "I don't know" abstentions with the answer in context),
missing-date-use 7, ranking 4, conflict-resolution 2, retrieval-gap 2,
outdated-fact 1. This reprioritized the release: the harness
answer-assembly path (see `feat/date-aware-ranking`) is the biggest
benchmark lever; pure ranking fixes are the production-quality lever.

Report artifact:
`benchmarks/results/failure_modes_canonical_llm_20260611.json` (local,
gitignored).

## Testing
31 new tests (synthetic fixtures, mocked clients, no network); full
suite 518 passed, 12 skipped; black + flake8 clean.

Refs #158, #159 (both issues require this failure-mode classification as
their first acceptance criterion).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…ormat (#184)

## Summary
The REST `/recall` API already returns `metadata`, `updated_at`, and
`last_accessed` — the gap was the MCP server's `detailed` format, which
omitted them (making custom metadata effectively write-only for MCP
agents, the actual complaint in #111).

- `formatRecallAsItems()` detailed branch now renders an `Updated:` line
and a **size-capped** single-line `Metadata:` JSON (300 chars +
ellipsis; omitted when empty) — capped because a prior raw-dump attempt
was rejected for verbosity.
- `json` format was already a raw passthrough; now pinned by a
transport-level test.
- `text`/`items` formats unchanged.
- New REST contract test (`test_recall_metadata_roundtrip`) locks the
server-side behavior; `docs/METADATA_BEHAVIOR.md` corrected (it
over-claimed that detailed already exposed metadata).

## Testing
- Node: 15/15 (`npm test`, includes truncation-boundary and
empty-metadata cases)
- Python: 488 passed, 12 skipped; black + flake8 clean

Closes #111

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…ichment/status (#188)

## Summary
On June 10, OpenAI quota exhaustion silently degraded every store-time
classification to the fallback `("Memory", 0.3)` path — 78% of recent
writes, invisible to every health surface (the classifier only logged;
`/enrichment/status` counts only enrichment-worker activity, and
classification runs in the API store path). 31.1% of the production
corpus is now fallback-typed.

- New `ClassificationStats` on `ServiceState` (mirrors `EnrichmentStats`
style): `llm_attempts`, `llm_successes`, `fallbacks`,
`pattern_classifications`, `last_error`, `last_error_at`.
- `MemoryClassifier` threads an optional stats object (backward
compatible, `None`-guarded); the inner LLM except stashes the real error
(reset per call) so the fallback records *why*.
- `/enrichment/status` gains a `classification` block — no new endpoint,
no new MCP tool. `health_monitor`/alerting can now watch the fallback
rate.

## Testing
9 unit tests + an end-to-end route test driving a 429-raising then
succeeding stub through the blueprint-shared state. Full suite 497
passed, 12 skipped; black + flake8 clean.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…urn attribution

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This release PR merges the “ranking & recall” series from develop into main, introducing new recall-scoring controls (recency tuning, relevance gating, optional scope fallback, and date-aware re-ranking), improved metadata surfacing for MCP consumers, and new benchmark tooling for LongMemEval diagnosis and quota preflight.

Changes:

  • Extend recall ranking/scoring with configurable recency decay, tag-score denominator cap, an optional relevance gate, deterministic timestamp tiebreaking, optional tag-scope fallback fills, and recency-bias re-ranking (plus associated API/docs/test updates).
  • Add observability and safety improvements: classification fallback-rate metrics in /enrichment/status and a judge quota/auth preflight for benchmark runs.
  • Add/expand benchmark harness capabilities (failure-mode diagnosis, prompt flagging, richer artifacts) and update CI to run on develop as well as main.

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_classification_stats.py New unit coverage for classification stats instrumentation + thread-safety checks
tests/test_api_endpoints.py Large expansion of API/recall/enrichment behavior tests (tag score cap, relevance gate, recency bias, scope fallback, metadata roundtrip, etc.)
tests/benchmarks/longmemeval/test_longmemeval.py Add richer result artifacts + temporal answer-hint prompt option
tests/benchmarks/longmemeval/test_diagnose_failures.py Unit tests for the new failure-mode diagnosis harness and judge preflight
tests/benchmarks/longmemeval/test_answer_prompt.py Tests to ensure prompt byte-identity when flag-off + expected behavior when flag-on
tests/benchmarks/longmemeval/diagnose_failures.py New failure-mode diagnosis CLI/harness for LongMemEval
tests/benchmarks/longmemeval/configs.py Add temporal-answer preset and temporal_answer_hint flag wiring
tests/benchmarks/judge_preflight.py New judge quota/auth preflight helper (CLI + testable core)
test-longmemeval-benchmark.sh Run judge preflight before judged LongMemEval runs
scripts/lab/summarize_pipeline_20260611.py New lab summarization script for release verification sweeps
scripts/lab/run_recall_test.py Broaden restart detection for SEARCH_* env changes in lab harness
scripts/browse_memories.py Align diagnose recency simulation with configurable recency window/curve
README.md Note contribution policy: feature PRs target develop
mcp-sse-server/test/server.test.js Expand MCP server tests for new recall params and metadata/updated_at rendering
mcp-sse-server/server.js MCP client: pass through scope_fallback; detailed formatting includes updated_at/metadata; surface outside-tag-scope fills
docs/METADATA_BEHAVIOR.md Document metadata + timestamps surfacing across recall/MCP formats
docs/ENVIRONMENT_VARIABLES.md Document new env vars and new recall behaviors (tag cap, recency tuning, gate, recency bias)
docs/COMPARISON.md Update recency weight description to reflect configurable decay
docs/API.md Document tag semantics, scope fallback, recency bias, timestamp tiebreak, chain-walk behavior, and new enrichment status block
docker-compose.yml Add optional .env.bench env_file for lab override propagation
CLAUDE.md Update documented recency/temporal-related knobs and browse_memories diagnose description
benchmarks/EXPERIMENT_LOG.md Add 2026-06-11 ranking-release verification entry
automem/utils/time.py Add temporal-intent detection helper for recency_bias=auto
automem/utils/scoring.py Configurable recency curve/window; tag-score cap; relevance evidence + gate; richer score components
automem/service_state.py Add ClassificationStats and attach it to ServiceState
automem/config.py Add env parsing/guards for recency tuning, tag cap, relevance gate, temporal weight, and default recency-bias mode
automem/classification/memory_classifier.py Thread optional stats through classifier; record attempts/success/fallback/error
automem/api/recall.py Add score timestamp tiebreak; recency-bias rerank; scope fallback fills; tag scope diagnostics; supersession chain-walk
automem/api/enrichment.py Add classification block to /enrichment/status response
app.py Wire classifier stats from global ServiceState into MemoryClassifier
.github/workflows/ci.yml Run CI for both main and develop pushes/PRs

Comment thread automem/classification/memory_classifier.py
Comment thread mcp-sse-server/server.js
Comment thread scripts/lab/summarize_pipeline_20260611.py
@jack-arturo jack-arturo changed the title release: recall ranking series (#182, #193, #186, #187, #183, #184, #188) feat(recall): ranking release — recency config, tag-score cap, relevance gate, date-aware ranking (#182, #193, #186, #187, #183, #184, #188) Jun 12, 2026
- classifier: set descriptive llm_error when LLM returns no usable result so /enrichment/status last_error populates
- mcp-sse-server: truncate metadata previews in detailed text format at 1500 chars (json format unchanged) + test
- lab summarizer: label paired_t as the z-approximation it is (renamed paired_z)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@jack-arturo jack-arturo merged commit 337fe98 into main Jun 12, 2026
13 checks passed
@jack-arturo jack-arturo deleted the develop branch June 12, 2026 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants