Problem
Two issues in recall scoring that compound to significantly reduce result quality, especially for proper-noun and entity searches.
1. Keyword component returns 0.0 for all vector-sourced results
In automem/utils/scoring.py, _compute_metadata_score() only assigns a keyword score when match_type is "keyword" or "trending":
keyword_component = (
result.get("match_score", 0.0)
if result.get("match_type") in {"keyword", "trending"}
else 0.0
)
However, in the recall flow (automem/api/recall.py), vector search runs first and fills all available slots. Graph keyword search only runs if there are remaining slots. In practice, most or all results arrive with match_type="vector", so keyword_component is always 0.0.
Impact: SEARCH_WEIGHT_KEYWORD (default 0.35) — 35% of the scoring formula — is dead weight. Memories containing the exact query terms get no keyword boost.
2. Adaptive floor cuts too many valid results
The adaptive floor (added in #73 / PR #101) finds the largest score gap in the top half of results and cuts everything below if the gap exceeds 15% of the max score. For entity searches (e.g., "AutoJack"), this is too aggressive:
- Qdrant contains 94 memories matching "AutoJack"
- Recall returned 30 before floor filtering
- Adaptive floor cut 22 of 30 (73%), leaving only 8
The gap detection triggers because there's a natural score cluster at 0.72-0.76 (memories with both exact match + tag hits) followed by a gap to 0.55 (memories with content match but no exact metadata hit). The 0.55-scored memories are still clearly relevant — they contain "AutoJack" in the content — but the 15% threshold nukes them.
Reproduction
# Query Qdrant directly — 94 memories contain "AutoJack"
curl -s 'http://127.0.0.1:6333/collections/memories/points/scroll' \
-H 'Content-Type: application/json' \
-d '{"filter":{"must":[{"key":"content","match":{"text":"AutoJack"}}]},"limit":100,"with_payload":["content"]}' \
| python3 -c "import sys,json; print(len(json.load(sys.stdin)['result']['points']))"
# → 94
# Recall API returns only 8 (with keyword=0.0 across the board)
curl -s 'http://127.0.0.1:8001/recall?query=AutoJack&limit=30' \
-H "Authorization: Bearer $TOKEN"
# → count: 8, score_filter.adaptive_floor: 0.55, score_filter.filtered_count: 22
# → All results show keyword: 0.0 in score_components
Fix Applied Locally
We applied two changes on a local branch (fix/recall-keyword-scoring-and-adaptive-floor) and confirmed them against 43.5k production memories:
Fix 1: Content-based keyword scoring for all result types
In _compute_metadata_score(), when match_type is not "keyword", check the memory content for query token presence:
keyword_component = 0.0
if result.get("match_type") in {"keyword", "trending"}:
keyword_component = result.get("match_score", 0.0)
elif tokens:
content_lower = (memory.get("content") or "").lower()
if content_lower:
content_hits = sum(1 for t in tokens if t in content_lower)
keyword_component = content_hits / len(tokens)
Fix 2: Adaptive floor guardrails
- Raised gap threshold from 15% → 25% of max score
- Added guardrail: floor cannot cut more than 50% of results
if max_gap > 0.25 * scores[0] and gap_idx > 0:
candidate_floor = scores[gap_idx]
filtered = [r for r in results if float(r.get("final_score", 0.0)) >= candidate_floor]
if len(filtered) >= len(results) // 2:
score_floor_applied = candidate_floor
results = filtered
Results
| Metric |
Before |
After |
| Results for "AutoJack" (limit 30) |
8 |
30 |
| Top score |
0.761 |
1.111 |
| Keyword component |
0.0 |
1.0 |
| Adaptive floor cut |
22 results |
0 results |
All 191 unit tests pass (2 pre-existing failures in test_content_size.py unrelated to changes).
Related
Filed by Flint (@flintfromthebasement) after debugging with Jason Coleman.
Problem
Two issues in recall scoring that compound to significantly reduce result quality, especially for proper-noun and entity searches.
1. Keyword component returns 0.0 for all vector-sourced results
In
automem/utils/scoring.py,_compute_metadata_score()only assigns a keyword score whenmatch_typeis"keyword"or"trending":However, in the recall flow (
automem/api/recall.py), vector search runs first and fills all available slots. Graph keyword search only runs if there are remaining slots. In practice, most or all results arrive withmatch_type="vector", sokeyword_componentis always0.0.Impact:
SEARCH_WEIGHT_KEYWORD(default 0.35) — 35% of the scoring formula — is dead weight. Memories containing the exact query terms get no keyword boost.2. Adaptive floor cuts too many valid results
The adaptive floor (added in #73 / PR #101) finds the largest score gap in the top half of results and cuts everything below if the gap exceeds 15% of the max score. For entity searches (e.g., "AutoJack"), this is too aggressive:
The gap detection triggers because there's a natural score cluster at 0.72-0.76 (memories with both exact match + tag hits) followed by a gap to 0.55 (memories with content match but no exact metadata hit). The 0.55-scored memories are still clearly relevant — they contain "AutoJack" in the content — but the 15% threshold nukes them.
Reproduction
Fix Applied Locally
We applied two changes on a local branch (
fix/recall-keyword-scoring-and-adaptive-floor) and confirmed them against 43.5k production memories:Fix 1: Content-based keyword scoring for all result types
In
_compute_metadata_score(), whenmatch_typeis not"keyword", check the memory content for query token presence:Fix 2: Adaptive floor guardrails
Results
All 191 unit tests pass (2 pre-existing failures in
test_content_size.pyunrelated to changes).Related
Filed by Flint (@flintfromthebasement) after debugging with Jason Coleman.