TL;DR
While validating AutoMem recall behavior for Opus 4.7 / 1M context across ~15 experiments against production (https://automem.up.railway.app), three items surfaced:
- Bug: A
store_memory call silently failed to persist — memory never appeared in any subsequent recall (content search, tag exact match, tag prefix, vector similarity).
- Bug:
context_tags + small limit drops higher-scoring boosted results in favor of lower-raw-score vector matches.
- Recommendation: Template updates for 1M-context sessions, with measurements showing which knobs actually move the needle.
Also a docs inconsistency worth flagging (#C below).
Context
Validation was done end-to-end via the MCP client and via direct HTTP GET /recall against production. Corpus size during testing: 9,401 memories, both Qdrant and FalkorDB reporting healthy and sync'd. All tests read-only.
A. Silent store failure — pref/test memory never persisted
Severity: Medium. Data integrity. Store calls returning success while not persisting is the worst failure mode.
What happened:
A store_memory call was made yesterday with:
content: "Test pref: multi-phase recall runs on restart"
type: "Preference"
tags: ["pref/test", "project/mcp-automem", "source/manual"]
importance: 0.5
MCP call returned success. Memory cannot be found through any recall path:
# Expected at least 1 result. Got 0:
curl -G "$ENDPOINT/recall" -H "X-Api-Key: $KEY" \
--data-urlencode 'tags=pref/test' \
--data-urlencode 'tag_match=exact' \
--data-urlencode 'limit=10'
# → {"count": 0, ...}
# Content search — exact phrase from the memory content:
curl -G "$ENDPOINT/recall" -H "X-Api-Key: $KEY" \
--data-urlencode 'query=multi-phase recall runs on restart' \
--data-urlencode 'limit=15'
# → returns 15 results, none are the stored memory
# (top results are all unrelated "restart" voice-conversation extractions)
# Confirmed zero pref/* memories of any kind:
curl -G "$ENDPOINT/recall" -H "X-Api-Key: $KEY" \
--data-urlencode 'tags=pref' \
--data-urlencode 'tag_match=prefix' \
--data-urlencode 'limit=50'
# → {"count": 0, ...}
Action needed: Check server logs for store operations on 2026-04-16 / 2026-04-17 with that content. Determine whether the store failed at validation, embedding, or persistence, and whether the client was informed.
B. context_tags + low limit drops boosted results
Severity: Medium. Silent ranking corruption.
Same semantic query + context_tags boost, varying limit:
limit |
Count |
Top 3 raw scores |
Top 3 contents (abbreviated) |
| 5 |
3 |
1.057, 1.007, 0.465 |
Wan 2.2 q4, Wan 2.2 perf, M5 Max Ollama specs (no project/video tag) |
| 10 |
4 |
1.057, 1.007, 0.909, 0.515 |
Wan 2.2 q4, Wan 2.2 perf, Wan 2.2 setup (project/video tag), Video gen plan |
At limit=5, the server returned the M5 Max Ollama memory at raw 0.465 INSTEAD OF the Wan 2.2 setup memory at raw 0.459 boosted to 0.909. The boost wasn't applied before the top-N cut — looks like tag-boosted and vector-only result streams are being merged or truncated out of order.
Repro:
Q="wan 2.2 benchmark M5 Max"
# Missing a boosted result:
curl -G "$ENDPOINT/recall" -H "X-Api-Key: $KEY" \
--data-urlencode "query=$Q" \
--data-urlencode 'context_tags=project/video' \
--data-urlencode 'limit=5'
# Complete result:
curl -G "$ENDPOINT/recall" -H "X-Api-Key: $KEY" \
--data-urlencode "query=$Q" \
--data-urlencode 'context_tags=project/video' \
--data-urlencode 'limit=10'
Workaround for clients: always use limit >= 20 when combining context_tags with a semantic query. But the real fix belongs in the server's merge/sort logic.
C. context_tags literal-match vs prefix-index asymmetry
Not technically a bug, but an internal inconsistency that surprised us during the prefix/no-prefix tag-scheme evaluation.
tags + tag_match=prefix matches against the tag_prefixes index. Both / and : separators get split into namespace segments (e.g., project/foo indexes as project, project:foo).
context_tags does a literal string compare against the raw tags array. context_tags=["project:foo"] will NOT match a memory tagged project/foo even though they're equivalent via the prefix index.
Measured:
# Same query, same data:
curl ... --data-urlencode 'context_tags=project/mcp-automem'
# → 3 results, top final_score = 1.001 (boost applied)
curl ... --data-urlencode 'context_tags=project:mcp-automem'
# → 24 results, top final_score = 0.602 (NO boost applied)
Result sets are disjoint. One fix option: have context_tags consult the prefix index too, so / and : are interchangeable in both filter paths.
At minimum: document that tag_prefixes stores with : separator while input uses either / or :.
D. Template refresh for Opus 4.7 / 1M context
Context
Tested the current templates/CLAUDE_MD_MEMORY_RULES.md two-phase pattern against a proposed 4-phase namespace-prefixed scheme. Two-phase (production) wins for Opus 4.7, with three small parameter changes validated by measurement.
Validated: the existing "bare tag" convention is the right one
Namespace-prefixed tags (project/<slug>, pref/<scope>, source/<kind>, lang/<lang>) were explored but are NOT recommended. Reasons, in order of weight:
- The existing 9,401-memory corpus uses bare tags exclusively. Introducing a prefix scheme creates a bifurcated corpus where old memories are invisible to new exact-match queries.
context_tags literal-match asymmetry (item C above) means the migration costs real ranking quality during the transition.
- The prefix doesn't solve the "hard-gate bite" — a memory that lacks tag
project/video also typically lacks bare tag video. The fix is tagging discipline, not syntactic namespaces.
Only risk with bare tags: slug collision with topic words. Empirically:
| Gate |
Behavior |
tags:["streamdeck-mcp"] |
clean — slug is unique |
tags:["mcp-automem"] |
clean — slug is unique |
tags:["video"] |
bad — pulls unrelated "video content strategy" memories |
Guidance to add to template: "Use a project slug that doesn't collide with common topic words (streamdeck-mcp ✓, video ✗). For short/generic names, either prefix at store time (video-gen-project) or omit the tag gate on that project."
Parameter bumps for 1M context
All measured on the production Phase 2 pattern over the mcp-automem slug:
| Config |
Count |
≥0.4 |
Current default: limit=10, time_query="last 30 days" |
2 |
2 |
limit=30, time_query="last 30 days" |
3 |
3 |
limit=10, time_query="last 90 days" |
2 |
2 |
limit=30, time_query="last 90 days" (proposed) |
5 |
5 |
Both bumps compound. 2.5× useful results, zero score quality loss. All 5 returned results scored ≥1.27.
Phase 1 (tags:["preference"], limit:20) validated with 14 high-signal results from the corpus (Jack's PR-merge preference, no-markdown-memory preference, etc.). No dilution.
Validated: auto_decompose=true is safe but low-impact
Same query with auto_decompose=true vs false: 16 results either way, 1 add and 1 drop. Keep it on; it may help for richer multi-topic queries but doesn't move the needle for focused ones.
Validated: expand_relations is a no-op right now
Expansion with current (relation_limit:8, expansion_limit:60) vs proposed (20/150) both added zero memories on the test query. The corpus is sparse in associations (few explicit associate_memories calls). Keep expansion on, don't bother bumping the limits until association discipline improves. Re-measure in ~30 days.
Proposed template replacement
# Phase 1 — Preferences (tag-only, no time filter, no semantic query)
recall_memory({ tags: ["preference"], limit: 20 })
# Phase 2 — Task context (semantic + time-limited + project-gated)
recall_memory({
queries: [<task topic>, "user corrections", "recent decisions"],
tags: ["<project-slug>"],
auto_decompose: true,
time_query: "last 90 days",
limit: 30
})
# Optional Phase 3 (on-demand, NOT at every session start) — debugging
recall_memory({
query: "<error symptom>",
tags: ["bugfix", "solution"],
limit: 20
})
Deltas from current template: limit 10 → 20/30, time_query 30 days → 90 days, explicit guidance that Phase 3 is on-demand.
Appendix: Measurement methodology
- Direct HTTP
GET /recall on production endpoint with X-Api-Key auth
- MCP client (
mcp__memory__recall_memory) cross-checked for parity — parity held in all tests except item B above, which reproduces on both paths
- Corpus state at test time: 9,401 memories, 9,401 vectors, FalkorDB + Qdrant both
connected and synced
- All HTTP calls read-only; no stores or modifications during validation
TL;DR
While validating AutoMem recall behavior for Opus 4.7 / 1M context across ~15 experiments against production (
https://automem.up.railway.app), three items surfaced:store_memorycall silently failed to persist — memory never appeared in any subsequent recall (content search, tag exact match, tag prefix, vector similarity).context_tags+ smalllimitdrops higher-scoring boosted results in favor of lower-raw-score vector matches.Also a docs inconsistency worth flagging (#C below).
Context
Validation was done end-to-end via the MCP client and via direct HTTP
GET /recallagainst production. Corpus size during testing: 9,401 memories, both Qdrant and FalkorDB reporting healthy and sync'd. All tests read-only.A. Silent store failure —
pref/testmemory never persistedSeverity: Medium. Data integrity. Store calls returning success while not persisting is the worst failure mode.
What happened:
A
store_memorycall was made yesterday with:MCP call returned success. Memory cannot be found through any recall path:
Action needed: Check server logs for store operations on 2026-04-16 / 2026-04-17 with that content. Determine whether the store failed at validation, embedding, or persistence, and whether the client was informed.
B.
context_tags+ lowlimitdrops boosted resultsSeverity: Medium. Silent ranking corruption.
Same semantic query +
context_tagsboost, varyinglimit:limitAt
limit=5, the server returned the M5 Max Ollama memory at raw 0.465 INSTEAD OF the Wan 2.2 setup memory at raw 0.459 boosted to 0.909. The boost wasn't applied before the top-N cut — looks like tag-boosted and vector-only result streams are being merged or truncated out of order.Repro:
Workaround for clients: always use
limit >= 20when combiningcontext_tagswith a semantic query. But the real fix belongs in the server's merge/sort logic.C.
context_tagsliteral-match vs prefix-index asymmetryNot technically a bug, but an internal inconsistency that surprised us during the prefix/no-prefix tag-scheme evaluation.
tags + tag_match=prefixmatches against thetag_prefixesindex. Both/and:separators get split into namespace segments (e.g.,project/fooindexes asproject,project:foo).context_tagsdoes a literal string compare against the rawtagsarray.context_tags=["project:foo"]will NOT match a memory taggedproject/fooeven though they're equivalent via the prefix index.Measured:
Result sets are disjoint. One fix option: have
context_tagsconsult the prefix index too, so/and:are interchangeable in both filter paths.At minimum: document that
tag_prefixesstores with:separator while input uses either/or:.D. Template refresh for Opus 4.7 / 1M context
Context
Tested the current
templates/CLAUDE_MD_MEMORY_RULES.mdtwo-phase pattern against a proposed 4-phase namespace-prefixed scheme. Two-phase (production) wins for Opus 4.7, with three small parameter changes validated by measurement.Validated: the existing "bare tag" convention is the right one
Namespace-prefixed tags (
project/<slug>,pref/<scope>,source/<kind>,lang/<lang>) were explored but are NOT recommended. Reasons, in order of weight:context_tagsliteral-match asymmetry (item C above) means the migration costs real ranking quality during the transition.project/videoalso typically lacks bare tagvideo. The fix is tagging discipline, not syntactic namespaces.Only risk with bare tags: slug collision with topic words. Empirically:
tags:["streamdeck-mcp"]tags:["mcp-automem"]tags:["video"]Guidance to add to template: "Use a project slug that doesn't collide with common topic words (
streamdeck-mcp✓,video✗). For short/generic names, either prefix at store time (video-gen-project) or omit the tag gate on that project."Parameter bumps for 1M context
All measured on the production Phase 2 pattern over the
mcp-automemslug:limit=10, time_query="last 30 days"limit=30, time_query="last 30 days"limit=10, time_query="last 90 days"limit=30, time_query="last 90 days"(proposed)Both bumps compound. 2.5× useful results, zero score quality loss. All 5 returned results scored ≥1.27.
Phase 1 (
tags:["preference"], limit:20) validated with 14 high-signal results from the corpus (Jack's PR-merge preference, no-markdown-memory preference, etc.). No dilution.Validated:
auto_decompose=trueis safe but low-impactSame query with
auto_decompose=truevsfalse: 16 results either way, 1 add and 1 drop. Keep it on; it may help for richer multi-topic queries but doesn't move the needle for focused ones.Validated:
expand_relationsis a no-op right nowExpansion with current (
relation_limit:8, expansion_limit:60) vs proposed (20/150) both added zero memories on the test query. The corpus is sparse in associations (few explicitassociate_memoriescalls). Keep expansion on, don't bother bumping the limits until association discipline improves. Re-measure in ~30 days.Proposed template replacement
Deltas from current template:
limit10 → 20/30,time_query30 days → 90 days, explicit guidance that Phase 3 is on-demand.Appendix: Measurement methodology
GET /recallon production endpoint withX-Api-Keyauthmcp__memory__recall_memory) cross-checked for parity — parity held in all tests except item B above, which reproduces on both pathsconnectedandsynced