Skip to content

Template refresh for Opus 4.7 / 1M context + 2 server bugs surfaced during validation #97

@jack-arturo

Description

@jack-arturo

TL;DR

While validating AutoMem recall behavior for Opus 4.7 / 1M context across ~15 experiments against production (https://automem.up.railway.app), three items surfaced:

  1. Bug: A store_memory call silently failed to persist — memory never appeared in any subsequent recall (content search, tag exact match, tag prefix, vector similarity).
  2. Bug: context_tags + small limit drops higher-scoring boosted results in favor of lower-raw-score vector matches.
  3. Recommendation: Template updates for 1M-context sessions, with measurements showing which knobs actually move the needle.

Also a docs inconsistency worth flagging (#C below).


Context

Validation was done end-to-end via the MCP client and via direct HTTP GET /recall against production. Corpus size during testing: 9,401 memories, both Qdrant and FalkorDB reporting healthy and sync'd. All tests read-only.


A. Silent store failure — pref/test memory never persisted

Severity: Medium. Data integrity. Store calls returning success while not persisting is the worst failure mode.

What happened:
A store_memory call was made yesterday with:

content: "Test pref: multi-phase recall runs on restart"
type:    "Preference"
tags:    ["pref/test", "project/mcp-automem", "source/manual"]
importance: 0.5

MCP call returned success. Memory cannot be found through any recall path:

# Expected at least 1 result. Got 0:
curl -G "$ENDPOINT/recall" -H "X-Api-Key: $KEY" \
  --data-urlencode 'tags=pref/test' \
  --data-urlencode 'tag_match=exact' \
  --data-urlencode 'limit=10'
# → {"count": 0, ...}

# Content search — exact phrase from the memory content:
curl -G "$ENDPOINT/recall" -H "X-Api-Key: $KEY" \
  --data-urlencode 'query=multi-phase recall runs on restart' \
  --data-urlencode 'limit=15'
# → returns 15 results, none are the stored memory
# (top results are all unrelated "restart" voice-conversation extractions)

# Confirmed zero pref/* memories of any kind:
curl -G "$ENDPOINT/recall" -H "X-Api-Key: $KEY" \
  --data-urlencode 'tags=pref' \
  --data-urlencode 'tag_match=prefix' \
  --data-urlencode 'limit=50'
# → {"count": 0, ...}

Action needed: Check server logs for store operations on 2026-04-16 / 2026-04-17 with that content. Determine whether the store failed at validation, embedding, or persistence, and whether the client was informed.


B. context_tags + low limit drops boosted results

Severity: Medium. Silent ranking corruption.

Same semantic query + context_tags boost, varying limit:

limit Count Top 3 raw scores Top 3 contents (abbreviated)
5 3 1.057, 1.007, 0.465 Wan 2.2 q4, Wan 2.2 perf, M5 Max Ollama specs (no project/video tag)
10 4 1.057, 1.007, 0.909, 0.515 Wan 2.2 q4, Wan 2.2 perf, Wan 2.2 setup (project/video tag), Video gen plan

At limit=5, the server returned the M5 Max Ollama memory at raw 0.465 INSTEAD OF the Wan 2.2 setup memory at raw 0.459 boosted to 0.909. The boost wasn't applied before the top-N cut — looks like tag-boosted and vector-only result streams are being merged or truncated out of order.

Repro:

Q="wan 2.2 benchmark M5 Max"

# Missing a boosted result:
curl -G "$ENDPOINT/recall" -H "X-Api-Key: $KEY" \
  --data-urlencode "query=$Q" \
  --data-urlencode 'context_tags=project/video' \
  --data-urlencode 'limit=5'

# Complete result:
curl -G "$ENDPOINT/recall" -H "X-Api-Key: $KEY" \
  --data-urlencode "query=$Q" \
  --data-urlencode 'context_tags=project/video' \
  --data-urlencode 'limit=10'

Workaround for clients: always use limit >= 20 when combining context_tags with a semantic query. But the real fix belongs in the server's merge/sort logic.


C. context_tags literal-match vs prefix-index asymmetry

Not technically a bug, but an internal inconsistency that surprised us during the prefix/no-prefix tag-scheme evaluation.

  • tags + tag_match=prefix matches against the tag_prefixes index. Both / and : separators get split into namespace segments (e.g., project/foo indexes as project, project:foo).
  • context_tags does a literal string compare against the raw tags array. context_tags=["project:foo"] will NOT match a memory tagged project/foo even though they're equivalent via the prefix index.

Measured:

# Same query, same data:
curl ... --data-urlencode 'context_tags=project/mcp-automem'
# → 3 results, top final_score = 1.001 (boost applied)

curl ... --data-urlencode 'context_tags=project:mcp-automem'  
# → 24 results, top final_score = 0.602 (NO boost applied)

Result sets are disjoint. One fix option: have context_tags consult the prefix index too, so / and : are interchangeable in both filter paths.

At minimum: document that tag_prefixes stores with : separator while input uses either / or :.


D. Template refresh for Opus 4.7 / 1M context

Context

Tested the current templates/CLAUDE_MD_MEMORY_RULES.md two-phase pattern against a proposed 4-phase namespace-prefixed scheme. Two-phase (production) wins for Opus 4.7, with three small parameter changes validated by measurement.

Validated: the existing "bare tag" convention is the right one

Namespace-prefixed tags (project/<slug>, pref/<scope>, source/<kind>, lang/<lang>) were explored but are NOT recommended. Reasons, in order of weight:

  1. The existing 9,401-memory corpus uses bare tags exclusively. Introducing a prefix scheme creates a bifurcated corpus where old memories are invisible to new exact-match queries.
  2. context_tags literal-match asymmetry (item C above) means the migration costs real ranking quality during the transition.
  3. The prefix doesn't solve the "hard-gate bite" — a memory that lacks tag project/video also typically lacks bare tag video. The fix is tagging discipline, not syntactic namespaces.

Only risk with bare tags: slug collision with topic words. Empirically:

Gate Behavior
tags:["streamdeck-mcp"] clean — slug is unique
tags:["mcp-automem"] clean — slug is unique
tags:["video"] bad — pulls unrelated "video content strategy" memories

Guidance to add to template: "Use a project slug that doesn't collide with common topic words (streamdeck-mcp ✓, video ✗). For short/generic names, either prefix at store time (video-gen-project) or omit the tag gate on that project."

Parameter bumps for 1M context

All measured on the production Phase 2 pattern over the mcp-automem slug:

Config Count ≥0.4
Current default: limit=10, time_query="last 30 days" 2 2
limit=30, time_query="last 30 days" 3 3
limit=10, time_query="last 90 days" 2 2
limit=30, time_query="last 90 days" (proposed) 5 5

Both bumps compound. 2.5× useful results, zero score quality loss. All 5 returned results scored ≥1.27.

Phase 1 (tags:["preference"], limit:20) validated with 14 high-signal results from the corpus (Jack's PR-merge preference, no-markdown-memory preference, etc.). No dilution.

Validated: auto_decompose=true is safe but low-impact

Same query with auto_decompose=true vs false: 16 results either way, 1 add and 1 drop. Keep it on; it may help for richer multi-topic queries but doesn't move the needle for focused ones.

Validated: expand_relations is a no-op right now

Expansion with current (relation_limit:8, expansion_limit:60) vs proposed (20/150) both added zero memories on the test query. The corpus is sparse in associations (few explicit associate_memories calls). Keep expansion on, don't bother bumping the limits until association discipline improves. Re-measure in ~30 days.

Proposed template replacement

# Phase 1 — Preferences (tag-only, no time filter, no semantic query)
recall_memory({ tags: ["preference"], limit: 20 })

# Phase 2 — Task context (semantic + time-limited + project-gated)
recall_memory({
  queries: [<task topic>, "user corrections", "recent decisions"],
  tags: ["<project-slug>"],
  auto_decompose: true,
  time_query: "last 90 days",
  limit: 30
})

# Optional Phase 3 (on-demand, NOT at every session start) — debugging
recall_memory({
  query: "<error symptom>",
  tags: ["bugfix", "solution"],
  limit: 20
})

Deltas from current template: limit 10 → 20/30, time_query 30 days → 90 days, explicit guidance that Phase 3 is on-demand.


Appendix: Measurement methodology

  • Direct HTTP GET /recall on production endpoint with X-Api-Key auth
  • MCP client (mcp__memory__recall_memory) cross-checked for parity — parity held in all tests except item B above, which reproduces on both paths
  • Corpus state at test time: 9,401 memories, 9,401 vectors, FalkorDB + Qdrant both connected and synced
  • All HTTP calls read-only; no stores or modifications during validation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions