[BOUNTY #639] Memanto vs Mem0 — Shifting Persona Benchmark#705
[BOUNTY #639] Memanto vs Mem0 — Shifting Persona Benchmark#705minhthai1995 wants to merge 1 commit into
Conversation
…h-ai#639) Implements Scenario B from the Agentic Memory Showdown bounty challenge. Compares Memanto (moorcheh-sdk direct upsert) against Mem0 (LLM-extraction pipeline) on token overhead, p95 latency, and temporal preference accuracy. Dataset: 5-session "Evolving Film Enthusiast" — preferences evolve from action → sci-fi → documentary → thriller → horror. Tests whether each system correctly surfaces the CURRENT preference without stale-history pollution. Metrics measured per system: - Total tokens written during ingestion (write overhead) - Total tokens retrieved across 5 evaluation queries - p95 write and read latency (seconds) - Accuracy score 0-3 per query via LLM-as-judge (Claude Haiku) Scientific controls: same dataset, same judge model, same judge prompt, isolated namespaces (UUID per run), same top_k=5, tiktoken cl100k_base for token counting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR adds a complete benchmark suite ( ChangesBenchmark Suite: Memanto vs Mem0
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (3)
examples/benchmarks/README.md (1)
49-53: ⚡ Quick winAdd language specifiers to fenced code blocks.
Three code blocks are missing language specifiers, which reduces readability and prevents proper syntax highlighting. The static analysis tool correctly flagged these at lines 49, 61, and 139.
📝 Suggested fix
For the architecture diagrams (lines 49 and 61):
-``` +```text User message → MoorchehClient.documents.upsert() → Moorcheh serverless index-``` +```text User messages → Mem0 extraction LLM (Claude Haiku) → Vectorized memory factsFor the example output (line 139):
-``` +```text 🏆 The Great Agentic Memory ShowdownAlso applies to: 61-65, 139-166
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/README.md` around lines 49 - 53, Add language specifiers ("text") to the three fenced code blocks that lack them: the architecture diagram containing "User message → MoorchehClient.documents.upsert()", the diagram containing "User messages → Mem0 extraction LLM (Claude Haiku) → Vectorized memory facts", and the example output block that starts with "🏆 The Great Agentic Memory Showdown"; update those triple-backtick fences to use ```text so the blocks get proper syntax highlighting and readability.examples/benchmarks/metrics/accuracy_judge.py (1)
64-64: Make the judge model configurable (avoid hardcoded dated Claude ID)
examples/benchmarks/metrics/accuracy_judge.pyhardcodesclaude-haiku-4-5-20251001for the LLM-as-judge. Even though this snapshot is currently valid, dated model IDs can be deprecated; use an env override with a default for benchmark stability.♻️ Proposed refactor
+_JUDGE_MODEL = os.getenv("ANTHROPIC_JUDGE_MODEL", "claude-haiku-4-5-20251001") ... - model="claude-haiku-4-5-20251001", + model=_JUDGE_MODEL,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/metrics/accuracy_judge.py` at line 64, Replace the hardcoded dated Claude model string used as the LLM-as-judge with a configurable environment variable: introduce reading os.environ.get("ACCURACY_JUDGE_MODEL", "claude-haiku-4-5-20251001") (or similar JUDGE_MODEL name) and pass that variable into the call/site where model="claude-haiku-4-5-20251001" is used (the function/constructor that accepts the model parameter in accuracy_judge.py). Ensure the new env var is documented or has a clear default so benchmark behavior stays stable while allowing overrides.examples/benchmarks/adapters/memanto_adapter.py (1)
22-22: ⚡ Quick winAlign class name with product naming (
MemantoAdapter).
MenantoAdapterlooks like a typo and leaks into public imports; renaming avoids avoidable API/docs drift.Proposed rename
-class MenantoAdapter: +class MemantoAdapter:- from adapters.memanto_adapter import MenantoAdapter + from adapters.memanto_adapter import MemantoAdapter ... - adapter = MenantoAdapter(namespace=namespace) + adapter = MemantoAdapter(namespace=namespace)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/adapters/memanto_adapter.py` at line 22, The class is misspelled as MenantoAdapter; rename the class to MemantoAdapter and update every reference/import/export that uses MenantoAdapter (including any __all__ entries, type hints, tests, and callers) to the new MemantoAdapter identifier so public API and docs no longer leak the typo; ensure constructor and any usages match the new class name.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/benchmarks/adapters/mem0_adapter.py`:
- Around line 167-172: _p95 currently calculates the 95th percentile using a
biased index (int(len * 0.95) - 1) which understates p95 for small samples;
change the index computation in function _p95 to use a ceiling-based rank:
compute idx = max(0, int(math.ceil(0.95 * len(sorted_vals))) - 1) (import math),
keep the empty-list guard and rounding, and clamp idx to the last element if
needed so sorted_vals[idx] is always valid.
- Around line 149-151: The token accounting currently builds retrieved_text and
calls count(retrieved_text), which double-counts separator tokens; update the
logic in the retrieval flow (where retrieved_text, tokens_retrieved and
self.total_tokens_retrieved are set) to use the same method as Memanto by
calling count_results(texts) (or count_results(raw_list) if variable named
differently) instead of count(retrieved_text), and add that returned value to
self.total_tokens_retrieved so totals are comparable across systems.
- Around line 100-111: The benchmark currently only adds raw_tokens to the
running total (total_tokens_written) and therefore undercounts ingestion
overhead when the adapter returns LLM token usage; update the write path that
builds the result dict (where result is inspected and llm_tokens is derived) to
also add llm_tokens to total_tokens_written when present and numeric, ensuring
you check isinstance(result, dict) and that "token_count" yields an int/float
before accumulation so the returned tokens_written and the aggregate
total_tokens_written both reflect raw + llm extraction tokens.
In `@examples/benchmarks/adapters/memanto_adapter.py`:
- Around line 93-98: The _p95 function currently computes the index using
int(len(sorted_vals) * 0.95) - 1 which biases the percentile low; change the
index calculation to use ceiling so the 95th percentile picks the correct
element (e.g., idx = min(len(sorted_vals)-1, math.ceil(0.95 * len(sorted_vals))
- 1)) and keep the rest of the logic (sorted_vals, rounding) the same; update
imports to include math if needed and ensure the function returns
round(sorted_vals[idx], 4).
In `@examples/benchmarks/dataset.py`:
- Line 84: Update the q4 golden_answer to exactly match the dataset mentions
(use the exact titles from the session text) so judge scoring isn't penalized;
locate the q4 entry and its "golden_answer" string in
examples/benchmarks/dataset.py and replace non-exact items (e.g., "Planet
Earth") with their exact dataset forms (e.g., "Planet Earth II") and ensure all
other listed entities in q4's "golden_answer" exactly match the dataset wording
and punctuation.
In `@examples/benchmarks/metrics/accuracy_judge.py`:
- Around line 63-83: The call to client.messages.create and subsequent accesses
(msg.content[0].text, msg.usage.input_tokens, msg.usage.output_tokens) are not
guarded and can raise network/auth/structure errors causing the benchmark to
abort; wrap the API call and the parsing logic in a try/except that catches
general exceptions (e.g., Exception) and structural issues (ValueError,
IndexError, AttributeError), validate that msg and msg.content[0].text exist
before using them, attempt the score parsing as currently done but on any
failure return a safe fallback payload (score 0, explanation set to the raw or
an error message, input_tokens/output_tokens set to 0 or available defaults),
and ensure the function returns this safe payload instead of raising so the
benchmark continues.
In `@examples/benchmarks/README.md`:
- Line 59: Update the inconsistent Mem0 version info by either changing the
README line "Mem0 (via `mem0ai` v2.0.4)" to reflect the loose constraint used in
requirements.txt (e.g., "mem0ai >=2.0.0") or pin the dependency in
requirements.txt to 2.0.4 (replace the existing "mem0ai>=2.0.0" entry) so the
documentation and the dependency spec (the README entry and the requirements.txt
mem0ai line) match.
In `@examples/benchmarks/run_benchmark.py`:
- Around line 43-55: Modify _check_env and the top-level imports to respect the
CLI flags --skip-judge and --skip-mem0: change _check_env to accept the parsed
args (or read a shared args object) and only require ANTHROPIC_API_KEY when the
judge or Mem0 functionality will actually be used (i.e., if not args.skip_judge
and not args.skip_mem0 is false, skip the Anthropic check); similarly guard
imports/usages of the judge module and Mem0-related modules (references:
_check_env, the judge import site, and any Mem0 import/initialization) so they
are only imported/initialized when their corresponding flags are not set. Ensure
all checks/imports mentioned around the regions noted (including the blocks at
the other referenced locations) use the same flag logic.
- Around line 200-207: The winner calculation skips integer metrics because it
only checks isinstance(..., float); update the logic in run_benchmark.py where
mv, ov, and label are evaluated (the block referencing mv, ov, label that sets
winner) to accept integers as well — e.g., treat numeric types by checking
isinstance(mv, (int, float)) and isinstance(ov, (int, float)) or use
numbers.Real, or coerce mv/ov to float before comparison; keep the existing
accuracy-vs-lower-is-better branching and set winner the same way once mv/ov are
treated as numeric so integer rows like "Total tokens" no longer produce "─".
---
Nitpick comments:
In `@examples/benchmarks/adapters/memanto_adapter.py`:
- Line 22: The class is misspelled as MenantoAdapter; rename the class to
MemantoAdapter and update every reference/import/export that uses MenantoAdapter
(including any __all__ entries, type hints, tests, and callers) to the new
MemantoAdapter identifier so public API and docs no longer leak the typo; ensure
constructor and any usages match the new class name.
In `@examples/benchmarks/metrics/accuracy_judge.py`:
- Line 64: Replace the hardcoded dated Claude model string used as the
LLM-as-judge with a configurable environment variable: introduce reading
os.environ.get("ACCURACY_JUDGE_MODEL", "claude-haiku-4-5-20251001") (or similar
JUDGE_MODEL name) and pass that variable into the call/site where
model="claude-haiku-4-5-20251001" is used (the function/constructor that accepts
the model parameter in accuracy_judge.py). Ensure the new env var is documented
or has a clear default so benchmark behavior stays stable while allowing
overrides.
In `@examples/benchmarks/README.md`:
- Around line 49-53: Add language specifiers ("text") to the three fenced code
blocks that lack them: the architecture diagram containing "User message →
MoorchehClient.documents.upsert()", the diagram containing "User messages → Mem0
extraction LLM (Claude Haiku) → Vectorized memory facts", and the example output
block that starts with "🏆 The Great Agentic Memory Showdown"; update those
triple-backtick fences to use ```text so the blocks get proper syntax
highlighting and readability.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: db9815ed-ffd8-4cd7-a7d1-7cc53474439d
📒 Files selected for processing (12)
examples/benchmarks/.env.exampleexamples/benchmarks/README.mdexamples/benchmarks/adapters/__init__.pyexamples/benchmarks/adapters/mem0_adapter.pyexamples/benchmarks/adapters/memanto_adapter.pyexamples/benchmarks/dataset.pyexamples/benchmarks/metrics/__init__.pyexamples/benchmarks/metrics/accuracy_judge.pyexamples/benchmarks/metrics/token_counter.pyexamples/benchmarks/requirements.txtexamples/benchmarks/results/.gitkeepexamples/benchmarks/run_benchmark.py
| self.total_tokens_written += raw_tokens | ||
|
|
||
| # Try to extract LLM token usage from the result if available | ||
| llm_tokens = 0 | ||
| if isinstance(result, dict) and "token_count" in result: | ||
| llm_tokens = result["token_count"] | ||
|
|
||
| return { | ||
| "tokens_written": raw_tokens, | ||
| "write_latency_s": round(latency, 4), | ||
| "llm_extraction_tokens": llm_tokens, | ||
| } |
There was a problem hiding this comment.
Mem0 write-token summary undercounts ingestion overhead.
total_tokens_written only accumulates raw message tokens, while extracted LLM token usage is returned but not included in the aggregate used by the benchmark summary.
Proposed fix
latency = time.perf_counter() - start
self._write_latencies.append(latency)
- self.total_tokens_written += raw_tokens
# Try to extract LLM token usage from the result if available
llm_tokens = 0
if isinstance(result, dict) and "token_count" in result:
llm_tokens = result["token_count"]
+ total_written = raw_tokens + llm_tokens
+ self.total_tokens_written += total_written
return {
- "tokens_written": raw_tokens,
+ "tokens_written": total_written,
+ "raw_tokens_written": raw_tokens,
"write_latency_s": round(latency, 4),
"llm_extraction_tokens": llm_tokens,
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self.total_tokens_written += raw_tokens | |
| # Try to extract LLM token usage from the result if available | |
| llm_tokens = 0 | |
| if isinstance(result, dict) and "token_count" in result: | |
| llm_tokens = result["token_count"] | |
| return { | |
| "tokens_written": raw_tokens, | |
| "write_latency_s": round(latency, 4), | |
| "llm_extraction_tokens": llm_tokens, | |
| } | |
| # Try to extract LLM token usage from the result if available | |
| llm_tokens = 0 | |
| if isinstance(result, dict) and "token_count" in result: | |
| llm_tokens = result["token_count"] | |
| total_written = raw_tokens + llm_tokens | |
| self.total_tokens_written += total_written | |
| return { | |
| "tokens_written": total_written, | |
| "raw_tokens_written": raw_tokens, | |
| "write_latency_s": round(latency, 4), | |
| "llm_extraction_tokens": llm_tokens, | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/adapters/mem0_adapter.py` around lines 100 - 111, The
benchmark currently only adds raw_tokens to the running total
(total_tokens_written) and therefore undercounts ingestion overhead when the
adapter returns LLM token usage; update the write path that builds the result
dict (where result is inspected and llm_tokens is derived) to also add
llm_tokens to total_tokens_written when present and numeric, ensuring you check
isinstance(result, dict) and that "token_count" yields an int/float before
accumulation so the returned tokens_written and the aggregate
total_tokens_written both reflect raw + llm extraction tokens.
| retrieved_text = "\n---\n".join(texts) | ||
| tokens_retrieved = count(retrieved_text) | ||
| self.total_tokens_retrieved += tokens_retrieved |
There was a problem hiding this comment.
Use the same retrieval-token accounting method as Memanto.
Counting retrieved_text after joining adds separator tokens and makes cross-system token totals non-comparable. Use count_results(raw_list) here too.
Proposed fix
- tokens_retrieved = count(retrieved_text)
+ tokens_retrieved = count_results(raw_list)
self.total_tokens_retrieved += tokens_retrieved🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/adapters/mem0_adapter.py` around lines 149 - 151, The
token accounting currently builds retrieved_text and calls
count(retrieved_text), which double-counts separator tokens; update the logic in
the retrieval flow (where retrieved_text, tokens_retrieved and
self.total_tokens_retrieved are set) to use the same method as Memanto by
calling count_results(texts) (or count_results(raw_list) if variable named
differently) instead of count(retrieved_text), and add that returned value to
self.total_tokens_retrieved so totals are comparable across systems.
| def _p95(values: list[float]) -> float: | ||
| if not values: | ||
| return 0.0 | ||
| sorted_vals = sorted(values) | ||
| idx = max(0, int(len(sorted_vals) * 0.95) - 1) | ||
| return round(sorted_vals[idx], 4) |
There was a problem hiding this comment.
p95 latency index calculation is incorrect for small N.
This uses the same biased formula as the Memanto adapter and understates p95 in this benchmark’s sample sizes.
Proposed fix
+import math
...
def _p95(values: list[float]) -> float:
if not values:
return 0.0
sorted_vals = sorted(values)
- idx = max(0, int(len(sorted_vals) * 0.95) - 1)
+ idx = max(0, math.ceil(len(sorted_vals) * 0.95) - 1)
return round(sorted_vals[idx], 4)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/adapters/mem0_adapter.py` around lines 167 - 172, _p95
currently calculates the 95th percentile using a biased index (int(len * 0.95) -
1) which understates p95 for small samples; change the index computation in
function _p95 to use a ceiling-based rank: compute idx = max(0,
int(math.ceil(0.95 * len(sorted_vals))) - 1) (import math), keep the empty-list
guard and rounding, and clamp idx to the last element if needed so
sorted_vals[idx] is always valid.
| def _p95(values: list[float]) -> float: | ||
| if not values: | ||
| return 0.0 | ||
| sorted_vals = sorted(values) | ||
| idx = max(0, int(len(sorted_vals) * 0.95) - 1) | ||
| return round(sorted_vals[idx], 4) |
There was a problem hiding this comment.
Fix p95 computation (currently biased low for benchmark-sized samples).
Current index math reports a lower percentile (e.g., with 5 samples it picks index 3 instead of index 4), so p95 latency outputs are systematically understated.
Proposed fix
+import math
...
def _p95(values: list[float]) -> float:
if not values:
return 0.0
sorted_vals = sorted(values)
- idx = max(0, int(len(sorted_vals) * 0.95) - 1)
+ idx = max(0, math.ceil(len(sorted_vals) * 0.95) - 1)
return round(sorted_vals[idx], 4)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _p95(values: list[float]) -> float: | |
| if not values: | |
| return 0.0 | |
| sorted_vals = sorted(values) | |
| idx = max(0, int(len(sorted_vals) * 0.95) - 1) | |
| return round(sorted_vals[idx], 4) | |
| import math | |
| def _p95(values: list[float]) -> float: | |
| if not values: | |
| return 0.0 | |
| sorted_vals = sorted(values) | |
| idx = max(0, math.ceil(len(sorted_vals) * 0.95) - 1) | |
| return round(sorted_vals[idx], 4) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/adapters/memanto_adapter.py` around lines 93 - 98, The
_p95 function currently computes the index using int(len(sorted_vals) * 0.95) -
1 which biases the percentile low; change the index calculation to use ceiling
so the 95th percentile picks the correct element (e.g., idx =
min(len(sorted_vals)-1, math.ceil(0.95 * len(sorted_vals)) - 1)) and keep the
rest of the logic (sorted_vals, rounding) the same; update imports to include
math if needed and ensure the function returns round(sorted_vals[idx], 4).
| { | ||
| "query_id": "q4", | ||
| "question": "Which specific film directors or films has the user mentioned?", | ||
| "golden_answer": "John Wick, The Dark Knight, Christopher Nolan, Dune, Interstellar, Planet Earth, The Social Dilemma, Gone Girl, Parasite, Knives Out, Hereditary, Midsommar, Ari Aster", |
There was a problem hiding this comment.
Align q4 golden entities with exact dataset mentions.
Line 84 currently mixes non-exact entities (e.g., "Planet Earth" vs session text "Planet Earth II"), which can unfairly lower judge scores for otherwise correct retrievals and skew benchmark accuracy.
🎯 Proposed fix
- "golden_answer": "John Wick, The Dark Knight, Christopher Nolan, Dune, Interstellar, Planet Earth, The Social Dilemma, Gone Girl, Parasite, Knives Out, Hereditary, Midsommar, Ari Aster",
+ "golden_answer": "John Wick, The Dark Knight, Nolan, Dune Part 2, Interstellar, Planet Earth II, The Social Dilemma, Gone Girl, Parasite, Knives Out, Hereditary, Midsommar, Ari Aster",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "golden_answer": "John Wick, The Dark Knight, Christopher Nolan, Dune, Interstellar, Planet Earth, The Social Dilemma, Gone Girl, Parasite, Knives Out, Hereditary, Midsommar, Ari Aster", | |
| "golden_answer": "John Wick, The Dark Knight, Nolan, Dune Part 2, Interstellar, Planet Earth II, The Social Dilemma, Gone Girl, Parasite, Knives Out, Hereditary, Midsommar, Ari Aster", |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/dataset.py` at line 84, Update the q4 golden_answer to
exactly match the dataset mentions (use the exact titles from the session text)
so judge scoring isn't penalized; locate the q4 entry and its "golden_answer"
string in examples/benchmarks/dataset.py and replace non-exact items (e.g.,
"Planet Earth") with their exact dataset forms (e.g., "Planet Earth II") and
ensure all other listed entities in q4's "golden_answer" exactly match the
dataset wording and punctuation.
| msg = client.messages.create( | ||
| model="claude-haiku-4-5-20251001", | ||
| max_tokens=80, | ||
| system=_JUDGE_SYSTEM, | ||
| messages=[{"role": "user", "content": prompt}], | ||
| ) | ||
|
|
||
| raw = msg.content[0].text.strip() | ||
| try: | ||
| score_str, explanation = raw.split("|", 1) | ||
| score = int(score_str.strip()) | ||
| except (ValueError, IndexError): | ||
| score = 0 | ||
| explanation = raw | ||
|
|
||
| return { | ||
| "score": max(0, min(3, score)), | ||
| "explanation": explanation.strip(), | ||
| "input_tokens": msg.usage.input_tokens, | ||
| "output_tokens": msg.usage.output_tokens, | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and examine the file
find . -type f -name "accuracy_judge.py" | head -20Repository: moorcheh-ai/memanto
Length of output: 111
🏁 Script executed:
# Read the file to see the full function context
wc -l examples/benchmarks/metrics/accuracy_judge.pyRepository: moorcheh-ai/memanto
Length of output: 112
🏁 Script executed:
# Read the relevant section with surrounding context
sed -n '50,90p' examples/benchmarks/metrics/accuracy_judge.pyRepository: moorcheh-ai/memanto
Length of output: 1230
🏁 Script executed:
# Read the full file to understand context
cat -n examples/benchmarks/metrics/accuracy_judge.pyRepository: moorcheh-ai/memanto
Length of output: 3273
🏁 Script executed:
# Search for calls to the judge function
rg "judge(" . -B 2 -A 2Repository: moorcheh-ai/memanto
Length of output: 126
🏁 Script executed:
# Search for calls to the judge function (escape parenthesis)
rg "judge\(" . -B 2 -A 2Repository: moorcheh-ai/memanto
Length of output: 1640
🏁 Script executed:
# Look for benchmark files that might call judge
find . -type f -name "*benchmark*" -o -name "*test*" | grep -E "\.(py|sh)$" | head -10Repository: moorcheh-ai/memanto
Length of output: 392
🏁 Script executed:
# Check the context around judge() calls in run_benchmark.py
rg "judge\(" ./examples/benchmarks/run_benchmark.py -B 5 -A 5Repository: moorcheh-ai/memanto
Length of output: 1142
🏁 Script executed:
# Check if there's any try-except wrapping around benchmark execution
rg "try:|except" ./examples/benchmarks/run_benchmark.py -B 2 -A 2Repository: moorcheh-ai/memanto
Length of output: 233
🏁 Script executed:
# Check the full context around the benchmark loops in run_benchmark.py
sed -n '1,50p' ./examples/benchmarks/run_benchmark.pyRepository: moorcheh-ai/memanto
Length of output: 1567
🏁 Script executed:
# Get the lines around the judge calls to see if there's any outer error handling
sed -n '50,150p' ./examples/benchmarks/run_benchmark.pyRepository: moorcheh-ai/memanto
Length of output: 3854
🏁 Script executed:
# Check the rest of run_benchmark.py for main function and outer error handling
tail -50 ./examples/benchmarks/run_benchmark.pyRepository: moorcheh-ai/memanto
Length of output: 1862
🏁 Script executed:
# Find the main() function definition
rg "^def main" ./examples/benchmarks/run_benchmark.py -A 30Repository: moorcheh-ai/memanto
Length of output: 1702
Guard Anthropic API calls in judge function to prevent benchmark abortion on transient failures.
If client.messages.create() fails (rate limit, auth, network error) or returns unexpected content structure, the benchmark aborts. Wrap the API call and content access defensively to return a safe fallback payload so benchmark execution continues.
Proposed fix
- msg = client.messages.create(
- model="claude-haiku-4-5-20251001",
- max_tokens=80,
- system=_JUDGE_SYSTEM,
- messages=[{"role": "user", "content": prompt}],
- )
-
- raw = msg.content[0].text.strip()
+ try:
+ msg = client.messages.create(
+ model="claude-haiku-4-5-20251001",
+ max_tokens=80,
+ system=_JUDGE_SYSTEM,
+ messages=[{"role": "user", "content": prompt}],
+ )
+ except Exception as e:
+ return {
+ "score": 0,
+ "explanation": f"Judge API error: {e}",
+ "input_tokens": 0,
+ "output_tokens": 0,
+ }
+
+ blocks = getattr(msg, "content", []) or []
+ first_text = blocks[0].text if blocks and hasattr(blocks[0], "text") else ""
+ raw = first_text.strip()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/metrics/accuracy_judge.py` around lines 63 - 83, The call
to client.messages.create and subsequent accesses (msg.content[0].text,
msg.usage.input_tokens, msg.usage.output_tokens) are not guarded and can raise
network/auth/structure errors causing the benchmark to abort; wrap the API call
and the parsing logic in a try/except that catches general exceptions (e.g.,
Exception) and structural issues (ValueError, IndexError, AttributeError),
validate that msg and msg.content[0].text exist before using them, attempt the
score parsing as currently done but on any failure return a safe fallback
payload (score 0, explanation set to the raw or an error message,
input_tokens/output_tokens set to 0 or available defaults), and ensure the
function returns this safe payload instead of raising so the benchmark
continues.
| - **Read cost**: Semantic search on Moorcheh's index — returns relevant snippets | ||
| - **Temporal tracking**: Relies on recency-weighted retrieval and tags | ||
|
|
||
| ### Mem0 (via `mem0ai` v2.0.4) |
There was a problem hiding this comment.
Inconsistent version specification for Mem0.
Line 59 states "via mem0ai v2.0.4" but requirements.txt specifies mem0ai>=2.0.0 as a minimum constraint, not a pinned version. This creates an inconsistency between the documentation and the actual dependency specification.
📝 Suggested fix
-### Mem0 (via `mem0ai` v2.0.4)
+### Mem0 (via `mem0ai` ≥2.0.0)Alternatively, if version 2.0.4 is specifically tested and recommended, consider pinning it in requirements.txt:
-mem0ai>=2.0.0
+mem0ai==2.0.4📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ### Mem0 (via `mem0ai` v2.0.4) | |
| ### Mem0 (via `mem0ai` ≥2.0.0) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/README.md` at line 59, Update the inconsistent Mem0
version info by either changing the README line "Mem0 (via `mem0ai` v2.0.4)" to
reflect the loose constraint used in requirements.txt (e.g., "mem0ai >=2.0.0")
or pin the dependency in requirements.txt to 2.0.4 (replace the existing
"mem0ai>=2.0.0" entry) so the documentation and the dependency spec (the README
entry and the requirements.txt mem0ai line) match.
| def _check_env() -> None: | ||
| missing = [] | ||
| if not os.environ.get("MOORCHEH_API_KEY"): | ||
| missing.append("MOORCHEH_API_KEY") | ||
| if not os.environ.get("ANTHROPIC_API_KEY"): | ||
| missing.append("ANTHROPIC_API_KEY (needed for accuracy judge)") | ||
| if missing: | ||
| print("❌ Missing environment variables:") | ||
| for m in missing: | ||
| print(f" {m}") | ||
| print("\n See .env.example for setup instructions.") | ||
| sys.exit(1) | ||
|
|
There was a problem hiding this comment.
Honor --skip-judge / --skip-mem0 in env validation and imports.
Current flow always requires Anthropic credentials and always imports judge, even when judge and Mem0 are both skipped.
Proposed fix
-def _check_env() -> None:
+def _check_env(skip_mem0: bool, skip_judge: bool) -> None:
missing = []
if not os.environ.get("MOORCHEH_API_KEY"):
missing.append("MOORCHEH_API_KEY")
- if not os.environ.get("ANTHROPIC_API_KEY"):
+ needs_anthropic = (not skip_judge) or (not skip_mem0)
+ if needs_anthropic and not os.environ.get("ANTHROPIC_API_KEY"):
missing.append("ANTHROPIC_API_KEY (needed for accuracy judge)")
...
def run_memanto_phase(namespace: str, skip_judge: bool) -> dict:
from adapters.memanto_adapter import MenantoAdapter
- from metrics.accuracy_judge import judge
+ judge = None
+ if not skip_judge:
+ from metrics.accuracy_judge import judge
...
def run_mem0_phase(skip_judge: bool) -> dict:
from adapters.mem0_adapter import Mem0Adapter
- from metrics.accuracy_judge import judge
+ judge = None
+ if not skip_judge:
+ from metrics.accuracy_judge import judge
...
- _check_env()
+ _check_env(skip_mem0=args.skip_mem0, skip_judge=args.skip_judge)Also applies to: 64-67, 120-123, 233-233
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/run_benchmark.py` around lines 43 - 55, Modify _check_env
and the top-level imports to respect the CLI flags --skip-judge and --skip-mem0:
change _check_env to accept the parsed args (or read a shared args object) and
only require ANTHROPIC_API_KEY when the judge or Mem0 functionality will
actually be used (i.e., if not args.skip_judge and not args.skip_mem0 is false,
skip the Anthropic check); similarly guard imports/usages of the judge module
and Mem0-related modules (references: _check_env, the judge import site, and any
Mem0 import/initialization) so they are only imported/initialized when their
corresponding flags are not set. Ensure all checks/imports mentioned around the
regions noted (including the blocks at the other referenced locations) use the
same flag logic.
| elif isinstance(mv, float) and isinstance(ov, float): | ||
| # Lower is better for tokens & latency; higher is better for accuracy | ||
| if "accuracy" in label.lower(): | ||
| winner = "Memanto ✓" if mv >= ov else "Mem0 ✓" | ||
| else: | ||
| winner = "Memanto ✓" if mv <= ov else "Mem0 ✓" | ||
| else: | ||
| winner = "─" |
There was a problem hiding this comment.
Winner column skips integer metrics.
Total tokens rows are integers, but winner logic only handles floats, so those rows always show "─".
Proposed fix
- elif isinstance(mv, float) and isinstance(ov, float):
+ elif isinstance(mv, (int, float)) and isinstance(ov, (int, float)):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/run_benchmark.py` around lines 200 - 207, The winner
calculation skips integer metrics because it only checks isinstance(..., float);
update the logic in run_benchmark.py where mv, ov, and label are evaluated (the
block referencing mv, ov, label that sets winner) to accept integers as well —
e.g., treat numeric types by checking isinstance(mv, (int, float)) and
isinstance(ov, (int, float)) or use numbers.Real, or coerce mv/ov to float
before comparison; keep the existing accuracy-vs-lower-is-better branching and
set winner the same way once mv/ov are treated as numeric so integer rows like
"Total tokens" no longer produce "─".
Note on r/AgenticMemoryr/AgenticMemory has been banned by Reddit as of June 7, 2026 ("This community has been banned for violating the Reddit rules."). The social amplification component for this PR cannot be fulfilled via r/AgenticMemory. I can immediately post to alternative communities (r/LocalLLaMA, r/ClaudeAI, Hacker News) if the maintainers confirm these count for the virality scoring. Let me know how to proceed! |
Summary
Implements Scenario B: Shifting Persona & Temporal Tracking Test from the Agentic Memory Showdown challenge.
Live demo link (Reddit/X): will be added after benchmark run — see below
What's in
examples/benchmarks/run_benchmark.pydataset.pyadapters/memanto_adapter.pymoorcheh-sdkadapters/mem0_adapter.pymem0aiv2.0.4metrics/token_counter.pycl100k_basecountingmetrics/accuracy_judge.pyrequirements.txt.env.exampleBenchmark Design
Dataset: 5 sessions where a user's movie genre preferences evolve:
Action → Sci-Fi → Documentary → Thriller → Horror (current)
Core question: Does the system surface the current preference (Horror) without being polluted by the 4 previous preferences?
Scientific controls:
top_k=5for retrievalcl100k_basefor token countingMetrics measured:
How to Run
Expected Key Finding
Memanto's direct-upsert architecture eliminates per-write LLM inference overhead entirely. Mem0's extraction pipeline (Claude Haiku + qdrant) adds significant token cost and latency at write time — costs that compound across long-lived agents.
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation