[BOUNTY #639] Memanto vs Mem0 — Shifting Persona Benchmark by minhthai1995 · Pull Request #705 · moorcheh-ai/memanto

minhthai1995 · 2026-06-08T15:03:54Z

Summary

Implements Scenario B: Shifting Persona & Temporal Tracking Test from the Agentic Memory Showdown challenge.

Live demo link (Reddit/X): will be added after benchmark run — see below

What's in `examples/benchmarks/`

File	Purpose
`run_benchmark.py`	Main runner — outputs comparison table + saves JSON
`dataset.py`	5-session "Evolving Film Enthusiast" + 5 golden-answer queries
`adapters/memanto_adapter.py`	Direct upsert via `moorcheh-sdk`
`adapters/mem0_adapter.py`	LLM-extraction pipeline via `mem0ai` v2.0.4
`metrics/token_counter.py`	tiktoken `cl100k_base` counting
`metrics/accuracy_judge.py`	Claude Haiku LLM-as-judge (score 0–3 per query)
`requirements.txt`	All deps with pinned minimums
`.env.example`	Required env vars template

Benchmark Design

Dataset: 5 sessions where a user's movie genre preferences evolve:
Action → Sci-Fi → Documentary → Thriller → Horror (current)

Core question: Does the system surface the current preference (Horror) without being polluted by the 4 previous preferences?

Scientific controls:

Same dataset fed to both systems
Same LLM judge (Claude Haiku) with identical prompt
Isolated namespaces per run (UUID-based)
Same top_k=5 for retrieval
tiktoken cl100k_base for token counting

Metrics measured:

Total tokens written (ingestion overhead)
Total tokens retrieved (all 5 queries)
p95 write latency (s)
p95 read latency (s)
Avg accuracy score 0–3 (LLM-as-judge)

How to Run

cd examples/benchmarks/
pip install -r requirements.txt
export MOORCHEH_API_KEY=... ANTHROPIC_API_KEY=...
python3 run_benchmark.py
# Fast run (Memanto only, no HuggingFace download):
python3 run_benchmark.py --skip-mem0

Expected Key Finding

Memanto's direct-upsert architecture eliminates per-write LLM inference overhead entirely. Mem0's extraction pipeline (Claude Haiku + qdrant) adds significant token cost and latency at write time — costs that compound across long-lived agents.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added a comprehensive benchmark suite for comparing memory systems with performance metrics (token usage, p95 latencies) and LLM-based accuracy evaluation capabilities
Documentation
- Added detailed benchmark documentation including setup instructions, evaluation scenarios, dataset specifications, and results interpretation guidance

…h-ai#639) Implements Scenario B from the Agentic Memory Showdown bounty challenge. Compares Memanto (moorcheh-sdk direct upsert) against Mem0 (LLM-extraction pipeline) on token overhead, p95 latency, and temporal preference accuracy. Dataset: 5-session "Evolving Film Enthusiast" — preferences evolve from action → sci-fi → documentary → thriller → horror. Tests whether each system correctly surfaces the CURRENT preference without stale-history pollution. Metrics measured per system: - Total tokens written during ingestion (write overhead) - Total tokens retrieved across 5 evaluation queries - p95 write and read latency (seconds) - Accuracy score 0-3 per query via LLM-as-judge (Claude Haiku) Scientific controls: same dataset, same judge model, same judge prompt, isolated namespaces (UUID per run), same top_k=5, tiktoken cl100k_base for token counting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-06-08T15:04:15Z

📝 Walkthrough

Walkthrough

This PR adds a complete benchmark suite (examples/benchmarks/) comparing Memanto and Mem0 memory systems. The suite includes a shifting-persona dataset, metrics infrastructure (token counting and LLM-as-judge accuracy), system-specific adapters, and an orchestrator script that runs both systems in parallel and outputs comparative results.

Changes

Benchmark Suite: Memanto vs Mem0

Layer / File(s)	Summary
Environment Setup & Dependencies `examples/benchmarks/.env.example`, `examples/benchmarks/requirements.txt`	Placeholder environment variables for Moorcheh and Anthropic API keys; dependencies include memanto, moorcheh-sdk, mem0ai, tiktoken, anthropic, sentence-transformers, and qdrant-client.
Benchmark Framework Documentation `examples/benchmarks/README.md`	Comprehensive documentation of Scenario B (5-session shifting persona), evaluation metrics (tokens, p95 latencies, LLM-judge accuracy), dataset details, expected architecture behavior, setup/run instructions, configuration parameters, controlled/uncontrolled variables, output format, and accuracy rubric.
Benchmark Dataset `examples/benchmarks/dataset.py`	Shifting persona dataset with five sessions where movie preferences evolve and five temporal-tracking queries with golden answers specifying whether to retrieve current vs. historical preferences and breadth of recall.
Metrics & Evaluation Infrastructure `examples/benchmarks/metrics/token_counter.py`, `examples/benchmarks/metrics/accuracy_judge.py`	Token counting helpers (text, message, and result aggregation) and LLM-as-judge that uses Anthropic Claude Haiku to score retrieval accuracy on a 0–3 scale.
Memanto Adapter `examples/benchmarks/adapters/memanto_adapter.py`	Wraps Memanto's client with per-message document ingest, similarity search, token tracking via stored text, and p95 latency computation.
Mem0 Adapter `examples/benchmarks/adapters/mem0_adapter.py`	Wraps Mem0's memory extraction and search using Anthropic Claude Haiku, HuggingFace embeddings, and in-memory Qdrant storage; tracks ingest/query latencies and estimates token usage from raw conversation and retrieved text.
Benchmark Runner & Orchestration `examples/benchmarks/run_benchmark.py`	Main script that validates environment variables, runs Memanto and Mem0 ingestion/retrieval phases, optionally scores results with LLM-as-judge, prints comparison tables, and saves timestamped JSON results. Supports `--skip-mem0`, `--skip-judge`, and `--output` flags.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

[BOUNTY $100] 🐜 The Great Agentic Memory Showdown: Memanto Benchmarking & Evaluation Challenge #639: This PR implements "The Great Agentic Memory Showdown" benchmark directly, providing the complete test harness, adapters, and evaluation infrastructure described in the challenge.

Suggested reviewers

Xenogents
het0814
Neelpatel1604

Poem

🐰 Hops through benchmarks, memory in sight,
Memanto meets Mem0 in comparative flight,
Token counts whisper, latencies trace,
The shifting personas test temporal grace,
LLM judges decree who wins the race!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 23.08% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title accurately and specifically describes the main change: adding a benchmark suite comparing Memanto vs Mem0 on the Shifting Persona scenario, which is the core focus of all changes in the PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (3)

examples/benchmarks/README.md (1)

49-53: ⚡ Quick win

Add language specifiers to fenced code blocks.

Three code blocks are missing language specifiers, which reduces readability and prevents proper syntax highlighting. The static analysis tool correctly flagged these at lines 49, 61, and 139.
📝 Suggested fix

For the architecture diagrams (lines 49 and 61):
-```
+```text
 User message → MoorchehClient.documents.upsert() → Moorcheh serverless index
-```
+```text
 User messages → Mem0 extraction LLM (Claude Haiku) → Vectorized memory facts
For the example output (line 139):
-```
+```text
 🏆 The Great Agentic Memory Showdown
Also applies to: 61-65, 139-166
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/README.md` around lines 49 - 53, Add language specifiers
("text") to the three fenced code blocks that lack them: the architecture
diagram containing "User message → MoorchehClient.documents.upsert()", the
diagram containing "User messages → Mem0 extraction LLM (Claude Haiku) →
Vectorized memory facts", and the example output block that starts with "🏆 The
Great Agentic Memory Showdown"; update those triple-backtick fences to use
```text so the blocks get proper syntax highlighting and readability.

examples/benchmarks/metrics/accuracy_judge.py (1)

64-64: Make the judge model configurable (avoid hardcoded dated Claude ID)

examples/benchmarks/metrics/accuracy_judge.py hardcodes claude-haiku-4-5-20251001 for the LLM-as-judge. Even though this snapshot is currently valid, dated model IDs can be deprecated; use an env override with a default for benchmark stability.

♻️ Proposed refactor

+_JUDGE_MODEL = os.getenv("ANTHROPIC_JUDGE_MODEL", "claude-haiku-4-5-20251001")
...
-        model="claude-haiku-4-5-20251001",
+        model=_JUDGE_MODEL,

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/metrics/accuracy_judge.py` at line 64, Replace the
hardcoded dated Claude model string used as the LLM-as-judge with a configurable
environment variable: introduce reading os.environ.get("ACCURACY_JUDGE_MODEL",
"claude-haiku-4-5-20251001") (or similar JUDGE_MODEL name) and pass that
variable into the call/site where model="claude-haiku-4-5-20251001" is used (the
function/constructor that accepts the model parameter in accuracy_judge.py).
Ensure the new env var is documented or has a clear default so benchmark
behavior stays stable while allowing overrides.

examples/benchmarks/adapters/memanto_adapter.py (1)

22-22: ⚡ Quick win

Align class name with product naming (MemantoAdapter).

MenantoAdapter looks like a typo and leaks into public imports; renaming avoids avoidable API/docs drift.

Proposed rename

-class MenantoAdapter:
+class MemantoAdapter:

-    from adapters.memanto_adapter import MenantoAdapter
+    from adapters.memanto_adapter import MemantoAdapter
...
-    adapter = MenantoAdapter(namespace=namespace)
+    adapter = MemantoAdapter(namespace=namespace)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/adapters/memanto_adapter.py` at line 22, The class is
misspelled as MenantoAdapter; rename the class to MemantoAdapter and update
every reference/import/export that uses MenantoAdapter (including any __all__
entries, type hints, tests, and callers) to the new MemantoAdapter identifier so
public API and docs no longer leak the typo; ensure constructor and any usages
match the new class name.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/benchmarks/adapters/mem0_adapter.py`:
- Around line 167-172: _p95 currently calculates the 95th percentile using a
biased index (int(len * 0.95) - 1) which understates p95 for small samples;
change the index computation in function _p95 to use a ceiling-based rank:
compute idx = max(0, int(math.ceil(0.95 * len(sorted_vals))) - 1) (import math),
keep the empty-list guard and rounding, and clamp idx to the last element if
needed so sorted_vals[idx] is always valid.
- Around line 149-151: The token accounting currently builds retrieved_text and
calls count(retrieved_text), which double-counts separator tokens; update the
logic in the retrieval flow (where retrieved_text, tokens_retrieved and
self.total_tokens_retrieved are set) to use the same method as Memanto by
calling count_results(texts) (or count_results(raw_list) if variable named
differently) instead of count(retrieved_text), and add that returned value to
self.total_tokens_retrieved so totals are comparable across systems.
- Around line 100-111: The benchmark currently only adds raw_tokens to the
running total (total_tokens_written) and therefore undercounts ingestion
overhead when the adapter returns LLM token usage; update the write path that
builds the result dict (where result is inspected and llm_tokens is derived) to
also add llm_tokens to total_tokens_written when present and numeric, ensuring
you check isinstance(result, dict) and that "token_count" yields an int/float
before accumulation so the returned tokens_written and the aggregate
total_tokens_written both reflect raw + llm extraction tokens.

In `@examples/benchmarks/adapters/memanto_adapter.py`:
- Around line 93-98: The _p95 function currently computes the index using
int(len(sorted_vals) * 0.95) - 1 which biases the percentile low; change the
index calculation to use ceiling so the 95th percentile picks the correct
element (e.g., idx = min(len(sorted_vals)-1, math.ceil(0.95 * len(sorted_vals))
- 1)) and keep the rest of the logic (sorted_vals, rounding) the same; update
imports to include math if needed and ensure the function returns
round(sorted_vals[idx], 4).

In `@examples/benchmarks/dataset.py`:
- Line 84: Update the q4 golden_answer to exactly match the dataset mentions
(use the exact titles from the session text) so judge scoring isn't penalized;
locate the q4 entry and its "golden_answer" string in
examples/benchmarks/dataset.py and replace non-exact items (e.g., "Planet
Earth") with their exact dataset forms (e.g., "Planet Earth II") and ensure all
other listed entities in q4's "golden_answer" exactly match the dataset wording
and punctuation.

In `@examples/benchmarks/metrics/accuracy_judge.py`:
- Around line 63-83: The call to client.messages.create and subsequent accesses
(msg.content[0].text, msg.usage.input_tokens, msg.usage.output_tokens) are not
guarded and can raise network/auth/structure errors causing the benchmark to
abort; wrap the API call and the parsing logic in a try/except that catches
general exceptions (e.g., Exception) and structural issues (ValueError,
IndexError, AttributeError), validate that msg and msg.content[0].text exist
before using them, attempt the score parsing as currently done but on any
failure return a safe fallback payload (score 0, explanation set to the raw or
an error message, input_tokens/output_tokens set to 0 or available defaults),
and ensure the function returns this safe payload instead of raising so the
benchmark continues.

In `@examples/benchmarks/README.md`:
- Line 59: Update the inconsistent Mem0 version info by either changing the
README line "Mem0 (via `mem0ai` v2.0.4)" to reflect the loose constraint used in
requirements.txt (e.g., "mem0ai >=2.0.0") or pin the dependency in
requirements.txt to 2.0.4 (replace the existing "mem0ai>=2.0.0" entry) so the
documentation and the dependency spec (the README entry and the requirements.txt
mem0ai line) match.

In `@examples/benchmarks/run_benchmark.py`:
- Around line 43-55: Modify _check_env and the top-level imports to respect the
CLI flags --skip-judge and --skip-mem0: change _check_env to accept the parsed
args (or read a shared args object) and only require ANTHROPIC_API_KEY when the
judge or Mem0 functionality will actually be used (i.e., if not args.skip_judge
and not args.skip_mem0 is false, skip the Anthropic check); similarly guard
imports/usages of the judge module and Mem0-related modules (references:
_check_env, the judge import site, and any Mem0 import/initialization) so they
are only imported/initialized when their corresponding flags are not set. Ensure
all checks/imports mentioned around the regions noted (including the blocks at
the other referenced locations) use the same flag logic.
- Around line 200-207: The winner calculation skips integer metrics because it
only checks isinstance(..., float); update the logic in run_benchmark.py where
mv, ov, and label are evaluated (the block referencing mv, ov, label that sets
winner) to accept integers as well — e.g., treat numeric types by checking
isinstance(mv, (int, float)) and isinstance(ov, (int, float)) or use
numbers.Real, or coerce mv/ov to float before comparison; keep the existing
accuracy-vs-lower-is-better branching and set winner the same way once mv/ov are
treated as numeric so integer rows like "Total tokens" no longer produce "─".

---

Nitpick comments:
In `@examples/benchmarks/adapters/memanto_adapter.py`:
- Line 22: The class is misspelled as MenantoAdapter; rename the class to
MemantoAdapter and update every reference/import/export that uses MenantoAdapter
(including any __all__ entries, type hints, tests, and callers) to the new
MemantoAdapter identifier so public API and docs no longer leak the typo; ensure
constructor and any usages match the new class name.

In `@examples/benchmarks/metrics/accuracy_judge.py`:
- Line 64: Replace the hardcoded dated Claude model string used as the
LLM-as-judge with a configurable environment variable: introduce reading
os.environ.get("ACCURACY_JUDGE_MODEL", "claude-haiku-4-5-20251001") (or similar
JUDGE_MODEL name) and pass that variable into the call/site where
model="claude-haiku-4-5-20251001" is used (the function/constructor that accepts
the model parameter in accuracy_judge.py). Ensure the new env var is documented
or has a clear default so benchmark behavior stays stable while allowing
overrides.

In `@examples/benchmarks/README.md`:
- Around line 49-53: Add language specifiers ("text") to the three fenced code
blocks that lack them: the architecture diagram containing "User message →
MoorchehClient.documents.upsert()", the diagram containing "User messages → Mem0
extraction LLM (Claude Haiku) → Vectorized memory facts", and the example output
block that starts with "🏆 The Great Agentic Memory Showdown"; update those
triple-backtick fences to use ```text so the blocks get proper syntax
highlighting and readability.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: db9815ed-ffd8-4cd7-a7d1-7cc53474439d

📥 Commits

Reviewing files that changed from the base of the PR and between 6b495ea and e19a85c.

📒 Files selected for processing (12)

examples/benchmarks/.env.example
examples/benchmarks/README.md
examples/benchmarks/adapters/__init__.py
examples/benchmarks/adapters/mem0_adapter.py
examples/benchmarks/adapters/memanto_adapter.py
examples/benchmarks/dataset.py
examples/benchmarks/metrics/__init__.py
examples/benchmarks/metrics/accuracy_judge.py
examples/benchmarks/metrics/token_counter.py
examples/benchmarks/requirements.txt
examples/benchmarks/results/.gitkeep
examples/benchmarks/run_benchmark.py

coderabbitai · 2026-06-08T15:14:57Z

+        self.total_tokens_written += raw_tokens
+
+        # Try to extract LLM token usage from the result if available
+        llm_tokens = 0
+        if isinstance(result, dict) and "token_count" in result:
+            llm_tokens = result["token_count"]
+
+        return {
+            "tokens_written": raw_tokens,
+            "write_latency_s": round(latency, 4),
+            "llm_extraction_tokens": llm_tokens,
+        }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Mem0 write-token summary undercounts ingestion overhead.

total_tokens_written only accumulates raw message tokens, while extracted LLM token usage is returned but not included in the aggregate used by the benchmark summary.

Proposed fix

latency = time.perf_counter() - start self._write_latencies.append(latency) - self.total_tokens_written += raw_tokens # Try to extract LLM token usage from the result if available llm_tokens = 0 if isinstance(result, dict) and "token_count" in result: llm_tokens = result["token_count"] + total_written = raw_tokens + llm_tokens + self.total_tokens_written += total_written return { - "tokens_written": raw_tokens, + "tokens_written": total_written, + "raw_tokens_written": raw_tokens, "write_latency_s": round(latency, 4), "llm_extraction_tokens": llm_tokens, }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self.total_tokens_written += raw_tokens

# Try to extract LLM token usage from the result if available

llm_tokens = 0

if isinstance(result, dict) and "token_count" in result:

llm_tokens = result["token_count"]

return {

"tokens_written": raw_tokens,

"write_latency_s": round(latency, 4),

"llm_extraction_tokens": llm_tokens,

}

# Try to extract LLM token usage from the result if available

llm_tokens = 0

if isinstance(result, dict) and "token_count" in result:

llm_tokens = result["token_count"]

total_written = raw_tokens + llm_tokens

self.total_tokens_written += total_written

return {

"tokens_written": total_written,

"raw_tokens_written": raw_tokens,

"write_latency_s": round(latency, 4),

"llm_extraction_tokens": llm_tokens,

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/adapters/mem0_adapter.py` around lines 100 - 111, The benchmark currently only adds raw_tokens to the running total (total_tokens_written) and therefore undercounts ingestion overhead when the adapter returns LLM token usage; update the write path that builds the result dict (where result is inspected and llm_tokens is derived) to also add llm_tokens to total_tokens_written when present and numeric, ensuring you check isinstance(result, dict) and that "token_count" yields an int/float before accumulation so the returned tokens_written and the aggregate total_tokens_written both reflect raw + llm extraction tokens.

coderabbitai · 2026-06-08T15:14:57Z

+        retrieved_text = "\n---\n".join(texts)
+        tokens_retrieved = count(retrieved_text)
+        self.total_tokens_retrieved += tokens_retrieved


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the same retrieval-token accounting method as Memanto.

Counting retrieved_text after joining adds separator tokens and makes cross-system token totals non-comparable. Use count_results(raw_list) here too.

Proposed fix

- tokens_retrieved = count(retrieved_text) + tokens_retrieved = count_results(raw_list) self.total_tokens_retrieved += tokens_retrieved

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/adapters/mem0_adapter.py` around lines 149 - 151, The token accounting currently builds retrieved_text and calls count(retrieved_text), which double-counts separator tokens; update the logic in the retrieval flow (where retrieved_text, tokens_retrieved and self.total_tokens_retrieved are set) to use the same method as Memanto by calling count_results(texts) (or count_results(raw_list) if variable named differently) instead of count(retrieved_text), and add that returned value to self.total_tokens_retrieved so totals are comparable across systems.

coderabbitai · 2026-06-08T15:14:57Z

+def _p95(values: list[float]) -> float:
+    if not values:
+        return 0.0
+    sorted_vals = sorted(values)
+    idx = max(0, int(len(sorted_vals) * 0.95) - 1)
+    return round(sorted_vals[idx], 4)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

p95 latency index calculation is incorrect for small N.

This uses the same biased formula as the Memanto adapter and understates p95 in this benchmark’s sample sizes.

Proposed fix

+import math ... def _p95(values: list[float]) -> float: if not values: return 0.0 sorted_vals = sorted(values) - idx = max(0, int(len(sorted_vals) * 0.95) - 1) + idx = max(0, math.ceil(len(sorted_vals) * 0.95) - 1) return round(sorted_vals[idx], 4)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/adapters/mem0_adapter.py` around lines 167 - 172, _p95 currently calculates the 95th percentile using a biased index (int(len * 0.95) - 1) which understates p95 for small samples; change the index computation in function _p95 to use a ceiling-based rank: compute idx = max(0, int(math.ceil(0.95 * len(sorted_vals))) - 1) (import math), keep the empty-list guard and rounding, and clamp idx to the last element if needed so sorted_vals[idx] is always valid.

coderabbitai · 2026-06-08T15:14:57Z

+def _p95(values: list[float]) -> float:
+    if not values:
+        return 0.0
+    sorted_vals = sorted(values)
+    idx = max(0, int(len(sorted_vals) * 0.95) - 1)
+    return round(sorted_vals[idx], 4)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix p95 computation (currently biased low for benchmark-sized samples).

Current index math reports a lower percentile (e.g., with 5 samples it picks index 3 instead of index 4), so p95 latency outputs are systematically understated.

Proposed fix

+import math ... def _p95(values: list[float]) -> float: if not values: return 0.0 sorted_vals = sorted(values) - idx = max(0, int(len(sorted_vals) * 0.95) - 1) + idx = max(0, math.ceil(len(sorted_vals) * 0.95) - 1) return round(sorted_vals[idx], 4)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _p95(values: list[float]) -> float:

if not values:

return 0.0

sorted_vals = sorted(values)

idx = max(0, int(len(sorted_vals) * 0.95) - 1)

return round(sorted_vals[idx], 4)

import math

def _p95(values: list[float]) -> float:

if not values:

return 0.0

sorted_vals = sorted(values)

idx = max(0, math.ceil(len(sorted_vals) * 0.95) - 1)

return round(sorted_vals[idx], 4)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/adapters/memanto_adapter.py` around lines 93 - 98, The _p95 function currently computes the index using int(len(sorted_vals) * 0.95) - 1 which biases the percentile low; change the index calculation to use ceiling so the 95th percentile picks the correct element (e.g., idx = min(len(sorted_vals)-1, math.ceil(0.95 * len(sorted_vals)) - 1)) and keep the rest of the logic (sorted_vals, rounding) the same; update imports to include math if needed and ensure the function returns round(sorted_vals[idx], 4).

coderabbitai · 2026-06-08T15:14:57Z

+    {
+        "query_id": "q4",
+        "question": "Which specific film directors or films has the user mentioned?",
+        "golden_answer": "John Wick, The Dark Knight, Christopher Nolan, Dune, Interstellar, Planet Earth, The Social Dilemma, Gone Girl, Parasite, Knives Out, Hereditary, Midsommar, Ari Aster",


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align q4 golden entities with exact dataset mentions.

Line 84 currently mixes non-exact entities (e.g., "Planet Earth" vs session text "Planet Earth II"), which can unfairly lower judge scores for otherwise correct retrievals and skew benchmark accuracy.

🎯 Proposed fix

- "golden_answer": "John Wick, The Dark Knight, Christopher Nolan, Dune, Interstellar, Planet Earth, The Social Dilemma, Gone Girl, Parasite, Knives Out, Hereditary, Midsommar, Ari Aster", + "golden_answer": "John Wick, The Dark Knight, Nolan, Dune Part 2, Interstellar, Planet Earth II, The Social Dilemma, Gone Girl, Parasite, Knives Out, Hereditary, Midsommar, Ari Aster",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"golden_answer": "John Wick, The Dark Knight, Christopher Nolan, Dune, Interstellar, Planet Earth, The Social Dilemma, Gone Girl, Parasite, Knives Out, Hereditary, Midsommar, Ari Aster",

"golden_answer": "John Wick, The Dark Knight, Nolan, Dune Part 2, Interstellar, Planet Earth II, The Social Dilemma, Gone Girl, Parasite, Knives Out, Hereditary, Midsommar, Ari Aster",

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/dataset.py` at line 84, Update the q4 golden_answer to exactly match the dataset mentions (use the exact titles from the session text) so judge scoring isn't penalized; locate the q4 entry and its "golden_answer" string in examples/benchmarks/dataset.py and replace non-exact items (e.g., "Planet Earth") with their exact dataset forms (e.g., "Planet Earth II") and ensure all other listed entities in q4's "golden_answer" exactly match the dataset wording and punctuation.

coderabbitai · 2026-06-08T15:14:57Z

+    msg = client.messages.create(
+        model="claude-haiku-4-5-20251001",
+        max_tokens=80,
+        system=_JUDGE_SYSTEM,
+        messages=[{"role": "user", "content": prompt}],
+    )
+
+    raw = msg.content[0].text.strip()
+    try:
+        score_str, explanation = raw.split("|", 1)
+        score = int(score_str.strip())
+    except (ValueError, IndexError):
+        score = 0
+        explanation = raw
+
+    return {
+        "score": max(0, min(3, score)),
+        "explanation": explanation.strip(),
+        "input_tokens": msg.usage.input_tokens,
+        "output_tokens": msg.usage.output_tokens,
+    }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the file find . -type f -name "accuracy_judge.py" | head -20

Repository: moorcheh-ai/memanto

Length of output: 111

🏁 Script executed:

# Read the file to see the full function context wc -l examples/benchmarks/metrics/accuracy_judge.py

Repository: moorcheh-ai/memanto

Length of output: 112

🏁 Script executed:

# Read the relevant section with surrounding context sed -n '50,90p' examples/benchmarks/metrics/accuracy_judge.py

Repository: moorcheh-ai/memanto

Length of output: 1230

🏁 Script executed:

# Read the full file to understand context cat -n examples/benchmarks/metrics/accuracy_judge.py

Repository: moorcheh-ai/memanto

Length of output: 3273

🏁 Script executed:

# Search for calls to the judge function rg "judge(" . -B 2 -A 2

Repository: moorcheh-ai/memanto

Length of output: 126

🏁 Script executed:

# Search for calls to the judge function (escape parenthesis) rg "judge\(" . -B 2 -A 2

Repository: moorcheh-ai/memanto

Length of output: 1640

🏁 Script executed:

# Look for benchmark files that might call judge find . -type f -name "*benchmark*" -o -name "*test*" | grep -E "\.(py|sh)$" | head -10

Repository: moorcheh-ai/memanto

Length of output: 392

🏁 Script executed:

# Check the context around judge() calls in run_benchmark.py rg "judge\(" ./examples/benchmarks/run_benchmark.py -B 5 -A 5

Repository: moorcheh-ai/memanto

Length of output: 1142

🏁 Script executed:

# Check if there's any try-except wrapping around benchmark execution rg "try:|except" ./examples/benchmarks/run_benchmark.py -B 2 -A 2

Repository: moorcheh-ai/memanto

Length of output: 233

🏁 Script executed:

# Check the full context around the benchmark loops in run_benchmark.py sed -n '1,50p' ./examples/benchmarks/run_benchmark.py

Repository: moorcheh-ai/memanto

Length of output: 1567

🏁 Script executed:

# Get the lines around the judge calls to see if there's any outer error handling sed -n '50,150p' ./examples/benchmarks/run_benchmark.py

Repository: moorcheh-ai/memanto

Length of output: 3854

🏁 Script executed:

# Check the rest of run_benchmark.py for main function and outer error handling tail -50 ./examples/benchmarks/run_benchmark.py

Repository: moorcheh-ai/memanto

Length of output: 1862

🏁 Script executed:

# Find the main() function definition rg "^def main" ./examples/benchmarks/run_benchmark.py -A 30

Repository: moorcheh-ai/memanto

Length of output: 1702

Guard Anthropic API calls in judge function to prevent benchmark abortion on transient failures.

If client.messages.create() fails (rate limit, auth, network error) or returns unexpected content structure, the benchmark aborts. Wrap the API call and content access defensively to return a safe fallback payload so benchmark execution continues.

Proposed fix

- msg = client.messages.create( - model="claude-haiku-4-5-20251001", - max_tokens=80, - system=_JUDGE_SYSTEM, - messages=[{"role": "user", "content": prompt}], - ) - - raw = msg.content[0].text.strip() + try: + msg = client.messages.create( + model="claude-haiku-4-5-20251001", + max_tokens=80, + system=_JUDGE_SYSTEM, + messages=[{"role": "user", "content": prompt}], + ) + except Exception as e: + return { + "score": 0, + "explanation": f"Judge API error: {e}", + "input_tokens": 0, + "output_tokens": 0, + } + + blocks = getattr(msg, "content", []) or [] + first_text = blocks[0].text if blocks and hasattr(blocks[0], "text") else "" + raw = first_text.strip()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/metrics/accuracy_judge.py` around lines 63 - 83, The call to client.messages.create and subsequent accesses (msg.content[0].text, msg.usage.input_tokens, msg.usage.output_tokens) are not guarded and can raise network/auth/structure errors causing the benchmark to abort; wrap the API call and the parsing logic in a try/except that catches general exceptions (e.g., Exception) and structural issues (ValueError, IndexError, AttributeError), validate that msg and msg.content[0].text exist before using them, attempt the score parsing as currently done but on any failure return a safe fallback payload (score 0, explanation set to the raw or an error message, input_tokens/output_tokens set to 0 or available defaults), and ensure the function returns this safe payload instead of raising so the benchmark continues.

coderabbitai · 2026-06-08T15:14:57Z

+- **Read cost**: Semantic search on Moorcheh's index — returns relevant snippets
+- **Temporal tracking**: Relies on recency-weighted retrieval and tags
+
+### Mem0 (via `mem0ai` v2.0.4)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Inconsistent version specification for Mem0.

Line 59 states "via mem0ai v2.0.4" but requirements.txt specifies mem0ai>=2.0.0 as a minimum constraint, not a pinned version. This creates an inconsistency between the documentation and the actual dependency specification.

📝 Suggested fix

-### Mem0 (via `mem0ai` v2.0.4) +### Mem0 (via `mem0ai` ≥2.0.0)

Alternatively, if version 2.0.4 is specifically tested and recommended, consider pinning it in requirements.txt:

-mem0ai>=2.0.0 +mem0ai==2.0.4

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

### Mem0 (via `mem0ai` v2.0.4)

### Mem0 (via `mem0ai` ≥2.0.0)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/README.md` at line 59, Update the inconsistent Mem0 version info by either changing the README line "Mem0 (via `mem0ai` v2.0.4)" to reflect the loose constraint used in requirements.txt (e.g., "mem0ai >=2.0.0") or pin the dependency in requirements.txt to 2.0.4 (replace the existing "mem0ai>=2.0.0" entry) so the documentation and the dependency spec (the README entry and the requirements.txt mem0ai line) match.

coderabbitai · 2026-06-08T15:14:57Z

+def _check_env() -> None:
+    missing = []
+    if not os.environ.get("MOORCHEH_API_KEY"):
+        missing.append("MOORCHEH_API_KEY")
+    if not os.environ.get("ANTHROPIC_API_KEY"):
+        missing.append("ANTHROPIC_API_KEY  (needed for accuracy judge)")
+    if missing:
+        print("❌  Missing environment variables:")
+        for m in missing:
+            print(f"    {m}")
+        print("\n   See .env.example for setup instructions.")
+        sys.exit(1)
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor --skip-judge / --skip-mem0 in env validation and imports.

Current flow always requires Anthropic credentials and always imports judge, even when judge and Mem0 are both skipped.

Proposed fix

-def _check_env() -> None: +def _check_env(skip_mem0: bool, skip_judge: bool) -> None: missing = [] if not os.environ.get("MOORCHEH_API_KEY"): missing.append("MOORCHEH_API_KEY") - if not os.environ.get("ANTHROPIC_API_KEY"): + needs_anthropic = (not skip_judge) or (not skip_mem0) + if needs_anthropic and not os.environ.get("ANTHROPIC_API_KEY"): missing.append("ANTHROPIC_API_KEY (needed for accuracy judge)") ... def run_memanto_phase(namespace: str, skip_judge: bool) -> dict: from adapters.memanto_adapter import MenantoAdapter - from metrics.accuracy_judge import judge + judge = None + if not skip_judge: + from metrics.accuracy_judge import judge ... def run_mem0_phase(skip_judge: bool) -> dict: from adapters.mem0_adapter import Mem0Adapter - from metrics.accuracy_judge import judge + judge = None + if not skip_judge: + from metrics.accuracy_judge import judge ... - _check_env() + _check_env(skip_mem0=args.skip_mem0, skip_judge=args.skip_judge)

Also applies to: 64-67, 120-123, 233-233

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/run_benchmark.py` around lines 43 - 55, Modify _check_env and the top-level imports to respect the CLI flags --skip-judge and --skip-mem0: change _check_env to accept the parsed args (or read a shared args object) and only require ANTHROPIC_API_KEY when the judge or Mem0 functionality will actually be used (i.e., if not args.skip_judge and not args.skip_mem0 is false, skip the Anthropic check); similarly guard imports/usages of the judge module and Mem0-related modules (references: _check_env, the judge import site, and any Mem0 import/initialization) so they are only imported/initialized when their corresponding flags are not set. Ensure all checks/imports mentioned around the regions noted (including the blocks at the other referenced locations) use the same flag logic.

coderabbitai · 2026-06-08T15:14:57Z

+        elif isinstance(mv, float) and isinstance(ov, float):
+            # Lower is better for tokens & latency; higher is better for accuracy
+            if "accuracy" in label.lower():
+                winner = "Memanto ✓" if mv >= ov else "Mem0    ✓"
+            else:
+                winner = "Memanto ✓" if mv <= ov else "Mem0    ✓"
+        else:
+            winner = "─"


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Winner column skips integer metrics.

Total tokens rows are integers, but winner logic only handles floats, so those rows always show "─".

Proposed fix

- elif isinstance(mv, float) and isinstance(ov, float): + elif isinstance(mv, (int, float)) and isinstance(ov, (int, float)):

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/run_benchmark.py` around lines 200 - 207, The winner calculation skips integer metrics because it only checks isinstance(..., float); update the logic in run_benchmark.py where mv, ov, and label are evaluated (the block referencing mv, ov, label that sets winner) to accept integers as well — e.g., treat numeric types by checking isinstance(mv, (int, float)) and isinstance(ov, (int, float)) or use numbers.Real, or coerce mv/ov to float before comparison; keep the existing accuracy-vs-lower-is-better branching and set winner the same way once mv/ov are treated as numeric so integer rows like "Total tokens" no longer produce "─".

minhthai1995 · 2026-06-09T03:42:46Z

Note on r/AgenticMemory

r/AgenticMemory has been banned by Reddit as of June 7, 2026 ("This community has been banned for violating the Reddit rules.").

The social amplification component for this PR cannot be fulfilled via r/AgenticMemory. I can immediately post to alternative communities (r/LocalLLaMA, r/ClaudeAI, Hacker News) if the maintainers confirm these count for the virality scoring.

Let me know how to proceed!

minhthai1995 mentioned this pull request Jun 8, 2026

[BOUNTY $100] 🐜 The Great Agentic Memory Showdown: Memanto Benchmarking & Evaluation Challenge #639

Open

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

	"golden_answer": "John Wick, The Dark Knight, Christopher Nolan, Dune, Interstellar, Planet Earth, The Social Dilemma, Gone Girl, Parasite, Knives Out, Hereditary, Midsommar, Ari Aster",
	"golden_answer": "John Wick, The Dark Knight, Nolan, Dune Part 2, Interstellar, Planet Earth II, The Social Dilemma, Gone Girl, Parasite, Knives Out, Hereditary, Midsommar, Ari Aster",

	### Mem0 (via `mem0ai` v2.0.4)
	### Mem0 (via `mem0ai` ≥2.0.0)

Conversation

minhthai1995 commented Jun 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in examples/benchmarks/

Benchmark Design

How to Run

Expected Key Finding

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

minhthai1995 commented Jun 9, 2026

Note on r/AgenticMemory

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

minhthai1995 commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

What's in `examples/benchmarks/`

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading