Skip to content

[BOUNTY #639] Memanto vs Mem0 — Shifting Persona Benchmark#705

Open
minhthai1995 wants to merge 1 commit into
moorcheh-ai:mainfrom
minhthai1995:feat/benchmarks-shifting-persona
Open

[BOUNTY #639] Memanto vs Mem0 — Shifting Persona Benchmark#705
minhthai1995 wants to merge 1 commit into
moorcheh-ai:mainfrom
minhthai1995:feat/benchmarks-shifting-persona

Conversation

@minhthai1995

@minhthai1995 minhthai1995 commented Jun 8, 2026

Copy link
Copy Markdown

Summary

Implements Scenario B: Shifting Persona & Temporal Tracking Test from the Agentic Memory Showdown challenge.

Live demo link (Reddit/X): will be added after benchmark run — see below


What's in examples/benchmarks/

File Purpose
run_benchmark.py Main runner — outputs comparison table + saves JSON
dataset.py 5-session "Evolving Film Enthusiast" + 5 golden-answer queries
adapters/memanto_adapter.py Direct upsert via moorcheh-sdk
adapters/mem0_adapter.py LLM-extraction pipeline via mem0ai v2.0.4
metrics/token_counter.py tiktoken cl100k_base counting
metrics/accuracy_judge.py Claude Haiku LLM-as-judge (score 0–3 per query)
requirements.txt All deps with pinned minimums
.env.example Required env vars template

Benchmark Design

Dataset: 5 sessions where a user's movie genre preferences evolve:
Action → Sci-Fi → Documentary → Thriller → Horror (current)

Core question: Does the system surface the current preference (Horror) without being polluted by the 4 previous preferences?

Scientific controls:

  • Same dataset fed to both systems
  • Same LLM judge (Claude Haiku) with identical prompt
  • Isolated namespaces per run (UUID-based)
  • Same top_k=5 for retrieval
  • tiktoken cl100k_base for token counting

Metrics measured:

  • Total tokens written (ingestion overhead)
  • Total tokens retrieved (all 5 queries)
  • p95 write latency (s)
  • p95 read latency (s)
  • Avg accuracy score 0–3 (LLM-as-judge)

How to Run

cd examples/benchmarks/
pip install -r requirements.txt
export MOORCHEH_API_KEY=... ANTHROPIC_API_KEY=...
python3 run_benchmark.py
# Fast run (Memanto only, no HuggingFace download):
python3 run_benchmark.py --skip-mem0

Expected Key Finding

Memanto's direct-upsert architecture eliminates per-write LLM inference overhead entirely. Mem0's extraction pipeline (Claude Haiku + qdrant) adds significant token cost and latency at write time — costs that compound across long-lived agents.


🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added a comprehensive benchmark suite for comparing memory systems with performance metrics (token usage, p95 latencies) and LLM-based accuracy evaluation capabilities
  • Documentation

    • Added detailed benchmark documentation including setup instructions, evaluation scenarios, dataset specifications, and results interpretation guidance

…h-ai#639)

Implements Scenario B from the Agentic Memory Showdown bounty challenge.
Compares Memanto (moorcheh-sdk direct upsert) against Mem0 (LLM-extraction
pipeline) on token overhead, p95 latency, and temporal preference accuracy.

Dataset: 5-session "Evolving Film Enthusiast" — preferences evolve from
action → sci-fi → documentary → thriller → horror. Tests whether each
system correctly surfaces the CURRENT preference without stale-history
pollution.

Metrics measured per system:
- Total tokens written during ingestion (write overhead)
- Total tokens retrieved across 5 evaluation queries
- p95 write and read latency (seconds)
- Accuracy score 0-3 per query via LLM-as-judge (Claude Haiku)

Scientific controls: same dataset, same judge model, same judge prompt,
isolated namespaces (UUID per run), same top_k=5, tiktoken cl100k_base
for token counting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds a complete benchmark suite (examples/benchmarks/) comparing Memanto and Mem0 memory systems. The suite includes a shifting-persona dataset, metrics infrastructure (token counting and LLM-as-judge accuracy), system-specific adapters, and an orchestrator script that runs both systems in parallel and outputs comparative results.

Changes

Benchmark Suite: Memanto vs Mem0

Layer / File(s) Summary
Environment Setup & Dependencies
examples/benchmarks/.env.example, examples/benchmarks/requirements.txt
Placeholder environment variables for Moorcheh and Anthropic API keys; dependencies include memanto, moorcheh-sdk, mem0ai, tiktoken, anthropic, sentence-transformers, and qdrant-client.
Benchmark Framework Documentation
examples/benchmarks/README.md
Comprehensive documentation of Scenario B (5-session shifting persona), evaluation metrics (tokens, p95 latencies, LLM-judge accuracy), dataset details, expected architecture behavior, setup/run instructions, configuration parameters, controlled/uncontrolled variables, output format, and accuracy rubric.
Benchmark Dataset
examples/benchmarks/dataset.py
Shifting persona dataset with five sessions where movie preferences evolve and five temporal-tracking queries with golden answers specifying whether to retrieve current vs. historical preferences and breadth of recall.
Metrics & Evaluation Infrastructure
examples/benchmarks/metrics/token_counter.py, examples/benchmarks/metrics/accuracy_judge.py
Token counting helpers (text, message, and result aggregation) and LLM-as-judge that uses Anthropic Claude Haiku to score retrieval accuracy on a 0–3 scale.
Memanto Adapter
examples/benchmarks/adapters/memanto_adapter.py
Wraps Memanto's client with per-message document ingest, similarity search, token tracking via stored text, and p95 latency computation.
Mem0 Adapter
examples/benchmarks/adapters/mem0_adapter.py
Wraps Mem0's memory extraction and search using Anthropic Claude Haiku, HuggingFace embeddings, and in-memory Qdrant storage; tracks ingest/query latencies and estimates token usage from raw conversation and retrieved text.
Benchmark Runner & Orchestration
examples/benchmarks/run_benchmark.py
Main script that validates environment variables, runs Memanto and Mem0 ingestion/retrieval phases, optionally scores results with LLM-as-judge, prints comparison tables, and saves timestamped JSON results. Supports --skip-mem0, --skip-judge, and --output flags.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Suggested reviewers

  • Xenogents
  • het0814
  • Neelpatel1604

Poem

🐰 Hops through benchmarks, memory in sight,
Memanto meets Mem0 in comparative flight,
Token counts whisper, latencies trace,
The shifting personas test temporal grace,
LLM judges decree who wins the race!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.08% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately and specifically describes the main change: adding a benchmark suite comparing Memanto vs Mem0 on the Shifting Persona scenario, which is the core focus of all changes in the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (3)
examples/benchmarks/README.md (1)

49-53: ⚡ Quick win

Add language specifiers to fenced code blocks.

Three code blocks are missing language specifiers, which reduces readability and prevents proper syntax highlighting. The static analysis tool correctly flagged these at lines 49, 61, and 139.

📝 Suggested fix

For the architecture diagrams (lines 49 and 61):

-```
+```text
 User message → MoorchehClient.documents.upsert() → Moorcheh serverless index
-```
+```text
 User messages → Mem0 extraction LLM (Claude Haiku) → Vectorized memory facts

For the example output (line 139):

-```
+```text
 🏆 The Great Agentic Memory Showdown

Also applies to: 61-65, 139-166

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/README.md` around lines 49 - 53, Add language specifiers
("text") to the three fenced code blocks that lack them: the architecture
diagram containing "User message → MoorchehClient.documents.upsert()", the
diagram containing "User messages → Mem0 extraction LLM (Claude Haiku) →
Vectorized memory facts", and the example output block that starts with "🏆 The
Great Agentic Memory Showdown"; update those triple-backtick fences to use
```text so the blocks get proper syntax highlighting and readability.
examples/benchmarks/metrics/accuracy_judge.py (1)

64-64: Make the judge model configurable (avoid hardcoded dated Claude ID)

examples/benchmarks/metrics/accuracy_judge.py hardcodes claude-haiku-4-5-20251001 for the LLM-as-judge. Even though this snapshot is currently valid, dated model IDs can be deprecated; use an env override with a default for benchmark stability.

♻️ Proposed refactor
+_JUDGE_MODEL = os.getenv("ANTHROPIC_JUDGE_MODEL", "claude-haiku-4-5-20251001")
...
-        model="claude-haiku-4-5-20251001",
+        model=_JUDGE_MODEL,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/metrics/accuracy_judge.py` at line 64, Replace the
hardcoded dated Claude model string used as the LLM-as-judge with a configurable
environment variable: introduce reading os.environ.get("ACCURACY_JUDGE_MODEL",
"claude-haiku-4-5-20251001") (or similar JUDGE_MODEL name) and pass that
variable into the call/site where model="claude-haiku-4-5-20251001" is used (the
function/constructor that accepts the model parameter in accuracy_judge.py).
Ensure the new env var is documented or has a clear default so benchmark
behavior stays stable while allowing overrides.
examples/benchmarks/adapters/memanto_adapter.py (1)

22-22: ⚡ Quick win

Align class name with product naming (MemantoAdapter).

MenantoAdapter looks like a typo and leaks into public imports; renaming avoids avoidable API/docs drift.

Proposed rename
-class MenantoAdapter:
+class MemantoAdapter:
-    from adapters.memanto_adapter import MenantoAdapter
+    from adapters.memanto_adapter import MemantoAdapter
...
-    adapter = MenantoAdapter(namespace=namespace)
+    adapter = MemantoAdapter(namespace=namespace)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/adapters/memanto_adapter.py` at line 22, The class is
misspelled as MenantoAdapter; rename the class to MemantoAdapter and update
every reference/import/export that uses MenantoAdapter (including any __all__
entries, type hints, tests, and callers) to the new MemantoAdapter identifier so
public API and docs no longer leak the typo; ensure constructor and any usages
match the new class name.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/benchmarks/adapters/mem0_adapter.py`:
- Around line 167-172: _p95 currently calculates the 95th percentile using a
biased index (int(len * 0.95) - 1) which understates p95 for small samples;
change the index computation in function _p95 to use a ceiling-based rank:
compute idx = max(0, int(math.ceil(0.95 * len(sorted_vals))) - 1) (import math),
keep the empty-list guard and rounding, and clamp idx to the last element if
needed so sorted_vals[idx] is always valid.
- Around line 149-151: The token accounting currently builds retrieved_text and
calls count(retrieved_text), which double-counts separator tokens; update the
logic in the retrieval flow (where retrieved_text, tokens_retrieved and
self.total_tokens_retrieved are set) to use the same method as Memanto by
calling count_results(texts) (or count_results(raw_list) if variable named
differently) instead of count(retrieved_text), and add that returned value to
self.total_tokens_retrieved so totals are comparable across systems.
- Around line 100-111: The benchmark currently only adds raw_tokens to the
running total (total_tokens_written) and therefore undercounts ingestion
overhead when the adapter returns LLM token usage; update the write path that
builds the result dict (where result is inspected and llm_tokens is derived) to
also add llm_tokens to total_tokens_written when present and numeric, ensuring
you check isinstance(result, dict) and that "token_count" yields an int/float
before accumulation so the returned tokens_written and the aggregate
total_tokens_written both reflect raw + llm extraction tokens.

In `@examples/benchmarks/adapters/memanto_adapter.py`:
- Around line 93-98: The _p95 function currently computes the index using
int(len(sorted_vals) * 0.95) - 1 which biases the percentile low; change the
index calculation to use ceiling so the 95th percentile picks the correct
element (e.g., idx = min(len(sorted_vals)-1, math.ceil(0.95 * len(sorted_vals))
- 1)) and keep the rest of the logic (sorted_vals, rounding) the same; update
imports to include math if needed and ensure the function returns
round(sorted_vals[idx], 4).

In `@examples/benchmarks/dataset.py`:
- Line 84: Update the q4 golden_answer to exactly match the dataset mentions
(use the exact titles from the session text) so judge scoring isn't penalized;
locate the q4 entry and its "golden_answer" string in
examples/benchmarks/dataset.py and replace non-exact items (e.g., "Planet
Earth") with their exact dataset forms (e.g., "Planet Earth II") and ensure all
other listed entities in q4's "golden_answer" exactly match the dataset wording
and punctuation.

In `@examples/benchmarks/metrics/accuracy_judge.py`:
- Around line 63-83: The call to client.messages.create and subsequent accesses
(msg.content[0].text, msg.usage.input_tokens, msg.usage.output_tokens) are not
guarded and can raise network/auth/structure errors causing the benchmark to
abort; wrap the API call and the parsing logic in a try/except that catches
general exceptions (e.g., Exception) and structural issues (ValueError,
IndexError, AttributeError), validate that msg and msg.content[0].text exist
before using them, attempt the score parsing as currently done but on any
failure return a safe fallback payload (score 0, explanation set to the raw or
an error message, input_tokens/output_tokens set to 0 or available defaults),
and ensure the function returns this safe payload instead of raising so the
benchmark continues.

In `@examples/benchmarks/README.md`:
- Line 59: Update the inconsistent Mem0 version info by either changing the
README line "Mem0 (via `mem0ai` v2.0.4)" to reflect the loose constraint used in
requirements.txt (e.g., "mem0ai >=2.0.0") or pin the dependency in
requirements.txt to 2.0.4 (replace the existing "mem0ai>=2.0.0" entry) so the
documentation and the dependency spec (the README entry and the requirements.txt
mem0ai line) match.

In `@examples/benchmarks/run_benchmark.py`:
- Around line 43-55: Modify _check_env and the top-level imports to respect the
CLI flags --skip-judge and --skip-mem0: change _check_env to accept the parsed
args (or read a shared args object) and only require ANTHROPIC_API_KEY when the
judge or Mem0 functionality will actually be used (i.e., if not args.skip_judge
and not args.skip_mem0 is false, skip the Anthropic check); similarly guard
imports/usages of the judge module and Mem0-related modules (references:
_check_env, the judge import site, and any Mem0 import/initialization) so they
are only imported/initialized when their corresponding flags are not set. Ensure
all checks/imports mentioned around the regions noted (including the blocks at
the other referenced locations) use the same flag logic.
- Around line 200-207: The winner calculation skips integer metrics because it
only checks isinstance(..., float); update the logic in run_benchmark.py where
mv, ov, and label are evaluated (the block referencing mv, ov, label that sets
winner) to accept integers as well — e.g., treat numeric types by checking
isinstance(mv, (int, float)) and isinstance(ov, (int, float)) or use
numbers.Real, or coerce mv/ov to float before comparison; keep the existing
accuracy-vs-lower-is-better branching and set winner the same way once mv/ov are
treated as numeric so integer rows like "Total tokens" no longer produce "─".

---

Nitpick comments:
In `@examples/benchmarks/adapters/memanto_adapter.py`:
- Line 22: The class is misspelled as MenantoAdapter; rename the class to
MemantoAdapter and update every reference/import/export that uses MenantoAdapter
(including any __all__ entries, type hints, tests, and callers) to the new
MemantoAdapter identifier so public API and docs no longer leak the typo; ensure
constructor and any usages match the new class name.

In `@examples/benchmarks/metrics/accuracy_judge.py`:
- Line 64: Replace the hardcoded dated Claude model string used as the
LLM-as-judge with a configurable environment variable: introduce reading
os.environ.get("ACCURACY_JUDGE_MODEL", "claude-haiku-4-5-20251001") (or similar
JUDGE_MODEL name) and pass that variable into the call/site where
model="claude-haiku-4-5-20251001" is used (the function/constructor that accepts
the model parameter in accuracy_judge.py). Ensure the new env var is documented
or has a clear default so benchmark behavior stays stable while allowing
overrides.

In `@examples/benchmarks/README.md`:
- Around line 49-53: Add language specifiers ("text") to the three fenced code
blocks that lack them: the architecture diagram containing "User message →
MoorchehClient.documents.upsert()", the diagram containing "User messages → Mem0
extraction LLM (Claude Haiku) → Vectorized memory facts", and the example output
block that starts with "🏆 The Great Agentic Memory Showdown"; update those
triple-backtick fences to use ```text so the blocks get proper syntax
highlighting and readability.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: db9815ed-ffd8-4cd7-a7d1-7cc53474439d

📥 Commits

Reviewing files that changed from the base of the PR and between 6b495ea and e19a85c.

📒 Files selected for processing (12)
  • examples/benchmarks/.env.example
  • examples/benchmarks/README.md
  • examples/benchmarks/adapters/__init__.py
  • examples/benchmarks/adapters/mem0_adapter.py
  • examples/benchmarks/adapters/memanto_adapter.py
  • examples/benchmarks/dataset.py
  • examples/benchmarks/metrics/__init__.py
  • examples/benchmarks/metrics/accuracy_judge.py
  • examples/benchmarks/metrics/token_counter.py
  • examples/benchmarks/requirements.txt
  • examples/benchmarks/results/.gitkeep
  • examples/benchmarks/run_benchmark.py

Comment on lines +100 to +111
self.total_tokens_written += raw_tokens

# Try to extract LLM token usage from the result if available
llm_tokens = 0
if isinstance(result, dict) and "token_count" in result:
llm_tokens = result["token_count"]

return {
"tokens_written": raw_tokens,
"write_latency_s": round(latency, 4),
"llm_extraction_tokens": llm_tokens,
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Mem0 write-token summary undercounts ingestion overhead.

total_tokens_written only accumulates raw message tokens, while extracted LLM token usage is returned but not included in the aggregate used by the benchmark summary.

Proposed fix
         latency = time.perf_counter() - start
         self._write_latencies.append(latency)
-        self.total_tokens_written += raw_tokens

         # Try to extract LLM token usage from the result if available
         llm_tokens = 0
         if isinstance(result, dict) and "token_count" in result:
             llm_tokens = result["token_count"]
+        total_written = raw_tokens + llm_tokens
+        self.total_tokens_written += total_written

         return {
-            "tokens_written": raw_tokens,
+            "tokens_written": total_written,
+            "raw_tokens_written": raw_tokens,
             "write_latency_s": round(latency, 4),
             "llm_extraction_tokens": llm_tokens,
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.total_tokens_written += raw_tokens
# Try to extract LLM token usage from the result if available
llm_tokens = 0
if isinstance(result, dict) and "token_count" in result:
llm_tokens = result["token_count"]
return {
"tokens_written": raw_tokens,
"write_latency_s": round(latency, 4),
"llm_extraction_tokens": llm_tokens,
}
# Try to extract LLM token usage from the result if available
llm_tokens = 0
if isinstance(result, dict) and "token_count" in result:
llm_tokens = result["token_count"]
total_written = raw_tokens + llm_tokens
self.total_tokens_written += total_written
return {
"tokens_written": total_written,
"raw_tokens_written": raw_tokens,
"write_latency_s": round(latency, 4),
"llm_extraction_tokens": llm_tokens,
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/adapters/mem0_adapter.py` around lines 100 - 111, The
benchmark currently only adds raw_tokens to the running total
(total_tokens_written) and therefore undercounts ingestion overhead when the
adapter returns LLM token usage; update the write path that builds the result
dict (where result is inspected and llm_tokens is derived) to also add
llm_tokens to total_tokens_written when present and numeric, ensuring you check
isinstance(result, dict) and that "token_count" yields an int/float before
accumulation so the returned tokens_written and the aggregate
total_tokens_written both reflect raw + llm extraction tokens.

Comment on lines +149 to +151
retrieved_text = "\n---\n".join(texts)
tokens_retrieved = count(retrieved_text)
self.total_tokens_retrieved += tokens_retrieved

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the same retrieval-token accounting method as Memanto.

Counting retrieved_text after joining adds separator tokens and makes cross-system token totals non-comparable. Use count_results(raw_list) here too.

Proposed fix
-        tokens_retrieved = count(retrieved_text)
+        tokens_retrieved = count_results(raw_list)
         self.total_tokens_retrieved += tokens_retrieved
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/adapters/mem0_adapter.py` around lines 149 - 151, The
token accounting currently builds retrieved_text and calls
count(retrieved_text), which double-counts separator tokens; update the logic in
the retrieval flow (where retrieved_text, tokens_retrieved and
self.total_tokens_retrieved are set) to use the same method as Memanto by
calling count_results(texts) (or count_results(raw_list) if variable named
differently) instead of count(retrieved_text), and add that returned value to
self.total_tokens_retrieved so totals are comparable across systems.

Comment on lines +167 to +172
def _p95(values: list[float]) -> float:
if not values:
return 0.0
sorted_vals = sorted(values)
idx = max(0, int(len(sorted_vals) * 0.95) - 1)
return round(sorted_vals[idx], 4)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

p95 latency index calculation is incorrect for small N.

This uses the same biased formula as the Memanto adapter and understates p95 in this benchmark’s sample sizes.

Proposed fix
+import math
...
 def _p95(values: list[float]) -> float:
     if not values:
         return 0.0
     sorted_vals = sorted(values)
-    idx = max(0, int(len(sorted_vals) * 0.95) - 1)
+    idx = max(0, math.ceil(len(sorted_vals) * 0.95) - 1)
     return round(sorted_vals[idx], 4)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/adapters/mem0_adapter.py` around lines 167 - 172, _p95
currently calculates the 95th percentile using a biased index (int(len * 0.95) -
1) which understates p95 for small samples; change the index computation in
function _p95 to use a ceiling-based rank: compute idx = max(0,
int(math.ceil(0.95 * len(sorted_vals))) - 1) (import math), keep the empty-list
guard and rounding, and clamp idx to the last element if needed so
sorted_vals[idx] is always valid.

Comment on lines +93 to +98
def _p95(values: list[float]) -> float:
if not values:
return 0.0
sorted_vals = sorted(values)
idx = max(0, int(len(sorted_vals) * 0.95) - 1)
return round(sorted_vals[idx], 4)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix p95 computation (currently biased low for benchmark-sized samples).

Current index math reports a lower percentile (e.g., with 5 samples it picks index 3 instead of index 4), so p95 latency outputs are systematically understated.

Proposed fix
+import math
...
 def _p95(values: list[float]) -> float:
     if not values:
         return 0.0
     sorted_vals = sorted(values)
-    idx = max(0, int(len(sorted_vals) * 0.95) - 1)
+    idx = max(0, math.ceil(len(sorted_vals) * 0.95) - 1)
     return round(sorted_vals[idx], 4)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _p95(values: list[float]) -> float:
if not values:
return 0.0
sorted_vals = sorted(values)
idx = max(0, int(len(sorted_vals) * 0.95) - 1)
return round(sorted_vals[idx], 4)
import math
def _p95(values: list[float]) -> float:
if not values:
return 0.0
sorted_vals = sorted(values)
idx = max(0, math.ceil(len(sorted_vals) * 0.95) - 1)
return round(sorted_vals[idx], 4)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/adapters/memanto_adapter.py` around lines 93 - 98, The
_p95 function currently computes the index using int(len(sorted_vals) * 0.95) -
1 which biases the percentile low; change the index calculation to use ceiling
so the 95th percentile picks the correct element (e.g., idx =
min(len(sorted_vals)-1, math.ceil(0.95 * len(sorted_vals)) - 1)) and keep the
rest of the logic (sorted_vals, rounding) the same; update imports to include
math if needed and ensure the function returns round(sorted_vals[idx], 4).

{
"query_id": "q4",
"question": "Which specific film directors or films has the user mentioned?",
"golden_answer": "John Wick, The Dark Knight, Christopher Nolan, Dune, Interstellar, Planet Earth, The Social Dilemma, Gone Girl, Parasite, Knives Out, Hereditary, Midsommar, Ari Aster",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align q4 golden entities with exact dataset mentions.

Line 84 currently mixes non-exact entities (e.g., "Planet Earth" vs session text "Planet Earth II"), which can unfairly lower judge scores for otherwise correct retrievals and skew benchmark accuracy.

🎯 Proposed fix
-        "golden_answer": "John Wick, The Dark Knight, Christopher Nolan, Dune, Interstellar, Planet Earth, The Social Dilemma, Gone Girl, Parasite, Knives Out, Hereditary, Midsommar, Ari Aster",
+        "golden_answer": "John Wick, The Dark Knight, Nolan, Dune Part 2, Interstellar, Planet Earth II, The Social Dilemma, Gone Girl, Parasite, Knives Out, Hereditary, Midsommar, Ari Aster",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"golden_answer": "John Wick, The Dark Knight, Christopher Nolan, Dune, Interstellar, Planet Earth, The Social Dilemma, Gone Girl, Parasite, Knives Out, Hereditary, Midsommar, Ari Aster",
"golden_answer": "John Wick, The Dark Knight, Nolan, Dune Part 2, Interstellar, Planet Earth II, The Social Dilemma, Gone Girl, Parasite, Knives Out, Hereditary, Midsommar, Ari Aster",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/dataset.py` at line 84, Update the q4 golden_answer to
exactly match the dataset mentions (use the exact titles from the session text)
so judge scoring isn't penalized; locate the q4 entry and its "golden_answer"
string in examples/benchmarks/dataset.py and replace non-exact items (e.g.,
"Planet Earth") with their exact dataset forms (e.g., "Planet Earth II") and
ensure all other listed entities in q4's "golden_answer" exactly match the
dataset wording and punctuation.

Comment on lines +63 to +83
msg = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=80,
system=_JUDGE_SYSTEM,
messages=[{"role": "user", "content": prompt}],
)

raw = msg.content[0].text.strip()
try:
score_str, explanation = raw.split("|", 1)
score = int(score_str.strip())
except (ValueError, IndexError):
score = 0
explanation = raw

return {
"score": max(0, min(3, score)),
"explanation": explanation.strip(),
"input_tokens": msg.usage.input_tokens,
"output_tokens": msg.usage.output_tokens,
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the file
find . -type f -name "accuracy_judge.py" | head -20

Repository: moorcheh-ai/memanto

Length of output: 111


🏁 Script executed:

# Read the file to see the full function context
wc -l examples/benchmarks/metrics/accuracy_judge.py

Repository: moorcheh-ai/memanto

Length of output: 112


🏁 Script executed:

# Read the relevant section with surrounding context
sed -n '50,90p' examples/benchmarks/metrics/accuracy_judge.py

Repository: moorcheh-ai/memanto

Length of output: 1230


🏁 Script executed:

# Read the full file to understand context
cat -n examples/benchmarks/metrics/accuracy_judge.py

Repository: moorcheh-ai/memanto

Length of output: 3273


🏁 Script executed:

# Search for calls to the judge function
rg "judge(" . -B 2 -A 2

Repository: moorcheh-ai/memanto

Length of output: 126


🏁 Script executed:

# Search for calls to the judge function (escape parenthesis)
rg "judge\(" . -B 2 -A 2

Repository: moorcheh-ai/memanto

Length of output: 1640


🏁 Script executed:

# Look for benchmark files that might call judge
find . -type f -name "*benchmark*" -o -name "*test*" | grep -E "\.(py|sh)$" | head -10

Repository: moorcheh-ai/memanto

Length of output: 392


🏁 Script executed:

# Check the context around judge() calls in run_benchmark.py
rg "judge\(" ./examples/benchmarks/run_benchmark.py -B 5 -A 5

Repository: moorcheh-ai/memanto

Length of output: 1142


🏁 Script executed:

# Check if there's any try-except wrapping around benchmark execution
rg "try:|except" ./examples/benchmarks/run_benchmark.py -B 2 -A 2

Repository: moorcheh-ai/memanto

Length of output: 233


🏁 Script executed:

# Check the full context around the benchmark loops in run_benchmark.py
sed -n '1,50p' ./examples/benchmarks/run_benchmark.py

Repository: moorcheh-ai/memanto

Length of output: 1567


🏁 Script executed:

# Get the lines around the judge calls to see if there's any outer error handling
sed -n '50,150p' ./examples/benchmarks/run_benchmark.py

Repository: moorcheh-ai/memanto

Length of output: 3854


🏁 Script executed:

# Check the rest of run_benchmark.py for main function and outer error handling
tail -50 ./examples/benchmarks/run_benchmark.py

Repository: moorcheh-ai/memanto

Length of output: 1862


🏁 Script executed:

# Find the main() function definition
rg "^def main" ./examples/benchmarks/run_benchmark.py -A 30

Repository: moorcheh-ai/memanto

Length of output: 1702


Guard Anthropic API calls in judge function to prevent benchmark abortion on transient failures.

If client.messages.create() fails (rate limit, auth, network error) or returns unexpected content structure, the benchmark aborts. Wrap the API call and content access defensively to return a safe fallback payload so benchmark execution continues.

Proposed fix
-    msg = client.messages.create(
-        model="claude-haiku-4-5-20251001",
-        max_tokens=80,
-        system=_JUDGE_SYSTEM,
-        messages=[{"role": "user", "content": prompt}],
-    )
-
-    raw = msg.content[0].text.strip()
+    try:
+        msg = client.messages.create(
+            model="claude-haiku-4-5-20251001",
+            max_tokens=80,
+            system=_JUDGE_SYSTEM,
+            messages=[{"role": "user", "content": prompt}],
+        )
+    except Exception as e:
+        return {
+            "score": 0,
+            "explanation": f"Judge API error: {e}",
+            "input_tokens": 0,
+            "output_tokens": 0,
+        }
+
+    blocks = getattr(msg, "content", []) or []
+    first_text = blocks[0].text if blocks and hasattr(blocks[0], "text") else ""
+    raw = first_text.strip()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/metrics/accuracy_judge.py` around lines 63 - 83, The call
to client.messages.create and subsequent accesses (msg.content[0].text,
msg.usage.input_tokens, msg.usage.output_tokens) are not guarded and can raise
network/auth/structure errors causing the benchmark to abort; wrap the API call
and the parsing logic in a try/except that catches general exceptions (e.g.,
Exception) and structural issues (ValueError, IndexError, AttributeError),
validate that msg and msg.content[0].text exist before using them, attempt the
score parsing as currently done but on any failure return a safe fallback
payload (score 0, explanation set to the raw or an error message,
input_tokens/output_tokens set to 0 or available defaults), and ensure the
function returns this safe payload instead of raising so the benchmark
continues.

- **Read cost**: Semantic search on Moorcheh's index — returns relevant snippets
- **Temporal tracking**: Relies on recency-weighted retrieval and tags

### Mem0 (via `mem0ai` v2.0.4)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Inconsistent version specification for Mem0.

Line 59 states "via mem0ai v2.0.4" but requirements.txt specifies mem0ai>=2.0.0 as a minimum constraint, not a pinned version. This creates an inconsistency between the documentation and the actual dependency specification.

📝 Suggested fix
-### Mem0 (via `mem0ai` v2.0.4)
+### Mem0 (via `mem0ai` ≥2.0.0)

Alternatively, if version 2.0.4 is specifically tested and recommended, consider pinning it in requirements.txt:

-mem0ai>=2.0.0
+mem0ai==2.0.4
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Mem0 (via `mem0ai` v2.0.4)
### Mem0 (via `mem0ai` ≥2.0.0)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/README.md` at line 59, Update the inconsistent Mem0
version info by either changing the README line "Mem0 (via `mem0ai` v2.0.4)" to
reflect the loose constraint used in requirements.txt (e.g., "mem0ai >=2.0.0")
or pin the dependency in requirements.txt to 2.0.4 (replace the existing
"mem0ai>=2.0.0" entry) so the documentation and the dependency spec (the README
entry and the requirements.txt mem0ai line) match.

Comment on lines +43 to +55
def _check_env() -> None:
missing = []
if not os.environ.get("MOORCHEH_API_KEY"):
missing.append("MOORCHEH_API_KEY")
if not os.environ.get("ANTHROPIC_API_KEY"):
missing.append("ANTHROPIC_API_KEY (needed for accuracy judge)")
if missing:
print("❌ Missing environment variables:")
for m in missing:
print(f" {m}")
print("\n See .env.example for setup instructions.")
sys.exit(1)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor --skip-judge / --skip-mem0 in env validation and imports.

Current flow always requires Anthropic credentials and always imports judge, even when judge and Mem0 are both skipped.

Proposed fix
-def _check_env() -> None:
+def _check_env(skip_mem0: bool, skip_judge: bool) -> None:
     missing = []
     if not os.environ.get("MOORCHEH_API_KEY"):
         missing.append("MOORCHEH_API_KEY")
-    if not os.environ.get("ANTHROPIC_API_KEY"):
+    needs_anthropic = (not skip_judge) or (not skip_mem0)
+    if needs_anthropic and not os.environ.get("ANTHROPIC_API_KEY"):
         missing.append("ANTHROPIC_API_KEY  (needed for accuracy judge)")
...
 def run_memanto_phase(namespace: str, skip_judge: bool) -> dict:
     from adapters.memanto_adapter import MenantoAdapter
-    from metrics.accuracy_judge import judge
+    judge = None
+    if not skip_judge:
+        from metrics.accuracy_judge import judge
...
 def run_mem0_phase(skip_judge: bool) -> dict:
     from adapters.mem0_adapter import Mem0Adapter
-    from metrics.accuracy_judge import judge
+    judge = None
+    if not skip_judge:
+        from metrics.accuracy_judge import judge
...
-    _check_env()
+    _check_env(skip_mem0=args.skip_mem0, skip_judge=args.skip_judge)

Also applies to: 64-67, 120-123, 233-233

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/run_benchmark.py` around lines 43 - 55, Modify _check_env
and the top-level imports to respect the CLI flags --skip-judge and --skip-mem0:
change _check_env to accept the parsed args (or read a shared args object) and
only require ANTHROPIC_API_KEY when the judge or Mem0 functionality will
actually be used (i.e., if not args.skip_judge and not args.skip_mem0 is false,
skip the Anthropic check); similarly guard imports/usages of the judge module
and Mem0-related modules (references: _check_env, the judge import site, and any
Mem0 import/initialization) so they are only imported/initialized when their
corresponding flags are not set. Ensure all checks/imports mentioned around the
regions noted (including the blocks at the other referenced locations) use the
same flag logic.

Comment on lines +200 to +207
elif isinstance(mv, float) and isinstance(ov, float):
# Lower is better for tokens & latency; higher is better for accuracy
if "accuracy" in label.lower():
winner = "Memanto ✓" if mv >= ov else "Mem0 ✓"
else:
winner = "Memanto ✓" if mv <= ov else "Mem0 ✓"
else:
winner = "─"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Winner column skips integer metrics.

Total tokens rows are integers, but winner logic only handles floats, so those rows always show "─".

Proposed fix
-        elif isinstance(mv, float) and isinstance(ov, float):
+        elif isinstance(mv, (int, float)) and isinstance(ov, (int, float)):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/run_benchmark.py` around lines 200 - 207, The winner
calculation skips integer metrics because it only checks isinstance(..., float);
update the logic in run_benchmark.py where mv, ov, and label are evaluated (the
block referencing mv, ov, label that sets winner) to accept integers as well —
e.g., treat numeric types by checking isinstance(mv, (int, float)) and
isinstance(ov, (int, float)) or use numbers.Real, or coerce mv/ov to float
before comparison; keep the existing accuracy-vs-lower-is-better branching and
set winner the same way once mv/ov are treated as numeric so integer rows like
"Total tokens" no longer produce "─".

@minhthai1995

Copy link
Copy Markdown
Author

Note on r/AgenticMemory

r/AgenticMemory has been banned by Reddit as of June 7, 2026 ("This community has been banned for violating the Reddit rules.").

The social amplification component for this PR cannot be fulfilled via r/AgenticMemory. I can immediately post to alternative communities (r/LocalLLaMA, r/ClaudeAI, Hacker News) if the maintainers confirm these count for the virality scoring.

Let me know how to proceed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant