Skip to content

feat: Add Memanto vs Mem0 benchmarking suite for Shifting Persona scenario#746

Open
zaid1234-11 wants to merge 7 commits into
moorcheh-ai:mainfrom
zaid1234-11:memanto-vs-mem0-benchmark
Open

feat: Add Memanto vs Mem0 benchmarking suite for Shifting Persona scenario#746
zaid1234-11 wants to merge 7 commits into
moorcheh-ai:mainfrom
zaid1234-11:memanto-vs-mem0-benchmark

Conversation

@zaid1234-11

@zaid1234-11 zaid1234-11 commented Jun 15, 2026

Copy link
Copy Markdown

Description

This PR introduces a reproducible benchmarking suite that evaluates the Accuracy vs. Resource Footprint of two agentic memory frameworks: Memanto and Mem0.

It implements a Shifting Persona scenario, which stress-tests how each framework adapts when a user’s preferences change dynamically over multiple sessions (e.g., transitioning from preferring action movies to getting burnt out and preferring slow, relaxing dramas).

The benchmark measures:

  1. Accuracy Score (0-100): Evaluated by an LLM-as-a-judge (Llama-3.3-70b) comparing retrieved context against the true current state.
  2. p95 Latency (s): End-to-end lookup and retrieval speed.
  3. Token Efficiency: Ingested vs. retrieved token overhead.

What was added

All code is located in the new folder examples/benchmarks/memanto_vs_mem0_persona_shift/:

  • benchmark.py: Main orchestrator running the multi-session interaction dataset, measuring metrics, and outputting the final comparison table.
  • dataset.py: The shifting persona dataset containing evolving user dialogues and the expected final state.
  • memory_layers.py: Unified adapters for both MemantoLayer (interfacing with Moorcheh SDK/API) and Mem0Layer (interfacing with local Qdrant/HF embeddings) for a 1:1 comparison.
  • judge.py: Robust LLM judge class utilizing JSON formatting to evaluate retrieved context accuracy.
  • README.md & requirements.txt: Clear instructions for installing dependencies and reproducing the benchmark.

Benchmark Run Summary

Evaluating both layers using a Groq llama-3.3-70b-versatile judge yielded the following result:

Framework Accuracy Score p95 Latency (s) Total Tokens Ingested
Memanto 80/100 4.11s 114
Mem0 40/100 59.55s 114

Observations:

  • Memanto successfully adapted to the new user state, returning context relevant to the active preference.
  • Mem0 suffered from embedding overlap, returning contradictory statements (both the old and new preferences), leading to a lower accuracy score.
  • Memanto achieved 14x faster p95 retrieval latency by avoiding the heavy background LLM updates required by Mem0's vector pipeline.

Verification Plan

Automated Verification

Run the benchmark locally using the python virtual environment:

cd examples/benchmarks/memanto_vs_mem0_persona_shift
pip install -r requirements.txt
python benchmark.py

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

* **New Features**
  * Added a reproducible benchmark suite comparing memory systems for “shifting persona & temporal tracking,” including token counting, latency tracking, and AI-based retrieval accuracy scoring.

* **Documentation**
  * Added an end-to-end setup and run guide with prerequisites, configuration steps, and expected output.

* **Chores**
  * Updated the benchmark environment template to include required API key placeholders (including an additional Groq key).
  * Added/updated benchmark dependency listings.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

A new self-contained benchmark example is added under examples/benchmarks/memanto_vs_mem0_persona_shift/. It includes a shifting-persona dataset, abstract and concrete memory layer wrappers for Memanto and Mem0 backends, a Groq-powered LLM judge, an orchestration script that produces a Rich metrics table, dependency and environment configuration files, and a README.

Changes

Memanto vs Mem0 Persona-Shift Benchmark Suite

Layer / File(s) Summary
Dataset, env config, and dependencies
examples/benchmarks/memanto_vs_mem0_persona_shift/dataset.py, .env.example, requirements.txt
Defines SHIFTING_PERSONA_DATASET and EXPECTED_STATE constants, API key placeholders for MOORCHEH_API_KEY, OPENAI_API_KEY, and GROQ_API_KEY, and eight Python package dependencies including memanto, mem0ai, groq, and tiktoken.
Memory layer abstraction
examples/benchmarks/memanto_vs_mem0_persona_shift/memory_layers.py (BaseMemoryLayer)
Introduces BaseMemoryLayer abstract interface with add_memory() and retrieve_memory() methods, establishing the contract for latency and token metric tracking.
Memory backend implementations
examples/benchmarks/memanto_vs_mem0_persona_shift/memory_layers.py (MemantoLayer, Mem0Layer)
MemantoLayer integrates moorcheh_sdk.MoorchehClient with lazy per-user namespace creation; Mem0Layer wraps mem0.Memory with hardcoded LLM/embedder/Qdrant configuration; both track latency and token counts via tiktoken.
LLM judge for accuracy scoring
examples/benchmarks/memanto_vs_mem0_persona_shift/judge.py
LLMJudge.evaluate() sends expected state and retrieved context to Groq with JSON-only output format and zero temperature, returning a numeric score and reasoning, with a fallback dict on any exception.
Benchmark orchestration and Rich output
examples/benchmarks/memanto_vs_mem0_persona_shift/benchmark.py
Loads environment variables, run_evaluation() ingests all dataset messages except the last, retrieves memory for the final query, and calls LLMJudge; main() runs both memory layers and renders a Rich table showing tokens ingested, tokens retrieved, p95 latency, and accuracy per layer.
README documentation
examples/benchmarks/memanto_vs_mem0_persona_shift/README.md
Documents prerequisites, virtualenv setup, how to run benchmark.py, what each module does, and the expected terminal output fields.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Suggested reviewers

  • Xenogents
  • Neelpatel1604

Poem

🐇 Hop hop, the benchmark's set,
Two memory layers in friendly bet!
Persona shifts from dawn to dusk,
The judge scores each with JSON's husk.
A Rich table blooms—who tracks it best?
Memanto hops ahead of the rest! 🏆

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a benchmarking suite that compares Memanto and Mem0 frameworks for a Shifting Persona scenario.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
examples/benchmarks/memanto_vs_mem0_persona_shift/requirements.txt (1)

1-8: Pin all dependency versions in requirements.txt for reproducible benchmark results.

All packages in this requirements file are unpinned. Without version pins, benchmark accuracy and latency results will vary across runs and machines. Specify exact versions using == for each dependency.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/memanto_vs_mem0_persona_shift/requirements.txt` around
lines 1 - 8, Add version pins to all eight dependencies in the requirements.txt
file to ensure reproducible benchmark results. For each unpinned package
(memanto, mem0ai, openai, python-dotenv, tiktoken, rich, groq, and
sentence-transformers), append `==<version>` where version should be the current
or desired stable version of that package. Use the exact version format operator
to lock each dependency to a specific release.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/benchmarks/memanto_vs_mem0_persona_shift/.env.example`:
- Around line 1-2: The .env.example file is missing the GROQ_API_KEY environment
variable template entry that the benchmark code actually requires. Add
GROQ_API_KEY with a placeholder value (similar to the existing MOORCHEH_API_KEY
and OPENAI_API_KEY entries) to align the template with the actual runtime
configuration used by the judge and Mem0 config components in the benchmark
code.

In `@examples/benchmarks/memanto_vs_mem0_persona_shift/benchmark.py`:
- Around line 18-26: The current implementation accumulates latencies into a
single total_ingest_latency variable, which prevents proper percentile
computation. Instead of summing latencies into one aggregate value, collect each
individual operation latency measurement from metrics["latency"] into a list
(e.g., ingest_latencies) during the loop over dataset messages. Then compute the
p95 percentile from this complete series of individual latency measurements
using a percentile function (such as numpy.percentile or statistics.quantiles),
rather than attempting to derive the percentile from aggregated totals. Apply
the same fix to the retrieval latency collection at line 42.

In `@examples/benchmarks/memanto_vs_mem0_persona_shift/judge.py`:
- Around line 7-8: The judge is currently treating all runtime failures
(infrastructure issues like missing API keys, transport errors, or malformed
responses) as score=0, which incorrectly penalizes model performance metrics.
Instead of returning a score of 0 on failure, implement proper error handling in
the Groq client initialization (around the client and model setup) and in the
judge evaluation method (lines 37-50) to distinguish between actual judge
execution failures and model performance issues. Return a special failure status
or None value when infrastructure failures occur, rather than defaulting to a
numeric score, so that only actual model evaluation results contribute to
accuracy metrics.

In `@examples/benchmarks/memanto_vs_mem0_persona_shift/memory_layers.py`:
- Around line 20-26: The try-except block that catches ImportError for the
MoorchehClient initialization silently falls back to mock memory by setting
has_real_client to False, which invalidates benchmarks by allowing them to run
with mock data instead of failing when the real integration is unavailable.
Instead of silently catching the ImportError and continuing, allow the
ImportError to propagate or raise an explicit error to fail the benchmark run
immediately and prevent invalid comparisons. Apply this same fix at all affected
locations where similar import fallback patterns exist (the sibling locations at
lines 39-47 and 57-65 likely have the same issue with their respective client
initializations).

In `@examples/benchmarks/memanto_vs_mem0_persona_shift/README.md`:
- Line 50: Update the README documentation at the specified line to accurately
reflect the actual judge model implementation. Replace the reference to `gpt-4o`
with `llama-3.3-70b-versatile` and add clarification that this model is
configured via Groq, ensuring the documentation matches the actual codebase
configuration.

---

Nitpick comments:
In `@examples/benchmarks/memanto_vs_mem0_persona_shift/requirements.txt`:
- Around line 1-8: Add version pins to all eight dependencies in the
requirements.txt file to ensure reproducible benchmark results. For each
unpinned package (memanto, mem0ai, openai, python-dotenv, tiktoken, rich, groq,
and sentence-transformers), append `==<version>` where version should be the
current or desired stable version of that package. Use the exact version format
operator to lock each dependency to a specific release.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: c264c3eb-aecc-420a-ae59-af7ae4f01223

📥 Commits

Reviewing files that changed from the base of the PR and between 918c68f and 2e7d2d9.

📒 Files selected for processing (7)
  • examples/benchmarks/memanto_vs_mem0_persona_shift/.env.example
  • examples/benchmarks/memanto_vs_mem0_persona_shift/README.md
  • examples/benchmarks/memanto_vs_mem0_persona_shift/benchmark.py
  • examples/benchmarks/memanto_vs_mem0_persona_shift/dataset.py
  • examples/benchmarks/memanto_vs_mem0_persona_shift/judge.py
  • examples/benchmarks/memanto_vs_mem0_persona_shift/memory_layers.py
  • examples/benchmarks/memanto_vs_mem0_persona_shift/requirements.txt

Comment thread examples/benchmarks/memanto_vs_mem0_persona_shift/.env.example
Comment thread examples/benchmarks/memanto_vs_mem0_persona_shift/benchmark.py Outdated
Comment thread examples/benchmarks/memanto_vs_mem0_persona_shift/judge.py
Comment thread examples/benchmarks/memanto_vs_mem0_persona_shift/memory_layers.py
Comment thread examples/benchmarks/memanto_vs_mem0_persona_shift/README.md Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/benchmarks/memanto_vs_mem0_persona_shift/memory_layers.py`:
- Around line 49-50: The code on line 50 contains unnecessary defensive
programming that silently masks integration failures by converting missing
answer fields to empty strings instead of surfacing real errors. According to
the moorcheh-sdk contract, the answer.generate() method (called on line 49)
always returns a dictionary with an answer string field, so the isinstance check
and getattr fallback are hiding problems rather than handling valid edge cases.
Replace the current defensive code with explicit validation that raises an error
if the response is not a dictionary or lacks the answer key, allowing real
integration failures to propagate instead of producing misleading benchmark
results with empty string defaults.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8ffb386c-c2ff-4164-bd0c-aaedc34e9b03

📥 Commits

Reviewing files that changed from the base of the PR and between 339eba5 and d243827.

📒 Files selected for processing (1)
  • examples/benchmarks/memanto_vs_mem0_persona_shift/memory_layers.py

Comment thread examples/benchmarks/memanto_vs_mem0_persona_shift/memory_layers.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant