feat: Add Memanto vs Mem0 benchmarking suite for Shifting Persona scenario by zaid1234-11 · Pull Request #746 · moorcheh-ai/memanto

zaid1234-11 · 2026-06-15T21:16:27Z

Description

This PR introduces a reproducible benchmarking suite that evaluates the Accuracy vs. Resource Footprint of two agentic memory frameworks: Memanto and Mem0.

It implements a Shifting Persona scenario, which stress-tests how each framework adapts when a user’s preferences change dynamically over multiple sessions (e.g., transitioning from preferring action movies to getting burnt out and preferring slow, relaxing dramas).

The benchmark measures:

Accuracy Score (0-100): Evaluated by an LLM-as-a-judge (Llama-3.3-70b) comparing retrieved context against the true current state.
p95 Latency (s): End-to-end lookup and retrieval speed.
Token Efficiency: Ingested vs. retrieved token overhead.

What was added

All code is located in the new folder examples/benchmarks/memanto_vs_mem0_persona_shift/:

benchmark.py: Main orchestrator running the multi-session interaction dataset, measuring metrics, and outputting the final comparison table.
dataset.py: The shifting persona dataset containing evolving user dialogues and the expected final state.
memory_layers.py: Unified adapters for both MemantoLayer (interfacing with Moorcheh SDK/API) and Mem0Layer (interfacing with local Qdrant/HF embeddings) for a 1:1 comparison.
judge.py: Robust LLM judge class utilizing JSON formatting to evaluate retrieved context accuracy.
README.md & requirements.txt: Clear instructions for installing dependencies and reproducing the benchmark.

Benchmark Run Summary

Evaluating both layers using a Groq llama-3.3-70b-versatile judge yielded the following result:

Framework	Accuracy Score	p95 Latency (s)	Total Tokens Ingested
Memanto	80/100	4.11s	114
Mem0	40/100	59.55s	114

Observations:

Memanto successfully adapted to the new user state, returning context relevant to the active preference.
Mem0 suffered from embedding overlap, returning contradictory statements (both the old and new preferences), leading to a lower accuracy score.
Memanto achieved 14x faster p95 retrieval latency by avoiding the heavy background LLM updates required by Mem0's vector pipeline.

Verification Plan

Automated Verification

Run the benchmark locally using the python virtual environment:

cd examples/benchmarks/memanto_vs_mem0_persona_shift
pip install -r requirements.txt
python benchmark.py

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

* **New Features**
  * Added a reproducible benchmark suite comparing memory systems for “shifting persona & temporal tracking,” including token counting, latency tracking, and AI-based retrieval accuracy scoring.

* **Documentation**
  * Added an end-to-end setup and run guide with prerequisites, configuration steps, and expected output.

* **Chores**
  * Updated the benchmark environment template to include required API key placeholders (including an additional Groq key).
  * Added/updated benchmark dependency listings.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

…nario

coderabbitai · 2026-06-15T21:16:40Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

A new self-contained benchmark example is added under examples/benchmarks/memanto_vs_mem0_persona_shift/. It includes a shifting-persona dataset, abstract and concrete memory layer wrappers for Memanto and Mem0 backends, a Groq-powered LLM judge, an orchestration script that produces a Rich metrics table, dependency and environment configuration files, and a README.

Changes

Memanto vs Mem0 Persona-Shift Benchmark Suite

Layer / File(s)	Summary
Dataset, env config, and dependencies `examples/benchmarks/memanto_vs_mem0_persona_shift/dataset.py`, `.env.example`, `requirements.txt`	Defines `SHIFTING_PERSONA_DATASET` and `EXPECTED_STATE` constants, API key placeholders for `MOORCHEH_API_KEY`, `OPENAI_API_KEY`, and `GROQ_API_KEY`, and eight Python package dependencies including `memanto`, `mem0ai`, `groq`, and `tiktoken`.
Memory layer abstraction `examples/benchmarks/memanto_vs_mem0_persona_shift/memory_layers.py` (BaseMemoryLayer)	Introduces `BaseMemoryLayer` abstract interface with `add_memory()` and `retrieve_memory()` methods, establishing the contract for latency and token metric tracking.
Memory backend implementations `examples/benchmarks/memanto_vs_mem0_persona_shift/memory_layers.py` (MemantoLayer, Mem0Layer)	`MemantoLayer` integrates `moorcheh_sdk.MoorchehClient` with lazy per-user namespace creation; `Mem0Layer` wraps `mem0.Memory` with hardcoded LLM/embedder/Qdrant configuration; both track latency and token counts via `tiktoken`.
LLM judge for accuracy scoring `examples/benchmarks/memanto_vs_mem0_persona_shift/judge.py`	`LLMJudge.evaluate()` sends expected state and retrieved context to Groq with JSON-only output format and zero temperature, returning a numeric score and reasoning, with a fallback dict on any exception.
Benchmark orchestration and Rich output `examples/benchmarks/memanto_vs_mem0_persona_shift/benchmark.py`	Loads environment variables, `run_evaluation()` ingests all dataset messages except the last, retrieves memory for the final query, and calls `LLMJudge`; `main()` runs both memory layers and renders a Rich table showing tokens ingested, tokens retrieved, p95 latency, and accuracy per layer.
README documentation `examples/benchmarks/memanto_vs_mem0_persona_shift/README.md`	Documents prerequisites, virtualenv setup, how to run `benchmark.py`, what each module does, and the expected terminal output fields.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

[BOUNTY $100] 🐜 The Great Agentic Memory Showdown: Memanto Benchmarking & Evaluation Challenge #639: This PR directly implements Scenario B (Shifting Persona & Temporal Tracking) with all specified components: Memanto vs Mem0 comparison, LLM-judge accuracy scoring, token/latency metrics, reproducible .env setup, and delivery under /examples/benchmarks/.

Suggested reviewers

Xenogents
Neelpatel1604

Poem

🐇 Hop hop, the benchmark's set,
Two memory layers in friendly bet!
Persona shifts from dawn to dusk,
The judge scores each with JSON's husk.
A Rich table blooms—who tracks it best?
Memanto hops ahead of the rest! 🏆

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding a benchmarking suite that compares Memanto and Mem0 frameworks for a Shifting Persona scenario.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

examples/benchmarks/memanto_vs_mem0_persona_shift/requirements.txt (1)
1-8: Pin all dependency versions in requirements.txt for reproducible benchmark results.

All packages in this requirements file are unpinned. Without version pins, benchmark accuracy and latency results will vary across runs and machines. Specify exact versions using == for each dependency.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/memanto_vs_mem0_persona_shift/requirements.txt` around
lines 1 - 8, Add version pins to all eight dependencies in the requirements.txt
file to ensure reproducible benchmark results. For each unpinned package
(memanto, mem0ai, openai, python-dotenv, tiktoken, rich, groq, and
sentence-transformers), append `==<version>` where version should be the current
or desired stable version of that package. Use the exact version format operator
to lock each dependency to a specific release.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/benchmarks/memanto_vs_mem0_persona_shift/.env.example`:
- Around line 1-2: The .env.example file is missing the GROQ_API_KEY environment
variable template entry that the benchmark code actually requires. Add
GROQ_API_KEY with a placeholder value (similar to the existing MOORCHEH_API_KEY
and OPENAI_API_KEY entries) to align the template with the actual runtime
configuration used by the judge and Mem0 config components in the benchmark
code.

In `@examples/benchmarks/memanto_vs_mem0_persona_shift/benchmark.py`:
- Around line 18-26: The current implementation accumulates latencies into a
single total_ingest_latency variable, which prevents proper percentile
computation. Instead of summing latencies into one aggregate value, collect each
individual operation latency measurement from metrics["latency"] into a list
(e.g., ingest_latencies) during the loop over dataset messages. Then compute the
p95 percentile from this complete series of individual latency measurements
using a percentile function (such as numpy.percentile or statistics.quantiles),
rather than attempting to derive the percentile from aggregated totals. Apply
the same fix to the retrieval latency collection at line 42.

In `@examples/benchmarks/memanto_vs_mem0_persona_shift/judge.py`:
- Around line 7-8: The judge is currently treating all runtime failures
(infrastructure issues like missing API keys, transport errors, or malformed
responses) as score=0, which incorrectly penalizes model performance metrics.
Instead of returning a score of 0 on failure, implement proper error handling in
the Groq client initialization (around the client and model setup) and in the
judge evaluation method (lines 37-50) to distinguish between actual judge
execution failures and model performance issues. Return a special failure status
or None value when infrastructure failures occur, rather than defaulting to a
numeric score, so that only actual model evaluation results contribute to
accuracy metrics.

In `@examples/benchmarks/memanto_vs_mem0_persona_shift/memory_layers.py`:
- Around line 20-26: The try-except block that catches ImportError for the
MoorchehClient initialization silently falls back to mock memory by setting
has_real_client to False, which invalidates benchmarks by allowing them to run
with mock data instead of failing when the real integration is unavailable.
Instead of silently catching the ImportError and continuing, allow the
ImportError to propagate or raise an explicit error to fail the benchmark run
immediately and prevent invalid comparisons. Apply this same fix at all affected
locations where similar import fallback patterns exist (the sibling locations at
lines 39-47 and 57-65 likely have the same issue with their respective client
initializations).

In `@examples/benchmarks/memanto_vs_mem0_persona_shift/README.md`:
- Line 50: Update the README documentation at the specified line to accurately
reflect the actual judge model implementation. Replace the reference to `gpt-4o`
with `llama-3.3-70b-versatile` and add clarification that this model is
configured via Groq, ensuring the documentation matches the actual codebase
configuration.

---

Nitpick comments:
In `@examples/benchmarks/memanto_vs_mem0_persona_shift/requirements.txt`:
- Around line 1-8: Add version pins to all eight dependencies in the
requirements.txt file to ensure reproducible benchmark results. For each
unpinned package (memanto, mem0ai, openai, python-dotenv, tiktoken, rich, groq,
and sentence-transformers), append `==<version>` where version should be the
current or desired stable version of that package. Use the exact version format
operator to lock each dependency to a specific release.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: c264c3eb-aecc-420a-ae59-af7ae4f01223

📥 Commits

Reviewing files that changed from the base of the PR and between 918c68f and 2e7d2d9.

📒 Files selected for processing (7)

examples/benchmarks/memanto_vs_mem0_persona_shift/.env.example
examples/benchmarks/memanto_vs_mem0_persona_shift/README.md
examples/benchmarks/memanto_vs_mem0_persona_shift/benchmark.py
examples/benchmarks/memanto_vs_mem0_persona_shift/dataset.py
examples/benchmarks/memanto_vs_mem0_persona_shift/judge.py
examples/benchmarks/memanto_vs_mem0_persona_shift/memory_layers.py
examples/benchmarks/memanto_vs_mem0_persona_shift/requirements.txt

…rcentile

…uto-create namespaces on demand

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/benchmarks/memanto_vs_mem0_persona_shift/memory_layers.py`:
- Around line 49-50: The code on line 50 contains unnecessary defensive
programming that silently masks integration failures by converting missing
answer fields to empty strings instead of surfacing real errors. According to
the moorcheh-sdk contract, the answer.generate() method (called on line 49)
always returns a dictionary with an answer string field, so the isinstance check
and getattr fallback are hiding problems rather than handling valid edge cases.
Replace the current defensive code with explicit validation that raises an error
if the response is not a dictionary or lacks the answer key, allowing real
integration failures to propagate instead of producing misleading benchmark
results with empty string defaults.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8ffb386c-c2ff-4164-bd0c-aaedc34e9b03

📥 Commits

Reviewing files that changed from the base of the PR and between 339eba5 and d243827.

📒 Files selected for processing (1)

examples/benchmarks/memanto_vs_mem0_persona_shift/memory_layers.py

…ws console print encodings

feat: Add Memanto vs Mem0 benchmarking suite for Shifting Persona sce…

2e7d2d9

…nario

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

zaid1234-11 added 5 commits June 16, 2026 02:56

fix: Add GROQ_API_KEY to .env.example

ae6bf16

fix: Collect individual operation latencies to compute correct p95 pe…

a953a6b

…rcentile

fix: Distinguish judge runtime failures from actual score evaluation

339eba5

fix: Do not silently fall back to mock memory in benchmark mode and a…

d243827

…uto-create namespaces on demand

docs: Update README.md to match Groq judge configuration

85da7d4

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread examples/benchmarks/memanto_vs_mem0_persona_shift/memory_layers.py Outdated

fix: Validate Moorcheh SDK answer response structure and handle Windo…

a6bb7da

…ws console print encodings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Memanto vs Mem0 benchmarking suite for Shifting Persona scenario#746

feat: Add Memanto vs Mem0 benchmarking suite for Shifting Persona scenario#746
zaid1234-11 wants to merge 7 commits into
moorcheh-ai:mainfrom
zaid1234-11:memanto-vs-mem0-benchmark

zaid1234-11 commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zaid1234-11 commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What was added

Benchmark Run Summary

Observations:

Verification Plan

Automated Verification

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zaid1234-11 commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading