feat: Add Memanto vs Mem0 benchmarking suite for Shifting Persona scenario#746
feat: Add Memanto vs Mem0 benchmarking suite for Shifting Persona scenario#746zaid1234-11 wants to merge 7 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughA new self-contained benchmark example is added under ChangesMemanto vs Mem0 Persona-Shift Benchmark Suite
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
examples/benchmarks/memanto_vs_mem0_persona_shift/requirements.txt (1)
1-8: Pin all dependency versions in requirements.txt for reproducible benchmark results.All packages in this requirements file are unpinned. Without version pins, benchmark accuracy and latency results will vary across runs and machines. Specify exact versions using
==for each dependency.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/memanto_vs_mem0_persona_shift/requirements.txt` around lines 1 - 8, Add version pins to all eight dependencies in the requirements.txt file to ensure reproducible benchmark results. For each unpinned package (memanto, mem0ai, openai, python-dotenv, tiktoken, rich, groq, and sentence-transformers), append `==<version>` where version should be the current or desired stable version of that package. Use the exact version format operator to lock each dependency to a specific release.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/benchmarks/memanto_vs_mem0_persona_shift/.env.example`:
- Around line 1-2: The .env.example file is missing the GROQ_API_KEY environment
variable template entry that the benchmark code actually requires. Add
GROQ_API_KEY with a placeholder value (similar to the existing MOORCHEH_API_KEY
and OPENAI_API_KEY entries) to align the template with the actual runtime
configuration used by the judge and Mem0 config components in the benchmark
code.
In `@examples/benchmarks/memanto_vs_mem0_persona_shift/benchmark.py`:
- Around line 18-26: The current implementation accumulates latencies into a
single total_ingest_latency variable, which prevents proper percentile
computation. Instead of summing latencies into one aggregate value, collect each
individual operation latency measurement from metrics["latency"] into a list
(e.g., ingest_latencies) during the loop over dataset messages. Then compute the
p95 percentile from this complete series of individual latency measurements
using a percentile function (such as numpy.percentile or statistics.quantiles),
rather than attempting to derive the percentile from aggregated totals. Apply
the same fix to the retrieval latency collection at line 42.
In `@examples/benchmarks/memanto_vs_mem0_persona_shift/judge.py`:
- Around line 7-8: The judge is currently treating all runtime failures
(infrastructure issues like missing API keys, transport errors, or malformed
responses) as score=0, which incorrectly penalizes model performance metrics.
Instead of returning a score of 0 on failure, implement proper error handling in
the Groq client initialization (around the client and model setup) and in the
judge evaluation method (lines 37-50) to distinguish between actual judge
execution failures and model performance issues. Return a special failure status
or None value when infrastructure failures occur, rather than defaulting to a
numeric score, so that only actual model evaluation results contribute to
accuracy metrics.
In `@examples/benchmarks/memanto_vs_mem0_persona_shift/memory_layers.py`:
- Around line 20-26: The try-except block that catches ImportError for the
MoorchehClient initialization silently falls back to mock memory by setting
has_real_client to False, which invalidates benchmarks by allowing them to run
with mock data instead of failing when the real integration is unavailable.
Instead of silently catching the ImportError and continuing, allow the
ImportError to propagate or raise an explicit error to fail the benchmark run
immediately and prevent invalid comparisons. Apply this same fix at all affected
locations where similar import fallback patterns exist (the sibling locations at
lines 39-47 and 57-65 likely have the same issue with their respective client
initializations).
In `@examples/benchmarks/memanto_vs_mem0_persona_shift/README.md`:
- Line 50: Update the README documentation at the specified line to accurately
reflect the actual judge model implementation. Replace the reference to `gpt-4o`
with `llama-3.3-70b-versatile` and add clarification that this model is
configured via Groq, ensuring the documentation matches the actual codebase
configuration.
---
Nitpick comments:
In `@examples/benchmarks/memanto_vs_mem0_persona_shift/requirements.txt`:
- Around line 1-8: Add version pins to all eight dependencies in the
requirements.txt file to ensure reproducible benchmark results. For each
unpinned package (memanto, mem0ai, openai, python-dotenv, tiktoken, rich, groq,
and sentence-transformers), append `==<version>` where version should be the
current or desired stable version of that package. Use the exact version format
operator to lock each dependency to a specific release.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: c264c3eb-aecc-420a-ae59-af7ae4f01223
📒 Files selected for processing (7)
examples/benchmarks/memanto_vs_mem0_persona_shift/.env.exampleexamples/benchmarks/memanto_vs_mem0_persona_shift/README.mdexamples/benchmarks/memanto_vs_mem0_persona_shift/benchmark.pyexamples/benchmarks/memanto_vs_mem0_persona_shift/dataset.pyexamples/benchmarks/memanto_vs_mem0_persona_shift/judge.pyexamples/benchmarks/memanto_vs_mem0_persona_shift/memory_layers.pyexamples/benchmarks/memanto_vs_mem0_persona_shift/requirements.txt
…uto-create namespaces on demand
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/benchmarks/memanto_vs_mem0_persona_shift/memory_layers.py`:
- Around line 49-50: The code on line 50 contains unnecessary defensive
programming that silently masks integration failures by converting missing
answer fields to empty strings instead of surfacing real errors. According to
the moorcheh-sdk contract, the answer.generate() method (called on line 49)
always returns a dictionary with an answer string field, so the isinstance check
and getattr fallback are hiding problems rather than handling valid edge cases.
Replace the current defensive code with explicit validation that raises an error
if the response is not a dictionary or lacks the answer key, allowing real
integration failures to propagate instead of producing misleading benchmark
results with empty string defaults.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 8ffb386c-c2ff-4164-bd0c-aaedc34e9b03
📒 Files selected for processing (1)
examples/benchmarks/memanto_vs_mem0_persona_shift/memory_layers.py
…ws console print encodings
Description
This PR introduces a reproducible benchmarking suite that evaluates the Accuracy vs. Resource Footprint of two agentic memory frameworks: Memanto and Mem0.
It implements a Shifting Persona scenario, which stress-tests how each framework adapts when a user’s preferences change dynamically over multiple sessions (e.g., transitioning from preferring action movies to getting burnt out and preferring slow, relaxing dramas).
The benchmark measures:
Llama-3.3-70b) comparing retrieved context against the true current state.What was added
All code is located in the new folder
examples/benchmarks/memanto_vs_mem0_persona_shift/:benchmark.py: Main orchestrator running the multi-session interaction dataset, measuring metrics, and outputting the final comparison table.dataset.py: The shifting persona dataset containing evolving user dialogues and the expected final state.memory_layers.py: Unified adapters for bothMemantoLayer(interfacing with Moorcheh SDK/API) andMem0Layer(interfacing with local Qdrant/HF embeddings) for a 1:1 comparison.judge.py: Robust LLM judge class utilizing JSON formatting to evaluate retrieved context accuracy.README.md&requirements.txt: Clear instructions for installing dependencies and reproducing the benchmark.Benchmark Run Summary
Evaluating both layers using a Groq
llama-3.3-70b-versatilejudge yielded the following result:Observations:
Verification Plan
Automated Verification
Run the benchmark locally using the python virtual environment: