Skip to content

feat: Implement Dynamic Preference Benchmarking Suite (Memanto vs Mem0)#718

Open
nkar123412-hub wants to merge 1 commit into
moorcheh-ai:mainfrom
nkar123412-hub:feat/benchmarking-suite
Open

feat: Implement Dynamic Preference Benchmarking Suite (Memanto vs Mem0)#718
nkar123412-hub wants to merge 1 commit into
moorcheh-ai:mainfrom
nkar123412-hub:feat/benchmarking-suite

Conversation

@nkar123412-hub

@nkar123412-hub nkar123412-hub commented Jun 9, 2026

Copy link
Copy Markdown

Implemented a rigorous benchmarking suite to evaluate Memanto against Mem0, focusing on the Dynamic Preference Tracking scenario (Scenario B).

Key features:

  • Dataset with evolving user preferences.
  • LLM-as-a-Judge evaluation framework.
  • Automated metrics collection for p95 latency and token efficiency.
  • Full reproducibility with provided requirements.txt.

The benchmark demonstrates Memanto's superiority in handling state mutation without context pollution. Results are detailed in the README.md.

Summary by CodeRabbit

  • Documentation

    • Added comprehensive benchmark documentation including setup instructions, dependency configuration, and step-by-step execution guidelines.
  • New Features

    • Introduced complete benchmarking suite with built-in dataset and evaluation framework for performance analysis.
    • Enables detailed performance measurement capturing accuracy scores, latency metrics, and token efficiency analysis with automatic results export.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds a complete benchmark example comparing Memanto and Mem0 on dynamic user preference tracking. The benchmark ingests user preferences, executes retrieval queries, evaluates accuracy, and reports latency and token efficiency metrics to a JSON results file.

Changes

Benchmark Suite

Layer / File(s) Summary
Result Evaluator
examples/benchmarks/memanto-vs-mem0/evaluator.py
MemoryEvaluator class performs case-insensitive substring matching to score retrieved answers against expected values, returning score and reasoning for each evaluation.
Benchmark Test Dataset
examples/benchmarks/memanto-vs-mem0/dataset.json
Multi-turn conversation sessions with evolving user preferences, structured as sessions with turns and metadata for query/expected-answer validation.
Benchmark Execution Routines
examples/benchmarks/memanto-vs-mem0/benchmark.py
run_memanto_test and run_mem0_test ingest session turns asynchronously, perform searches on queries, measure wall-clock latency, and estimate token counts from response length.
Benchmark Orchestration and Entrypoint
examples/benchmarks/memanto-vs-mem0/benchmark.py
main() coroutine loads dataset, runs both benchmarks, evaluates results via MemoryEvaluator, computes average accuracy and aggregate metrics, writes to results.json, and handles errors with MOORCHEH_API_KEY guidance.
Project Dependencies
examples/benchmarks/memanto-vs-mem0/requirements.txt
Specifies memanto, mem0ai, and python-dotenv as required packages.
Benchmark Documentation
examples/benchmarks/memanto-vs-mem0/README.md
Describes scenario, evaluation metrics, methodology, preliminary results, and step-by-step instructions for setup (MOORCHEH_API_KEY configuration, dependency install, benchmark execution, output inspection).

Estimated Code Review Effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A benchmark hops into the warren,
Comparing memory with gentle tarryin',
Preferences shift like morning dew,
Memanto and Mem0—which one's most true?
Results in JSON, accuracy in sight! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: implementing a comprehensive benchmarking suite comparing Memanto against Mem0 for dynamic preference tracking, which is the core purpose of all files added in this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (2)
examples/benchmarks/memanto-vs-mem0/benchmark.py (1)

83-87: ⚡ Quick win

Run both benchmark suites concurrently with asyncio.gather.

Both routines are async but currently executed sequentially; gather is a straightforward speedup and better matches the stated parallel benchmark design.

Proposed refactor
-        print("Running Memanto tests...")
-        m_res, m_tokens, m_lat = await run_memanto_test(memanto, dataset)
-        
-        print("Running Mem0 tests...")
-        z_res, z_tokens, z_lat = await run_mem0_test(mem0, dataset)
+        print("Running Memanto and Mem0 tests...")
+        (m_res, m_tokens, m_lat), (z_res, z_tokens, z_lat) = await asyncio.gather(
+            run_memanto_test(memanto, dataset),
+            run_mem0_test(mem0, dataset),
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/memanto-vs-mem0/benchmark.py` around lines 83 - 87, The
two async benchmark calls run_memanto_test and run_mem0_test are awaited
sequentially; run them concurrently using asyncio.gather and unpack the returned
tuples into (m_res, m_tokens, m_lat) and (z_res, z_tokens, z_lat). Replace the
two separate await statements with a single await
asyncio.gather(run_memanto_test(memanto, dataset), run_mem0_test(mem0, dataset))
and assign/unpack the results so downstream code still uses the same variable
names.
examples/benchmarks/memanto-vs-mem0/README.md (1)

6-6: ⚡ Quick win

Replace LaTeX syntax with Unicode or Markdown for proper rendering.

The LaTeX math syntax $\rightarrow$ will render as literal text in standard GitHub Markdown. Use a Unicode arrow or plain text instead for better readability.

✨ Proposed fix
-The test uses a "Shifting Persona" dataset where a user's preferences dynamically mutate or contradict over multiple sessions (e.g., shifting from black coffee $\rightarrow$ Matcha tea $\rightarrow$ Almond Milk Latte).
+The test uses a "Shifting Persona" dataset where a user's preferences dynamically mutate or contradict over multiple sessions (e.g., shifting from black coffee → Matcha tea → Almond Milk Latte).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/memanto-vs-mem0/README.md` at line 6, The README contains
LaTeX math `$\rightarrow$` in the sentence describing the "Shifting Persona"
dataset which won't render in GitHub Markdown; replace `$\rightarrow$` with a
Unicode arrow (→) or a plain ASCII arrow (->) so the example "black coffee →
Matcha tea → Almond Milk Latte" displays correctly in the README text.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/benchmarks/memanto-vs-mem0/benchmark.py`:
- Around line 72-73: Open the dataset/results file using a script-relative path
instead of a cwd-relative path (use os.path.join(os.path.dirname(__file__),
'dataset.json') and similarly for 'results.json') so file access doesn't break
when running from another directory; update the open(...) call that produces
dataset = json.load(f) and the block that reads/writes results.json (lines
~97-104) to use these computed paths. Also remove the blanket except that treats
every error as an API-key/auth problem: catch and handle specific authentication
exceptions (or check for missing API key explicitly) and let other exceptions
bubble or log their actual error before re-raising, so non-auth failures are not
misdiagnosed.
- Around line 29-37: The token-counting is inconsistent because one place
computes tokens as len(response) // 4 while the other uses len(str(response)) //
4; normalize by first converting the response to a text representation and then
apply the token heuristic. Update the code near the tokens calculation in
benchmark.py (the tokens variable and the results.append block) to compute
response_text = str(response) (or ensure session['response'] is cast to str) and
then tokens = len(response_text) // 4, and make the identical change in the
second occurrence (the block around lines 57-65) so both Memanto and Mem0 use
the same string-based token heuristic.
- Around line 89-90: The current calculations of m_score and z_score divide by
len(m_res)/len(z_res) and will raise ZeroDivisionError when m_res or z_res are
empty; change the logic in the benchmark to guard those cases (referencing
m_score, z_score, evaluator.evaluate, m_res, z_res) by checking if the result
list is empty and setting the score to 0 (or another defined default) or by
using a safe denominator (e.g., denom = len(m_res) or 1) before dividing;
implement the same guard for both m_res and z_res so evaluator.evaluate calls
are only averaged when there are results.
- Around line 6-8: The benchmark is calling non-existent Memanto symbols and
using async methods that don't match the real SDK; replace import of
MemantoClient with the actual clients (SdkClient or DirectClient) and call their
synchronous remember(...) and recall(...) methods instead of await
client.add(...) / await client.search(...); when handling return values extract
the correct fields (remember(...) returns a dict including memory_id,
recall(...) returns a dict with memories and count) and for Mem0 extract its
top-level results field (e.g., response['results']) before passing to
MemoryEvaluator.evaluate so that evaluate receives a string-like or serialized
representation rather than the raw dict.

In `@examples/benchmarks/memanto-vs-mem0/evaluator.py`:
- Around line 14-36: The current permissive check if expected.lower() in
actual.lower() (using variables expected and actual in the evaluator block that
builds prompt) lets answers containing both stale and latest preferences pass;
replace this with a stricter validation: either require an exact semantic match
(compare normalized strings: actual_normalized == expected_normalized) or
implement a small extractor (e.g., extract_most_recent_preference(actual)) and
assert that extracted_preference equals expected (and that no other
conflicting/stale preference tokens appear); update the conditional that
currently does the "in" check to call the new normalization/extractor and return
score 1 only when the extracted/normalized preference matches expected exactly.

In `@examples/benchmarks/memanto-vs-mem0/README.md`:
- Line 16: The README claims a "p95 Latency" metric but the benchmark
implementation reports "avg_latency" (see results structure and variable
avg_latency in benchmark.py); either update README to say "avg latency" to match
the code, or modify benchmark.py to compute and report p95 instead of average by
collecting all per-request latencies and replacing avg_latency with a p95 value
(e.g., compute the 95th percentile from the latency list using numpy.percentile
or statistics.quantiles) and update the results key name to "p95_latency" so doc
and code are consistent.
- Line 13: The README incorrectly states that GPT-4o is used as the
LLM-as-a-Judge, but evaluator.py actually performs a simple case-insensitive
substring check (the conditional using expected.lower() in actual.lower()) in
its evaluate function; either update the README to describe this exact matching
logic (mentioning evaluator.py and the evaluate function/substring check) or
modify evaluator.py to replace the substring check with an actual GPT-4o scoring
call (add API invocation, prompt construction, and score parsing) and update
tests/docs accordingly.
- Around line 19-28: Clarify whether the metrics in the README table are
projected or measured: if these are projections, update the text around
benchmark.py and the table (remove ambiguous phrase "produces the following
metrics", change column headers from "(Expected)" to "(Projected)" or add a
clear "Projected" label, and add a short note referencing benchmark.py and
MOORCHEH_API_KEY that these are estimated outcomes); if they are actual
measurements, run benchmark.py with MOORCHEH_API_KEY configured, replace the
table values with the measured results and update the copy to state "Measured
results from running benchmark.py" and include the command/seed used.

In `@examples/benchmarks/memanto-vs-mem0/requirements.txt`:
- Around line 1-3: The requirements.txt currently lists memanto, mem0ai, and
python-dotenv without pinned versions, which breaks reproducibility; update
requirements.txt to pin each dependency to the exact versions used in the
benchmark (replace each package line with package==<version>) by running pip
freeze and extracting the versions for memanto, mem0ai, and python-dotenv, then
commit the updated file so the benchmark uses deterministic installs.

---

Nitpick comments:
In `@examples/benchmarks/memanto-vs-mem0/benchmark.py`:
- Around line 83-87: The two async benchmark calls run_memanto_test and
run_mem0_test are awaited sequentially; run them concurrently using
asyncio.gather and unpack the returned tuples into (m_res, m_tokens, m_lat) and
(z_res, z_tokens, z_lat). Replace the two separate await statements with a
single await asyncio.gather(run_memanto_test(memanto, dataset),
run_mem0_test(mem0, dataset)) and assign/unpack the results so downstream code
still uses the same variable names.

In `@examples/benchmarks/memanto-vs-mem0/README.md`:
- Line 6: The README contains LaTeX math `$\rightarrow$` in the sentence
describing the "Shifting Persona" dataset which won't render in GitHub Markdown;
replace `$\rightarrow$` with a Unicode arrow (→) or a plain ASCII arrow (->) so
the example "black coffee → Matcha tea → Almond Milk Latte" displays correctly
in the README text.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 9b343512-df94-4c30-ba7d-aa1b9465461b

📥 Commits

Reviewing files that changed from the base of the PR and between b53ac14 and 5cdd6ad.

📒 Files selected for processing (5)
  • examples/benchmarks/memanto-vs-mem0/README.md
  • examples/benchmarks/memanto-vs-mem0/benchmark.py
  • examples/benchmarks/memanto-vs-mem0/dataset.json
  • examples/benchmarks/memanto-vs-mem0/evaluator.py
  • examples/benchmarks/memanto-vs-mem0/requirements.txt

Comment on lines +6 to +8
from memanto import MemantoClient # Hypothetical based on repo structure
from mem0 import Memory # Standard Mem0 API
from evaluator import MemoryEvaluator

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Locate benchmark call sites =="
rg -n --type=py -C3 '\bMemantoClient\b|\bMemory\b|\bclient\.(add|search)\s*\(|\bmem0\.(add|search)\s*\('

echo "== Inspect Memanto client API in repo (if present) =="
rg -n --type=py -C3 '\bclass\s+.*Client\b|def\s+(add|search|query)\s*\('

echo "== Show pinned benchmark dependencies =="
fd -i 'requirements.txt' --exec sh -c 'echo "--- {} ---"; sed -n "1,120p" "{}"'

Repository: moorcheh-ai/memanto

Length of output: 50377


🌐 Web query:

In the current mem0ai Python SDK, what are the exact method signatures and return types for Memory.add(...)andMemory.search(...)?

💡 Result:

In the current mem0ai Python SDK (as of version 1.0.0+), the method signatures and return types for the Memory class are as follows: Memory.add(...) Signature: def add( self, messages, user_id: str = None, agent_id: str = None, run_id: str = None, metadata: dict = None, filters: dict = None, infer: bool = True) -> dict Return Type: The method returns a dictionary containing a "results" key, which holds the details of the processed memory operations [1][2]. The response typically follows this structure: { "results": [ { "id": "mem_...", "memory": "...", "metadata": {...}, "event": "ADD" } ] } [1] Memory.search(...) Signature: def search( self, query: str, filters: dict = None, top_k: int = 10, threshold: float = 0.1, rerank: bool = False, fields: list = None, categories: list = None) -> dict Return Type: The method returns a dictionary containing a "results" key, which holds a list of memory objects matching the search criteria [3][4]. The standard structure is: { "results": [ { "id": "...", "memory": "...", "score": 0.8,... } ] } [3][4] Note: Entity IDs such as user_id, agent_id, and run_id should be provided within the filters dictionary when using the search method [3][4].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Locate benchmark file =="
fd -i "benchmark.py" examples/benchmarks/memanto-vs-mem0 || true
fd -i "memanto-vs-mem0" examples/benchmarks || true

echo "== Show benchmark snippet around provided lines =="
# If the file exists, print the relevant area; otherwise nothing.
BENCH="$(fd -i "benchmark.py" examples/benchmarks/memanto-vs-mem0 -0 2>/dev/null | head -n1 || true)"
if [ -n "${BENCH:-}" ]; then
  echo "--- $BENCH (lines 1-140) ---"
  sed -n '1,140p' "$BENCH"
else
  echo "benchmark.py not found under examples/benchmarks/memanto-vs-mem0"
fi

echo "== Locate MemantoClient in repo =="
rg -n --type=py '\bclass\s+MemantoClient\b|\bMemantoClient\b' .

echo "== Inspect Memanto client methods (add/search) =="
rg -n --type=py '\bdef\s+(add|search)\s*\(' memanto || true
rg -n --type=py '\bdef\s+(add|search)\s*\(' . || true

echo "== Inspect likely client entrypoints (sdk/direct clients) =="
ls -R memanto/cli/client 2>/dev/null || true
rg -n --type=py 'class\s+.*Client\b' memanto/cli/client
rg -n --type=py 'def\s+(add|search|remember|recall|query)\s*\(' memanto/cli/client

Repository: moorcheh-ai/memanto

Length of output: 4865


🌐 Web query:

MemantoClient add search python SDK signature return types

💡 Result:

In the Memanto Python SDK (accessed via the SdkClient, which is used by the CLI, MCP, and other integrations), the search operation is typically implemented via a recall method [1][2][3]. The recall method is designed to perform semantic searches against agent memories [4][2]. Based on the underlying implementation patterns and integration tool definitions (such as those found in MCP and LangGraph integrations), the method signature for a search or recall operation generally accepts the following parameters [4][3]: 1. query (str): The natural-language search string (required) [4][5][3]. 2. limit (int): The maximum number of results to return, typically ranging from 1 to 100 (optional, default often 10) [4][3]. 3. min_similarity (float): A threshold for filtering results by relevance, ranging from 0.0 to 1.0 (optional) [4][3]. 4. type (list[str]): A filter for specific memory types, such as ["preference"] (optional) [4][3]. The return type for these search operations is typically a structured object (often a list of memory dictionaries or a response object containing a list of "memories") [4]. For example, in LlamaIndex and custom agent integrations, the return value is often processed into a string or a list of relevant memory objects, each containing fields like id, title, content, type, and similarity score [4][5]. Note that Memanto utilizes session-based authentication; ensure your client is initialized with an active session token before calling search/recall methods, as these operations require an authorized context [4][2]. For exact, up-to-date type signatures, you can inspect the SdkClient source code directly in the memanto/cli/client/ directory of the official repository [2][6].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Inspect memanto/cli/client/sdk_client.py signatures + return shapes =="
sed -n '1,220p' memanto/cli/client/sdk_client.py
rg -n "def remember|def recall|async def remember|async def recall" memanto/cli/client/sdk_client.py
# show remember() and recall() blocks
python3 - <<'PY'
import re
path="memanto/cli/client/sdk_client.py"
with open(path,'r',encoding='utf-8') as f:
    lines=f.readlines()
def show_at(pattern):
    for i,l in enumerate(lines):
        if re.search(pattern,l):
            start=max(0,i-5); end=min(len(lines), i+120)
            print("\n---",path,"around line",i+1,"---")
            for j in range(start,end):
                print(f"{j+1}:{lines[j].rstrip()}")
            break
show_at(r"def remember\s*\(")
show_at(r"def recall\s*\(")
PY

echo "== Inspect memanto/cli/client/direct_client.py signatures + return shapes =="
sed -n '1,260p' memanto/cli/client/direct_client.py
rg -n "class MoorchehClient|class DirectClient|def remember|def recall|async def remember|async def recall" memanto/cli/client/direct_client.py

python3 - <<'PY'
import re
path="memanto/cli/client/direct_client.py"
with open(path,'r',encoding='utf-8') as f:
    lines=f.readlines()
def show_at(pattern):
    for i,l in enumerate(lines):
        if re.search(pattern,l):
            start=max(0,i-5); end=min(len(lines), i+180)
            print("\n---",path,"around line",i+1,"---")
            for j in range(start,end):
                print(f"{j+1}:{lines[j].rstrip()}")
            break
show_at(r"class MoorchehClient")
show_at(r"def remember\s*\(")
show_at(r"def recall\s*\(")
PY

Repository: moorcheh-ai/memanto

Length of output: 45723


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Locate MemoryEvaluator =="
rg -n --type=py '\bclass\s+MemoryEvaluator\b' . || true
fd -i "evaluator.py" examples/benchmarks/memanto-vs-mem0 || true
fd -i "evaluator*.py" . | head -n 20 || true

EVAL="$(fd -i "evaluator.py" examples/benchmarks/memanto-vs-mem0 2>/dev/null | head -n1 || true)"
if [ -n "${EVAL:-}" ]; then
  echo "== Show evaluator.py =="
  wc -l "$EVAL" || true
  sed -n '1,220p' "$EVAL"
fi

echo "== Inspect memanto exports for MemantoClient =="
fd -i "__init__.py" memanto | head -n 20
rg -n --type=py '\bMemantoClient\b' memanto | head -n 50 || true
for f in memanto/__init__.py memanto/cli/client/__init__.py memanto/cli/__init__.py; do
  if [ -f "$f" ]; then
    echo "== $f =="
    sed -n '1,200p' "$f"
  fi
done

echo "== Inspect memanto-vs-mem0 benchmark evaluator expectations =="
sed -n '1,220p' examples/benchmarks/memanto-vs-mem0/benchmark.py

Repository: moorcheh-ai/memanto

Length of output: 6145


Fix Memanto/Mem0 benchmark API calls to match the real SDK contracts

  • examples/benchmarks/memanto-vs-mem0/benchmark.py imports MemantoClient from memanto, but this repo doesn’t export a MemantoClient symbol; the Memanto SDK clients are SdkClient/DirectClient, with synchronous remember(...) (returns dict incl. memory_id, etc.) and recall(...) (returns dict incl. memories + count). The benchmark’s await client.add(...) / await client.search(...) (e.g., lines 22-22 and 25-25) don’t match the actual API and response shapes (sync + different method names).
  • The benchmark’s evaluator expects actual to be string-like, but both Mem0 and Memanto client operations return dicts (Mem0 returns dict with top-level results); MemoryEvaluator.evaluate(...) will fail unless you extract/serialize the returned data appropriately (e.g., lines 25-25 / 52-52 / 55-55 / 79-80).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/memanto-vs-mem0/benchmark.py` around lines 6 - 8, The
benchmark is calling non-existent Memanto symbols and using async methods that
don't match the real SDK; replace import of MemantoClient with the actual
clients (SdkClient or DirectClient) and call their synchronous remember(...) and
recall(...) methods instead of await client.add(...) / await client.search(...);
when handling return values extract the correct fields (remember(...) returns a
dict including memory_id, recall(...) returns a dict with memories and count)
and for Mem0 extract its top-level results field (e.g., response['results'])
before passing to MemoryEvaluator.evaluate so that evaluate receives a
string-like or serialized representation rather than the raw dict.

Comment on lines +29 to +37
tokens = len(response) // 4

results.append({
"query": session['query'],
"expected": session['expected_answer'],
"actual": response,
"latency": latency,
"tokens": tokens
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Token efficiency is measured inconsistently across Memanto vs Mem0.

Memanto uses len(response) // 4 while Mem0 uses len(str(response)) // 4; this makes token totals non-comparable and can skew winner claims. Normalize both responses to a text representation first, then apply one token heuristic.

Proposed fix
@@
-            response = await client.search(session['query'])
+            response = await client.search(session['query'])
+            response_text = response if isinstance(response, str) else json.dumps(response, ensure_ascii=False)
             latency = time.time() - start_time
@@
-            tokens = len(response) // 4 
+            tokens = len(response_text) // 4
@@
-                "actual": response,
+                "actual": response_text,
@@
-            response = mem0.search(session['query'])
+            response = mem0.search(session['query'])
+            response_text = response if isinstance(response, str) else json.dumps(response, ensure_ascii=False)
             latency = time.time() - start_time
-            tokens = len(str(response)) // 4
+            tokens = len(response_text) // 4
@@
-                "actual": response,
+                "actual": response_text,

Also applies to: 57-65

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/memanto-vs-mem0/benchmark.py` around lines 29 - 37, The
token-counting is inconsistent because one place computes tokens as
len(response) // 4 while the other uses len(str(response)) // 4; normalize by
first converting the response to a text representation and then apply the token
heuristic. Update the code near the tokens calculation in benchmark.py (the
tokens variable and the results.append block) to compute response_text =
str(response) (or ensure session['response'] is cast to str) and then tokens =
len(response_text) // 4, and make the identical change in the second occurrence
(the block around lines 57-65) so both Memanto and Mem0 use the same
string-based token heuristic.

Comment on lines +72 to +73
with open('dataset.json', 'r') as f:
dataset = json.load(f)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use script-relative paths and avoid masking all failures as API-key issues.

Using cwd-relative dataset.json/results.json and a blanket exception path makes runs brittle and misdiagnoses non-auth errors.

Proposed fix
@@
-    with open('dataset.json', 'r') as f:
+    base_dir = os.path.dirname(os.path.abspath(__file__))
+    dataset_path = os.path.join(base_dir, "dataset.json")
+    results_path = os.path.join(base_dir, "results.json")
+    with open(dataset_path, 'r') as f:
         dataset = json.load(f)
@@
-        memanto = MemantoClient(api_key=os.getenv("MOORCHEH_API_KEY"))
+        api_key = os.getenv("MOORCHEH_API_KEY")
+        if not api_key:
+            raise RuntimeError("MOORCHEH_API_KEY is not set")
+        memanto = MemantoClient(api_key=api_key)
@@
-        with open('results.json', 'w') as f:
+        with open(results_path, 'w') as f:
             json.dump(final_results, f, indent=2)
@@
-    except Exception as e:
-        print(f"Error during benchmark: {e}")
-        print("Make sure MOORCHEH_API_KEY is set in .env")
+    except RuntimeError as e:
+        print(f"Configuration error: {e}")
+        raise
+    except Exception as e:
+        print(f"Error during benchmark: {e}")
+        raise

Also applies to: 97-104

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/memanto-vs-mem0/benchmark.py` around lines 72 - 73, Open
the dataset/results file using a script-relative path instead of a cwd-relative
path (use os.path.join(os.path.dirname(__file__), 'dataset.json') and similarly
for 'results.json') so file access doesn't break when running from another
directory; update the open(...) call that produces dataset = json.load(f) and
the block that reads/writes results.json (lines ~97-104) to use these computed
paths. Also remove the blanket except that treats every error as an API-key/auth
problem: catch and handle specific authentication exceptions (or check for
missing API key explicitly) and let other exceptions bubble or log their actual
error before re-raising, so non-auth failures are not misdiagnosed.

Comment on lines +89 to +90
m_score = sum([evaluator.evaluate(r['query'], r['expected'], r['actual'])['score'] for r in m_res]) / len(m_res)
z_score = sum([evaluator.evaluate(r['query'], r['expected'], r['actual'])['score'] for r in z_res]) / len(z_res)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard against zero-result accuracy division.

If no query rows are produced, these divisions can crash with ZeroDivisionError.

Proposed fix
-        m_score = sum([evaluator.evaluate(r['query'], r['expected'], r['actual'])['score'] for r in m_res]) / len(m_res)
-        z_score = sum([evaluator.evaluate(r['query'], r['expected'], r['actual'])['score'] for r in z_res]) / len(z_res)
+        m_score = (
+            sum(evaluator.evaluate(r['query'], r['expected'], r['actual'])['score'] for r in m_res) / len(m_res)
+            if m_res else 0
+        )
+        z_score = (
+            sum(evaluator.evaluate(r['query'], r['expected'], r['actual'])['score'] for r in z_res) / len(z_res)
+            if z_res else 0
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/memanto-vs-mem0/benchmark.py` around lines 89 - 90, The
current calculations of m_score and z_score divide by len(m_res)/len(z_res) and
will raise ZeroDivisionError when m_res or z_res are empty; change the logic in
the benchmark to guard those cases (referencing m_score, z_score,
evaluator.evaluate, m_res, z_res) by checking if the result list is empty and
setting the score to 0 (or another defined default) or by using a safe
denominator (e.g., denom = len(m_res) or 1) before dividing; implement the same
guard for both m_res and z_res so evaluator.evaluate calls are only averaged
when there are results.

Comment on lines +14 to +36
prompt = f"""
You are an impartial judge evaluating the accuracy of an AI agent's memory retrieval.

Query: {query}
Expected Answer: {expected}
Actual Agent Answer: {actual}

Does the Actual Answer correctly reflect the most recent preference specified in the Expected Answer?
Respond only with a JSON object:
{{
"score": 1 or 0,
"reasoning": "short explanation"
}}
"""
# In a real implementation, this would call an LLM API.
# For this benchmark infrastructure, we implement a semantic match
# or a mock call if API key is missing.

# Simplified semantic check for the demo/infrastructure
if expected.lower() in actual.lower():
return {"score": 1, "reasoning": "Exact or semantic match found."}

return {"score": 0, "reasoning": "The agent failed to retrieve the most recent preference."}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Benchmark accuracy scoring is too permissive for a “most-recent-only” test.

The current expected in actual rule can score polluted answers as correct (e.g., answers containing both stale and latest preferences), which inflates accuracy for this scenario.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/memanto-vs-mem0/evaluator.py` around lines 14 - 36, The
current permissive check if expected.lower() in actual.lower() (using variables
expected and actual in the evaluator block that builds prompt) lets answers
containing both stale and latest preferences pass; replace this with a stricter
validation: either require an exact semantic match (compare normalized strings:
actual_normalized == expected_normalized) or implement a small extractor (e.g.,
extract_most_recent_preference(actual)) and assert that extracted_preference
equals expected (and that no other conflicting/stale preference tokens appear);
update the conditional that currently does the "in" check to call the new
normalization/extractor and return score 1 only when the extracted/normalized
preference matches expected exactly.

## 🛠 Methodology
- **Dataset:** `dataset.json` containing evolving preference turns.
- **Control Group:** Identical inputs fed to both Memanto and Mem0.
- **Backend LLM:** GPT-4o (used for both retrieval and as the LLM-as-a-Judge).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clarify the evaluation mechanism to match actual implementation.

The README claims "GPT-4o (used for both retrieval and as the LLM-as-a-Judge)," but the actual evaluator.py implementation uses case-insensitive substring matching (if expected.lower() in actual.lower()) rather than an LLM API call. This discrepancy between documentation and implementation undermines the benchmark's credibility.

📝 Proposed fix

Option 1: Update documentation to reflect the simplified evaluator:

-- **Backend LLM:** GPT-4o (used for both retrieval and as the LLM-as-a-Judge).
+- **Backend LLM:** GPT-4o (used for retrieval).
+- **Evaluation:** Simplified semantic matching via case-insensitive substring check.

Option 2: Implement the promised LLM-as-a-Judge by updating evaluator.py to actually call GPT-4o for scoring.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- **Backend LLM:** GPT-4o (used for both retrieval and as the LLM-as-a-Judge).
- **Backend LLM:** GPT-4o (used for retrieval).
- **Evaluation:** Simplified semantic matching via case-insensitive substring check.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/memanto-vs-mem0/README.md` at line 13, The README
incorrectly states that GPT-4o is used as the LLM-as-a-Judge, but evaluator.py
actually performs a simple case-insensitive substring check (the conditional
using expected.lower() in actual.lower()) in its evaluate function; either
update the README to describe this exact matching logic (mentioning evaluator.py
and the evaluate function/substring check) or modify evaluator.py to replace the
substring check with an actual GPT-4o scoring call (add API invocation, prompt
construction, and score parsing) and update tests/docs accordingly.

- **Backend LLM:** GPT-4o (used for both retrieval and as the LLM-as-a-Judge).
- **Metrics:**
- **Accuracy:** Percentage of turns where the agent correctly identifies the current preference.
- **p95 Latency:** Time to retrieve the correct context.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Correct latency metric inconsistency between README and implementation.

The README promises "p95 Latency" as a metric, but the actual benchmark implementation (benchmark.py) computes and reports "avg_latency" in the results structure. This inconsistency between documentation and code will confuse users interpreting the benchmark output.

🔧 Proposed fix

Option 1: Update the README to match the implementation:

-    - **p95 Latency:** Time to retrieve the correct context.
+    - **Avg Latency:** Average time to retrieve the correct context.

Option 2: Update the benchmark implementation to actually compute p95 latency instead of average.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/memanto-vs-mem0/README.md` at line 16, The README claims
a "p95 Latency" metric but the benchmark implementation reports "avg_latency"
(see results structure and variable avg_latency in benchmark.py); either update
README to say "avg latency" to match the code, or modify benchmark.py to compute
and report p95 instead of average by collecting all per-request latencies and
replacing avg_latency with a p95 value (e.g., compute the 95th percentile from
the latency list using numpy.percentile or statistics.quantiles) and update the
results key name to "p95_latency" so doc and code are consistent.

Comment on lines +19 to +28
## 📊 Preliminary Results (Infrastructure Ready)
The benchmark suite is fully implemented. Once the `MOORCHEH_API_KEY` is configured, the `benchmark.py` script produces the following metrics:

| Metric | Memanto (Expected) | Mem0 (Expected) | Winner |
| :--- | :---: | :---: | :---: |
| **Accuracy** | 95% | 70% | **Memanto** |
| **Avg Latency** | 0.4s | 1.2s | **Memanto** |
| **Token Overhead** | Low | High | **Memanto** |

*Note: Memanto's active compression and serverless retrieval are expected to significantly outperform passive vector-dumping systems in dynamic scenarios.*

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clarify whether results are actual measurements or projections.

Lines 20 and 22-26 present a results table in a way that suggests actual benchmark measurements ("the benchmark.py script produces the following metrics"), but the column headers include "(Expected)" and line 28 notes these are "expected" outcomes. This ambiguity undermines the benchmark's credibility and may mislead users into citing projected results as measured data.

📊 Proposed fix

Option 1: If these are projections, make it explicit:

-## 📊 Preliminary Results (Infrastructure Ready)
-The benchmark suite is fully implemented. Once the `MOORCHEH_API_KEY` is configured, the `benchmark.py` script produces the following metrics:
+## 📊 Projected Results (Infrastructure Ready)
+The benchmark suite is fully implemented. Once the `MOORCHEH_API_KEY` is configured, the `benchmark.py` script will measure the following. Below are our projected outcomes based on system design:

-| Metric | Memanto (Expected) | Mem0 (Expected) | Winner |
+| Metric | Memanto (Projected) | Mem0 (Projected) | Projected Winner |

Option 2: Run the actual benchmark and report measured results instead of projections.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## 📊 Preliminary Results (Infrastructure Ready)
The benchmark suite is fully implemented. Once the `MOORCHEH_API_KEY` is configured, the `benchmark.py` script produces the following metrics:
| Metric | Memanto (Expected) | Mem0 (Expected) | Winner |
| :--- | :---: | :---: | :---: |
| **Accuracy** | 95% | 70% | **Memanto** |
| **Avg Latency** | 0.4s | 1.2s | **Memanto** |
| **Token Overhead** | Low | High | **Memanto** |
*Note: Memanto's active compression and serverless retrieval are expected to significantly outperform passive vector-dumping systems in dynamic scenarios.*
## 📊 Projected Results (Infrastructure Ready)
The benchmark suite is fully implemented. Once the `MOORCHEH_API_KEY` is configured, the `benchmark.py` script will measure the following. Below are our projected outcomes based on system design:
| Metric | Memanto (Projected) | Mem0 (Projected) | Projected Winner |
| :--- | :---: | :---: | :---: |
| **Accuracy** | 95% | 70% | **Memanto** |
| **Avg Latency** | 0.4s | 1.2s | **Memanto** |
| **Token Overhead** | Low | High | **Memanto** |
*Note: Memanto's active compression and serverless retrieval are expected to significantly outperform passive vector-dumping systems in dynamic scenarios.*
🧰 Tools
🪛 LanguageTool

[style] ~28-~28: Try moving the adverb to make the sentence clearer.
Context: ...ompression and serverless retrieval are expected to significantly outperform passive vector-dumping systems in dynamic scenarios.* ## 🚀 How to Ru...

(SPLIT_INFINITIVE)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/memanto-vs-mem0/README.md` around lines 19 - 28, Clarify
whether the metrics in the README table are projected or measured: if these are
projections, update the text around benchmark.py and the table (remove ambiguous
phrase "produces the following metrics", change column headers from "(Expected)"
to "(Projected)" or add a clear "Projected" label, and add a short note
referencing benchmark.py and MOORCHEH_API_KEY that these are estimated
outcomes); if they are actual measurements, run benchmark.py with
MOORCHEH_API_KEY configured, replace the table values with the measured results
and update the copy to state "Measured results from running benchmark.py" and
include the command/seed used.

Comment on lines +1 to +3
memanto
mem0ai
python-dotenv

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Pin dependency versions to ensure benchmark reproducibility.

The PR objectives explicitly promise "full reproducibility with an included requirements.txt," but all three dependencies are unpinned. Without version constraints, different installation times will resolve different versions, making benchmark results incomparable across environments and runs.

📌 Proposed fix to pin versions
-memanto
-mem0ai
-python-dotenv
+memanto==<version>
+mem0ai==<version>
+python-dotenv==<version>

Replace <version> with the specific versions tested. To capture current resolved versions:

pip freeze | grep -E '^(memanto|mem0ai|python-dotenv)=='
🧰 Tools
🪛 OSV Scanner (2.3.8)

[HIGH] 1-1: pyjwt 2.9.0: undefined

(PYSEC-2025-183)


[HIGH] 1-1: pyjwt 2.9.0: undefined

(PYSEC-2026-120)


[HIGH] 1-1: pyjwt 2.9.0: undefined

(PYSEC-2026-175)


[HIGH] 1-1: pyjwt 2.9.0: undefined

(PYSEC-2026-176)


[HIGH] 1-1: pyjwt 2.9.0: undefined

(PYSEC-2026-177)


[HIGH] 1-1: pyjwt 2.9.0: undefined

(PYSEC-2026-178)


[HIGH] 1-1: pyjwt 2.9.0: undefined

(PYSEC-2026-179)


[HIGH] 1-1: pyjwt 2.9.0: PyJWT accepts unknown crit header extensions

(GHSA-752w-5fwx-jx9f)


[HIGH] 1-1: python-multipart 0.0.9: Denial of service (DoS) via deformation multipart/form-data boundary

(GHSA-59g5-xgcq-4qw3)


[HIGH] 1-1: python-multipart 0.0.9: python-multipart affected by Denial of Service via large multipart preamble or epilogue data

(GHSA-mj87-hwqh-73pj)


[HIGH] 1-1: python-multipart 0.0.9: python-multipart has Denial of Service via unbounded multipart part headers

(GHSA-pp6c-gr5w-3c5g)


[HIGH] 1-1: python-multipart 0.0.9: Python-Multipart has Arbitrary File Write via Non-Default Configuration

(GHSA-wp53-j4wj-2cfg)


[HIGH] 1-1: requests 2.9.2: undefined

(PYSEC-2018-28)


[HIGH] 1-1: requests 2.9.2: undefined

(PYSEC-2023-74)


[HIGH] 1-1: requests 2.9.2: Requests vulnerable to .netrc credentials leak via malicious URLs

(GHSA-9hjg-9r4m-mvj7)


[HIGH] 1-1: requests 2.9.2: Requests Session object does not verify requests after making first request with verify=False

(GHSA-9wx4-h78v-vm56)


[HIGH] 1-1: requests 2.9.2: Requests has Insecure Temp File Reuse in its extract_zipped_paths() utility function

(GHSA-gc5v-m9x4-r6x2)


[HIGH] 1-1: requests 2.9.2: Unintended leak of Proxy-Authorization header in requests

(GHSA-j8r2-6x86-q33q)


[HIGH] 1-1: requests 2.9.2: Insufficiently Protected Credentials in Requests

(GHSA-x84v-xcm2-53pg)


[HIGH] 1-1: tqdm 4.9.0: undefined

(PYSEC-2017-74)


[HIGH] 1-1: tqdm 4.9.0: tqdm CLI arguments injection attack

(GHSA-g7vv-2v7x-gj9p)


[HIGH] 1-1: tqdm 4.9.0: TDQM Arbitrary Code Execution

(GHSA-r7q7-xcjw-qx8q)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/memanto-vs-mem0/requirements.txt` around lines 1 - 3, The
requirements.txt currently lists memanto, mem0ai, and python-dotenv without
pinned versions, which breaks reproducibility; update requirements.txt to pin
each dependency to the exact versions used in the benchmark (replace each
package line with package==<version>) by running pip freeze and extracting the
versions for memanto, mem0ai, and python-dotenv, then commit the updated file so
the benchmark uses deterministic installs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant