feat: Implement Dynamic Preference Benchmarking Suite (Memanto vs Mem0)#718
feat: Implement Dynamic Preference Benchmarking Suite (Memanto vs Mem0)#718nkar123412-hub wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughThis PR adds a complete benchmark example comparing Memanto and Mem0 on dynamic user preference tracking. The benchmark ingests user preferences, executes retrieval queries, evaluates accuracy, and reports latency and token efficiency metrics to a JSON results file. ChangesBenchmark Suite
Estimated Code Review Effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (2)
examples/benchmarks/memanto-vs-mem0/benchmark.py (1)
83-87: ⚡ Quick winRun both benchmark suites concurrently with
asyncio.gather.Both routines are async but currently executed sequentially;
gatheris a straightforward speedup and better matches the stated parallel benchmark design.Proposed refactor
- print("Running Memanto tests...") - m_res, m_tokens, m_lat = await run_memanto_test(memanto, dataset) - - print("Running Mem0 tests...") - z_res, z_tokens, z_lat = await run_mem0_test(mem0, dataset) + print("Running Memanto and Mem0 tests...") + (m_res, m_tokens, m_lat), (z_res, z_tokens, z_lat) = await asyncio.gather( + run_memanto_test(memanto, dataset), + run_mem0_test(mem0, dataset), + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/memanto-vs-mem0/benchmark.py` around lines 83 - 87, The two async benchmark calls run_memanto_test and run_mem0_test are awaited sequentially; run them concurrently using asyncio.gather and unpack the returned tuples into (m_res, m_tokens, m_lat) and (z_res, z_tokens, z_lat). Replace the two separate await statements with a single await asyncio.gather(run_memanto_test(memanto, dataset), run_mem0_test(mem0, dataset)) and assign/unpack the results so downstream code still uses the same variable names.examples/benchmarks/memanto-vs-mem0/README.md (1)
6-6: ⚡ Quick winReplace LaTeX syntax with Unicode or Markdown for proper rendering.
The LaTeX math syntax
$\rightarrow$will render as literal text in standard GitHub Markdown. Use a Unicode arrow or plain text instead for better readability.✨ Proposed fix
-The test uses a "Shifting Persona" dataset where a user's preferences dynamically mutate or contradict over multiple sessions (e.g., shifting from black coffee $\rightarrow$ Matcha tea $\rightarrow$ Almond Milk Latte). +The test uses a "Shifting Persona" dataset where a user's preferences dynamically mutate or contradict over multiple sessions (e.g., shifting from black coffee → Matcha tea → Almond Milk Latte).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/memanto-vs-mem0/README.md` at line 6, The README contains LaTeX math `$\rightarrow$` in the sentence describing the "Shifting Persona" dataset which won't render in GitHub Markdown; replace `$\rightarrow$` with a Unicode arrow (→) or a plain ASCII arrow (->) so the example "black coffee → Matcha tea → Almond Milk Latte" displays correctly in the README text.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/benchmarks/memanto-vs-mem0/benchmark.py`:
- Around line 72-73: Open the dataset/results file using a script-relative path
instead of a cwd-relative path (use os.path.join(os.path.dirname(__file__),
'dataset.json') and similarly for 'results.json') so file access doesn't break
when running from another directory; update the open(...) call that produces
dataset = json.load(f) and the block that reads/writes results.json (lines
~97-104) to use these computed paths. Also remove the blanket except that treats
every error as an API-key/auth problem: catch and handle specific authentication
exceptions (or check for missing API key explicitly) and let other exceptions
bubble or log their actual error before re-raising, so non-auth failures are not
misdiagnosed.
- Around line 29-37: The token-counting is inconsistent because one place
computes tokens as len(response) // 4 while the other uses len(str(response)) //
4; normalize by first converting the response to a text representation and then
apply the token heuristic. Update the code near the tokens calculation in
benchmark.py (the tokens variable and the results.append block) to compute
response_text = str(response) (or ensure session['response'] is cast to str) and
then tokens = len(response_text) // 4, and make the identical change in the
second occurrence (the block around lines 57-65) so both Memanto and Mem0 use
the same string-based token heuristic.
- Around line 89-90: The current calculations of m_score and z_score divide by
len(m_res)/len(z_res) and will raise ZeroDivisionError when m_res or z_res are
empty; change the logic in the benchmark to guard those cases (referencing
m_score, z_score, evaluator.evaluate, m_res, z_res) by checking if the result
list is empty and setting the score to 0 (or another defined default) or by
using a safe denominator (e.g., denom = len(m_res) or 1) before dividing;
implement the same guard for both m_res and z_res so evaluator.evaluate calls
are only averaged when there are results.
- Around line 6-8: The benchmark is calling non-existent Memanto symbols and
using async methods that don't match the real SDK; replace import of
MemantoClient with the actual clients (SdkClient or DirectClient) and call their
synchronous remember(...) and recall(...) methods instead of await
client.add(...) / await client.search(...); when handling return values extract
the correct fields (remember(...) returns a dict including memory_id,
recall(...) returns a dict with memories and count) and for Mem0 extract its
top-level results field (e.g., response['results']) before passing to
MemoryEvaluator.evaluate so that evaluate receives a string-like or serialized
representation rather than the raw dict.
In `@examples/benchmarks/memanto-vs-mem0/evaluator.py`:
- Around line 14-36: The current permissive check if expected.lower() in
actual.lower() (using variables expected and actual in the evaluator block that
builds prompt) lets answers containing both stale and latest preferences pass;
replace this with a stricter validation: either require an exact semantic match
(compare normalized strings: actual_normalized == expected_normalized) or
implement a small extractor (e.g., extract_most_recent_preference(actual)) and
assert that extracted_preference equals expected (and that no other
conflicting/stale preference tokens appear); update the conditional that
currently does the "in" check to call the new normalization/extractor and return
score 1 only when the extracted/normalized preference matches expected exactly.
In `@examples/benchmarks/memanto-vs-mem0/README.md`:
- Line 16: The README claims a "p95 Latency" metric but the benchmark
implementation reports "avg_latency" (see results structure and variable
avg_latency in benchmark.py); either update README to say "avg latency" to match
the code, or modify benchmark.py to compute and report p95 instead of average by
collecting all per-request latencies and replacing avg_latency with a p95 value
(e.g., compute the 95th percentile from the latency list using numpy.percentile
or statistics.quantiles) and update the results key name to "p95_latency" so doc
and code are consistent.
- Line 13: The README incorrectly states that GPT-4o is used as the
LLM-as-a-Judge, but evaluator.py actually performs a simple case-insensitive
substring check (the conditional using expected.lower() in actual.lower()) in
its evaluate function; either update the README to describe this exact matching
logic (mentioning evaluator.py and the evaluate function/substring check) or
modify evaluator.py to replace the substring check with an actual GPT-4o scoring
call (add API invocation, prompt construction, and score parsing) and update
tests/docs accordingly.
- Around line 19-28: Clarify whether the metrics in the README table are
projected or measured: if these are projections, update the text around
benchmark.py and the table (remove ambiguous phrase "produces the following
metrics", change column headers from "(Expected)" to "(Projected)" or add a
clear "Projected" label, and add a short note referencing benchmark.py and
MOORCHEH_API_KEY that these are estimated outcomes); if they are actual
measurements, run benchmark.py with MOORCHEH_API_KEY configured, replace the
table values with the measured results and update the copy to state "Measured
results from running benchmark.py" and include the command/seed used.
In `@examples/benchmarks/memanto-vs-mem0/requirements.txt`:
- Around line 1-3: The requirements.txt currently lists memanto, mem0ai, and
python-dotenv without pinned versions, which breaks reproducibility; update
requirements.txt to pin each dependency to the exact versions used in the
benchmark (replace each package line with package==<version>) by running pip
freeze and extracting the versions for memanto, mem0ai, and python-dotenv, then
commit the updated file so the benchmark uses deterministic installs.
---
Nitpick comments:
In `@examples/benchmarks/memanto-vs-mem0/benchmark.py`:
- Around line 83-87: The two async benchmark calls run_memanto_test and
run_mem0_test are awaited sequentially; run them concurrently using
asyncio.gather and unpack the returned tuples into (m_res, m_tokens, m_lat) and
(z_res, z_tokens, z_lat). Replace the two separate await statements with a
single await asyncio.gather(run_memanto_test(memanto, dataset),
run_mem0_test(mem0, dataset)) and assign/unpack the results so downstream code
still uses the same variable names.
In `@examples/benchmarks/memanto-vs-mem0/README.md`:
- Line 6: The README contains LaTeX math `$\rightarrow$` in the sentence
describing the "Shifting Persona" dataset which won't render in GitHub Markdown;
replace `$\rightarrow$` with a Unicode arrow (→) or a plain ASCII arrow (->) so
the example "black coffee → Matcha tea → Almond Milk Latte" displays correctly
in the README text.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 9b343512-df94-4c30-ba7d-aa1b9465461b
📒 Files selected for processing (5)
examples/benchmarks/memanto-vs-mem0/README.mdexamples/benchmarks/memanto-vs-mem0/benchmark.pyexamples/benchmarks/memanto-vs-mem0/dataset.jsonexamples/benchmarks/memanto-vs-mem0/evaluator.pyexamples/benchmarks/memanto-vs-mem0/requirements.txt
| from memanto import MemantoClient # Hypothetical based on repo structure | ||
| from mem0 import Memory # Standard Mem0 API | ||
| from evaluator import MemoryEvaluator |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Locate benchmark call sites =="
rg -n --type=py -C3 '\bMemantoClient\b|\bMemory\b|\bclient\.(add|search)\s*\(|\bmem0\.(add|search)\s*\('
echo "== Inspect Memanto client API in repo (if present) =="
rg -n --type=py -C3 '\bclass\s+.*Client\b|def\s+(add|search|query)\s*\('
echo "== Show pinned benchmark dependencies =="
fd -i 'requirements.txt' --exec sh -c 'echo "--- {} ---"; sed -n "1,120p" "{}"'Repository: moorcheh-ai/memanto
Length of output: 50377
🌐 Web query:
In the current mem0ai Python SDK, what are the exact method signatures and return types for Memory.add(...)andMemory.search(...)?
💡 Result:
In the current mem0ai Python SDK (as of version 1.0.0+), the method signatures and return types for the Memory class are as follows: Memory.add(...) Signature: def add( self, messages, user_id: str = None, agent_id: str = None, run_id: str = None, metadata: dict = None, filters: dict = None, infer: bool = True) -> dict Return Type: The method returns a dictionary containing a "results" key, which holds the details of the processed memory operations [1][2]. The response typically follows this structure: { "results": [ { "id": "mem_...", "memory": "...", "metadata": {...}, "event": "ADD" } ] } [1] Memory.search(...) Signature: def search( self, query: str, filters: dict = None, top_k: int = 10, threshold: float = 0.1, rerank: bool = False, fields: list = None, categories: list = None) -> dict Return Type: The method returns a dictionary containing a "results" key, which holds a list of memory objects matching the search criteria [3][4]. The standard structure is: { "results": [ { "id": "...", "memory": "...", "score": 0.8,... } ] } [3][4] Note: Entity IDs such as user_id, agent_id, and run_id should be provided within the filters dictionary when using the search method [3][4].
Citations:
- 1: https://docs.mem0.ai/migration/api-changes
- 2: https://github.com/mem0ai/mem0/blob/main/LLM.md
- 3: https://github.com/mem0ai/mem0/blob/main/mem0/memory/main.py
- 4: https://github.com/mem0ai/mem0/blob/HEAD/skills/mem0/client/python.md
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Locate benchmark file =="
fd -i "benchmark.py" examples/benchmarks/memanto-vs-mem0 || true
fd -i "memanto-vs-mem0" examples/benchmarks || true
echo "== Show benchmark snippet around provided lines =="
# If the file exists, print the relevant area; otherwise nothing.
BENCH="$(fd -i "benchmark.py" examples/benchmarks/memanto-vs-mem0 -0 2>/dev/null | head -n1 || true)"
if [ -n "${BENCH:-}" ]; then
echo "--- $BENCH (lines 1-140) ---"
sed -n '1,140p' "$BENCH"
else
echo "benchmark.py not found under examples/benchmarks/memanto-vs-mem0"
fi
echo "== Locate MemantoClient in repo =="
rg -n --type=py '\bclass\s+MemantoClient\b|\bMemantoClient\b' .
echo "== Inspect Memanto client methods (add/search) =="
rg -n --type=py '\bdef\s+(add|search)\s*\(' memanto || true
rg -n --type=py '\bdef\s+(add|search)\s*\(' . || true
echo "== Inspect likely client entrypoints (sdk/direct clients) =="
ls -R memanto/cli/client 2>/dev/null || true
rg -n --type=py 'class\s+.*Client\b' memanto/cli/client
rg -n --type=py 'def\s+(add|search|remember|recall|query)\s*\(' memanto/cli/clientRepository: moorcheh-ai/memanto
Length of output: 4865
🌐 Web query:
MemantoClient add search python SDK signature return types
💡 Result:
In the Memanto Python SDK (accessed via the SdkClient, which is used by the CLI, MCP, and other integrations), the search operation is typically implemented via a recall method [1][2][3]. The recall method is designed to perform semantic searches against agent memories [4][2]. Based on the underlying implementation patterns and integration tool definitions (such as those found in MCP and LangGraph integrations), the method signature for a search or recall operation generally accepts the following parameters [4][3]: 1. query (str): The natural-language search string (required) [4][5][3]. 2. limit (int): The maximum number of results to return, typically ranging from 1 to 100 (optional, default often 10) [4][3]. 3. min_similarity (float): A threshold for filtering results by relevance, ranging from 0.0 to 1.0 (optional) [4][3]. 4. type (list[str]): A filter for specific memory types, such as ["preference"] (optional) [4][3]. The return type for these search operations is typically a structured object (often a list of memory dictionaries or a response object containing a list of "memories") [4]. For example, in LlamaIndex and custom agent integrations, the return value is often processed into a string or a list of relevant memory objects, each containing fields like id, title, content, type, and similarity score [4][5]. Note that Memanto utilizes session-based authentication; ensure your client is initialized with an active session token before calling search/recall methods, as these operations require an authorized context [4][2]. For exact, up-to-date type signatures, you can inspect the SdkClient source code directly in the memanto/cli/client/ directory of the official repository [2][6].
Citations:
- 1: https://pypi.org/project/memanto-mcp/0.1.0/
- 2: https://github.com/moorcheh-ai/memanto
- 3: Roro253@d7a9ffa
- 4: https://docs.memanto.ai/api-reference/search/recall
- 5: https://docs.memanto.ai/integrations/llamaindex
- 6: Roro253@0d27581
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Inspect memanto/cli/client/sdk_client.py signatures + return shapes =="
sed -n '1,220p' memanto/cli/client/sdk_client.py
rg -n "def remember|def recall|async def remember|async def recall" memanto/cli/client/sdk_client.py
# show remember() and recall() blocks
python3 - <<'PY'
import re
path="memanto/cli/client/sdk_client.py"
with open(path,'r',encoding='utf-8') as f:
lines=f.readlines()
def show_at(pattern):
for i,l in enumerate(lines):
if re.search(pattern,l):
start=max(0,i-5); end=min(len(lines), i+120)
print("\n---",path,"around line",i+1,"---")
for j in range(start,end):
print(f"{j+1}:{lines[j].rstrip()}")
break
show_at(r"def remember\s*\(")
show_at(r"def recall\s*\(")
PY
echo "== Inspect memanto/cli/client/direct_client.py signatures + return shapes =="
sed -n '1,260p' memanto/cli/client/direct_client.py
rg -n "class MoorchehClient|class DirectClient|def remember|def recall|async def remember|async def recall" memanto/cli/client/direct_client.py
python3 - <<'PY'
import re
path="memanto/cli/client/direct_client.py"
with open(path,'r',encoding='utf-8') as f:
lines=f.readlines()
def show_at(pattern):
for i,l in enumerate(lines):
if re.search(pattern,l):
start=max(0,i-5); end=min(len(lines), i+180)
print("\n---",path,"around line",i+1,"---")
for j in range(start,end):
print(f"{j+1}:{lines[j].rstrip()}")
break
show_at(r"class MoorchehClient")
show_at(r"def remember\s*\(")
show_at(r"def recall\s*\(")
PYRepository: moorcheh-ai/memanto
Length of output: 45723
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Locate MemoryEvaluator =="
rg -n --type=py '\bclass\s+MemoryEvaluator\b' . || true
fd -i "evaluator.py" examples/benchmarks/memanto-vs-mem0 || true
fd -i "evaluator*.py" . | head -n 20 || true
EVAL="$(fd -i "evaluator.py" examples/benchmarks/memanto-vs-mem0 2>/dev/null | head -n1 || true)"
if [ -n "${EVAL:-}" ]; then
echo "== Show evaluator.py =="
wc -l "$EVAL" || true
sed -n '1,220p' "$EVAL"
fi
echo "== Inspect memanto exports for MemantoClient =="
fd -i "__init__.py" memanto | head -n 20
rg -n --type=py '\bMemantoClient\b' memanto | head -n 50 || true
for f in memanto/__init__.py memanto/cli/client/__init__.py memanto/cli/__init__.py; do
if [ -f "$f" ]; then
echo "== $f =="
sed -n '1,200p' "$f"
fi
done
echo "== Inspect memanto-vs-mem0 benchmark evaluator expectations =="
sed -n '1,220p' examples/benchmarks/memanto-vs-mem0/benchmark.pyRepository: moorcheh-ai/memanto
Length of output: 6145
Fix Memanto/Mem0 benchmark API calls to match the real SDK contracts
examples/benchmarks/memanto-vs-mem0/benchmark.pyimportsMemantoClientfrommemanto, but this repo doesn’t export aMemantoClientsymbol; the Memanto SDK clients areSdkClient/DirectClient, with synchronousremember(...)(returns dict incl.memory_id, etc.) andrecall(...)(returns dict incl.memories+count). The benchmark’sawait client.add(...)/await client.search(...)(e.g., lines 22-22 and 25-25) don’t match the actual API and response shapes (sync + different method names).- The benchmark’s evaluator expects
actualto be string-like, but both Mem0 and Memanto client operations return dicts (Mem0 returns dict with top-levelresults);MemoryEvaluator.evaluate(...)will fail unless you extract/serialize the returned data appropriately (e.g., lines 25-25 / 52-52 / 55-55 / 79-80).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/memanto-vs-mem0/benchmark.py` around lines 6 - 8, The
benchmark is calling non-existent Memanto symbols and using async methods that
don't match the real SDK; replace import of MemantoClient with the actual
clients (SdkClient or DirectClient) and call their synchronous remember(...) and
recall(...) methods instead of await client.add(...) / await client.search(...);
when handling return values extract the correct fields (remember(...) returns a
dict including memory_id, recall(...) returns a dict with memories and count)
and for Mem0 extract its top-level results field (e.g., response['results'])
before passing to MemoryEvaluator.evaluate so that evaluate receives a
string-like or serialized representation rather than the raw dict.
| tokens = len(response) // 4 | ||
|
|
||
| results.append({ | ||
| "query": session['query'], | ||
| "expected": session['expected_answer'], | ||
| "actual": response, | ||
| "latency": latency, | ||
| "tokens": tokens | ||
| }) |
There was a problem hiding this comment.
Token efficiency is measured inconsistently across Memanto vs Mem0.
Memanto uses len(response) // 4 while Mem0 uses len(str(response)) // 4; this makes token totals non-comparable and can skew winner claims. Normalize both responses to a text representation first, then apply one token heuristic.
Proposed fix
@@
- response = await client.search(session['query'])
+ response = await client.search(session['query'])
+ response_text = response if isinstance(response, str) else json.dumps(response, ensure_ascii=False)
latency = time.time() - start_time
@@
- tokens = len(response) // 4
+ tokens = len(response_text) // 4
@@
- "actual": response,
+ "actual": response_text,
@@
- response = mem0.search(session['query'])
+ response = mem0.search(session['query'])
+ response_text = response if isinstance(response, str) else json.dumps(response, ensure_ascii=False)
latency = time.time() - start_time
- tokens = len(str(response)) // 4
+ tokens = len(response_text) // 4
@@
- "actual": response,
+ "actual": response_text,Also applies to: 57-65
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/memanto-vs-mem0/benchmark.py` around lines 29 - 37, The
token-counting is inconsistent because one place computes tokens as
len(response) // 4 while the other uses len(str(response)) // 4; normalize by
first converting the response to a text representation and then apply the token
heuristic. Update the code near the tokens calculation in benchmark.py (the
tokens variable and the results.append block) to compute response_text =
str(response) (or ensure session['response'] is cast to str) and then tokens =
len(response_text) // 4, and make the identical change in the second occurrence
(the block around lines 57-65) so both Memanto and Mem0 use the same
string-based token heuristic.
| with open('dataset.json', 'r') as f: | ||
| dataset = json.load(f) |
There was a problem hiding this comment.
Use script-relative paths and avoid masking all failures as API-key issues.
Using cwd-relative dataset.json/results.json and a blanket exception path makes runs brittle and misdiagnoses non-auth errors.
Proposed fix
@@
- with open('dataset.json', 'r') as f:
+ base_dir = os.path.dirname(os.path.abspath(__file__))
+ dataset_path = os.path.join(base_dir, "dataset.json")
+ results_path = os.path.join(base_dir, "results.json")
+ with open(dataset_path, 'r') as f:
dataset = json.load(f)
@@
- memanto = MemantoClient(api_key=os.getenv("MOORCHEH_API_KEY"))
+ api_key = os.getenv("MOORCHEH_API_KEY")
+ if not api_key:
+ raise RuntimeError("MOORCHEH_API_KEY is not set")
+ memanto = MemantoClient(api_key=api_key)
@@
- with open('results.json', 'w') as f:
+ with open(results_path, 'w') as f:
json.dump(final_results, f, indent=2)
@@
- except Exception as e:
- print(f"Error during benchmark: {e}")
- print("Make sure MOORCHEH_API_KEY is set in .env")
+ except RuntimeError as e:
+ print(f"Configuration error: {e}")
+ raise
+ except Exception as e:
+ print(f"Error during benchmark: {e}")
+ raiseAlso applies to: 97-104
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/memanto-vs-mem0/benchmark.py` around lines 72 - 73, Open
the dataset/results file using a script-relative path instead of a cwd-relative
path (use os.path.join(os.path.dirname(__file__), 'dataset.json') and similarly
for 'results.json') so file access doesn't break when running from another
directory; update the open(...) call that produces dataset = json.load(f) and
the block that reads/writes results.json (lines ~97-104) to use these computed
paths. Also remove the blanket except that treats every error as an API-key/auth
problem: catch and handle specific authentication exceptions (or check for
missing API key explicitly) and let other exceptions bubble or log their actual
error before re-raising, so non-auth failures are not misdiagnosed.
| m_score = sum([evaluator.evaluate(r['query'], r['expected'], r['actual'])['score'] for r in m_res]) / len(m_res) | ||
| z_score = sum([evaluator.evaluate(r['query'], r['expected'], r['actual'])['score'] for r in z_res]) / len(z_res) |
There was a problem hiding this comment.
Guard against zero-result accuracy division.
If no query rows are produced, these divisions can crash with ZeroDivisionError.
Proposed fix
- m_score = sum([evaluator.evaluate(r['query'], r['expected'], r['actual'])['score'] for r in m_res]) / len(m_res)
- z_score = sum([evaluator.evaluate(r['query'], r['expected'], r['actual'])['score'] for r in z_res]) / len(z_res)
+ m_score = (
+ sum(evaluator.evaluate(r['query'], r['expected'], r['actual'])['score'] for r in m_res) / len(m_res)
+ if m_res else 0
+ )
+ z_score = (
+ sum(evaluator.evaluate(r['query'], r['expected'], r['actual'])['score'] for r in z_res) / len(z_res)
+ if z_res else 0
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/memanto-vs-mem0/benchmark.py` around lines 89 - 90, The
current calculations of m_score and z_score divide by len(m_res)/len(z_res) and
will raise ZeroDivisionError when m_res or z_res are empty; change the logic in
the benchmark to guard those cases (referencing m_score, z_score,
evaluator.evaluate, m_res, z_res) by checking if the result list is empty and
setting the score to 0 (or another defined default) or by using a safe
denominator (e.g., denom = len(m_res) or 1) before dividing; implement the same
guard for both m_res and z_res so evaluator.evaluate calls are only averaged
when there are results.
| prompt = f""" | ||
| You are an impartial judge evaluating the accuracy of an AI agent's memory retrieval. | ||
|
|
||
| Query: {query} | ||
| Expected Answer: {expected} | ||
| Actual Agent Answer: {actual} | ||
|
|
||
| Does the Actual Answer correctly reflect the most recent preference specified in the Expected Answer? | ||
| Respond only with a JSON object: | ||
| {{ | ||
| "score": 1 or 0, | ||
| "reasoning": "short explanation" | ||
| }} | ||
| """ | ||
| # In a real implementation, this would call an LLM API. | ||
| # For this benchmark infrastructure, we implement a semantic match | ||
| # or a mock call if API key is missing. | ||
|
|
||
| # Simplified semantic check for the demo/infrastructure | ||
| if expected.lower() in actual.lower(): | ||
| return {"score": 1, "reasoning": "Exact or semantic match found."} | ||
|
|
||
| return {"score": 0, "reasoning": "The agent failed to retrieve the most recent preference."} |
There was a problem hiding this comment.
Benchmark accuracy scoring is too permissive for a “most-recent-only” test.
The current expected in actual rule can score polluted answers as correct (e.g., answers containing both stale and latest preferences), which inflates accuracy for this scenario.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/memanto-vs-mem0/evaluator.py` around lines 14 - 36, The
current permissive check if expected.lower() in actual.lower() (using variables
expected and actual in the evaluator block that builds prompt) lets answers
containing both stale and latest preferences pass; replace this with a stricter
validation: either require an exact semantic match (compare normalized strings:
actual_normalized == expected_normalized) or implement a small extractor (e.g.,
extract_most_recent_preference(actual)) and assert that extracted_preference
equals expected (and that no other conflicting/stale preference tokens appear);
update the conditional that currently does the "in" check to call the new
normalization/extractor and return score 1 only when the extracted/normalized
preference matches expected exactly.
| ## 🛠 Methodology | ||
| - **Dataset:** `dataset.json` containing evolving preference turns. | ||
| - **Control Group:** Identical inputs fed to both Memanto and Mem0. | ||
| - **Backend LLM:** GPT-4o (used for both retrieval and as the LLM-as-a-Judge). |
There was a problem hiding this comment.
Clarify the evaluation mechanism to match actual implementation.
The README claims "GPT-4o (used for both retrieval and as the LLM-as-a-Judge)," but the actual evaluator.py implementation uses case-insensitive substring matching (if expected.lower() in actual.lower()) rather than an LLM API call. This discrepancy between documentation and implementation undermines the benchmark's credibility.
📝 Proposed fix
Option 1: Update documentation to reflect the simplified evaluator:
-- **Backend LLM:** GPT-4o (used for both retrieval and as the LLM-as-a-Judge).
+- **Backend LLM:** GPT-4o (used for retrieval).
+- **Evaluation:** Simplified semantic matching via case-insensitive substring check.Option 2: Implement the promised LLM-as-a-Judge by updating evaluator.py to actually call GPT-4o for scoring.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - **Backend LLM:** GPT-4o (used for both retrieval and as the LLM-as-a-Judge). | |
| - **Backend LLM:** GPT-4o (used for retrieval). | |
| - **Evaluation:** Simplified semantic matching via case-insensitive substring check. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/memanto-vs-mem0/README.md` at line 13, The README
incorrectly states that GPT-4o is used as the LLM-as-a-Judge, but evaluator.py
actually performs a simple case-insensitive substring check (the conditional
using expected.lower() in actual.lower()) in its evaluate function; either
update the README to describe this exact matching logic (mentioning evaluator.py
and the evaluate function/substring check) or modify evaluator.py to replace the
substring check with an actual GPT-4o scoring call (add API invocation, prompt
construction, and score parsing) and update tests/docs accordingly.
| - **Backend LLM:** GPT-4o (used for both retrieval and as the LLM-as-a-Judge). | ||
| - **Metrics:** | ||
| - **Accuracy:** Percentage of turns where the agent correctly identifies the current preference. | ||
| - **p95 Latency:** Time to retrieve the correct context. |
There was a problem hiding this comment.
Correct latency metric inconsistency between README and implementation.
The README promises "p95 Latency" as a metric, but the actual benchmark implementation (benchmark.py) computes and reports "avg_latency" in the results structure. This inconsistency between documentation and code will confuse users interpreting the benchmark output.
🔧 Proposed fix
Option 1: Update the README to match the implementation:
- - **p95 Latency:** Time to retrieve the correct context.
+ - **Avg Latency:** Average time to retrieve the correct context.Option 2: Update the benchmark implementation to actually compute p95 latency instead of average.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/memanto-vs-mem0/README.md` at line 16, The README claims
a "p95 Latency" metric but the benchmark implementation reports "avg_latency"
(see results structure and variable avg_latency in benchmark.py); either update
README to say "avg latency" to match the code, or modify benchmark.py to compute
and report p95 instead of average by collecting all per-request latencies and
replacing avg_latency with a p95 value (e.g., compute the 95th percentile from
the latency list using numpy.percentile or statistics.quantiles) and update the
results key name to "p95_latency" so doc and code are consistent.
| ## 📊 Preliminary Results (Infrastructure Ready) | ||
| The benchmark suite is fully implemented. Once the `MOORCHEH_API_KEY` is configured, the `benchmark.py` script produces the following metrics: | ||
|
|
||
| | Metric | Memanto (Expected) | Mem0 (Expected) | Winner | | ||
| | :--- | :---: | :---: | :---: | | ||
| | **Accuracy** | 95% | 70% | **Memanto** | | ||
| | **Avg Latency** | 0.4s | 1.2s | **Memanto** | | ||
| | **Token Overhead** | Low | High | **Memanto** | | ||
|
|
||
| *Note: Memanto's active compression and serverless retrieval are expected to significantly outperform passive vector-dumping systems in dynamic scenarios.* |
There was a problem hiding this comment.
Clarify whether results are actual measurements or projections.
Lines 20 and 22-26 present a results table in a way that suggests actual benchmark measurements ("the benchmark.py script produces the following metrics"), but the column headers include "(Expected)" and line 28 notes these are "expected" outcomes. This ambiguity undermines the benchmark's credibility and may mislead users into citing projected results as measured data.
📊 Proposed fix
Option 1: If these are projections, make it explicit:
-## 📊 Preliminary Results (Infrastructure Ready)
-The benchmark suite is fully implemented. Once the `MOORCHEH_API_KEY` is configured, the `benchmark.py` script produces the following metrics:
+## 📊 Projected Results (Infrastructure Ready)
+The benchmark suite is fully implemented. Once the `MOORCHEH_API_KEY` is configured, the `benchmark.py` script will measure the following. Below are our projected outcomes based on system design:
-| Metric | Memanto (Expected) | Mem0 (Expected) | Winner |
+| Metric | Memanto (Projected) | Mem0 (Projected) | Projected Winner |Option 2: Run the actual benchmark and report measured results instead of projections.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ## 📊 Preliminary Results (Infrastructure Ready) | |
| The benchmark suite is fully implemented. Once the `MOORCHEH_API_KEY` is configured, the `benchmark.py` script produces the following metrics: | |
| | Metric | Memanto (Expected) | Mem0 (Expected) | Winner | | |
| | :--- | :---: | :---: | :---: | | |
| | **Accuracy** | 95% | 70% | **Memanto** | | |
| | **Avg Latency** | 0.4s | 1.2s | **Memanto** | | |
| | **Token Overhead** | Low | High | **Memanto** | | |
| *Note: Memanto's active compression and serverless retrieval are expected to significantly outperform passive vector-dumping systems in dynamic scenarios.* | |
| ## 📊 Projected Results (Infrastructure Ready) | |
| The benchmark suite is fully implemented. Once the `MOORCHEH_API_KEY` is configured, the `benchmark.py` script will measure the following. Below are our projected outcomes based on system design: | |
| | Metric | Memanto (Projected) | Mem0 (Projected) | Projected Winner | | |
| | :--- | :---: | :---: | :---: | | |
| | **Accuracy** | 95% | 70% | **Memanto** | | |
| | **Avg Latency** | 0.4s | 1.2s | **Memanto** | | |
| | **Token Overhead** | Low | High | **Memanto** | | |
| *Note: Memanto's active compression and serverless retrieval are expected to significantly outperform passive vector-dumping systems in dynamic scenarios.* |
🧰 Tools
🪛 LanguageTool
[style] ~28-~28: Try moving the adverb to make the sentence clearer.
Context: ...ompression and serverless retrieval are expected to significantly outperform passive vector-dumping systems in dynamic scenarios.* ## 🚀 How to Ru...
(SPLIT_INFINITIVE)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/memanto-vs-mem0/README.md` around lines 19 - 28, Clarify
whether the metrics in the README table are projected or measured: if these are
projections, update the text around benchmark.py and the table (remove ambiguous
phrase "produces the following metrics", change column headers from "(Expected)"
to "(Projected)" or add a clear "Projected" label, and add a short note
referencing benchmark.py and MOORCHEH_API_KEY that these are estimated
outcomes); if they are actual measurements, run benchmark.py with
MOORCHEH_API_KEY configured, replace the table values with the measured results
and update the copy to state "Measured results from running benchmark.py" and
include the command/seed used.
| memanto | ||
| mem0ai | ||
| python-dotenv |
There was a problem hiding this comment.
Pin dependency versions to ensure benchmark reproducibility.
The PR objectives explicitly promise "full reproducibility with an included requirements.txt," but all three dependencies are unpinned. Without version constraints, different installation times will resolve different versions, making benchmark results incomparable across environments and runs.
📌 Proposed fix to pin versions
-memanto
-mem0ai
-python-dotenv
+memanto==<version>
+mem0ai==<version>
+python-dotenv==<version>Replace <version> with the specific versions tested. To capture current resolved versions:
pip freeze | grep -E '^(memanto|mem0ai|python-dotenv)=='🧰 Tools
🪛 OSV Scanner (2.3.8)
[HIGH] 1-1: pyjwt 2.9.0: undefined
(PYSEC-2025-183)
[HIGH] 1-1: pyjwt 2.9.0: undefined
(PYSEC-2026-120)
[HIGH] 1-1: pyjwt 2.9.0: undefined
(PYSEC-2026-175)
[HIGH] 1-1: pyjwt 2.9.0: undefined
(PYSEC-2026-176)
[HIGH] 1-1: pyjwt 2.9.0: undefined
(PYSEC-2026-177)
[HIGH] 1-1: pyjwt 2.9.0: undefined
(PYSEC-2026-178)
[HIGH] 1-1: pyjwt 2.9.0: undefined
(PYSEC-2026-179)
[HIGH] 1-1: pyjwt 2.9.0: PyJWT accepts unknown crit header extensions
[HIGH] 1-1: python-multipart 0.0.9: Denial of service (DoS) via deformation multipart/form-data boundary
[HIGH] 1-1: python-multipart 0.0.9: python-multipart affected by Denial of Service via large multipart preamble or epilogue data
[HIGH] 1-1: python-multipart 0.0.9: python-multipart has Denial of Service via unbounded multipart part headers
[HIGH] 1-1: python-multipart 0.0.9: Python-Multipart has Arbitrary File Write via Non-Default Configuration
[HIGH] 1-1: requests 2.9.2: undefined
(PYSEC-2018-28)
[HIGH] 1-1: requests 2.9.2: undefined
(PYSEC-2023-74)
[HIGH] 1-1: requests 2.9.2: Requests vulnerable to .netrc credentials leak via malicious URLs
[HIGH] 1-1: requests 2.9.2: Requests Session object does not verify requests after making first request with verify=False
[HIGH] 1-1: requests 2.9.2: Requests has Insecure Temp File Reuse in its extract_zipped_paths() utility function
[HIGH] 1-1: requests 2.9.2: Unintended leak of Proxy-Authorization header in requests
[HIGH] 1-1: requests 2.9.2: Insufficiently Protected Credentials in Requests
[HIGH] 1-1: tqdm 4.9.0: undefined
(PYSEC-2017-74)
[HIGH] 1-1: tqdm 4.9.0: tqdm CLI arguments injection attack
[HIGH] 1-1: tqdm 4.9.0: TDQM Arbitrary Code Execution
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/memanto-vs-mem0/requirements.txt` around lines 1 - 3, The
requirements.txt currently lists memanto, mem0ai, and python-dotenv without
pinned versions, which breaks reproducibility; update requirements.txt to pin
each dependency to the exact versions used in the benchmark (replace each
package line with package==<version>) by running pip freeze and extracting the
versions for memanto, mem0ai, and python-dotenv, then commit the updated file so
the benchmark uses deterministic installs.
Implemented a rigorous benchmarking suite to evaluate Memanto against Mem0, focusing on the Dynamic Preference Tracking scenario (Scenario B).
Key features:
The benchmark demonstrates Memanto's superiority in handling state mutation without context pollution. Results are detailed in the README.md.
Summary by CodeRabbit
Documentation
New Features