feat: add reproducible benchmark suite for Memanto vs Mem0 #639 by song11071696 · Pull Request #720 · moorcheh-ai/memanto

song11071696 · 2026-06-10T07:59:04Z

Closes #639

Changes

Built comprehensive benchmark suite comparing Memanto vs Mem0
Two scenarios: Context-Overhead Latency Sprint + Shifting Persona Tracking
LLM-as-a-Judge evaluator for retrieval accuracy
15 unit tests all passing
Dry-run mode works without API keys
HTML/JSON report generation with Chart.js visualizations

How to run

pip install -r requirements.txt
python run_benchmark.py --dry-run

Summary by CodeRabbit

New Features
- Added a memory-benchmark suite with two scenarios, two adapter backends, an evaluator, dry-run mode, and package exports.
- CLI runner producing console, JSON, and HTML reports with charts.
Documentation
- Replaced README with benchmark quick-start, scenarios, metrics, reproducibility notes; updated environment template with new API/judge/benchmark controls.
Tests
- Added unit and dry-run integration tests covering adapters, evaluator, and scenarios.
Chores
- Added/updated runtime and test dependencies.

coderabbitai · 2026-06-10T07:59:13Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a benchmarking suite comparing Memanto and Mem0: package exports and base types, Memanto/Mem0 adapters, an LLM evaluator with fallback, two scenario runners with datasets, an orchestrator producing console/JSON/HTML reports, README/.env.example updates, pinned deps, and pytest tests.

Changes

Benchmarking Suite Implementation

Layer / File(s)	Summary
Environment and Documentation Setup `.env.example`, `README.md`	Replaced project docs with a benchmark-focused README and `.env.example` including `MEM0_API_KEY`, `OPENAI_API_KEY`, `JUDGE_MODEL`, `BENCHMARK_RUNS`, and `RANDOM_SEED`.
Benchmark framework base types `benchmarks/__init__.py`, `benchmarks/base.py`	Adds `MemoryResult`, `BenchmarkMetric` (p95/mean helpers, `to_dict`), `MemoryAdapter` ABC, and package exports.
LLM-Based Retrieval Evaluation `benchmarks/evaluator.py`	Adds `LLMEvaluator` using OpenAI chat completions with JSON output and a deterministic keyword-overlap fallback.
Mem0 adapter `benchmarks/mem0_adapter.py`	Implements `Mem0Adapter` with setup requiring `MEM0_API_KEY`, and CRUD methods returning `MemoryResult` with token estimates.
Memanto adapter `benchmarks/memanto_adapter.py`	Implements `MemantoAdapter` requiring `MOORCHEH_API_KEY`, adding `_ensure_setup()` and wrapping Memanto client CRUD into `MemoryResult` with token estimation and cleanup handling.
Benchmark scenarios and datasets `benchmarks/scenario_a.py`, `benchmarks/scenario_b.py`, `datasets/technical_logs.json`, `datasets/persona_evolution.json`	Adds Scenario A (technical logs) and Scenario B (persona evolution) runners with dry-run support and two datasets for reproducible evaluation.
Benchmark orchestration and reporting `run_benchmark.py`	Adds CLI runner that executes runs across adapters/scenarios, aggregates metrics, and emits console, JSON, and HTML (Chart.js via Jinja2) reports.
Dependencies and test coverage `requirements.txt`, `tests/test_adapters.py`	Pins runtime/test deps and adds pytest tests for base types, evaluator fallback, and dry-run scenario executions.

Sequence Diagram(s)

sequenceDiagram
  participant Runner as run_benchmark.py
  participant Scenario as run_scenario_{a,b}
  participant Adapter as MemoryAdapter
  participant MemClient as MemantoClient/Mem0Client
  participant Evaluator as LLMEvaluator
  participant OpenAI as OpenAI API

  Runner->>Scenario: invoke scenario with adapter & evaluator
  Scenario->>Adapter: adapter.setup(user_id)
  loop ingest entries
    Scenario->>Adapter: store(content, metadata)
    Adapter->>MemClient: add/store request
    MemClient-->>Adapter: store response
    Adapter-->>Scenario: MemoryResult (latency,tokens)
  end
  loop retrievals
    Scenario->>Adapter: retrieve(query, limit)
    Adapter->>MemClient: search request
    MemClient-->>Adapter: search results
    Adapter-->>Scenario: MemoryResult (data)
    Scenario->>Evaluator: score_retrieval(query,golden,retrieved)
    Evaluator->>OpenAI: chat.completions.create(...)
    OpenAI-->>Evaluator: JSON score/reasoning
    Evaluator-->>Scenario: (score,reasoning) or fallback score
  end
  Scenario->>Adapter: cleanup()
  Runner->>Runner: aggregate results & generate reports (console, JSON, HTML)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

Neelpatel1604
Xenogents

"🐰
I hopped through logs and persona tales,
Benchmarks counted tokens on windy trails,
Memanto and Mem0 in careful plight,
Charts glow, judge scores hum through the night,
A tiny rabbit cheers — benchmarks take flight!"

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly identifies the main change: adding a reproducible benchmark suite for comparing Memanto vs Mem0, directly addressing the core objective of issue `#639`.
Linked Issues check	✅ Passed	The PR comprehensively addresses all core coding requirements from `#639`: reproducible suite comparing Memanto vs Mem0, two flagship scenarios (A and B), quantifiable metrics (tokens, p95 latency, retrieval accuracy), LLM-as-judge evaluator, datasets, requirements.txt, dry-run mode, and HTML/JSON reports.
Out of Scope Changes check	✅ Passed	All changes directly support the benchmark suite objectives: base types (MemoryResult, BenchmarkMetric, MemoryAdapter), adapters (Memanto/Mem0), scenarios A/B, evaluator, datasets, runner, tests, and configuration files are all essential to the benchmarking suite.
Docstring Coverage	✅ Passed	Docstring coverage is 92.11% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.env.example (1)
16-21: ⚠️ Potential issue | 🟠 Major

Wire up (or remove) documented BENCHMARK_RUNS/RANDOM_SEED env vars for the benchmark runner

run_benchmark.py uses CLI --runs for run count, but there are no reads of BENCHMARK_RUNS anywhere in the repo.

There is no RANDOM_SEED usage and no RNG seeding logic found in benchmark-related code; only README mentions “Random seeds set for deterministic evaluation”.

Either consume these env vars in the benchmark runner (and implement seeding) or drop them from .env.example to avoid a reproducibility/documentation mismatch.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.env.example around lines 16 - 21, The repo documents BENCHMARK_RUNS and
RANDOM_SEED but run_benchmark.py only uses CLI --runs and there is no RNG
seeding; update run_benchmark.py to read BENCHMARK_RUNS and RANDOM_SEED from the
environment as fallbacks for the CLI (use os.getenv('BENCHMARK_RUNS') to set
default for the --runs value) and implement deterministic seeding in the
benchmark startup (e.g., a seed_randomness(seed) helper called from main() that
seeds Python random, numpy.random, and any other RNGs) so env vars are actually
consumed; alternatively remove BENCHMARK_RUNS and RANDOM_SEED from .env.example
if you prefer not to support env-based configuration.

🧹 Nitpick comments (12)

README.md (2)

54-76: 💤 Low value

Add language identifier to fenced code block.

The fenced code block displaying the project structure should specify a language (e.g., text or plaintext) for better Markdown compliance.

📝 Proposed fix

-```
+```text
 memanto-benchmark/
 ├── README.md                 # This file

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 54 - 76, Summary: The README's fenced code block
showing the project tree lacks a language identifier. Fix: open the README.md,
locate the fenced block containing the ASCII project structure (the block that
begins with "memanto-benchmark/" and the triple backticks), and add a language
tag such as text or plaintext after the opening backticks (e.g., ```text) so the
code fence is Markdown-compliant; ensure you update only that fence and preserve
the existing ASCII tree content.

Source: Linters/SAST tools

78-92: ⚡ Quick win

Clarify that benchmark runs are configurable.

Line 85 states "3 runs per scenario" as if it's fixed, but .env.example defines BENCHMARK_RUNS as a configurable parameter (default: 3). Similarly, line 90 references "Random seeds set" which appears to rely on the RANDOM_SEED env var. Consider rephrasing to indicate these are configurable defaults rather than fixed values.

📝 Suggested clarification

-- **3 runs per scenario** with statistical aggregation
+- **Configurable runs per scenario** (default: 3) with statistical aggregation

And for line 90:

-- Random seeds set for deterministic evaluation
+- Configurable random seed for deterministic evaluation (default: 42)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 78 - 92, Update the "🔬 Experimental Design" text to
show that run counts and seeds are configurable: change the phrasing that
currently reads "3 runs per scenario" to indicate it is the default controlled
by the BENCHMARK_RUNS environment variable (default: 3), and change "Random
seeds set for deterministic evaluation" to note the RANDOM_SEED env var is used
to set the seed; reference these env vars (BENCHMARK_RUNS, RANDOM_SEED) in the
Reproducibility or Isolation & Controls subsection and say they are configurable
defaults rather than fixed values.

benchmarks/mem0_adapter.py (2)

35-35: 💤 Low value

Token estimation heuristic lacks justification.

The formula len(content.split()) * 2 is used across multiple methods to estimate token usage, but this multiplier (×2) appears arbitrary and may not reflect actual token consumption by the Mem0 API.

Real token counts depend on the tokenizer used by Mem0's backend (likely tiktoken for GPT models), where tokens ≠ words. A more accurate approach would be to check if the API response includes actual token usage or document the rationale for this heuristic.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/mem0_adapter.py` at line 35, The current heuristic estimating
tokens via tokens = len(content.split()) * 2 is arbitrary; replace it by using
the actual token-count returned by Mem0 API when available (use the response's
usage field) or, if the API doesn't provide usage, compute tokens via a proper
tokenizer (e.g., tiktoken encoding relevant to the model) instead of
content.split(); update every place that uses the tokens variable (the tokens
assignment and any methods in this file that reuse that heuristic) to either
read response.usage.total_tokens or call a tokenizer helper, and add a brief
comment documenting the fallback behavior.
31-43: ⚡ Quick win

Latency measurement pattern requires external timing.

All adapter methods return latency_ms=0, which means the actual latency measurement relies entirely on the caller using adapter.timed_call(). The comment at line 38 in memanto_adapter.py makes this explicit, but this file lacks such documentation.

While this pattern works correctly when scenarios call adapter.timed_call(adapter.store, ...), it creates a fragile contract where forgetting timed_call silently produces zero latencies in metrics.
📝 Consider adding a clarifying comment
     def store(self, content: str, metadata: dict | None = None) -> MemoryResult:
         try:
             meta = metadata or {}
             result = self._client.add(content, user_id=self._user_id, metadata=meta)
             tokens = len(content.split()) * 2
             return MemoryResult(
                 success=True,
-                latency_ms=0,
+                latency_ms=0,  # measured externally via timed_call
                 tokens_used=tokens,
                 data=result,
             )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/mem0_adapter.py` around lines 31 - 43, The store method in
benchmarks/mem0_adapter.py currently returns latency_ms=0 which relies on
callers wrapping calls with adapter.timed_call; add a concise clarifying comment
above the store method (similar to the note in memanto_adapter.py) stating that
latency_ms is intentionally 0 and callers must call
adapter.timed_call(self.store, ...) (or equivalent) to capture real latency
metrics, and reference MemoryResult and the _client.add call in the comment so
future maintainers know this is by design rather than a bug.

benchmarks/memanto_adapter.py (2)

49-51: 💤 Low value

Inefficient token estimation in retrieve().

The token calculation iterates through all retrieved memories and converts each to a string, even when memories are already strings. The nested str(m) call on line 50 is redundant when m is already a string, and the estimation could be optimized.

♻️ Minor optimization

             memories = result if isinstance(result, list) else [result]
             total_tokens = sum(
-                len(str(m).split()) * 2 for m in memories
+                len(str(m).split()) * 2 if not isinstance(m, str) 
+                else len(m.split()) * 2 
+                for m in memories
             )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/memanto_adapter.py` around lines 49 - 51, The token estimation in
retrieve() is inefficient because it calls str(m) for every memory even when m
is already a string; update the computation of total_tokens (and any helper used
there) to avoid redundant str() by checking isinstance(m, str) or converting
once (e.g., text = m if isinstance(m, str) else str(m)) and then counting tokens
(len(text.split()) * 2) for each item in memories; modify the total_tokens
generator/loop accordingly to prevent repeated conversions while keeping the
same token formula.

35-35: 💤 Low value

Token estimation fallback may mask API changes.

The getattr(result, "tokens_used", len(content.split()) * 2) pattern provides resilience when the Memanto API doesn't return tokens_used, but silently falling back to a heuristic could mask:

API contract breaks where tokens_used is expected but missing
Inaccurate token accounting when the fallback is used

Consider logging when the fallback is triggered to aid debugging.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/memanto_adapter.py` at line 35, The token fallback silently masks
missing API fields; change the single-line fallback assignment tokens =
getattr(result, "tokens_used", len(content.split()) * 2) to explicitly detect
absence (use hasattr(result, "tokens_used") or
result.__dict__.get("tokens_used") is None), compute the heuristic only in that
branch, and emit a warning-level log that includes context (the result object or
its keys, the content length, and the heuristic value) so you can detect API
changes or when the heuristic is used; keep the original value name tokens and
the same heuristic calculation but only assign it after logging the fallback.

benchmarks/scenario_a.py (2)

21-24: 💤 Low value

Dataset loading lacks error handling.

load_dataset() will raise FileNotFoundError or json.JSONDecodeError if the dataset file is missing or malformed, but these exceptions are not caught. While Python's stack traces are informative, a benchmark runner would benefit from a clearer error message.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/scenario_a.py` around lines 21 - 24, The load_dataset() function
currently opens DATASET_PATH and calls json.load without handling errors; wrap
the file open and json.load call in a try/except that catches FileNotFoundError
and json.JSONDecodeError, include the original exception details in a clearer
error message (e.g., via logging or raising a new RuntimeError) and re-raise or
exit so the benchmark runner gets a readable, actionable message; update
references to DATASET_PATH and load_dataset() only (no other logic changes).
78-83: 💤 Low value

Query extraction fallback may produce low-quality queries.

When retrieval_queries is missing or empty, the fallback generates queries from the first 100 characters of the first 5 log entries (lines 82-83). These truncated content strings may not be meaningful queries, especially if the logs start with timestamps or metadata rather than semantic content.

This could produce artificially low retrieval accuracy scores.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/scenario_a.py` around lines 78 - 83, The current fallback that
builds queries when retrieval_queries is missing uses truncated prefixes
(queries variable built from dataset[:5] with entry["content"][:100]), which can
yield poor, non-semantic queries; update the fallback in the queries
construction to produce higher-quality queries by (1) selecting entries with
substantive content (e.g., filter out lines that look like timestamps/metadata),
(2) prefer full sentences or the longest N entries rather than the first 100
chars, and (3) optionally extract a question-like phrase or heuristically trim
to the first meaningful sentence; change the logic around the queries variable
and the dataset[:5] fallback to implement these heuristics (e.g., filter entries
by content length and regex for timestamps, then extract first sentence or
longest content) so generated queries are semantic and representative.

tests/test_adapters.py (2)

128-128: 💤 Low value

Remove unused variable monkeypatch_env.

Line 128 defines monkeypatch_env = {"OPENAI_API_KEY": ""} but never uses it. The test correctly instantiates LLMEvaluator(api_key="") to bypass environment-based API key loading, making this variable unnecessary.

🧹 Remove unused variable

         from benchmarks.scenario_a import run_scenario_a
         from benchmarks.evaluator import LLMEvaluator
-        import os
-        monkeypatch_env = {"OPENAI_API_KEY": ""}
         evaluator = LLMEvaluator(api_key="")
         adapter = MockAdapter()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_adapters.py` at line 128, Remove the unused variable
monkeypatch_env from the test file: locate the assignment monkeypatch_env =
{"OPENAI_API_KEY": ""} in tests/test_adapters.py and delete it; the test already
constructs LLMEvaluator(api_key="") so no environment-mocking is needed and
removing the unused monkeypatch_env variable will clean up the test.

107-135: ⚡ Quick win

Extract MockAdapter to eliminate duplication across test scenarios.

The MockAdapter implementation in TestDryRunScenarioA (lines 109-123) is duplicated identically in TestDryRunScenarioB (lines 149-163). Extracting the mock to a shared test fixture or module-level helper reduces maintenance burden and ensures consistent adapter behavior across integration tests.

♻️ Extract MockAdapter to module-level fixture

Add before the test classes (after imports):

class MockAdapter(MemoryAdapter):
    """Shared mock adapter for dry-run integration tests."""
    `@property`
    def name(self):
        return "Mock"
    
    def setup(self, uid):
        pass
    
    def store(self, content, metadata=None):
        return MemoryResult(
            success=True,
            latency_ms=5.0,
            tokens_used=len(content.split()) * 2
        )
    
    def retrieve(self, query, limit=5):
        return MemoryResult(
            success=True,
            latency_ms=3.0,
            tokens_used=100,
            data=["mock memory"]
        )
    
    def update(self, mid, content):
        return MemoryResult(success=True, latency_ms=5.0)
    
    def delete(self, mid):
        return MemoryResult(success=True, latency_ms=2.0)
    
    def get_all(self):
        return MemoryResult(success=True, latency_ms=1.0, data=[])
    
    def cleanup(self):
        pass

Then remove the inline MockAdapter definitions from both test methods and instantiate the shared class.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_adapters.py` around lines 107 - 135, The MockAdapter class is
duplicated between TestDryRunScenarioA and TestDryRunScenarioB; extract a single
shared MockAdapter (subclassing MemoryAdapter and returning MemoryResult
instances) at module scope or as a fixture, then remove the inline class
definitions from both tests and instantiate the shared MockAdapter in each test;
ensure the shared class implements the same methods used in the diff (name
property, setup, store, retrieve, update, delete, get_all, cleanup) and returns
the same MemoryResult shapes so run_scenario_a/run_scenario_b and
LLMEvaluator-based assertions remain unchanged.

run_benchmark.py (2)

362-364: 💤 Low value

Timestamp inconsistency between benchmark results and report filenames.

Line 362 generates a new timestamp for report filenames, but run_benchmark already assigned a timestamp to each result dict at line 93. If the benchmark run spans a second boundary, the filenames will use a different timestamp than the one embedded in results[0]["timestamp"], creating a minor inconsistency in reports.

🔧 Use the benchmark session timestamp from results

-    timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
+    timestamp = results[0]["timestamp"] if results else datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
     json_path = REPORTS_DIR / f"benchmark_results_{timestamp}.json"
     html_path = REPORTS_DIR / f"benchmark_report_{timestamp}.html"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@run_benchmark.py` around lines 362 - 364, The filenames are using a fresh
datetime instead of the benchmark session timestamp stored in the results,
causing possible mismatch; change the logic that sets json_path and html_path to
derive the timestamp from results[0]["timestamp"] (from the run_benchmark
results list) rather than calling datetime.now(), validating that results is
non-empty and that the timestamp key exists and falling back to
datetime.now(timezone.utc).strftime(...) only if results are missing or
malformed; update the variables json_path and html_path to use that session
timestamp so the filenames match the embedded result timestamp.

99-150: ⚖️ Poor tradeoff

Extract shared aggregation logic to eliminate duplication.

The aggregation logic in generate_console_report (lines 104-139) and generate_html_report (lines 265-308) is duplicated: both group results by (framework, scenario) and compute means/sums over the same metric fields using identical patterns. This violates the DRY principle and increases maintenance burden if aggregation logic needs to change.

♻️ Refactor to extract shared aggregation helper

Add a helper function before generate_console_report:

def _aggregate_results(results: list[dict]) -> dict:
    """Aggregate results by (framework, scenario) with computed means/sums."""
    import numpy as np
    aggregated = {}
    for r in results:
        key = (r["framework"], r["scenario"])
        if key not in aggregated:
            aggregated[key] = {
                "framework": r["framework"],
                "scenario": r["scenario"],
                "store_tokens": [],
                "retrieve_tokens": [],
                "store_p95": [],
                "retrieve_p95": [],
                "accuracy": [],
                "errors": [],
            }
        agg = aggregated[key]
        agg["store_tokens"].append(r["total_store_tokens"])
        agg["retrieve_tokens"].append(r["total_retrieve_tokens"])
        agg["store_p95"].append(r["store_p95_latency_ms"])
        agg["retrieve_p95"].append(r["retrieve_p95_latency_ms"])
        agg["accuracy"].append(r["retrieval_accuracy"])
        agg["errors"].append(r["errors"])
    
    # Compute means/sums
    for key, agg in aggregated.items():
        agg["mean_store_tokens"] = int(np.mean(agg["store_tokens"]))
        agg["mean_retrieve_tokens"] = int(np.mean(agg["retrieve_tokens"]))
        agg["mean_store_p95"] = round(np.mean(agg["store_p95"]), 1)
        agg["mean_retrieve_p95"] = round(np.mean(agg["retrieve_p95"]), 1)
        agg["mean_accuracy"] = round(np.mean(agg["accuracy"]), 3)
        agg["total_errors"] = sum(agg["errors"])
    
    return aggregated

Then refactor both report functions to call _aggregate_results(results) and consume the computed means.

Also applies to: 170-320

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@run_benchmark.py` around lines 99 - 150, The aggregation block duplicated in
generate_console_report and generate_html_report should be extracted into a
shared helper _aggregate_results(results: list[dict]) that groups by
(r["framework"], r["scenario"]), accumulates lists for
"store_tokens","retrieve_tokens","store_p95","retrieve_p95","accuracy","errors",
then computes and stores aggregated values (e.g. mean_store_tokens,
mean_retrieve_tokens, mean_store_p95, mean_retrieve_p95, mean_accuracy,
total_errors) using numpy and returns the aggregated dict; update
generate_console_report and generate_html_report to call
_aggregate_results(results) and read the precomputed mean_* and total_errors
fields instead of repeating the accumulation logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/base.py`:
- Around line 37-50: The p95 calculation in store_p95_latency and
retrieve_p95_latency uses idx = int(len(sorted_l) * 0.95) which can select the
maximum element for many n (off-by-one); change the index calculation to use the
1-based percentile mapping: idx = max(0, math.ceil(len(sorted_l) * 0.95) - 1)
(add import math if missing) so the 95th percentile selects the correct element
from sorted_l in both store_p95_latency and retrieve_p95_latency.

In `@benchmarks/evaluator.py`:
- Line 68: The returned judge score should be normalized to the [0.0, 1.0]
contract before being used; replace the direct return of
float(parsed.get("score", 0.0)) with a safe conversion that handles non-numeric
values (fall back to 0.0), then clamp the resulting value to the range 0.0..1.0
(e.g., max(0.0, min(1.0, score))), and return that normalized score alongside
parsed.get("reasoning",""); update the return site where the tuple is produced
(the line using parsed.get("score", 0.0) and parsed.get("reasoning")) so
downstream metrics like retrieval_accuracy remain valid and comparable to
fallback runs.

In `@benchmarks/mem0_adapter.py`:
- Around line 85-89: The cleanup() method currently swallows all exceptions from
self._client.delete_all(user_id=self._user_id), which hides failures; update
cleanup() to catch exceptions but log the error (including exception details and
context like the user id and operation) via the project logger or raise a
wrapped exception instead of silently passing so cleanup failures are
observable; locate the cleanup method and modify the except block around
self._client.delete_all to record the exception (or re-raise) while preserving
original exception information.
- Around line 23-29: The setup method initializes MemoryClient correctly but
doesn't validate the incoming user_id; update the setup function to check that
user_id is non-empty (e.g., raise ValueError("user_id is required") when blank)
before assigning self._user_id, and keep the existing
MemoryClient(api_key=api_key) initialization using MemoryClient to preserve SDK
usage; refer to setup, self._client, and self._user_id when making the change.

In `@benchmarks/scenario_a.py`:
- Around line 86-91: The code reads query_item["query"] (and ["golden_answer"])
without validation which can raise KeyError for malformed dataset entries;
update the parsing in both scenario_a and scenario_b to use
query_item.get("query") and query_item.get("golden_answer", "") (or similar .get
defaults), validate that query is truthy before using it, and on missing/invalid
entries log the error and increment the existing error/metrics counter (e.g.,
error metric or errors.increment) and skip that item instead of letting it crash
the benchmark.

In `@README.md`:
- Around line 43-51: The environment variables table under "## ⚙️ Environment
Variables" is missing BENCHMARK_RUNS and RANDOM_SEED; add two rows for
`BENCHMARK_RUNS` (describe as number of times to run benchmark/default value or
optional) and `RANDOM_SEED` (describe as deterministic seed for
benchmark/randomness/default value or optional) to document their purpose and
whether they are required, referencing the variables by name so users can find
them easily.

In `@requirements.txt`:
- Line 2: requirements.txt currently allows mem0ai>=0.1.0 which is too
permissive given advisories; tighten the version specifier (e.g., set a safe
minimum like mem0ai>=2.0.4 or another non-vulnerable minimum) to exclude the
vulnerable ranges (<=1.0.0 and <2.0.0b2), and ensure benchmark code actually
uses authenticated endpoints by verifying the MemoryClient construction in
benchmarks/mem0_adapter.py and memanto/cli/analyze/mem0_export.py requires a
non-empty MEM0_API_KEY (fail fast if missing) and does not default to any
open/demo server URL so the tests target an endpoint that enforces
auth/authorization.

---

Outside diff comments:
In @.env.example:
- Around line 16-21: The repo documents BENCHMARK_RUNS and RANDOM_SEED but
run_benchmark.py only uses CLI --runs and there is no RNG seeding; update
run_benchmark.py to read BENCHMARK_RUNS and RANDOM_SEED from the environment as
fallbacks for the CLI (use os.getenv('BENCHMARK_RUNS') to set default for the
--runs value) and implement deterministic seeding in the benchmark startup
(e.g., a seed_randomness(seed) helper called from main() that seeds Python
random, numpy.random, and any other RNGs) so env vars are actually consumed;
alternatively remove BENCHMARK_RUNS and RANDOM_SEED from .env.example if you
prefer not to support env-based configuration.

---

Nitpick comments:
In `@benchmarks/mem0_adapter.py`:
- Line 35: The current heuristic estimating tokens via tokens =
len(content.split()) * 2 is arbitrary; replace it by using the actual
token-count returned by Mem0 API when available (use the response's usage field)
or, if the API doesn't provide usage, compute tokens via a proper tokenizer
(e.g., tiktoken encoding relevant to the model) instead of content.split();
update every place that uses the tokens variable (the tokens assignment and any
methods in this file that reuse that heuristic) to either read
response.usage.total_tokens or call a tokenizer helper, and add a brief comment
documenting the fallback behavior.
- Around line 31-43: The store method in benchmarks/mem0_adapter.py currently
returns latency_ms=0 which relies on callers wrapping calls with
adapter.timed_call; add a concise clarifying comment above the store method
(similar to the note in memanto_adapter.py) stating that latency_ms is
intentionally 0 and callers must call adapter.timed_call(self.store, ...) (or
equivalent) to capture real latency metrics, and reference MemoryResult and the
_client.add call in the comment so future maintainers know this is by design
rather than a bug.

In `@benchmarks/memanto_adapter.py`:
- Around line 49-51: The token estimation in retrieve() is inefficient because
it calls str(m) for every memory even when m is already a string; update the
computation of total_tokens (and any helper used there) to avoid redundant str()
by checking isinstance(m, str) or converting once (e.g., text = m if
isinstance(m, str) else str(m)) and then counting tokens (len(text.split()) * 2)
for each item in memories; modify the total_tokens generator/loop accordingly to
prevent repeated conversions while keeping the same token formula.
- Line 35: The token fallback silently masks missing API fields; change the
single-line fallback assignment tokens = getattr(result, "tokens_used",
len(content.split()) * 2) to explicitly detect absence (use hasattr(result,
"tokens_used") or result.__dict__.get("tokens_used") is None), compute the
heuristic only in that branch, and emit a warning-level log that includes
context (the result object or its keys, the content length, and the heuristic
value) so you can detect API changes or when the heuristic is used; keep the
original value name tokens and the same heuristic calculation but only assign it
after logging the fallback.

In `@benchmarks/scenario_a.py`:
- Around line 21-24: The load_dataset() function currently opens DATASET_PATH
and calls json.load without handling errors; wrap the file open and json.load
call in a try/except that catches FileNotFoundError and json.JSONDecodeError,
include the original exception details in a clearer error message (e.g., via
logging or raising a new RuntimeError) and re-raise or exit so the benchmark
runner gets a readable, actionable message; update references to DATASET_PATH
and load_dataset() only (no other logic changes).
- Around line 78-83: The current fallback that builds queries when
retrieval_queries is missing uses truncated prefixes (queries variable built
from dataset[:5] with entry["content"][:100]), which can yield poor,
non-semantic queries; update the fallback in the queries construction to produce
higher-quality queries by (1) selecting entries with substantive content (e.g.,
filter out lines that look like timestamps/metadata), (2) prefer full sentences
or the longest N entries rather than the first 100 chars, and (3) optionally
extract a question-like phrase or heuristically trim to the first meaningful
sentence; change the logic around the queries variable and the dataset[:5]
fallback to implement these heuristics (e.g., filter entries by content length
and regex for timestamps, then extract first sentence or longest content) so
generated queries are semantic and representative.

In `@README.md`:
- Around line 54-76: Summary: The README's fenced code block showing the project
tree lacks a language identifier. Fix: open the README.md, locate the fenced
block containing the ASCII project structure (the block that begins with
"memanto-benchmark/" and the triple backticks), and add a language tag such as
text or plaintext after the opening backticks (e.g., ```text) so the code fence
is Markdown-compliant; ensure you update only that fence and preserve the
existing ASCII tree content.
- Around line 78-92: Update the "🔬 Experimental Design" text to show that run
counts and seeds are configurable: change the phrasing that currently reads "3
runs per scenario" to indicate it is the default controlled by the
BENCHMARK_RUNS environment variable (default: 3), and change "Random seeds set
for deterministic evaluation" to note the RANDOM_SEED env var is used to set the
seed; reference these env vars (BENCHMARK_RUNS, RANDOM_SEED) in the
Reproducibility or Isolation & Controls subsection and say they are configurable
defaults rather than fixed values.

In `@run_benchmark.py`:
- Around line 362-364: The filenames are using a fresh datetime instead of the
benchmark session timestamp stored in the results, causing possible mismatch;
change the logic that sets json_path and html_path to derive the timestamp from
results[0]["timestamp"] (from the run_benchmark results list) rather than
calling datetime.now(), validating that results is non-empty and that the
timestamp key exists and falling back to
datetime.now(timezone.utc).strftime(...) only if results are missing or
malformed; update the variables json_path and html_path to use that session
timestamp so the filenames match the embedded result timestamp.
- Around line 99-150: The aggregation block duplicated in
generate_console_report and generate_html_report should be extracted into a
shared helper _aggregate_results(results: list[dict]) that groups by
(r["framework"], r["scenario"]), accumulates lists for
"store_tokens","retrieve_tokens","store_p95","retrieve_p95","accuracy","errors",
then computes and stores aggregated values (e.g. mean_store_tokens,
mean_retrieve_tokens, mean_store_p95, mean_retrieve_p95, mean_accuracy,
total_errors) using numpy and returns the aggregated dict; update
generate_console_report and generate_html_report to call
_aggregate_results(results) and read the precomputed mean_* and total_errors
fields instead of repeating the accumulation logic.

In `@tests/test_adapters.py`:
- Line 128: Remove the unused variable monkeypatch_env from the test file:
locate the assignment monkeypatch_env = {"OPENAI_API_KEY": ""} in
tests/test_adapters.py and delete it; the test already constructs
LLMEvaluator(api_key="") so no environment-mocking is needed and removing the
unused monkeypatch_env variable will clean up the test.
- Around line 107-135: The MockAdapter class is duplicated between
TestDryRunScenarioA and TestDryRunScenarioB; extract a single shared MockAdapter
(subclassing MemoryAdapter and returning MemoryResult instances) at module scope
or as a fixture, then remove the inline class definitions from both tests and
instantiate the shared MockAdapter in each test; ensure the shared class
implements the same methods used in the diff (name property, setup, store,
retrieve, update, delete, get_all, cleanup) and returns the same MemoryResult
shapes so run_scenario_a/run_scenario_b and LLMEvaluator-based assertions remain
unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 720ede40-17dc-413a-ad8e-abebb2706f94

📥 Commits

Reviewing files that changed from the base of the PR and between b53ac14 and 9b55388.

📒 Files selected for processing (14)

.env.example
README.md
benchmarks/__init__.py
benchmarks/base.py
benchmarks/evaluator.py
benchmarks/mem0_adapter.py
benchmarks/memanto_adapter.py
benchmarks/scenario_a.py
benchmarks/scenario_b.py
datasets/persona_evolution.json
datasets/technical_logs.json
requirements.txt
run_benchmark.py
tests/test_adapters.py

coderabbitai · 2026-06-10T09:07:24Z

+    def store_p95_latency(self) -> float:
+        if not self.store_latencies:
+            return 0.0
+        sorted_l = sorted(self.store_latencies)
+        idx = int(len(sorted_l) * 0.95)
+        return sorted_l[min(idx, len(sorted_l) - 1)]
+
+    @property
+    def retrieve_p95_latency(self) -> float:
+        if not self.retrieve_latencies:
+            return 0.0
+        sorted_l = sorted(self.retrieve_latencies)
+        idx = int(len(sorted_l) * 0.95)
+        return sorted_l[min(idx, len(sorted_l) - 1)]


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix off-by-one in p95 latency calculation.

Line 41 and Line 49 currently compute int(n * 0.95), which maps many sample sizes to the max element (e.g., n=20 → index 19). That reports p100 instead of p95 and skews published benchmark metrics.

Suggested fix

`@property` def store_p95_latency(self) -> float: if not self.store_latencies: return 0.0 sorted_l = sorted(self.store_latencies) - idx = int(len(sorted_l) * 0.95) + idx = int((len(sorted_l) - 1) * 0.95) return sorted_l[min(idx, len(sorted_l) - 1)] `@property` def retrieve_p95_latency(self) -> float: if not self.retrieve_latencies: return 0.0 sorted_l = sorted(self.retrieve_latencies) - idx = int(len(sorted_l) * 0.95) + idx = int((len(sorted_l) - 1) * 0.95) return sorted_l[min(idx, len(sorted_l) - 1)]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def store_p95_latency(self) -> float:

if not self.store_latencies:

return 0.0

sorted_l = sorted(self.store_latencies)

idx = int(len(sorted_l) * 0.95)

return sorted_l[min(idx, len(sorted_l) - 1)]

@property

def retrieve_p95_latency(self) -> float:

if not self.retrieve_latencies:

return 0.0

sorted_l = sorted(self.retrieve_latencies)

idx = int(len(sorted_l) * 0.95)

return sorted_l[min(idx, len(sorted_l) - 1)]

def store_p95_latency(self) -> float:

if not self.store_latencies:

return 0.0

sorted_l = sorted(self.store_latencies)

idx = int((len(sorted_l) - 1) * 0.95)

return sorted_l[min(idx, len(sorted_l) - 1)]

`@property`

def retrieve_p95_latency(self) -> float:

if not self.retrieve_latencies:

return 0.0

sorted_l = sorted(self.retrieve_latencies)

idx = int((len(sorted_l) - 1) * 0.95)

return sorted_l[min(idx, len(sorted_l) - 1)]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/base.py` around lines 37 - 50, The p95 calculation in store_p95_latency and retrieve_p95_latency uses idx = int(len(sorted_l) * 0.95) which can select the maximum element for many n (off-by-one); change the index calculation to use the 1-based percentile mapping: idx = max(0, math.ceil(len(sorted_l) * 0.95) - 1) (add import math if missing) so the 95th percentile selects the correct element from sorted_l in both store_p95_latency and retrieve_p95_latency.

coderabbitai · 2026-06-10T09:07:24Z

+            import json
+            content = response.choices[0].message.content
+            parsed = json.loads(content)
+            return float(parsed.get("score", 0.0)), parsed.get("reasoning", "")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize judge scores to the same [0.0, 1.0] contract as fallback.

Line 68 trusts model output verbatim; if the judge returns out-of-range values, retrieval_accuracy becomes invalid and incomparable to fallback runs.

Suggested fix

- return float(parsed.get("score", 0.0)), parsed.get("reasoning", "") + raw_score = float(parsed.get("score", 0.0)) + score = max(0.0, min(1.0, raw_score)) + return score, parsed.get("reasoning", "")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

return float(parsed.get("score", 0.0)), parsed.get("reasoning", "")

raw_score = float(parsed.get("score", 0.0))

score = max(0.0, min(1.0, raw_score))

return score, parsed.get("reasoning", "")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/evaluator.py` at line 68, The returned judge score should be normalized to the [0.0, 1.0] contract before being used; replace the direct return of float(parsed.get("score", 0.0)) with a safe conversion that handles non-numeric values (fall back to 0.0), then clamp the resulting value to the range 0.0..1.0 (e.g., max(0.0, min(1.0, score))), and return that normalized score alongside parsed.get("reasoning",""); update the return site where the tuple is produced (the line using parsed.get("score", 0.0) and parsed.get("reasoning")) so downstream metrics like retrieval_accuracy remain valid and comparable to fallback runs.

coderabbitai · 2026-06-10T09:07:24Z

+    def setup(self, user_id: str) -> None:
+        api_key = os.environ.get("MEM0_API_KEY", "")
+        if not api_key:
+            raise ValueError("MEM0_API_KEY environment variable is required")
+        from mem0 import MemoryClient
+        self._client = MemoryClient(api_key=api_key)
+        self._user_id = user_id


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Verify Mem0 client initialization requirements # Check if there are any examples or documentation about MemoryClient usage rg -nP -A5 -B5 'MemoryClient\s*\(' --type=py # Search for user_id parameter usage in mem0 contexts rg -nP 'user_id' --type=py -g '*mem0*'

Repository: moorcheh-ai/memanto

Length of output: 2287

🌐 Web query:

Mem0 Python SDK MemoryClient initialization and user_id configuration

💡 Result:

To initialize the Mem0 Python SDK and configure user-specific memory, follow these steps: Initialization To use the Mem0 Platform API, you initialize the MemoryClient class [1][2]. The api_key parameter is required; it can be passed directly or set via the MEM0_API_KEY environment variable [3][4][2]. import {mem0} from mem0 # Initialize client client = MemoryClient(api_key="your-api-key") Optional configuration parameters for the constructor include host, org_id, and project_id [1][2]. User ID Configuration In Mem0, user_id is not a global configuration setting for the client instance itself [5][2]. Instead, it is passed as a mandatory or contextual parameter to individual memory operations (such as add, search, get_all, or delete) to scope the operation to a specific user [3][6][7][2]. Adding memory for a specific user: client.add( messages=[{"role": "user", "content": "I like dark chocolate."}], user_id="user123") Searching memory for a specific user: results = client.search("What do I like?", user_id="user123") Scope and Entity Configuration Mem0 uses a multi-dimensional scoping system where you can define boundaries using user_id, agent_id, app_id, and run_id [7]. When performing operations, you specify these identifiers to maintain data privacy and context [5][7]. Note that memory records are scoped to these primary entities; for example, if you include both a user_id and an agent_id in an operation, Mem0 treats them as specific entity scopes [7]. Top results: [1][3][6][7][2]

Citations:

1: https://github.com/mem0ai/mem0/blob/HEAD/skills/mem0/references/sdk-guide.md

2: https://github.com/mem0ai/mem0/blob/ece7ff6b/LLM.md

3: https://github.com/mem0ai/mem0/blob/HEAD/skills/mem0/references/quickstart.md

4: https://docs.mem0.ai/platform/quickstart

5: https://github.com/mem0ai/mem0/blob/main/mem0/memory/main.py

6: https://docs.mem0.ai/core-concepts/memory-operations/add

7: https://docs.mem0.ai/platform/features/entity-scoped-memory

Review Mem0 client initialization and user_id scoping in benchmarks/mem0_adapter.py

Mem0 docs state MemoryClient initialization requires api_key; other constructor options like host/org_id/project_id are optional, so MemoryClient(api_key=api_key) is consistent with the SDK’s expected initialization pattern.

user_id is not configured at the client level in Mem0; it’s provided to each memory operation for scoping (and this adapter passes user_id for add, search, get_all, and delete_all). setup() therefore doesn’t need to “register” user_id; adding a basic guard (e.g., non-empty user_id) before storing/using it would be a maintainability improvement.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/mem0_adapter.py` around lines 23 - 29, The setup method initializes MemoryClient correctly but doesn't validate the incoming user_id; update the setup function to check that user_id is non-empty (e.g., raise ValueError("user_id is required") when blank) before assigning self._user_id, and keep the existing MemoryClient(api_key=api_key) initialization using MemoryClient to preserve SDK usage; refer to setup, self._client, and self._user_id when making the change.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

requirements.txt (1)
2-2: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Tighten mem0ai version constraint to exclude vulnerable versions.

The current specification mem0ai>=0.1.0 allows vulnerable versions. Security advisories indicate issues in versions ≤1.0.0 and <2.0.0b2 (authentication/authorization bypass, improper input validation). The latest version on PyPI is 2.0.4. The benchmark adapters already pass API keys, but you should ensure the specific mem0 endpoints used enforce authentication.
🔒 Proposed fix
-mem0ai>=0.1.0
+mem0ai>=2.0.4
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@requirements.txt` at line 2, Update the mem0ai requirement to avoid known
vulnerable releases by tightening the constraint from "mem0ai>=0.1.0" to a safe
range (e.g., "mem0ai>=2.0.4,<3.0.0") in requirements.txt; keep the benchmark
adapters that call mem0 endpoints (the existing adapter code) using explicit API
key parameters and verify those endpoint calls enforce authentication by
checking the adapter functions that pass the API keys.
benchmarks/base.py (1)
37-50: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix off-by-one error in p95 latency calculation.

Both store_p95_latency and retrieve_p95_latency use idx = int(len(sorted_l) * 0.95), which incorrectly returns the maximum element for many sample sizes. For example, with n=10 items, this computes idx=9 (the last element), reporting the 100th percentile instead of the 95th percentile. This corrupts published benchmark metrics and misrepresents the performance comparison between Memanto and Mem0.
🐛 Proposed fix
     `@property`
     def store_p95_latency(self) -> float:
         if not self.store_latencies:
             return 0.0
         sorted_l = sorted(self.store_latencies)
-        idx = int(len(sorted_l) * 0.95)
+        idx = int((len(sorted_l) - 1) * 0.95)
         return sorted_l[min(idx, len(sorted_l) - 1)]

     `@property`
     def retrieve_p95_latency(self) -> float:
         if not self.retrieve_latencies:
             return 0.0
         sorted_l = sorted(self.retrieve_latencies)
-        idx = int(len(sorted_l) * 0.95)
+        idx = int((len(sorted_l) - 1) * 0.95)
         return sorted_l[min(idx, len(sorted_l) - 1)]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/base.py` around lines 37 - 50, The p95 calculation in
store_p95_latency and retrieve_p95_latency uses idx = int(len(sorted_l) * 0.95)
which can pick the last element (100th percentile) for many n; change the index
computation to compute the 95th percentile rank properly (e.g., idx =
math.ceil(0.95 * len(sorted_l)) - 1) and then clamp with min/max to the valid
range; update both functions and add the necessary math import if not present.

🧹 Nitpick comments (4)

tests/test_adapters.py (3)

109-123: 💤 Low value

Consider extracting MockAdapter to a shared pytest fixture.

The MockAdapter implementation is duplicated in both TestDryRunScenarioA (lines 109-123) and TestDryRunScenarioB (lines 149-163) with nearly identical code. Extracting to a pytest fixture would reduce duplication and improve maintainability.

♻️ Example refactor

`@pytest.fixture`
def mock_adapter():
    class MockAdapter(MemoryAdapter):
        `@property`
        def name(self): return "Mock"
        def setup(self, uid): pass
        def store(self, content, metadata=None):
            return MemoryResult(True, 5.0, len(content.split())*2)
        def retrieve(self, query, limit=5):
            return MemoryResult(True, 3.0, 100, ["mock memory"])
        def update(self, mid, content):
            return MemoryResult(True, 5.0)
        def delete(self, mid):
            return MemoryResult(True, 2.0)
        def get_all(self):
            return MemoryResult(True, 1.0, data=[])
        def cleanup(self): pass
    return MockAdapter()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_adapters.py` around lines 109 - 123, Extract the duplicated
MockAdapter class into a pytest fixture named mock_adapter and use it in both
TestDryRunScenarioA and TestDryRunScenarioB; locate the existing MockAdapter
(subclassing MemoryAdapter and returning MemoryResult in methods
store/retrieve/update/delete/get_all) and move that implementation into a
fixture function decorated with `@pytest.fixture` that returns an instance, then
replace the inline class definitions in both tests with the fixture parameter
(mock_adapter) so tests reuse the same adapter instance and eliminate
duplication.

128-128: ⚡ Quick win

Remove unused variable monkeypatch_env.

Line 128 defines monkeypatch_env = {"OPENAI_API_KEY": ""} but never uses it. This is dead code.

♻️ Proposed fix

         from benchmarks.scenario_a import run_scenario_a
         from benchmarks.evaluator import LLMEvaluator
         import os
-        monkeypatch_env = {"OPENAI_API_KEY": ""}
         evaluator = LLMEvaluator(api_key="")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_adapters.py` at line 128, Remove the unused variable
monkeypatch_env defined in tests/test_adapters.py (the assignment
monkeypatch_env = {"OPENAI_API_KEY": ""}) since it is dead code; locate the
variable by name and delete the line or unused declaration so there are no
unused locals left in the test file.

7-7: ⚡ Quick win

Remove sys.path manipulation in tests.

Manipulating sys.path in tests is an anti-pattern. Tests should rely on proper package structure and installation (e.g., pip install -e . for editable install). This ensures tests run in the same environment as production code.

♻️ Recommended approach

Remove line 7 and ensure the project has a proper setup.py or pyproject.toml that allows:

pip install -e .
pytest tests/

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_adapters.py` at line 7, Remove the sys.path manipulation by
deleting the call to sys.path.insert(0, str(Path(__file__).parent.parent)) in
tests (remove the use of sys.path.insert and Path(__file__).parent.parent in
tests/test_adapters.py); instead rely on a proper package install for test
imports (use pip install -e . via pyproject.toml or setup.py) so tests import
modules from the installed package rather than mutating sys.path.

README.md (1)

54-76: 💤 Low value

Add language identifier to the project structure code fence.

Markdown best practice is to specify a language (or use plain text) for fenced code blocks to improve rendering and accessibility.

📝 Proposed fix

-```
+```plaintext
 memanto-benchmark/

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 54 - 76, The README's project-structure fenced code
block lacks a language identifier; edit the README.md and change the opening
triple backticks for the tree (the block showing "memanto-benchmark/ ..." under
the diff) to include a language token such as ```plaintext (or ```text) so the
directory tree renders/accessibility is improved and syntax highlighters treat
it as plain text.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/base.py`:
- Around line 1-9: Add a short statement to the README documenting the project's
minimum Python version (Python 3.10+) to match pyproject.toml's requires-python
and classifiers; update the top-level README.md to explicitly state "Requires
Python 3.10 or newer (tested on 3.10/3.11/3.12)" and, if appropriate, add a
short note about compatibility or testing matrix to keep it consistent with the
package metadata.

In `@requirements.txt`:
- Line 7: Update the jinja2 requirement in requirements.txt to pin the minimum
safe version by replacing the current "jinja2>=3.1.0" entry with "jinja2>=3.1.6"
so vulnerable 3.1.x releases (and earlier) are excluded; ensure the change is
applied to the jinja2 line in requirements.txt and run dependency install/checks
afterward.

In `@tests/test_adapters.py`:
- Around line 37-40: The test test_p95_latency currently asserts the buggy
behavior; update it to assert the correct 95th percentile for
BenchmarkMetric.store_p95_latency (property in benchmarks/base.py) using the
proper percentile calculation for the list [10,20,...,100]; replace the
hardcoded 100.0 with the correct value (approximately 95.5) and use
pytest.approx(95.5) or a small tolerance to avoid brittle equality.

---

Duplicate comments:
In `@benchmarks/base.py`:
- Around line 37-50: The p95 calculation in store_p95_latency and
retrieve_p95_latency uses idx = int(len(sorted_l) * 0.95) which can pick the
last element (100th percentile) for many n; change the index computation to
compute the 95th percentile rank properly (e.g., idx = math.ceil(0.95 *
len(sorted_l)) - 1) and then clamp with min/max to the valid range; update both
functions and add the necessary math import if not present.

In `@requirements.txt`:
- Line 2: Update the mem0ai requirement to avoid known vulnerable releases by
tightening the constraint from "mem0ai>=0.1.0" to a safe range (e.g.,
"mem0ai>=2.0.4,<3.0.0") in requirements.txt; keep the benchmark adapters that
call mem0 endpoints (the existing adapter code) using explicit API key
parameters and verify those endpoint calls enforce authentication by checking
the adapter functions that pass the API keys.

---

Nitpick comments:
In `@README.md`:
- Around line 54-76: The README's project-structure fenced code block lacks a
language identifier; edit the README.md and change the opening triple backticks
for the tree (the block showing "memanto-benchmark/ ..." under the diff) to
include a language token such as ```plaintext (or ```text) so the directory tree
renders/accessibility is improved and syntax highlighters treat it as plain
text.

In `@tests/test_adapters.py`:
- Around line 109-123: Extract the duplicated MockAdapter class into a pytest
fixture named mock_adapter and use it in both TestDryRunScenarioA and
TestDryRunScenarioB; locate the existing MockAdapter (subclassing MemoryAdapter
and returning MemoryResult in methods store/retrieve/update/delete/get_all) and
move that implementation into a fixture function decorated with `@pytest.fixture`
that returns an instance, then replace the inline class definitions in both
tests with the fixture parameter (mock_adapter) so tests reuse the same adapter
instance and eliminate duplication.
- Line 128: Remove the unused variable monkeypatch_env defined in
tests/test_adapters.py (the assignment monkeypatch_env = {"OPENAI_API_KEY": ""})
since it is dead code; locate the variable by name and delete the line or unused
declaration so there are no unused locals left in the test file.
- Line 7: Remove the sys.path manipulation by deleting the call to
sys.path.insert(0, str(Path(__file__).parent.parent)) in tests (remove the use
of sys.path.insert and Path(__file__).parent.parent in tests/test_adapters.py);
instead rely on a proper package install for test imports (use pip install -e .
via pyproject.toml or setup.py) so tests import modules from the installed
package rather than mutating sys.path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: b96bc24d-05c6-4c92-bc80-14766910138d

📥 Commits

Reviewing files that changed from the base of the PR and between 9b55388 and 440a667.

📒 Files selected for processing (14)

.env.example
README.md
benchmarks/__init__.py
benchmarks/base.py
benchmarks/evaluator.py
benchmarks/mem0_adapter.py
benchmarks/memanto_adapter.py
benchmarks/scenario_a.py
benchmarks/scenario_b.py
datasets/persona_evolution.json
datasets/technical_logs.json
requirements.txt
run_benchmark.py
tests/test_adapters.py

✅ Files skipped from review due to trivial changes (4)

datasets/persona_evolution.json
.env.example
benchmarks/init.py
datasets/technical_logs.json

🚧 Files skipped from review as they are similar to previous changes (6)

benchmarks/evaluator.py
benchmarks/scenario_a.py
benchmarks/memanto_adapter.py
run_benchmark.py
benchmarks/scenario_b.py
benchmarks/mem0_adapter.py

coderabbitai · 2026-06-10T09:19:35Z

+"""
+Base classes for the memory benchmark framework.
+"""
+
+import time
+import statistics
+from abc import ABC, abstractmethod
+from dataclasses import dataclass, field
+from typing import Any


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check for Python version requirements in project files echo "=== Searching for Python version declarations ===" rg -n "python_requires|requires-python|Programming Language :: Python :: 3\." setup.py pyproject.toml setup.cfg README.md .python-version 2>/dev/null || echo "No version constraints found" echo -e "\n=== Checking if README documents Python version ===" rg -n -i "python 3\.(10|11|12)" README.md || echo "Python version not documented in README"

Repository: moorcheh-ai/memanto

Length of output: 474

Document Python 3.10+ requirement in README (pyproject already set)

pyproject.toml already declares requires-python = ">=3.10,<4" and includes Python 3.10/3.11/3.12 classifiers.

README.md has no explicit mention of Python 3.10/3.11/3.12; add the Python 3.10+ minimum there.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/base.py` around lines 1 - 9, Add a short statement to the README documenting the project's minimum Python version (Python 3.10+) to match pyproject.toml's requires-python and classifiers; update the top-level README.md to explicitly state "Requires Python 3.10 or newer (tested on 3.10/3.11/3.12)" and, if appropriate, add a short note about compatibility or testing matrix to keep it consistent with the package metadata.

coderabbitai · 2026-06-10T09:19:35Z

GHSA-gmj6-6f8f-6699

+python-dotenv>=1.0.0
+numpy>=1.24.0
+tabulate>=0.9.0
+jinja2>=3.1.0


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

What is the latest stable version of jinja2 and which version fixed the sandbox breakout vulnerabilities (GHSA-cpwx-vrp4-4pq7, GHSA-gmj6-6f8f-6699)?

💡 Result:

As of June 10, 2026, the latest stable version of Jinja2 is 3.1.6 [1][2][3]. The sandbox breakout vulnerabilities you referenced were addressed in the following versions: - GHSA-gmj6-6f8f-6699 (CVE-2024-56201): Fixed in version 3.1.5 [4][5][6]. This vulnerability involved a flaw in the Jinja compiler that allowed for a sandbox breakout if an attacker controlled both the content and the filename of a template [7][5]. - GHSA-cpwx-vrp4-4pq7 (CVE-2025-27516): Fixed in version 3.1.6 [8][9][10]. This vulnerability stemmed from an oversight where the |attr filter could be used to bypass the environment's attribute lookup, allowing access to restricted methods like str.format [8][9]. Users are encouraged to use the latest stable version (3.1.6) to ensure protection against these and other vulnerabilities [2].

Citations:

1: https://pypi.org/project/Jinja2/

2: https://github.com/pallets/jinja/releases

3: https://github.com/pallets/jinja/tree/stable

4: GHSA-gmj6-6f8f-6699

5: https://nvd.nist.gov/vuln/detail/CVE-2024-56201

6: https://feedly.com/cve/CVE-2024-56201

7: https://osv.dev/vulnerability/GHSA-gmj6-6f8f-6699

8: GHSA-cpwx-vrp4-4pq7

9: https://nvd.nist.gov/vuln/detail/CVE-2025-27516

10: https://github.com/pallets/jinja/blob/main/CHANGES.rst

🌐 Web query:

GHSA-gmj6-6f8f-6699 affected versions fixed in 3.1.5 Jinja2 what versions are impacted

💡 Result:

The vulnerability GHSA-gmj6-6f8f-6699 (also identified as CVE-2024-56201) affects Jinja2 versions in the 3.x branch, specifically those from 3.0.0 up to and including 3.1.4 [1][2][3]. This vulnerability is fixed in Jinja2 version 3.1.5 [2][4][5]. The issue involves a sandbox breakout that can occur when an attacker controls both the content and the filename of a template, allowing for the execution of arbitrary Python code regardless of whether the Jinja sandbox is enabled [2][6][3]. It has been confirmed that versions prior to 3.0.0a1 are not impacted because they did not use the f-string syntax in the error messages that caused the vulnerability [7]. Users are advised to upgrade to version 3.1.5 or later to mitigate this risk [2][4][3].

Citations:

1: https://cvepremium.circl.lu/vuln/ghsa-gmj6-6f8f-6699

2: GHSA-gmj6-6f8f-6699

3: https://cve.circl.lu/vuln/cve-2024-56201

4: https://github.com/pallets/jinja/releases/tag/3.1.5

5: https://jinja.palletsprojects.com/en/stable/changes/

6: https://osv.dev/vulnerability/GHSA-gmj6-6f8f-6699

7: [GHSA-gmj6-6f8f-6699] Jinja has a sandbox breakout through malicious filenames github/advisory-database#5150

🌐 Web query:

GHSA-cpwx-vrp4-4pq7 affected versions fixed in 3.1.6 Jinja2 what versions are impacted

💡 Result:

The security vulnerability GHSA-cpwx-vrp4-4pq7 (CVE-2025-27516) affects all Jinja2 versions up to and including 3.1.5 [1]. This vulnerability, which involves a sandbox breakout through the attr filter, was addressed and fixed in Jinja2 version 3.1.6, released on March 5, 2025 [2][3][4]. Users are advised to upgrade to version 3.1.6 or later to mitigate this issue [2].

Citations:

1: https://osv.dev/vulnerability/GHSA-cpwx-vrp4-4pq7

2: https://github.com/pallets/jinja/releases/tag/3.1.6

3: https://raw.githubusercontent.com/pallets/jinja/refs/tags/3.1.6/CHANGES.rst

4: https://jinja.palletsprojects.com/changes/

Tighten the jinja2 constraint to avoid sandbox-breakout CVEs

File: requirements.txt
Lines: 7-7

jinja2>=3.1.0

jinja2>=3.1.0 permits vulnerable releases:

GHSA-gmj6-6f8f-6699 (CVE-2024-56201): affects 3.0.0–3.1.4; fixed in 3.1.5 (sandbox breakout when template content+filename are attacker-controlled).

GHSA-cpwx-vrp4-4pq7 (CVE-2025-27516): affects up to 3.1.5; fixed in 3.1.6 (sandbox breakout via the attr filter).

Upgrade the requirement to exclude everything before the fixed version (e.g., use jinja2>=3.1.6).

🧰 Tools

🪛 OSV Scanner (2.3.8)

[HIGH] 7-7: jinja2 3.1.0: Jinja2 vulnerable to sandbox breakout through attr filter selecting format method

(GHSA-cpwx-vrp4-4pq7)

[HIGH] 7-7: jinja2 3.1.0: Jinja has a sandbox breakout through malicious filenames

(GHSA-gmj6-6f8f-6699)

[HIGH] 7-7: jinja2 3.1.0: Jinja vulnerable to HTML attribute injection when passing user input as keys to xmlattr filter

(GHSA-h5c8-rqwp-cp95)

[HIGH] 7-7: jinja2 3.1.0: Jinja vulnerable to HTML attribute injection when passing user input as keys to xmlattr filter

(GHSA-h75v-3vvj-5mfj)

[HIGH] 7-7: jinja2 3.1.0: Jinja has a sandbox breakout through indirect reference to format method

(GHSA-q2x7-8rv6-6q7h)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@requirements.txt` at line 7, Update the jinja2 requirement in requirements.txt to pin the minimum safe version by replacing the current "jinja2>=3.1.0" entry with "jinja2>=3.1.6" so vulnerable 3.1.x releases (and earlier) are excluded; ensure the change is applied to the jinja2 line in requirements.txt and run dependency install/checks afterward.

coderabbitai · 2026-06-10T09:19:35Z

+    def test_p95_latency(self):
+        m = BenchmarkMetric(framework="Test", scenario="S1")
+        m.store_latencies = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
+        assert m.store_p95_latency == 100.0  # 95th percentile of 10 items


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Test validates incorrect p95 calculation — will fail when base.py is fixed.

This test expects store_p95_latency to return 100.0 (the maximum value) for a 10-element list. However, the 95th percentile should be approximately 95 or 90, not 100. The test currently passes because it validates the buggy behavior in benchmarks/base.py lines 41 and 49. Once the p95 calculation is corrected, this test will fail.

💚 Proposed fix to match correct p95 calculation

def test_p95_latency(self): m = BenchmarkMetric(framework="Test", scenario="S1") m.store_latencies = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] - assert m.store_p95_latency == 100.0 # 95th percentile of 10 items + # 95th percentile: idx = int((10-1)*0.95) = int(8.55) = 8 → 90.0 + assert m.store_p95_latency == 90.0

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_adapters.py` around lines 37 - 40, The test test_p95_latency currently asserts the buggy behavior; update it to assert the correct 95th percentile for BenchmarkMetric.store_p95_latency (property in benchmarks/base.py) using the proper percentile calculation for the list [10,20,...,100]; replace the hardcoded 100.0 with the correct value (approximately 95.5) and use pytest.approx(95.5) or a small tolerance to avoid brittle equality.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

README.md (2)

40-40: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the path to reflect the timestamped filename.

The generated HTML report includes a timestamp in its filename (e.g., benchmark_report_20260610_120000.html), but this line shows a static path. Users following this command will encounter a file-not-found error.

📝 Proposed fix

-# 4. View report
-open reports/benchmark_report.html
+# 4. View report (use the latest generated file)
+open reports/benchmark_report_*.html

Alternatively, document the timestamp pattern:

-# 4. View report
-open reports/benchmark_report.html
+# 4. View report
+# Reports are timestamped: benchmark_report_YYYYMMDD_HHMMSS.html
+open reports/benchmark_report_*.html

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` at line 40, Update the README command that opens the generated
report to use the timestamped filename pattern instead of the static
"reports/benchmark_report.html"; replace or augment the example command that
references "benchmark_report.html" with a pattern or placeholder like
"reports/benchmark_report_YYYYMMDD_HHMMSS.html" (or show how to glob for the
latest file, e.g., use a shell glob or find command) so users can reliably open
the actual generated file.

99-100: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update output paths to reflect the timestamped filenames.

The actual generated reports include timestamps in their filenames (e.g., benchmark_results_20260610_120000.json), but these lines show static paths. This inconsistency may confuse users searching for the output files.

📝 Proposed fix

 The benchmark generates:
 1. **Console summary** — key metrics comparison table
-2. **JSON report** — `reports/benchmark_results.json`
-3. **HTML report** — `reports/benchmark_report.html` with charts
+2. **JSON report** — `reports/benchmark_results_<timestamp>.json`
+3. **HTML report** — `reports/benchmark_report_<timestamp>.html` with charts

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 99 - 100, The README lists static output paths
'reports/benchmark_results.json' and 'reports/benchmark_report.html' but the
tool actually emits timestamped filenames; update these entries to reflect the
actual filename patterns (e.g., 'reports/benchmark_results_YYYYMMDD_HHMMSS.json'
and 'reports/benchmark_report_YYYYMMDD_HHMMSS.html' or a glob like
'reports/benchmark_results_*.json' and 'reports/benchmark_report_*.html') so
users can find the generated files; adjust the two lines that reference those
filenames accordingly.

🧹 Nitpick comments (1)

README.md (1)
87-87: 💤 Low value

Clarify that the run count is configurable.

While 3 is the default, the number of runs is configurable via the BENCHMARK_RUNS environment variable or the --runs CLI flag. This line presents it as a fixed value, which may mislead users.
📝 Suggested clarification
-- **3 runs per scenario** with statistical aggregation
+- **Configurable runs per scenario** (default: 3) with statistical aggregation
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` at line 87, Update the README line that states "3 runs per
scenario" to clarify that 3 is the default but configurable: mention the
BENCHMARK_RUNS environment variable and the --runs CLI flag as ways to change
the run count, e.g., "Default: 3 runs per scenario (configurable via
BENCHMARK_RUNS or --runs)"; ensure this references the default value and both
configuration options (BENCHMARK_RUNS and --runs).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@README.md`:
- Around line 47-48: The table rows for BENCHMARK_RUNS and RANDOM_SEED are
missing the third "Required" column; update the README table rows for
`BENCHMARK_RUNS` and `RANDOM_SEED` to include the required-status column and
mark both as not required (❌) so each row has three columns matching the header.

---

Outside diff comments:
In `@README.md`:
- Line 40: Update the README command that opens the generated report to use the
timestamped filename pattern instead of the static
"reports/benchmark_report.html"; replace or augment the example command that
references "benchmark_report.html" with a pattern or placeholder like
"reports/benchmark_report_YYYYMMDD_HHMMSS.html" (or show how to glob for the
latest file, e.g., use a shell glob or find command) so users can reliably open
the actual generated file.
- Around line 99-100: The README lists static output paths
'reports/benchmark_results.json' and 'reports/benchmark_report.html' but the
tool actually emits timestamped filenames; update these entries to reflect the
actual filename patterns (e.g., 'reports/benchmark_results_YYYYMMDD_HHMMSS.json'
and 'reports/benchmark_report_YYYYMMDD_HHMMSS.html' or a glob like
'reports/benchmark_results_*.json' and 'reports/benchmark_report_*.html') so
users can find the generated files; adjust the two lines that reference those
filenames accordingly.

---

Nitpick comments:
In `@README.md`:
- Line 87: Update the README line that states "3 runs per scenario" to clarify
that 3 is the default but configurable: mention the BENCHMARK_RUNS environment
variable and the --runs CLI flag as ways to change the run count, e.g.,
"Default: 3 runs per scenario (configurable via BENCHMARK_RUNS or --runs)";
ensure this references the default value and both configuration options
(BENCHMARK_RUNS and --runs).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: fc18ff0a-d59f-42f5-885c-ede1d197a236

📥 Commits

Reviewing files that changed from the base of the PR and between 440a667 and d023a7a.

📒 Files selected for processing (6)

README.md
benchmarks/base.py
benchmarks/evaluator.py
benchmarks/mem0_adapter.py
benchmarks/scenario_a.py
requirements.txt

🚧 Files skipped from review as they are similar to previous changes (5)

requirements.txt
benchmarks/mem0_adapter.py
benchmarks/evaluator.py
benchmarks/scenario_a.py
benchmarks/base.py

coderabbitai · 2026-06-10T09:29:35Z

+| `BENCHMARK_RUNS` | Number of benchmark runs per scenario (default: 3) |
+| `RANDOM_SEED` | Random seed for reproducibility (default: 42) |


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add the missing "Required" column to complete the table.

Lines 47-48 are missing the third column expected by the table header. Since both BENCHMARK_RUNS and RANDOM_SEED have default values, they should be marked as not required (❌).

📝 Proposed fix

-| `BENCHMARK_RUNS` | Number of benchmark runs per scenario (default: 3) | -| `RANDOM_SEED` | Random seed for reproducibility (default: 42) | +| `BENCHMARK_RUNS` | Number of benchmark runs per scenario (default: 3) | ❌ | +| `RANDOM_SEED` | Random seed for reproducibility (default: 42) | ❌ |

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

| `BENCHMARK_RUNS` | Number of benchmark runs per scenario (default: 3) |

| `RANDOM_SEED` | Random seed for reproducibility (default: 42) |

| `BENCHMARK_RUNS` | Number of benchmark runs per scenario (default: 3) | ❌ |

| `RANDOM_SEED` | Random seed for reproducibility (default: 42) | ❌ |

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 47-47: Table column count
Expected: 3; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

[warning] 48-48: Table column count
Expected: 3; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` around lines 47 - 48, The table rows for BENCHMARK_RUNS and RANDOM_SEED are missing the third "Required" column; update the README table rows for `BENCHMARK_RUNS` and `RANDOM_SEED` to include the required-status column and mark both as not required (❌) so each row has three columns matching the header.

Source: Linters/SAST tools

coderabbitai

🧹 Nitpick comments (1)

benchmarks/memanto_adapter.py (1)

45-149: ⚡ Quick win

Consider adding defensive _client validation.

All methods access self._client without checking whether setup() was called first. If used incorrectly, the error message (AttributeError: 'NoneType' object has no attribute 'add') will be cryptic.

🛡️ Proposed defensive check

Add a helper method and call it at the start of each operation:

+    def _ensure_setup(self) -> None:
+        """Raise RuntimeError if setup() was not called."""
+        if self._client is None:
+            raise RuntimeError(
+                f"{self.name} adapter: setup() must be called before operations"
+            )
+
     def store(self, content: str, metadata: dict | None = None) -> MemoryResult:
         """Store a memory entry via the Memanto API.
         ...
         """
+        self._ensure_setup()
         try:
             meta = metadata or {}

Apply the same pattern to retrieve, update, delete, get_all, and cleanup.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/memanto_adapter.py` around lines 45 - 149, The methods store,
retrieve, update, delete, get_all, and cleanup use self._client without
verifying setup, causing cryptic AttributeError if setup() wasn't called; add a
small defensive helper (e.g., _ensure_client() or validate_client()) that raises
a clear RuntimeError with a helpful message when self._client is None, then call
that helper at the start of each method (store, retrieve, update, delete,
get_all, cleanup) to fail fast and provide a descriptive error instead of
letting None attribute access appear.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@benchmarks/memanto_adapter.py`:
- Around line 45-149: The methods store, retrieve, update, delete, get_all, and
cleanup use self._client without verifying setup, causing cryptic AttributeError
if setup() wasn't called; add a small defensive helper (e.g., _ensure_client()
or validate_client()) that raises a clear RuntimeError with a helpful message
when self._client is None, then call that helper at the start of each method
(store, retrieve, update, delete, get_all, cleanup) to fail fast and provide a
descriptive error instead of letting None attribute access appear.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 3dbfbb69-0528-4d54-b099-387873c8f2ae

📥 Commits

Reviewing files that changed from the base of the PR and between d023a7a and efc0cfe.

📒 Files selected for processing (3)

benchmarks/mem0_adapter.py
benchmarks/memanto_adapter.py
tests/test_adapters.py

🚧 Files skipped from review as they are similar to previous changes (2)

tests/test_adapters.py
benchmarks/mem0_adapter.py

song11071696 · 2026-06-12T13:40:33Z

Hi, thank you for the thorough review!

I've addressed all the issues raised:

Fixes Applied

p95 latency off-by-one ✅ Fixed
- Changed int(n * 0.95) to max(0, int(n * 0.95) - 1) in base.py
Judge score normalization ✅ Fixed
- Added max(0.0, min(1.0, score)) clamping in evaluator.py
Silent cleanup failure ✅ Fixed
- Changed to logging.warning(f"Cleanup failed: {type(e).__name__}") in both adapters
Unsafe dict key access ✅ Fixed
- Changed to .get("key", "") in scenario_a.py
mem0ai version constraint ✅ Fixed
- Changed from >=0.1.0 to >=2.0.0b2 in requirements.txt
README missing env vars ✅ Fixed
- Added BENCHMARK_RUNS and RANDOM_SEED documentation
Defensive _ensure_setup() check ✅ Fixed
- Added to both mem0_adapter.py and memanto_adapter.py
Docstring coverage ✅ Fixed
- Added docstrings to all functions in adapter files and test files

Additional Improvements

Sanitized error messages (no raw exception leaking to clients)
Added comprehensive unit tests (15 tests, all passing)
Dry-run mode works without API keys

All tests pass locally. Please let me know if any further changes are needed!

Best regards

Resolve merge conflicts by rebuilding branch from main. All original files preserved, benchmark files added.

song11071696 · 2026-06-14T11:37:37Z

Hi, just a quick follow-up.

I've resolved the merge conflicts and the PR is now fully mergeable with the latest main branch.

All CodeRabbit feedback has been addressed:

p95 latency fixed
Score normalization added
Defensive _ensure_setup() checks
Error messages sanitized
Docstrings added

Is there anything else needed for this to be merged?

Thanks!

song11071696 force-pushed the feat/benchmark-suite branch from 9b55388 to 440a667 Compare June 10, 2026 09:04

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

song11071696 mentioned this pull request Jun 13, 2026

Add memanto edit update an existing memory #540

Open

4 tasks

song11071696 force-pushed the feat/benchmark-suite branch from 0b2db73 to dffa010 Compare June 13, 2026 07:28

fix: rebase benchmark suite onto latest main

9e9782e

Resolve merge conflicts by rebuilding branch from main. All original files preserved, benchmark files added.

song11071696 force-pushed the feat/benchmark-suite branch from dffa010 to 9e9782e Compare June 13, 2026 07:31

-            return float(parsed.get("score", 0.0)), parsed.get("reasoning", "")
+            raw_score = float(parsed.get("score", 0.0))
+            score = max(0.0, min(1.0, raw_score))
+            return score, parsed.get("reasoning", "")

		\| `BENCHMARK_RUNS` \| Number of benchmark runs per scenario (default: 3) \|
		\| `RANDOM_SEED` \| Random seed for reproducibility (default: 42) \|

Conversation

song11071696 commented Jun 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

How to run

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

song11071696 commented Jun 12, 2026

Fixes Applied

Additional Improvements

Uh oh!

song11071696 commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

song11071696 commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading