feat: add reproducible benchmark suite for Memanto vs Mem0 #639#720
feat: add reproducible benchmark suite for Memanto vs Mem0 #639#720song11071696 wants to merge 1 commit into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a benchmarking suite comparing Memanto and Mem0: package exports and base types, Memanto/Mem0 adapters, an LLM evaluator with fallback, two scenario runners with datasets, an orchestrator producing console/JSON/HTML reports, README/.env.example updates, pinned deps, and pytest tests. ChangesBenchmarking Suite Implementation
Sequence Diagram(s)sequenceDiagram
participant Runner as run_benchmark.py
participant Scenario as run_scenario_{a,b}
participant Adapter as MemoryAdapter
participant MemClient as MemantoClient/Mem0Client
participant Evaluator as LLMEvaluator
participant OpenAI as OpenAI API
Runner->>Scenario: invoke scenario with adapter & evaluator
Scenario->>Adapter: adapter.setup(user_id)
loop ingest entries
Scenario->>Adapter: store(content, metadata)
Adapter->>MemClient: add/store request
MemClient-->>Adapter: store response
Adapter-->>Scenario: MemoryResult (latency,tokens)
end
loop retrievals
Scenario->>Adapter: retrieve(query, limit)
Adapter->>MemClient: search request
MemClient-->>Adapter: search results
Adapter-->>Scenario: MemoryResult (data)
Scenario->>Evaluator: score_retrieval(query,golden,retrieved)
Evaluator->>OpenAI: chat.completions.create(...)
OpenAI-->>Evaluator: JSON score/reasoning
Evaluator-->>Scenario: (score,reasoning) or fallback score
end
Scenario->>Adapter: cleanup()
Runner->>Runner: aggregate results & generate reports (console, JSON, HTML)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
9b55388 to
440a667
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.env.example (1)
16-21:⚠️ Potential issue | 🟠 MajorWire up (or remove) documented BENCHMARK_RUNS/RANDOM_SEED env vars for the benchmark runner
run_benchmark.pyuses CLI--runsfor run count, but there are no reads ofBENCHMARK_RUNSanywhere in the repo.- There is no
RANDOM_SEEDusage and no RNG seeding logic found in benchmark-related code; only README mentions “Random seeds set for deterministic evaluation”.- Either consume these env vars in the benchmark runner (and implement seeding) or drop them from
.env.exampleto avoid a reproducibility/documentation mismatch.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.env.example around lines 16 - 21, The repo documents BENCHMARK_RUNS and RANDOM_SEED but run_benchmark.py only uses CLI --runs and there is no RNG seeding; update run_benchmark.py to read BENCHMARK_RUNS and RANDOM_SEED from the environment as fallbacks for the CLI (use os.getenv('BENCHMARK_RUNS') to set default for the --runs value) and implement deterministic seeding in the benchmark startup (e.g., a seed_randomness(seed) helper called from main() that seeds Python random, numpy.random, and any other RNGs) so env vars are actually consumed; alternatively remove BENCHMARK_RUNS and RANDOM_SEED from .env.example if you prefer not to support env-based configuration.
🧹 Nitpick comments (12)
README.md (2)
54-76: 💤 Low valueAdd language identifier to fenced code block.
The fenced code block displaying the project structure should specify a language (e.g.,
textorplaintext) for better Markdown compliance.📝 Proposed fix
-``` +```text memanto-benchmark/ ├── README.md # This file🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` around lines 54 - 76, Summary: The README's fenced code block showing the project tree lacks a language identifier. Fix: open the README.md, locate the fenced block containing the ASCII project structure (the block that begins with "memanto-benchmark/" and the triple backticks), and add a language tag such as text or plaintext after the opening backticks (e.g., ```text) so the code fence is Markdown-compliant; ensure you update only that fence and preserve the existing ASCII tree content.Source: Linters/SAST tools
78-92: ⚡ Quick winClarify that benchmark runs are configurable.
Line 85 states "3 runs per scenario" as if it's fixed, but
.env.exampledefinesBENCHMARK_RUNSas a configurable parameter (default: 3). Similarly, line 90 references "Random seeds set" which appears to rely on theRANDOM_SEEDenv var. Consider rephrasing to indicate these are configurable defaults rather than fixed values.📝 Suggested clarification
-- **3 runs per scenario** with statistical aggregation +- **Configurable runs per scenario** (default: 3) with statistical aggregationAnd for line 90:
-- Random seeds set for deterministic evaluation +- Configurable random seed for deterministic evaluation (default: 42)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` around lines 78 - 92, Update the "🔬 Experimental Design" text to show that run counts and seeds are configurable: change the phrasing that currently reads "3 runs per scenario" to indicate it is the default controlled by the BENCHMARK_RUNS environment variable (default: 3), and change "Random seeds set for deterministic evaluation" to note the RANDOM_SEED env var is used to set the seed; reference these env vars (BENCHMARK_RUNS, RANDOM_SEED) in the Reproducibility or Isolation & Controls subsection and say they are configurable defaults rather than fixed values.benchmarks/mem0_adapter.py (2)
35-35: 💤 Low valueToken estimation heuristic lacks justification.
The formula
len(content.split()) * 2is used across multiple methods to estimate token usage, but this multiplier (×2) appears arbitrary and may not reflect actual token consumption by the Mem0 API.Real token counts depend on the tokenizer used by Mem0's backend (likely tiktoken for GPT models), where tokens ≠ words. A more accurate approach would be to check if the API response includes actual token usage or document the rationale for this heuristic.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/mem0_adapter.py` at line 35, The current heuristic estimating tokens via tokens = len(content.split()) * 2 is arbitrary; replace it by using the actual token-count returned by Mem0 API when available (use the response's usage field) or, if the API doesn't provide usage, compute tokens via a proper tokenizer (e.g., tiktoken encoding relevant to the model) instead of content.split(); update every place that uses the tokens variable (the tokens assignment and any methods in this file that reuse that heuristic) to either read response.usage.total_tokens or call a tokenizer helper, and add a brief comment documenting the fallback behavior.
31-43: ⚡ Quick winLatency measurement pattern requires external timing.
All adapter methods return
latency_ms=0, which means the actual latency measurement relies entirely on the caller usingadapter.timed_call(). The comment at line 38 inmemanto_adapter.pymakes this explicit, but this file lacks such documentation.While this pattern works correctly when scenarios call
adapter.timed_call(adapter.store, ...), it creates a fragile contract where forgettingtimed_callsilently produces zero latencies in metrics.📝 Consider adding a clarifying comment
def store(self, content: str, metadata: dict | None = None) -> MemoryResult: try: meta = metadata or {} result = self._client.add(content, user_id=self._user_id, metadata=meta) tokens = len(content.split()) * 2 return MemoryResult( success=True, - latency_ms=0, + latency_ms=0, # measured externally via timed_call tokens_used=tokens, data=result, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/mem0_adapter.py` around lines 31 - 43, The store method in benchmarks/mem0_adapter.py currently returns latency_ms=0 which relies on callers wrapping calls with adapter.timed_call; add a concise clarifying comment above the store method (similar to the note in memanto_adapter.py) stating that latency_ms is intentionally 0 and callers must call adapter.timed_call(self.store, ...) (or equivalent) to capture real latency metrics, and reference MemoryResult and the _client.add call in the comment so future maintainers know this is by design rather than a bug.benchmarks/memanto_adapter.py (2)
49-51: 💤 Low valueInefficient token estimation in retrieve().
The token calculation iterates through all retrieved memories and converts each to a string, even when memories are already strings. The nested
str(m)call on line 50 is redundant whenmis already a string, and the estimation could be optimized.♻️ Minor optimization
memories = result if isinstance(result, list) else [result] total_tokens = sum( - len(str(m).split()) * 2 for m in memories + len(str(m).split()) * 2 if not isinstance(m, str) + else len(m.split()) * 2 + for m in memories )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/memanto_adapter.py` around lines 49 - 51, The token estimation in retrieve() is inefficient because it calls str(m) for every memory even when m is already a string; update the computation of total_tokens (and any helper used there) to avoid redundant str() by checking isinstance(m, str) or converting once (e.g., text = m if isinstance(m, str) else str(m)) and then counting tokens (len(text.split()) * 2) for each item in memories; modify the total_tokens generator/loop accordingly to prevent repeated conversions while keeping the same token formula.
35-35: 💤 Low valueToken estimation fallback may mask API changes.
The
getattr(result, "tokens_used", len(content.split()) * 2)pattern provides resilience when the Memanto API doesn't returntokens_used, but silently falling back to a heuristic could mask:
- API contract breaks where
tokens_usedis expected but missing- Inaccurate token accounting when the fallback is used
Consider logging when the fallback is triggered to aid debugging.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/memanto_adapter.py` at line 35, The token fallback silently masks missing API fields; change the single-line fallback assignment tokens = getattr(result, "tokens_used", len(content.split()) * 2) to explicitly detect absence (use hasattr(result, "tokens_used") or result.__dict__.get("tokens_used") is None), compute the heuristic only in that branch, and emit a warning-level log that includes context (the result object or its keys, the content length, and the heuristic value) so you can detect API changes or when the heuristic is used; keep the original value name tokens and the same heuristic calculation but only assign it after logging the fallback.benchmarks/scenario_a.py (2)
21-24: 💤 Low valueDataset loading lacks error handling.
load_dataset()will raiseFileNotFoundErrororjson.JSONDecodeErrorif the dataset file is missing or malformed, but these exceptions are not caught. While Python's stack traces are informative, a benchmark runner would benefit from a clearer error message.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/scenario_a.py` around lines 21 - 24, The load_dataset() function currently opens DATASET_PATH and calls json.load without handling errors; wrap the file open and json.load call in a try/except that catches FileNotFoundError and json.JSONDecodeError, include the original exception details in a clearer error message (e.g., via logging or raising a new RuntimeError) and re-raise or exit so the benchmark runner gets a readable, actionable message; update references to DATASET_PATH and load_dataset() only (no other logic changes).
78-83: 💤 Low valueQuery extraction fallback may produce low-quality queries.
When
retrieval_queriesis missing or empty, the fallback generates queries from the first 100 characters of the first 5 log entries (lines 82-83). These truncated content strings may not be meaningful queries, especially if the logs start with timestamps or metadata rather than semantic content.This could produce artificially low retrieval accuracy scores.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/scenario_a.py` around lines 78 - 83, The current fallback that builds queries when retrieval_queries is missing uses truncated prefixes (queries variable built from dataset[:5] with entry["content"][:100]), which can yield poor, non-semantic queries; update the fallback in the queries construction to produce higher-quality queries by (1) selecting entries with substantive content (e.g., filter out lines that look like timestamps/metadata), (2) prefer full sentences or the longest N entries rather than the first 100 chars, and (3) optionally extract a question-like phrase or heuristically trim to the first meaningful sentence; change the logic around the queries variable and the dataset[:5] fallback to implement these heuristics (e.g., filter entries by content length and regex for timestamps, then extract first sentence or longest content) so generated queries are semantic and representative.tests/test_adapters.py (2)
128-128: 💤 Low valueRemove unused variable
monkeypatch_env.Line 128 defines
monkeypatch_env = {"OPENAI_API_KEY": ""}but never uses it. The test correctly instantiatesLLMEvaluator(api_key="")to bypass environment-based API key loading, making this variable unnecessary.🧹 Remove unused variable
from benchmarks.scenario_a import run_scenario_a from benchmarks.evaluator import LLMEvaluator - import os - monkeypatch_env = {"OPENAI_API_KEY": ""} evaluator = LLMEvaluator(api_key="") adapter = MockAdapter()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_adapters.py` at line 128, Remove the unused variable monkeypatch_env from the test file: locate the assignment monkeypatch_env = {"OPENAI_API_KEY": ""} in tests/test_adapters.py and delete it; the test already constructs LLMEvaluator(api_key="") so no environment-mocking is needed and removing the unused monkeypatch_env variable will clean up the test.
107-135: ⚡ Quick winExtract MockAdapter to eliminate duplication across test scenarios.
The
MockAdapterimplementation inTestDryRunScenarioA(lines 109-123) is duplicated identically inTestDryRunScenarioB(lines 149-163). Extracting the mock to a shared test fixture or module-level helper reduces maintenance burden and ensures consistent adapter behavior across integration tests.♻️ Extract MockAdapter to module-level fixture
Add before the test classes (after imports):
class MockAdapter(MemoryAdapter): """Shared mock adapter for dry-run integration tests.""" `@property` def name(self): return "Mock" def setup(self, uid): pass def store(self, content, metadata=None): return MemoryResult( success=True, latency_ms=5.0, tokens_used=len(content.split()) * 2 ) def retrieve(self, query, limit=5): return MemoryResult( success=True, latency_ms=3.0, tokens_used=100, data=["mock memory"] ) def update(self, mid, content): return MemoryResult(success=True, latency_ms=5.0) def delete(self, mid): return MemoryResult(success=True, latency_ms=2.0) def get_all(self): return MemoryResult(success=True, latency_ms=1.0, data=[]) def cleanup(self): passThen remove the inline
MockAdapterdefinitions from both test methods and instantiate the shared class.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_adapters.py` around lines 107 - 135, The MockAdapter class is duplicated between TestDryRunScenarioA and TestDryRunScenarioB; extract a single shared MockAdapter (subclassing MemoryAdapter and returning MemoryResult instances) at module scope or as a fixture, then remove the inline class definitions from both tests and instantiate the shared MockAdapter in each test; ensure the shared class implements the same methods used in the diff (name property, setup, store, retrieve, update, delete, get_all, cleanup) and returns the same MemoryResult shapes so run_scenario_a/run_scenario_b and LLMEvaluator-based assertions remain unchanged.run_benchmark.py (2)
362-364: 💤 Low valueTimestamp inconsistency between benchmark results and report filenames.
Line 362 generates a new timestamp for report filenames, but
run_benchmarkalready assigned a timestamp to each result dict at line 93. If the benchmark run spans a second boundary, the filenames will use a different timestamp than the one embedded inresults[0]["timestamp"], creating a minor inconsistency in reports.🔧 Use the benchmark session timestamp from results
- timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S") + timestamp = results[0]["timestamp"] if results else datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S") json_path = REPORTS_DIR / f"benchmark_results_{timestamp}.json" html_path = REPORTS_DIR / f"benchmark_report_{timestamp}.html"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@run_benchmark.py` around lines 362 - 364, The filenames are using a fresh datetime instead of the benchmark session timestamp stored in the results, causing possible mismatch; change the logic that sets json_path and html_path to derive the timestamp from results[0]["timestamp"] (from the run_benchmark results list) rather than calling datetime.now(), validating that results is non-empty and that the timestamp key exists and falling back to datetime.now(timezone.utc).strftime(...) only if results are missing or malformed; update the variables json_path and html_path to use that session timestamp so the filenames match the embedded result timestamp.
99-150: ⚖️ Poor tradeoffExtract shared aggregation logic to eliminate duplication.
The aggregation logic in
generate_console_report(lines 104-139) andgenerate_html_report(lines 265-308) is duplicated: both group results by(framework, scenario)and compute means/sums over the same metric fields using identical patterns. This violates the DRY principle and increases maintenance burden if aggregation logic needs to change.♻️ Refactor to extract shared aggregation helper
Add a helper function before
generate_console_report:def _aggregate_results(results: list[dict]) -> dict: """Aggregate results by (framework, scenario) with computed means/sums.""" import numpy as np aggregated = {} for r in results: key = (r["framework"], r["scenario"]) if key not in aggregated: aggregated[key] = { "framework": r["framework"], "scenario": r["scenario"], "store_tokens": [], "retrieve_tokens": [], "store_p95": [], "retrieve_p95": [], "accuracy": [], "errors": [], } agg = aggregated[key] agg["store_tokens"].append(r["total_store_tokens"]) agg["retrieve_tokens"].append(r["total_retrieve_tokens"]) agg["store_p95"].append(r["store_p95_latency_ms"]) agg["retrieve_p95"].append(r["retrieve_p95_latency_ms"]) agg["accuracy"].append(r["retrieval_accuracy"]) agg["errors"].append(r["errors"]) # Compute means/sums for key, agg in aggregated.items(): agg["mean_store_tokens"] = int(np.mean(agg["store_tokens"])) agg["mean_retrieve_tokens"] = int(np.mean(agg["retrieve_tokens"])) agg["mean_store_p95"] = round(np.mean(agg["store_p95"]), 1) agg["mean_retrieve_p95"] = round(np.mean(agg["retrieve_p95"]), 1) agg["mean_accuracy"] = round(np.mean(agg["accuracy"]), 3) agg["total_errors"] = sum(agg["errors"]) return aggregatedThen refactor both report functions to call
_aggregate_results(results)and consume the computed means.Also applies to: 170-320
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@run_benchmark.py` around lines 99 - 150, The aggregation block duplicated in generate_console_report and generate_html_report should be extracted into a shared helper _aggregate_results(results: list[dict]) that groups by (r["framework"], r["scenario"]), accumulates lists for "store_tokens","retrieve_tokens","store_p95","retrieve_p95","accuracy","errors", then computes and stores aggregated values (e.g. mean_store_tokens, mean_retrieve_tokens, mean_store_p95, mean_retrieve_p95, mean_accuracy, total_errors) using numpy and returns the aggregated dict; update generate_console_report and generate_html_report to call _aggregate_results(results) and read the precomputed mean_* and total_errors fields instead of repeating the accumulation logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmarks/base.py`:
- Around line 37-50: The p95 calculation in store_p95_latency and
retrieve_p95_latency uses idx = int(len(sorted_l) * 0.95) which can select the
maximum element for many n (off-by-one); change the index calculation to use the
1-based percentile mapping: idx = max(0, math.ceil(len(sorted_l) * 0.95) - 1)
(add import math if missing) so the 95th percentile selects the correct element
from sorted_l in both store_p95_latency and retrieve_p95_latency.
In `@benchmarks/evaluator.py`:
- Line 68: The returned judge score should be normalized to the [0.0, 1.0]
contract before being used; replace the direct return of
float(parsed.get("score", 0.0)) with a safe conversion that handles non-numeric
values (fall back to 0.0), then clamp the resulting value to the range 0.0..1.0
(e.g., max(0.0, min(1.0, score))), and return that normalized score alongside
parsed.get("reasoning",""); update the return site where the tuple is produced
(the line using parsed.get("score", 0.0) and parsed.get("reasoning")) so
downstream metrics like retrieval_accuracy remain valid and comparable to
fallback runs.
In `@benchmarks/mem0_adapter.py`:
- Around line 85-89: The cleanup() method currently swallows all exceptions from
self._client.delete_all(user_id=self._user_id), which hides failures; update
cleanup() to catch exceptions but log the error (including exception details and
context like the user id and operation) via the project logger or raise a
wrapped exception instead of silently passing so cleanup failures are
observable; locate the cleanup method and modify the except block around
self._client.delete_all to record the exception (or re-raise) while preserving
original exception information.
- Around line 23-29: The setup method initializes MemoryClient correctly but
doesn't validate the incoming user_id; update the setup function to check that
user_id is non-empty (e.g., raise ValueError("user_id is required") when blank)
before assigning self._user_id, and keep the existing
MemoryClient(api_key=api_key) initialization using MemoryClient to preserve SDK
usage; refer to setup, self._client, and self._user_id when making the change.
In `@benchmarks/scenario_a.py`:
- Around line 86-91: The code reads query_item["query"] (and ["golden_answer"])
without validation which can raise KeyError for malformed dataset entries;
update the parsing in both scenario_a and scenario_b to use
query_item.get("query") and query_item.get("golden_answer", "") (or similar .get
defaults), validate that query is truthy before using it, and on missing/invalid
entries log the error and increment the existing error/metrics counter (e.g.,
error metric or errors.increment) and skip that item instead of letting it crash
the benchmark.
In `@README.md`:
- Around line 43-51: The environment variables table under "## ⚙️ Environment
Variables" is missing BENCHMARK_RUNS and RANDOM_SEED; add two rows for
`BENCHMARK_RUNS` (describe as number of times to run benchmark/default value or
optional) and `RANDOM_SEED` (describe as deterministic seed for
benchmark/randomness/default value or optional) to document their purpose and
whether they are required, referencing the variables by name so users can find
them easily.
In `@requirements.txt`:
- Line 2: requirements.txt currently allows mem0ai>=0.1.0 which is too
permissive given advisories; tighten the version specifier (e.g., set a safe
minimum like mem0ai>=2.0.4 or another non-vulnerable minimum) to exclude the
vulnerable ranges (<=1.0.0 and <2.0.0b2), and ensure benchmark code actually
uses authenticated endpoints by verifying the MemoryClient construction in
benchmarks/mem0_adapter.py and memanto/cli/analyze/mem0_export.py requires a
non-empty MEM0_API_KEY (fail fast if missing) and does not default to any
open/demo server URL so the tests target an endpoint that enforces
auth/authorization.
---
Outside diff comments:
In @.env.example:
- Around line 16-21: The repo documents BENCHMARK_RUNS and RANDOM_SEED but
run_benchmark.py only uses CLI --runs and there is no RNG seeding; update
run_benchmark.py to read BENCHMARK_RUNS and RANDOM_SEED from the environment as
fallbacks for the CLI (use os.getenv('BENCHMARK_RUNS') to set default for the
--runs value) and implement deterministic seeding in the benchmark startup
(e.g., a seed_randomness(seed) helper called from main() that seeds Python
random, numpy.random, and any other RNGs) so env vars are actually consumed;
alternatively remove BENCHMARK_RUNS and RANDOM_SEED from .env.example if you
prefer not to support env-based configuration.
---
Nitpick comments:
In `@benchmarks/mem0_adapter.py`:
- Line 35: The current heuristic estimating tokens via tokens =
len(content.split()) * 2 is arbitrary; replace it by using the actual
token-count returned by Mem0 API when available (use the response's usage field)
or, if the API doesn't provide usage, compute tokens via a proper tokenizer
(e.g., tiktoken encoding relevant to the model) instead of content.split();
update every place that uses the tokens variable (the tokens assignment and any
methods in this file that reuse that heuristic) to either read
response.usage.total_tokens or call a tokenizer helper, and add a brief comment
documenting the fallback behavior.
- Around line 31-43: The store method in benchmarks/mem0_adapter.py currently
returns latency_ms=0 which relies on callers wrapping calls with
adapter.timed_call; add a concise clarifying comment above the store method
(similar to the note in memanto_adapter.py) stating that latency_ms is
intentionally 0 and callers must call adapter.timed_call(self.store, ...) (or
equivalent) to capture real latency metrics, and reference MemoryResult and the
_client.add call in the comment so future maintainers know this is by design
rather than a bug.
In `@benchmarks/memanto_adapter.py`:
- Around line 49-51: The token estimation in retrieve() is inefficient because
it calls str(m) for every memory even when m is already a string; update the
computation of total_tokens (and any helper used there) to avoid redundant str()
by checking isinstance(m, str) or converting once (e.g., text = m if
isinstance(m, str) else str(m)) and then counting tokens (len(text.split()) * 2)
for each item in memories; modify the total_tokens generator/loop accordingly to
prevent repeated conversions while keeping the same token formula.
- Line 35: The token fallback silently masks missing API fields; change the
single-line fallback assignment tokens = getattr(result, "tokens_used",
len(content.split()) * 2) to explicitly detect absence (use hasattr(result,
"tokens_used") or result.__dict__.get("tokens_used") is None), compute the
heuristic only in that branch, and emit a warning-level log that includes
context (the result object or its keys, the content length, and the heuristic
value) so you can detect API changes or when the heuristic is used; keep the
original value name tokens and the same heuristic calculation but only assign it
after logging the fallback.
In `@benchmarks/scenario_a.py`:
- Around line 21-24: The load_dataset() function currently opens DATASET_PATH
and calls json.load without handling errors; wrap the file open and json.load
call in a try/except that catches FileNotFoundError and json.JSONDecodeError,
include the original exception details in a clearer error message (e.g., via
logging or raising a new RuntimeError) and re-raise or exit so the benchmark
runner gets a readable, actionable message; update references to DATASET_PATH
and load_dataset() only (no other logic changes).
- Around line 78-83: The current fallback that builds queries when
retrieval_queries is missing uses truncated prefixes (queries variable built
from dataset[:5] with entry["content"][:100]), which can yield poor,
non-semantic queries; update the fallback in the queries construction to produce
higher-quality queries by (1) selecting entries with substantive content (e.g.,
filter out lines that look like timestamps/metadata), (2) prefer full sentences
or the longest N entries rather than the first 100 chars, and (3) optionally
extract a question-like phrase or heuristically trim to the first meaningful
sentence; change the logic around the queries variable and the dataset[:5]
fallback to implement these heuristics (e.g., filter entries by content length
and regex for timestamps, then extract first sentence or longest content) so
generated queries are semantic and representative.
In `@README.md`:
- Around line 54-76: Summary: The README's fenced code block showing the project
tree lacks a language identifier. Fix: open the README.md, locate the fenced
block containing the ASCII project structure (the block that begins with
"memanto-benchmark/" and the triple backticks), and add a language tag such as
text or plaintext after the opening backticks (e.g., ```text) so the code fence
is Markdown-compliant; ensure you update only that fence and preserve the
existing ASCII tree content.
- Around line 78-92: Update the "🔬 Experimental Design" text to show that run
counts and seeds are configurable: change the phrasing that currently reads "3
runs per scenario" to indicate it is the default controlled by the
BENCHMARK_RUNS environment variable (default: 3), and change "Random seeds set
for deterministic evaluation" to note the RANDOM_SEED env var is used to set the
seed; reference these env vars (BENCHMARK_RUNS, RANDOM_SEED) in the
Reproducibility or Isolation & Controls subsection and say they are configurable
defaults rather than fixed values.
In `@run_benchmark.py`:
- Around line 362-364: The filenames are using a fresh datetime instead of the
benchmark session timestamp stored in the results, causing possible mismatch;
change the logic that sets json_path and html_path to derive the timestamp from
results[0]["timestamp"] (from the run_benchmark results list) rather than
calling datetime.now(), validating that results is non-empty and that the
timestamp key exists and falling back to
datetime.now(timezone.utc).strftime(...) only if results are missing or
malformed; update the variables json_path and html_path to use that session
timestamp so the filenames match the embedded result timestamp.
- Around line 99-150: The aggregation block duplicated in
generate_console_report and generate_html_report should be extracted into a
shared helper _aggregate_results(results: list[dict]) that groups by
(r["framework"], r["scenario"]), accumulates lists for
"store_tokens","retrieve_tokens","store_p95","retrieve_p95","accuracy","errors",
then computes and stores aggregated values (e.g. mean_store_tokens,
mean_retrieve_tokens, mean_store_p95, mean_retrieve_p95, mean_accuracy,
total_errors) using numpy and returns the aggregated dict; update
generate_console_report and generate_html_report to call
_aggregate_results(results) and read the precomputed mean_* and total_errors
fields instead of repeating the accumulation logic.
In `@tests/test_adapters.py`:
- Line 128: Remove the unused variable monkeypatch_env from the test file:
locate the assignment monkeypatch_env = {"OPENAI_API_KEY": ""} in
tests/test_adapters.py and delete it; the test already constructs
LLMEvaluator(api_key="") so no environment-mocking is needed and removing the
unused monkeypatch_env variable will clean up the test.
- Around line 107-135: The MockAdapter class is duplicated between
TestDryRunScenarioA and TestDryRunScenarioB; extract a single shared MockAdapter
(subclassing MemoryAdapter and returning MemoryResult instances) at module scope
or as a fixture, then remove the inline class definitions from both tests and
instantiate the shared MockAdapter in each test; ensure the shared class
implements the same methods used in the diff (name property, setup, store,
retrieve, update, delete, get_all, cleanup) and returns the same MemoryResult
shapes so run_scenario_a/run_scenario_b and LLMEvaluator-based assertions remain
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 720ede40-17dc-413a-ad8e-abebb2706f94
📒 Files selected for processing (14)
.env.exampleREADME.mdbenchmarks/__init__.pybenchmarks/base.pybenchmarks/evaluator.pybenchmarks/mem0_adapter.pybenchmarks/memanto_adapter.pybenchmarks/scenario_a.pybenchmarks/scenario_b.pydatasets/persona_evolution.jsondatasets/technical_logs.jsonrequirements.txtrun_benchmark.pytests/test_adapters.py
| def store_p95_latency(self) -> float: | ||
| if not self.store_latencies: | ||
| return 0.0 | ||
| sorted_l = sorted(self.store_latencies) | ||
| idx = int(len(sorted_l) * 0.95) | ||
| return sorted_l[min(idx, len(sorted_l) - 1)] | ||
|
|
||
| @property | ||
| def retrieve_p95_latency(self) -> float: | ||
| if not self.retrieve_latencies: | ||
| return 0.0 | ||
| sorted_l = sorted(self.retrieve_latencies) | ||
| idx = int(len(sorted_l) * 0.95) | ||
| return sorted_l[min(idx, len(sorted_l) - 1)] |
There was a problem hiding this comment.
Fix off-by-one in p95 latency calculation.
Line 41 and Line 49 currently compute int(n * 0.95), which maps many sample sizes to the max element (e.g., n=20 → index 19). That reports p100 instead of p95 and skews published benchmark metrics.
Suggested fix
`@property`
def store_p95_latency(self) -> float:
if not self.store_latencies:
return 0.0
sorted_l = sorted(self.store_latencies)
- idx = int(len(sorted_l) * 0.95)
+ idx = int((len(sorted_l) - 1) * 0.95)
return sorted_l[min(idx, len(sorted_l) - 1)]
`@property`
def retrieve_p95_latency(self) -> float:
if not self.retrieve_latencies:
return 0.0
sorted_l = sorted(self.retrieve_latencies)
- idx = int(len(sorted_l) * 0.95)
+ idx = int((len(sorted_l) - 1) * 0.95)
return sorted_l[min(idx, len(sorted_l) - 1)]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def store_p95_latency(self) -> float: | |
| if not self.store_latencies: | |
| return 0.0 | |
| sorted_l = sorted(self.store_latencies) | |
| idx = int(len(sorted_l) * 0.95) | |
| return sorted_l[min(idx, len(sorted_l) - 1)] | |
| @property | |
| def retrieve_p95_latency(self) -> float: | |
| if not self.retrieve_latencies: | |
| return 0.0 | |
| sorted_l = sorted(self.retrieve_latencies) | |
| idx = int(len(sorted_l) * 0.95) | |
| return sorted_l[min(idx, len(sorted_l) - 1)] | |
| def store_p95_latency(self) -> float: | |
| if not self.store_latencies: | |
| return 0.0 | |
| sorted_l = sorted(self.store_latencies) | |
| idx = int((len(sorted_l) - 1) * 0.95) | |
| return sorted_l[min(idx, len(sorted_l) - 1)] | |
| `@property` | |
| def retrieve_p95_latency(self) -> float: | |
| if not self.retrieve_latencies: | |
| return 0.0 | |
| sorted_l = sorted(self.retrieve_latencies) | |
| idx = int((len(sorted_l) - 1) * 0.95) | |
| return sorted_l[min(idx, len(sorted_l) - 1)] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmarks/base.py` around lines 37 - 50, The p95 calculation in
store_p95_latency and retrieve_p95_latency uses idx = int(len(sorted_l) * 0.95)
which can select the maximum element for many n (off-by-one); change the index
calculation to use the 1-based percentile mapping: idx = max(0,
math.ceil(len(sorted_l) * 0.95) - 1) (add import math if missing) so the 95th
percentile selects the correct element from sorted_l in both store_p95_latency
and retrieve_p95_latency.
| import json | ||
| content = response.choices[0].message.content | ||
| parsed = json.loads(content) | ||
| return float(parsed.get("score", 0.0)), parsed.get("reasoning", "") |
There was a problem hiding this comment.
Normalize judge scores to the same [0.0, 1.0] contract as fallback.
Line 68 trusts model output verbatim; if the judge returns out-of-range values, retrieval_accuracy becomes invalid and incomparable to fallback runs.
Suggested fix
- return float(parsed.get("score", 0.0)), parsed.get("reasoning", "")
+ raw_score = float(parsed.get("score", 0.0))
+ score = max(0.0, min(1.0, raw_score))
+ return score, parsed.get("reasoning", "")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| return float(parsed.get("score", 0.0)), parsed.get("reasoning", "") | |
| raw_score = float(parsed.get("score", 0.0)) | |
| score = max(0.0, min(1.0, raw_score)) | |
| return score, parsed.get("reasoning", "") |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmarks/evaluator.py` at line 68, The returned judge score should be
normalized to the [0.0, 1.0] contract before being used; replace the direct
return of float(parsed.get("score", 0.0)) with a safe conversion that handles
non-numeric values (fall back to 0.0), then clamp the resulting value to the
range 0.0..1.0 (e.g., max(0.0, min(1.0, score))), and return that normalized
score alongside parsed.get("reasoning",""); update the return site where the
tuple is produced (the line using parsed.get("score", 0.0) and
parsed.get("reasoning")) so downstream metrics like retrieval_accuracy remain
valid and comparable to fallback runs.
| def setup(self, user_id: str) -> None: | ||
| api_key = os.environ.get("MEM0_API_KEY", "") | ||
| if not api_key: | ||
| raise ValueError("MEM0_API_KEY environment variable is required") | ||
| from mem0 import MemoryClient | ||
| self._client = MemoryClient(api_key=api_key) | ||
| self._user_id = user_id |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify Mem0 client initialization requirements
# Check if there are any examples or documentation about MemoryClient usage
rg -nP -A5 -B5 'MemoryClient\s*\(' --type=py
# Search for user_id parameter usage in mem0 contexts
rg -nP 'user_id' --type=py -g '*mem0*'Repository: moorcheh-ai/memanto
Length of output: 2287
🌐 Web query:
Mem0 Python SDK MemoryClient initialization and user_id configuration
💡 Result:
To initialize the Mem0 Python SDK and configure user-specific memory, follow these steps: Initialization To use the Mem0 Platform API, you initialize the MemoryClient class [1][2]. The api_key parameter is required; it can be passed directly or set via the MEM0_API_KEY environment variable [3][4][2]. import {mem0} from mem0 # Initialize client client = MemoryClient(api_key="your-api-key") Optional configuration parameters for the constructor include host, org_id, and project_id [1][2]. User ID Configuration In Mem0, user_id is not a global configuration setting for the client instance itself [5][2]. Instead, it is passed as a mandatory or contextual parameter to individual memory operations (such as add, search, get_all, or delete) to scope the operation to a specific user [3][6][7][2]. Adding memory for a specific user: client.add( messages=[{"role": "user", "content": "I like dark chocolate."}], user_id="user123") Searching memory for a specific user: results = client.search("What do I like?", user_id="user123") Scope and Entity Configuration Mem0 uses a multi-dimensional scoping system where you can define boundaries using user_id, agent_id, app_id, and run_id [7]. When performing operations, you specify these identifiers to maintain data privacy and context [5][7]. Note that memory records are scoped to these primary entities; for example, if you include both a user_id and an agent_id in an operation, Mem0 treats them as specific entity scopes [7]. Top results: [1][3][6][7][2]
Citations:
- 1: https://github.com/mem0ai/mem0/blob/HEAD/skills/mem0/references/sdk-guide.md
- 2: https://github.com/mem0ai/mem0/blob/ece7ff6b/LLM.md
- 3: https://github.com/mem0ai/mem0/blob/HEAD/skills/mem0/references/quickstart.md
- 4: https://docs.mem0.ai/platform/quickstart
- 5: https://github.com/mem0ai/mem0/blob/main/mem0/memory/main.py
- 6: https://docs.mem0.ai/core-concepts/memory-operations/add
- 7: https://docs.mem0.ai/platform/features/entity-scoped-memory
Review Mem0 client initialization and user_id scoping in benchmarks/mem0_adapter.py
- Mem0 docs state
MemoryClientinitialization requiresapi_key; other constructor options likehost/org_id/project_idare optional, soMemoryClient(api_key=api_key)is consistent with the SDK’s expected initialization pattern. user_idis not configured at the client level in Mem0; it’s provided to each memory operation for scoping (and this adapter passesuser_idforadd,search,get_all, anddelete_all).setup()therefore doesn’t need to “register”user_id; adding a basic guard (e.g., non-emptyuser_id) before storing/using it would be a maintainability improvement.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmarks/mem0_adapter.py` around lines 23 - 29, The setup method
initializes MemoryClient correctly but doesn't validate the incoming user_id;
update the setup function to check that user_id is non-empty (e.g., raise
ValueError("user_id is required") when blank) before assigning self._user_id,
and keep the existing MemoryClient(api_key=api_key) initialization using
MemoryClient to preserve SDK usage; refer to setup, self._client, and
self._user_id when making the change.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
requirements.txt (1)
2-2:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTighten mem0ai version constraint to exclude vulnerable versions.
The current specification
mem0ai>=0.1.0allows vulnerable versions. Security advisories indicate issues in versions ≤1.0.0 and <2.0.0b2 (authentication/authorization bypass, improper input validation). The latest version on PyPI is 2.0.4. The benchmark adapters already pass API keys, but you should ensure the specific mem0 endpoints used enforce authentication.🔒 Proposed fix
-mem0ai>=0.1.0 +mem0ai>=2.0.4🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@requirements.txt` at line 2, Update the mem0ai requirement to avoid known vulnerable releases by tightening the constraint from "mem0ai>=0.1.0" to a safe range (e.g., "mem0ai>=2.0.4,<3.0.0") in requirements.txt; keep the benchmark adapters that call mem0 endpoints (the existing adapter code) using explicit API key parameters and verify those endpoint calls enforce authentication by checking the adapter functions that pass the API keys.benchmarks/base.py (1)
37-50:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winFix off-by-one error in p95 latency calculation.
Both
store_p95_latencyandretrieve_p95_latencyuseidx = int(len(sorted_l) * 0.95), which incorrectly returns the maximum element for many sample sizes. For example, with n=10 items, this computes idx=9 (the last element), reporting the 100th percentile instead of the 95th percentile. This corrupts published benchmark metrics and misrepresents the performance comparison between Memanto and Mem0.🐛 Proposed fix
`@property` def store_p95_latency(self) -> float: if not self.store_latencies: return 0.0 sorted_l = sorted(self.store_latencies) - idx = int(len(sorted_l) * 0.95) + idx = int((len(sorted_l) - 1) * 0.95) return sorted_l[min(idx, len(sorted_l) - 1)] `@property` def retrieve_p95_latency(self) -> float: if not self.retrieve_latencies: return 0.0 sorted_l = sorted(self.retrieve_latencies) - idx = int(len(sorted_l) * 0.95) + idx = int((len(sorted_l) - 1) * 0.95) return sorted_l[min(idx, len(sorted_l) - 1)]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/base.py` around lines 37 - 50, The p95 calculation in store_p95_latency and retrieve_p95_latency uses idx = int(len(sorted_l) * 0.95) which can pick the last element (100th percentile) for many n; change the index computation to compute the 95th percentile rank properly (e.g., idx = math.ceil(0.95 * len(sorted_l)) - 1) and then clamp with min/max to the valid range; update both functions and add the necessary math import if not present.
🧹 Nitpick comments (4)
tests/test_adapters.py (3)
109-123: 💤 Low valueConsider extracting MockAdapter to a shared pytest fixture.
The
MockAdapterimplementation is duplicated in bothTestDryRunScenarioA(lines 109-123) andTestDryRunScenarioB(lines 149-163) with nearly identical code. Extracting to a pytest fixture would reduce duplication and improve maintainability.♻️ Example refactor
`@pytest.fixture` def mock_adapter(): class MockAdapter(MemoryAdapter): `@property` def name(self): return "Mock" def setup(self, uid): pass def store(self, content, metadata=None): return MemoryResult(True, 5.0, len(content.split())*2) def retrieve(self, query, limit=5): return MemoryResult(True, 3.0, 100, ["mock memory"]) def update(self, mid, content): return MemoryResult(True, 5.0) def delete(self, mid): return MemoryResult(True, 2.0) def get_all(self): return MemoryResult(True, 1.0, data=[]) def cleanup(self): pass return MockAdapter()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_adapters.py` around lines 109 - 123, Extract the duplicated MockAdapter class into a pytest fixture named mock_adapter and use it in both TestDryRunScenarioA and TestDryRunScenarioB; locate the existing MockAdapter (subclassing MemoryAdapter and returning MemoryResult in methods store/retrieve/update/delete/get_all) and move that implementation into a fixture function decorated with `@pytest.fixture` that returns an instance, then replace the inline class definitions in both tests with the fixture parameter (mock_adapter) so tests reuse the same adapter instance and eliminate duplication.
128-128: ⚡ Quick winRemove unused variable
monkeypatch_env.Line 128 defines
monkeypatch_env = {"OPENAI_API_KEY": ""}but never uses it. This is dead code.♻️ Proposed fix
from benchmarks.scenario_a import run_scenario_a from benchmarks.evaluator import LLMEvaluator import os - monkeypatch_env = {"OPENAI_API_KEY": ""} evaluator = LLMEvaluator(api_key="")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_adapters.py` at line 128, Remove the unused variable monkeypatch_env defined in tests/test_adapters.py (the assignment monkeypatch_env = {"OPENAI_API_KEY": ""}) since it is dead code; locate the variable by name and delete the line or unused declaration so there are no unused locals left in the test file.
7-7: ⚡ Quick winRemove sys.path manipulation in tests.
Manipulating
sys.pathin tests is an anti-pattern. Tests should rely on proper package structure and installation (e.g.,pip install -e .for editable install). This ensures tests run in the same environment as production code.♻️ Recommended approach
Remove line 7 and ensure the project has a proper
setup.pyorpyproject.tomlthat allows:pip install -e . pytest tests/🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_adapters.py` at line 7, Remove the sys.path manipulation by deleting the call to sys.path.insert(0, str(Path(__file__).parent.parent)) in tests (remove the use of sys.path.insert and Path(__file__).parent.parent in tests/test_adapters.py); instead rely on a proper package install for test imports (use pip install -e . via pyproject.toml or setup.py) so tests import modules from the installed package rather than mutating sys.path.README.md (1)
54-76: 💤 Low valueAdd language identifier to the project structure code fence.
Markdown best practice is to specify a language (or use plain text) for fenced code blocks to improve rendering and accessibility.
📝 Proposed fix
-``` +```plaintext memanto-benchmark/🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` around lines 54 - 76, The README's project-structure fenced code block lacks a language identifier; edit the README.md and change the opening triple backticks for the tree (the block showing "memanto-benchmark/ ..." under the diff) to include a language token such as ```plaintext (or ```text) so the directory tree renders/accessibility is improved and syntax highlighters treat it as plain text.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmarks/base.py`:
- Around line 1-9: Add a short statement to the README documenting the project's
minimum Python version (Python 3.10+) to match pyproject.toml's requires-python
and classifiers; update the top-level README.md to explicitly state "Requires
Python 3.10 or newer (tested on 3.10/3.11/3.12)" and, if appropriate, add a
short note about compatibility or testing matrix to keep it consistent with the
package metadata.
In `@requirements.txt`:
- Line 7: Update the jinja2 requirement in requirements.txt to pin the minimum
safe version by replacing the current "jinja2>=3.1.0" entry with "jinja2>=3.1.6"
so vulnerable 3.1.x releases (and earlier) are excluded; ensure the change is
applied to the jinja2 line in requirements.txt and run dependency install/checks
afterward.
In `@tests/test_adapters.py`:
- Around line 37-40: The test test_p95_latency currently asserts the buggy
behavior; update it to assert the correct 95th percentile for
BenchmarkMetric.store_p95_latency (property in benchmarks/base.py) using the
proper percentile calculation for the list [10,20,...,100]; replace the
hardcoded 100.0 with the correct value (approximately 95.5) and use
pytest.approx(95.5) or a small tolerance to avoid brittle equality.
---
Duplicate comments:
In `@benchmarks/base.py`:
- Around line 37-50: The p95 calculation in store_p95_latency and
retrieve_p95_latency uses idx = int(len(sorted_l) * 0.95) which can pick the
last element (100th percentile) for many n; change the index computation to
compute the 95th percentile rank properly (e.g., idx = math.ceil(0.95 *
len(sorted_l)) - 1) and then clamp with min/max to the valid range; update both
functions and add the necessary math import if not present.
In `@requirements.txt`:
- Line 2: Update the mem0ai requirement to avoid known vulnerable releases by
tightening the constraint from "mem0ai>=0.1.0" to a safe range (e.g.,
"mem0ai>=2.0.4,<3.0.0") in requirements.txt; keep the benchmark adapters that
call mem0 endpoints (the existing adapter code) using explicit API key
parameters and verify those endpoint calls enforce authentication by checking
the adapter functions that pass the API keys.
---
Nitpick comments:
In `@README.md`:
- Around line 54-76: The README's project-structure fenced code block lacks a
language identifier; edit the README.md and change the opening triple backticks
for the tree (the block showing "memanto-benchmark/ ..." under the diff) to
include a language token such as ```plaintext (or ```text) so the directory tree
renders/accessibility is improved and syntax highlighters treat it as plain
text.
In `@tests/test_adapters.py`:
- Around line 109-123: Extract the duplicated MockAdapter class into a pytest
fixture named mock_adapter and use it in both TestDryRunScenarioA and
TestDryRunScenarioB; locate the existing MockAdapter (subclassing MemoryAdapter
and returning MemoryResult in methods store/retrieve/update/delete/get_all) and
move that implementation into a fixture function decorated with `@pytest.fixture`
that returns an instance, then replace the inline class definitions in both
tests with the fixture parameter (mock_adapter) so tests reuse the same adapter
instance and eliminate duplication.
- Line 128: Remove the unused variable monkeypatch_env defined in
tests/test_adapters.py (the assignment monkeypatch_env = {"OPENAI_API_KEY": ""})
since it is dead code; locate the variable by name and delete the line or unused
declaration so there are no unused locals left in the test file.
- Line 7: Remove the sys.path manipulation by deleting the call to
sys.path.insert(0, str(Path(__file__).parent.parent)) in tests (remove the use
of sys.path.insert and Path(__file__).parent.parent in tests/test_adapters.py);
instead rely on a proper package install for test imports (use pip install -e .
via pyproject.toml or setup.py) so tests import modules from the installed
package rather than mutating sys.path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: b96bc24d-05c6-4c92-bc80-14766910138d
📒 Files selected for processing (14)
.env.exampleREADME.mdbenchmarks/__init__.pybenchmarks/base.pybenchmarks/evaluator.pybenchmarks/mem0_adapter.pybenchmarks/memanto_adapter.pybenchmarks/scenario_a.pybenchmarks/scenario_b.pydatasets/persona_evolution.jsondatasets/technical_logs.jsonrequirements.txtrun_benchmark.pytests/test_adapters.py
✅ Files skipped from review due to trivial changes (4)
- datasets/persona_evolution.json
- .env.example
- benchmarks/init.py
- datasets/technical_logs.json
🚧 Files skipped from review as they are similar to previous changes (6)
- benchmarks/evaluator.py
- benchmarks/scenario_a.py
- benchmarks/memanto_adapter.py
- run_benchmark.py
- benchmarks/scenario_b.py
- benchmarks/mem0_adapter.py
| """ | ||
| Base classes for the memory benchmark framework. | ||
| """ | ||
|
|
||
| import time | ||
| import statistics | ||
| from abc import ABC, abstractmethod | ||
| from dataclasses import dataclass, field | ||
| from typing import Any |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check for Python version requirements in project files
echo "=== Searching for Python version declarations ==="
rg -n "python_requires|requires-python|Programming Language :: Python :: 3\." setup.py pyproject.toml setup.cfg README.md .python-version 2>/dev/null || echo "No version constraints found"
echo -e "\n=== Checking if README documents Python version ==="
rg -n -i "python 3\.(10|11|12)" README.md || echo "Python version not documented in README"Repository: moorcheh-ai/memanto
Length of output: 474
Document Python 3.10+ requirement in README (pyproject already set)
pyproject.tomlalready declaresrequires-python = ">=3.10,<4"and includes Python 3.10/3.11/3.12 classifiers.README.mdhas no explicit mention of Python 3.10/3.11/3.12; add the Python 3.10+ minimum there.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmarks/base.py` around lines 1 - 9, Add a short statement to the README
documenting the project's minimum Python version (Python 3.10+) to match
pyproject.toml's requires-python and classifiers; update the top-level README.md
to explicitly state "Requires Python 3.10 or newer (tested on 3.10/3.11/3.12)"
and, if appropriate, add a short note about compatibility or testing matrix to
keep it consistent with the package metadata.
| python-dotenv>=1.0.0 | ||
| numpy>=1.24.0 | ||
| tabulate>=0.9.0 | ||
| jinja2>=3.1.0 |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What is the latest stable version of jinja2 and which version fixed the sandbox breakout vulnerabilities (GHSA-cpwx-vrp4-4pq7, GHSA-gmj6-6f8f-6699)?
💡 Result:
As of June 10, 2026, the latest stable version of Jinja2 is 3.1.6 [1][2][3]. The sandbox breakout vulnerabilities you referenced were addressed in the following versions: - GHSA-gmj6-6f8f-6699 (CVE-2024-56201): Fixed in version 3.1.5 [4][5][6]. This vulnerability involved a flaw in the Jinja compiler that allowed for a sandbox breakout if an attacker controlled both the content and the filename of a template [7][5]. - GHSA-cpwx-vrp4-4pq7 (CVE-2025-27516): Fixed in version 3.1.6 [8][9][10]. This vulnerability stemmed from an oversight where the |attr filter could be used to bypass the environment's attribute lookup, allowing access to restricted methods like str.format [8][9]. Users are encouraged to use the latest stable version (3.1.6) to ensure protection against these and other vulnerabilities [2].
Citations:
- 1: https://pypi.org/project/Jinja2/
- 2: https://github.com/pallets/jinja/releases
- 3: https://github.com/pallets/jinja/tree/stable
- 4: GHSA-gmj6-6f8f-6699
- 5: https://nvd.nist.gov/vuln/detail/CVE-2024-56201
- 6: https://feedly.com/cve/CVE-2024-56201
- 7: https://osv.dev/vulnerability/GHSA-gmj6-6f8f-6699
- 8: GHSA-cpwx-vrp4-4pq7
- 9: https://nvd.nist.gov/vuln/detail/CVE-2025-27516
- 10: https://github.com/pallets/jinja/blob/main/CHANGES.rst
🌐 Web query:
GHSA-gmj6-6f8f-6699 affected versions fixed in 3.1.5 Jinja2 what versions are impacted
💡 Result:
The vulnerability GHSA-gmj6-6f8f-6699 (also identified as CVE-2024-56201) affects Jinja2 versions in the 3.x branch, specifically those from 3.0.0 up to and including 3.1.4 [1][2][3]. This vulnerability is fixed in Jinja2 version 3.1.5 [2][4][5]. The issue involves a sandbox breakout that can occur when an attacker controls both the content and the filename of a template, allowing for the execution of arbitrary Python code regardless of whether the Jinja sandbox is enabled [2][6][3]. It has been confirmed that versions prior to 3.0.0a1 are not impacted because they did not use the f-string syntax in the error messages that caused the vulnerability [7]. Users are advised to upgrade to version 3.1.5 or later to mitigate this risk [2][4][3].
Citations:
- 1: https://cvepremium.circl.lu/vuln/ghsa-gmj6-6f8f-6699
- 2: GHSA-gmj6-6f8f-6699
- 3: https://cve.circl.lu/vuln/cve-2024-56201
- 4: https://github.com/pallets/jinja/releases/tag/3.1.5
- 5: https://jinja.palletsprojects.com/en/stable/changes/
- 6: https://osv.dev/vulnerability/GHSA-gmj6-6f8f-6699
- 7: [GHSA-gmj6-6f8f-6699] Jinja has a sandbox breakout through malicious filenames github/advisory-database#5150
🌐 Web query:
GHSA-cpwx-vrp4-4pq7 affected versions fixed in 3.1.6 Jinja2 what versions are impacted
💡 Result:
The security vulnerability GHSA-cpwx-vrp4-4pq7 (CVE-2025-27516) affects all Jinja2 versions up to and including 3.1.5 [1]. This vulnerability, which involves a sandbox breakout through the attr filter, was addressed and fixed in Jinja2 version 3.1.6, released on March 5, 2025 [2][3][4]. Users are advised to upgrade to version 3.1.6 or later to mitigate this issue [2].
Citations:
- 1: https://osv.dev/vulnerability/GHSA-cpwx-vrp4-4pq7
- 2: https://github.com/pallets/jinja/releases/tag/3.1.6
- 3: https://raw.githubusercontent.com/pallets/jinja/refs/tags/3.1.6/CHANGES.rst
- 4: https://jinja.palletsprojects.com/changes/
Tighten the jinja2 constraint to avoid sandbox-breakout CVEs
File: requirements.txt
Lines: 7-7
jinja2>=3.1.0
jinja2>=3.1.0 permits vulnerable releases:
- GHSA-gmj6-6f8f-6699 (CVE-2024-56201): affects 3.0.0–3.1.4; fixed in 3.1.5 (sandbox breakout when template content+filename are attacker-controlled).
- GHSA-cpwx-vrp4-4pq7 (CVE-2025-27516): affects up to 3.1.5; fixed in 3.1.6 (sandbox breakout via the
attrfilter).
Upgrade the requirement to exclude everything before the fixed version (e.g., use jinja2>=3.1.6).
🧰 Tools
🪛 OSV Scanner (2.3.8)
[HIGH] 7-7: jinja2 3.1.0: Jinja2 vulnerable to sandbox breakout through attr filter selecting format method
[HIGH] 7-7: jinja2 3.1.0: Jinja has a sandbox breakout through malicious filenames
[HIGH] 7-7: jinja2 3.1.0: Jinja vulnerable to HTML attribute injection when passing user input as keys to xmlattr filter
[HIGH] 7-7: jinja2 3.1.0: Jinja vulnerable to HTML attribute injection when passing user input as keys to xmlattr filter
[HIGH] 7-7: jinja2 3.1.0: Jinja has a sandbox breakout through indirect reference to format method
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@requirements.txt` at line 7, Update the jinja2 requirement in
requirements.txt to pin the minimum safe version by replacing the current
"jinja2>=3.1.0" entry with "jinja2>=3.1.6" so vulnerable 3.1.x releases (and
earlier) are excluded; ensure the change is applied to the jinja2 line in
requirements.txt and run dependency install/checks afterward.
| def test_p95_latency(self): | ||
| m = BenchmarkMetric(framework="Test", scenario="S1") | ||
| m.store_latencies = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] | ||
| assert m.store_p95_latency == 100.0 # 95th percentile of 10 items |
There was a problem hiding this comment.
Test validates incorrect p95 calculation — will fail when base.py is fixed.
This test expects store_p95_latency to return 100.0 (the maximum value) for a 10-element list. However, the 95th percentile should be approximately 95 or 90, not 100. The test currently passes because it validates the buggy behavior in benchmarks/base.py lines 41 and 49. Once the p95 calculation is corrected, this test will fail.
💚 Proposed fix to match correct p95 calculation
def test_p95_latency(self):
m = BenchmarkMetric(framework="Test", scenario="S1")
m.store_latencies = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
- assert m.store_p95_latency == 100.0 # 95th percentile of 10 items
+ # 95th percentile: idx = int((10-1)*0.95) = int(8.55) = 8 → 90.0
+ assert m.store_p95_latency == 90.0🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_adapters.py` around lines 37 - 40, The test test_p95_latency
currently asserts the buggy behavior; update it to assert the correct 95th
percentile for BenchmarkMetric.store_p95_latency (property in
benchmarks/base.py) using the proper percentile calculation for the list
[10,20,...,100]; replace the hardcoded 100.0 with the correct value
(approximately 95.5) and use pytest.approx(95.5) or a small tolerance to avoid
brittle equality.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
README.md (2)
40-40:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate the path to reflect the timestamped filename.
The generated HTML report includes a timestamp in its filename (e.g.,
benchmark_report_20260610_120000.html), but this line shows a static path. Users following this command will encounter a file-not-found error.📝 Proposed fix
-# 4. View report -open reports/benchmark_report.html +# 4. View report (use the latest generated file) +open reports/benchmark_report_*.htmlAlternatively, document the timestamp pattern:
-# 4. View report -open reports/benchmark_report.html +# 4. View report +# Reports are timestamped: benchmark_report_YYYYMMDD_HHMMSS.html +open reports/benchmark_report_*.html🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` at line 40, Update the README command that opens the generated report to use the timestamped filename pattern instead of the static "reports/benchmark_report.html"; replace or augment the example command that references "benchmark_report.html" with a pattern or placeholder like "reports/benchmark_report_YYYYMMDD_HHMMSS.html" (or show how to glob for the latest file, e.g., use a shell glob or find command) so users can reliably open the actual generated file.
99-100:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate output paths to reflect the timestamped filenames.
The actual generated reports include timestamps in their filenames (e.g.,
benchmark_results_20260610_120000.json), but these lines show static paths. This inconsistency may confuse users searching for the output files.📝 Proposed fix
The benchmark generates: 1. **Console summary** — key metrics comparison table -2. **JSON report** — `reports/benchmark_results.json` -3. **HTML report** — `reports/benchmark_report.html` with charts +2. **JSON report** — `reports/benchmark_results_<timestamp>.json` +3. **HTML report** — `reports/benchmark_report_<timestamp>.html` with charts🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` around lines 99 - 100, The README lists static output paths 'reports/benchmark_results.json' and 'reports/benchmark_report.html' but the tool actually emits timestamped filenames; update these entries to reflect the actual filename patterns (e.g., 'reports/benchmark_results_YYYYMMDD_HHMMSS.json' and 'reports/benchmark_report_YYYYMMDD_HHMMSS.html' or a glob like 'reports/benchmark_results_*.json' and 'reports/benchmark_report_*.html') so users can find the generated files; adjust the two lines that reference those filenames accordingly.
🧹 Nitpick comments (1)
README.md (1)
87-87: 💤 Low valueClarify that the run count is configurable.
While 3 is the default, the number of runs is configurable via the
BENCHMARK_RUNSenvironment variable or the--runsCLI flag. This line presents it as a fixed value, which may mislead users.📝 Suggested clarification
-- **3 runs per scenario** with statistical aggregation +- **Configurable runs per scenario** (default: 3) with statistical aggregation🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` at line 87, Update the README line that states "3 runs per scenario" to clarify that 3 is the default but configurable: mention the BENCHMARK_RUNS environment variable and the --runs CLI flag as ways to change the run count, e.g., "Default: 3 runs per scenario (configurable via BENCHMARK_RUNS or --runs)"; ensure this references the default value and both configuration options (BENCHMARK_RUNS and --runs).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@README.md`:
- Around line 47-48: The table rows for BENCHMARK_RUNS and RANDOM_SEED are
missing the third "Required" column; update the README table rows for
`BENCHMARK_RUNS` and `RANDOM_SEED` to include the required-status column and
mark both as not required (❌) so each row has three columns matching the header.
---
Outside diff comments:
In `@README.md`:
- Line 40: Update the README command that opens the generated report to use the
timestamped filename pattern instead of the static
"reports/benchmark_report.html"; replace or augment the example command that
references "benchmark_report.html" with a pattern or placeholder like
"reports/benchmark_report_YYYYMMDD_HHMMSS.html" (or show how to glob for the
latest file, e.g., use a shell glob or find command) so users can reliably open
the actual generated file.
- Around line 99-100: The README lists static output paths
'reports/benchmark_results.json' and 'reports/benchmark_report.html' but the
tool actually emits timestamped filenames; update these entries to reflect the
actual filename patterns (e.g., 'reports/benchmark_results_YYYYMMDD_HHMMSS.json'
and 'reports/benchmark_report_YYYYMMDD_HHMMSS.html' or a glob like
'reports/benchmark_results_*.json' and 'reports/benchmark_report_*.html') so
users can find the generated files; adjust the two lines that reference those
filenames accordingly.
---
Nitpick comments:
In `@README.md`:
- Line 87: Update the README line that states "3 runs per scenario" to clarify
that 3 is the default but configurable: mention the BENCHMARK_RUNS environment
variable and the --runs CLI flag as ways to change the run count, e.g.,
"Default: 3 runs per scenario (configurable via BENCHMARK_RUNS or --runs)";
ensure this references the default value and both configuration options
(BENCHMARK_RUNS and --runs).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: fc18ff0a-d59f-42f5-885c-ede1d197a236
📒 Files selected for processing (6)
README.mdbenchmarks/base.pybenchmarks/evaluator.pybenchmarks/mem0_adapter.pybenchmarks/scenario_a.pyrequirements.txt
🚧 Files skipped from review as they are similar to previous changes (5)
- requirements.txt
- benchmarks/mem0_adapter.py
- benchmarks/evaluator.py
- benchmarks/scenario_a.py
- benchmarks/base.py
| | `BENCHMARK_RUNS` | Number of benchmark runs per scenario (default: 3) | | ||
| | `RANDOM_SEED` | Random seed for reproducibility (default: 42) | |
There was a problem hiding this comment.
Add the missing "Required" column to complete the table.
Lines 47-48 are missing the third column expected by the table header. Since both BENCHMARK_RUNS and RANDOM_SEED have default values, they should be marked as not required (❌).
📝 Proposed fix
-| `BENCHMARK_RUNS` | Number of benchmark runs per scenario (default: 3) |
-| `RANDOM_SEED` | Random seed for reproducibility (default: 42) |
+| `BENCHMARK_RUNS` | Number of benchmark runs per scenario (default: 3) | ❌ |
+| `RANDOM_SEED` | Random seed for reproducibility (default: 42) | ❌ |📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| | `BENCHMARK_RUNS` | Number of benchmark runs per scenario (default: 3) | | |
| | `RANDOM_SEED` | Random seed for reproducibility (default: 42) | | |
| | `BENCHMARK_RUNS` | Number of benchmark runs per scenario (default: 3) | ❌ | | |
| | `RANDOM_SEED` | Random seed for reproducibility (default: 42) | ❌ | |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 47-47: Table column count
Expected: 3; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
[warning] 48-48: Table column count
Expected: 3; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@README.md` around lines 47 - 48, The table rows for BENCHMARK_RUNS and
RANDOM_SEED are missing the third "Required" column; update the README table
rows for `BENCHMARK_RUNS` and `RANDOM_SEED` to include the required-status
column and mark both as not required (❌) so each row has three columns matching
the header.
Source: Linters/SAST tools
There was a problem hiding this comment.
🧹 Nitpick comments (1)
benchmarks/memanto_adapter.py (1)
45-149: ⚡ Quick winConsider adding defensive
_clientvalidation.All methods access
self._clientwithout checking whethersetup()was called first. If used incorrectly, the error message (AttributeError: 'NoneType' object has no attribute 'add') will be cryptic.🛡️ Proposed defensive check
Add a helper method and call it at the start of each operation:
+ def _ensure_setup(self) -> None: + """Raise RuntimeError if setup() was not called.""" + if self._client is None: + raise RuntimeError( + f"{self.name} adapter: setup() must be called before operations" + ) + def store(self, content: str, metadata: dict | None = None) -> MemoryResult: """Store a memory entry via the Memanto API. ... """ + self._ensure_setup() try: meta = metadata or {}Apply the same pattern to
retrieve,update,delete,get_all, andcleanup.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/memanto_adapter.py` around lines 45 - 149, The methods store, retrieve, update, delete, get_all, and cleanup use self._client without verifying setup, causing cryptic AttributeError if setup() wasn't called; add a small defensive helper (e.g., _ensure_client() or validate_client()) that raises a clear RuntimeError with a helpful message when self._client is None, then call that helper at the start of each method (store, retrieve, update, delete, get_all, cleanup) to fail fast and provide a descriptive error instead of letting None attribute access appear.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@benchmarks/memanto_adapter.py`:
- Around line 45-149: The methods store, retrieve, update, delete, get_all, and
cleanup use self._client without verifying setup, causing cryptic AttributeError
if setup() wasn't called; add a small defensive helper (e.g., _ensure_client()
or validate_client()) that raises a clear RuntimeError with a helpful message
when self._client is None, then call that helper at the start of each method
(store, retrieve, update, delete, get_all, cleanup) to fail fast and provide a
descriptive error instead of letting None attribute access appear.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 3dbfbb69-0528-4d54-b099-387873c8f2ae
📒 Files selected for processing (3)
benchmarks/mem0_adapter.pybenchmarks/memanto_adapter.pytests/test_adapters.py
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/test_adapters.py
- benchmarks/mem0_adapter.py
|
Hi, thank you for the thorough review! I've addressed all the issues raised: Fixes Applied
Additional Improvements
All tests pass locally. Please let me know if any further changes are needed! Best regards |
0b2db73 to
dffa010
Compare
Resolve merge conflicts by rebuilding branch from main. All original files preserved, benchmark files added.
dffa010 to
9e9782e
Compare
|
Hi, just a quick follow-up. I've resolved the merge conflicts and the PR is now fully mergeable with the latest main branch. All CodeRabbit feedback has been addressed:
Is there anything else needed for this to be merged? Thanks! |
Closes #639
Changes
How to run
Summary by CodeRabbit
New Features
Documentation
Tests
Chores