Add live Memanto vs Mem0 temporal memory benchmark by 2077196405-commits · Pull Request #730 · moorcheh-ai/memanto

2077196405-commits · 2026-06-12T21:39:38Z

Summary

Adds a fully local, reproducible temporal-memory benchmark under examples/benchmarks/temporal-memory-showdown/.
Runs the same 32-record, 10-session evolving-persona dataset through Memanto On-Prem and Mem0 2.0.5.
Uses Mem0's default agentic extraction (infer=True) as the primary competitor and reports infer=False only as a clearly labeled vector-only ablation.
Measures golden concept coverage, stale-state leakage, strict accuracy, source/retrieved/native-LLM tokens, ingest/query p50 and p95 latency, readiness time, and RSS delta.
Adds a GitHub Actions workflow that provisions Moorcheh On-Prem, Qdrant, and Ollama without cloud secrets.
Fixes Memanto On-Prem agent creation so a namespace conflict (MoorchehApiError(status_code=409)) is treated idempotently, with a regression test.

Verified benchmark result

A real GitHub-hosted run completed successfully: Actions run 27441595257.

Metric	Memanto	Mem0 agentic
Golden concept coverage	97.2%	69.4%
Ingest total	0.096 s	2912.082 s
Ingest p95	0.0049 s	93.1412 s
Query p95	0.0983 s	0.1032 s
Retrieved tokens	1,779	1,793
Native LLM tokens	0	134,690
Strict accuracy	0.0%	11.1%
Stale-state leakage	100.0%	88.9%

The paired bootstrap estimate for Memanto's concept-coverage advantage is +27.8 percentage points with a 95% CI of +9.3 to +48.1 points. Memanto ingestion was approximately 30,286x faster in this run.

The report does not hide the temporal failure mode: raw top-5 retrieval from every tested backend still surfaced superseded values. Mem0 agentic's lower stale-leak rate also coincided with missing more current facts. The benchmark therefore reports coverage, strict accuracy, and stale leakage separately rather than collapsing them into a single favorable score.

The vector-only Mem0 ablation reached 98.6% coverage with 3.996 s total ingestion and zero LLM tokens; it is included for diagnostic context, not presented as the primary agentic comparison.

Reproduce and audit

Public benchmark report
Exact JSON result
Workflow: Benchmark memory showdown via workflow_dispatch
Dataset, scoring rules, bootstrap seed, model versions, prompts, engine toggles, and host metadata are committed with the runner.

Validation

pytest -q passes for all available non-live tests; 24 existing cloud E2E tests skip without a Moorcheh cloud API key.
Focused benchmark and On-Prem conflict regression tests: 10 passed.
ruff check and ruff format --check pass.
The committed JSON SHA-256 matches the successful Actions artifact.

Bounty submission

Refs #639

Public showcase: auditable benchmark report and result tables. GitHub reactions and technical discussion on this PR are also part of the issue's published social scoring formula.

Social showcase

GitHub PR #730 is the public technical showcase and reaction-scored discussion, consistent with the bounty's published GitHub PR reaction formula. The committed live report provides the shareable result tables and limitations.

Summary by CodeRabbit

New Features
- Added "Temporal Memory Showdown" benchmark suite for comparing memory system performance across coverage, accuracy, latency, and token efficiency metrics.
- Added GitHub Actions workflow for automated benchmark execution and result collection.
Bug Fixes
- Fixed agent service to gracefully handle namespace creation conflicts in on-prem environments.
Tests
- Added comprehensive benchmark test suite validating dataset integrity, scoring logic, and system configuration.

coderabbitai · 2026-06-12T21:39:51Z

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive "Temporal Memory Showdown" benchmark suite comparing Memanto On-Prem versus Mem0 (with direct and agentic ablations) over a synthetic temporal dataset, plus a small fix enabling AgentService to handle on-prem namespace creation conflicts. The benchmark includes dataset definitions, backend adapters, metrics/scoring logic, a runner orchestrating execution and reporting, a CI workflow, and full test coverage.

Changes

Temporal Memory Showdown Benchmark

Layer / File(s)	Summary
Dataset model and validation `examples/benchmarks/temporal-memory-showdown/dataset.py`	`MemoryRecord` and `QueryCase` dataclasses define the experiment structure; `RECORDS` and `QUERIES` tuples populate synthetic temporal memory entries and query cases with concept constraints; `validate_dataset()` enforces unique IDs, positive sessions, and non-empty required concepts at import time.
Metrics, scoring, and statistical utilities `examples/benchmarks/temporal-memory-showdown/metrics.py`	`QueryScore` captures per-query coverage, exactness, stale-leak flag, and required/forbidden match counts; utilities implement token counting via tiktoken, percentile computation, text normalization, per-query scoring by alias matching, category-grouped summarization, and deterministic paired bootstrap confidence intervals for baseline-challenger deltas.
Memory backend protocol and adapters `examples/benchmarks/temporal-memory-showdown/backends.py`	`MemoryBackend` protocol defines ingest/search/usage/close contract; `MemantoBackend` bootstraps on-prem agent lifecycle and performs recall-based search; `Mem0Backend` wires Ollama for LLM+embeddings with persistent Qdrant storage; `MeteredOllamaClient` wraps underlying Ollama clients to count token usage; supporting functions generate Mem0 config, poll readiness, and create run identifiers.
Benchmark runner and reporting `examples/benchmarks/temporal-memory-showdown/run_benchmark.py`	Parses CLI arguments (backends, service URLs, model/top-k/repeat settings), constructs backend instances via factory, orchestrates record ingestion and query execution across repeats with latency tracking, captures environment metadata, builds baseline-vs-challenger bootstrap comparisons, renders Markdown reports with headline metrics and per-query audit tables, writes JSON and Markdown outputs.
Comprehensive test coverage and import setup `examples/benchmarks/temporal-memory-showdown/tests/conftest.py`, `examples/benchmarks/temporal-memory-showdown/tests/test_metrics.py`	Conftest adjusts sys.path for test module imports; tests validate dataset stability, scoring correctness (coverage, exactness, stale-leak), percentile nearest-rank behavior, category summarization, bootstrap determinism, Mem0 config schema matching, Ollama token metering, Memanto agent retry logic, and complete runner/report shape with rendered Markdown.
CI workflow, dependencies, and documentation `.github/workflows/benchmark-memory-showdown.yml`, `examples/benchmarks/temporal-memory-showdown/requirements.txt`, `examples/benchmarks/temporal-memory-showdown/README.md`, `examples/benchmarks/temporal-memory-showdown/.gitignore`, `examples/benchmarks/temporal-memory-showdown/results/*`	GitHub Actions workflow checks out repo, installs dependencies, configures Moorcheh on-prem with Ollama, waits for service readiness, runs benchmark with configurable repeats, collects diagnostics, uploads artifacts, and tears down services; requirements.txt pins benchmark dependencies; README documents purpose, methodology, and reproduction steps; .gitignore ignores generated artifacts; results directory holds sample benchmark report and placeholder.

Agent Service Namespace Conflict Handling

Layer / File(s)	Summary
409 conflict handling and test `memanto/app/services/agent_service.py`, `tests/test_unit.py`	`AgentService.create_agent` now detects exceptions with `status_code == 409` (on-prem namespace already exists) as non-fatal alongside SDK `ConflictError`, logs success, and continues; unit test simulates on-prem 409 scenario to verify idempotent agent creation succeeds.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

moorcheh-ai/memanto#639: This PR directly implements the reproducible Memanto vs. Mem0 benchmarking infrastructure (dataset, runners, metrics, backends, tests, and CI workflow) requested in that issue.

Possibly related PRs

moorcheh-ai/memanto#631: The change to AgentService.create_agent (treating on-prem 409 conflicts as non-fatal) supports that PR's LangGraph tools that call client.create_agent during retry logic.

Suggested reviewers

het0814
Neelpatel1604
Xenogents

Poem

🐰 A benchmark born from whisker-thin vision—
Memanto meets Mem0 in temporal collision,
With metrics that measure and backends that hum,
Bootstrap deltas tell stories of which one won.
Now workflows and datasets dance in the glow,
Testing the memory that mortals will know! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.20% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly and specifically describes the main change: adding a temporal memory benchmark comparing Memanto and Mem0 live systems.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (2)

memanto/app/services/agent_service.py (1)

88-90: 📐 Maintainability & Code Quality | ⚡ Quick win

Preserve exception chain when re-raising.

The current code creates a new Exception which discards the original exception type, traceback, and context. Use raise ... from e to preserve the exception chain for debugging.

🔗 Proposed fix to preserve exception chain

             else:
                 # Unexpected error - fail the agent creation
-                raise Exception(
-                    f"Failed to create namespace '{namespace}' in Moorcheh: {str(e)}"
-                )
+                raise Exception(
+                    f"Failed to create namespace '{namespace}' in Moorcheh: {str(e)}"
+                ) from e

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@memanto/app/services/agent_service.py` around lines 88 - 90, Replace the
current re-raise that discards the original traceback—specifically the line
raising Exception(f"Failed to create namespace '{namespace}' in Moorcheh:
{str(e)}")—with a chained raise so the original exception is preserved (use
raise ... from e). Locate the raise in the function that creates the Moorcheh
namespace (the block where the variable namespace and exception variable e are
available) and update it to re-raise the new Exception using "from e" to keep
the original context and traceback.

examples/benchmarks/temporal-memory-showdown/tests/test_metrics.py (1)

57-63: 📐 Maintainability & Code Quality | ⚡ Quick win

Add a negative-path test for invalid bootstrap sample counts.

Please add a regression test that asserts paired_bootstrap_delta(..., samples=0) raises ValueError, so the input-contract fix remains enforced.

Suggested test addition

+import pytest
 ...
 def test_bootstrap_is_deterministic():
     baseline = [score_query(QUERIES[0], "basil") for _ in range(4)]
     challenger = [score_query(QUERIES[0], "dwarf radish") for _ in range(4)]
     result = paired_bootstrap_delta(baseline, challenger, samples=100, seed=123)
     assert result["observed_delta"] == 1.0
     assert result["ci95"] == [1.0, 1.0]
+
+
+def test_bootstrap_rejects_non_positive_samples():
+    baseline = [score_query(QUERIES[0], "basil")]
+    challenger = [score_query(QUERIES[0], "dwarf radish")]
+    with pytest.raises(ValueError, match="samples must be a positive integer"):
+        paired_bootstrap_delta(baseline, challenger, samples=0)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/temporal-memory-showdown/tests/test_metrics.py` around
lines 57 - 63, Add a negative-path unit test to ensure paired_bootstrap_delta
enforces its samples>0 contract: create a new test (e.g., next to
test_bootstrap_is_deterministic) that calls paired_bootstrap_delta(baseline,
challenger, samples=0) and asserts it raises ValueError (use pytest.raises).
Reference the paired_bootstrap_delta function and reuse simple
baseline/challenger lists like in test_bootstrap_is_deterministic to keep the
test focused on the samples parameter validation.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/benchmark-memory-showdown.yml:
- Around line 6-10: The workflow directly interpolates the workflow_dispatch
input inputs.repeats into a shell command which can lead to command injection;
add a validation step that sanitizes and constrains inputs.repeats before use:
create an early run step that reads the input into a shell variable (e.g.
repeats="${{ inputs.repeats }}"), test it against a strict numeric regex (e.g.
if ! [[ "$repeats" =~ ^[0-9]+$ ]]; then echo "invalid repeats" >&2; exit 1; fi),
export or write the sanitized numeric value to an environment file (GITHUB_ENV)
and then use that sanitized variable (quoted) in the later run invocation
instead of interpolating ${{ inputs.repeats }} directly; also ensure all uses
quote the variable to avoid word-splitting/expansion.
- Line 24: The workflow uses floating action tags (actions/checkout@v4,
actions/setup-python@v5, actions/upload-artifact@v4) which weakens supply-chain
guarantees; update each usage to a pinned immutable commit SHA instead of the
tag (replace actions/checkout@v4, actions/setup-python@v5, and
actions/upload-artifact@v4 with their corresponding commit SHA refs) and verify
the SHA values point to the intended release commits, keeping the action names
for readability in the workflow comments.

In `@examples/benchmarks/temporal-memory-showdown/backends.py`:
- Around line 291-297: The readiness loop currently lets any exception from
backend.search escape and break the retry logic; wrap the call to backend.search
inside the while loop in a try/except that catches transient exceptions,
optionally logs them, and continues retrying until the deadline instead of
propagating (refer to backend.search, the while time.perf_counter() < deadline
loop, hits, expected_lower and backend.name); preserve the existing success
check and sleep behavior and only raise the TimeoutError after the deadline
elapses.
- Around line 124-130: The retry loop in create_memanto_agent currently catches
all Exceptions, stores last_error and always re-raises it, which causes a
bootstrap to fail on HTTP 409 (already exists) even though that should be
considered a successful idempotent outcome; update the except block inside
create_memanto_agent to detect a 409 response (e.g., check
error.response.status_code, getattr(error, "status_code", None), or inspect
HTTPError.response) and treat it as success by breaking/returning without
setting last_error, otherwise keep the existing retry/backoff behavior and
re-raise the last_error after attempts are exhausted; reference
create_memanto_agent, last_error, attempts and delay_s to locate and update the
logic.

In `@examples/benchmarks/temporal-memory-showdown/metrics.py`:
- Around line 114-127: The code does not validate that `samples` is positive
before generating resamples, which can leave `deltas` empty and cause an
IndexError when computing `lower`/`upper`; add an early check (e.g., if samples
<= 0: raise ValueError("samples must be a positive integer for paired
bootstrap")) before creating `rng` and the resampling loop (referencing the
`samples`, `deltas`, `ordered`, `baseline`, and `challenger` variables) so the
function fails fast on non-positive sample counts.

In `@examples/benchmarks/temporal-memory-showdown/run_benchmark.py`:
- Around line 278-283: Guard the division by checking denominators before
computing ingest_speedup and query_reduction: verify
memanto_metrics["ingest_total_s"] is not zero before computing ingest_speedup
and mem0_metrics["query_p95_s"] is not zero before computing query_reduction; if
a denominator is zero (or nearly zero) return a safe fallback (e.g., set
ingest_speedup/query_reduction to None or a sentinel like float("inf")/0.0) to
avoid ZeroDivisionError, and use the existing variable names ingest_speedup and
query_reduction so the rest of the code can handle the fallback consistently.

---

Nitpick comments:
In `@examples/benchmarks/temporal-memory-showdown/tests/test_metrics.py`:
- Around line 57-63: Add a negative-path unit test to ensure
paired_bootstrap_delta enforces its samples>0 contract: create a new test (e.g.,
next to test_bootstrap_is_deterministic) that calls
paired_bootstrap_delta(baseline, challenger, samples=0) and asserts it raises
ValueError (use pytest.raises). Reference the paired_bootstrap_delta function
and reuse simple baseline/challenger lists like in
test_bootstrap_is_deterministic to keep the test focused on the samples
parameter validation.

In `@memanto/app/services/agent_service.py`:
- Around line 88-90: Replace the current re-raise that discards the original
traceback—specifically the line raising Exception(f"Failed to create namespace
'{namespace}' in Moorcheh: {str(e)}")—with a chained raise so the original
exception is preserved (use raise ... from e). Locate the raise in the function
that creates the Moorcheh namespace (the block where the variable namespace and
exception variable e are available) and update it to re-raise the new Exception
using "from e" to keep the original context and traceback.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: f42d686b-5b43-41c4-8970-53babffaaeb1

📥 Commits

Reviewing files that changed from the base of the PR and between 7665bfb and dd8609b.

📒 Files selected for processing (15)

.github/workflows/benchmark-memory-showdown.yml
examples/benchmarks/temporal-memory-showdown/.gitignore
examples/benchmarks/temporal-memory-showdown/README.md
examples/benchmarks/temporal-memory-showdown/backends.py
examples/benchmarks/temporal-memory-showdown/dataset.py
examples/benchmarks/temporal-memory-showdown/metrics.py
examples/benchmarks/temporal-memory-showdown/requirements.txt
examples/benchmarks/temporal-memory-showdown/results/.gitkeep
examples/benchmarks/temporal-memory-showdown/results/latest.json
examples/benchmarks/temporal-memory-showdown/results/latest.md
examples/benchmarks/temporal-memory-showdown/run_benchmark.py
examples/benchmarks/temporal-memory-showdown/tests/conftest.py
examples/benchmarks/temporal-memory-showdown/tests/test_metrics.py
memanto/app/services/agent_service.py
tests/test_unit.py

coderabbitai · 2026-06-12T21:47:41Z

+      repeats:
+        description: Measured query repetitions after warm-up
+        required: false
+        default: "5"
+


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Sanitize workflow_dispatch input before passing it to shell command arguments.

Line 86 interpolates ${{ inputs.repeats }} directly into a shell command. A crafted input using command substitution can execute before Python validates int.

Suggested patch

workflow_dispatch: inputs: repeats: description: Measured query repetitions after warm-up required: false default: "5" + type: number @@ - name: Run live benchmark env: HOME: ${{ env.BENCH_HOME }} + REPEATS: ${{ inputs.repeats }} run: | + [[ "$REPEATS" =~ ^[0-9]+$ ]] || { echo "Invalid repeats: $REPEATS"; exit 1; } python examples/benchmarks/temporal-memory-showdown/run_benchmark.py \ --backends memanto,mem0-direct,mem0-agentic \ - --repeats "${{ inputs.repeats }}" + --repeats "$REPEATS"

Also applies to: 86-86

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/benchmark-memory-showdown.yml around lines 6 - 10, The workflow directly interpolates the workflow_dispatch input inputs.repeats into a shell command which can lead to command injection; add a validation step that sanitizes and constrains inputs.repeats before use: create an early run step that reads the input into a shell variable (e.g. repeats="${{ inputs.repeats }}"), test it against a strict numeric regex (e.g. if ! [[ "$repeats" =~ ^[0-9]+$ ]]; then echo "invalid repeats" >&2; exit 1; fi), export or write the sanitized numeric value to an environment file (GITHUB_ENV) and then use that sanitized variable (quoted) in the later run invocation instead of interpolating ${{ inputs.repeats }} directly; also ensure all uses quote the variable to avoid word-splitting/expansion.

Source: Linters/SAST tools

coderabbitai · 2026-06-12T21:47:41Z

+      SETUPTOOLS_SCM_PRETEND_VERSION: "0.0.0"
+
+    steps:
+      - uses: actions/checkout@v4


🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Detect non-SHA-pinned actions in workflows. rg -n 'uses:\s*[^@]+@v[0-9]+' .github/workflows

Repository: moorcheh-ai/memanto

Length of output: 1140

Pin GitHub Actions to immutable commit SHAs.

In .github/workflows/benchmark-memory-showdown.yml, the workflow uses floating tags (weak supply-chain guarantees) instead of commit SHAs:

Line 24: actions/checkout@v4

Line 26: actions/setup-python@v5

Line 103: actions/upload-artifact@v4

🧰 Tools

🪛 zizmor (1.25.2)

[warning] 24-24: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)

[error] 24-24: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/benchmark-memory-showdown.yml at line 24, The workflow uses floating action tags (actions/checkout@v4, actions/setup-python@v5, actions/upload-artifact@v4) which weakens supply-chain guarantees; update each usage to a pinned immutable commit SHA instead of the tag (replace actions/checkout@v4, actions/setup-python@v5, and actions/upload-artifact@v4 with their corresponding commit SHA refs) and verify the SHA values point to the intended release commits, keeping the action names for readability in the workflow comments.

Source: Linters/SAST tools

coderabbitai · 2026-06-12T21:47:41Z

+        except Exception as error:
+            last_error = error
+            if attempt + 1 < attempts:
+                time.sleep(delay_s)
+    assert last_error is not None
+    raise last_error
+


🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify current conflict-handling paths for agent creation. rg -n -C3 'def create_memanto_agent|create_agent\(|status_code\s*==\s*409|MoorchehApiError'

Repository: moorcheh-ai/memanto

Length of output: 18565

Handle HTTP 409 “already exists” as a successful idempotent bootstrap in create_memanto_agent

examples/benchmarks/temporal-memory-showdown/backends.py retries except Exception and always re-raises the last error; if the server already created the agent/namespace but the call still raises a 409, bootstrap can fail instead of proceeding. The app layer already treats status_code == 409 as OK in memanto/app/services/agent_service.py.

Suggested patch

except Exception as error: + if getattr(error, "status_code", None) == 409: + return last_error = error if attempt + 1 < attempts: time.sleep(delay_s)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except Exception as error:

last_error = error

if attempt + 1 < attempts:

time.sleep(delay_s)

assert last_error is not None

raise last_error

except Exception as error:

if getattr(error, "status_code", None) == 409:

return

last_error = error

if attempt + 1 < attempts:

time.sleep(delay_s)

assert last_error is not None

raise last_error

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/temporal-memory-showdown/backends.py` around lines 124 - 130, The retry loop in create_memanto_agent currently catches all Exceptions, stores last_error and always re-raises it, which causes a bootstrap to fail on HTTP 409 (already exists) even though that should be considered a successful idempotent outcome; update the except block inside create_memanto_agent to detect a 409 response (e.g., check error.response.status_code, getattr(error, "status_code", None), or inspect HTTPError.response) and treat it as success by breaking/returning without setting last_error, otherwise keep the existing retry/backoff behavior and re-raise the last_error after attempts are exhausted; reference create_memanto_agent, last_error, attempts and delay_s to locate and update the logic.

coderabbitai · 2026-06-12T21:47:41Z

+    while time.perf_counter() < deadline:
+        hits = backend.search(query, top_k)
+        if expected_lower in "\n".join(hit.text for hit in hits).casefold():
+            return time.perf_counter() - started
+        time.sleep(1.0)
+    raise TimeoutError(
+        f"{backend.name} did not surface {expected!r} within {timeout_s:.0f}s"


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Don’t let transient search failures bypass the readiness timeout loop.

At Line 292, any temporary backend/search exception exits immediately and skips the intended timeout retry behavior, which can make CI flaky.

Suggested patch

expected_lower = expected.casefold() + last_error: Exception | None = None while time.perf_counter() < deadline: - hits = backend.search(query, top_k) + try: + hits = backend.search(query, top_k) + except Exception as error: + last_error = error + time.sleep(1.0) + continue if expected_lower in "\n".join(hit.text for hit in hits).casefold(): return time.perf_counter() - started time.sleep(1.0) - raise TimeoutError( - f"{backend.name} did not surface {expected!r} within {timeout_s:.0f}s" - ) + message = f"{backend.name} did not surface {expected!r} within {timeout_s:.0f}s" + if last_error is not None: + message = f"{message}; last search error: {last_error}" + raise TimeoutError(message)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/temporal-memory-showdown/backends.py` around lines 291 - 297, The readiness loop currently lets any exception from backend.search escape and break the retry logic; wrap the call to backend.search inside the while loop in a try/except that catches transient exceptions, optionally logs them, and continues retrying until the deadline instead of propagating (refer to backend.search, the while time.perf_counter() < deadline loop, hits, expected_lower and backend.name); preserve the existing success check and sleep behavior and only raise the TimeoutError after the deadline elapses.

coderabbitai · 2026-06-12T21:47:42Z

+    if len(baseline) != len(challenger) or not baseline:
+        raise ValueError("Paired bootstrap requires equal non-empty score lists")
+    rng = random.Random(seed)
+    deltas: list[float] = []
+    count = len(baseline)
+    for _ in range(samples):
+        indices = [rng.randrange(count) for _ in range(count)]
+        baseline_mean = mean(baseline[i].coverage for i in indices)
+        challenger_mean = mean(challenger[i].coverage for i in indices)
+        deltas.append(challenger_mean - baseline_mean)
+
+    ordered = sorted(deltas)
+    lower = ordered[max(0, math.floor(0.025 * samples))]
+    upper = ordered[min(samples - 1, math.ceil(0.975 * samples) - 1)]


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Validate bootstrap sample count before generating resamples.

At Line 119 onward, samples <= 0 yields an empty deltas list, and Lines 126–127 then index into an empty ordered list (IndexError). Please fail fast with a ValueError for non-positive sample counts.

Proposed fix

def paired_bootstrap_delta( baseline: Sequence[QueryScore], challenger: Sequence[QueryScore], *, samples: int = 5000, seed: int = 639, ) -> dict: if len(baseline) != len(challenger) or not baseline: raise ValueError("Paired bootstrap requires equal non-empty score lists") + if samples <= 0: + raise ValueError("samples must be a positive integer") rng = random.Random(seed) deltas: list[float] = []

🧰 Tools

🪛 ast-grep (0.43.0)

[info] 115-115: use secrets package over random package
Context: random.Random(seed)
Note: [CWE-330].

(avoid-random-python)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/temporal-memory-showdown/metrics.py` around lines 114 - 127, The code does not validate that `samples` is positive before generating resamples, which can leave `deltas` empty and cause an IndexError when computing `lower`/`upper`; add an early check (e.g., if samples <= 0: raise ValueError("samples must be a positive integer for paired bootstrap")) before creating `rng` and the resampling loop (referencing the `samples`, `deltas`, `ordered`, `baseline`, and `challenger` variables) so the function fails fast on non-positive sample counts.

coderabbitai · 2026-06-12T21:47:42Z

+        ingest_speedup = (
+            mem0_metrics["ingest_total_s"] / memanto_metrics["ingest_total_s"]
+        )
+        query_reduction = 1 - (
+            memanto_metrics["query_p95_s"] / mem0_metrics["query_p95_s"]
+        )


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Guard headline ratio calculations against zero denominators.

Line 279 and Line 282 divide by metric fields that are pre-rounded; if those round to 0.0, Markdown rendering crashes with ZeroDivisionError.

Suggested patch

- ingest_speedup = ( - mem0_metrics["ingest_total_s"] / memanto_metrics["ingest_total_s"] - ) - query_reduction = 1 - ( - memanto_metrics["query_p95_s"] / mem0_metrics["query_p95_s"] - ) + memanto_ingest = memanto_metrics["ingest_total_s"] + mem0_query_p95 = mem0_metrics["query_p95_s"] + ingest_speedup = ( + mem0_metrics["ingest_total_s"] / memanto_ingest if memanto_ingest > 0 else None + ) + query_reduction = ( + 1 - (memanto_metrics["query_p95_s"] / mem0_query_p95) + if mem0_query_p95 > 0 + else None + ) @@ - f"- Full ingestion was **{ingest_speedup:,.1f}x faster** " + f"- Full ingestion was **{ingest_speedup:,.1f}x faster** " f"({memanto_metrics['ingest_total_s']:.3f}s vs " - f"{mem0_metrics['ingest_total_s']:.3f}s).", - f"- Query p95 was **{query_reduction:.1%} lower** " + f"{mem0_metrics['ingest_total_s']:.3f}s)." + if ingest_speedup is not None + else "- Full ingestion speedup is not available (zero denominator).", + f"- Query p95 was **{query_reduction:.1%} lower** " f"({memanto_metrics['query_p95_s']:.4f}s vs " - f"{mem0_metrics['query_p95_s']:.4f}s).", + f"{mem0_metrics['query_p95_s']:.4f}s)." + if query_reduction is not None + else "- Query p95 reduction is not available (zero denominator).",

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/temporal-memory-showdown/run_benchmark.py` around lines 278 - 283, Guard the division by checking denominators before computing ingest_speedup and query_reduction: verify memanto_metrics["ingest_total_s"] is not zero before computing ingest_speedup and mem0_metrics["query_p95_s"] is not zero before computing query_reduction; if a denominator is zero (or nearly zero) return a safe fallback (e.g., set ingest_speedup/query_reduction to None or a sentinel like float("inf")/0.0) to avoid ZeroDivisionError, and use the existing variable names ingest_speedup and query_reduction so the rest of the code can handle the fallback consistently.

2077196405-commits added 6 commits June 13, 2026 04:06

feat(benchmarks): add live temporal memory showdown

812782a

fix(ci): use static benchmark home

f1d3549

fix(ci): wait for namespace readiness

fa883d6

fix(benchmarks): retry on-prem agent bootstrap

7ba3493

fix(on-prem): accept namespace conflict responses

66219b4

docs(benchmarks): publish live showdown results

dd8609b

2077196405-commits mentioned this pull request Jun 12, 2026

[BOUNTY $100] 🐜 The Great Agentic Memory Showdown: Memanto Benchmarking & Evaluation Challenge #639

Open

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

jdjioe5-cpu mentioned this pull request Jun 13, 2026

Add Claude Code skills context capsules example #619

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add live Memanto vs Mem0 temporal memory benchmark#730

Add live Memanto vs Mem0 temporal memory benchmark#730
2077196405-commits wants to merge 6 commits into
moorcheh-ai:mainfrom
2077196405-commits:codex/memanto-benchmark-639

2077196405-commits commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

2077196405-commits commented Jun 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verified benchmark result

Reproduce and audit

Validation

Bounty submission

Social showcase

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

2077196405-commits commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading