Skip to content

Add live Memanto vs Mem0 temporal memory benchmark#730

Open
2077196405-commits wants to merge 6 commits into
moorcheh-ai:mainfrom
2077196405-commits:codex/memanto-benchmark-639
Open

Add live Memanto vs Mem0 temporal memory benchmark#730
2077196405-commits wants to merge 6 commits into
moorcheh-ai:mainfrom
2077196405-commits:codex/memanto-benchmark-639

Conversation

@2077196405-commits

@2077196405-commits 2077196405-commits commented Jun 12, 2026

Copy link
Copy Markdown

Summary

  • Adds a fully local, reproducible temporal-memory benchmark under examples/benchmarks/temporal-memory-showdown/.
  • Runs the same 32-record, 10-session evolving-persona dataset through Memanto On-Prem and Mem0 2.0.5.
  • Uses Mem0's default agentic extraction (infer=True) as the primary competitor and reports infer=False only as a clearly labeled vector-only ablation.
  • Measures golden concept coverage, stale-state leakage, strict accuracy, source/retrieved/native-LLM tokens, ingest/query p50 and p95 latency, readiness time, and RSS delta.
  • Adds a GitHub Actions workflow that provisions Moorcheh On-Prem, Qdrant, and Ollama without cloud secrets.
  • Fixes Memanto On-Prem agent creation so a namespace conflict (MoorchehApiError(status_code=409)) is treated idempotently, with a regression test.

Verified benchmark result

A real GitHub-hosted run completed successfully: Actions run 27441595257.

Metric Memanto Mem0 agentic
Golden concept coverage 97.2% 69.4%
Ingest total 0.096 s 2912.082 s
Ingest p95 0.0049 s 93.1412 s
Query p95 0.0983 s 0.1032 s
Retrieved tokens 1,779 1,793
Native LLM tokens 0 134,690
Strict accuracy 0.0% 11.1%
Stale-state leakage 100.0% 88.9%

The paired bootstrap estimate for Memanto's concept-coverage advantage is +27.8 percentage points with a 95% CI of +9.3 to +48.1 points. Memanto ingestion was approximately 30,286x faster in this run.

The report does not hide the temporal failure mode: raw top-5 retrieval from every tested backend still surfaced superseded values. Mem0 agentic's lower stale-leak rate also coincided with missing more current facts. The benchmark therefore reports coverage, strict accuracy, and stale leakage separately rather than collapsing them into a single favorable score.

The vector-only Mem0 ablation reached 98.6% coverage with 3.996 s total ingestion and zero LLM tokens; it is included for diagnostic context, not presented as the primary agentic comparison.

Reproduce and audit

  • Public benchmark report
  • Exact JSON result
  • Workflow: Benchmark memory showdown via workflow_dispatch
  • Dataset, scoring rules, bootstrap seed, model versions, prompts, engine toggles, and host metadata are committed with the runner.

Validation

  • pytest -q passes for all available non-live tests; 24 existing cloud E2E tests skip without a Moorcheh cloud API key.
  • Focused benchmark and On-Prem conflict regression tests: 10 passed.
  • ruff check and ruff format --check pass.
  • The committed JSON SHA-256 matches the successful Actions artifact.

Bounty submission

Refs #639

Public showcase: auditable benchmark report and result tables. GitHub reactions and technical discussion on this PR are also part of the issue's published social scoring formula.

Social showcase

GitHub PR #730 is the public technical showcase and reaction-scored discussion, consistent with the bounty's published GitHub PR reaction formula. The committed live report provides the shareable result tables and limitations.

Summary by CodeRabbit

  • New Features

    • Added "Temporal Memory Showdown" benchmark suite for comparing memory system performance across coverage, accuracy, latency, and token efficiency metrics.
    • Added GitHub Actions workflow for automated benchmark execution and result collection.
  • Bug Fixes

    • Fixed agent service to gracefully handle namespace creation conflicts in on-prem environments.
  • Tests

    • Added comprehensive benchmark test suite validating dataset integrity, scoring logic, and system configuration.

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive "Temporal Memory Showdown" benchmark suite comparing Memanto On-Prem versus Mem0 (with direct and agentic ablations) over a synthetic temporal dataset, plus a small fix enabling AgentService to handle on-prem namespace creation conflicts. The benchmark includes dataset definitions, backend adapters, metrics/scoring logic, a runner orchestrating execution and reporting, a CI workflow, and full test coverage.

Changes

Temporal Memory Showdown Benchmark

Layer / File(s) Summary
Dataset model and validation
examples/benchmarks/temporal-memory-showdown/dataset.py
MemoryRecord and QueryCase dataclasses define the experiment structure; RECORDS and QUERIES tuples populate synthetic temporal memory entries and query cases with concept constraints; validate_dataset() enforces unique IDs, positive sessions, and non-empty required concepts at import time.
Metrics, scoring, and statistical utilities
examples/benchmarks/temporal-memory-showdown/metrics.py
QueryScore captures per-query coverage, exactness, stale-leak flag, and required/forbidden match counts; utilities implement token counting via tiktoken, percentile computation, text normalization, per-query scoring by alias matching, category-grouped summarization, and deterministic paired bootstrap confidence intervals for baseline-challenger deltas.
Memory backend protocol and adapters
examples/benchmarks/temporal-memory-showdown/backends.py
MemoryBackend protocol defines ingest/search/usage/close contract; MemantoBackend bootstraps on-prem agent lifecycle and performs recall-based search; Mem0Backend wires Ollama for LLM+embeddings with persistent Qdrant storage; MeteredOllamaClient wraps underlying Ollama clients to count token usage; supporting functions generate Mem0 config, poll readiness, and create run identifiers.
Benchmark runner and reporting
examples/benchmarks/temporal-memory-showdown/run_benchmark.py
Parses CLI arguments (backends, service URLs, model/top-k/repeat settings), constructs backend instances via factory, orchestrates record ingestion and query execution across repeats with latency tracking, captures environment metadata, builds baseline-vs-challenger bootstrap comparisons, renders Markdown reports with headline metrics and per-query audit tables, writes JSON and Markdown outputs.
Comprehensive test coverage and import setup
examples/benchmarks/temporal-memory-showdown/tests/conftest.py, examples/benchmarks/temporal-memory-showdown/tests/test_metrics.py
Conftest adjusts sys.path for test module imports; tests validate dataset stability, scoring correctness (coverage, exactness, stale-leak), percentile nearest-rank behavior, category summarization, bootstrap determinism, Mem0 config schema matching, Ollama token metering, Memanto agent retry logic, and complete runner/report shape with rendered Markdown.
CI workflow, dependencies, and documentation
.github/workflows/benchmark-memory-showdown.yml, examples/benchmarks/temporal-memory-showdown/requirements.txt, examples/benchmarks/temporal-memory-showdown/README.md, examples/benchmarks/temporal-memory-showdown/.gitignore, examples/benchmarks/temporal-memory-showdown/results/*
GitHub Actions workflow checks out repo, installs dependencies, configures Moorcheh on-prem with Ollama, waits for service readiness, runs benchmark with configurable repeats, collects diagnostics, uploads artifacts, and tears down services; requirements.txt pins benchmark dependencies; README documents purpose, methodology, and reproduction steps; .gitignore ignores generated artifacts; results directory holds sample benchmark report and placeholder.

Agent Service Namespace Conflict Handling

Layer / File(s) Summary
409 conflict handling and test
memanto/app/services/agent_service.py, tests/test_unit.py
AgentService.create_agent now detects exceptions with status_code == 409 (on-prem namespace already exists) as non-fatal alongside SDK ConflictError, logs success, and continues; unit test simulates on-prem 409 scenario to verify idempotent agent creation succeeds.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

  • moorcheh-ai/memanto#639: This PR directly implements the reproducible Memanto vs. Mem0 benchmarking infrastructure (dataset, runners, metrics, backends, tests, and CI workflow) requested in that issue.

Possibly related PRs

  • moorcheh-ai/memanto#631: The change to AgentService.create_agent (treating on-prem 409 conflicts as non-fatal) supports that PR's LangGraph tools that call client.create_agent during retry logic.

Suggested reviewers

  • het0814
  • Neelpatel1604
  • Xenogents

Poem

🐰 A benchmark born from whisker-thin vision—
Memanto meets Mem0 in temporal collision,
With metrics that measure and backends that hum,
Bootstrap deltas tell stories of which one won.
Now workflows and datasets dance in the glow,
Testing the memory that mortals will know! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.20% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main change: adding a temporal memory benchmark comparing Memanto and Mem0 live systems.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
memanto/app/services/agent_service.py (1)

88-90: 📐 Maintainability & Code Quality | ⚡ Quick win

Preserve exception chain when re-raising.

The current code creates a new Exception which discards the original exception type, traceback, and context. Use raise ... from e to preserve the exception chain for debugging.

🔗 Proposed fix to preserve exception chain
             else:
                 # Unexpected error - fail the agent creation
-                raise Exception(
-                    f"Failed to create namespace '{namespace}' in Moorcheh: {str(e)}"
-                )
+                raise Exception(
+                    f"Failed to create namespace '{namespace}' in Moorcheh: {str(e)}"
+                ) from e
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@memanto/app/services/agent_service.py` around lines 88 - 90, Replace the
current re-raise that discards the original traceback—specifically the line
raising Exception(f"Failed to create namespace '{namespace}' in Moorcheh:
{str(e)}")—with a chained raise so the original exception is preserved (use
raise ... from e). Locate the raise in the function that creates the Moorcheh
namespace (the block where the variable namespace and exception variable e are
available) and update it to re-raise the new Exception using "from e" to keep
the original context and traceback.
examples/benchmarks/temporal-memory-showdown/tests/test_metrics.py (1)

57-63: 📐 Maintainability & Code Quality | ⚡ Quick win

Add a negative-path test for invalid bootstrap sample counts.

Please add a regression test that asserts paired_bootstrap_delta(..., samples=0) raises ValueError, so the input-contract fix remains enforced.

Suggested test addition
+import pytest
 ...
 def test_bootstrap_is_deterministic():
     baseline = [score_query(QUERIES[0], "basil") for _ in range(4)]
     challenger = [score_query(QUERIES[0], "dwarf radish") for _ in range(4)]
     result = paired_bootstrap_delta(baseline, challenger, samples=100, seed=123)
     assert result["observed_delta"] == 1.0
     assert result["ci95"] == [1.0, 1.0]
+
+
+def test_bootstrap_rejects_non_positive_samples():
+    baseline = [score_query(QUERIES[0], "basil")]
+    challenger = [score_query(QUERIES[0], "dwarf radish")]
+    with pytest.raises(ValueError, match="samples must be a positive integer"):
+        paired_bootstrap_delta(baseline, challenger, samples=0)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/temporal-memory-showdown/tests/test_metrics.py` around
lines 57 - 63, Add a negative-path unit test to ensure paired_bootstrap_delta
enforces its samples>0 contract: create a new test (e.g., next to
test_bootstrap_is_deterministic) that calls paired_bootstrap_delta(baseline,
challenger, samples=0) and asserts it raises ValueError (use pytest.raises).
Reference the paired_bootstrap_delta function and reuse simple
baseline/challenger lists like in test_bootstrap_is_deterministic to keep the
test focused on the samples parameter validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/benchmark-memory-showdown.yml:
- Around line 6-10: The workflow directly interpolates the workflow_dispatch
input inputs.repeats into a shell command which can lead to command injection;
add a validation step that sanitizes and constrains inputs.repeats before use:
create an early run step that reads the input into a shell variable (e.g.
repeats="${{ inputs.repeats }}"), test it against a strict numeric regex (e.g.
if ! [[ "$repeats" =~ ^[0-9]+$ ]]; then echo "invalid repeats" >&2; exit 1; fi),
export or write the sanitized numeric value to an environment file (GITHUB_ENV)
and then use that sanitized variable (quoted) in the later run invocation
instead of interpolating ${{ inputs.repeats }} directly; also ensure all uses
quote the variable to avoid word-splitting/expansion.
- Line 24: The workflow uses floating action tags (actions/checkout@v4,
actions/setup-python@v5, actions/upload-artifact@v4) which weakens supply-chain
guarantees; update each usage to a pinned immutable commit SHA instead of the
tag (replace actions/checkout@v4, actions/setup-python@v5, and
actions/upload-artifact@v4 with their corresponding commit SHA refs) and verify
the SHA values point to the intended release commits, keeping the action names
for readability in the workflow comments.

In `@examples/benchmarks/temporal-memory-showdown/backends.py`:
- Around line 291-297: The readiness loop currently lets any exception from
backend.search escape and break the retry logic; wrap the call to backend.search
inside the while loop in a try/except that catches transient exceptions,
optionally logs them, and continues retrying until the deadline instead of
propagating (refer to backend.search, the while time.perf_counter() < deadline
loop, hits, expected_lower and backend.name); preserve the existing success
check and sleep behavior and only raise the TimeoutError after the deadline
elapses.
- Around line 124-130: The retry loop in create_memanto_agent currently catches
all Exceptions, stores last_error and always re-raises it, which causes a
bootstrap to fail on HTTP 409 (already exists) even though that should be
considered a successful idempotent outcome; update the except block inside
create_memanto_agent to detect a 409 response (e.g., check
error.response.status_code, getattr(error, "status_code", None), or inspect
HTTPError.response) and treat it as success by breaking/returning without
setting last_error, otherwise keep the existing retry/backoff behavior and
re-raise the last_error after attempts are exhausted; reference
create_memanto_agent, last_error, attempts and delay_s to locate and update the
logic.

In `@examples/benchmarks/temporal-memory-showdown/metrics.py`:
- Around line 114-127: The code does not validate that `samples` is positive
before generating resamples, which can leave `deltas` empty and cause an
IndexError when computing `lower`/`upper`; add an early check (e.g., if samples
<= 0: raise ValueError("samples must be a positive integer for paired
bootstrap")) before creating `rng` and the resampling loop (referencing the
`samples`, `deltas`, `ordered`, `baseline`, and `challenger` variables) so the
function fails fast on non-positive sample counts.

In `@examples/benchmarks/temporal-memory-showdown/run_benchmark.py`:
- Around line 278-283: Guard the division by checking denominators before
computing ingest_speedup and query_reduction: verify
memanto_metrics["ingest_total_s"] is not zero before computing ingest_speedup
and mem0_metrics["query_p95_s"] is not zero before computing query_reduction; if
a denominator is zero (or nearly zero) return a safe fallback (e.g., set
ingest_speedup/query_reduction to None or a sentinel like float("inf")/0.0) to
avoid ZeroDivisionError, and use the existing variable names ingest_speedup and
query_reduction so the rest of the code can handle the fallback consistently.

---

Nitpick comments:
In `@examples/benchmarks/temporal-memory-showdown/tests/test_metrics.py`:
- Around line 57-63: Add a negative-path unit test to ensure
paired_bootstrap_delta enforces its samples>0 contract: create a new test (e.g.,
next to test_bootstrap_is_deterministic) that calls
paired_bootstrap_delta(baseline, challenger, samples=0) and asserts it raises
ValueError (use pytest.raises). Reference the paired_bootstrap_delta function
and reuse simple baseline/challenger lists like in
test_bootstrap_is_deterministic to keep the test focused on the samples
parameter validation.

In `@memanto/app/services/agent_service.py`:
- Around line 88-90: Replace the current re-raise that discards the original
traceback—specifically the line raising Exception(f"Failed to create namespace
'{namespace}' in Moorcheh: {str(e)}")—with a chained raise so the original
exception is preserved (use raise ... from e). Locate the raise in the function
that creates the Moorcheh namespace (the block where the variable namespace and
exception variable e are available) and update it to re-raise the new Exception
using "from e" to keep the original context and traceback.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: f42d686b-5b43-41c4-8970-53babffaaeb1

📥 Commits

Reviewing files that changed from the base of the PR and between 7665bfb and dd8609b.

📒 Files selected for processing (15)
  • .github/workflows/benchmark-memory-showdown.yml
  • examples/benchmarks/temporal-memory-showdown/.gitignore
  • examples/benchmarks/temporal-memory-showdown/README.md
  • examples/benchmarks/temporal-memory-showdown/backends.py
  • examples/benchmarks/temporal-memory-showdown/dataset.py
  • examples/benchmarks/temporal-memory-showdown/metrics.py
  • examples/benchmarks/temporal-memory-showdown/requirements.txt
  • examples/benchmarks/temporal-memory-showdown/results/.gitkeep
  • examples/benchmarks/temporal-memory-showdown/results/latest.json
  • examples/benchmarks/temporal-memory-showdown/results/latest.md
  • examples/benchmarks/temporal-memory-showdown/run_benchmark.py
  • examples/benchmarks/temporal-memory-showdown/tests/conftest.py
  • examples/benchmarks/temporal-memory-showdown/tests/test_metrics.py
  • memanto/app/services/agent_service.py
  • tests/test_unit.py

Comment on lines +6 to +10
repeats:
description: Measured query repetitions after warm-up
required: false
default: "5"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Sanitize workflow_dispatch input before passing it to shell command arguments.

Line 86 interpolates ${{ inputs.repeats }} directly into a shell command. A crafted input using command substitution can execute before Python validates int.

Suggested patch
   workflow_dispatch:
     inputs:
       repeats:
         description: Measured query repetitions after warm-up
         required: false
         default: "5"
+        type: number
@@
       - name: Run live benchmark
         env:
           HOME: ${{ env.BENCH_HOME }}
+          REPEATS: ${{ inputs.repeats }}
         run: |
+          [[ "$REPEATS" =~ ^[0-9]+$ ]] || { echo "Invalid repeats: $REPEATS"; exit 1; }
           python examples/benchmarks/temporal-memory-showdown/run_benchmark.py \
             --backends memanto,mem0-direct,mem0-agentic \
-            --repeats "${{ inputs.repeats }}"
+            --repeats "$REPEATS"

Also applies to: 86-86

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/benchmark-memory-showdown.yml around lines 6 - 10, The
workflow directly interpolates the workflow_dispatch input inputs.repeats into a
shell command which can lead to command injection; add a validation step that
sanitizes and constrains inputs.repeats before use: create an early run step
that reads the input into a shell variable (e.g. repeats="${{ inputs.repeats
}}"), test it against a strict numeric regex (e.g. if ! [[ "$repeats" =~
^[0-9]+$ ]]; then echo "invalid repeats" >&2; exit 1; fi), export or write the
sanitized numeric value to an environment file (GITHUB_ENV) and then use that
sanitized variable (quoted) in the later run invocation instead of interpolating
${{ inputs.repeats }} directly; also ensure all uses quote the variable to avoid
word-splitting/expansion.

Source: Linters/SAST tools

SETUPTOOLS_SCM_PRETEND_VERSION: "0.0.0"

steps:
- uses: actions/checkout@v4

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Detect non-SHA-pinned actions in workflows.
rg -n 'uses:\s*[^@]+@v[0-9]+' .github/workflows

Repository: moorcheh-ai/memanto

Length of output: 1140


Pin GitHub Actions to immutable commit SHAs.

In .github/workflows/benchmark-memory-showdown.yml, the workflow uses floating tags (weak supply-chain guarantees) instead of commit SHAs:

  • Line 24: actions/checkout@v4
  • Line 26: actions/setup-python@v5
  • Line 103: actions/upload-artifact@v4
🧰 Tools
🪛 zizmor (1.25.2)

[warning] 24-24: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)


[error] 24-24: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/benchmark-memory-showdown.yml at line 24, The workflow
uses floating action tags (actions/checkout@v4, actions/setup-python@v5,
actions/upload-artifact@v4) which weakens supply-chain guarantees; update each
usage to a pinned immutable commit SHA instead of the tag (replace
actions/checkout@v4, actions/setup-python@v5, and actions/upload-artifact@v4
with their corresponding commit SHA refs) and verify the SHA values point to the
intended release commits, keeping the action names for readability in the
workflow comments.

Source: Linters/SAST tools

Comment on lines +124 to +130
except Exception as error:
last_error = error
if attempt + 1 < attempts:
time.sleep(delay_s)
assert last_error is not None
raise last_error

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify current conflict-handling paths for agent creation.
rg -n -C3 'def create_memanto_agent|create_agent\(|status_code\s*==\s*409|MoorchehApiError'

Repository: moorcheh-ai/memanto

Length of output: 18565


Handle HTTP 409 “already exists” as a successful idempotent bootstrap in create_memanto_agent

examples/benchmarks/temporal-memory-showdown/backends.py retries except Exception and always re-raises the last error; if the server already created the agent/namespace but the call still raises a 409, bootstrap can fail instead of proceeding. The app layer already treats status_code == 409 as OK in memanto/app/services/agent_service.py.

Suggested patch
         except Exception as error:
+            if getattr(error, "status_code", None) == 409:
+                return
             last_error = error
             if attempt + 1 < attempts:
                 time.sleep(delay_s)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception as error:
last_error = error
if attempt + 1 < attempts:
time.sleep(delay_s)
assert last_error is not None
raise last_error
except Exception as error:
if getattr(error, "status_code", None) == 409:
return
last_error = error
if attempt + 1 < attempts:
time.sleep(delay_s)
assert last_error is not None
raise last_error
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/temporal-memory-showdown/backends.py` around lines 124 -
130, The retry loop in create_memanto_agent currently catches all Exceptions,
stores last_error and always re-raises it, which causes a bootstrap to fail on
HTTP 409 (already exists) even though that should be considered a successful
idempotent outcome; update the except block inside create_memanto_agent to
detect a 409 response (e.g., check error.response.status_code, getattr(error,
"status_code", None), or inspect HTTPError.response) and treat it as success by
breaking/returning without setting last_error, otherwise keep the existing
retry/backoff behavior and re-raise the last_error after attempts are exhausted;
reference create_memanto_agent, last_error, attempts and delay_s to locate and
update the logic.

Comment on lines +291 to +297
while time.perf_counter() < deadline:
hits = backend.search(query, top_k)
if expected_lower in "\n".join(hit.text for hit in hits).casefold():
return time.perf_counter() - started
time.sleep(1.0)
raise TimeoutError(
f"{backend.name} did not surface {expected!r} within {timeout_s:.0f}s"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Don’t let transient search failures bypass the readiness timeout loop.

At Line 292, any temporary backend/search exception exits immediately and skips the intended timeout retry behavior, which can make CI flaky.

Suggested patch
     expected_lower = expected.casefold()
+    last_error: Exception | None = None
     while time.perf_counter() < deadline:
-        hits = backend.search(query, top_k)
+        try:
+            hits = backend.search(query, top_k)
+        except Exception as error:
+            last_error = error
+            time.sleep(1.0)
+            continue
         if expected_lower in "\n".join(hit.text for hit in hits).casefold():
             return time.perf_counter() - started
         time.sleep(1.0)
-    raise TimeoutError(
-        f"{backend.name} did not surface {expected!r} within {timeout_s:.0f}s"
-    )
+    message = f"{backend.name} did not surface {expected!r} within {timeout_s:.0f}s"
+    if last_error is not None:
+        message = f"{message}; last search error: {last_error}"
+    raise TimeoutError(message)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/temporal-memory-showdown/backends.py` around lines 291 -
297, The readiness loop currently lets any exception from backend.search escape
and break the retry logic; wrap the call to backend.search inside the while loop
in a try/except that catches transient exceptions, optionally logs them, and
continues retrying until the deadline instead of propagating (refer to
backend.search, the while time.perf_counter() < deadline loop, hits,
expected_lower and backend.name); preserve the existing success check and sleep
behavior and only raise the TimeoutError after the deadline elapses.

Comment on lines +114 to +127
if len(baseline) != len(challenger) or not baseline:
raise ValueError("Paired bootstrap requires equal non-empty score lists")
rng = random.Random(seed)
deltas: list[float] = []
count = len(baseline)
for _ in range(samples):
indices = [rng.randrange(count) for _ in range(count)]
baseline_mean = mean(baseline[i].coverage for i in indices)
challenger_mean = mean(challenger[i].coverage for i in indices)
deltas.append(challenger_mean - baseline_mean)

ordered = sorted(deltas)
lower = ordered[max(0, math.floor(0.025 * samples))]
upper = ordered[min(samples - 1, math.ceil(0.975 * samples) - 1)]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Validate bootstrap sample count before generating resamples.

At Line 119 onward, samples <= 0 yields an empty deltas list, and Lines 126–127 then index into an empty ordered list (IndexError). Please fail fast with a ValueError for non-positive sample counts.

Proposed fix
 def paired_bootstrap_delta(
     baseline: Sequence[QueryScore],
     challenger: Sequence[QueryScore],
     *,
     samples: int = 5000,
     seed: int = 639,
 ) -> dict:
     if len(baseline) != len(challenger) or not baseline:
         raise ValueError("Paired bootstrap requires equal non-empty score lists")
+    if samples <= 0:
+        raise ValueError("samples must be a positive integer")
     rng = random.Random(seed)
     deltas: list[float] = []
🧰 Tools
🪛 ast-grep (0.43.0)

[info] 115-115: use secrets package over random package
Context: random.Random(seed)
Note: [CWE-330].

(avoid-random-python)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/temporal-memory-showdown/metrics.py` around lines 114 -
127, The code does not validate that `samples` is positive before generating
resamples, which can leave `deltas` empty and cause an IndexError when computing
`lower`/`upper`; add an early check (e.g., if samples <= 0: raise
ValueError("samples must be a positive integer for paired bootstrap")) before
creating `rng` and the resampling loop (referencing the `samples`, `deltas`,
`ordered`, `baseline`, and `challenger` variables) so the function fails fast on
non-positive sample counts.

Comment on lines +278 to +283
ingest_speedup = (
mem0_metrics["ingest_total_s"] / memanto_metrics["ingest_total_s"]
)
query_reduction = 1 - (
memanto_metrics["query_p95_s"] / mem0_metrics["query_p95_s"]
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Guard headline ratio calculations against zero denominators.

Line 279 and Line 282 divide by metric fields that are pre-rounded; if those round to 0.0, Markdown rendering crashes with ZeroDivisionError.

Suggested patch
-        ingest_speedup = (
-            mem0_metrics["ingest_total_s"] / memanto_metrics["ingest_total_s"]
-        )
-        query_reduction = 1 - (
-            memanto_metrics["query_p95_s"] / mem0_metrics["query_p95_s"]
-        )
+        memanto_ingest = memanto_metrics["ingest_total_s"]
+        mem0_query_p95 = mem0_metrics["query_p95_s"]
+        ingest_speedup = (
+            mem0_metrics["ingest_total_s"] / memanto_ingest if memanto_ingest > 0 else None
+        )
+        query_reduction = (
+            1 - (memanto_metrics["query_p95_s"] / mem0_query_p95)
+            if mem0_query_p95 > 0
+            else None
+        )
@@
-                f"- Full ingestion was **{ingest_speedup:,.1f}x faster** "
+                f"- Full ingestion was **{ingest_speedup:,.1f}x faster** "
                 f"({memanto_metrics['ingest_total_s']:.3f}s vs "
-                f"{mem0_metrics['ingest_total_s']:.3f}s).",
-                f"- Query p95 was **{query_reduction:.1%} lower** "
+                f"{mem0_metrics['ingest_total_s']:.3f}s)."
+                if ingest_speedup is not None
+                else "- Full ingestion speedup is not available (zero denominator).",
+                f"- Query p95 was **{query_reduction:.1%} lower** "
                 f"({memanto_metrics['query_p95_s']:.4f}s vs "
-                f"{mem0_metrics['query_p95_s']:.4f}s).",
+                f"{mem0_metrics['query_p95_s']:.4f}s)."
+                if query_reduction is not None
+                else "- Query p95 reduction is not available (zero denominator).",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/temporal-memory-showdown/run_benchmark.py` around lines
278 - 283, Guard the division by checking denominators before computing
ingest_speedup and query_reduction: verify memanto_metrics["ingest_total_s"] is
not zero before computing ingest_speedup and mem0_metrics["query_p95_s"] is not
zero before computing query_reduction; if a denominator is zero (or nearly zero)
return a safe fallback (e.g., set ingest_speedup/query_reduction to None or a
sentinel like float("inf")/0.0) to avoid ZeroDivisionError, and use the existing
variable names ingest_speedup and query_reduction so the rest of the code can
handle the fallback consistently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant