Add live Memanto vs Mem0 temporal memory benchmark#730
Add live Memanto vs Mem0 temporal memory benchmark#7302077196405-commits wants to merge 6 commits into
Conversation
📝 WalkthroughWalkthroughThis PR introduces a comprehensive "Temporal Memory Showdown" benchmark suite comparing Memanto On-Prem versus Mem0 (with direct and agentic ablations) over a synthetic temporal dataset, plus a small fix enabling AgentService to handle on-prem namespace creation conflicts. The benchmark includes dataset definitions, backend adapters, metrics/scoring logic, a runner orchestrating execution and reporting, a CI workflow, and full test coverage. ChangesTemporal Memory Showdown Benchmark
Agent Service Namespace Conflict Handling
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
memanto/app/services/agent_service.py (1)
88-90: 📐 Maintainability & Code Quality | ⚡ Quick winPreserve exception chain when re-raising.
The current code creates a new
Exceptionwhich discards the original exception type, traceback, and context. Useraise ... from eto preserve the exception chain for debugging.🔗 Proposed fix to preserve exception chain
else: # Unexpected error - fail the agent creation - raise Exception( - f"Failed to create namespace '{namespace}' in Moorcheh: {str(e)}" - ) + raise Exception( + f"Failed to create namespace '{namespace}' in Moorcheh: {str(e)}" + ) from e🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@memanto/app/services/agent_service.py` around lines 88 - 90, Replace the current re-raise that discards the original traceback—specifically the line raising Exception(f"Failed to create namespace '{namespace}' in Moorcheh: {str(e)}")—with a chained raise so the original exception is preserved (use raise ... from e). Locate the raise in the function that creates the Moorcheh namespace (the block where the variable namespace and exception variable e are available) and update it to re-raise the new Exception using "from e" to keep the original context and traceback.examples/benchmarks/temporal-memory-showdown/tests/test_metrics.py (1)
57-63: 📐 Maintainability & Code Quality | ⚡ Quick winAdd a negative-path test for invalid bootstrap sample counts.
Please add a regression test that asserts
paired_bootstrap_delta(..., samples=0)raisesValueError, so the input-contract fix remains enforced.Suggested test addition
+import pytest ... def test_bootstrap_is_deterministic(): baseline = [score_query(QUERIES[0], "basil") for _ in range(4)] challenger = [score_query(QUERIES[0], "dwarf radish") for _ in range(4)] result = paired_bootstrap_delta(baseline, challenger, samples=100, seed=123) assert result["observed_delta"] == 1.0 assert result["ci95"] == [1.0, 1.0] + + +def test_bootstrap_rejects_non_positive_samples(): + baseline = [score_query(QUERIES[0], "basil")] + challenger = [score_query(QUERIES[0], "dwarf radish")] + with pytest.raises(ValueError, match="samples must be a positive integer"): + paired_bootstrap_delta(baseline, challenger, samples=0)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/temporal-memory-showdown/tests/test_metrics.py` around lines 57 - 63, Add a negative-path unit test to ensure paired_bootstrap_delta enforces its samples>0 contract: create a new test (e.g., next to test_bootstrap_is_deterministic) that calls paired_bootstrap_delta(baseline, challenger, samples=0) and asserts it raises ValueError (use pytest.raises). Reference the paired_bootstrap_delta function and reuse simple baseline/challenger lists like in test_bootstrap_is_deterministic to keep the test focused on the samples parameter validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/benchmark-memory-showdown.yml:
- Around line 6-10: The workflow directly interpolates the workflow_dispatch
input inputs.repeats into a shell command which can lead to command injection;
add a validation step that sanitizes and constrains inputs.repeats before use:
create an early run step that reads the input into a shell variable (e.g.
repeats="${{ inputs.repeats }}"), test it against a strict numeric regex (e.g.
if ! [[ "$repeats" =~ ^[0-9]+$ ]]; then echo "invalid repeats" >&2; exit 1; fi),
export or write the sanitized numeric value to an environment file (GITHUB_ENV)
and then use that sanitized variable (quoted) in the later run invocation
instead of interpolating ${{ inputs.repeats }} directly; also ensure all uses
quote the variable to avoid word-splitting/expansion.
- Line 24: The workflow uses floating action tags (actions/checkout@v4,
actions/setup-python@v5, actions/upload-artifact@v4) which weakens supply-chain
guarantees; update each usage to a pinned immutable commit SHA instead of the
tag (replace actions/checkout@v4, actions/setup-python@v5, and
actions/upload-artifact@v4 with their corresponding commit SHA refs) and verify
the SHA values point to the intended release commits, keeping the action names
for readability in the workflow comments.
In `@examples/benchmarks/temporal-memory-showdown/backends.py`:
- Around line 291-297: The readiness loop currently lets any exception from
backend.search escape and break the retry logic; wrap the call to backend.search
inside the while loop in a try/except that catches transient exceptions,
optionally logs them, and continues retrying until the deadline instead of
propagating (refer to backend.search, the while time.perf_counter() < deadline
loop, hits, expected_lower and backend.name); preserve the existing success
check and sleep behavior and only raise the TimeoutError after the deadline
elapses.
- Around line 124-130: The retry loop in create_memanto_agent currently catches
all Exceptions, stores last_error and always re-raises it, which causes a
bootstrap to fail on HTTP 409 (already exists) even though that should be
considered a successful idempotent outcome; update the except block inside
create_memanto_agent to detect a 409 response (e.g., check
error.response.status_code, getattr(error, "status_code", None), or inspect
HTTPError.response) and treat it as success by breaking/returning without
setting last_error, otherwise keep the existing retry/backoff behavior and
re-raise the last_error after attempts are exhausted; reference
create_memanto_agent, last_error, attempts and delay_s to locate and update the
logic.
In `@examples/benchmarks/temporal-memory-showdown/metrics.py`:
- Around line 114-127: The code does not validate that `samples` is positive
before generating resamples, which can leave `deltas` empty and cause an
IndexError when computing `lower`/`upper`; add an early check (e.g., if samples
<= 0: raise ValueError("samples must be a positive integer for paired
bootstrap")) before creating `rng` and the resampling loop (referencing the
`samples`, `deltas`, `ordered`, `baseline`, and `challenger` variables) so the
function fails fast on non-positive sample counts.
In `@examples/benchmarks/temporal-memory-showdown/run_benchmark.py`:
- Around line 278-283: Guard the division by checking denominators before
computing ingest_speedup and query_reduction: verify
memanto_metrics["ingest_total_s"] is not zero before computing ingest_speedup
and mem0_metrics["query_p95_s"] is not zero before computing query_reduction; if
a denominator is zero (or nearly zero) return a safe fallback (e.g., set
ingest_speedup/query_reduction to None or a sentinel like float("inf")/0.0) to
avoid ZeroDivisionError, and use the existing variable names ingest_speedup and
query_reduction so the rest of the code can handle the fallback consistently.
---
Nitpick comments:
In `@examples/benchmarks/temporal-memory-showdown/tests/test_metrics.py`:
- Around line 57-63: Add a negative-path unit test to ensure
paired_bootstrap_delta enforces its samples>0 contract: create a new test (e.g.,
next to test_bootstrap_is_deterministic) that calls
paired_bootstrap_delta(baseline, challenger, samples=0) and asserts it raises
ValueError (use pytest.raises). Reference the paired_bootstrap_delta function
and reuse simple baseline/challenger lists like in
test_bootstrap_is_deterministic to keep the test focused on the samples
parameter validation.
In `@memanto/app/services/agent_service.py`:
- Around line 88-90: Replace the current re-raise that discards the original
traceback—specifically the line raising Exception(f"Failed to create namespace
'{namespace}' in Moorcheh: {str(e)}")—with a chained raise so the original
exception is preserved (use raise ... from e). Locate the raise in the function
that creates the Moorcheh namespace (the block where the variable namespace and
exception variable e are available) and update it to re-raise the new Exception
using "from e" to keep the original context and traceback.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: f42d686b-5b43-41c4-8970-53babffaaeb1
📒 Files selected for processing (15)
.github/workflows/benchmark-memory-showdown.ymlexamples/benchmarks/temporal-memory-showdown/.gitignoreexamples/benchmarks/temporal-memory-showdown/README.mdexamples/benchmarks/temporal-memory-showdown/backends.pyexamples/benchmarks/temporal-memory-showdown/dataset.pyexamples/benchmarks/temporal-memory-showdown/metrics.pyexamples/benchmarks/temporal-memory-showdown/requirements.txtexamples/benchmarks/temporal-memory-showdown/results/.gitkeepexamples/benchmarks/temporal-memory-showdown/results/latest.jsonexamples/benchmarks/temporal-memory-showdown/results/latest.mdexamples/benchmarks/temporal-memory-showdown/run_benchmark.pyexamples/benchmarks/temporal-memory-showdown/tests/conftest.pyexamples/benchmarks/temporal-memory-showdown/tests/test_metrics.pymemanto/app/services/agent_service.pytests/test_unit.py
| repeats: | ||
| description: Measured query repetitions after warm-up | ||
| required: false | ||
| default: "5" | ||
|
|
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Sanitize workflow_dispatch input before passing it to shell command arguments.
Line 86 interpolates ${{ inputs.repeats }} directly into a shell command. A crafted input using command substitution can execute before Python validates int.
Suggested patch
workflow_dispatch:
inputs:
repeats:
description: Measured query repetitions after warm-up
required: false
default: "5"
+ type: number
@@
- name: Run live benchmark
env:
HOME: ${{ env.BENCH_HOME }}
+ REPEATS: ${{ inputs.repeats }}
run: |
+ [[ "$REPEATS" =~ ^[0-9]+$ ]] || { echo "Invalid repeats: $REPEATS"; exit 1; }
python examples/benchmarks/temporal-memory-showdown/run_benchmark.py \
--backends memanto,mem0-direct,mem0-agentic \
- --repeats "${{ inputs.repeats }}"
+ --repeats "$REPEATS"Also applies to: 86-86
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/benchmark-memory-showdown.yml around lines 6 - 10, The
workflow directly interpolates the workflow_dispatch input inputs.repeats into a
shell command which can lead to command injection; add a validation step that
sanitizes and constrains inputs.repeats before use: create an early run step
that reads the input into a shell variable (e.g. repeats="${{ inputs.repeats
}}"), test it against a strict numeric regex (e.g. if ! [[ "$repeats" =~
^[0-9]+$ ]]; then echo "invalid repeats" >&2; exit 1; fi), export or write the
sanitized numeric value to an environment file (GITHUB_ENV) and then use that
sanitized variable (quoted) in the later run invocation instead of interpolating
${{ inputs.repeats }} directly; also ensure all uses quote the variable to avoid
word-splitting/expansion.
Source: Linters/SAST tools
| SETUPTOOLS_SCM_PRETEND_VERSION: "0.0.0" | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@v4 |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Detect non-SHA-pinned actions in workflows.
rg -n 'uses:\s*[^@]+@v[0-9]+' .github/workflowsRepository: moorcheh-ai/memanto
Length of output: 1140
Pin GitHub Actions to immutable commit SHAs.
In .github/workflows/benchmark-memory-showdown.yml, the workflow uses floating tags (weak supply-chain guarantees) instead of commit SHAs:
- Line 24:
actions/checkout@v4 - Line 26:
actions/setup-python@v5 - Line 103:
actions/upload-artifact@v4
🧰 Tools
🪛 zizmor (1.25.2)
[warning] 24-24: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false
(artipacked)
[error] 24-24: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)
(unpinned-uses)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/benchmark-memory-showdown.yml at line 24, The workflow
uses floating action tags (actions/checkout@v4, actions/setup-python@v5,
actions/upload-artifact@v4) which weakens supply-chain guarantees; update each
usage to a pinned immutable commit SHA instead of the tag (replace
actions/checkout@v4, actions/setup-python@v5, and actions/upload-artifact@v4
with their corresponding commit SHA refs) and verify the SHA values point to the
intended release commits, keeping the action names for readability in the
workflow comments.
Source: Linters/SAST tools
| except Exception as error: | ||
| last_error = error | ||
| if attempt + 1 < attempts: | ||
| time.sleep(delay_s) | ||
| assert last_error is not None | ||
| raise last_error | ||
|
|
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify current conflict-handling paths for agent creation.
rg -n -C3 'def create_memanto_agent|create_agent\(|status_code\s*==\s*409|MoorchehApiError'Repository: moorcheh-ai/memanto
Length of output: 18565
Handle HTTP 409 “already exists” as a successful idempotent bootstrap in create_memanto_agent
examples/benchmarks/temporal-memory-showdown/backends.py retries except Exception and always re-raises the last error; if the server already created the agent/namespace but the call still raises a 409, bootstrap can fail instead of proceeding. The app layer already treats status_code == 409 as OK in memanto/app/services/agent_service.py.
Suggested patch
except Exception as error:
+ if getattr(error, "status_code", None) == 409:
+ return
last_error = error
if attempt + 1 < attempts:
time.sleep(delay_s)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as error: | |
| last_error = error | |
| if attempt + 1 < attempts: | |
| time.sleep(delay_s) | |
| assert last_error is not None | |
| raise last_error | |
| except Exception as error: | |
| if getattr(error, "status_code", None) == 409: | |
| return | |
| last_error = error | |
| if attempt + 1 < attempts: | |
| time.sleep(delay_s) | |
| assert last_error is not None | |
| raise last_error |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/temporal-memory-showdown/backends.py` around lines 124 -
130, The retry loop in create_memanto_agent currently catches all Exceptions,
stores last_error and always re-raises it, which causes a bootstrap to fail on
HTTP 409 (already exists) even though that should be considered a successful
idempotent outcome; update the except block inside create_memanto_agent to
detect a 409 response (e.g., check error.response.status_code, getattr(error,
"status_code", None), or inspect HTTPError.response) and treat it as success by
breaking/returning without setting last_error, otherwise keep the existing
retry/backoff behavior and re-raise the last_error after attempts are exhausted;
reference create_memanto_agent, last_error, attempts and delay_s to locate and
update the logic.
| while time.perf_counter() < deadline: | ||
| hits = backend.search(query, top_k) | ||
| if expected_lower in "\n".join(hit.text for hit in hits).casefold(): | ||
| return time.perf_counter() - started | ||
| time.sleep(1.0) | ||
| raise TimeoutError( | ||
| f"{backend.name} did not surface {expected!r} within {timeout_s:.0f}s" |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Don’t let transient search failures bypass the readiness timeout loop.
At Line 292, any temporary backend/search exception exits immediately and skips the intended timeout retry behavior, which can make CI flaky.
Suggested patch
expected_lower = expected.casefold()
+ last_error: Exception | None = None
while time.perf_counter() < deadline:
- hits = backend.search(query, top_k)
+ try:
+ hits = backend.search(query, top_k)
+ except Exception as error:
+ last_error = error
+ time.sleep(1.0)
+ continue
if expected_lower in "\n".join(hit.text for hit in hits).casefold():
return time.perf_counter() - started
time.sleep(1.0)
- raise TimeoutError(
- f"{backend.name} did not surface {expected!r} within {timeout_s:.0f}s"
- )
+ message = f"{backend.name} did not surface {expected!r} within {timeout_s:.0f}s"
+ if last_error is not None:
+ message = f"{message}; last search error: {last_error}"
+ raise TimeoutError(message)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/temporal-memory-showdown/backends.py` around lines 291 -
297, The readiness loop currently lets any exception from backend.search escape
and break the retry logic; wrap the call to backend.search inside the while loop
in a try/except that catches transient exceptions, optionally logs them, and
continues retrying until the deadline instead of propagating (refer to
backend.search, the while time.perf_counter() < deadline loop, hits,
expected_lower and backend.name); preserve the existing success check and sleep
behavior and only raise the TimeoutError after the deadline elapses.
| if len(baseline) != len(challenger) or not baseline: | ||
| raise ValueError("Paired bootstrap requires equal non-empty score lists") | ||
| rng = random.Random(seed) | ||
| deltas: list[float] = [] | ||
| count = len(baseline) | ||
| for _ in range(samples): | ||
| indices = [rng.randrange(count) for _ in range(count)] | ||
| baseline_mean = mean(baseline[i].coverage for i in indices) | ||
| challenger_mean = mean(challenger[i].coverage for i in indices) | ||
| deltas.append(challenger_mean - baseline_mean) | ||
|
|
||
| ordered = sorted(deltas) | ||
| lower = ordered[max(0, math.floor(0.025 * samples))] | ||
| upper = ordered[min(samples - 1, math.ceil(0.975 * samples) - 1)] |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Validate bootstrap sample count before generating resamples.
At Line 119 onward, samples <= 0 yields an empty deltas list, and Lines 126–127 then index into an empty ordered list (IndexError). Please fail fast with a ValueError for non-positive sample counts.
Proposed fix
def paired_bootstrap_delta(
baseline: Sequence[QueryScore],
challenger: Sequence[QueryScore],
*,
samples: int = 5000,
seed: int = 639,
) -> dict:
if len(baseline) != len(challenger) or not baseline:
raise ValueError("Paired bootstrap requires equal non-empty score lists")
+ if samples <= 0:
+ raise ValueError("samples must be a positive integer")
rng = random.Random(seed)
deltas: list[float] = []🧰 Tools
🪛 ast-grep (0.43.0)
[info] 115-115: use secrets package over random package
Context: random.Random(seed)
Note: [CWE-330].
(avoid-random-python)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/temporal-memory-showdown/metrics.py` around lines 114 -
127, The code does not validate that `samples` is positive before generating
resamples, which can leave `deltas` empty and cause an IndexError when computing
`lower`/`upper`; add an early check (e.g., if samples <= 0: raise
ValueError("samples must be a positive integer for paired bootstrap")) before
creating `rng` and the resampling loop (referencing the `samples`, `deltas`,
`ordered`, `baseline`, and `challenger` variables) so the function fails fast on
non-positive sample counts.
| ingest_speedup = ( | ||
| mem0_metrics["ingest_total_s"] / memanto_metrics["ingest_total_s"] | ||
| ) | ||
| query_reduction = 1 - ( | ||
| memanto_metrics["query_p95_s"] / mem0_metrics["query_p95_s"] | ||
| ) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Guard headline ratio calculations against zero denominators.
Line 279 and Line 282 divide by metric fields that are pre-rounded; if those round to 0.0, Markdown rendering crashes with ZeroDivisionError.
Suggested patch
- ingest_speedup = (
- mem0_metrics["ingest_total_s"] / memanto_metrics["ingest_total_s"]
- )
- query_reduction = 1 - (
- memanto_metrics["query_p95_s"] / mem0_metrics["query_p95_s"]
- )
+ memanto_ingest = memanto_metrics["ingest_total_s"]
+ mem0_query_p95 = mem0_metrics["query_p95_s"]
+ ingest_speedup = (
+ mem0_metrics["ingest_total_s"] / memanto_ingest if memanto_ingest > 0 else None
+ )
+ query_reduction = (
+ 1 - (memanto_metrics["query_p95_s"] / mem0_query_p95)
+ if mem0_query_p95 > 0
+ else None
+ )
@@
- f"- Full ingestion was **{ingest_speedup:,.1f}x faster** "
+ f"- Full ingestion was **{ingest_speedup:,.1f}x faster** "
f"({memanto_metrics['ingest_total_s']:.3f}s vs "
- f"{mem0_metrics['ingest_total_s']:.3f}s).",
- f"- Query p95 was **{query_reduction:.1%} lower** "
+ f"{mem0_metrics['ingest_total_s']:.3f}s)."
+ if ingest_speedup is not None
+ else "- Full ingestion speedup is not available (zero denominator).",
+ f"- Query p95 was **{query_reduction:.1%} lower** "
f"({memanto_metrics['query_p95_s']:.4f}s vs "
- f"{mem0_metrics['query_p95_s']:.4f}s).",
+ f"{mem0_metrics['query_p95_s']:.4f}s)."
+ if query_reduction is not None
+ else "- Query p95 reduction is not available (zero denominator).",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/benchmarks/temporal-memory-showdown/run_benchmark.py` around lines
278 - 283, Guard the division by checking denominators before computing
ingest_speedup and query_reduction: verify memanto_metrics["ingest_total_s"] is
not zero before computing ingest_speedup and mem0_metrics["query_p95_s"] is not
zero before computing query_reduction; if a denominator is zero (or nearly zero)
return a safe fallback (e.g., set ingest_speedup/query_reduction to None or a
sentinel like float("inf")/0.0) to avoid ZeroDivisionError, and use the existing
variable names ingest_speedup and query_reduction so the rest of the code can
handle the fallback consistently.
Summary
examples/benchmarks/temporal-memory-showdown/.infer=True) as the primary competitor and reportsinfer=Falseonly as a clearly labeled vector-only ablation.MoorchehApiError(status_code=409)) is treated idempotently, with a regression test.Verified benchmark result
A real GitHub-hosted run completed successfully: Actions run 27441595257.
The paired bootstrap estimate for Memanto's concept-coverage advantage is +27.8 percentage points with a 95% CI of +9.3 to +48.1 points. Memanto ingestion was approximately 30,286x faster in this run.
The report does not hide the temporal failure mode: raw top-5 retrieval from every tested backend still surfaced superseded values. Mem0 agentic's lower stale-leak rate also coincided with missing more current facts. The benchmark therefore reports coverage, strict accuracy, and stale leakage separately rather than collapsing them into a single favorable score.
The vector-only Mem0 ablation reached 98.6% coverage with 3.996 s total ingestion and zero LLM tokens; it is included for diagnostic context, not presented as the primary agentic comparison.
Reproduce and audit
Benchmark memory showdownviaworkflow_dispatchValidation
pytest -qpasses for all available non-live tests; 24 existing cloud E2E tests skip without a Moorcheh cloud API key.ruff checkandruff format --checkpass.Bounty submission
Refs #639
Public showcase: auditable benchmark report and result tables. GitHub reactions and technical discussion on this PR are also part of the issue's published social scoring formula.
Social showcase
GitHub PR #730 is the public technical showcase and reaction-scored discussion, consistent with the bounty's published GitHub PR reaction formula. The committed live report provides the shareable result tables and limitations.
Summary by CodeRabbit
New Features
Bug Fixes
Tests