Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions .github/workflows/benchmark-memory-showdown.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
name: Temporal memory showdown

"on":
workflow_dispatch:
inputs:
repeats:
description: Measured query repetitions after warm-up
required: false
default: "5"

Comment on lines +6 to +10

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Sanitize workflow_dispatch input before passing it to shell command arguments.

Line 86 interpolates ${{ inputs.repeats }} directly into a shell command. A crafted input using command substitution can execute before Python validates int.

Suggested patch
   workflow_dispatch:
     inputs:
       repeats:
         description: Measured query repetitions after warm-up
         required: false
         default: "5"
+        type: number
@@
       - name: Run live benchmark
         env:
           HOME: ${{ env.BENCH_HOME }}
+          REPEATS: ${{ inputs.repeats }}
         run: |
+          [[ "$REPEATS" =~ ^[0-9]+$ ]] || { echo "Invalid repeats: $REPEATS"; exit 1; }
           python examples/benchmarks/temporal-memory-showdown/run_benchmark.py \
             --backends memanto,mem0-direct,mem0-agentic \
-            --repeats "${{ inputs.repeats }}"
+            --repeats "$REPEATS"

Also applies to: 86-86

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/benchmark-memory-showdown.yml around lines 6 - 10, The
workflow directly interpolates the workflow_dispatch input inputs.repeats into a
shell command which can lead to command injection; add a validation step that
sanitizes and constrains inputs.repeats before use: create an early run step
that reads the input into a shell variable (e.g. repeats="${{ inputs.repeats
}}"), test it against a strict numeric regex (e.g. if ! [[ "$repeats" =~
^[0-9]+$ ]]; then echo "invalid repeats" >&2; exit 1; fi), export or write the
sanitized numeric value to an environment file (GITHUB_ENV) and then use that
sanitized variable (quoted) in the later run invocation instead of interpolating
${{ inputs.repeats }} directly; also ensure all uses quote the variable to avoid
word-splitting/expansion.

Source: Linters/SAST tools

permissions:
contents: read

jobs:
benchmark:
runs-on: ubuntu-latest
timeout-minutes: 90
env:
MEM0_TELEMETRY: "false"
BENCH_HOME: /tmp/temporal-memory-showdown
SETUPTOOLS_SCM_PRETEND_VERSION: "0.0.0"

steps:
- uses: actions/checkout@v4

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Detect non-SHA-pinned actions in workflows.
rg -n 'uses:\s*[^@]+@v[0-9]+' .github/workflows

Repository: moorcheh-ai/memanto

Length of output: 1140


Pin GitHub Actions to immutable commit SHAs.

In .github/workflows/benchmark-memory-showdown.yml, the workflow uses floating tags (weak supply-chain guarantees) instead of commit SHAs:

  • Line 24: actions/checkout@v4
  • Line 26: actions/setup-python@v5
  • Line 103: actions/upload-artifact@v4
🧰 Tools
🪛 zizmor (1.25.2)

[warning] 24-24: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)


[error] 24-24: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/benchmark-memory-showdown.yml at line 24, The workflow
uses floating action tags (actions/checkout@v4, actions/setup-python@v5,
actions/upload-artifact@v4) which weakens supply-chain guarantees; update each
usage to a pinned immutable commit SHA instead of the tag (replace
actions/checkout@v4, actions/setup-python@v5, and actions/upload-artifact@v4
with their corresponding commit SHA refs) and verify the SHA values point to the
intended release commits, keeping the action names for readability in the
workflow comments.

Source: Linters/SAST tools


- uses: actions/setup-python@v5
with:
python-version: "3.12"
cache: pip

- name: Install benchmark
run: |
python -m pip install -e .
python -m pip install -r examples/benchmarks/temporal-memory-showdown/requirements.txt

- name: Configure Moorcheh On-Prem
env:
HOME: ${{ env.BENCH_HOME }}
run: |
python - <<'PY'
from moorcheh.user_config import EmbeddingConfig, LlmConfig, save_runtime_config

save_runtime_config(
EmbeddingConfig(provider="ollama", model="nomic-embed-text"),
LlmConfig(provider="ollama", model="qwen2.5:1.5b"),
)
PY
python -m moorcheh up \
--bundled-ollama \
--embedding-provider ollama \
--embedding-model nomic-embed-text

- name: Wait for services
env:
HOME: ${{ env.BENCH_HOME }}
run: |
for attempt in $(seq 1 60); do
if curl --fail --silent http://127.0.0.1:8080/health >/dev/null; then
break
fi
sleep 2
done
curl --fail --silent http://127.0.0.1:8080/health >/dev/null

# The HTTP health endpoint can become ready before namespace storage.
for attempt in $(seq 1 30); do
if python -m moorcheh namespace-create \
--name benchmark-readiness \
--type text; then
exit 0
fi
if python -m moorcheh namespace-list | grep -q benchmark-readiness; then
exit 0
fi
sleep 2
done
docker logs moorcheh-onprem-server
exit 1

- name: Run live benchmark
env:
HOME: ${{ env.BENCH_HOME }}
run: |
python examples/benchmarks/temporal-memory-showdown/run_benchmark.py \
--backends memanto,mem0-direct,mem0-agentic \
--repeats "${{ inputs.repeats }}"

- name: Collect service logs
if: always()
run: |
mkdir -p examples/benchmarks/temporal-memory-showdown/results/logs
docker logs moorcheh-onprem-server \
> examples/benchmarks/temporal-memory-showdown/results/logs/moorcheh.log 2>&1 || true
docker logs moorcheh-ollama \
> examples/benchmarks/temporal-memory-showdown/results/logs/ollama.log 2>&1 || true
docker inspect moorcheh-onprem-server \
> examples/benchmarks/temporal-memory-showdown/results/logs/server-inspect.json 2>&1 || true
find "$BENCH_HOME/.moorcheh" -maxdepth 3 -printf "%M %u:%g %s %p\n" \
> examples/benchmarks/temporal-memory-showdown/results/logs/data-layout.txt 2>&1 || true

- name: Upload results
if: always()
uses: actions/upload-artifact@v4
with:
name: temporal-memory-showdown-results
path: examples/benchmarks/temporal-memory-showdown/results/
if-no-files-found: warn

- name: Stop services
if: always()
env:
HOME: ${{ env.BENCH_HOME }}
run: python -m moorcheh down --bundled-ollama || true
4 changes: 4 additions & 0 deletions examples/benchmarks/temporal-memory-showdown/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.benchmark-data/
results/latest.json
results/latest.md
results/logs/
150 changes: 150 additions & 0 deletions examples/benchmarks/temporal-memory-showdown/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# Temporal Memory Showdown

A live, reproducible benchmark of actual Memanto On-Prem and actual Mem0 OSS.
It tests a long-running research mission whose facts change across ten sessions.
The benchmark is designed to answer two separate questions:

1. How do the retrieval layers compare when both store the same raw memories?
2. What changes when Mem0's normal LLM extraction and reconciliation pipeline is enabled?

No backend is simulated. No LLM judges the results.

## Systems under test

| Backend | Storage path | Ingestion mode |
| --- | --- | --- |
| `memanto-on-prem` | Memanto `SdkClient` -> Moorcheh On-Prem | Typed `remember()` calls |
| `mem0-direct` | Mem0 2.0.5 -> local Qdrant | `infer=False`, raw memory |
| `mem0-agentic` | Mem0 2.0.5 -> local Qdrant | `infer=True`, Ollama extraction |

All three use `nomic-embed-text` through the same Ollama service. The agentic
Mem0 run uses `qwen2.5:1.5b`; its native Ollama input and output token counters
are captured without estimating them.

## Dataset

The synthetic Asteria mission contains 32 records across ten sessions. It has
eleven explicit state changes, including:

- crop: Genovese basil -> dwarf radish
- launch: August 14 -> September 2
- commander: Elena Park -> Priya Nair
- channel: Slack -> Matrix
- nutrient protocol: N-17 / pH 6.2 -> N-21 / pH 5.9
- landing site: Malapert Ridge -> Shackleton rim
- vendor: Helios / PO-81 -> Nova / PO-96
- valve procedure: V1 -> V3

The 18 golden queries cover current state, history, and multi-hop briefs.
Every answer is scored with required and forbidden concept groups. This makes
the result deterministic and exposes stale-value leakage directly.

## Metrics

- deterministic required-concept coverage
- exact query accuracy
- stale-value leak rate
- source and retrieved context tokens (`cl100k_base` accounting unit)
- native Ollama extraction tokens for agentic Mem0
- ingestion total, p50, and p95 latency
- time until the final update becomes searchable
- query mean, p50, and p95 latency after warm-up
- client process RSS delta
- paired bootstrap 95% confidence interval for coverage differences

## Reproduce

Prerequisites: Python 3.10+, Docker with Compose, and enough disk space for
`nomic-embed-text` plus `qwen2.5:1.5b`.

```bash
python -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -r examples/benchmarks/temporal-memory-showdown/requirements.txt
```

Configure Moorcheh to use the same local Ollama models:

```bash
python - <<'PY'
from moorcheh.user_config import (
EmbeddingConfig,
LlmConfig,
save_runtime_config,
)

save_runtime_config(
EmbeddingConfig(provider="ollama", model="nomic-embed-text"),
LlmConfig(provider="ollama", model="qwen2.5:1.5b"),
)
PY

python -m moorcheh up \
--bundled-ollama \
--embedding-provider ollama \
--embedding-model nomic-embed-text
```

Run all systems:

```bash
python examples/benchmarks/temporal-memory-showdown/run_benchmark.py \
--backends memanto,mem0-direct,mem0-agentic \
--repeats 5
```

The runner writes machine-readable JSON and an audit-friendly Markdown table to
`results/latest.json` and `results/latest.md`.

Run only deterministic unit tests:

```bash
pytest examples/benchmarks/temporal-memory-showdown/tests -q
```

## Verified live result

The committed result was produced by
[GitHub Actions run 27441595257](https://github.com/2077196405-commits/memanto/actions/runs/27441595257)
on June 12, 2026, using a four-core Ubuntu runner:

| Metric | Memanto On-Prem | Mem0 agentic |
| --- | ---: | ---: |
| Golden concept coverage | 97.2% | 69.4% |
| Total ingestion time | 0.096s | 2912.082s |
| Query p95 | 0.0983s | 0.1032s |
| Retrieved context tokens | 1779 | 1793 |
| Extraction LLM tokens | 0 | 134,690 |

The paired coverage advantage is 27.8 percentage points, with a bootstrap 95%
confidence interval of 9.3 to 48.1 points. Memanto completed ingestion about
30,286 times faster while avoiding all extraction-model tokens.

`mem0-direct` reached 98.6% coverage, but it deliberately disables Mem0's
normal extraction and reconciliation (`infer=False`). It is included as a
vector-only ablation, not the primary agentic competitor.

The run also exposed a limitation worth keeping visible: raw top-five context
from every backend can contain superseded values. The report therefore
separates required-concept coverage from strict contradiction-free accuracy
instead of hiding stale-value leakage.

## Experimental controls

- same records, order, queries, and `top_k=5`
- same embedding model and Ollama service
- fresh Memanto agent and fresh Mem0 collection per run
- one warm-up query pass before measured latency samples
- first measured pass used for accuracy and context-token totals
- no answer-generation model and no LLM-as-a-judge
- fixed bootstrap seed and fixed tokenizer accounting unit

## Interpretation limits

The benchmark measures retrieval context, not final answer quality. Memanto's
server runs in Docker while Mem0's Qdrant runs in the Python process, so client
RSS is reported but is not treated as a total-system memory comparison.
`cl100k_base` is a stable cross-system accounting unit, not the native embedding
tokenizer. Exact internal LLM tokens are reported only where Ollama exposes
them.
Loading