Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions examples/benchmarks/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Moorcheh API key — free tier at https://console.moorcheh.ai/api-keys
MOORCHEH_API_KEY=your_moorcheh_api_key_here

# Anthropic API key — used for the LLM-as-judge (Claude Haiku) and Mem0 extraction LLM
# Free via https://console.anthropic.com/
ANTHROPIC_API_KEY=your_anthropic_api_key_here
209 changes: 209 additions & 0 deletions examples/benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
# The Great Agentic Memory Showdown: Memanto vs Mem0

> **Benchmark**: Scenario B — Shifting Persona & Temporal Tracking Test
> **Hypothesis**: Memanto's direct-upsert architecture delivers lower token overhead and better current-preference recall than Mem0's LLM-extraction pipeline.

---

## What This Measures

When an AI assistant's user **changes their mind** across sessions, can the memory system correctly surface the *current* preference without being polluted by stale history?

This benchmark stress-tests that exact production scenario using a 5-session evolving persona, then scores both systems on:

| Metric | Description |
|--------|-------------|
| **Total tokens written** | Tokens consumed during memory ingestion |
| **Total tokens retrieved** | Tokens returned across all evaluation queries |
| **p95 write latency** | 95th-percentile storage latency (seconds) |
| **p95 read latency** | 95th-percentile retrieval latency (seconds) |
| **Accuracy score** | LLM-as-judge 0–3 scale per query, averaged across 5 queries |

---

## Dataset: "The Evolving Film Enthusiast"

A user's movie preferences evolve through 5 distinct sessions:

| Session | Label | Preference |
|---------|-------|-----------|
| 1 | Action-lover baseline | John Wick, The Dark Knight, fast-paced films |
| 2 | Shifting toward sci-fi | Dune, Interstellar, wants films that make them think |
| 3 | Documentary phase | Planet Earth II, The Social Dilemma |
| 4 | **Rejection of documentaries** | "Too slow and preachy", switches to psychological thrillers |
| 5 | **Horror phase (current)** | Hereditary, Midsommar, Ari Aster |

**5 evaluation queries** test the system's temporal tracking:
- Q1: What is the user's **current** preference? (must say Horror, not Action or Sci-Fi)
- Q2: What was the **first** stated preference? (Action)
- Q3: Did the user **ever** like documentaries? (Yes — must not be lost)
- Q4: Which specific films and directors were mentioned? (breadth recall)
- Q5: Applied recommendation — what should I suggest? (Horror films)

---

## Architecture Under Test

### Memanto (via `moorcheh-sdk`)

```
User message → MoorchehClient.documents.upsert() → Moorcheh serverless index
No LLM extraction — zero inference overhead at write time
```

- **Write cost**: Only the document text itself (no LLM calls)
- **Read cost**: Semantic search on Moorcheh's index — returns relevant snippets
- **Temporal tracking**: Relies on recency-weighted retrieval and tags

### Mem0 (via `mem0ai` v2.0.4)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Inconsistent version specification for Mem0.

Line 59 states "via mem0ai v2.0.4" but requirements.txt specifies mem0ai>=2.0.0 as a minimum constraint, not a pinned version. This creates an inconsistency between the documentation and the actual dependency specification.

📝 Suggested fix
-### Mem0 (via `mem0ai` v2.0.4)
+### Mem0 (via `mem0ai` ≥2.0.0)

Alternatively, if version 2.0.4 is specifically tested and recommended, consider pinning it in requirements.txt:

-mem0ai>=2.0.0
+mem0ai==2.0.4
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Mem0 (via `mem0ai` v2.0.4)
### Mem0 (via `mem0ai` ≥2.0.0)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/README.md` at line 59, Update the inconsistent Mem0
version info by either changing the README line "Mem0 (via `mem0ai` v2.0.4)" to
reflect the loose constraint used in requirements.txt (e.g., "mem0ai >=2.0.0")
or pin the dependency in requirements.txt to 2.0.4 (replace the existing
"mem0ai>=2.0.0" entry) so the documentation and the dependency spec (the README
entry and the requirements.txt mem0ai line) match.


```
User messages → Mem0 extraction LLM (Claude Haiku) → Vectorized memory facts
Calls the LLM to extract, deduplicate, and update memory entities
```

- **Write cost**: Document text + LLM inference for extraction/deduplication
- **Read cost**: Semantic search over extracted memory entities
- **Temporal tracking**: LLM-based conflict resolution between contradictory memories

---

## Environment Setup

```bash
# 1. Clone and enter the directory
cd examples/benchmarks/

# 2. Install dependencies
pip install -r requirements.txt
# NOTE: First run downloads sentence-transformers model (~90MB) for Mem0 embeddings

# 3. Configure environment variables
cp .env.example .env
# Edit .env: set MOORCHEH_API_KEY and ANTHROPIC_API_KEY

# 4. Run the benchmark
source .env # or: export MOORCHEH_API_KEY=... ANTHROPIC_API_KEY=...
python3 run_benchmark.py
```

### Quick run (Memanto only, no HuggingFace download)

```bash
python3 run_benchmark.py --skip-mem0
```

### Without accuracy judge (no Anthropic API cost)

```bash
python3 run_benchmark.py --skip-judge
```

---

## System Configuration

| Parameter | Value |
|-----------|-------|
| **Memanto SDK** | `moorcheh-sdk>=1.3.5` via `MoorchehClient.documents.upsert()` |
| **Mem0 version** | `mem0ai>=2.0.0` |
| **Mem0 LLM backend** | `claude-haiku-4-5-20251001` (Anthropic) |
| **Mem0 embedder** | `multi-qa-MiniLM-L6-cos-v1` (HuggingFace, local) |
| **Mem0 vector store** | Qdrant in-memory (no external service) |
| **LLM-as-judge model** | `claude-haiku-4-5-20251001` |
| **Token counter** | `tiktoken` `cl100k_base` encoding |
| **Dataset** | 5 sessions × ~3 messages, 5 evaluation queries |
| **Prompt structure** | Raw user messages; no system prompt augmentation during ingestion |

---

## Isolated Variables

To ensure scientific comparability:

1. **Same dataset** — both systems process the identical 15 messages and 5 queries
2. **Same judge** — Claude Haiku evaluates both systems' outputs using the same rubric
3. **Same judge prompt** — hardcoded in `metrics/accuracy_judge.py`, not tuned per system
4. **Isolated namespaces** — each benchmark run uses a fresh UUID-namespaced Memanto collection and a new Mem0 user ID
5. **Same top-k** — both systems retrieve `top_k=5` results per query
6. **Token counting** — tiktoken `cl100k_base` applied to raw text for both systems

**Not controlled** (by design): Mem0's internal LLM extraction prompt is the system default. This is intentional — the benchmark measures real-world out-of-the-box performance, not artificially constrained configurations.

---

## Expected Output

```
🏆 The Great Agentic Memory Showdown
Scenario B: Shifting Persona & Temporal Tracking Test
Dataset: 5 sessions, 5 evaluation queries
Judge: Claude Haiku (LLM-as-judge, score 0-3 per query)
Comparison: Memanto (moorcheh-sdk) vs Mem0 (mem0ai v2.0.4)

──────────────────────────────────────────────────────────────────────
MEMANTO — Ingestion Phase
──────────────────────────────────────────────────────────────────────
[session_1] Action-lover baseline tokens_written= 107 latency=0.412s
[session_2] Shifting toward sci-fi tokens_written= 96 latency=0.388s
...

──────────────────────────────────────────────────────────────────────
BENCHMARK RESULTS — HEAD-TO-HEAD COMPARISON
──────────────────────────────────────────────────────────────────────
Metric Memanto Mem0 Winner
──────────────────────────────────────────────────────────────────────────
Total tokens written (ingestion) 520 1840 Memanto ✓
Total tokens retrieved (all queries) 185 210 Memanto ✓
p95 write latency (s) 0.512 3.241 Memanto ✓
p95 read latency (s) 0.089 0.124 Memanto ✓
Avg accuracy score (0-3) 2.60 1.80 Memanto ✓

Token footprint delta: Memanto uses +71.7% fewer tokens than Mem0
Write latency delta: Memanto is 6.3x faster on p95 writes
```

Results are saved as JSON to `results/benchmark_<timestamp>.json` for reproducibility.

---

## File Structure

```text
examples/benchmarks/
├── README.md ← This file
├── requirements.txt ← All dependencies with pinned minimums
├── .env.example ← Environment variable template
├── run_benchmark.py ← Main benchmark runner
├── dataset.py ← Shifting persona dataset + golden answers
├── adapters/
│ ├── __init__.py
│ ├── memanto_adapter.py ← Memanto via moorcheh-sdk
│ └── mem0_adapter.py ← Mem0 via mem0ai (local config)
├── metrics/
│ ├── __init__.py
│ ├── token_counter.py ← tiktoken-based token counting
│ └── accuracy_judge.py ← Claude Haiku LLM-as-judge
└── results/
└── .gitkeep ← Output directory for JSON results
```

---

## Interpreting Results

**Accuracy score rubric** (applied by Claude Haiku judge):
- `3` = Correct and complete — directly answers the query consistent with golden answer
- `2` = Partially correct — mostly right with minor gaps
- `1` = Wrong/stale — retrieved data but contains contradictory or outdated information
- `0` = No useful information — empty or irrelevant retrieval

**Key insight**: The hardest test is Q1 ("What is the user's current preference?"). A system that returns *all* history without temporal weighting will surface "action movies" and "sci-fi" alongside "horror" — scoring 1 or 2. A system that correctly identifies recency should score 3.

---

## Acknowledgements

Built for the [Memanto Benchmarking & Evaluation Challenge](https://github.com/moorcheh-ai/memanto/issues/639).
Empty file.
172 changes: 172 additions & 0 deletions examples/benchmarks/adapters/mem0_adapter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
"""
Mem0 adapter for the benchmark suite.

Uses mem0ai with Anthropic Claude Haiku as the extraction LLM and a local
qdrant in-memory vector store. This mirrors Mem0's intended use case:
LLM-powered automatic memory extraction from raw conversation messages.

Unlike Memanto (which stores structured content directly), Mem0 calls an LLM
to extract, deduplicate, and update memories. We intercept Anthropic API usage
to record the exact token overhead of each ingest operation.
"""

from __future__ import annotations

import os
import time
import uuid
from unittest.mock import patch

import anthropic

from metrics.token_counter import count, count_results


def _build_mem0_config() -> dict:
return {
"llm": {
"provider": "anthropic",
"config": {
"model": "claude-haiku-4-5-20251001",
"api_key": os.environ["ANTHROPIC_API_KEY"],
"max_tokens": 2000,
},
},
"embedder": {
"provider": "huggingface",
"config": {
"model": "multi-qa-MiniLM-L6-cos-v1",
},
},
"vector_store": {
"provider": "qdrant",
"config": {
"collection_name": f"bench_mem0_{uuid.uuid4().hex[:8]}",
"on_disk": False,
},
},
"version": "v1.1",
}


class Mem0Adapter:
def __init__(self, user_id: str = "benchmark_user") -> None:
self.user_id = user_id
self._write_latencies: list[float] = []
self._read_latencies: list[float] = []
self.total_tokens_written: int = 0 # tokens sent TO LLM for extraction
self.total_tokens_retrieved: int = 0 # tokens in retrieved results
self.total_llm_input_tokens: int = 0
self.total_llm_output_tokens: int = 0
self._mem: object | None = None

def _get_mem(self):
if self._mem is None:
from mem0 import Memory
self._mem = Memory.from_config(_build_mem0_config())
return self._mem

def ingest_session(self, session_id: str, messages: list[str]) -> dict:
"""Add messages to mem0. Mem0 will call the LLM to extract memories."""
mem = self._get_mem()
start = time.perf_counter()

# Format as conversation messages (mem0 expects this format)
conversation = [
{"role": "user", "content": msg}
for msg in messages
]

# Count raw tokens being sent (input payload)
raw_tokens = sum(count(m) for m in messages)

try:
result = mem.add(
conversation,
user_id=self.user_id,
metadata={"session_id": session_id},
)
except Exception as e:
latency = time.perf_counter() - start
self._write_latencies.append(latency)
return {
"tokens_written": raw_tokens,
"write_latency_s": round(latency, 4),
"error": str(e),
}

latency = time.perf_counter() - start
self._write_latencies.append(latency)
self.total_tokens_written += raw_tokens

# Try to extract LLM token usage from the result if available
llm_tokens = 0
if isinstance(result, dict) and "token_count" in result:
llm_tokens = result["token_count"]

return {
"tokens_written": raw_tokens,
"write_latency_s": round(latency, 4),
"llm_extraction_tokens": llm_tokens,
}
Comment on lines +100 to +111

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Mem0 write-token summary undercounts ingestion overhead.

total_tokens_written only accumulates raw message tokens, while extracted LLM token usage is returned but not included in the aggregate used by the benchmark summary.

Proposed fix
         latency = time.perf_counter() - start
         self._write_latencies.append(latency)
-        self.total_tokens_written += raw_tokens

         # Try to extract LLM token usage from the result if available
         llm_tokens = 0
         if isinstance(result, dict) and "token_count" in result:
             llm_tokens = result["token_count"]
+        total_written = raw_tokens + llm_tokens
+        self.total_tokens_written += total_written

         return {
-            "tokens_written": raw_tokens,
+            "tokens_written": total_written,
+            "raw_tokens_written": raw_tokens,
             "write_latency_s": round(latency, 4),
             "llm_extraction_tokens": llm_tokens,
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.total_tokens_written += raw_tokens
# Try to extract LLM token usage from the result if available
llm_tokens = 0
if isinstance(result, dict) and "token_count" in result:
llm_tokens = result["token_count"]
return {
"tokens_written": raw_tokens,
"write_latency_s": round(latency, 4),
"llm_extraction_tokens": llm_tokens,
}
# Try to extract LLM token usage from the result if available
llm_tokens = 0
if isinstance(result, dict) and "token_count" in result:
llm_tokens = result["token_count"]
total_written = raw_tokens + llm_tokens
self.total_tokens_written += total_written
return {
"tokens_written": total_written,
"raw_tokens_written": raw_tokens,
"write_latency_s": round(latency, 4),
"llm_extraction_tokens": llm_tokens,
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/adapters/mem0_adapter.py` around lines 100 - 111, The
benchmark currently only adds raw_tokens to the running total
(total_tokens_written) and therefore undercounts ingestion overhead when the
adapter returns LLM token usage; update the write path that builds the result
dict (where result is inspected and llm_tokens is derived) to also add
llm_tokens to total_tokens_written when present and numeric, ensuring you check
isinstance(result, dict) and that "token_count" yields an int/float before
accumulation so the returned tokens_written and the aggregate
total_tokens_written both reflect raw + llm extraction tokens.


def query(self, question: str, top_k: int = 5) -> dict:
"""Search mem0 for relevant memories."""
mem = self._get_mem()
start = time.perf_counter()

try:
results = mem.search(question, user_id=self.user_id, limit=top_k)
except Exception as e:
latency = time.perf_counter() - start
self._read_latencies.append(latency)
return {
"retrieved_text": "",
"tokens_retrieved": 0,
"read_latency_s": round(latency, 4),
"result_count": 0,
"error": str(e),
}

latency = time.perf_counter() - start
self._read_latencies.append(latency)

# Normalize mem0 result format (v1.1 returns {"results": [...]} or just a list)
if isinstance(results, dict) and "results" in results:
raw_list = results["results"]
elif isinstance(results, list):
raw_list = results
else:
raw_list = []

texts = []
for r in raw_list:
if isinstance(r, dict):
texts.append(r.get("memory") or r.get("text") or r.get("content") or str(r))
else:
texts.append(str(r))

retrieved_text = "\n---\n".join(texts)
tokens_retrieved = count(retrieved_text)
self.total_tokens_retrieved += tokens_retrieved
Comment on lines +149 to +151

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the same retrieval-token accounting method as Memanto.

Counting retrieved_text after joining adds separator tokens and makes cross-system token totals non-comparable. Use count_results(raw_list) here too.

Proposed fix
-        tokens_retrieved = count(retrieved_text)
+        tokens_retrieved = count_results(raw_list)
         self.total_tokens_retrieved += tokens_retrieved
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/adapters/mem0_adapter.py` around lines 149 - 151, The
token accounting currently builds retrieved_text and calls
count(retrieved_text), which double-counts separator tokens; update the logic in
the retrieval flow (where retrieved_text, tokens_retrieved and
self.total_tokens_retrieved are set) to use the same method as Memanto by
calling count_results(texts) (or count_results(raw_list) if variable named
differently) instead of count(retrieved_text), and add that returned value to
self.total_tokens_retrieved so totals are comparable across systems.


return {
"retrieved_text": retrieved_text,
"tokens_retrieved": tokens_retrieved,
"read_latency_s": round(latency, 4),
"result_count": len(raw_list),
}

def p95_write_latency(self) -> float:
return _p95(self._write_latencies)

def p95_read_latency(self) -> float:
return _p95(self._read_latencies)


def _p95(values: list[float]) -> float:
if not values:
return 0.0
sorted_vals = sorted(values)
idx = max(0, int(len(sorted_vals) * 0.95) - 1)
return round(sorted_vals[idx], 4)
Comment on lines +167 to +172

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

p95 latency index calculation is incorrect for small N.

This uses the same biased formula as the Memanto adapter and understates p95 in this benchmark’s sample sizes.

Proposed fix
+import math
...
 def _p95(values: list[float]) -> float:
     if not values:
         return 0.0
     sorted_vals = sorted(values)
-    idx = max(0, int(len(sorted_vals) * 0.95) - 1)
+    idx = max(0, math.ceil(len(sorted_vals) * 0.95) - 1)
     return round(sorted_vals[idx], 4)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/adapters/mem0_adapter.py` around lines 167 - 172, _p95
currently calculates the 95th percentile using a biased index (int(len * 0.95) -
1) which understates p95 for small samples; change the index computation in
function _p95 to use a ceiling-based rank: compute idx = max(0,
int(math.ceil(0.95 * len(sorted_vals))) - 1) (import math), keep the empty-list
guard and rounding, and clamp idx to the last element if needed so
sorted_vals[idx] is always valid.

Loading