Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,8 @@ The benchmark system uses **snapshot-based evaluation**: ingest once, eval many
| Tier | Benchmark | Command | Runtime | Cost | When to use |
|------|-----------|---------|---------|------|-------------|
| 0 | Unit tests | `make test` | 30s | free | Every change |
| 1 | LoCoMo-mini (2 convos, 304 Qs) | `make bench-eval BENCH=locomo-mini` | 2-3 min | free | Rapid iteration |
| 2 | LoCoMo-full (10 convos, 1986 Qs) | `make bench-eval BENCH=locomo` | 5-10 min | free | Before merge |
| 1 | LoCoMo-mini (2 convos, 304 Qs) | `make bench-eval BENCH=locomo-mini` | 2-3 min | free / ~$0.20 with judge | Rapid iteration |
| 2 | LoCoMo-full (10 convos, 1986 Qs) | `make bench-eval BENCH=locomo` | 5-10 min | free / ~$1-3 with judge | Before merge |
| 3 | LongMemEval-mini (20 Qs) | `make bench-mini-longmemeval` | 15 min | ~$1 | Scoring/entity changes |
| 4 | LongMemEval-full (500 Qs) | `make test-longmemeval` | 1-2 hr | ~$10 | Milestones only |

Expand Down
28 changes: 19 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

# **AI Memory That Actually Learns**

AutoMem is a **production-grade long-term memory system** for AI assistants, achieving **89.27% accuracy** on the [LoCoMo benchmark](docs/TESTING.md#locomo-benchmark) (ACL 2024)—outperforming CORE (88.24%). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for current baselines.
AutoMem is a **production-grade long-term memory system** for AI assistants with transparent [LoCoMo benchmark](docs/TESTING.md#locomo-benchmark) baselines (ACL 2024): **89.27%** on `locomo-mini` categories 1-4 with category 5 skipped, and **87.56%** on full `locomo` with the opt-in category-5 judge enabled. See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for methodology and current baselines.

**Deploy in 60 seconds:**

Expand Down Expand Up @@ -522,7 +522,7 @@ Vector databases match embeddings. AutoMem builds knowledge graphs:

AutoMem saves you months of iteration:

- ✅ **Benchmark-proven** - 89.27% on LoCoMo (ACL 2024), beats CORE SOTA
- ✅ **Benchmark-proven** - Transparent LoCoMo baselines for both judge-off and judge-on evaluation
- ✅ **Research-validated** - Implements HippoRAG 2, A-MEM, MELODI, ReadAgent principles
- ✅ **Production-ready** - Auth, admin tools, health monitoring, automated backups
- ✅ **Battle-tested** - Enrichment pipeline, consolidation engine, retry logic, dual storage
Expand All @@ -532,26 +532,36 @@ AutoMem saves you months of iteration:

### LoCoMo Benchmark (ACL 2024)

**89.27% accuracy** on categories 1–4 (233 scored questions, Voyage 4 embeddings):
AutoMem publishes two reference baselines with Voyage 4 embeddings:

| Setup | Scope | Score | Notes |
|-------|-------|-------|-------|
| Fast iteration | `locomo-mini`, judge off | **89.27% (208/233)** | Categories 1-4 only; 71 category-5 questions skipped |
| Full benchmark | `locomo`, judge on (`gpt-4o`) | **87.56% (1739/1986)** | Includes category 5 at 95.74% (427/446) |
Comment thread
coderabbitai[bot] marked this conversation as resolved.

`locomo-mini` category breakdown with the judge disabled:

| Category | AutoMem | Notes |
| -------------------------- | ---------- | --------------------------------------- |
| **Open Domain** | **96.49%** | General knowledge recall |
| **Temporal Understanding** | **92.06%** | Time-aware queries |
| **Single-hop Recall** | **79.07%** | Basic fact retrieval |
| **Multi-hop Reasoning** | **46.15%** | Connecting disparate memories |
| **Complex Reasoning** | N/A | Requires LLM judge (not yet scored) |
| **Complex Reasoning** | N/A | Skipped in this setup; use judge-on run |

**Comparison with other systems:**
Reference point:

| System | Score |
|--------|-------|
| AutoMem | 89.27% |
| CORE | 88.24% |
| Published CORE result | 88.24% |
| AutoMem `locomo-mini` judge off | 89.27% |
| AutoMem `locomo` judge on | 87.56% |

> **Note:** Earlier versions reported 90.53% which included two evaluator bugs: temporal matching compared the wrong text (false negatives → 22%) and category 5 matched empty strings (false positives → 100%). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for full history.
> **Methodology note:** We do not present this as a strict leaderboard claim. The published CORE number is a useful reference point, but the public LoCoMo setups are not perfectly apples-to-apples, especially around category-5 handling. AutoMem is above that published reference on the `locomo-mini` categories 1-4 run and below it on the full judge-enabled run.
>
> **History note:** Earlier versions reported 90.53%, but that included two evaluator bugs: temporal matching compared the wrong text (false negatives) and category 5 matched empty strings (false positives). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for the corrected timeline.
Comment thread
coderabbitai[bot] marked this conversation as resolved.

Run benchmarks: `make bench-eval BENCH=locomo-mini` (quick) or `make bench-eval BENCH=locomo` (full)
Run benchmarks: `make bench-eval BENCH=locomo-mini CONFIG=baseline` (quick) or `BENCH_JUDGE_MODEL=gpt-4o make bench-eval BENCH=locomo CONFIG=baseline` (full, includes category 5)

### Production Characteristics

Expand Down
8 changes: 5 additions & 3 deletions benchmarks/EXPERIMENT_LOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ on the snapshot-based bench infrastructure (PR #97, merged 2026-03-02).
| Tier | Benchmark | Runtime | Cost | When to use |
|------|-----------|---------|------|-------------|
| 0 | `make test` (unit) | 30s | free | Every change |
| 1 | `locomo-mini` (2 convos, 304 Qs) | 2-3 min | free | Rapid iteration |
| 2 | `locomo` (10 convos, 1986 Qs) | 5-10 min | free | Before merge |
| 1 | `locomo-mini` (2 convos, 304 Qs) | 2-3 min | free / ~$0.20 with judge | Rapid iteration |
| 2 | `locomo` (10 convos, 1986 Qs) | 5-10 min | free / ~$1-3 with judge | Before merge |
| 3 | `longmemeval-mini` (20 Qs) | 15 min | ~$1 | Scoring/entity changes |
| 4 | `longmemeval` (500 Qs) | 1-2 hr | ~$10 | Milestones only |

Expand All @@ -26,16 +26,18 @@ on the snapshot-based bench infrastructure (PR #97, merged 2026-03-02).
| 2026-03-02 | #78 | exp/78-decay-fix | 76.97% (+0.0) | 79.51% (-0.55) | -- | Decay rate 0.1→0.01, importance floor, archive filter. Within variance. Impact is on production (rehabilitated via rescore) |
| 2026-03-10 | pre-refactor | main (@ 795368a) | 76.97% (+0.0) | -- | -- | Baseline re-confirmed after #73, #78, #115, #116 merged. Stable. Pre-relation-tier-refactor checkpoint. |
| 2026-03-10 | eval-fix | docs/benchmark-agent-guidelines | **89.27% (208/233)** | -- | -- | Fix temporal matching (answer vs memory dates) + skip cat5 (no ground truth). Honest score, beats CORE by 1.03pp. |
| 2026-03-10 | cat5-judge | feat/bench-cat5-judge | **89.80% (273/304)** | **87.56% (1739/1986)** | -- | Opt-in GPT-4o judge for cat5. Full run scored cat5 at 95.74% (427/446) with 0 judge skips/errors; added 90s OpenAI request timeout to prevent stuck full runs. |

### Category Breakdown (LoCoMo-mini)

Categories 1-4 scored by word-overlap/date matching. Category 5 requires LLM judge (not yet implemented).
Categories 1-4 are scored by word-overlap/date matching. Category 5 uses an opt-in LLM judge when `BENCH_JUDGE_MODEL` or `--judge` is enabled; otherwise it remains `N/A`.

| Date | Issue/PR | Single-hop | Temporal | Multi-hop | Open Domain | Complex |
|------|----------|------------|----------|-----------|-------------|---------|
| 2026-03-02 | baseline | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) |
| 2026-03-10 | pre-refactor | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) |
| 2026-03-10 | eval-fix | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | N/A (71 skipped) |
| 2026-03-10 | cat5-judge | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | **91.5% (65/71)** |

\* Temporal was artificially low: evaluator compared question dates (empty) vs memory dates instead of answer dates.
\*\* Complex was artificially 100%: dataset has no `answer` field for cat5 → empty string → `"" in content` always True.
Expand Down
54 changes: 42 additions & 12 deletions docs/TESTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -218,6 +218,27 @@ make test-integration

AutoMem can be evaluated against the **LoCoMo benchmark** (ACL 2024), which tests long-term conversational memory across 10 conversations and 1,986 questions.

### LoCoMo Cat-5 Judge

Category 5 uses evidence-grounded complex reasoning and is opt-in for cost reasons.

```bash
# Default: categories 1-4 scored, category 5 skipped
make bench-eval BENCH=locomo-mini CONFIG=baseline

# Enable cat-5 judge with env var
BENCH_JUDGE_MODEL=gpt-4o make bench-eval BENCH=locomo-mini CONFIG=baseline

# Or use the runner CLI flags directly
./test-locomo-benchmark.sh --conversations 0,1 --judge
./test-locomo-benchmark.sh --conversations 0,1 --judge-model gpt-4o-mini
```

- `BENCH_JUDGE_MODEL` enables category-5 judging for `tests/benchmarks/test_locomo.py`.
- `--judge` and `--judge-model` both enable the judge; `--judge` defaults to `gpt-4o` unless overridden by `BENCH_JUDGE_MODEL` or `--judge-model`.
- If the judge is disabled, category 5 remains `N/A`.
Comment thread
coderabbitai[bot] marked this conversation as resolved.
- If the judge is enabled but evidence is missing or the LLM response is invalid, the affected category-5 questions are skipped rather than counted wrong.

### What is LoCoMo?

LoCoMo evaluates AI systems' ability to remember and reason across very long conversations (300+ turns). It measures performance across 5 categories:
Expand All @@ -228,7 +249,14 @@ LoCoMo evaluates AI systems' ability to remember and reason across very long con
4. **Open Domain** (Category 4) - General knowledge questions
5. **Complex Reasoning** (Category 5) - Advanced inference tasks

**Comparison**: CORE achieved 88.24% (June 2025). AutoMem achieved 90.53%.
Published reference point: CORE is widely cited at **88.24%** (June 2025), but public LoCoMo setups are not perfectly apples-to-apples, especially around category-5 handling.

AutoMem currently publishes two LoCoMo baselines:

| Setup | Scope | Score | Notes |
|------|-------|-------|-------|
| `locomo-mini`, judge off | 2 conversations, categories 1-4 only | **89.27% (208/233)** | 71 category-5 questions skipped |
| `locomo`, judge on (`gpt-4o`) | Full 10 conversations | **87.56% (1739/1986)** | Category 5 scored at 95.74% (427/446) |

### Running the Benchmark

Expand Down Expand Up @@ -265,24 +293,26 @@ Memory usage:
Example benchmark output:
```text
📊 FINAL RESULTS
🎯 Overall Accuracy: 90.53% (1798/1986)
⏱️ Total Time: 1665s
🎯 Overall Accuracy: 87.56% (1739/1986)
⏱️ Total Time: 3497s
💾 Total Memories Stored: 5882

📈 Category Breakdown:
Single-hop Recall : 79.79% (225/282)
Temporal Understanding : 85.05% (273/321)
Multi-hop Reasoning : 50.00% ( 48/ 96)
Open Domain : 95.84% (806/841)
Complex Reasoning : 100.00% (446/446)
Single-hop Recall : 66.31% (187/282)
Temporal Understanding : 87.23% (280/321)
Multi-hop Reasoning : 45.83% ( 44/ 96)
Open Domain : 95.24% (801/841)
Complex Reasoning : 95.74% (427/446)

📊 Comparison:
📊 Comparison with published CORE reference:
CORE: 88.24%
AutoMem: 90.53%
AutoMem: 87.56%
📉 AutoMem is 0.68% behind that reference
```

All benchmark reports live in `tests/benchmarks/`.
```
If you run without the judge, category 5 will show as `N/A` and the comparison should be treated as directional rather than apples-to-apples.

Current baselines and methodology notes live in `benchmarks/EXPERIMENT_LOG.md`.

### AutoMem's Advantages

Expand Down
8 changes: 6 additions & 2 deletions scripts/bench/restore_and_eval.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ set -euo pipefail
BENCH_NAME="${1:-locomo}"
CONFIG="${2:-baseline}"
REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
PYTHON_BIN="${REPO_ROOT}/venv/bin/python"
if [[ ! -x "$PYTHON_BIN" ]]; then
PYTHON_BIN="python3"
fi

# Shared utilities (colors + wait_for_api)
source "$(dirname "$0")/../lib/common.sh"
Expand Down Expand Up @@ -90,7 +94,7 @@ if [[ "$BENCH_NAME" == locomo* ]]; then
if [[ "$BENCH_NAME" == "locomo-mini" ]]; then
EVAL_ARGS="--conversations 0,1 ${EVAL_ARGS}"
fi
python3 tests/benchmarks/test_locomo.py \
"$PYTHON_BIN" tests/benchmarks/test_locomo.py \
--base-url "$AUTOMEM_TEST_BASE_URL" \
--api-token "$AUTOMEM_TEST_API_TOKEN" \
${EVAL_ARGS}
Expand All @@ -99,7 +103,7 @@ elif [[ "$BENCH_NAME" == longmemeval* ]]; then
if [[ "$BENCH_NAME" == "longmemeval-mini" ]]; then
LONGMEM_ARGS+=(--max-questions 20)
fi
python3 tests/benchmarks/longmemeval/test_longmemeval.py \
"$PYTHON_BIN" tests/benchmarks/longmemeval/test_longmemeval.py \
--base-url "$AUTOMEM_TEST_BASE_URL" \
--api-token "$AUTOMEM_TEST_API_TOKEN" \
"${LONGMEM_ARGS[@]}"
Expand Down
30 changes: 29 additions & 1 deletion test-locomo-benchmark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,20 @@ SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
# Shared utilities (colors + wait_for_api)
source "${SCRIPT_DIR}/scripts/lib/common.sh"

if [ -x "${SCRIPT_DIR}/venv/bin/python" ]; then
PYTHON_BIN="${SCRIPT_DIR}/venv/bin/python"
else
PYTHON_BIN="python3"
fi

# Default configuration
RUN_LIVE=false
CONVERSATIONS=""
RECALL_LIMIT=10
NO_CLEANUP=false
OUTPUT_FILE=""
JUDGE=false
JUDGE_MODEL=""

# Parse arguments
while [[ $# -gt 0 ]]; do
Expand All @@ -45,6 +53,15 @@ while [[ $# -gt 0 ]]; do
OUTPUT_FILE="$2"
shift 2
;;
--judge)
JUDGE=true
shift
;;
--judge-model)
JUDGE=true
JUDGE_MODEL="$2"
shift 2
;;
--conversations)
CONVERSATIONS="$2"
shift 2
Expand All @@ -56,6 +73,8 @@ while [[ $# -gt 0 ]]; do
echo " --live Run against Railway deployment (default: local Docker)"
echo " --recall-limit N Number of memories to recall per question (default: 10)"
echo " --conversations I,J Comma-separated conversation indices (e.g. 0,1 for mini mode)"
echo " --judge Enable category-5 LLM judge (defaults to gpt-4o)"
echo " --judge-model MODEL Set the category-5 judge model (also enables judge)"
echo " --no-cleanup Don't cleanup test data after evaluation"
echo " --output FILE Save results to JSON file"
echo " --help, -h Show this help message"
Expand All @@ -64,6 +83,7 @@ while [[ $# -gt 0 ]]; do
echo " $0 # Run locally"
echo " $0 --live # Run against Railway"
echo " $0 --conversations 0,1 # Mini mode (2 conversations)"
echo " $0 --conversations 0,1 --judge # Mini mode with cat-5 judge"
echo " $0 --recall-limit 20 --output results.json"
exit 0
;;
Expand Down Expand Up @@ -159,7 +179,7 @@ else
fi

# Build python command
PYTHON_CMD="python3 $SCRIPT_DIR/tests/benchmarks/test_locomo.py"
PYTHON_CMD="$PYTHON_BIN $SCRIPT_DIR/tests/benchmarks/test_locomo.py"
PYTHON_CMD="$PYTHON_CMD --base-url $AUTOMEM_TEST_BASE_URL"
PYTHON_CMD="$PYTHON_CMD --api-token $AUTOMEM_TEST_API_TOKEN"
PYTHON_CMD="$PYTHON_CMD --recall-limit $RECALL_LIMIT"
Expand All @@ -176,6 +196,14 @@ if [ -n "$OUTPUT_FILE" ]; then
PYTHON_CMD="$PYTHON_CMD --output $OUTPUT_FILE"
fi

if [ "$JUDGE" = true ]; then
PYTHON_CMD="$PYTHON_CMD --judge"
fi

if [ -n "$JUDGE_MODEL" ]; then
PYTHON_CMD="$PYTHON_CMD --judge-model $JUDGE_MODEL"
fi

echo ""
echo -e "${BLUE}🚀 Starting benchmark evaluation...${NC}"
echo ""
Expand Down
2 changes: 2 additions & 0 deletions tests/benchmarks/BENCHMARK_2025-11-08.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# AutoMem Benchmark Results

> Historical note: This report predates the March 10, 2026 LoCoMo evaluator fixes. Temporal and category-5 scoring were corrected later, so these scores and comparisons are not current. See `benchmarks/EXPERIMENT_LOG.md` for current baselines and methodology.

## LoCoMo Benchmark (Long-term Conversational Memory)

**Benchmark Version**: LoCoMo-10 (1,986 questions across 10 conversations)
Expand Down
2 changes: 2 additions & 0 deletions tests/benchmarks/BENCHMARK_2025-11-20.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# AutoMem Benchmark Results

> Historical note: This report predates the March 10, 2026 LoCoMo evaluator fixes. Temporal and category-5 scoring were corrected later, so the scores and any "SOTA" language here are not current. See `benchmarks/EXPERIMENT_LOG.md` for current baselines and methodology.

## LoCoMo Benchmark (Long-term Conversational Memory)

**Benchmark Version**: LoCoMo-10 (1,986 questions across 10 conversations)
Expand Down
2 changes: 2 additions & 0 deletions tests/benchmarks/BENCHMARK_2025-12-02.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# AutoMem Benchmark Results

> Historical note: This report predates the March 10, 2026 LoCoMo evaluator fixes. Temporal and category-5 scoring were corrected later, so the scores and any "SOTA" language here are not current. See `benchmarks/EXPERIMENT_LOG.md` for current baselines and methodology.

## LoCoMo Benchmark (Long-term Conversational Memory)

**Benchmark Version**: LoCoMo-10 (1,986 questions across 10 conversations)
Expand Down
Loading