Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,8 @@ The benchmark system uses **snapshot-based evaluation**: ingest once, eval many
| Tier | Benchmark | Command | Runtime | Cost | When to use |
|------|-----------|---------|---------|------|-------------|
| 0 | Unit tests | `make test` | 30s | free | Every change |
| 1 | LoCoMo-mini (2 convos, 304 Qs) | `make bench-eval BENCH=locomo-mini` | 2-3 min | free | Rapid iteration |
| 2 | LoCoMo-full (10 convos, 1986 Qs) | `make bench-eval BENCH=locomo` | 5-10 min | free | Before merge |
| 1 | LoCoMo-mini (2 convos, 304 Qs) | `make bench-eval BENCH=locomo-mini` | 2-3 min | free / ~$0.20 with judge | Rapid iteration |
| 2 | LoCoMo-full (10 convos, 1986 Qs) | `make bench-eval BENCH=locomo` | 5-10 min | free / ~$1-3 with judge | Before merge |
| 3 | LongMemEval-mini (20 Qs) | `make bench-mini-longmemeval` | 15 min | ~$1 | Scoring/entity changes |
| 4 | LongMemEval-full (500 Qs) | `make test-longmemeval` | 1-2 hr | ~$10 | Milestones only |

Expand Down
28 changes: 19 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

# **AI Memory That Actually Learns**

AutoMem is a **production-grade long-term memory system** for AI assistants, achieving **89.27% accuracy** on the [LoCoMo benchmark](docs/TESTING.md#locomo-benchmark) (ACL 2024)—outperforming CORE (88.24%). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for current baselines.
AutoMem is a **production-grade long-term memory system** for AI assistants with transparent [LoCoMo benchmark](docs/TESTING.md#locomo-benchmark) baselines (ACL 2024): **89.27%** on `locomo-mini` categories 1-4 with category 5 skipped, and **87.56%** on full `locomo` with the opt-in category-5 judge enabled. See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for methodology and current baselines.

**Deploy in 60 seconds:**

Expand Down Expand Up @@ -522,7 +522,7 @@ Vector databases match embeddings. AutoMem builds knowledge graphs:

AutoMem saves you months of iteration:

- ✅ **Benchmark-proven** - 89.27% on LoCoMo (ACL 2024), beats CORE SOTA
- ✅ **Benchmark-proven** - Transparent LoCoMo baselines for both judge-off and judge-on evaluation
- ✅ **Research-validated** - Implements HippoRAG 2, A-MEM, MELODI, ReadAgent principles
- ✅ **Production-ready** - Auth, admin tools, health monitoring, automated backups
- ✅ **Battle-tested** - Enrichment pipeline, consolidation engine, retry logic, dual storage
Expand All @@ -532,26 +532,36 @@ AutoMem saves you months of iteration:

### LoCoMo Benchmark (ACL 2024)

**89.27% accuracy** on categories 1–4 (233 scored questions, Voyage 4 embeddings):
AutoMem publishes two reference baselines with Voyage 4 embeddings:

| Setup | Scope | Score | Notes |
|-------|-------|-------|-------|
| Fast iteration | `locomo-mini`, judge off | **89.27% (208/233)** | Categories 1-4 only; 71 category-5 questions skipped |
| Full benchmark | `locomo`, judge on (`gpt-4o`) | **87.56% (1739/1986)** | Includes category 5 at 95.74% (427/446) |
Comment thread
coderabbitai[bot] marked this conversation as resolved.

`locomo-mini` category breakdown with the judge disabled:

| Category | AutoMem | Notes |
| -------------------------- | ---------- | --------------------------------------- |
| **Open Domain** | **96.49%** | General knowledge recall |
| **Temporal Understanding** | **92.06%** | Time-aware queries |
| **Single-hop Recall** | **79.07%** | Basic fact retrieval |
| **Multi-hop Reasoning** | **46.15%** | Connecting disparate memories |
| **Complex Reasoning** | N/A | Requires LLM judge (not yet scored) |
| **Complex Reasoning** | N/A | Skipped in this setup; use judge-on run |

**Comparison with other systems:**
Reference point:

| System | Score |
|--------|-------|
| AutoMem | 89.27% |
| CORE | 88.24% |
| Published CORE result | 88.24% |
| AutoMem `locomo-mini` judge off | 89.27% |
| AutoMem `locomo` judge on | 87.56% |

> **Note:** Earlier versions reported 90.53% which included two evaluator bugs: temporal matching compared the wrong text (false negatives → 22%) and category 5 matched empty strings (false positives → 100%). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for full history.
> **Methodology note:** We do not present this as a strict leaderboard claim. The published CORE number is a useful reference point, but the public LoCoMo setups are not perfectly apples-to-apples, especially around category-5 handling. AutoMem is above that published reference on the `locomo-mini` categories 1-4 run and below it on the full judge-enabled run.
>
> **History note:** Earlier versions reported 90.53%, but that included two evaluator bugs: temporal matching compared the wrong text (false negatives) and category 5 matched empty strings (false positives). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for the corrected timeline.
Comment thread
coderabbitai[bot] marked this conversation as resolved.

Run benchmarks: `make bench-eval BENCH=locomo-mini` (quick) or `make bench-eval BENCH=locomo` (full)
Run benchmarks: `make bench-eval BENCH=locomo-mini CONFIG=baseline` (quick) or `BENCH_JUDGE_MODEL=gpt-4o make bench-eval BENCH=locomo CONFIG=baseline` (full, includes category 5)

### Production Characteristics

Expand Down
8 changes: 5 additions & 3 deletions benchmarks/EXPERIMENT_LOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ on the snapshot-based bench infrastructure (PR #97, merged 2026-03-02).
| Tier | Benchmark | Runtime | Cost | When to use |
|------|-----------|---------|------|-------------|
| 0 | `make test` (unit) | 30s | free | Every change |
| 1 | `locomo-mini` (2 convos, 304 Qs) | 2-3 min | free | Rapid iteration |
| 2 | `locomo` (10 convos, 1986 Qs) | 5-10 min | free | Before merge |
| 1 | `locomo-mini` (2 convos, 304 Qs) | 2-3 min | free / ~$0.20 with judge | Rapid iteration |
| 2 | `locomo` (10 convos, 1986 Qs) | 5-10 min | free / ~$1-3 with judge | Before merge |
| 3 | `longmemeval-mini` (20 Qs) | 15 min | ~$1 | Scoring/entity changes |
| 4 | `longmemeval` (500 Qs) | 1-2 hr | ~$10 | Milestones only |

Expand All @@ -26,16 +26,18 @@ on the snapshot-based bench infrastructure (PR #97, merged 2026-03-02).
| 2026-03-02 | #78 | exp/78-decay-fix | 76.97% (+0.0) | 79.51% (-0.55) | -- | Decay rate 0.1→0.01, importance floor, archive filter. Within variance. Impact is on production (rehabilitated via rescore) |
| 2026-03-10 | pre-refactor | main (@ 795368a) | 76.97% (+0.0) | -- | -- | Baseline re-confirmed after #73, #78, #115, #116 merged. Stable. Pre-relation-tier-refactor checkpoint. |
| 2026-03-10 | eval-fix | docs/benchmark-agent-guidelines | **89.27% (208/233)** | -- | -- | Fix temporal matching (answer vs memory dates) + skip cat5 (no ground truth). Honest score, beats CORE by 1.03pp. |
| 2026-03-10 | cat5-judge | feat/bench-cat5-judge | **89.80% (273/304)** | **87.56% (1739/1986)** | -- | Opt-in GPT-4o judge for cat5. Full run scored cat5 at 95.74% (427/446) with 0 judge skips/errors; added 90s OpenAI request timeout to prevent stuck full runs. |

### Category Breakdown (LoCoMo-mini)

Categories 1-4 scored by word-overlap/date matching. Category 5 requires LLM judge (not yet implemented).
Categories 1-4 are scored by word-overlap/date matching. Category 5 uses an opt-in LLM judge when `BENCH_JUDGE_MODEL` or `--judge` is enabled; otherwise it remains `N/A`.

| Date | Issue/PR | Single-hop | Temporal | Multi-hop | Open Domain | Complex |
|------|----------|------------|----------|-----------|-------------|---------|
| 2026-03-02 | baseline | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) |
| 2026-03-10 | pre-refactor | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) |
| 2026-03-10 | eval-fix | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | N/A (71 skipped) |
| 2026-03-10 | cat5-judge | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | **91.5% (65/71)** |

\* Temporal was artificially low: evaluator compared question dates (empty) vs memory dates instead of answer dates.
\*\* Complex was artificially 100%: dataset has no `answer` field for cat5 → empty string → `"" in content` always True.
Expand Down
54 changes: 42 additions & 12 deletions docs/TESTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -218,6 +218,27 @@ make test-integration

AutoMem can be evaluated against the **LoCoMo benchmark** (ACL 2024), which tests long-term conversational memory across 10 conversations and 1,986 questions.

### LoCoMo Cat-5 Judge

Category 5 uses evidence-grounded complex reasoning and is opt-in for cost reasons.

```bash
# Default: categories 1-4 scored, category 5 skipped
make bench-eval BENCH=locomo-mini CONFIG=baseline

# Enable cat-5 judge with env var
BENCH_JUDGE_MODEL=gpt-4o make bench-eval BENCH=locomo-mini CONFIG=baseline

# Or use the runner CLI flags directly
./test-locomo-benchmark.sh --conversations 0,1 --judge
./test-locomo-benchmark.sh --conversations 0,1 --judge-model gpt-4o-mini
```

- `BENCH_JUDGE_MODEL` enables category-5 judging for `tests/benchmarks/test_locomo.py`.
- `--judge` and `--judge-model` both enable the judge; `--judge` defaults to `gpt-4o` unless overridden by `BENCH_JUDGE_MODEL` or `--judge-model`.
- If the judge is disabled, category 5 remains `N/A`.
Comment thread
coderabbitai[bot] marked this conversation as resolved.
- If the judge is enabled but evidence is missing or the LLM response is invalid, the affected category-5 questions are skipped rather than counted wrong.

### What is LoCoMo?

LoCoMo evaluates AI systems' ability to remember and reason across very long conversations (300+ turns). It measures performance across 5 categories:
Expand All @@ -228,7 +249,14 @@ LoCoMo evaluates AI systems' ability to remember and reason across very long con
4. **Open Domain** (Category 4) - General knowledge questions
5. **Complex Reasoning** (Category 5) - Advanced inference tasks

**Comparison**: CORE achieved 88.24% (June 2025). AutoMem achieved 90.53%.
Published reference point: CORE is widely cited at **88.24%** (June 2025), but public LoCoMo setups are not perfectly apples-to-apples, especially around category-5 handling.

AutoMem currently publishes two LoCoMo baselines:

| Setup | Scope | Score | Notes |
|------|-------|-------|-------|
| `locomo-mini`, judge off | 2 conversations, categories 1-4 only | **89.27% (208/233)** | 71 category-5 questions skipped |
| `locomo`, judge on (`gpt-4o`) | Full 10 conversations | **87.56% (1739/1986)** | Category 5 scored at 95.74% (427/446) |

### Running the Benchmark

Expand Down Expand Up @@ -265,24 +293,26 @@ Memory usage:
Example benchmark output:
```text
📊 FINAL RESULTS
🎯 Overall Accuracy: 90.53% (1798/1986)
⏱️ Total Time: 1665s
🎯 Overall Accuracy: 87.56% (1739/1986)
⏱️ Total Time: 3497s
💾 Total Memories Stored: 5882

📈 Category Breakdown:
Single-hop Recall : 79.79% (225/282)
Temporal Understanding : 85.05% (273/321)
Multi-hop Reasoning : 50.00% ( 48/ 96)
Open Domain : 95.84% (806/841)
Complex Reasoning : 100.00% (446/446)
Single-hop Recall : 66.31% (187/282)
Temporal Understanding : 87.23% (280/321)
Multi-hop Reasoning : 45.83% ( 44/ 96)
Open Domain : 95.24% (801/841)
Complex Reasoning : 95.74% (427/446)

📊 Comparison:
📊 Comparison with published CORE reference:
CORE: 88.24%
AutoMem: 90.53%
AutoMem: 87.56%
📉 AutoMem is 0.68% behind that reference
```

All benchmark reports live in `tests/benchmarks/`.
```
If you run without the judge, category 5 will show as `N/A` and the comparison should be treated as directional rather than apples-to-apples.

Current baselines and methodology notes live in `benchmarks/EXPERIMENT_LOG.md`.

### AutoMem's Advantages

Expand Down
8 changes: 6 additions & 2 deletions scripts/bench/restore_and_eval.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ set -euo pipefail
BENCH_NAME="${1:-locomo}"
CONFIG="${2:-baseline}"
REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
PYTHON_BIN="${REPO_ROOT}/venv/bin/python"
if [[ ! -x "$PYTHON_BIN" ]]; then
PYTHON_BIN="python3"
fi

# Shared utilities (colors + wait_for_api)
source "$(dirname "$0")/../lib/common.sh"
Expand Down Expand Up @@ -90,7 +94,7 @@ if [[ "$BENCH_NAME" == locomo* ]]; then
if [[ "$BENCH_NAME" == "locomo-mini" ]]; then
EVAL_ARGS="--conversations 0,1 ${EVAL_ARGS}"
fi
python3 tests/benchmarks/test_locomo.py \
"$PYTHON_BIN" tests/benchmarks/test_locomo.py \
--base-url "$AUTOMEM_TEST_BASE_URL" \
--api-token "$AUTOMEM_TEST_API_TOKEN" \
${EVAL_ARGS}
Expand All @@ -99,7 +103,7 @@ elif [[ "$BENCH_NAME" == longmemeval* ]]; then
if [[ "$BENCH_NAME" == "longmemeval-mini" ]]; then
LONGMEM_ARGS+=(--max-questions 20)
fi
python3 tests/benchmarks/longmemeval/test_longmemeval.py \
"$PYTHON_BIN" tests/benchmarks/longmemeval/test_longmemeval.py \
--base-url "$AUTOMEM_TEST_BASE_URL" \
--api-token "$AUTOMEM_TEST_API_TOKEN" \
"${LONGMEM_ARGS[@]}"
Expand Down
30 changes: 29 additions & 1 deletion test-locomo-benchmark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,20 @@ SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
# Shared utilities (colors + wait_for_api)
source "${SCRIPT_DIR}/scripts/lib/common.sh"

if [ -x "${SCRIPT_DIR}/venv/bin/python" ]; then
PYTHON_BIN="${SCRIPT_DIR}/venv/bin/python"
else
PYTHON_BIN="python3"
fi

# Default configuration
RUN_LIVE=false
CONVERSATIONS=""
RECALL_LIMIT=10
NO_CLEANUP=false
OUTPUT_FILE=""
JUDGE=false
JUDGE_MODEL=""

# Parse arguments
while [[ $# -gt 0 ]]; do
Expand All @@ -45,6 +53,15 @@ while [[ $# -gt 0 ]]; do
OUTPUT_FILE="$2"
shift 2
;;
--judge)
JUDGE=true
shift
;;
--judge-model)
JUDGE=true
JUDGE_MODEL="$2"
shift 2
;;
--conversations)
CONVERSATIONS="$2"
shift 2
Expand All @@ -56,6 +73,8 @@ while [[ $# -gt 0 ]]; do
echo " --live Run against Railway deployment (default: local Docker)"
echo " --recall-limit N Number of memories to recall per question (default: 10)"
echo " --conversations I,J Comma-separated conversation indices (e.g. 0,1 for mini mode)"
echo " --judge Enable category-5 LLM judge (defaults to gpt-4o)"
echo " --judge-model MODEL Set the category-5 judge model (also enables judge)"
echo " --no-cleanup Don't cleanup test data after evaluation"
echo " --output FILE Save results to JSON file"
echo " --help, -h Show this help message"
Expand All @@ -64,6 +83,7 @@ while [[ $# -gt 0 ]]; do
echo " $0 # Run locally"
echo " $0 --live # Run against Railway"
echo " $0 --conversations 0,1 # Mini mode (2 conversations)"
echo " $0 --conversations 0,1 --judge # Mini mode with cat-5 judge"
echo " $0 --recall-limit 20 --output results.json"
exit 0
;;
Expand Down Expand Up @@ -159,7 +179,7 @@ else
fi

# Build python command
PYTHON_CMD="python3 $SCRIPT_DIR/tests/benchmarks/test_locomo.py"
PYTHON_CMD="$PYTHON_BIN $SCRIPT_DIR/tests/benchmarks/test_locomo.py"
PYTHON_CMD="$PYTHON_CMD --base-url $AUTOMEM_TEST_BASE_URL"
PYTHON_CMD="$PYTHON_CMD --api-token $AUTOMEM_TEST_API_TOKEN"
PYTHON_CMD="$PYTHON_CMD --recall-limit $RECALL_LIMIT"
Expand All @@ -176,6 +196,14 @@ if [ -n "$OUTPUT_FILE" ]; then
PYTHON_CMD="$PYTHON_CMD --output $OUTPUT_FILE"
fi

if [ "$JUDGE" = true ]; then
PYTHON_CMD="$PYTHON_CMD --judge"
fi

if [ -n "$JUDGE_MODEL" ]; then
PYTHON_CMD="$PYTHON_CMD --judge-model $JUDGE_MODEL"
fi

echo ""
echo -e "${BLUE}🚀 Starting benchmark evaluation...${NC}"
echo ""
Expand Down
2 changes: 2 additions & 0 deletions tests/benchmarks/BENCHMARK_2025-11-08.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# AutoMem Benchmark Results

> Historical note: This report predates the March 10, 2026 LoCoMo evaluator fixes. Temporal and category-5 scoring were corrected later, so these scores and comparisons are not current. See `benchmarks/EXPERIMENT_LOG.md` for current baselines and methodology.

## LoCoMo Benchmark (Long-term Conversational Memory)

**Benchmark Version**: LoCoMo-10 (1,986 questions across 10 conversations)
Expand Down
2 changes: 2 additions & 0 deletions tests/benchmarks/BENCHMARK_2025-11-20.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# AutoMem Benchmark Results

> Historical note: This report predates the March 10, 2026 LoCoMo evaluator fixes. Temporal and category-5 scoring were corrected later, so the scores and any "SOTA" language here are not current. See `benchmarks/EXPERIMENT_LOG.md` for current baselines and methodology.

## LoCoMo Benchmark (Long-term Conversational Memory)

**Benchmark Version**: LoCoMo-10 (1,986 questions across 10 conversations)
Expand Down
2 changes: 2 additions & 0 deletions tests/benchmarks/BENCHMARK_2025-12-02.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# AutoMem Benchmark Results

> Historical note: This report predates the March 10, 2026 LoCoMo evaluator fixes. Temporal and category-5 scoring were corrected later, so the scores and any "SOTA" language here are not current. See `benchmarks/EXPERIMENT_LOG.md` for current baselines and methodology.

## LoCoMo Benchmark (Long-term Conversational Memory)

**Benchmark Version**: LoCoMo-10 (1,986 questions across 10 conversations)
Expand Down
Loading