verygoodplugins · jack-arturo · Mar 10, 2026 · Mar 10, 2026 · Mar 10, 2026 · Mar 10, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -4,7 +4,9 @@
 
 - `automem/`: Core package. Notable dirs: `api/` (Flask blueprints), `utils/`, `stores/`, `config.py`.
 - `app.py`: Flask API entry point used in local/dev and tests.
-- `tests/`: Pytest suite (`test_*.py`), plus benchmarks under `tests/benchmarks/`.
+- `tests/`: Pytest suite (`test_*.py`), plus legacy benchmark harnesses under `tests/benchmarks/`.
+- `benchmarks/`: Snapshot-based benchmark system. See `EXPERIMENT_LOG.md` for current baselines and results.
+- `scripts/bench/`: Benchmark tooling (ingest, eval, compare, health check).
 - `docs/`: API, testing, deployment, monitoring, and env var references.
 - `scripts/`: Maintenance and ops helpers (backup, reembed, health monitor).
 - `mcp-sse-server/`: Optional MCP bridge used in some deployments.
@@ -17,6 +19,7 @@
 - `make test`: Run unit tests (fast, no services).
 - `make test-integration`: Start Docker and run full integration tests.
 - `make fmt` / `make lint`: Format with Black/Isort and lint with Flake8.
+- `make bench-eval BENCH=locomo-mini`: Run snapshot-based benchmark (~2 min). See Benchmarking section below.
 - `make deploy` / `make status`: Deploy/check Railway. Quick health: `curl :8001/health`.
 
 ## Coding Style & Naming
@@ -33,6 +36,47 @@
 - Integration: `make test-integration` (requires Docker). See `docs/TESTING.md` for env flags and live testing options.
 - Add/adjust tests for new endpoints, stores, or utils; prefer fixtures over globals.
 
+## Benchmarking
+
+The benchmark system uses **snapshot-based evaluation**: ingest once, eval many times from the same snapshot. This keeps runs deterministic and fast.
+
+**Source of truth**: `benchmarks/EXPERIMENT_LOG.md` — contains current baselines, all experiment results, and the tiered benchmark table.
+
+### Tiered System
+
+| Tier | Benchmark | Command | Runtime | Cost | When to use |
+|------|-----------|---------|---------|------|-------------|
+| 0 | Unit tests | `make test` | 30s | free | Every change |
+| 1 | LoCoMo-mini (2 convos, 304 Qs) | `make bench-eval BENCH=locomo-mini` | 2-3 min | free | Rapid iteration |
+| 2 | LoCoMo-full (10 convos, 1986 Qs) | `make bench-eval BENCH=locomo` | 5-10 min | free | Before merge |
+| 3 | LongMemEval-mini (20 Qs) | `make bench-mini-longmemeval` | 15 min | ~$1 | Scoring/entity changes |
+| 4 | LongMemEval-full (500 Qs) | `make test-longmemeval` | 1-2 hr | ~$10 | Milestones only |
+
+### Key Commands
+
+- `make bench-eval BENCH=locomo-mini CONFIG=baseline` — eval from snapshot (~2 min).
+- `make bench-compare BENCH=locomo CONFIG=<name> BASELINE=baseline` — A/B compare two configs.
+- `make bench-compare-branch BRANCH=<branch>` — compare a branch against baseline.
+- `make bench-ingest BENCH=locomo` — ingest + snapshot (run once per embedding change).
+- `make bench-health` — recall health check (score distribution, entity quality, latency).
+
+### Workflow for Recall/Retrieval Changes
+
+1. Run `make bench-eval BENCH=locomo-mini` on `main` to confirm the current baseline.
+2. Create a feature branch and implement changes.
+3. Run the same eval on the branch.
+4. Record both results as a new row in `benchmarks/EXPERIMENT_LOG.md`.
+5. Promote to `make bench-eval BENCH=locomo` (full) before merge.
+
+### Directory Layout
+
+- `benchmarks/EXPERIMENT_LOG.md` — results table and experiment metadata (committed).
+- `benchmarks/baselines/` — baseline result JSONs (small files committed, large ones gitignored).
+- `benchmarks/snapshots/` — Qdrant/FalkorDB snapshot data (gitignored, regenerate with `make bench-ingest`).
+- `benchmarks/results/` — per-run result JSONs (gitignored).
+- `scripts/bench/` — shell and Python scripts driving ingest, eval, compare, and health checks.
+- `tests/benchmarks/` — legacy benchmark harnesses (LoCoMo, LongMemEval) and historical result markdown files.
+
 ## Commit & Pull Requests
 
 - Use Conventional Commits style: `feat`, `fix`, `docs`, `refactor`, `test`, `chore` (e.g., `feat(api): add /analyze endpoint`).

diff --git a/README.md b/README.md
@@ -14,7 +14,7 @@
 
 # **AI Memory That Actually Learns**
 
-AutoMem is a **production-grade long-term memory system** for AI assistants, achieving **90.53% accuracy** on the [LoCoMo benchmark](docs/TESTING.md#locomo-benchmark) (ACL 2024)—outperforming CORE (88.24%).
+AutoMem is a **production-grade long-term memory system** for AI assistants, achieving **89.27% accuracy** on the [LoCoMo benchmark](docs/TESTING.md#locomo-benchmark) (ACL 2024)—outperforming CORE (88.24%). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for current baselines.
 
 **Deploy in 60 seconds:**
 
@@ -522,7 +522,7 @@ Vector databases match embeddings. AutoMem builds knowledge graphs:
 
 AutoMem saves you months of iteration:
 
-- ✅ **Benchmark-proven** - 90.53% on LoCoMo (ACL 2024)
+- ✅ **Benchmark-proven** - 89.27% on LoCoMo (ACL 2024), beats CORE SOTA
 - ✅ **Research-validated** - Implements HippoRAG 2, A-MEM, MELODI, ReadAgent principles
 - ✅ **Production-ready** - Auth, admin tools, health monitoring, automated backups
 - ✅ **Battle-tested** - Enrichment pipeline, consolidation engine, retry logic, dual storage
@@ -532,24 +532,26 @@ AutoMem saves you months of iteration:
 
 ### LoCoMo Benchmark (ACL 2024)
 
-**90.53% overall accuracy** across 1,986 questions:
+**89.27% accuracy** on categories 1–4 (233 scored questions, Voyage 4 embeddings):
 
 | Category                   | AutoMem    | Notes                                   |
 | -------------------------- | ---------- | --------------------------------------- |
-| **Complex Reasoning**      | **100%**   | Perfect score on multi-step reasoning   |
-| **Open Domain**            | **95.84%** | General knowledge recall                |
-| **Temporal Understanding** | **85.05%** | Time-aware queries                      |
-| **Single-hop Recall**      | **79.79%** | Basic fact retrieval                    |
-| **Multi-hop Reasoning**    | **50.00%** | Connecting disparate memories (+12.5pp) |
+| **Open Domain**            | **96.49%** | General knowledge recall                |
+| **Temporal Understanding** | **92.06%** | Time-aware queries                      |
+| **Single-hop Recall**      | **79.07%** | Basic fact retrieval                    |
+| **Multi-hop Reasoning**    | **46.15%** | Connecting disparate memories           |
+| **Complex Reasoning**      | N/A        | Requires LLM judge (not yet scored)     |
 
 **Comparison with other systems:**
 
 | System | Score |
 |--------|-------|
-| AutoMem | 90.53% |
+| AutoMem | 89.27% |
 | CORE | 88.24% |
 
-Run the benchmark yourself: `make test-locomo`
+> **Note:** Earlier versions reported 90.53% which included two evaluator bugs: temporal matching compared the wrong text (false negatives → 22%) and category 5 matched empty strings (false positives → 100%). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for full history.
+
+Run benchmarks: `make bench-eval BENCH=locomo-mini` (quick) or `make bench-eval BENCH=locomo` (full)
 
 ### Production Characteristics
 

diff --git a/benchmarks/EXPERIMENT_LOG.md b/benchmarks/EXPERIMENT_LOG.md
@@ -24,13 +24,29 @@ on the snapshot-based bench infrastructure (PR #97, merged 2026-03-02).
 | 2026-03-02 | PR #80 | jescalan/feat/enhanced-recall | BLOCKED | -- | -- | Merge conflicts with main (recall.py), needs rebase before eval |
 | 2026-03-02 | PR #87 | jescalan/feat/write-time-dedup | 76.97% (+0.0) | -- | -- | Write-time dedup gate. Neutral on recall (expected) |
 | 2026-03-02 | #78 | exp/78-decay-fix | 76.97% (+0.0) | 79.51% (-0.55) | -- | Decay rate 0.1→0.01, importance floor, archive filter. Within variance. Impact is on production (rehabilitated via rescore) |
+| 2026-03-10 | pre-refactor | main (@ 795368a) | 76.97% (+0.0) | -- | -- | Baseline re-confirmed after #73, #78, #115, #116 merged. Stable. Pre-relation-tier-refactor checkpoint. |
+| 2026-03-10 | eval-fix | docs/benchmark-agent-guidelines | **89.27% (208/233)** | -- | -- | Fix temporal matching (answer vs memory dates) + skip cat5 (no ground truth). Honest score, beats CORE by 1.03pp. |
+
+### Category Breakdown (LoCoMo-mini)
+
+Categories 1-4 scored by word-overlap/date matching. Category 5 requires LLM judge (not yet implemented).
+
+| Date | Issue/PR | Single-hop | Temporal | Multi-hop | Open Domain | Complex |
+|------|----------|------------|----------|-----------|-------------|---------|
+| 2026-03-02 | baseline | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) |
+| 2026-03-10 | pre-refactor | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) |
+| 2026-03-10 | eval-fix | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | N/A (71 skipped) |
+
+\* Temporal was artificially low: evaluator compared question dates (empty) vs memory dates instead of answer dates.
+\*\* Complex was artificially 100%: dataset has no `answer` field for cat5 → empty string → `"" in content` always True.
 
 ## How to add an entry
 
 1. Run the benchmark: `make bench-eval BENCH=locomo-mini CONFIG=baseline`
 2. Record the overall accuracy from the output JSON
-3. Add a row to the table above with the date, issue/PR, branch, and scores
-4. For deltas, show as `XX.X% (+Y.Y)` relative to the baseline row
+3. Add a row to the Results table with the date, issue/PR, branch, and scores
+4. Add a row to the Category Breakdown table with per-category scores
+5. For deltas, show as `XX.X% (+Y.Y)` relative to the baseline row
 
 ## Snapshot metadata