Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 45 additions & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@

- `automem/`: Core package. Notable dirs: `api/` (Flask blueprints), `utils/`, `stores/`, `config.py`.
- `app.py`: Flask API entry point used in local/dev and tests.
- `tests/`: Pytest suite (`test_*.py`), plus benchmarks under `tests/benchmarks/`.
- `tests/`: Pytest suite (`test_*.py`), plus legacy benchmark harnesses under `tests/benchmarks/`.
- `benchmarks/`: Snapshot-based benchmark system. See `EXPERIMENT_LOG.md` for current baselines and results.
- `scripts/bench/`: Benchmark tooling (ingest, eval, compare, health check).
- `docs/`: API, testing, deployment, monitoring, and env var references.
- `scripts/`: Maintenance and ops helpers (backup, reembed, health monitor).
- `mcp-sse-server/`: Optional MCP bridge used in some deployments.
Expand All @@ -17,6 +19,7 @@
- `make test`: Run unit tests (fast, no services).
- `make test-integration`: Start Docker and run full integration tests.
- `make fmt` / `make lint`: Format with Black/Isort and lint with Flake8.
- `make bench-eval BENCH=locomo-mini`: Run snapshot-based benchmark (~2 min). See Benchmarking section below.
- `make deploy` / `make status`: Deploy/check Railway. Quick health: `curl :8001/health`.

## Coding Style & Naming
Expand All @@ -33,6 +36,47 @@
- Integration: `make test-integration` (requires Docker). See `docs/TESTING.md` for env flags and live testing options.
- Add/adjust tests for new endpoints, stores, or utils; prefer fixtures over globals.

## Benchmarking

The benchmark system uses **snapshot-based evaluation**: ingest once, eval many times from the same snapshot. This keeps runs deterministic and fast.

**Source of truth**: `benchmarks/EXPERIMENT_LOG.md` — contains current baselines, all experiment results, and the tiered benchmark table.

### Tiered System

| Tier | Benchmark | Command | Runtime | Cost | When to use |
|------|-----------|---------|---------|------|-------------|
| 0 | Unit tests | `make test` | 30s | free | Every change |
| 1 | LoCoMo-mini (2 convos, 304 Qs) | `make bench-eval BENCH=locomo-mini` | 2-3 min | free | Rapid iteration |
| 2 | LoCoMo-full (10 convos, 1986 Qs) | `make bench-eval BENCH=locomo` | 5-10 min | free | Before merge |
| 3 | LongMemEval-mini (20 Qs) | `make bench-mini-longmemeval` | 15 min | ~$1 | Scoring/entity changes |
| 4 | LongMemEval-full (500 Qs) | `make test-longmemeval` | 1-2 hr | ~$10 | Milestones only |

### Key Commands

- `make bench-eval BENCH=locomo-mini CONFIG=baseline` — eval from snapshot (~2 min).
- `make bench-compare BENCH=locomo CONFIG=<name> BASELINE=baseline` — A/B compare two configs.
- `make bench-compare-branch BRANCH=<branch>` — compare a branch against baseline.
- `make bench-ingest BENCH=locomo` — ingest + snapshot (run once per embedding change).
- `make bench-health` — recall health check (score distribution, entity quality, latency).

### Workflow for Recall/Retrieval Changes

1. Run `make bench-eval BENCH=locomo-mini` on `main` to confirm the current baseline.
2. Create a feature branch and implement changes.
3. Run the same eval on the branch.
4. Record both results as a new row in `benchmarks/EXPERIMENT_LOG.md`.
5. Promote to `make bench-eval BENCH=locomo` (full) before merge.

### Directory Layout

- `benchmarks/EXPERIMENT_LOG.md` — results table and experiment metadata (committed).
- `benchmarks/baselines/` — baseline result JSONs (small files committed, large ones gitignored).
- `benchmarks/snapshots/` — Qdrant/FalkorDB snapshot data (gitignored, regenerate with `make bench-ingest`).
- `benchmarks/results/` — per-run result JSONs (gitignored).
- `scripts/bench/` — shell and Python scripts driving ingest, eval, compare, and health checks.
- `tests/benchmarks/` — legacy benchmark harnesses (LoCoMo, LongMemEval) and historical result markdown files.

## Commit & Pull Requests

- Use Conventional Commits style: `feat`, `fix`, `docs`, `refactor`, `test`, `chore` (e.g., `feat(api): add /analyze endpoint`).
Expand Down
22 changes: 12 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

# **AI Memory That Actually Learns**

AutoMem is a **production-grade long-term memory system** for AI assistants, achieving **90.53% accuracy** on the [LoCoMo benchmark](docs/TESTING.md#locomo-benchmark) (ACL 2024)—outperforming CORE (88.24%).
AutoMem is a **production-grade long-term memory system** for AI assistants, achieving **89.27% accuracy** on the [LoCoMo benchmark](docs/TESTING.md#locomo-benchmark) (ACL 2024)—outperforming CORE (88.24%). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for current baselines.

**Deploy in 60 seconds:**

Expand Down Expand Up @@ -522,7 +522,7 @@ Vector databases match embeddings. AutoMem builds knowledge graphs:

AutoMem saves you months of iteration:

- ✅ **Benchmark-proven** - 90.53% on LoCoMo (ACL 2024)
- ✅ **Benchmark-proven** - 89.27% on LoCoMo (ACL 2024), beats CORE SOTA
- ✅ **Research-validated** - Implements HippoRAG 2, A-MEM, MELODI, ReadAgent principles
- ✅ **Production-ready** - Auth, admin tools, health monitoring, automated backups
- ✅ **Battle-tested** - Enrichment pipeline, consolidation engine, retry logic, dual storage
Expand All @@ -532,24 +532,26 @@ AutoMem saves you months of iteration:

### LoCoMo Benchmark (ACL 2024)

**90.53% overall accuracy** across 1,986 questions:
**89.27% accuracy** on categories 1–4 (233 scored questions, Voyage 4 embeddings):

| Category | AutoMem | Notes |
| -------------------------- | ---------- | --------------------------------------- |
| **Complex Reasoning** | **100%** | Perfect score on multi-step reasoning |
| **Open Domain** | **95.84%** | General knowledge recall |
| **Temporal Understanding** | **85.05%** | Time-aware queries |
| **Single-hop Recall** | **79.79%** | Basic fact retrieval |
| **Multi-hop Reasoning** | **50.00%** | Connecting disparate memories (+12.5pp) |
| **Open Domain** | **96.49%** | General knowledge recall |
| **Temporal Understanding** | **92.06%** | Time-aware queries |
| **Single-hop Recall** | **79.07%** | Basic fact retrieval |
| **Multi-hop Reasoning** | **46.15%** | Connecting disparate memories |
| **Complex Reasoning** | N/A | Requires LLM judge (not yet scored) |

**Comparison with other systems:**

| System | Score |
|--------|-------|
| AutoMem | 90.53% |
| AutoMem | 89.27% |
| CORE | 88.24% |

Run the benchmark yourself: `make test-locomo`
> **Note:** Earlier versions reported 90.53% which included two evaluator bugs: temporal matching compared the wrong text (false negatives → 22%) and category 5 matched empty strings (false positives → 100%). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for full history.

Run benchmarks: `make bench-eval BENCH=locomo-mini` (quick) or `make bench-eval BENCH=locomo` (full)

### Production Characteristics

Expand Down
20 changes: 18 additions & 2 deletions benchmarks/EXPERIMENT_LOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,29 @@ on the snapshot-based bench infrastructure (PR #97, merged 2026-03-02).
| 2026-03-02 | PR #80 | jescalan/feat/enhanced-recall | BLOCKED | -- | -- | Merge conflicts with main (recall.py), needs rebase before eval |
| 2026-03-02 | PR #87 | jescalan/feat/write-time-dedup | 76.97% (+0.0) | -- | -- | Write-time dedup gate. Neutral on recall (expected) |
| 2026-03-02 | #78 | exp/78-decay-fix | 76.97% (+0.0) | 79.51% (-0.55) | -- | Decay rate 0.1→0.01, importance floor, archive filter. Within variance. Impact is on production (rehabilitated via rescore) |
| 2026-03-10 | pre-refactor | main (@ 795368a) | 76.97% (+0.0) | -- | -- | Baseline re-confirmed after #73, #78, #115, #116 merged. Stable. Pre-relation-tier-refactor checkpoint. |
| 2026-03-10 | eval-fix | docs/benchmark-agent-guidelines | **89.27% (208/233)** | -- | -- | Fix temporal matching (answer vs memory dates) + skip cat5 (no ground truth). Honest score, beats CORE by 1.03pp. |

### Category Breakdown (LoCoMo-mini)

Categories 1-4 scored by word-overlap/date matching. Category 5 requires LLM judge (not yet implemented).

| Date | Issue/PR | Single-hop | Temporal | Multi-hop | Open Domain | Complex |
|------|----------|------------|----------|-----------|-------------|---------|
| 2026-03-02 | baseline | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) |
| 2026-03-10 | pre-refactor | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) |
| 2026-03-10 | eval-fix | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | N/A (71 skipped) |

\* Temporal was artificially low: evaluator compared question dates (empty) vs memory dates instead of answer dates.
\*\* Complex was artificially 100%: dataset has no `answer` field for cat5 → empty string → `"" in content` always True.

## How to add an entry

1. Run the benchmark: `make bench-eval BENCH=locomo-mini CONFIG=baseline`
2. Record the overall accuracy from the output JSON
3. Add a row to the table above with the date, issue/PR, branch, and scores
4. For deltas, show as `XX.X% (+Y.Y)` relative to the baseline row
3. Add a row to the Results table with the date, issue/PR, branch, and scores
4. Add a row to the Category Breakdown table with per-category scores
5. For deltas, show as `XX.X% (+Y.Y)` relative to the baseline row

## Snapshot metadata

Expand Down
Loading