Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 45 additions & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@

- `automem/`: Core package. Notable dirs: `api/` (Flask blueprints), `utils/`, `stores/`, `config.py`.
- `app.py`: Flask API entry point used in local/dev and tests.
- `tests/`: Pytest suite (`test_*.py`), plus benchmarks under `tests/benchmarks/`.
- `tests/`: Pytest suite (`test_*.py`), plus legacy benchmark harnesses under `tests/benchmarks/`.
- `benchmarks/`: Snapshot-based benchmark system. See `EXPERIMENT_LOG.md` for current baselines and results.
- `scripts/bench/`: Benchmark tooling (ingest, eval, compare, health check).
- `docs/`: API, testing, deployment, monitoring, and env var references.
- `scripts/`: Maintenance and ops helpers (backup, reembed, health monitor).
- `mcp-sse-server/`: Optional MCP bridge used in some deployments.
Expand All @@ -17,6 +19,7 @@
- `make test`: Run unit tests (fast, no services).
- `make test-integration`: Start Docker and run full integration tests.
- `make fmt` / `make lint`: Format with Black/Isort and lint with Flake8.
- `make bench-eval BENCH=locomo-mini`: Run snapshot-based benchmark (~2 min). See Benchmarking section below.
- `make deploy` / `make status`: Deploy/check Railway. Quick health: `curl :8001/health`.

## Coding Style & Naming
Expand All @@ -33,6 +36,47 @@
- Integration: `make test-integration` (requires Docker). See `docs/TESTING.md` for env flags and live testing options.
- Add/adjust tests for new endpoints, stores, or utils; prefer fixtures over globals.

## Benchmarking

The benchmark system uses **snapshot-based evaluation**: ingest once, eval many times from the same snapshot. This keeps runs deterministic and fast.

**Source of truth**: `benchmarks/EXPERIMENT_LOG.md` — contains current baselines, all experiment results, and the tiered benchmark table.

### Tiered System

| Tier | Benchmark | Command | Runtime | Cost | When to use |
|------|-----------|---------|---------|------|-------------|
| 0 | Unit tests | `make test` | 30s | free | Every change |
| 1 | LoCoMo-mini (2 convos, 304 Qs) | `make bench-eval BENCH=locomo-mini` | 2-3 min | free | Rapid iteration |
| 2 | LoCoMo-full (10 convos, 1986 Qs) | `make bench-eval BENCH=locomo` | 5-10 min | free | Before merge |
| 3 | LongMemEval-mini (20 Qs) | `make bench-mini-longmemeval` | 15 min | ~$1 | Scoring/entity changes |
| 4 | LongMemEval-full (500 Qs) | `make test-longmemeval` | 1-2 hr | ~$10 | Milestones only |

### Key Commands

- `make bench-eval BENCH=locomo-mini CONFIG=baseline` — eval from snapshot (~2 min).
- `make bench-compare BENCH=locomo CONFIG=<name> BASELINE=baseline` — A/B compare two configs.
- `make bench-compare-branch BRANCH=<branch>` — compare a branch against baseline.
- `make bench-ingest BENCH=locomo` — ingest + snapshot (run once per embedding change).
- `make bench-health` — recall health check (score distribution, entity quality, latency).

### Workflow for Recall/Retrieval Changes

1. Run `make bench-eval BENCH=locomo-mini` on `main` to confirm the current baseline.
2. Create a feature branch and implement changes.
3. Run the same eval on the branch.
4. Record both results as a new row in `benchmarks/EXPERIMENT_LOG.md`.
5. Promote to `make bench-eval BENCH=locomo` (full) before merge.

### Directory Layout

- `benchmarks/EXPERIMENT_LOG.md` — results table and experiment metadata (committed).
- `benchmarks/baselines/` — baseline result JSONs (small files committed, large ones gitignored).
- `benchmarks/snapshots/` — Qdrant/FalkorDB snapshot data (gitignored, regenerate with `make bench-ingest`).
- `benchmarks/results/` — per-run result JSONs (gitignored).
- `scripts/bench/` — shell and Python scripts driving ingest, eval, compare, and health checks.
- `tests/benchmarks/` — legacy benchmark harnesses (LoCoMo, LongMemEval) and historical result markdown files.

## Commit & Pull Requests

- Use Conventional Commits style: `feat`, `fix`, `docs`, `refactor`, `test`, `chore` (e.g., `feat(api): add /analyze endpoint`).
Expand Down
22 changes: 12 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

# **AI Memory That Actually Learns**

AutoMem is a **production-grade long-term memory system** for AI assistants, achieving **90.53% accuracy** on the [LoCoMo benchmark](docs/TESTING.md#locomo-benchmark) (ACL 2024)—outperforming CORE (88.24%).
AutoMem is a **production-grade long-term memory system** for AI assistants, achieving **89.27% accuracy** on the [LoCoMo benchmark](docs/TESTING.md#locomo-benchmark) (ACL 2024)—outperforming CORE (88.24%). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for current baselines.

**Deploy in 60 seconds:**

Expand Down Expand Up @@ -522,7 +522,7 @@ Vector databases match embeddings. AutoMem builds knowledge graphs:

AutoMem saves you months of iteration:

- ✅ **Benchmark-proven** - 90.53% on LoCoMo (ACL 2024)
- ✅ **Benchmark-proven** - 89.27% on LoCoMo (ACL 2024), beats CORE SOTA
- ✅ **Research-validated** - Implements HippoRAG 2, A-MEM, MELODI, ReadAgent principles
- ✅ **Production-ready** - Auth, admin tools, health monitoring, automated backups
- ✅ **Battle-tested** - Enrichment pipeline, consolidation engine, retry logic, dual storage
Expand All @@ -532,24 +532,26 @@ AutoMem saves you months of iteration:

### LoCoMo Benchmark (ACL 2024)

**90.53% overall accuracy** across 1,986 questions:
**89.27% accuracy** on categories 1–4 (233 scored questions, Voyage 4 embeddings):

| Category | AutoMem | Notes |
| -------------------------- | ---------- | --------------------------------------- |
| **Complex Reasoning** | **100%** | Perfect score on multi-step reasoning |
| **Open Domain** | **95.84%** | General knowledge recall |
| **Temporal Understanding** | **85.05%** | Time-aware queries |
| **Single-hop Recall** | **79.79%** | Basic fact retrieval |
| **Multi-hop Reasoning** | **50.00%** | Connecting disparate memories (+12.5pp) |
| **Open Domain** | **96.49%** | General knowledge recall |
| **Temporal Understanding** | **92.06%** | Time-aware queries |
| **Single-hop Recall** | **79.07%** | Basic fact retrieval |
| **Multi-hop Reasoning** | **46.15%** | Connecting disparate memories |
| **Complex Reasoning** | N/A | Requires LLM judge (not yet scored) |

**Comparison with other systems:**

| System | Score |
|--------|-------|
| AutoMem | 90.53% |
| AutoMem | 89.27% |
| CORE | 88.24% |

Run the benchmark yourself: `make test-locomo`
> **Note:** Earlier versions reported 90.53% which included two evaluator bugs: temporal matching compared the wrong text (false negatives → 22%) and category 5 matched empty strings (false positives → 100%). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for full history.

Run benchmarks: `make bench-eval BENCH=locomo-mini` (quick) or `make bench-eval BENCH=locomo` (full)

### Production Characteristics

Expand Down
20 changes: 18 additions & 2 deletions benchmarks/EXPERIMENT_LOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,29 @@ on the snapshot-based bench infrastructure (PR #97, merged 2026-03-02).
| 2026-03-02 | PR #80 | jescalan/feat/enhanced-recall | BLOCKED | -- | -- | Merge conflicts with main (recall.py), needs rebase before eval |
| 2026-03-02 | PR #87 | jescalan/feat/write-time-dedup | 76.97% (+0.0) | -- | -- | Write-time dedup gate. Neutral on recall (expected) |
| 2026-03-02 | #78 | exp/78-decay-fix | 76.97% (+0.0) | 79.51% (-0.55) | -- | Decay rate 0.1→0.01, importance floor, archive filter. Within variance. Impact is on production (rehabilitated via rescore) |
| 2026-03-10 | pre-refactor | main (@ 795368a) | 76.97% (+0.0) | -- | -- | Baseline re-confirmed after #73, #78, #115, #116 merged. Stable. Pre-relation-tier-refactor checkpoint. |
| 2026-03-10 | eval-fix | docs/benchmark-agent-guidelines | **89.27% (208/233)** | -- | -- | Fix temporal matching (answer vs memory dates) + skip cat5 (no ground truth). Honest score, beats CORE by 1.03pp. |

### Category Breakdown (LoCoMo-mini)

Categories 1-4 scored by word-overlap/date matching. Category 5 requires LLM judge (not yet implemented).

| Date | Issue/PR | Single-hop | Temporal | Multi-hop | Open Domain | Complex |
|------|----------|------------|----------|-----------|-------------|---------|
| 2026-03-02 | baseline | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) |
| 2026-03-10 | pre-refactor | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) |
| 2026-03-10 | eval-fix | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | N/A (71 skipped) |

\* Temporal was artificially low: evaluator compared question dates (empty) vs memory dates instead of answer dates.
\*\* Complex was artificially 100%: dataset has no `answer` field for cat5 → empty string → `"" in content` always True.

## How to add an entry

1. Run the benchmark: `make bench-eval BENCH=locomo-mini CONFIG=baseline`
2. Record the overall accuracy from the output JSON
3. Add a row to the table above with the date, issue/PR, branch, and scores
4. For deltas, show as `XX.X% (+Y.Y)` relative to the baseline row
3. Add a row to the Results table with the date, issue/PR, branch, and scores
4. Add a row to the Category Breakdown table with per-category scores
5. For deltas, show as `XX.X% (+Y.Y)` relative to the baseline row

## Snapshot metadata

Expand Down
132 changes: 118 additions & 14 deletions tests/benchmarks/test_locomo.py
Original file line number Diff line number Diff line change
Expand Up @@ -486,6 +486,26 @@ def _extract_speaker_from_question(self, question: str) -> Optional[str]:

return None

@staticmethod
def _session_datetime_to_words(iso_str: str) -> str:
"""Decompose an ISO-8601 timestamp into human-readable date words.

'2023-05-08T13:56:00+00:00' -> '2023 may 8 08 05 may'
This lets word-overlap matching find '2023', 'may', '8', etc.
"""
if not iso_str:
return ""
try:
dt = date_parser.parse(iso_str)
month_name = dt.strftime("%B").lower() # 'may'
month_abbr = dt.strftime("%b").lower() # 'may'
return (
f"{dt.year} {month_name} {month_abbr} {dt.day} "
f"{dt.strftime('%d')} {dt.strftime('%m')}"
)
except (ValueError, OverflowError):
return ""

def is_temporal_question(self, question: str) -> bool:
"""Detect if question is asking about time/dates"""
temporal_keywords = [
Expand Down Expand Up @@ -1030,7 +1050,7 @@ def check_answer_in_memories(

# For temporal questions, try fuzzy date matching across the joined evidence
if self.is_temporal_question(question) and self.match_dates_fuzzy(
question, joined_text
str(expected_answer), joined_text
):
return (
True,
Expand Down Expand Up @@ -1084,12 +1104,15 @@ def check_answer_in_memories(

# Phase 1 Improvement: For temporal questions, also check session_datetime
if is_temporal:
session_datetime = metadata.get("session_datetime", "").lower()
# Combine content and datetime for temporal matching
searchable_text = f"{content_normalized} {session_datetime}"

# Quick Win #1: Fuzzy date matching for temporal questions
if self.match_dates_fuzzy(question, content + " " + session_datetime):
session_datetime = metadata.get("session_datetime", "")
session_readable = self._session_datetime_to_words(session_datetime)
searchable_text = f"{content_normalized} {session_readable}"

# Fuzzy date matching: compare ANSWER dates vs memory dates
if self.match_dates_fuzzy(
str(expected_answer),
content + " " + session_datetime,
):
return (
True,
0.95,
Expand Down Expand Up @@ -1127,6 +1150,23 @@ def check_answer_in_memories(
content = memory.get("content", "").lower()
content_normalized = self.normalize_answer(content)

# For temporal questions, enrich searchable text with session_datetime
if is_temporal:
metadata = memory.get("metadata", {})
session_dt = metadata.get("session_datetime", "")
session_words = self._session_datetime_to_words(session_dt)
content_normalized = f"{content_normalized} {session_words}"

# Fuzzy date matching: compare answer dates vs memory dates
if session_dt and self.match_dates_fuzzy(
str(expected_answer), content + " " + session_dt
):
return (
True,
0.95,
f"Date match in memory {memory.get('id', '?')[:8]}",
)

# Exact substring match
if expected_normalized in content_normalized:
confidence = 1.0
Expand Down Expand Up @@ -1215,6 +1255,22 @@ def _evaluate_only(self, conversation: Dict[str, Any], sample_id: str) -> Dict[s
category = qa.get("category", 0)
evidence = qa.get("evidence", [])

# Category 5 (Complex Reasoning) needs an LLM judge — the
# dataset's ground-truth is either absent or trivial (yes/no).
if category == 5:
qa_results.append(
{
"question": question,
"expected_answer": qa.get("adversarial_answer", answer),
"category": category,
"is_correct": None,
"confidence": 0.0,
"recalled_count": 0,
"explanation": "Skipped: requires LLM judge",
}
)
continue

if evidence and len(evidence) > 1:
recalled_memories = self.multi_hop_recall_with_graph(
question,
Expand Down Expand Up @@ -1249,11 +1305,16 @@ def _evaluate_only(self, conversation: Dict[str, Any], sample_id: str) -> Dict[s
if (i + 1) % 10 == 0:
print(f" Processed {i+1}/{len(questions)} questions...")

correct_count = sum(1 for r in qa_results if r["is_correct"])
total_count = len(qa_results)
scored = [r for r in qa_results if r["is_correct"] is not None]
skipped = len(qa_results) - len(scored)
correct_count = sum(1 for r in scored if r["is_correct"])
total_count = len(scored)
accuracy = correct_count / total_count if total_count > 0 else 0.0

print(f"\nConversation Results: {accuracy:.2%} ({correct_count}/{total_count})")
msg = f"\nConversation Results: {accuracy:.2%} ({correct_count}/{total_count})"
if skipped:
msg += f" [{skipped} skipped (no ground truth)]"
print(msg)

return {
"sample_id": sample_id,
Expand Down Expand Up @@ -1297,6 +1358,24 @@ def evaluate_conversation(self, conversation: Dict[str, Any], sample_id: str) ->
category = qa.get("category", 0)
evidence = qa.get("evidence", [])

# Category 5 (Complex Reasoning) needs an LLM judge — the
# dataset's ground-truth is either absent or trivial (yes/no).
if category == 5:
qa_results.append(
{
"question": question,
"expected_answer": qa.get("adversarial_answer", answer),
"category": category,
"is_correct": None,
"confidence": 0.0,
"recalled_count": 0,
"explanation": "Skipped: requires LLM judge",
}
)
if (i + 1) % 10 == 0:
print(f" Processed {i+1}/{len(questions)} questions...")
continue

# Recall memories for this question
# Use graph expansion for multi-hop questions (evidence > 1)
if evidence and len(evidence) > 1:
Expand Down Expand Up @@ -1338,13 +1417,16 @@ def evaluate_conversation(self, conversation: Dict[str, Any], sample_id: str) ->
if (i + 1) % 10 == 0:
print(f" Processed {i+1}/{len(questions)} questions...")

# Calculate conversation-level statistics
correct_count = sum(1 for r in qa_results if r["is_correct"])
total_count = len(qa_results)
# Calculate conversation-level statistics (exclude skipped/None results)
scored = [r for r in qa_results if r["is_correct"] is not None]
skipped = len(qa_results) - len(scored)
correct_count = sum(1 for r in scored if r["is_correct"])
total_count = len(scored)
accuracy = correct_count / total_count if total_count > 0 else 0.0

skip_note = f" [{skipped} skipped (no ground truth)]" if skipped else ""
print(f"\n📊 Conversation Results:")
print(f" Accuracy: {accuracy:.2%} ({correct_count}/{total_count})")
print(f" Accuracy: {accuracy:.2%} ({correct_count}/{total_count}){skip_note}")
Comment thread
coderabbitai[bot] marked this conversation as resolved.

return {
"sample_id": sample_id,
Expand Down Expand Up @@ -1485,6 +1567,14 @@ def run_benchmark(
5: "Complex Reasoning",
}

# Count skipped category-5 questions for reporting
cat5_skipped = sum(
1
for cr in conversation_results
for qa in cr.get("qa_results", [])
if qa["category"] == 5 and qa["is_correct"] is None
)

category_results = {}
for category, scores in sorted(self.results.items()):
correct = sum(scores)
Expand All @@ -1500,6 +1590,20 @@ def run_benchmark(
f" {category_names.get(category, f'Category {category}'):25s}: {accuracy:6.2%} ({correct:3d}/{total:3d})"
)

if cat5_skipped:
cat5_name = category_names[5]
if 5 not in category_results:
category_results[5] = {
"name": cat5_name,
"accuracy": None,
"correct": 0,
"total": cat5_skipped,
"skipped": True,
}
else:
category_results[5]["skipped_count"] = cat5_skipped
print(f" {cat5_name:25s}: N/A ({cat5_skipped:3d} skipped, needs LLM judge)")

# Comparison with CORE
core_sota = 0.8824
improvement = overall_accuracy - core_sota
Expand Down