From 4cb7ecbba1b21856f5c23a926d513799948f5b81 Mon Sep 17 00:00:00 2001
From: Jack Arturo <info@verygoodplugins.com>
Date: Tue, 10 Mar 2026 14:45:19 +0530
Subject: [PATCH 1/6] docs(agents): add benchmarking section to AGENTS.md

The new snapshot-based benchmark system (PR #97) wasn't documented in
AGENTS.md, causing AI agents to miss the tiered eval pipeline and
reference stale Oct 2025 numbers instead of current Voyage 4 baselines.

Adds: tiered benchmark table, key commands, recall/retrieval change
workflow, and directory layout. Updates project structure to reference
benchmarks/ and scripts/bench/.

Made-with: Cursor
---
 AGENTS.md | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 45 insertions(+), 1 deletion(-)
diff --git a/AGENTS.md b/AGENTS.md
index 26108aad..c343a85a 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -4,7 +4,9 @@
 
 - `automem/`: Core package. Notable dirs: `api/` (Flask blueprints), `utils/`, `stores/`, `config.py`.
 - `app.py`: Flask API entry point used in local/dev and tests.
-- `tests/`: Pytest suite (`test_*.py`), plus benchmarks under `tests/benchmarks/`.
+- `tests/`: Pytest suite (`test_*.py`), plus legacy benchmark harnesses under `tests/benchmarks/`.
+- `benchmarks/`: Snapshot-based benchmark system. See `EXPERIMENT_LOG.md` for current baselines and results.
+- `scripts/bench/`: Benchmark tooling (ingest, eval, compare, health check).
 - `docs/`: API, testing, deployment, monitoring, and env var references.
 - `scripts/`: Maintenance and ops helpers (backup, reembed, health monitor).
 - `mcp-sse-server/`: Optional MCP bridge used in some deployments.
@@ -17,6 +19,7 @@
 - `make test`: Run unit tests (fast, no services).
 - `make test-integration`: Start Docker and run full integration tests.
 - `make fmt` / `make lint`: Format with Black/Isort and lint with Flake8.
+- `make bench-eval BENCH=locomo-mini`: Run snapshot-based benchmark (~2 min). See Benchmarking section below.
 - `make deploy` / `make status`: Deploy/check Railway. Quick health: `curl :8001/health`.
 
 ## Coding Style & Naming
@@ -33,6 +36,47 @@
 - Integration: `make test-integration` (requires Docker). See `docs/TESTING.md` for env flags and live testing options.
 - Add/adjust tests for new endpoints, stores, or utils; prefer fixtures over globals.
 
+## Benchmarking
+
+The benchmark system uses **snapshot-based evaluation**: ingest once, eval many times from the same snapshot. This keeps runs deterministic and fast.
+
+**Source of truth**: `benchmarks/EXPERIMENT_LOG.md` — contains current baselines, all experiment results, and the tiered benchmark table.
+
+### Tiered System
+
+| Tier | Benchmark | Command | Runtime | Cost | When to use |
+|------|-----------|---------|---------|------|-------------|
+| 0 | Unit tests | `make test` | 30s | free | Every change |
+| 1 | LoCoMo-mini (2 convos, 304 Qs) | `make bench-eval BENCH=locomo-mini` | 2-3 min | free | Rapid iteration |
+| 2 | LoCoMo-full (10 convos, 1986 Qs) | `make bench-eval BENCH=locomo` | 5-10 min | free | Before merge |
+| 3 | LongMemEval-mini (20 Qs) | `make bench-mini-longmemeval` | 15 min | ~$1 | Scoring/entity changes |
+| 4 | LongMemEval-full (500 Qs) | `make test-longmemeval` | 1-2 hr | ~$10 | Milestones only |
+
+### Key Commands
+
+- `make bench-eval BENCH=locomo-mini CONFIG=baseline` — eval from snapshot (~2 min).
+- `make bench-compare BENCH=locomo CONFIG=<name> BASELINE=baseline` — A/B compare two configs.
+- `make bench-compare-branch BRANCH=<branch>` — compare a branch against baseline.
+- `make bench-ingest BENCH=locomo` — ingest + snapshot (run once per embedding change).
+- `make bench-health` — recall health check (score distribution, entity quality, latency).
+
+### Workflow for Recall/Retrieval Changes
+
+1. Run `make bench-eval BENCH=locomo-mini` on `main` to confirm the current baseline.
+2. Create a feature branch and implement changes.
+3. Run the same eval on the branch.
+4. Record both results as a new row in `benchmarks/EXPERIMENT_LOG.md`.
+5. Promote to `make bench-eval BENCH=locomo` (full) before merge.
+
+### Directory Layout
+
+- `benchmarks/EXPERIMENT_LOG.md` — results table and experiment metadata (committed).
+- `benchmarks/baselines/` — baseline result JSONs (small files committed, large ones gitignored).
+- `benchmarks/snapshots/` — Qdrant/FalkorDB snapshot data (gitignored, regenerate with `make bench-ingest`).
+- `benchmarks/results/` — per-run result JSONs (gitignored).
+- `scripts/bench/` — shell and Python scripts driving ingest, eval, compare, and health checks.
+- `tests/benchmarks/` — legacy benchmark harnesses (LoCoMo, LongMemEval) and historical result markdown files.
+
 ## Commit & Pull Requests
 
 - Use Conventional Commits style: `feat`, `fix`, `docs`, `refactor`, `test`, `chore` (e.g., `feat(api): add /analyze endpoint`).

From e96fa5c517a7c4803e6ecb34faba730cbf53005c Mon Sep 17 00:00:00 2001
From: Jack Arturo <info@verygoodplugins.com>
Date: Tue, 10 Mar 2026 14:51:57 +0530
Subject: [PATCH 2/6] docs(bench): log pre-refactor baseline confirmation
 (76.97%)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Re-ran locomo-mini on current main (795368a) after merging #73, #78,
#115, #116. Score unchanged at 76.97% — baseline is stable. This
serves as B₀ for the upcoming relation tier refactor.

Made-with: Cursor
---
 benchmarks/EXPERIMENT_LOG.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/benchmarks/EXPERIMENT_LOG.md b/benchmarks/EXPERIMENT_LOG.md
index e5c399b7..1f8fe467 100644
--- a/benchmarks/EXPERIMENT_LOG.md
+++ b/benchmarks/EXPERIMENT_LOG.md
@@ -24,6 +24,7 @@ on the snapshot-based bench infrastructure (PR #97, merged 2026-03-02).
 | 2026-03-02 | PR #80 | jescalan/feat/enhanced-recall | BLOCKED | -- | -- | Merge conflicts with main (recall.py), needs rebase before eval |
 | 2026-03-02 | PR #87 | jescalan/feat/write-time-dedup | 76.97% (+0.0) | -- | -- | Write-time dedup gate. Neutral on recall (expected) |
 | 2026-03-02 | #78 | exp/78-decay-fix | 76.97% (+0.0) | 79.51% (-0.55) | -- | Decay rate 0.1→0.01, importance floor, archive filter. Within variance. Impact is on production (rehabilitated via rescore) |
+| 2026-03-10 | pre-refactor | main (@ 795368a) | 76.97% (+0.0) | -- | -- | Baseline re-confirmed after #73, #78, #115, #116 merged. Stable. Pre-relation-tier-refactor checkpoint. |
 
 ## How to add an entry
 

From b79d4ed31030bcad51431005712e2b6fd75fe69f Mon Sep 17 00:00:00 2001
From: Jack Arturo <info@verygoodplugins.com>
Date: Tue, 10 Mar 2026 15:06:53 +0530
Subject: [PATCH 3/6] fix(benchmarks): correct temporal answer matching in
 LoCoMo evaluator
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two bugs in check_answer_in_memories caused 22% temporal accuracy
(should be 92%):

1. match_dates_fuzzy compared question dates vs memory dates, but
   temporal questions ("When did X happen?") contain no dates — the
   date is in the expected answer. Now compares answer dates vs memory
   session_datetime with 1-day tolerance.

2. Strategy 2 (semantic search) only checked memory content text,
   ignoring session_datetime metadata. Added _session_datetime_to_words
   helper that decomposes ISO timestamps into matchable tokens (year,
   month name, day) and injects them into searchable text.

Result: locomo-mini 76.97% → 91.78% (+14.81pp). Temporal category
22.2% → 92.1% (+69.8pp). Beats CORE SOTA by 3.54pp.

Also adds Category Breakdown table to EXPERIMENT_LOG.md so per-category
regressions are visible at a glance.

Made-with: Cursor
---
 benchmarks/EXPERIMENT_LOG.md    | 14 +++++++--
 tests/benchmarks/test_locomo.py | 54 ++++++++++++++++++++++++++++-----
 2 files changed, 59 insertions(+), 9 deletions(-)

diff --git a/benchmarks/EXPERIMENT_LOG.md b/benchmarks/EXPERIMENT_LOG.md
index 1f8fe467..422bb903 100644
--- a/benchmarks/EXPERIMENT_LOG.md
+++ b/benchmarks/EXPERIMENT_LOG.md
@@ -25,13 +25,23 @@ on the snapshot-based bench infrastructure (PR #97, merged 2026-03-02).
 | 2026-03-02 | PR #87 | jescalan/feat/write-time-dedup | 76.97% (+0.0) | -- | -- | Write-time dedup gate. Neutral on recall (expected) |
 | 2026-03-02 | #78 | exp/78-decay-fix | 76.97% (+0.0) | 79.51% (-0.55) | -- | Decay rate 0.1→0.01, importance floor, archive filter. Within variance. Impact is on production (rehabilitated via rescore) |
 | 2026-03-10 | pre-refactor | main (@ 795368a) | 76.97% (+0.0) | -- | -- | Baseline re-confirmed after #73, #78, #115, #116 merged. Stable. Pre-relation-tier-refactor checkpoint. |
+| 2026-03-10 | temporal-fix | docs/benchmark-agent-guidelines | **91.78% (+14.81)** | -- | -- | Fix: match_dates_fuzzy was comparing question dates vs memory (always empty); now compares answer dates. Also: Strategy 2 now includes session_datetime in searchable text. |
+
+### Category Breakdown (LoCoMo-mini)
+
+| Date | Issue/PR | Single-hop | Temporal | Multi-hop | Open Domain | Complex |
+|------|----------|------------|----------|-----------|-------------|---------|
+| 2026-03-02 | baseline | 76.7% (33/43) | 22.2% (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100% (71/71) |
+| 2026-03-10 | pre-refactor | 76.7% (33/43) | 22.2% (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100% (71/71) |
+| 2026-03-10 | temporal-fix | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | 100% (71/71) |
 
 ## How to add an entry
 
 1. Run the benchmark: `make bench-eval BENCH=locomo-mini CONFIG=baseline`
 2. Record the overall accuracy from the output JSON
-3. Add a row to the table above with the date, issue/PR, branch, and scores
-4. For deltas, show as `XX.X% (+Y.Y)` relative to the baseline row
+3. Add a row to the Results table with the date, issue/PR, branch, and scores
+4. Add a row to the Category Breakdown table with per-category scores
+5. For deltas, show as `XX.X% (+Y.Y)` relative to the baseline row
 
 ## Snapshot metadata
 
diff --git a/tests/benchmarks/test_locomo.py b/tests/benchmarks/test_locomo.py
index b3e06910..2adcc1c1 100644
--- a/tests/benchmarks/test_locomo.py
+++ b/tests/benchmarks/test_locomo.py
@@ -486,6 +486,26 @@ def _extract_speaker_from_question(self, question: str) -> Optional[str]:
 
         return None
 
+    @staticmethod
+    def _session_datetime_to_words(iso_str: str) -> str:
+        """Decompose an ISO-8601 timestamp into human-readable date words.
+
+        '2023-05-08T13:56:00+00:00' -> '2023 may 8 08 05 may'
+        This lets word-overlap matching find '2023', 'may', '8', etc.
+        """
+        if not iso_str:
+            return ""
+        try:
+            dt = date_parser.parse(iso_str)
+            month_name = dt.strftime("%B").lower()  # 'may'
+            month_abbr = dt.strftime("%b").lower()  # 'may'
+            return (
+                f"{dt.year} {month_name} {month_abbr} {dt.day} "
+                f"{dt.strftime('%d')} {dt.strftime('%m')}"
+            )
+        except (ValueError, OverflowError):
+            return ""
+
     def is_temporal_question(self, question: str) -> bool:
         """Detect if question is asking about time/dates"""
         temporal_keywords = [
@@ -1030,7 +1050,7 @@ def check_answer_in_memories(
 
                     # For temporal questions, try fuzzy date matching across the joined evidence
                     if self.is_temporal_question(question) and self.match_dates_fuzzy(
-                        question, joined_text
+                        str(expected_answer), joined_text
                     ):
                         return (
                             True,
@@ -1084,12 +1104,15 @@ def check_answer_in_memories(
 
                     # Phase 1 Improvement: For temporal questions, also check session_datetime
                     if is_temporal:
-                        session_datetime = metadata.get("session_datetime", "").lower()
-                        # Combine content and datetime for temporal matching
-                        searchable_text = f"{content_normalized} {session_datetime}"
-
-                        # Quick Win #1: Fuzzy date matching for temporal questions
-                        if self.match_dates_fuzzy(question, content + " " + session_datetime):
+                        session_datetime = metadata.get("session_datetime", "")
+                        session_readable = self._session_datetime_to_words(session_datetime)
+                        searchable_text = f"{content_normalized} {session_readable}"
+
+                        # Fuzzy date matching: compare ANSWER dates vs memory dates
+                        if self.match_dates_fuzzy(
+                            str(expected_answer),
+                            content + " " + session_datetime,
+                        ):
                             return (
                                 True,
                                 0.95,
@@ -1127,6 +1150,23 @@ def check_answer_in_memories(
             content = memory.get("content", "").lower()
             content_normalized = self.normalize_answer(content)
 
+            # For temporal questions, enrich searchable text with session_datetime
+            if is_temporal:
+                metadata = memory.get("metadata", {})
+                session_dt = metadata.get("session_datetime", "")
+                session_words = self._session_datetime_to_words(session_dt)
+                content_normalized = f"{content_normalized} {session_words}"
+
+                # Fuzzy date matching: compare answer dates vs memory dates
+                if session_dt and self.match_dates_fuzzy(
+                    str(expected_answer), content + " " + session_dt
+                ):
+                    return (
+                        True,
+                        0.95,
+                        f"Date match in memory {memory.get('id', '?')[:8]}",
+                    )
+
             # Exact substring match
             if expected_normalized in content_normalized:
                 confidence = 1.0

From bbf2720e46bdb1ad983613d874636726853e11b1 Mon Sep 17 00:00:00 2001
From: Jack Arturo <info@verygoodplugins.com>
Date: Tue, 10 Mar 2026 15:33:10 +0530
Subject: [PATCH 4/6] fix(benchmarks): skip category 5 (no ground truth) in
 LoCoMo evaluator

Category 5 (Complex Reasoning) questions either have no `answer` field
or only trivial yes/no answers. The evaluator was getting 100% because
`"" in <any string>` is always True, and `"no" in "know"` matches as
substring. These are designed for LLM-judge evaluation per the LoCoMo
paper.

Now skips all category 5 uniformly and reports as N/A. Overall honest
score: 89.27% (208/233 on cats 1-4), still beats CORE by 1.03pp.

Adds footnotes to EXPERIMENT_LOG.md explaining the two evaluator bugs
(temporal false negatives + complex false positives) so future readers
understand why earlier rows show different numbers.

Made-with: Cursor
---
 benchmarks/EXPERIMENT_LOG.md    | 13 ++++--
 tests/benchmarks/test_locomo.py | 78 ++++++++++++++++++++++++++++++---
 2 files changed, 80 insertions(+), 11 deletions(-)

diff --git a/benchmarks/EXPERIMENT_LOG.md b/benchmarks/EXPERIMENT_LOG.md
index 422bb903..dd50ade8 100644
--- a/benchmarks/EXPERIMENT_LOG.md
+++ b/benchmarks/EXPERIMENT_LOG.md
@@ -25,15 +25,20 @@ on the snapshot-based bench infrastructure (PR #97, merged 2026-03-02).
 | 2026-03-02 | PR #87 | jescalan/feat/write-time-dedup | 76.97% (+0.0) | -- | -- | Write-time dedup gate. Neutral on recall (expected) |
 | 2026-03-02 | #78 | exp/78-decay-fix | 76.97% (+0.0) | 79.51% (-0.55) | -- | Decay rate 0.1→0.01, importance floor, archive filter. Within variance. Impact is on production (rehabilitated via rescore) |
 | 2026-03-10 | pre-refactor | main (@ 795368a) | 76.97% (+0.0) | -- | -- | Baseline re-confirmed after #73, #78, #115, #116 merged. Stable. Pre-relation-tier-refactor checkpoint. |
-| 2026-03-10 | temporal-fix | docs/benchmark-agent-guidelines | **91.78% (+14.81)** | -- | -- | Fix: match_dates_fuzzy was comparing question dates vs memory (always empty); now compares answer dates. Also: Strategy 2 now includes session_datetime in searchable text. |
+| 2026-03-10 | eval-fix | docs/benchmark-agent-guidelines | **89.27% (208/233)** | -- | -- | Fix temporal matching (answer vs memory dates) + skip cat5 (no ground truth). Honest score, beats CORE by 1.03pp. |
 
 ### Category Breakdown (LoCoMo-mini)
 
+Categories 1-4 scored by word-overlap/date matching. Category 5 requires LLM judge (not yet implemented).
+
 | Date | Issue/PR | Single-hop | Temporal | Multi-hop | Open Domain | Complex |
 |------|----------|------------|----------|-----------|-------------|---------|
-| 2026-03-02 | baseline | 76.7% (33/43) | 22.2% (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100% (71/71) |
-| 2026-03-10 | pre-refactor | 76.7% (33/43) | 22.2% (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100% (71/71) |
-| 2026-03-10 | temporal-fix | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | 100% (71/71) |
+| 2026-03-02 | baseline | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) |
+| 2026-03-10 | pre-refactor | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) |
+| 2026-03-10 | eval-fix | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | N/A (71 skipped) |
+
+\* Temporal was artificially low: evaluator compared question dates (empty) vs memory dates instead of answer dates.
+\*\* Complex was artificially 100%: dataset has no `answer` field for cat5 → empty string → `"" in content` always True.
 
 ## How to add an entry
 
diff --git a/tests/benchmarks/test_locomo.py b/tests/benchmarks/test_locomo.py
index 2adcc1c1..9d1427aa 100644
--- a/tests/benchmarks/test_locomo.py
+++ b/tests/benchmarks/test_locomo.py
@@ -1255,6 +1255,22 @@ def _evaluate_only(self, conversation: Dict[str, Any], sample_id: str) -> Dict[s
             category = qa.get("category", 0)
             evidence = qa.get("evidence", [])
 
+            # Category 5 (Complex Reasoning) needs an LLM judge — the
+            # dataset's ground-truth is either absent or trivial (yes/no).
+            if category == 5:
+                qa_results.append(
+                    {
+                        "question": question,
+                        "expected_answer": qa.get("adversarial_answer", answer),
+                        "category": category,
+                        "is_correct": None,
+                        "confidence": 0.0,
+                        "recalled_count": 0,
+                        "explanation": "Skipped: requires LLM judge",
+                    }
+                )
+                continue
+
             if evidence and len(evidence) > 1:
                 recalled_memories = self.multi_hop_recall_with_graph(
                     question,
@@ -1289,11 +1305,16 @@ def _evaluate_only(self, conversation: Dict[str, Any], sample_id: str) -> Dict[s
             if (i + 1) % 10 == 0:
                 print(f"  Processed {i+1}/{len(questions)} questions...")
 
-        correct_count = sum(1 for r in qa_results if r["is_correct"])
-        total_count = len(qa_results)
+        scored = [r for r in qa_results if r["is_correct"] is not None]
+        skipped = len(qa_results) - len(scored)
+        correct_count = sum(1 for r in scored if r["is_correct"])
+        total_count = len(scored)
         accuracy = correct_count / total_count if total_count > 0 else 0.0
 
-        print(f"\nConversation Results: {accuracy:.2%} ({correct_count}/{total_count})")
+        msg = f"\nConversation Results: {accuracy:.2%} ({correct_count}/{total_count})"
+        if skipped:
+            msg += f"  [{skipped} skipped (no ground truth)]"
+        print(msg)
 
         return {
             "sample_id": sample_id,
@@ -1337,6 +1358,24 @@ def evaluate_conversation(self, conversation: Dict[str, Any], sample_id: str) ->
             category = qa.get("category", 0)
             evidence = qa.get("evidence", [])
 
+            # Category 5 (Complex Reasoning) needs an LLM judge — the
+            # dataset's ground-truth is either absent or trivial (yes/no).
+            if category == 5:
+                qa_results.append(
+                    {
+                        "question": question,
+                        "expected_answer": qa.get("adversarial_answer", answer),
+                        "category": category,
+                        "is_correct": None,
+                        "confidence": 0.0,
+                        "recalled_count": 0,
+                        "explanation": "Skipped: requires LLM judge",
+                    }
+                )
+                if (i + 1) % 10 == 0:
+                    print(f"  Processed {i+1}/{len(questions)} questions...")
+                continue
+
             # Recall memories for this question
             # Use graph expansion for multi-hop questions (evidence > 1)
             if evidence and len(evidence) > 1:
@@ -1378,13 +1417,16 @@ def evaluate_conversation(self, conversation: Dict[str, Any], sample_id: str) ->
             if (i + 1) % 10 == 0:
                 print(f"  Processed {i+1}/{len(questions)} questions...")
 
-        # Calculate conversation-level statistics
-        correct_count = sum(1 for r in qa_results if r["is_correct"])
-        total_count = len(qa_results)
+        # Calculate conversation-level statistics (exclude skipped/None results)
+        scored = [r for r in qa_results if r["is_correct"] is not None]
+        skipped = len(qa_results) - len(scored)
+        correct_count = sum(1 for r in scored if r["is_correct"])
+        total_count = len(scored)
         accuracy = correct_count / total_count if total_count > 0 else 0.0
 
+        skip_note = f"  [{skipped} skipped (no ground truth)]" if skipped else ""
         print(f"\n📊 Conversation Results:")
-        print(f"  Accuracy: {accuracy:.2%} ({correct_count}/{total_count})")
+        print(f"  Accuracy: {accuracy:.2%} ({correct_count}/{total_count}){skip_note}")
 
         return {
             "sample_id": sample_id,
@@ -1525,6 +1567,14 @@ def run_benchmark(
             5: "Complex Reasoning",
         }
 
+        # Count skipped category-5 questions for reporting
+        cat5_skipped = sum(
+            1
+            for cr in conversation_results
+            for qa in cr.get("qa_results", [])
+            if qa["category"] == 5 and qa["is_correct"] is None
+        )
+
         category_results = {}
         for category, scores in sorted(self.results.items()):
             correct = sum(scores)
@@ -1540,6 +1590,20 @@ def run_benchmark(
                 f"  {category_names.get(category, f'Category {category}'):25s}: {accuracy:6.2%} ({correct:3d}/{total:3d})"
             )
 
+        if cat5_skipped:
+            cat5_name = category_names[5]
+            if 5 not in category_results:
+                category_results[5] = {
+                    "name": cat5_name,
+                    "accuracy": None,
+                    "correct": 0,
+                    "total": cat5_skipped,
+                    "skipped": True,
+                }
+            else:
+                category_results[5]["skipped_count"] = cat5_skipped
+            print(f"  {cat5_name:25s}:    N/A ({cat5_skipped:3d} skipped, needs LLM judge)")
+
         # Comparison with CORE
         core_sota = 0.8824
         improvement = overall_accuracy - core_sota

From b527d765d5af81eee541a8dee6f91a90057728c8 Mon Sep 17 00:00:00 2001
From: Jack Arturo <info@verygoodplugins.com>
Date: Tue, 10 Mar 2026 15:36:49 +0530
Subject: [PATCH 5/6] docs: update README benchmark scores to reflect evaluator
 fixes

Replace inflated 90.53% with honest 89.27% (categories 1-4, Voyage 4).
Add note explaining the two evaluator bugs that affected prior numbers.
Point to benchmarks/EXPERIMENT_LOG.md as source of truth for baselines.
Update make commands to reference new snapshot-based bench system.

Made-with: Cursor
---
 README.md | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/README.md b/README.md
index f9e599c4..75269485 100644
--- a/README.md
+++ b/README.md
@@ -14,7 +14,7 @@
 
 # **AI Memory That Actually Learns**
 
-AutoMem is a **production-grade long-term memory system** for AI assistants, achieving **90.53% accuracy** on the [LoCoMo benchmark](docs/TESTING.md#locomo-benchmark) (ACL 2024)—outperforming CORE (88.24%).
+AutoMem is a **production-grade long-term memory system** for AI assistants, achieving **89.27% accuracy** on the [LoCoMo benchmark](docs/TESTING.md#locomo-benchmark) (ACL 2024)—outperforming CORE (88.24%). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for current baselines.
 
 **Deploy in 60 seconds:**
 
@@ -522,7 +522,7 @@ Vector databases match embeddings. AutoMem builds knowledge graphs:
 
 AutoMem saves you months of iteration:
 
-- ✅ **Benchmark-proven** - 90.53% on LoCoMo (ACL 2024)
+- ✅ **Benchmark-proven** - 89.27% on LoCoMo (ACL 2024), beats CORE SOTA
 - ✅ **Research-validated** - Implements HippoRAG 2, A-MEM, MELODI, ReadAgent principles
 - ✅ **Production-ready** - Auth, admin tools, health monitoring, automated backups
 - ✅ **Battle-tested** - Enrichment pipeline, consolidation engine, retry logic, dual storage
@@ -532,24 +532,26 @@ AutoMem saves you months of iteration:
 
 ### LoCoMo Benchmark (ACL 2024)
 
-**90.53% overall accuracy** across 1,986 questions:
+**89.27% accuracy** on categories 1–4 (233 scored questions, Voyage 4 embeddings):
 
 | Category                   | AutoMem    | Notes                                   |
 | -------------------------- | ---------- | --------------------------------------- |
-| **Complex Reasoning**      | **100%**   | Perfect score on multi-step reasoning   |
-| **Open Domain**            | **95.84%** | General knowledge recall                |
-| **Temporal Understanding** | **85.05%** | Time-aware queries                      |
-| **Single-hop Recall**      | **79.79%** | Basic fact retrieval                    |
-| **Multi-hop Reasoning**    | **50.00%** | Connecting disparate memories (+12.5pp) |
+| **Open Domain**            | **96.49%** | General knowledge recall                |
+| **Temporal Understanding** | **92.06%** | Time-aware queries                      |
+| **Single-hop Recall**      | **79.07%** | Basic fact retrieval                    |
+| **Multi-hop Reasoning**    | **46.15%** | Connecting disparate memories           |
+| **Complex Reasoning**      | N/A        | Requires LLM judge (not yet scored)     |
 
 **Comparison with other systems:**
 
 | System | Score |
 |--------|-------|
-| AutoMem | 90.53% |
+| AutoMem | 89.27% |
 | CORE | 88.24% |
 
-Run the benchmark yourself: `make test-locomo`
+> **Note:** Earlier versions reported 90.53% which included two evaluator bugs: temporal matching compared the wrong text (false negatives → 22%) and category 5 matched empty strings (false positives → 100%). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for full history.
+
+Run benchmarks: `make bench-eval BENCH=locomo-mini` (quick) or `make bench-eval BENCH=locomo` (full)
 
 ### Production Characteristics
 

From 9c759a18214a1c8f93041cfeb042fce9bad2a56b Mon Sep 17 00:00:00 2001
From: Jack Arturo <info@verygoodplugins.com>
Date: Tue, 10 Mar 2026 15:51:42 +0530
Subject: [PATCH 6/6] fix(benchmarks): address CodeRabbit review findings

- Remove empty f-string (F541) in evaluate_conversation print
- Add _session_datetime_to_words to multi-hop joined text so
  word-overlap fallback can match date components (same fix already
  applied to Strategy 1 and Strategy 2)
- Add caveat to CORE comparison when cat5 is excluded, noting that
  CORE's 88.24% includes cat5 via GPT-4 judge

Made-with: Cursor
---
 tests/benchmarks/test_locomo.py | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/tests/benchmarks/test_locomo.py b/tests/benchmarks/test_locomo.py
index 9d1427aa..52749239 100644
--- a/tests/benchmarks/test_locomo.py
+++ b/tests/benchmarks/test_locomo.py
@@ -1036,7 +1036,6 @@ def check_answer_in_memories(
 
                 # Quick Win: Multi-hop joining — evaluate on concatenated evidence
                 if len(evidence_dialog_ids) > 1:
-                    # Build a single searchable text by concatenating evidence contents and session times
                     joined_text_parts = []
                     for mem in evidence_memories:
                         content = mem.get("content", "")
@@ -1045,6 +1044,7 @@ def check_answer_in_memories(
                         joined_text_parts.append(str(content))
                         if session_dt:
                             joined_text_parts.append(str(session_dt))
+                            joined_text_parts.append(self._session_datetime_to_words(session_dt))
                     joined_text = " \n ".join(joined_text_parts).lower()
                     joined_norm = self.normalize_answer(joined_text)
 
@@ -1425,7 +1425,7 @@ def evaluate_conversation(self, conversation: Dict[str, Any], sample_id: str) ->
         accuracy = correct_count / total_count if total_count > 0 else 0.0
 
         skip_note = f"  [{skipped} skipped (no ground truth)]" if skipped else ""
-        print(f"\n📊 Conversation Results:")
+        print("\n📊 Conversation Results:")
         print(f"  Accuracy: {accuracy:.2%} ({correct_count}/{total_count}){skip_note}")
 
         return {
@@ -1604,18 +1604,22 @@ def run_benchmark(
                 category_results[5]["skipped_count"] = cat5_skipped
             print(f"  {cat5_name:25s}:    N/A ({cat5_skipped:3d} skipped, needs LLM judge)")
 
-        # Comparison with CORE
+        # Comparison with CORE (their 88.24% includes cat5 via GPT-4 judge)
         core_sota = 0.8824
         improvement = overall_accuracy - core_sota
-        print(f"\n🏆 Comparison with CORE (SOTA):")
+        print("\n🏆 Comparison with CORE (SOTA):")
         print(f"  CORE: {core_sota:.2%}")
         print(f"  AutoMem: {overall_accuracy:.2%}")
+        if cat5_skipped:
+            print(
+                f"  ⚠️  AutoMem excludes {cat5_skipped} cat-5 Qs (needs LLM judge); CORE includes them"
+            )
         if improvement > 0:
-            print(f"  🎉 AutoMem BEATS CORE by {improvement:.2%}!")
+            print(f"  🎉 AutoMem leads by {improvement:.2%}")
         elif improvement < 0:
             print(f"  📉 AutoMem is {abs(improvement):.2%} behind CORE")
         else:
-            print(f"  🤝 AutoMem matches CORE")
+            print("  🤝 AutoMem matches CORE")
 
         # Cleanup
         if cleanup_after:
@@ -1636,6 +1640,8 @@ def run_benchmark(
                 "core_sota": core_sota,
                 "automem": overall_accuracy,
                 "improvement": improvement,
+                "cat5_excluded": cat5_skipped,
+                "note": "CORE 88.24% includes cat-5 via GPT-4 judge" if cat5_skipped else None,
             },
         }