verygoodplugins · jack-arturo · Mar 10, 2026 · Mar 10, 2026 · Mar 10, 2026 · Mar 10, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -47,8 +47,8 @@ The benchmark system uses **snapshot-based evaluation**: ingest once, eval many
 | Tier | Benchmark | Command | Runtime | Cost | When to use |
 |------|-----------|---------|---------|------|-------------|
 | 0 | Unit tests | `make test` | 30s | free | Every change |
-| 1 | LoCoMo-mini (2 convos, 304 Qs) | `make bench-eval BENCH=locomo-mini` | 2-3 min | free | Rapid iteration |
-| 2 | LoCoMo-full (10 convos, 1986 Qs) | `make bench-eval BENCH=locomo` | 5-10 min | free | Before merge |
+| 1 | LoCoMo-mini (2 convos, 304 Qs) | `make bench-eval BENCH=locomo-mini` | 2-3 min | free / ~$0.20 with judge | Rapid iteration |
+| 2 | LoCoMo-full (10 convos, 1986 Qs) | `make bench-eval BENCH=locomo` | 5-10 min | free / ~$1-3 with judge | Before merge |
 | 3 | LongMemEval-mini (20 Qs) | `make bench-mini-longmemeval` | 15 min | ~$1 | Scoring/entity changes |
 | 4 | LongMemEval-full (500 Qs) | `make test-longmemeval` | 1-2 hr | ~$10 | Milestones only |
 

diff --git a/README.md b/README.md
@@ -14,7 +14,7 @@
 
 # **AI Memory That Actually Learns**
 
-AutoMem is a **production-grade long-term memory system** for AI assistants, achieving **89.27% accuracy** on the [LoCoMo benchmark](docs/TESTING.md#locomo-benchmark) (ACL 2024)—outperforming CORE (88.24%). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for current baselines.
+AutoMem is a **production-grade long-term memory system** for AI assistants with transparent [LoCoMo benchmark](docs/TESTING.md#locomo-benchmark) baselines (ACL 2024): **89.27%** on `locomo-mini` categories 1-4 with category 5 skipped, and **87.56%** on full `locomo` with the opt-in category-5 judge enabled. See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for methodology and current baselines.
 
 **Deploy in 60 seconds:**
 
@@ -522,7 +522,7 @@ Vector databases match embeddings. AutoMem builds knowledge graphs:
 
 AutoMem saves you months of iteration:
 
-- ✅ **Benchmark-proven** - 89.27% on LoCoMo (ACL 2024), beats CORE SOTA
+- ✅ **Benchmark-proven** - Transparent LoCoMo baselines for both judge-off and judge-on evaluation
 - ✅ **Research-validated** - Implements HippoRAG 2, A-MEM, MELODI, ReadAgent principles
 - ✅ **Production-ready** - Auth, admin tools, health monitoring, automated backups
 - ✅ **Battle-tested** - Enrichment pipeline, consolidation engine, retry logic, dual storage
@@ -532,26 +532,36 @@ AutoMem saves you months of iteration:
 
 ### LoCoMo Benchmark (ACL 2024)
 
-**89.27% accuracy** on categories 1–4 (233 scored questions, Voyage 4 embeddings):
+AutoMem publishes two reference baselines with Voyage 4 embeddings:
+
+| Setup | Scope | Score | Notes |
+|-------|-------|-------|-------|
+| Fast iteration | `locomo-mini`, judge off | **89.27% (208/233)** | Categories 1-4 only; 71 category-5 questions skipped |
+| Full benchmark | `locomo`, judge on (`gpt-4o`) | **87.56% (1739/1986)** | Includes category 5 at 95.74% (427/446) |
+
+`locomo-mini` category breakdown with the judge disabled:
 
 | Category                   | AutoMem    | Notes                                   |
 | -------------------------- | ---------- | --------------------------------------- |
 | **Open Domain**            | **96.49%** | General knowledge recall                |
 | **Temporal Understanding** | **92.06%** | Time-aware queries                      |
 | **Single-hop Recall**      | **79.07%** | Basic fact retrieval                    |
 | **Multi-hop Reasoning**    | **46.15%** | Connecting disparate memories           |
-| **Complex Reasoning**      | N/A        | Requires LLM judge (not yet scored)     |
+| **Complex Reasoning**      | N/A        | Skipped in this setup; use judge-on run |
 
-**Comparison with other systems:**
+Reference point:
 
 | System | Score |
 |--------|-------|
-| AutoMem | 89.27% |
-| CORE | 88.24% |
+| Published CORE result | 88.24% |
+| AutoMem `locomo-mini` judge off | 89.27% |
+| AutoMem `locomo` judge on | 87.56% |
 
-> **Note:** Earlier versions reported 90.53% which included two evaluator bugs: temporal matching compared the wrong text (false negatives → 22%) and category 5 matched empty strings (false positives → 100%). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for full history.
+> **Methodology note:** We do not present this as a strict leaderboard claim. The published CORE number is a useful reference point, but the public LoCoMo setups are not perfectly apples-to-apples, especially around category-5 handling. AutoMem is above that published reference on the `locomo-mini` categories 1-4 run and below it on the full judge-enabled run.
+>
+> **History note:** Earlier versions reported 90.53%, but that included two evaluator bugs: temporal matching compared the wrong text (false negatives) and category 5 matched empty strings (false positives). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for the corrected timeline.
 
-Run benchmarks: `make bench-eval BENCH=locomo-mini` (quick) or `make bench-eval BENCH=locomo` (full)
+Run benchmarks: `make bench-eval BENCH=locomo-mini CONFIG=baseline` (quick) or `BENCH_JUDGE_MODEL=gpt-4o make bench-eval BENCH=locomo CONFIG=baseline` (full, includes category 5)
 
 ### Production Characteristics
 

diff --git a/benchmarks/EXPERIMENT_LOG.md b/benchmarks/EXPERIMENT_LOG.md
@@ -10,8 +10,8 @@ on the snapshot-based bench infrastructure (PR #97, merged 2026-03-02).
 | Tier | Benchmark | Runtime | Cost | When to use |
 |------|-----------|---------|------|-------------|
 | 0 | `make test` (unit) | 30s | free | Every change |
-| 1 | `locomo-mini` (2 convos, 304 Qs) | 2-3 min | free | Rapid iteration |
-| 2 | `locomo` (10 convos, 1986 Qs) | 5-10 min | free | Before merge |
+| 1 | `locomo-mini` (2 convos, 304 Qs) | 2-3 min | free / ~$0.20 with judge | Rapid iteration |
+| 2 | `locomo` (10 convos, 1986 Qs) | 5-10 min | free / ~$1-3 with judge | Before merge |
 | 3 | `longmemeval-mini` (20 Qs) | 15 min | ~$1 | Scoring/entity changes |
 | 4 | `longmemeval` (500 Qs) | 1-2 hr | ~$10 | Milestones only |
 
@@ -26,16 +26,18 @@ on the snapshot-based bench infrastructure (PR #97, merged 2026-03-02).
 | 2026-03-02 | #78 | exp/78-decay-fix | 76.97% (+0.0) | 79.51% (-0.55) | -- | Decay rate 0.1→0.01, importance floor, archive filter. Within variance. Impact is on production (rehabilitated via rescore) |
 | 2026-03-10 | pre-refactor | main (@ 795368a) | 76.97% (+0.0) | -- | -- | Baseline re-confirmed after #73, #78, #115, #116 merged. Stable. Pre-relation-tier-refactor checkpoint. |
 | 2026-03-10 | eval-fix | docs/benchmark-agent-guidelines | **89.27% (208/233)** | -- | -- | Fix temporal matching (answer vs memory dates) + skip cat5 (no ground truth). Honest score, beats CORE by 1.03pp. |
+| 2026-03-10 | cat5-judge | feat/bench-cat5-judge | **89.80% (273/304)** | **87.56% (1739/1986)** | -- | Opt-in GPT-4o judge for cat5. Full run scored cat5 at 95.74% (427/446) with 0 judge skips/errors; added 90s OpenAI request timeout to prevent stuck full runs. |
 
 ### Category Breakdown (LoCoMo-mini)
 
-Categories 1-4 scored by word-overlap/date matching. Category 5 requires LLM judge (not yet implemented).
+Categories 1-4 are scored by word-overlap/date matching. Category 5 uses an opt-in LLM judge when `BENCH_JUDGE_MODEL` or `--judge` is enabled; otherwise it remains `N/A`.
 
 | Date | Issue/PR | Single-hop | Temporal | Multi-hop | Open Domain | Complex |
 |------|----------|------------|----------|-----------|-------------|---------|
 | 2026-03-02 | baseline | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) |
 | 2026-03-10 | pre-refactor | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) |
 | 2026-03-10 | eval-fix | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | N/A (71 skipped) |
+| 2026-03-10 | cat5-judge | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | **91.5% (65/71)** |
 
 \* Temporal was artificially low: evaluator compared question dates (empty) vs memory dates instead of answer dates.
 \*\* Complex was artificially 100%: dataset has no `answer` field for cat5 → empty string → `"" in content` always True.

diff --git a/docs/TESTING.md b/docs/TESTING.md
@@ -218,6 +218,27 @@ make test-integration
 
 AutoMem can be evaluated against the **LoCoMo benchmark** (ACL 2024), which tests long-term conversational memory across 10 conversations and 1,986 questions.
 
+### LoCoMo Cat-5 Judge
+
+Category 5 uses evidence-grounded complex reasoning and is opt-in for cost reasons.
+
+```bash
+# Default: categories 1-4 scored, category 5 skipped
+make bench-eval BENCH=locomo-mini CONFIG=baseline
+
+# Enable cat-5 judge with env var
+BENCH_JUDGE_MODEL=gpt-4o make bench-eval BENCH=locomo-mini CONFIG=baseline
+
+# Or use the runner CLI flags directly
+./test-locomo-benchmark.sh --conversations 0,1 --judge
+./test-locomo-benchmark.sh --conversations 0,1 --judge-model gpt-4o-mini
+```
+
+- `BENCH_JUDGE_MODEL` enables category-5 judging for `tests/benchmarks/test_locomo.py`.
+- `--judge` and `--judge-model` both enable the judge; `--judge` defaults to `gpt-4o` unless overridden by `BENCH_JUDGE_MODEL` or `--judge-model`.
+- If the judge is disabled, category 5 remains `N/A`.
+- If the judge is enabled but evidence is missing or the LLM response is invalid, the affected category-5 questions are skipped rather than counted wrong.
+
 ### What is LoCoMo?
 
 LoCoMo evaluates AI systems' ability to remember and reason across very long conversations (300+ turns). It measures performance across 5 categories:
@@ -228,7 +249,14 @@ LoCoMo evaluates AI systems' ability to remember and reason across very long con
 4. **Open Domain** (Category 4) - General knowledge questions
 5. **Complex Reasoning** (Category 5) - Advanced inference tasks
 
-**Comparison**: CORE achieved 88.24% (June 2025). AutoMem achieved 90.53%.
+Published reference point: CORE is widely cited at **88.24%** (June 2025), but public LoCoMo setups are not perfectly apples-to-apples, especially around category-5 handling.
+
+AutoMem currently publishes two LoCoMo baselines:
+
+| Setup | Scope | Score | Notes |
+|------|-------|-------|-------|
+| `locomo-mini`, judge off | 2 conversations, categories 1-4 only | **89.27% (208/233)** | 71 category-5 questions skipped |
+| `locomo`, judge on (`gpt-4o`) | Full 10 conversations | **87.56% (1739/1986)** | Category 5 scored at 95.74% (427/446) |
 
 ### Running the Benchmark
 
@@ -265,24 +293,26 @@ Memory usage:
 Example benchmark output:
 ```text
 📊 FINAL RESULTS
-🎯 Overall Accuracy: 90.53% (1798/1986)
-⏱️ Total Time: 1665s
+🎯 Overall Accuracy: 87.56% (1739/1986)
+⏱️ Total Time: 3497s
 💾 Total Memories Stored: 5882
 
 📈 Category Breakdown:
-  Single-hop Recall        : 79.79% (225/282)
-  Temporal Understanding   : 85.05% (273/321)
-  Multi-hop Reasoning      : 50.00% ( 48/ 96)
-  Open Domain              : 95.84% (806/841)
-  Complex Reasoning        : 100.00% (446/446)
+  Single-hop Recall        : 66.31% (187/282)
+  Temporal Understanding   : 87.23% (280/321)
+  Multi-hop Reasoning      : 45.83% ( 44/ 96)
+  Open Domain              : 95.24% (801/841)
+  Complex Reasoning        : 95.74% (427/446)
 
-📊 Comparison:
+📊 Comparison with published CORE reference:
   CORE: 88.24%
-  AutoMem: 90.53%
+  AutoMem: 87.56%
+  📉 AutoMem is 0.68% behind that reference
 ```
 
-All benchmark reports live in `tests/benchmarks/`.
-```
+If you run without the judge, category 5 will show as `N/A` and the comparison should be treated as directional rather than apples-to-apples.
+
+Current baselines and methodology notes live in `benchmarks/EXPERIMENT_LOG.md`.
 
 ### AutoMem's Advantages
 

diff --git a/scripts/bench/restore_and_eval.sh b/scripts/bench/restore_and_eval.sh
@@ -6,6 +6,10 @@ set -euo pipefail
 BENCH_NAME="${1:-locomo}"
 CONFIG="${2:-baseline}"
 REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
+PYTHON_BIN="${REPO_ROOT}/venv/bin/python"
+if [[ ! -x "$PYTHON_BIN" ]]; then
+    PYTHON_BIN="python3"
+fi
 
 # Shared utilities (colors + wait_for_api)
 source "$(dirname "$0")/../lib/common.sh"
@@ -90,7 +94,7 @@ if [[ "$BENCH_NAME" == locomo* ]]; then
     if [[ "$BENCH_NAME" == "locomo-mini" ]]; then
         EVAL_ARGS="--conversations 0,1 ${EVAL_ARGS}"
     fi
-    python3 tests/benchmarks/test_locomo.py \
+    "$PYTHON_BIN" tests/benchmarks/test_locomo.py \
         --base-url "$AUTOMEM_TEST_BASE_URL" \
         --api-token "$AUTOMEM_TEST_API_TOKEN" \
         ${EVAL_ARGS}
@@ -99,7 +103,7 @@ elif [[ "$BENCH_NAME" == longmemeval* ]]; then
     if [[ "$BENCH_NAME" == "longmemeval-mini" ]]; then
         LONGMEM_ARGS+=(--max-questions 20)
     fi
-    python3 tests/benchmarks/longmemeval/test_longmemeval.py \
+    "$PYTHON_BIN" tests/benchmarks/longmemeval/test_longmemeval.py \
         --base-url "$AUTOMEM_TEST_BASE_URL" \
         --api-token "$AUTOMEM_TEST_API_TOKEN" \
         "${LONGMEM_ARGS[@]}"

diff --git a/test-locomo-benchmark.sh b/test-locomo-benchmark.sh
@@ -19,12 +19,20 @@ SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
 # Shared utilities (colors + wait_for_api)
 source "${SCRIPT_DIR}/scripts/lib/common.sh"
 
+if [ -x "${SCRIPT_DIR}/venv/bin/python" ]; then
+    PYTHON_BIN="${SCRIPT_DIR}/venv/bin/python"
+else
+    PYTHON_BIN="python3"
+fi
+
 # Default configuration
 RUN_LIVE=false
 CONVERSATIONS=""
 RECALL_LIMIT=10
 NO_CLEANUP=false
 OUTPUT_FILE=""
+JUDGE=false
+JUDGE_MODEL=""
 
 # Parse arguments
 while [[ $# -gt 0 ]]; do
@@ -45,6 +53,15 @@ while [[ $# -gt 0 ]]; do
             OUTPUT_FILE="$2"
             shift 2
             ;;
+        --judge)
+            JUDGE=true
+            shift
+            ;;
+        --judge-model)
+            JUDGE=true
+            JUDGE_MODEL="$2"
+            shift 2
+            ;;
         --conversations)
             CONVERSATIONS="$2"
             shift 2
@@ -56,6 +73,8 @@ while [[ $# -gt 0 ]]; do
             echo "  --live              Run against Railway deployment (default: local Docker)"
             echo "  --recall-limit N    Number of memories to recall per question (default: 10)"
             echo "  --conversations I,J Comma-separated conversation indices (e.g. 0,1 for mini mode)"
+            echo "  --judge             Enable category-5 LLM judge (defaults to gpt-4o)"
+            echo "  --judge-model MODEL Set the category-5 judge model (also enables judge)"
             echo "  --no-cleanup        Don't cleanup test data after evaluation"
             echo "  --output FILE       Save results to JSON file"
             echo "  --help, -h          Show this help message"
@@ -64,6 +83,7 @@ while [[ $# -gt 0 ]]; do
             echo "  $0                                    # Run locally"
             echo "  $0 --live                             # Run against Railway"
             echo "  $0 --conversations 0,1                # Mini mode (2 conversations)"
+            echo "  $0 --conversations 0,1 --judge        # Mini mode with cat-5 judge"
             echo "  $0 --recall-limit 20 --output results.json"
             exit 0
             ;;
@@ -159,7 +179,7 @@ else
 fi
 
 # Build python command
-PYTHON_CMD="python3 $SCRIPT_DIR/tests/benchmarks/test_locomo.py"
+PYTHON_CMD="$PYTHON_BIN $SCRIPT_DIR/tests/benchmarks/test_locomo.py"
 PYTHON_CMD="$PYTHON_CMD --base-url $AUTOMEM_TEST_BASE_URL"
 PYTHON_CMD="$PYTHON_CMD --api-token $AUTOMEM_TEST_API_TOKEN"
 PYTHON_CMD="$PYTHON_CMD --recall-limit $RECALL_LIMIT"
@@ -176,6 +196,14 @@ if [ -n "$OUTPUT_FILE" ]; then
     PYTHON_CMD="$PYTHON_CMD --output $OUTPUT_FILE"
 fi
 
+if [ "$JUDGE" = true ]; then
+    PYTHON_CMD="$PYTHON_CMD --judge"
+fi
+
+if [ -n "$JUDGE_MODEL" ]; then
+    PYTHON_CMD="$PYTHON_CMD --judge-model $JUDGE_MODEL"
+fi
+
 echo ""
 echo -e "${BLUE}🚀 Starting benchmark evaluation...${NC}"
 echo ""

diff --git a/tests/benchmarks/BENCHMARK_2025-11-08.md b/tests/benchmarks/BENCHMARK_2025-11-08.md
@@ -1,5 +1,7 @@
 # AutoMem Benchmark Results
 
+> Historical note: This report predates the March 10, 2026 LoCoMo evaluator fixes. Temporal and category-5 scoring were corrected later, so these scores and comparisons are not current. See `benchmarks/EXPERIMENT_LOG.md` for current baselines and methodology.
+
 ## LoCoMo Benchmark (Long-term Conversational Memory)
 
 **Benchmark Version**: LoCoMo-10 (1,986 questions across 10 conversations)

diff --git a/tests/benchmarks/BENCHMARK_2025-11-20.md b/tests/benchmarks/BENCHMARK_2025-11-20.md
@@ -1,5 +1,7 @@
 # AutoMem Benchmark Results
 
+> Historical note: This report predates the March 10, 2026 LoCoMo evaluator fixes. Temporal and category-5 scoring were corrected later, so the scores and any "SOTA" language here are not current. See `benchmarks/EXPERIMENT_LOG.md` for current baselines and methodology.
+
 ## LoCoMo Benchmark (Long-term Conversational Memory)
 
 **Benchmark Version**: LoCoMo-10 (1,986 questions across 10 conversations)

diff --git a/tests/benchmarks/BENCHMARK_2025-12-02.md b/tests/benchmarks/BENCHMARK_2025-12-02.md
@@ -1,5 +1,7 @@
 # AutoMem Benchmark Results
 
+> Historical note: This report predates the March 10, 2026 LoCoMo evaluator fixes. Temporal and category-5 scoring were corrected later, so the scores and any "SOTA" language here are not current. See `benchmarks/EXPERIMENT_LOG.md` for current baselines and methodology.
+
 ## LoCoMo Benchmark (Long-term Conversational Memory)
 
 **Benchmark Version**: LoCoMo-10 (1,986 questions across 10 conversations)