diff --git a/AGENTS.md b/AGENTS.md
index c343a85a..29794fc5 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -47,8 +47,8 @@ The benchmark system uses **snapshot-based evaluation**: ingest once, eval many
 | Tier | Benchmark | Command | Runtime | Cost | When to use |
 |------|-----------|---------|---------|------|-------------|
 | 0 | Unit tests | `make test` | 30s | free | Every change |
-| 1 | LoCoMo-mini (2 convos, 304 Qs) | `make bench-eval BENCH=locomo-mini` | 2-3 min | free | Rapid iteration |
-| 2 | LoCoMo-full (10 convos, 1986 Qs) | `make bench-eval BENCH=locomo` | 5-10 min | free | Before merge |
+| 1 | LoCoMo-mini (2 convos, 304 Qs) | `make bench-eval BENCH=locomo-mini` | 2-3 min | free / ~$0.20 with judge | Rapid iteration |
+| 2 | LoCoMo-full (10 convos, 1986 Qs) | `make bench-eval BENCH=locomo` | 5-10 min | free / ~$1-3 with judge | Before merge |
 | 3 | LongMemEval-mini (20 Qs) | `make bench-mini-longmemeval` | 15 min | ~$1 | Scoring/entity changes |
 | 4 | LongMemEval-full (500 Qs) | `make test-longmemeval` | 1-2 hr | ~$10 | Milestones only |
 
diff --git a/README.md b/README.md
index 75269485..2b9429af 100644
--- a/README.md
+++ b/README.md
@@ -14,7 +14,7 @@
 
 # **AI Memory That Actually Learns**
 
-AutoMem is a **production-grade long-term memory system** for AI assistants, achieving **89.27% accuracy** on the [LoCoMo benchmark](docs/TESTING.md#locomo-benchmark) (ACL 2024)—outperforming CORE (88.24%). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for current baselines.
+AutoMem is a **production-grade long-term memory system** for AI assistants with transparent [LoCoMo benchmark](docs/TESTING.md#locomo-benchmark) baselines (ACL 2024): **89.27%** on `locomo-mini` categories 1-4 with category 5 skipped, and **87.56%** on full `locomo` with the opt-in category-5 judge enabled. See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for methodology and current baselines.
 
 **Deploy in 60 seconds:**
 
@@ -522,7 +522,7 @@ Vector databases match embeddings. AutoMem builds knowledge graphs:
 
 AutoMem saves you months of iteration:
 
-- ✅ **Benchmark-proven** - 89.27% on LoCoMo (ACL 2024), beats CORE SOTA
+- ✅ **Benchmark-proven** - Transparent LoCoMo baselines for both judge-off and judge-on evaluation
 - ✅ **Research-validated** - Implements HippoRAG 2, A-MEM, MELODI, ReadAgent principles
 - ✅ **Production-ready** - Auth, admin tools, health monitoring, automated backups
 - ✅ **Battle-tested** - Enrichment pipeline, consolidation engine, retry logic, dual storage
@@ -532,7 +532,14 @@ AutoMem saves you months of iteration:
 
 ### LoCoMo Benchmark (ACL 2024)
 
-**89.27% accuracy** on categories 1–4 (233 scored questions, Voyage 4 embeddings):
+AutoMem publishes two reference baselines with Voyage 4 embeddings:
+
+| Setup | Scope | Score | Notes |
+|-------|-------|-------|-------|
+| Fast iteration | `locomo-mini`, judge off | **89.27% (208/233)** | Categories 1-4 only; 71 category-5 questions skipped |
+| Full benchmark | `locomo`, judge on (`gpt-4o`) | **87.56% (1739/1986)** | Includes category 5 at 95.74% (427/446) |
+
+`locomo-mini` category breakdown with the judge disabled:
 
 | Category                   | AutoMem    | Notes                                   |
 | -------------------------- | ---------- | --------------------------------------- |
@@ -540,18 +547,21 @@ AutoMem saves you months of iteration:
 | **Temporal Understanding** | **92.06%** | Time-aware queries                      |
 | **Single-hop Recall**      | **79.07%** | Basic fact retrieval                    |
 | **Multi-hop Reasoning**    | **46.15%** | Connecting disparate memories           |
-| **Complex Reasoning**      | N/A        | Requires LLM judge (not yet scored)     |
+| **Complex Reasoning**      | N/A        | Skipped in this setup; use judge-on run |
 
-**Comparison with other systems:**
+Reference point:
 
 | System | Score |
 |--------|-------|
-| AutoMem | 89.27% |
-| CORE | 88.24% |
+| Published CORE result | 88.24% |
+| AutoMem `locomo-mini` judge off | 89.27% |
+| AutoMem `locomo` judge on | 87.56% |
 
-> **Note:** Earlier versions reported 90.53% which included two evaluator bugs: temporal matching compared the wrong text (false negatives → 22%) and category 5 matched empty strings (false positives → 100%). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for full history.
+> **Methodology note:** We do not present this as a strict leaderboard claim. The published CORE number is a useful reference point, but the public LoCoMo setups are not perfectly apples-to-apples, especially around category-5 handling. AutoMem is above that published reference on the `locomo-mini` categories 1-4 run and below it on the full judge-enabled run.
+>
+> **History note:** Earlier versions reported 90.53%, but that included two evaluator bugs: temporal matching compared the wrong text (false negatives) and category 5 matched empty strings (false positives). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for the corrected timeline.
 
-Run benchmarks: `make bench-eval BENCH=locomo-mini` (quick) or `make bench-eval BENCH=locomo` (full)
+Run benchmarks: `make bench-eval BENCH=locomo-mini CONFIG=baseline` (quick) or `BENCH_JUDGE_MODEL=gpt-4o make bench-eval BENCH=locomo CONFIG=baseline` (full, includes category 5)
 
 ### Production Characteristics
 
diff --git a/benchmarks/EXPERIMENT_LOG.md b/benchmarks/EXPERIMENT_LOG.md
index dd50ade8..9b7db1d6 100644
--- a/benchmarks/EXPERIMENT_LOG.md
+++ b/benchmarks/EXPERIMENT_LOG.md
@@ -10,8 +10,8 @@ on the snapshot-based bench infrastructure (PR #97, merged 2026-03-02).
 | Tier | Benchmark | Runtime | Cost | When to use |
 |------|-----------|---------|------|-------------|
 | 0 | `make test` (unit) | 30s | free | Every change |
-| 1 | `locomo-mini` (2 convos, 304 Qs) | 2-3 min | free | Rapid iteration |
-| 2 | `locomo` (10 convos, 1986 Qs) | 5-10 min | free | Before merge |
+| 1 | `locomo-mini` (2 convos, 304 Qs) | 2-3 min | free / ~$0.20 with judge | Rapid iteration |
+| 2 | `locomo` (10 convos, 1986 Qs) | 5-10 min | free / ~$1-3 with judge | Before merge |
 | 3 | `longmemeval-mini` (20 Qs) | 15 min | ~$1 | Scoring/entity changes |
 | 4 | `longmemeval` (500 Qs) | 1-2 hr | ~$10 | Milestones only |
 
@@ -26,16 +26,18 @@ on the snapshot-based bench infrastructure (PR #97, merged 2026-03-02).
 | 2026-03-02 | #78 | exp/78-decay-fix | 76.97% (+0.0) | 79.51% (-0.55) | -- | Decay rate 0.1→0.01, importance floor, archive filter. Within variance. Impact is on production (rehabilitated via rescore) |
 | 2026-03-10 | pre-refactor | main (@ 795368a) | 76.97% (+0.0) | -- | -- | Baseline re-confirmed after #73, #78, #115, #116 merged. Stable. Pre-relation-tier-refactor checkpoint. |
 | 2026-03-10 | eval-fix | docs/benchmark-agent-guidelines | **89.27% (208/233)** | -- | -- | Fix temporal matching (answer vs memory dates) + skip cat5 (no ground truth). Honest score, beats CORE by 1.03pp. |
+| 2026-03-10 | cat5-judge | feat/bench-cat5-judge | **89.80% (273/304)** | **87.56% (1739/1986)** | -- | Opt-in GPT-4o judge for cat5. Full run scored cat5 at 95.74% (427/446) with 0 judge skips/errors; added 90s OpenAI request timeout to prevent stuck full runs. |
 
 ### Category Breakdown (LoCoMo-mini)
 
-Categories 1-4 scored by word-overlap/date matching. Category 5 requires LLM judge (not yet implemented).
+Categories 1-4 are scored by word-overlap/date matching. Category 5 uses an opt-in LLM judge when `BENCH_JUDGE_MODEL` or `--judge` is enabled; otherwise it remains `N/A`.
 
 | Date | Issue/PR | Single-hop | Temporal | Multi-hop | Open Domain | Complex |
 |------|----------|------------|----------|-----------|-------------|---------|
 | 2026-03-02 | baseline | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) |
 | 2026-03-10 | pre-refactor | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) |
 | 2026-03-10 | eval-fix | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | N/A (71 skipped) |
+| 2026-03-10 | cat5-judge | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | **91.5% (65/71)** |
 
 \* Temporal was artificially low: evaluator compared question dates (empty) vs memory dates instead of answer dates.
 \*\* Complex was artificially 100%: dataset has no `answer` field for cat5 → empty string → `"" in content` always True.
diff --git a/docs/TESTING.md b/docs/TESTING.md
index 1fd6fd89..ce89946c 100644
--- a/docs/TESTING.md
+++ b/docs/TESTING.md
@@ -218,6 +218,27 @@ make test-integration
 
 AutoMem can be evaluated against the **LoCoMo benchmark** (ACL 2024), which tests long-term conversational memory across 10 conversations and 1,986 questions.
 
+### LoCoMo Cat-5 Judge
+
+Category 5 uses evidence-grounded complex reasoning and is opt-in for cost reasons.
+
+```bash
+# Default: categories 1-4 scored, category 5 skipped
+make bench-eval BENCH=locomo-mini CONFIG=baseline
+
+# Enable cat-5 judge with env var
+BENCH_JUDGE_MODEL=gpt-4o make bench-eval BENCH=locomo-mini CONFIG=baseline
+
+# Or use the runner CLI flags directly
+./test-locomo-benchmark.sh --conversations 0,1 --judge
+./test-locomo-benchmark.sh --conversations 0,1 --judge-model gpt-4o-mini
+```
+
+- `BENCH_JUDGE_MODEL` enables category-5 judging for `tests/benchmarks/test_locomo.py`.
+- `--judge` and `--judge-model` both enable the judge; `--judge` defaults to `gpt-4o` unless overridden by `BENCH_JUDGE_MODEL` or `--judge-model`.
+- If the judge is disabled, category 5 remains `N/A`.
+- If the judge is enabled but evidence is missing or the LLM response is invalid, the affected category-5 questions are skipped rather than counted wrong.
+
 ### What is LoCoMo?
 
 LoCoMo evaluates AI systems' ability to remember and reason across very long conversations (300+ turns). It measures performance across 5 categories:
@@ -228,7 +249,14 @@ LoCoMo evaluates AI systems' ability to remember and reason across very long con
 4. **Open Domain** (Category 4) - General knowledge questions
 5. **Complex Reasoning** (Category 5) - Advanced inference tasks
 
-**Comparison**: CORE achieved 88.24% (June 2025). AutoMem achieved 90.53%.
+Published reference point: CORE is widely cited at **88.24%** (June 2025), but public LoCoMo setups are not perfectly apples-to-apples, especially around category-5 handling.
+
+AutoMem currently publishes two LoCoMo baselines:
+
+| Setup | Scope | Score | Notes |
+|------|-------|-------|-------|
+| `locomo-mini`, judge off | 2 conversations, categories 1-4 only | **89.27% (208/233)** | 71 category-5 questions skipped |
+| `locomo`, judge on (`gpt-4o`) | Full 10 conversations | **87.56% (1739/1986)** | Category 5 scored at 95.74% (427/446) |
 
 ### Running the Benchmark
 
@@ -265,24 +293,26 @@ Memory usage:
 Example benchmark output:
 ```text
 📊 FINAL RESULTS
-🎯 Overall Accuracy: 90.53% (1798/1986)
-⏱️ Total Time: 1665s
+🎯 Overall Accuracy: 87.56% (1739/1986)
+⏱️ Total Time: 3497s
 💾 Total Memories Stored: 5882
 
 📈 Category Breakdown:
-  Single-hop Recall        : 79.79% (225/282)
-  Temporal Understanding   : 85.05% (273/321)
-  Multi-hop Reasoning      : 50.00% ( 48/ 96)
-  Open Domain              : 95.84% (806/841)
-  Complex Reasoning        : 100.00% (446/446)
+  Single-hop Recall        : 66.31% (187/282)
+  Temporal Understanding   : 87.23% (280/321)
+  Multi-hop Reasoning      : 45.83% ( 44/ 96)
+  Open Domain              : 95.24% (801/841)
+  Complex Reasoning        : 95.74% (427/446)
 
-📊 Comparison:
+📊 Comparison with published CORE reference:
   CORE: 88.24%
-  AutoMem: 90.53%
+  AutoMem: 87.56%
+  📉 AutoMem is 0.68% behind that reference
 ```
 
-All benchmark reports live in `tests/benchmarks/`.
-```
+If you run without the judge, category 5 will show as `N/A` and the comparison should be treated as directional rather than apples-to-apples.
+
+Current baselines and methodology notes live in `benchmarks/EXPERIMENT_LOG.md`.
 
 ### AutoMem's Advantages
 
diff --git a/scripts/bench/restore_and_eval.sh b/scripts/bench/restore_and_eval.sh
index e7ad7768..5d37a935 100755
--- a/scripts/bench/restore_and_eval.sh
+++ b/scripts/bench/restore_and_eval.sh
@@ -6,6 +6,10 @@ set -euo pipefail
 BENCH_NAME="${1:-locomo}"
 CONFIG="${2:-baseline}"
 REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
+PYTHON_BIN="${REPO_ROOT}/venv/bin/python"
+if [[ ! -x "$PYTHON_BIN" ]]; then
+    PYTHON_BIN="python3"
+fi
 
 # Shared utilities (colors + wait_for_api)
 source "$(dirname "$0")/../lib/common.sh"
@@ -90,7 +94,7 @@ if [[ "$BENCH_NAME" == locomo* ]]; then
     if [[ "$BENCH_NAME" == "locomo-mini" ]]; then
         EVAL_ARGS="--conversations 0,1 ${EVAL_ARGS}"
     fi
-    python3 tests/benchmarks/test_locomo.py \
+    "$PYTHON_BIN" tests/benchmarks/test_locomo.py \
         --base-url "$AUTOMEM_TEST_BASE_URL" \
         --api-token "$AUTOMEM_TEST_API_TOKEN" \
         ${EVAL_ARGS}
@@ -99,7 +103,7 @@ elif [[ "$BENCH_NAME" == longmemeval* ]]; then
     if [[ "$BENCH_NAME" == "longmemeval-mini" ]]; then
         LONGMEM_ARGS+=(--max-questions 20)
     fi
-    python3 tests/benchmarks/longmemeval/test_longmemeval.py \
+    "$PYTHON_BIN" tests/benchmarks/longmemeval/test_longmemeval.py \
         --base-url "$AUTOMEM_TEST_BASE_URL" \
         --api-token "$AUTOMEM_TEST_API_TOKEN" \
         "${LONGMEM_ARGS[@]}"
diff --git a/test-locomo-benchmark.sh b/test-locomo-benchmark.sh
index 64071e75..611e2727 100755
--- a/test-locomo-benchmark.sh
+++ b/test-locomo-benchmark.sh
@@ -19,12 +19,20 @@ SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
 # Shared utilities (colors + wait_for_api)
 source "${SCRIPT_DIR}/scripts/lib/common.sh"
 
+if [ -x "${SCRIPT_DIR}/venv/bin/python" ]; then
+    PYTHON_BIN="${SCRIPT_DIR}/venv/bin/python"
+else
+    PYTHON_BIN="python3"
+fi
+
 # Default configuration
 RUN_LIVE=false
 CONVERSATIONS=""
 RECALL_LIMIT=10
 NO_CLEANUP=false
 OUTPUT_FILE=""
+JUDGE=false
+JUDGE_MODEL=""
 
 # Parse arguments
 while [[ $# -gt 0 ]]; do
@@ -45,6 +53,15 @@ while [[ $# -gt 0 ]]; do
             OUTPUT_FILE="$2"
             shift 2
             ;;
+        --judge)
+            JUDGE=true
+            shift
+            ;;
+        --judge-model)
+            JUDGE=true
+            JUDGE_MODEL="$2"
+            shift 2
+            ;;
         --conversations)
             CONVERSATIONS="$2"
             shift 2
@@ -56,6 +73,8 @@ while [[ $# -gt 0 ]]; do
             echo "  --live              Run against Railway deployment (default: local Docker)"
             echo "  --recall-limit N    Number of memories to recall per question (default: 10)"
             echo "  --conversations I,J Comma-separated conversation indices (e.g. 0,1 for mini mode)"
+            echo "  --judge             Enable category-5 LLM judge (defaults to gpt-4o)"
+            echo "  --judge-model MODEL Set the category-5 judge model (also enables judge)"
             echo "  --no-cleanup        Don't cleanup test data after evaluation"
             echo "  --output FILE       Save results to JSON file"
             echo "  --help, -h          Show this help message"
@@ -64,6 +83,7 @@ while [[ $# -gt 0 ]]; do
             echo "  $0                                    # Run locally"
             echo "  $0 --live                             # Run against Railway"
             echo "  $0 --conversations 0,1                # Mini mode (2 conversations)"
+            echo "  $0 --conversations 0,1 --judge        # Mini mode with cat-5 judge"
             echo "  $0 --recall-limit 20 --output results.json"
             exit 0
             ;;
@@ -159,7 +179,7 @@ else
 fi
 
 # Build python command
-PYTHON_CMD="python3 $SCRIPT_DIR/tests/benchmarks/test_locomo.py"
+PYTHON_CMD="$PYTHON_BIN $SCRIPT_DIR/tests/benchmarks/test_locomo.py"
 PYTHON_CMD="$PYTHON_CMD --base-url $AUTOMEM_TEST_BASE_URL"
 PYTHON_CMD="$PYTHON_CMD --api-token $AUTOMEM_TEST_API_TOKEN"
 PYTHON_CMD="$PYTHON_CMD --recall-limit $RECALL_LIMIT"
@@ -176,6 +196,14 @@ if [ -n "$OUTPUT_FILE" ]; then
     PYTHON_CMD="$PYTHON_CMD --output $OUTPUT_FILE"
 fi
 
+if [ "$JUDGE" = true ]; then
+    PYTHON_CMD="$PYTHON_CMD --judge"
+fi
+
+if [ -n "$JUDGE_MODEL" ]; then
+    PYTHON_CMD="$PYTHON_CMD --judge-model $JUDGE_MODEL"
+fi
+
 echo ""
 echo -e "${BLUE}🚀 Starting benchmark evaluation...${NC}"
 echo ""
diff --git a/tests/benchmarks/BENCHMARK_2025-11-08.md b/tests/benchmarks/BENCHMARK_2025-11-08.md
index 604b855a..0b138914 100644
--- a/tests/benchmarks/BENCHMARK_2025-11-08.md
+++ b/tests/benchmarks/BENCHMARK_2025-11-08.md
@@ -1,5 +1,7 @@
 # AutoMem Benchmark Results
 
+> Historical note: This report predates the March 10, 2026 LoCoMo evaluator fixes. Temporal and category-5 scoring were corrected later, so these scores and comparisons are not current. See `benchmarks/EXPERIMENT_LOG.md` for current baselines and methodology.
+
 ## LoCoMo Benchmark (Long-term Conversational Memory)
 
 **Benchmark Version**: LoCoMo-10 (1,986 questions across 10 conversations)
diff --git a/tests/benchmarks/BENCHMARK_2025-11-20.md b/tests/benchmarks/BENCHMARK_2025-11-20.md
index 5435452b..6dbe9515 100644
--- a/tests/benchmarks/BENCHMARK_2025-11-20.md
+++ b/tests/benchmarks/BENCHMARK_2025-11-20.md
@@ -1,5 +1,7 @@
 # AutoMem Benchmark Results
 
+> Historical note: This report predates the March 10, 2026 LoCoMo evaluator fixes. Temporal and category-5 scoring were corrected later, so the scores and any "SOTA" language here are not current. See `benchmarks/EXPERIMENT_LOG.md` for current baselines and methodology.
+
 ## LoCoMo Benchmark (Long-term Conversational Memory)
 
 **Benchmark Version**: LoCoMo-10 (1,986 questions across 10 conversations)
diff --git a/tests/benchmarks/BENCHMARK_2025-12-02.md b/tests/benchmarks/BENCHMARK_2025-12-02.md
index b9016405..c32017fe 100644
--- a/tests/benchmarks/BENCHMARK_2025-12-02.md
+++ b/tests/benchmarks/BENCHMARK_2025-12-02.md
@@ -1,5 +1,7 @@
 # AutoMem Benchmark Results
 
+> Historical note: This report predates the March 10, 2026 LoCoMo evaluator fixes. Temporal and category-5 scoring were corrected later, so the scores and any "SOTA" language here are not current. See `benchmarks/EXPERIMENT_LOG.md` for current baselines and methodology.
+
 ## LoCoMo Benchmark (Long-term Conversational Memory)
 
 **Benchmark Version**: LoCoMo-10 (1,986 questions across 10 conversations)
diff --git a/tests/benchmarks/test_locomo.py b/tests/benchmarks/test_locomo.py
index 52749239..83bdd9f5 100644
--- a/tests/benchmarks/test_locomo.py
+++ b/tests/benchmarks/test_locomo.py
@@ -18,13 +18,14 @@
 - CORE blog: https://blog.heysol.ai/core-build-memory-knowledge-graph-for-individuals-and-achieved-sota-on-locomo-benchmark/
 """
 
+import hashlib
 import json
 import os
 import re
 import sys
 import time
 from collections import defaultdict
-from dataclasses import dataclass
+from dataclasses import dataclass, field
 from datetime import datetime, timedelta, timezone
 from pathlib import Path
 from typing import Any, Dict, List, Optional, Tuple
@@ -64,11 +65,20 @@ class LoCoMoConfig:
     # Performance tuning
     batch_size: int = 50  # Memories to store before pausing
     pause_between_batches: float = 0.5  # Seconds to wait between batches
+    judge_model: Optional[str] = field(
+        default_factory=lambda: os.getenv("BENCH_JUDGE_MODEL") or None
+    )
+
+    def __post_init__(self) -> None:
+        if self.judge_model is not None:
+            self.judge_model = self.judge_model.strip() or None
 
 
 class LoCoMoEvaluator:
     """Evaluates AutoMem against the LoCoMo benchmark"""
 
+    OPENAI_REQUEST_TIMEOUT_SECONDS = 90.0
+
     def __init__(self, config: LoCoMoConfig):
         self.config = config
         self.headers = {
@@ -77,17 +87,20 @@ def __init__(self, config: LoCoMoConfig):
         }
         # memory_map is returned per-conversation by _load_batch/_load_individual
         self.results = defaultdict(list)  # Category -> [True/False scores]
+        self.local_conversation_memories = {}  # sample_id -> dialog_id -> prepared memory
 
         # Phase 2: Initialize OpenAI client for LLM-based answer extraction
-        self.openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
+        api_key = os.getenv("OPENAI_API_KEY")
+        self.openai_client = OpenAI(api_key=api_key) if api_key else None
+        self.has_openai_api_key = bool(api_key)
         # DISABLED: LLM extraction too slow for iteration; using word-overlap for now
         self.use_llm_extraction = False
 
         # Phase 2.5: Cache LLM responses to avoid redundant API calls
-        self.llm_cache = {}  # (question, answer) -> (result, confidence, explanation)
+        self.llm_cache = {}  # Operation-scoped cache key -> response tuple
 
         # Embedding-based answer checking (fast, handles semantic similarity)
-        self.use_embedding_similarity = bool(os.getenv("OPENAI_API_KEY"))
+        self.use_embedding_similarity = self.has_openai_api_key
         self.embedding_cache = {}  # text -> embedding vector
 
     def health_check(self) -> bool:
@@ -132,6 +145,24 @@ def _cosine_similarity(self, a: List[float], b: List[float]) -> float:
             return 0.0
         return dot / (norm_a * norm_b)
 
+    @staticmethod
+    def _cache_value(value: Any, max_len: int = 500) -> str:
+        """Normalize cache parts and hash long payloads."""
+        if isinstance(value, (dict, list)):
+            text = json.dumps(value, sort_keys=True)
+        else:
+            text = str(value)
+
+        if len(text) <= max_len:
+            return text
+
+        digest = hashlib.sha1(text.encode("utf-8")).hexdigest()
+        return f"{text[:max_len]}::{digest}"
+
+    def _make_llm_cache_key(self, operation: str, *parts: Any) -> Tuple[str, ...]:
+        """Build a stable, operation-scoped cache key for LLM calls."""
+        return tuple([operation] + [self._cache_value(part) for part in parts])
+
     def cleanup_test_data(self, tag_prefix: str = "locomo-test", max_iterations: int = 200) -> bool:
         """Remove all test memories from AutoMem"""
         print(f"\nCleaning up test memories with tag: {tag_prefix}")
@@ -303,12 +334,23 @@ def load_conversation_into_automem(
         print(f"\nLoading conversation {sample_id} into AutoMem...")
 
         all_memories = self._prepare_conversation_memories(conversation, sample_id)
+        self._cache_prepared_memories(sample_id, all_memories)
 
         if self._has_batch_api():
             return self._load_batch(all_memories, sample_id)
         else:
             return self._load_individual(all_memories, sample_id)
 
+    def _cache_prepared_memories(self, sample_id: str, memories: List[Dict[str, Any]]) -> None:
+        """Cache prepared conversation memories locally by dialog ID for evidence lookup."""
+        memory_index = {}
+        for memory in memories:
+            dialog_id = memory.get("_dia_id")
+            if not dialog_id:
+                continue
+            memory_index[dialog_id] = {k: v for k, v in memory.items() if k != "_dia_id"}
+        self.local_conversation_memories[sample_id] = memory_index
+
     def _load_batch(self, memories: List[Dict[str, Any]], sample_id: str) -> Dict[str, str]:
         """Load memories using batch API (much faster)."""
         memory_map = {}
@@ -748,7 +790,10 @@ def normalize_answer(self, text: str) -> str:
         return " ".join(stemmed)
 
     def fetch_evidence_memories(
-        self, evidence_dialog_ids: List[str], sample_id: str
+        self,
+        evidence_dialog_ids: List[str],
+        sample_id: str,
+        use_local_cache: bool = False,
     ) -> List[Dict[str, Any]]:
         """
         Phase 2.5: Fetch specific evidence memories by dialog ID.
@@ -758,14 +803,21 @@ def fetch_evidence_memories(
         """
         evidence_memories = []
 
+        if use_local_cache:
+            local_index = self.local_conversation_memories.get(sample_id, {})
+            for dialog_id in evidence_dialog_ids:
+                memory = local_index.get(dialog_id)
+                if memory:
+                    evidence_memories.append(memory)
+            return evidence_memories
+
         try:
-            # Get all memories for this conversation
             response = requests.get(
                 f"{self.config.base_url}/recall",
                 headers=self.headers,
                 params={
-                    "query": "",  # Empty query to get all
-                    "limit": 1000,  # High limit to get all conversation memories
+                    "query": "",
+                    "limit": 1000,
                     "tags": f"conversation:{sample_id}",
                     "tag_match": "exact",
                 },
@@ -776,14 +828,11 @@ def fetch_evidence_memories(
                 result = response.json()
                 results = result.get("results", [])
                 all_memories = [r.get("memory", {}) for r in results if "memory" in r]
-
-                # Filter to just the evidence dialogs
                 for memory in all_memories:
                     metadata = memory.get("metadata", {})
                     dialog_id = metadata.get("dialog_id", "")
                     if dialog_id in evidence_dialog_ids:
                         evidence_memories.append(memory)
-
         except Exception as e:
             print(f"⚠️  Evidence fetch error: {e}")
 
@@ -896,7 +945,12 @@ def llm_extract_answer(
         )
         # Ensure is_multi_hop is bool
         is_multi_hop_bool = bool(is_multi_hop)
-        cache_key = (question[:200], answer_str[:100], is_multi_hop_bool)
+        cache_key = self._make_llm_cache_key(
+            "extract_answer",
+            question[:200],
+            answer_str[:100],
+            is_multi_hop_bool,
+        )
         if cache_key in self.llm_cache:
             return self.llm_cache[cache_key]
 
@@ -980,6 +1034,7 @@ def llm_extract_answer(
                 temperature=0.0,
                 max_tokens=200,
                 response_format={"type": "json_object"},
+                timeout=self.OPENAI_REQUEST_TIMEOUT_SECONDS,
             )
 
             # Parse response
@@ -1000,6 +1055,156 @@ def llm_extract_answer(
             self.llm_cache[cache_key] = error_result  # Cache errors too
             return error_result
 
+    def _format_memories_for_llm(
+        self,
+        memories: List[Dict[str, Any]],
+        limit: int = 10,
+        include_session_datetime: bool = True,
+    ) -> str:
+        """Format recalled or evidence memories for prompt context."""
+        if not memories:
+            return "None"
+
+        contexts = []
+        for i, mem in enumerate(memories[:limit]):
+            metadata = mem.get("metadata", {})
+            dialog_id = metadata.get("dialog_id", f"mem-{i}")
+            session_dt = metadata.get("session_datetime", "")
+            content = mem.get("content", "")
+            context = f"[{dialog_id}] {content}"
+            if include_session_datetime and session_dt:
+                context += f" (Session: {session_dt})"
+            contexts.append(context)
+
+        return "\n\n".join(contexts)
+
+    def _parse_json_object_response(self, content: str) -> Dict[str, Any]:
+        """Parse a JSON object response, tolerating fenced markdown wrappers."""
+        content = content.strip()
+        fence_match = re.match(r"^\s*```[a-zA-Z0-9_-]*\s*\n(?P<body>.*)\n\s*```\s*$", content, re.S)
+        if fence_match:
+            content = fence_match.group("body").strip()
+        return json.loads(content)
+
+    def judge_complex_reasoning(
+        self,
+        question: str,
+        adversarial_answer: str,
+        recalled_memories: List[Dict[str, Any]],
+        evidence_dialog_ids: List[str],
+        sample_id: str,
+    ) -> Tuple[Optional[bool], float, str, Optional[str], Optional[str]]:
+        """Judge a category-5 question using recalled memories and evidence dialogs."""
+        if not self.config.judge_model:
+            return None, 0.0, "Skipped: requires LLM judge", None, None
+
+        if not self.openai_client:
+            return None, 0.0, "Skipped: OPENAI_API_KEY not set for LLM judge", None, None
+
+        evidence_memories = self.fetch_evidence_memories(
+            evidence_dialog_ids,
+            sample_id,
+            use_local_cache=True,
+        )
+        if len(evidence_memories) < len(set(evidence_dialog_ids)):
+            return (
+                None,
+                0.0,
+                "Skipped: no evidence memories available for LLM judge",
+                None,
+                None,
+            )
+
+        recalled_text = self._format_memories_for_llm(recalled_memories, limit=25)
+        evidence_text = self._format_memories_for_llm(evidence_memories, limit=8)
+        cache_key = self._make_llm_cache_key(
+            "judge_cat5",
+            self.config.judge_model,
+            question,
+            adversarial_answer,
+            evidence_dialog_ids,
+            recalled_text,
+            evidence_text,
+        )
+        if cache_key in self.llm_cache:
+            return self.llm_cache[cache_key]
+
+        prompt = f"""You are judging a LoCoMo category-5 complex reasoning answer.
+
+Question:
+{question}
+
+Adversarial answer (distractor; may be wrong, incomplete, or occasionally overlap with the evidence):
+{adversarial_answer or "None"}
+
+Recalled memories:
+{recalled_text}
+
+Evidence dialogs (ground truth):
+{evidence_text}
+
+Instructions:
+1. Draft the best answer you can using ONLY the recalled memories.
+2. If the recalled memories do not support a positive answer, it is valid to answer with:
+   - "I don't know"
+   - "The recalled memories do not say"
+   - "The question premise or person appears incorrect"
+3. Compare that drafted answer to the evidence dialogs, which are the only ground truth.
+4. Do NOT assume the adversarial answer is always false; use the evidence dialogs to decide.
+5. Mark the answer correct if the drafted answer materially agrees with the evidence dialogs, including when the correct outcome is abstaining, correcting the person/entity, or stating that the premise is unsupported.
+6. Mark the answer incorrect if the drafted answer asserts content contradicted by the evidence, misses clearly available support in the recalled memories, or simply repeats unsupported adversarial content.
+
+Respond with ONLY a JSON object:
+{{
+  "generated_answer": "answer drafted from recalled memories",
+  "verdict": "supported | abstain | contradiction | unsupported | incorrect",
+  "correct": true,
+  "confidence": 0.0,
+  "reasoning": "brief explanation grounded in the evidence dialogs"
+}}"""
+
+        try:
+            response = self.openai_client.chat.completions.create(
+                model=self.config.judge_model,
+                messages=[
+                    {
+                        "role": "system",
+                        "content": (
+                            "You are a strict benchmark judge. "
+                            "Use recalled memories to draft the answer and evidence dialogs "
+                            "only to verify correctness."
+                        ),
+                    },
+                    {"role": "user", "content": prompt},
+                ],
+                temperature=0.0,
+                max_tokens=250,
+                response_format={"type": "json_object"},
+                timeout=self.OPENAI_REQUEST_TIMEOUT_SECONDS,
+            )
+            result = self._parse_json_object_response(response.choices[0].message.content)
+            correct = result.get("correct")
+            if not isinstance(correct, bool):
+                raise ValueError("Judge response missing boolean 'correct'")
+
+            confidence = float(result.get("confidence", 0.0))
+            generated_answer = str(result.get("generated_answer", "")).strip() or None
+            reasoning = str(result.get("reasoning", "")).strip()
+            explanation = f"LLM judge: {reasoning}" if reasoning else "LLM judge"
+            judge_result = (correct, confidence, explanation, generated_answer, reasoning or None)
+            self.llm_cache[cache_key] = judge_result
+            return judge_result
+        except Exception as e:
+            error_result = (
+                None,
+                0.0,
+                f"Skipped: LLM judge error: {e}",
+                None,
+                f"LLM judge error: {e}",
+            )
+            self.llm_cache[cache_key] = error_result
+            return error_result
+
     def check_answer_in_memories(
         self,
         question: str,
@@ -1238,11 +1443,109 @@ def check_answer_in_memories(
 
         return is_correct, max_confidence, explanation
 
+    def _recall_memories_for_qa(
+        self,
+        question: str,
+        sample_id: str,
+        evidence: List[str],
+    ) -> List[Dict[str, Any]]:
+        """Recall memories for a single benchmark QA row."""
+        if evidence and len(evidence) > 1:
+            return self.multi_hop_recall_with_graph(
+                question,
+                sample_id,
+                initial_limit=20,
+                max_connected=60,
+            )
+
+        return self.recall_for_question(
+            question,
+            sample_id,
+            evidence_count=len(evidence),
+        )
+
+    def _evaluate_question(self, qa: Dict[str, Any], sample_id: str) -> Dict[str, Any]:
+        """Evaluate one QA item and return a normalized result payload."""
+        question = qa.get("question", "")
+        answer = qa.get("answer", "")
+        category = qa.get("category", 0)
+        evidence = qa.get("evidence", [])
+        adversarial_answer = qa.get("adversarial_answer")
+
+        base_result = {
+            "question": question,
+            "expected_answer": answer,
+            "adversarial_answer": adversarial_answer,
+            "category": category,
+            "is_correct": None,
+            "confidence": 0.0,
+            "recalled_count": 0,
+            "explanation": "",
+            "judge_generated_answer": None,
+            "judge_reasoning": None,
+        }
+
+        if category == 5 and not answer and not self.config.judge_model:
+            base_result["explanation"] = "Skipped: requires LLM judge"
+            return base_result
+
+        recalled_memories = self._recall_memories_for_qa(question, sample_id, evidence)
+        base_result["recalled_count"] = len(recalled_memories)
+
+        if category == 5 and not answer:
+            (
+                is_correct,
+                confidence,
+                explanation,
+                generated_answer,
+                judge_reasoning,
+            ) = self.judge_complex_reasoning(
+                question,
+                adversarial_answer or "",
+                recalled_memories,
+                evidence,
+                sample_id,
+            )
+            base_result.update(
+                {
+                    "is_correct": is_correct,
+                    "confidence": confidence,
+                    "explanation": explanation,
+                    "judge_generated_answer": generated_answer,
+                    "judge_reasoning": judge_reasoning,
+                }
+            )
+            return base_result
+
+        is_correct, confidence, explanation = self.check_answer_in_memories(
+            question,
+            answer,
+            recalled_memories,
+            evidence,
+            sample_id,
+        )
+        if category == 5:
+            explanation = f"Deterministic cat-5 scoring: {explanation}"
+
+        base_result.update(
+            {
+                "is_correct": is_correct,
+                "confidence": confidence,
+                "explanation": explanation,
+            }
+        )
+        return base_result
+
     def _evaluate_only(self, conversation: Dict[str, Any], sample_id: str) -> Dict[str, Any]:
         """Evaluate a conversation without ingesting data (assumes already loaded)."""
         print(f"\n{'='*60}")
         print(f"Evaluating Conversation (eval-only): {sample_id}")
         print(f"{'='*60}")
+        if "conversation" in conversation:
+            self._cache_prepared_memories(
+                sample_id,
+                self._prepare_conversation_memories(conversation, sample_id),
+            )
 
         # Skip straight to evaluation — same logic as evaluate_conversation but no load step
         qa_results = []
@@ -1250,57 +1553,10 @@ def _evaluate_only(self, conversation: Dict[str, Any], sample_id: str) -> Dict[s
         print(f"\nEvaluating {len(questions)} questions...")
 
         for i, qa in enumerate(questions):
-            question = qa.get("question", "")
-            answer = qa.get("answer", "")
-            category = qa.get("category", 0)
-            evidence = qa.get("evidence", [])
-
-            # Category 5 (Complex Reasoning) needs an LLM judge — the
-            # dataset's ground-truth is either absent or trivial (yes/no).
-            if category == 5:
-                qa_results.append(
-                    {
-                        "question": question,
-                        "expected_answer": qa.get("adversarial_answer", answer),
-                        "category": category,
-                        "is_correct": None,
-                        "confidence": 0.0,
-                        "recalled_count": 0,
-                        "explanation": "Skipped: requires LLM judge",
-                    }
-                )
-                continue
-
-            if evidence and len(evidence) > 1:
-                recalled_memories = self.multi_hop_recall_with_graph(
-                    question,
-                    sample_id,
-                    initial_limit=20,
-                    max_connected=60,
-                )
-            else:
-                recalled_memories = self.recall_for_question(
-                    question,
-                    sample_id,
-                    evidence_count=len(evidence),
-                )
-
-            is_correct, confidence, explanation = self.check_answer_in_memories(
-                question, answer, recalled_memories, evidence, sample_id
-            )
-
-            qa_results.append(
-                {
-                    "question": question,
-                    "expected_answer": answer,
-                    "category": category,
-                    "is_correct": is_correct,
-                    "confidence": confidence,
-                    "recalled_count": len(recalled_memories),
-                    "explanation": explanation,
-                }
-            )
-            self.results[category].append(is_correct)
+            qa_result = self._evaluate_question(qa, sample_id)
+            qa_results.append(qa_result)
+            if qa_result["is_correct"] is not None:
+                self.results[qa_result["category"]].append(qa_result["is_correct"])
 
             if (i + 1) % 10 == 0:
                 print(f"  Processed {i+1}/{len(questions)} questions...")
@@ -1353,65 +1609,12 @@ def evaluate_conversation(self, conversation: Dict[str, Any], sample_id: str) ->
         print(f"\n❓ Evaluating {len(questions)} questions...")
 
         for i, qa in enumerate(questions):
-            question = qa.get("question", "")
-            answer = qa.get("answer", "")
-            category = qa.get("category", 0)
-            evidence = qa.get("evidence", [])
-
-            # Category 5 (Complex Reasoning) needs an LLM judge — the
-            # dataset's ground-truth is either absent or trivial (yes/no).
-            if category == 5:
-                qa_results.append(
-                    {
-                        "question": question,
-                        "expected_answer": qa.get("adversarial_answer", answer),
-                        "category": category,
-                        "is_correct": None,
-                        "confidence": 0.0,
-                        "recalled_count": 0,
-                        "explanation": "Skipped: requires LLM judge",
-                    }
-                )
-                if (i + 1) % 10 == 0:
-                    print(f"  Processed {i+1}/{len(questions)} questions...")
-                continue
-
-            # Recall memories for this question
-            # Use graph expansion for multi-hop questions (evidence > 1)
-            if evidence and len(evidence) > 1:
-                recalled_memories = self.multi_hop_recall_with_graph(
-                    question,
-                    sample_id,
-                    initial_limit=20,
-                    max_connected=60,
-                )
-            else:
-                recalled_memories = self.recall_for_question(
-                    question,
-                    sample_id,
-                    evidence_count=len(evidence),
-                )
-
-            # Check if answer is in recalled memories
-            # Phase 2.5: Pass sample_id to enable evidence fetching
-            is_correct, confidence, explanation = self.check_answer_in_memories(
-                question, answer, recalled_memories, evidence, sample_id
-            )
-
-            # Record result
-            qa_result = {
-                "question": question,
-                "expected_answer": answer,
-                "category": category,
-                "is_correct": is_correct,
-                "confidence": confidence,
-                "recalled_count": len(recalled_memories),
-                "explanation": explanation,
-            }
+            qa_result = self._evaluate_question(qa, sample_id)
             qa_results.append(qa_result)
 
             # Track results by category
-            self.results[category].append(is_correct)
+            if qa_result["is_correct"] is not None:
+                self.results[qa_result["category"]].append(qa_result["is_correct"])
 
             # Progress indicator
             if (i + 1) % 10 == 0:
@@ -1586,9 +1789,10 @@ def run_benchmark(
                 "correct": correct,
                 "total": total,
             }
-            print(
-                f"  {category_names.get(category, f'Category {category}'):25s}: {accuracy:6.2%} ({correct:3d}/{total:3d})"
-            )
+            if category != 5 or cat5_skipped == 0:
+                print(
+                    f"  {category_names.get(category, f'Category {category}'):25s}: {accuracy:6.2%} ({correct:3d}/{total:3d})"
+                )
 
         if cat5_skipped:
             cat5_name = category_names[5]
@@ -1597,29 +1801,47 @@ def run_benchmark(
                     "name": cat5_name,
                     "accuracy": None,
                     "correct": 0,
-                    "total": cat5_skipped,
+                    "total": 0,
                     "skipped": True,
+                    "skipped_count": cat5_skipped,
                 }
+                reason = (
+                    "judge unavailable/errors" if self.config.judge_model else "needs LLM judge"
+                )
+                print(f"  {cat5_name:25s}:    N/A ({cat5_skipped:3d} skipped, {reason})")
             else:
                 category_results[5]["skipped_count"] = cat5_skipped
-            print(f"  {cat5_name:25s}:    N/A ({cat5_skipped:3d} skipped, needs LLM judge)")
+                category_results[5]["skipped"] = True
+                correct = category_results[5]["correct"]
+                total = category_results[5]["total"]
+                accuracy = category_results[5]["accuracy"]
+                print(
+                    f"  {cat5_name:25s}: {accuracy:6.2%} ({correct:3d}/{total:3d}, {cat5_skipped:3d} skipped)"
+                )
 
-        # Comparison with CORE (their 88.24% includes cat5 via GPT-4 judge)
+        # Comparison with the published CORE reference.
+        # Treat 88.24% as a useful external reference point, not a strict
+        # apples-to-apples leaderboard, because public LoCoMo setups differ.
         core_sota = 0.8824
         improvement = overall_accuracy - core_sota
-        print("\n🏆 Comparison with CORE (SOTA):")
+        print("\n📊 Comparison with published CORE reference:")
         print(f"  CORE: {core_sota:.2%}")
         print(f"  AutoMem: {overall_accuracy:.2%}")
         if cat5_skipped:
-            print(
-                f"  ⚠️  AutoMem excludes {cat5_skipped} cat-5 Qs (needs LLM judge); CORE includes them"
-            )
+            if self.config.judge_model:
+                print(
+                    f"  ⚠️  AutoMem skipped {cat5_skipped} cat-5 Qs due to judge/missing-evidence errors"
+                )
+            else:
+                print(
+                    f"  ⚠️  AutoMem excludes {cat5_skipped} cat-5 Qs (needs LLM judge); treat comparison as directional only"
+                )
         if improvement > 0:
-            print(f"  🎉 AutoMem leads by {improvement:.2%}")
+            print(f"  📈 AutoMem is {improvement:.2%} above that reference")
         elif improvement < 0:
-            print(f"  📉 AutoMem is {abs(improvement):.2%} behind CORE")
+            print(f"  📉 AutoMem is {abs(improvement):.2%} behind that reference")
         else:
-            print("  🤝 AutoMem matches CORE")
+            print("  🤝 AutoMem matches that reference")
 
         # Cleanup
         if cleanup_after:
@@ -1634,6 +1856,9 @@ def run_benchmark(
                 "total": total_questions,
                 "elapsed_time": elapsed_time,
             },
+            "judge_requested": bool(self.config.judge_model),
+            "judge_available": bool(self.config.judge_model and self.openai_client),
+            "judge_model": self.config.judge_model,
             "categories": category_results,
             "conversations": conversation_results,
             "comparison": {
@@ -1641,7 +1866,11 @@ def run_benchmark(
                 "automem": overall_accuracy,
                 "improvement": improvement,
                 "cat5_excluded": cat5_skipped,
-                "note": "CORE 88.24% includes cat-5 via GPT-4 judge" if cat5_skipped else None,
+                "note": (
+                    "CORE 88.24% includes cat-5 via GPT-4 judge"
+                    if cat5_skipped and not self.config.judge_model
+                    else None
+                ),
             },
         }
 
@@ -1694,6 +1923,16 @@ def main():
         action="store_true",
         help="Only evaluate (skip ingestion). Assumes data already loaded.",
     )
+    parser.add_argument(
+        "--judge",
+        action="store_true",
+        help="Enable category-5 LLM judging (defaults to gpt-4o unless BENCH_JUDGE_MODEL is set).",
+    )
+    parser.add_argument(
+        "--judge-model",
+        default=None,
+        help="LLM model for category-5 judging (also enables judge mode).",
+    )
 
     args = parser.parse_args()
 
@@ -1704,6 +1943,8 @@ def main():
 
     if args.data_file:
         config.data_file = args.data_file
+    if args.judge or args.judge_model:
+        config.judge_model = args.judge_model or config.judge_model or "gpt-4o"
 
     # Run evaluation
     evaluator = LoCoMoEvaluator(config)
diff --git a/tests/test_locomo_cat5_judge.py b/tests/test_locomo_cat5_judge.py
new file mode 100644
index 00000000..20320920
--- /dev/null
+++ b/tests/test_locomo_cat5_judge.py
@@ -0,0 +1,408 @@
+import sys
+from importlib.util import module_from_spec, spec_from_file_location
+from pathlib import Path
+from types import SimpleNamespace
+from typing import Any, Dict, List
+
+import pytest
+
+_MODULE_NAME = "locomo_benchmark_module_cat5"
+
+
+def _load_locomo_module() -> Any:
+    if _MODULE_NAME in sys.modules:
+        return sys.modules[_MODULE_NAME]
+    module_path = Path(__file__).resolve().parent / "benchmarks" / "test_locomo.py"
+    spec = spec_from_file_location(_MODULE_NAME, module_path)
+    assert spec is not None
+    assert spec.loader is not None
+    module = module_from_spec(spec)
+    sys.modules[_MODULE_NAME] = module
+    spec.loader.exec_module(module)
+    return module
+
+
+@pytest.fixture()
+def locomo_module() -> Any:
+    return _load_locomo_module()
+
+
+@pytest.fixture()
+def locomo_evaluator(locomo_module: Any) -> Any:
+    config = locomo_module.LoCoMoConfig()
+    config.judge_model = None
+    evaluator = locomo_module.LoCoMoEvaluator(config)
+    evaluator.openai_client = None
+    evaluator.use_embedding_similarity = False
+    return evaluator
+
+
+def _fake_memory(dialog_id: str, content: str) -> Dict[str, Any]:
+    return {
+        "content": content,
+        "metadata": {
+            "dialog_id": dialog_id,
+            "session_datetime": "2023-05-08T13:56:00+00:00",
+        },
+    }
+
+
+def _cat5_qa(**overrides: Any) -> Dict[str, Any]:
+    qa = {
+        "question": "How does Jolene plan to pursue her dream of climbing mountains?",
+        "answer": "",
+        "adversarial_answer": "By ignoring training and winging it.",
+        "category": 5,
+        "evidence": ["D10:20"],
+    }
+    qa.update(overrides)
+    return qa
+
+
+def test_cat5_without_judge_still_skips(locomo_evaluator: Any) -> None:
+    result = locomo_evaluator._evaluate_question(_cat5_qa(), "conv-1")
+
+    assert result["is_correct"] is None
+    assert result["recalled_count"] == 0
+    assert result["explanation"] == "Skipped: requires LLM judge"
+
+
+def test_cat5_with_judge_counts_toward_category_results(
+    locomo_evaluator: Any, monkeypatch: pytest.MonkeyPatch
+) -> None:
+    locomo_evaluator.config.judge_model = "gpt-4o"
+    monkeypatch.setattr(
+        locomo_evaluator,
+        "_recall_memories_for_qa",
+        lambda question, sample_id, evidence: [_fake_memory("D10:20", "Jolene watches videos.")],
+    )
+    monkeypatch.setattr(
+        locomo_evaluator,
+        "judge_complex_reasoning",
+        lambda question, adversarial_answer, recalled_memories, evidence_dialog_ids, sample_id: (
+            True,
+            0.92,
+            "LLM judge: supported by evidence",
+            "She plans to study and start with beginner climbs.",
+            "supported by evidence",
+        ),
+    )
+
+    conversation = {"qa": [_cat5_qa()]}
+    result = locomo_evaluator._evaluate_only(conversation, "conv-1")
+
+    assert result["correct"] == 1
+    assert result["total_questions"] == 1
+    assert locomo_evaluator.results[5] == [True]
+    assert result["qa_results"][0]["judge_generated_answer"] is not None
+
+
+def test_fetch_evidence_memories_uses_local_conversation_cache(
+    locomo_evaluator: Any,
+) -> None:
+    locomo_evaluator.local_conversation_memories["conv-26"] = {
+        "D2:3": _fake_memory("D2:3", "Target evidence"),
+        "D5:8": _fake_memory("D5:8", "Secondary evidence"),
+    }
+
+    evidence = locomo_evaluator.fetch_evidence_memories(
+        ["D2:3", "D5:8"],
+        "conv-26",
+        use_local_cache=True,
+    )
+
+    assert [memory["metadata"]["dialog_id"] for memory in evidence] == ["D2:3", "D5:8"]
+
+
+def test_cat5_judge_uses_cache(locomo_evaluator: Any, monkeypatch: pytest.MonkeyPatch) -> None:
+    locomo_evaluator.config.judge_model = "gpt-4o"
+    calls = {"count": 0}
+
+    class FakeCompletions:
+        def create(self, **kwargs: Any) -> Any:
+            calls["count"] += 1
+            return SimpleNamespace(
+                choices=[
+                    SimpleNamespace(
+                        message=SimpleNamespace(
+                            content=(
+                                '{"generated_answer": "She will research and train.", '
+                                '"verdict": "supported", '
+                                '"correct": true, "confidence": 0.88, '
+                                '"reasoning": "Matches the evidence dialog."}'
+                            )
+                        )
+                    )
+                ]
+            )
+
+    locomo_evaluator.openai_client = SimpleNamespace(
+        chat=SimpleNamespace(completions=FakeCompletions())
+    )
+    monkeypatch.setattr(
+        locomo_evaluator,
+        "fetch_evidence_memories",
+        lambda evidence_dialog_ids, sample_id, use_local_cache=False: [
+            _fake_memory("D10:20", "Jolene is gathering information and watching videos.")
+        ],
+    )
+
+    recalled = [_fake_memory("R1", "Jolene is gathering information and watching videos.")]
+    first = locomo_evaluator.judge_complex_reasoning(
+        _cat5_qa()["question"],
+        _cat5_qa()["adversarial_answer"],
+        recalled,
+        ["D10:20"],
+        "conv-1",
+    )
+    second = locomo_evaluator.judge_complex_reasoning(
+        _cat5_qa()["question"],
+        _cat5_qa()["adversarial_answer"],
+        recalled,
+        ["D10:20"],
+        "conv-1",
+    )
+
+    assert calls["count"] == 1
+    assert first == second
+    assert first[0] is True
+
+
+def test_cat5_judge_sets_request_timeout(
+    locomo_evaluator: Any, monkeypatch: pytest.MonkeyPatch
+) -> None:
+    locomo_evaluator.config.judge_model = "gpt-4o"
+    captured = {}
+
+    class FakeCompletions:
+        def create(self, **kwargs: Any) -> Any:
+            captured["timeout"] = kwargs.get("timeout")
+            return SimpleNamespace(
+                choices=[
+                    SimpleNamespace(
+                        message=SimpleNamespace(
+                            content=(
+                                '{"generated_answer": "She will research and train.", '
+                                '"verdict": "supported", '
+                                '"correct": true, "confidence": 0.88, '
+                                '"reasoning": "Matches the evidence dialog."}'
+                            )
+                        )
+                    )
+                ]
+            )
+
+    locomo_evaluator.openai_client = SimpleNamespace(
+        chat=SimpleNamespace(completions=FakeCompletions())
+    )
+    monkeypatch.setattr(
+        locomo_evaluator,
+        "fetch_evidence_memories",
+        lambda evidence_dialog_ids, sample_id, use_local_cache=False: [
+            _fake_memory("D10:20", "Jolene is gathering information and watching videos.")
+        ],
+    )
+
+    result = locomo_evaluator.judge_complex_reasoning(
+        _cat5_qa()["question"],
+        _cat5_qa()["adversarial_answer"],
+        [_fake_memory("R1", "Jolene is gathering information and watching videos.")],
+        ["D10:20"],
+        "conv-1",
+    )
+
+    assert result[0] is True
+    assert captured["timeout"] == locomo_evaluator.OPENAI_REQUEST_TIMEOUT_SECONDS
+
+
+@pytest.mark.parametrize(
+    ("evidence_memories", "response_content", "expected_message"),
+    [
+        (
+            [],
+            '{"generated_answer": "", "verdict": "unsupported", "correct": true, "confidence": 0.1, "reasoning": ""}',
+            "Skipped: no evidence memories available for LLM judge",
+        ),
+        (
+            [{"content": "evidence", "metadata": {"dialog_id": "D10:20"}}],
+            "not-json",
+            "Skipped: LLM judge error:",
+        ),
+    ],
+)
+def test_cat5_judge_failures_skip_instead_of_marking_wrong(
+    locomo_evaluator: Any,
+    monkeypatch: pytest.MonkeyPatch,
+    evidence_memories: List[Dict[str, Any]],
+    response_content: str,
+    expected_message: str,
+) -> None:
+    locomo_evaluator.config.judge_model = "gpt-4o"
+
+    class FakeCompletions:
+        def create(self, **kwargs: Any) -> Any:
+            return SimpleNamespace(
+                choices=[SimpleNamespace(message=SimpleNamespace(content=response_content))]
+            )
+
+    locomo_evaluator.openai_client = SimpleNamespace(
+        chat=SimpleNamespace(completions=FakeCompletions())
+    )
+    monkeypatch.setattr(
+        locomo_evaluator,
+        "fetch_evidence_memories",
+        lambda evidence_dialog_ids, sample_id, use_local_cache=False: evidence_memories,
+    )
+    monkeypatch.setattr(
+        locomo_evaluator,
+        "_recall_memories_for_qa",
+        lambda question, sample_id, evidence: [
+            _fake_memory("R1", "Jolene is gathering information and watching videos.")
+        ],
+    )
+
+    result = locomo_evaluator._evaluate_question(_cat5_qa(), "conv-1")
+
+    assert result["is_correct"] is None
+    assert result["confidence"] == 0.0
+    assert expected_message in result["explanation"]
+
+
+def test_cat5_judge_prompt_allows_abstention_and_wrong_premise(
+    locomo_evaluator: Any, monkeypatch: pytest.MonkeyPatch
+) -> None:
+    locomo_evaluator.config.judge_model = "gpt-4o"
+    captured = {}
+
+    class FakeCompletions:
+        def create(self, **kwargs: Any) -> Any:
+            captured["messages"] = kwargs["messages"]
+            return SimpleNamespace(
+                choices=[
+                    SimpleNamespace(
+                        message=SimpleNamespace(
+                            content=(
+                                '{"generated_answer": "I don\'t know; the premise seems wrong.", '
+                                '"verdict": "contradiction", '
+                                '"correct": true, "confidence": 0.91, '
+                                '"reasoning": "The evidence shows the question names the wrong person."}'
+                            )
+                        )
+                    )
+                ]
+            )
+
+    locomo_evaluator.openai_client = SimpleNamespace(
+        chat=SimpleNamespace(completions=FakeCompletions())
+    )
+    monkeypatch.setattr(
+        locomo_evaluator,
+        "fetch_evidence_memories",
+        lambda evidence_dialog_ids, sample_id, use_local_cache=False: [
+            _fake_memory("D2:3", "The realization belonged to Melanie, not Caroline.")
+        ],
+    )
+
+    result = locomo_evaluator.judge_complex_reasoning(
+        "What did Caroline realize after her charity race?",
+        "self-care is important",
+        [_fake_memory("R1", "Melanie talked about self-care after the race.")],
+        ["D2:3"],
+        "conv-26",
+    )
+
+    prompt = captured["messages"][1]["content"]
+    assert "Do NOT assume the adversarial answer is always false" in prompt
+    assert (
+        "abstaining, correcting the person/entity, or stating that the premise is unsupported"
+        in prompt
+    )
+    assert result[0] is True
+
+
+def test_cat5_judge_prompt_includes_more_than_old_top_12_memories(
+    locomo_evaluator: Any, monkeypatch: pytest.MonkeyPatch
+) -> None:
+    locomo_evaluator.config.judge_model = "gpt-4o"
+    captured = {}
+
+    class FakeCompletions:
+        def create(self, **kwargs: Any) -> Any:
+            captured["messages"] = kwargs["messages"]
+            return SimpleNamespace(
+                choices=[
+                    SimpleNamespace(
+                        message=SimpleNamespace(
+                            content=(
+                                '{"generated_answer": "Answer", '
+                                '"verdict": "supported", '
+                                '"correct": true, "confidence": 0.8, '
+                                '"reasoning": "ok"}'
+                            )
+                        )
+                    )
+                ]
+            )
+
+    locomo_evaluator.openai_client = SimpleNamespace(
+        chat=SimpleNamespace(completions=FakeCompletions())
+    )
+    monkeypatch.setattr(
+        locomo_evaluator,
+        "fetch_evidence_memories",
+        lambda evidence_dialog_ids, sample_id, use_local_cache=False: [
+            _fake_memory("D10:20", "Target evidence")
+        ],
+    )
+    recalled = [_fake_memory(f"R{i}", f"Memory {i}") for i in range(1, 16)]
+
+    locomo_evaluator.judge_complex_reasoning(
+        _cat5_qa()["question"],
+        _cat5_qa()["adversarial_answer"],
+        recalled,
+        ["D10:20"],
+        "conv-1",
+    )
+
+    prompt = captured["messages"][1]["content"]
+    assert "[R15] Memory 15" in prompt
+
+
+def test_cat5_with_canonical_answer_stays_deterministic(
+    locomo_evaluator: Any, monkeypatch: pytest.MonkeyPatch
+) -> None:
+    locomo_evaluator.config.judge_model = "gpt-4o"
+    monkeypatch.setattr(
+        locomo_evaluator,
+        "_recall_memories_for_qa",
+        lambda question, sample_id, evidence: [_fake_memory("D5:8", "Caroline did not make it.")],
+    )
+    monkeypatch.setattr(
+        locomo_evaluator,
+        "check_answer_in_memories",
+        lambda question, expected_answer, recalled_memories, evidence_dialog_ids, sample_id: (
+            True,
+            1.0,
+            "Found answer",
+        ),
+    )
+    monkeypatch.setattr(
+        locomo_evaluator,
+        "judge_complex_reasoning",
+        lambda *args, **kwargs: (_ for _ in ()).throw(AssertionError("judge should not be used")),
+    )
+
+    result = locomo_evaluator._evaluate_question(
+        _cat5_qa(
+            question="Did Caroline make the black and white bowl in the photo?",
+            answer="No",
+            adversarial_answer="Yes",
+            evidence=["D5:8"],
+        ),
+        "conv-26",
+    )
+
+    assert result["is_correct"] is True
+    assert result["expected_answer"] == "No"
+    assert result["judge_generated_answer"] is None
+    assert result["explanation"].startswith("Deterministic cat-5 scoring:")