diff --git a/AGENTS.md b/AGENTS.md index c343a85a..29794fc5 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -47,8 +47,8 @@ The benchmark system uses **snapshot-based evaluation**: ingest once, eval many | Tier | Benchmark | Command | Runtime | Cost | When to use | |------|-----------|---------|---------|------|-------------| | 0 | Unit tests | `make test` | 30s | free | Every change | -| 1 | LoCoMo-mini (2 convos, 304 Qs) | `make bench-eval BENCH=locomo-mini` | 2-3 min | free | Rapid iteration | -| 2 | LoCoMo-full (10 convos, 1986 Qs) | `make bench-eval BENCH=locomo` | 5-10 min | free | Before merge | +| 1 | LoCoMo-mini (2 convos, 304 Qs) | `make bench-eval BENCH=locomo-mini` | 2-3 min | free / ~$0.20 with judge | Rapid iteration | +| 2 | LoCoMo-full (10 convos, 1986 Qs) | `make bench-eval BENCH=locomo` | 5-10 min | free / ~$1-3 with judge | Before merge | | 3 | LongMemEval-mini (20 Qs) | `make bench-mini-longmemeval` | 15 min | ~$1 | Scoring/entity changes | | 4 | LongMemEval-full (500 Qs) | `make test-longmemeval` | 1-2 hr | ~$10 | Milestones only | diff --git a/README.md b/README.md index 75269485..2b9429af 100644 --- a/README.md +++ b/README.md @@ -14,7 +14,7 @@ # **AI Memory That Actually Learns** -AutoMem is a **production-grade long-term memory system** for AI assistants, achieving **89.27% accuracy** on the [LoCoMo benchmark](docs/TESTING.md#locomo-benchmark) (ACL 2024)—outperforming CORE (88.24%). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for current baselines. +AutoMem is a **production-grade long-term memory system** for AI assistants with transparent [LoCoMo benchmark](docs/TESTING.md#locomo-benchmark) baselines (ACL 2024): **89.27%** on `locomo-mini` categories 1-4 with category 5 skipped, and **87.56%** on full `locomo` with the opt-in category-5 judge enabled. See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for methodology and current baselines. **Deploy in 60 seconds:** @@ -522,7 +522,7 @@ Vector databases match embeddings. AutoMem builds knowledge graphs: AutoMem saves you months of iteration: -- ✅ **Benchmark-proven** - 89.27% on LoCoMo (ACL 2024), beats CORE SOTA +- ✅ **Benchmark-proven** - Transparent LoCoMo baselines for both judge-off and judge-on evaluation - ✅ **Research-validated** - Implements HippoRAG 2, A-MEM, MELODI, ReadAgent principles - ✅ **Production-ready** - Auth, admin tools, health monitoring, automated backups - ✅ **Battle-tested** - Enrichment pipeline, consolidation engine, retry logic, dual storage @@ -532,7 +532,14 @@ AutoMem saves you months of iteration: ### LoCoMo Benchmark (ACL 2024) -**89.27% accuracy** on categories 1–4 (233 scored questions, Voyage 4 embeddings): +AutoMem publishes two reference baselines with Voyage 4 embeddings: + +| Setup | Scope | Score | Notes | +|-------|-------|-------|-------| +| Fast iteration | `locomo-mini`, judge off | **89.27% (208/233)** | Categories 1-4 only; 71 category-5 questions skipped | +| Full benchmark | `locomo`, judge on (`gpt-4o`) | **87.56% (1739/1986)** | Includes category 5 at 95.74% (427/446) | + +`locomo-mini` category breakdown with the judge disabled: | Category | AutoMem | Notes | | -------------------------- | ---------- | --------------------------------------- | @@ -540,18 +547,21 @@ AutoMem saves you months of iteration: | **Temporal Understanding** | **92.06%** | Time-aware queries | | **Single-hop Recall** | **79.07%** | Basic fact retrieval | | **Multi-hop Reasoning** | **46.15%** | Connecting disparate memories | -| **Complex Reasoning** | N/A | Requires LLM judge (not yet scored) | +| **Complex Reasoning** | N/A | Skipped in this setup; use judge-on run | -**Comparison with other systems:** +Reference point: | System | Score | |--------|-------| -| AutoMem | 89.27% | -| CORE | 88.24% | +| Published CORE result | 88.24% | +| AutoMem `locomo-mini` judge off | 89.27% | +| AutoMem `locomo` judge on | 87.56% | -> **Note:** Earlier versions reported 90.53% which included two evaluator bugs: temporal matching compared the wrong text (false negatives → 22%) and category 5 matched empty strings (false positives → 100%). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for full history. +> **Methodology note:** We do not present this as a strict leaderboard claim. The published CORE number is a useful reference point, but the public LoCoMo setups are not perfectly apples-to-apples, especially around category-5 handling. AutoMem is above that published reference on the `locomo-mini` categories 1-4 run and below it on the full judge-enabled run. +> +> **History note:** Earlier versions reported 90.53%, but that included two evaluator bugs: temporal matching compared the wrong text (false negatives) and category 5 matched empty strings (false positives). See [`benchmarks/EXPERIMENT_LOG.md`](benchmarks/EXPERIMENT_LOG.md) for the corrected timeline. -Run benchmarks: `make bench-eval BENCH=locomo-mini` (quick) or `make bench-eval BENCH=locomo` (full) +Run benchmarks: `make bench-eval BENCH=locomo-mini CONFIG=baseline` (quick) or `BENCH_JUDGE_MODEL=gpt-4o make bench-eval BENCH=locomo CONFIG=baseline` (full, includes category 5) ### Production Characteristics diff --git a/benchmarks/EXPERIMENT_LOG.md b/benchmarks/EXPERIMENT_LOG.md index dd50ade8..9b7db1d6 100644 --- a/benchmarks/EXPERIMENT_LOG.md +++ b/benchmarks/EXPERIMENT_LOG.md @@ -10,8 +10,8 @@ on the snapshot-based bench infrastructure (PR #97, merged 2026-03-02). | Tier | Benchmark | Runtime | Cost | When to use | |------|-----------|---------|------|-------------| | 0 | `make test` (unit) | 30s | free | Every change | -| 1 | `locomo-mini` (2 convos, 304 Qs) | 2-3 min | free | Rapid iteration | -| 2 | `locomo` (10 convos, 1986 Qs) | 5-10 min | free | Before merge | +| 1 | `locomo-mini` (2 convos, 304 Qs) | 2-3 min | free / ~$0.20 with judge | Rapid iteration | +| 2 | `locomo` (10 convos, 1986 Qs) | 5-10 min | free / ~$1-3 with judge | Before merge | | 3 | `longmemeval-mini` (20 Qs) | 15 min | ~$1 | Scoring/entity changes | | 4 | `longmemeval` (500 Qs) | 1-2 hr | ~$10 | Milestones only | @@ -26,16 +26,18 @@ on the snapshot-based bench infrastructure (PR #97, merged 2026-03-02). | 2026-03-02 | #78 | exp/78-decay-fix | 76.97% (+0.0) | 79.51% (-0.55) | -- | Decay rate 0.1→0.01, importance floor, archive filter. Within variance. Impact is on production (rehabilitated via rescore) | | 2026-03-10 | pre-refactor | main (@ 795368a) | 76.97% (+0.0) | -- | -- | Baseline re-confirmed after #73, #78, #115, #116 merged. Stable. Pre-relation-tier-refactor checkpoint. | | 2026-03-10 | eval-fix | docs/benchmark-agent-guidelines | **89.27% (208/233)** | -- | -- | Fix temporal matching (answer vs memory dates) + skip cat5 (no ground truth). Honest score, beats CORE by 1.03pp. | +| 2026-03-10 | cat5-judge | feat/bench-cat5-judge | **89.80% (273/304)** | **87.56% (1739/1986)** | -- | Opt-in GPT-4o judge for cat5. Full run scored cat5 at 95.74% (427/446) with 0 judge skips/errors; added 90s OpenAI request timeout to prevent stuck full runs. | ### Category Breakdown (LoCoMo-mini) -Categories 1-4 scored by word-overlap/date matching. Category 5 requires LLM judge (not yet implemented). +Categories 1-4 are scored by word-overlap/date matching. Category 5 uses an opt-in LLM judge when `BENCH_JUDGE_MODEL` or `--judge` is enabled; otherwise it remains `N/A`. | Date | Issue/PR | Single-hop | Temporal | Multi-hop | Open Domain | Complex | |------|----------|------------|----------|-----------|-------------|---------| | 2026-03-02 | baseline | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) | | 2026-03-10 | pre-refactor | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) | | 2026-03-10 | eval-fix | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | N/A (71 skipped) | +| 2026-03-10 | cat5-judge | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | **91.5% (65/71)** | \* Temporal was artificially low: evaluator compared question dates (empty) vs memory dates instead of answer dates. \*\* Complex was artificially 100%: dataset has no `answer` field for cat5 → empty string → `"" in content` always True. diff --git a/docs/TESTING.md b/docs/TESTING.md index 1fd6fd89..ce89946c 100644 --- a/docs/TESTING.md +++ b/docs/TESTING.md @@ -218,6 +218,27 @@ make test-integration AutoMem can be evaluated against the **LoCoMo benchmark** (ACL 2024), which tests long-term conversational memory across 10 conversations and 1,986 questions. +### LoCoMo Cat-5 Judge + +Category 5 uses evidence-grounded complex reasoning and is opt-in for cost reasons. + +```bash +# Default: categories 1-4 scored, category 5 skipped +make bench-eval BENCH=locomo-mini CONFIG=baseline + +# Enable cat-5 judge with env var +BENCH_JUDGE_MODEL=gpt-4o make bench-eval BENCH=locomo-mini CONFIG=baseline + +# Or use the runner CLI flags directly +./test-locomo-benchmark.sh --conversations 0,1 --judge +./test-locomo-benchmark.sh --conversations 0,1 --judge-model gpt-4o-mini +``` + +- `BENCH_JUDGE_MODEL` enables category-5 judging for `tests/benchmarks/test_locomo.py`. +- `--judge` and `--judge-model` both enable the judge; `--judge` defaults to `gpt-4o` unless overridden by `BENCH_JUDGE_MODEL` or `--judge-model`. +- If the judge is disabled, category 5 remains `N/A`. +- If the judge is enabled but evidence is missing or the LLM response is invalid, the affected category-5 questions are skipped rather than counted wrong. + ### What is LoCoMo? LoCoMo evaluates AI systems' ability to remember and reason across very long conversations (300+ turns). It measures performance across 5 categories: @@ -228,7 +249,14 @@ LoCoMo evaluates AI systems' ability to remember and reason across very long con 4. **Open Domain** (Category 4) - General knowledge questions 5. **Complex Reasoning** (Category 5) - Advanced inference tasks -**Comparison**: CORE achieved 88.24% (June 2025). AutoMem achieved 90.53%. +Published reference point: CORE is widely cited at **88.24%** (June 2025), but public LoCoMo setups are not perfectly apples-to-apples, especially around category-5 handling. + +AutoMem currently publishes two LoCoMo baselines: + +| Setup | Scope | Score | Notes | +|------|-------|-------|-------| +| `locomo-mini`, judge off | 2 conversations, categories 1-4 only | **89.27% (208/233)** | 71 category-5 questions skipped | +| `locomo`, judge on (`gpt-4o`) | Full 10 conversations | **87.56% (1739/1986)** | Category 5 scored at 95.74% (427/446) | ### Running the Benchmark @@ -265,24 +293,26 @@ Memory usage: Example benchmark output: ```text 📊 FINAL RESULTS -🎯 Overall Accuracy: 90.53% (1798/1986) -⏱️ Total Time: 1665s +🎯 Overall Accuracy: 87.56% (1739/1986) +⏱️ Total Time: 3497s 💾 Total Memories Stored: 5882 📈 Category Breakdown: - Single-hop Recall : 79.79% (225/282) - Temporal Understanding : 85.05% (273/321) - Multi-hop Reasoning : 50.00% ( 48/ 96) - Open Domain : 95.84% (806/841) - Complex Reasoning : 100.00% (446/446) + Single-hop Recall : 66.31% (187/282) + Temporal Understanding : 87.23% (280/321) + Multi-hop Reasoning : 45.83% ( 44/ 96) + Open Domain : 95.24% (801/841) + Complex Reasoning : 95.74% (427/446) -📊 Comparison: +📊 Comparison with published CORE reference: CORE: 88.24% - AutoMem: 90.53% + AutoMem: 87.56% + 📉 AutoMem is 0.68% behind that reference ``` -All benchmark reports live in `tests/benchmarks/`. -``` +If you run without the judge, category 5 will show as `N/A` and the comparison should be treated as directional rather than apples-to-apples. + +Current baselines and methodology notes live in `benchmarks/EXPERIMENT_LOG.md`. ### AutoMem's Advantages diff --git a/scripts/bench/restore_and_eval.sh b/scripts/bench/restore_and_eval.sh index e7ad7768..5d37a935 100755 --- a/scripts/bench/restore_and_eval.sh +++ b/scripts/bench/restore_and_eval.sh @@ -6,6 +6,10 @@ set -euo pipefail BENCH_NAME="${1:-locomo}" CONFIG="${2:-baseline}" REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)" +PYTHON_BIN="${REPO_ROOT}/venv/bin/python" +if [[ ! -x "$PYTHON_BIN" ]]; then + PYTHON_BIN="python3" +fi # Shared utilities (colors + wait_for_api) source "$(dirname "$0")/../lib/common.sh" @@ -90,7 +94,7 @@ if [[ "$BENCH_NAME" == locomo* ]]; then if [[ "$BENCH_NAME" == "locomo-mini" ]]; then EVAL_ARGS="--conversations 0,1 ${EVAL_ARGS}" fi - python3 tests/benchmarks/test_locomo.py \ + "$PYTHON_BIN" tests/benchmarks/test_locomo.py \ --base-url "$AUTOMEM_TEST_BASE_URL" \ --api-token "$AUTOMEM_TEST_API_TOKEN" \ ${EVAL_ARGS} @@ -99,7 +103,7 @@ elif [[ "$BENCH_NAME" == longmemeval* ]]; then if [[ "$BENCH_NAME" == "longmemeval-mini" ]]; then LONGMEM_ARGS+=(--max-questions 20) fi - python3 tests/benchmarks/longmemeval/test_longmemeval.py \ + "$PYTHON_BIN" tests/benchmarks/longmemeval/test_longmemeval.py \ --base-url "$AUTOMEM_TEST_BASE_URL" \ --api-token "$AUTOMEM_TEST_API_TOKEN" \ "${LONGMEM_ARGS[@]}" diff --git a/test-locomo-benchmark.sh b/test-locomo-benchmark.sh index 64071e75..611e2727 100755 --- a/test-locomo-benchmark.sh +++ b/test-locomo-benchmark.sh @@ -19,12 +19,20 @@ SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" # Shared utilities (colors + wait_for_api) source "${SCRIPT_DIR}/scripts/lib/common.sh" +if [ -x "${SCRIPT_DIR}/venv/bin/python" ]; then + PYTHON_BIN="${SCRIPT_DIR}/venv/bin/python" +else + PYTHON_BIN="python3" +fi + # Default configuration RUN_LIVE=false CONVERSATIONS="" RECALL_LIMIT=10 NO_CLEANUP=false OUTPUT_FILE="" +JUDGE=false +JUDGE_MODEL="" # Parse arguments while [[ $# -gt 0 ]]; do @@ -45,6 +53,15 @@ while [[ $# -gt 0 ]]; do OUTPUT_FILE="$2" shift 2 ;; + --judge) + JUDGE=true + shift + ;; + --judge-model) + JUDGE=true + JUDGE_MODEL="$2" + shift 2 + ;; --conversations) CONVERSATIONS="$2" shift 2 @@ -56,6 +73,8 @@ while [[ $# -gt 0 ]]; do echo " --live Run against Railway deployment (default: local Docker)" echo " --recall-limit N Number of memories to recall per question (default: 10)" echo " --conversations I,J Comma-separated conversation indices (e.g. 0,1 for mini mode)" + echo " --judge Enable category-5 LLM judge (defaults to gpt-4o)" + echo " --judge-model MODEL Set the category-5 judge model (also enables judge)" echo " --no-cleanup Don't cleanup test data after evaluation" echo " --output FILE Save results to JSON file" echo " --help, -h Show this help message" @@ -64,6 +83,7 @@ while [[ $# -gt 0 ]]; do echo " $0 # Run locally" echo " $0 --live # Run against Railway" echo " $0 --conversations 0,1 # Mini mode (2 conversations)" + echo " $0 --conversations 0,1 --judge # Mini mode with cat-5 judge" echo " $0 --recall-limit 20 --output results.json" exit 0 ;; @@ -159,7 +179,7 @@ else fi # Build python command -PYTHON_CMD="python3 $SCRIPT_DIR/tests/benchmarks/test_locomo.py" +PYTHON_CMD="$PYTHON_BIN $SCRIPT_DIR/tests/benchmarks/test_locomo.py" PYTHON_CMD="$PYTHON_CMD --base-url $AUTOMEM_TEST_BASE_URL" PYTHON_CMD="$PYTHON_CMD --api-token $AUTOMEM_TEST_API_TOKEN" PYTHON_CMD="$PYTHON_CMD --recall-limit $RECALL_LIMIT" @@ -176,6 +196,14 @@ if [ -n "$OUTPUT_FILE" ]; then PYTHON_CMD="$PYTHON_CMD --output $OUTPUT_FILE" fi +if [ "$JUDGE" = true ]; then + PYTHON_CMD="$PYTHON_CMD --judge" +fi + +if [ -n "$JUDGE_MODEL" ]; then + PYTHON_CMD="$PYTHON_CMD --judge-model $JUDGE_MODEL" +fi + echo "" echo -e "${BLUE}🚀 Starting benchmark evaluation...${NC}" echo "" diff --git a/tests/benchmarks/BENCHMARK_2025-11-08.md b/tests/benchmarks/BENCHMARK_2025-11-08.md index 604b855a..0b138914 100644 --- a/tests/benchmarks/BENCHMARK_2025-11-08.md +++ b/tests/benchmarks/BENCHMARK_2025-11-08.md @@ -1,5 +1,7 @@ # AutoMem Benchmark Results +> Historical note: This report predates the March 10, 2026 LoCoMo evaluator fixes. Temporal and category-5 scoring were corrected later, so these scores and comparisons are not current. See `benchmarks/EXPERIMENT_LOG.md` for current baselines and methodology. + ## LoCoMo Benchmark (Long-term Conversational Memory) **Benchmark Version**: LoCoMo-10 (1,986 questions across 10 conversations) diff --git a/tests/benchmarks/BENCHMARK_2025-11-20.md b/tests/benchmarks/BENCHMARK_2025-11-20.md index 5435452b..6dbe9515 100644 --- a/tests/benchmarks/BENCHMARK_2025-11-20.md +++ b/tests/benchmarks/BENCHMARK_2025-11-20.md @@ -1,5 +1,7 @@ # AutoMem Benchmark Results +> Historical note: This report predates the March 10, 2026 LoCoMo evaluator fixes. Temporal and category-5 scoring were corrected later, so the scores and any "SOTA" language here are not current. See `benchmarks/EXPERIMENT_LOG.md` for current baselines and methodology. + ## LoCoMo Benchmark (Long-term Conversational Memory) **Benchmark Version**: LoCoMo-10 (1,986 questions across 10 conversations) diff --git a/tests/benchmarks/BENCHMARK_2025-12-02.md b/tests/benchmarks/BENCHMARK_2025-12-02.md index b9016405..c32017fe 100644 --- a/tests/benchmarks/BENCHMARK_2025-12-02.md +++ b/tests/benchmarks/BENCHMARK_2025-12-02.md @@ -1,5 +1,7 @@ # AutoMem Benchmark Results +> Historical note: This report predates the March 10, 2026 LoCoMo evaluator fixes. Temporal and category-5 scoring were corrected later, so the scores and any "SOTA" language here are not current. See `benchmarks/EXPERIMENT_LOG.md` for current baselines and methodology. + ## LoCoMo Benchmark (Long-term Conversational Memory) **Benchmark Version**: LoCoMo-10 (1,986 questions across 10 conversations) diff --git a/tests/benchmarks/test_locomo.py b/tests/benchmarks/test_locomo.py index 52749239..83bdd9f5 100644 --- a/tests/benchmarks/test_locomo.py +++ b/tests/benchmarks/test_locomo.py @@ -18,13 +18,14 @@ - CORE blog: https://blog.heysol.ai/core-build-memory-knowledge-graph-for-individuals-and-achieved-sota-on-locomo-benchmark/ """ +import hashlib import json import os import re import sys import time from collections import defaultdict -from dataclasses import dataclass +from dataclasses import dataclass, field from datetime import datetime, timedelta, timezone from pathlib import Path from typing import Any, Dict, List, Optional, Tuple @@ -64,11 +65,20 @@ class LoCoMoConfig: # Performance tuning batch_size: int = 50 # Memories to store before pausing pause_between_batches: float = 0.5 # Seconds to wait between batches + judge_model: Optional[str] = field( + default_factory=lambda: os.getenv("BENCH_JUDGE_MODEL") or None + ) + + def __post_init__(self) -> None: + if self.judge_model is not None: + self.judge_model = self.judge_model.strip() or None class LoCoMoEvaluator: """Evaluates AutoMem against the LoCoMo benchmark""" + OPENAI_REQUEST_TIMEOUT_SECONDS = 90.0 + def __init__(self, config: LoCoMoConfig): self.config = config self.headers = { @@ -77,17 +87,20 @@ def __init__(self, config: LoCoMoConfig): } # memory_map is returned per-conversation by _load_batch/_load_individual self.results = defaultdict(list) # Category -> [True/False scores] + self.local_conversation_memories = {} # sample_id -> dialog_id -> prepared memory # Phase 2: Initialize OpenAI client for LLM-based answer extraction - self.openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")) + api_key = os.getenv("OPENAI_API_KEY") + self.openai_client = OpenAI(api_key=api_key) if api_key else None + self.has_openai_api_key = bool(api_key) # DISABLED: LLM extraction too slow for iteration; using word-overlap for now self.use_llm_extraction = False # Phase 2.5: Cache LLM responses to avoid redundant API calls - self.llm_cache = {} # (question, answer) -> (result, confidence, explanation) + self.llm_cache = {} # Operation-scoped cache key -> response tuple # Embedding-based answer checking (fast, handles semantic similarity) - self.use_embedding_similarity = bool(os.getenv("OPENAI_API_KEY")) + self.use_embedding_similarity = self.has_openai_api_key self.embedding_cache = {} # text -> embedding vector def health_check(self) -> bool: @@ -132,6 +145,24 @@ def _cosine_similarity(self, a: List[float], b: List[float]) -> float: return 0.0 return dot / (norm_a * norm_b) + @staticmethod + def _cache_value(value: Any, max_len: int = 500) -> str: + """Normalize cache parts and hash long payloads.""" + if isinstance(value, (dict, list)): + text = json.dumps(value, sort_keys=True) + else: + text = str(value) + + if len(text) <= max_len: + return text + + digest = hashlib.sha1(text.encode("utf-8")).hexdigest() + return f"{text[:max_len]}::{digest}" + + def _make_llm_cache_key(self, operation: str, *parts: Any) -> Tuple[str, ...]: + """Build a stable, operation-scoped cache key for LLM calls.""" + return tuple([operation] + [self._cache_value(part) for part in parts]) + def cleanup_test_data(self, tag_prefix: str = "locomo-test", max_iterations: int = 200) -> bool: """Remove all test memories from AutoMem""" print(f"\nCleaning up test memories with tag: {tag_prefix}") @@ -303,12 +334,23 @@ def load_conversation_into_automem( print(f"\nLoading conversation {sample_id} into AutoMem...") all_memories = self._prepare_conversation_memories(conversation, sample_id) + self._cache_prepared_memories(sample_id, all_memories) if self._has_batch_api(): return self._load_batch(all_memories, sample_id) else: return self._load_individual(all_memories, sample_id) + def _cache_prepared_memories(self, sample_id: str, memories: List[Dict[str, Any]]) -> None: + """Cache prepared conversation memories locally by dialog ID for evidence lookup.""" + memory_index = {} + for memory in memories: + dialog_id = memory.get("_dia_id") + if not dialog_id: + continue + memory_index[dialog_id] = {k: v for k, v in memory.items() if k != "_dia_id"} + self.local_conversation_memories[sample_id] = memory_index + def _load_batch(self, memories: List[Dict[str, Any]], sample_id: str) -> Dict[str, str]: """Load memories using batch API (much faster).""" memory_map = {} @@ -748,7 +790,10 @@ def normalize_answer(self, text: str) -> str: return " ".join(stemmed) def fetch_evidence_memories( - self, evidence_dialog_ids: List[str], sample_id: str + self, + evidence_dialog_ids: List[str], + sample_id: str, + use_local_cache: bool = False, ) -> List[Dict[str, Any]]: """ Phase 2.5: Fetch specific evidence memories by dialog ID. @@ -758,14 +803,21 @@ def fetch_evidence_memories( """ evidence_memories = [] + if use_local_cache: + local_index = self.local_conversation_memories.get(sample_id, {}) + for dialog_id in evidence_dialog_ids: + memory = local_index.get(dialog_id) + if memory: + evidence_memories.append(memory) + return evidence_memories + try: - # Get all memories for this conversation response = requests.get( f"{self.config.base_url}/recall", headers=self.headers, params={ - "query": "", # Empty query to get all - "limit": 1000, # High limit to get all conversation memories + "query": "", + "limit": 1000, "tags": f"conversation:{sample_id}", "tag_match": "exact", }, @@ -776,14 +828,11 @@ def fetch_evidence_memories( result = response.json() results = result.get("results", []) all_memories = [r.get("memory", {}) for r in results if "memory" in r] - - # Filter to just the evidence dialogs for memory in all_memories: metadata = memory.get("metadata", {}) dialog_id = metadata.get("dialog_id", "") if dialog_id in evidence_dialog_ids: evidence_memories.append(memory) - except Exception as e: print(f"⚠️ Evidence fetch error: {e}") @@ -896,7 +945,12 @@ def llm_extract_answer( ) # Ensure is_multi_hop is bool is_multi_hop_bool = bool(is_multi_hop) - cache_key = (question[:200], answer_str[:100], is_multi_hop_bool) + cache_key = self._make_llm_cache_key( + "extract_answer", + question[:200], + answer_str[:100], + is_multi_hop_bool, + ) if cache_key in self.llm_cache: return self.llm_cache[cache_key] @@ -980,6 +1034,7 @@ def llm_extract_answer( temperature=0.0, max_tokens=200, response_format={"type": "json_object"}, + timeout=self.OPENAI_REQUEST_TIMEOUT_SECONDS, ) # Parse response @@ -1000,6 +1055,156 @@ def llm_extract_answer( self.llm_cache[cache_key] = error_result # Cache errors too return error_result + def _format_memories_for_llm( + self, + memories: List[Dict[str, Any]], + limit: int = 10, + include_session_datetime: bool = True, + ) -> str: + """Format recalled or evidence memories for prompt context.""" + if not memories: + return "None" + + contexts = [] + for i, mem in enumerate(memories[:limit]): + metadata = mem.get("metadata", {}) + dialog_id = metadata.get("dialog_id", f"mem-{i}") + session_dt = metadata.get("session_datetime", "") + content = mem.get("content", "") + context = f"[{dialog_id}] {content}" + if include_session_datetime and session_dt: + context += f" (Session: {session_dt})" + contexts.append(context) + + return "\n\n".join(contexts) + + def _parse_json_object_response(self, content: str) -> Dict[str, Any]: + """Parse a JSON object response, tolerating fenced markdown wrappers.""" + content = content.strip() + fence_match = re.match(r"^\s*```[a-zA-Z0-9_-]*\s*\n(?P.*)\n\s*```\s*$", content, re.S) + if fence_match: + content = fence_match.group("body").strip() + return json.loads(content) + + def judge_complex_reasoning( + self, + question: str, + adversarial_answer: str, + recalled_memories: List[Dict[str, Any]], + evidence_dialog_ids: List[str], + sample_id: str, + ) -> Tuple[Optional[bool], float, str, Optional[str], Optional[str]]: + """Judge a category-5 question using recalled memories and evidence dialogs.""" + if not self.config.judge_model: + return None, 0.0, "Skipped: requires LLM judge", None, None + + if not self.openai_client: + return None, 0.0, "Skipped: OPENAI_API_KEY not set for LLM judge", None, None + + evidence_memories = self.fetch_evidence_memories( + evidence_dialog_ids, + sample_id, + use_local_cache=True, + ) + if len(evidence_memories) < len(set(evidence_dialog_ids)): + return ( + None, + 0.0, + "Skipped: no evidence memories available for LLM judge", + None, + None, + ) + + recalled_text = self._format_memories_for_llm(recalled_memories, limit=25) + evidence_text = self._format_memories_for_llm(evidence_memories, limit=8) + cache_key = self._make_llm_cache_key( + "judge_cat5", + self.config.judge_model, + question, + adversarial_answer, + evidence_dialog_ids, + recalled_text, + evidence_text, + ) + if cache_key in self.llm_cache: + return self.llm_cache[cache_key] + + prompt = f"""You are judging a LoCoMo category-5 complex reasoning answer. + +Question: +{question} + +Adversarial answer (distractor; may be wrong, incomplete, or occasionally overlap with the evidence): +{adversarial_answer or "None"} + +Recalled memories: +{recalled_text} + +Evidence dialogs (ground truth): +{evidence_text} + +Instructions: +1. Draft the best answer you can using ONLY the recalled memories. +2. If the recalled memories do not support a positive answer, it is valid to answer with: + - "I don't know" + - "The recalled memories do not say" + - "The question premise or person appears incorrect" +3. Compare that drafted answer to the evidence dialogs, which are the only ground truth. +4. Do NOT assume the adversarial answer is always false; use the evidence dialogs to decide. +5. Mark the answer correct if the drafted answer materially agrees with the evidence dialogs, including when the correct outcome is abstaining, correcting the person/entity, or stating that the premise is unsupported. +6. Mark the answer incorrect if the drafted answer asserts content contradicted by the evidence, misses clearly available support in the recalled memories, or simply repeats unsupported adversarial content. + +Respond with ONLY a JSON object: +{{ + "generated_answer": "answer drafted from recalled memories", + "verdict": "supported | abstain | contradiction | unsupported | incorrect", + "correct": true, + "confidence": 0.0, + "reasoning": "brief explanation grounded in the evidence dialogs" +}}""" + + try: + response = self.openai_client.chat.completions.create( + model=self.config.judge_model, + messages=[ + { + "role": "system", + "content": ( + "You are a strict benchmark judge. " + "Use recalled memories to draft the answer and evidence dialogs " + "only to verify correctness." + ), + }, + {"role": "user", "content": prompt}, + ], + temperature=0.0, + max_tokens=250, + response_format={"type": "json_object"}, + timeout=self.OPENAI_REQUEST_TIMEOUT_SECONDS, + ) + result = self._parse_json_object_response(response.choices[0].message.content) + correct = result.get("correct") + if not isinstance(correct, bool): + raise ValueError("Judge response missing boolean 'correct'") + + confidence = float(result.get("confidence", 0.0)) + generated_answer = str(result.get("generated_answer", "")).strip() or None + reasoning = str(result.get("reasoning", "")).strip() + explanation = f"LLM judge: {reasoning}" if reasoning else "LLM judge" + judge_result = (correct, confidence, explanation, generated_answer, reasoning or None) + self.llm_cache[cache_key] = judge_result + return judge_result + except Exception as e: + error_result = ( + None, + 0.0, + f"Skipped: LLM judge error: {e}", + None, + f"LLM judge error: {e}", + ) + self.llm_cache[cache_key] = error_result + return error_result + def check_answer_in_memories( self, question: str, @@ -1238,11 +1443,109 @@ def check_answer_in_memories( return is_correct, max_confidence, explanation + def _recall_memories_for_qa( + self, + question: str, + sample_id: str, + evidence: List[str], + ) -> List[Dict[str, Any]]: + """Recall memories for a single benchmark QA row.""" + if evidence and len(evidence) > 1: + return self.multi_hop_recall_with_graph( + question, + sample_id, + initial_limit=20, + max_connected=60, + ) + + return self.recall_for_question( + question, + sample_id, + evidence_count=len(evidence), + ) + + def _evaluate_question(self, qa: Dict[str, Any], sample_id: str) -> Dict[str, Any]: + """Evaluate one QA item and return a normalized result payload.""" + question = qa.get("question", "") + answer = qa.get("answer", "") + category = qa.get("category", 0) + evidence = qa.get("evidence", []) + adversarial_answer = qa.get("adversarial_answer") + + base_result = { + "question": question, + "expected_answer": answer, + "adversarial_answer": adversarial_answer, + "category": category, + "is_correct": None, + "confidence": 0.0, + "recalled_count": 0, + "explanation": "", + "judge_generated_answer": None, + "judge_reasoning": None, + } + + if category == 5 and not answer and not self.config.judge_model: + base_result["explanation"] = "Skipped: requires LLM judge" + return base_result + + recalled_memories = self._recall_memories_for_qa(question, sample_id, evidence) + base_result["recalled_count"] = len(recalled_memories) + + if category == 5 and not answer: + ( + is_correct, + confidence, + explanation, + generated_answer, + judge_reasoning, + ) = self.judge_complex_reasoning( + question, + adversarial_answer or "", + recalled_memories, + evidence, + sample_id, + ) + base_result.update( + { + "is_correct": is_correct, + "confidence": confidence, + "explanation": explanation, + "judge_generated_answer": generated_answer, + "judge_reasoning": judge_reasoning, + } + ) + return base_result + + is_correct, confidence, explanation = self.check_answer_in_memories( + question, + answer, + recalled_memories, + evidence, + sample_id, + ) + if category == 5: + explanation = f"Deterministic cat-5 scoring: {explanation}" + + base_result.update( + { + "is_correct": is_correct, + "confidence": confidence, + "explanation": explanation, + } + ) + return base_result + def _evaluate_only(self, conversation: Dict[str, Any], sample_id: str) -> Dict[str, Any]: """Evaluate a conversation without ingesting data (assumes already loaded).""" print(f"\n{'='*60}") print(f"Evaluating Conversation (eval-only): {sample_id}") print(f"{'='*60}") + if "conversation" in conversation: + self._cache_prepared_memories( + sample_id, + self._prepare_conversation_memories(conversation, sample_id), + ) # Skip straight to evaluation — same logic as evaluate_conversation but no load step qa_results = [] @@ -1250,57 +1553,10 @@ def _evaluate_only(self, conversation: Dict[str, Any], sample_id: str) -> Dict[s print(f"\nEvaluating {len(questions)} questions...") for i, qa in enumerate(questions): - question = qa.get("question", "") - answer = qa.get("answer", "") - category = qa.get("category", 0) - evidence = qa.get("evidence", []) - - # Category 5 (Complex Reasoning) needs an LLM judge — the - # dataset's ground-truth is either absent or trivial (yes/no). - if category == 5: - qa_results.append( - { - "question": question, - "expected_answer": qa.get("adversarial_answer", answer), - "category": category, - "is_correct": None, - "confidence": 0.0, - "recalled_count": 0, - "explanation": "Skipped: requires LLM judge", - } - ) - continue - - if evidence and len(evidence) > 1: - recalled_memories = self.multi_hop_recall_with_graph( - question, - sample_id, - initial_limit=20, - max_connected=60, - ) - else: - recalled_memories = self.recall_for_question( - question, - sample_id, - evidence_count=len(evidence), - ) - - is_correct, confidence, explanation = self.check_answer_in_memories( - question, answer, recalled_memories, evidence, sample_id - ) - - qa_results.append( - { - "question": question, - "expected_answer": answer, - "category": category, - "is_correct": is_correct, - "confidence": confidence, - "recalled_count": len(recalled_memories), - "explanation": explanation, - } - ) - self.results[category].append(is_correct) + qa_result = self._evaluate_question(qa, sample_id) + qa_results.append(qa_result) + if qa_result["is_correct"] is not None: + self.results[qa_result["category"]].append(qa_result["is_correct"]) if (i + 1) % 10 == 0: print(f" Processed {i+1}/{len(questions)} questions...") @@ -1353,65 +1609,12 @@ def evaluate_conversation(self, conversation: Dict[str, Any], sample_id: str) -> print(f"\n❓ Evaluating {len(questions)} questions...") for i, qa in enumerate(questions): - question = qa.get("question", "") - answer = qa.get("answer", "") - category = qa.get("category", 0) - evidence = qa.get("evidence", []) - - # Category 5 (Complex Reasoning) needs an LLM judge — the - # dataset's ground-truth is either absent or trivial (yes/no). - if category == 5: - qa_results.append( - { - "question": question, - "expected_answer": qa.get("adversarial_answer", answer), - "category": category, - "is_correct": None, - "confidence": 0.0, - "recalled_count": 0, - "explanation": "Skipped: requires LLM judge", - } - ) - if (i + 1) % 10 == 0: - print(f" Processed {i+1}/{len(questions)} questions...") - continue - - # Recall memories for this question - # Use graph expansion for multi-hop questions (evidence > 1) - if evidence and len(evidence) > 1: - recalled_memories = self.multi_hop_recall_with_graph( - question, - sample_id, - initial_limit=20, - max_connected=60, - ) - else: - recalled_memories = self.recall_for_question( - question, - sample_id, - evidence_count=len(evidence), - ) - - # Check if answer is in recalled memories - # Phase 2.5: Pass sample_id to enable evidence fetching - is_correct, confidence, explanation = self.check_answer_in_memories( - question, answer, recalled_memories, evidence, sample_id - ) - - # Record result - qa_result = { - "question": question, - "expected_answer": answer, - "category": category, - "is_correct": is_correct, - "confidence": confidence, - "recalled_count": len(recalled_memories), - "explanation": explanation, - } + qa_result = self._evaluate_question(qa, sample_id) qa_results.append(qa_result) # Track results by category - self.results[category].append(is_correct) + if qa_result["is_correct"] is not None: + self.results[qa_result["category"]].append(qa_result["is_correct"]) # Progress indicator if (i + 1) % 10 == 0: @@ -1586,9 +1789,10 @@ def run_benchmark( "correct": correct, "total": total, } - print( - f" {category_names.get(category, f'Category {category}'):25s}: {accuracy:6.2%} ({correct:3d}/{total:3d})" - ) + if category != 5 or cat5_skipped == 0: + print( + f" {category_names.get(category, f'Category {category}'):25s}: {accuracy:6.2%} ({correct:3d}/{total:3d})" + ) if cat5_skipped: cat5_name = category_names[5] @@ -1597,29 +1801,47 @@ def run_benchmark( "name": cat5_name, "accuracy": None, "correct": 0, - "total": cat5_skipped, + "total": 0, "skipped": True, + "skipped_count": cat5_skipped, } + reason = ( + "judge unavailable/errors" if self.config.judge_model else "needs LLM judge" + ) + print(f" {cat5_name:25s}: N/A ({cat5_skipped:3d} skipped, {reason})") else: category_results[5]["skipped_count"] = cat5_skipped - print(f" {cat5_name:25s}: N/A ({cat5_skipped:3d} skipped, needs LLM judge)") + category_results[5]["skipped"] = True + correct = category_results[5]["correct"] + total = category_results[5]["total"] + accuracy = category_results[5]["accuracy"] + print( + f" {cat5_name:25s}: {accuracy:6.2%} ({correct:3d}/{total:3d}, {cat5_skipped:3d} skipped)" + ) - # Comparison with CORE (their 88.24% includes cat5 via GPT-4 judge) + # Comparison with the published CORE reference. + # Treat 88.24% as a useful external reference point, not a strict + # apples-to-apples leaderboard, because public LoCoMo setups differ. core_sota = 0.8824 improvement = overall_accuracy - core_sota - print("\n🏆 Comparison with CORE (SOTA):") + print("\n📊 Comparison with published CORE reference:") print(f" CORE: {core_sota:.2%}") print(f" AutoMem: {overall_accuracy:.2%}") if cat5_skipped: - print( - f" ⚠️ AutoMem excludes {cat5_skipped} cat-5 Qs (needs LLM judge); CORE includes them" - ) + if self.config.judge_model: + print( + f" ⚠️ AutoMem skipped {cat5_skipped} cat-5 Qs due to judge/missing-evidence errors" + ) + else: + print( + f" ⚠️ AutoMem excludes {cat5_skipped} cat-5 Qs (needs LLM judge); treat comparison as directional only" + ) if improvement > 0: - print(f" 🎉 AutoMem leads by {improvement:.2%}") + print(f" 📈 AutoMem is {improvement:.2%} above that reference") elif improvement < 0: - print(f" 📉 AutoMem is {abs(improvement):.2%} behind CORE") + print(f" 📉 AutoMem is {abs(improvement):.2%} behind that reference") else: - print(" 🤝 AutoMem matches CORE") + print(" 🤝 AutoMem matches that reference") # Cleanup if cleanup_after: @@ -1634,6 +1856,9 @@ def run_benchmark( "total": total_questions, "elapsed_time": elapsed_time, }, + "judge_requested": bool(self.config.judge_model), + "judge_available": bool(self.config.judge_model and self.openai_client), + "judge_model": self.config.judge_model, "categories": category_results, "conversations": conversation_results, "comparison": { @@ -1641,7 +1866,11 @@ def run_benchmark( "automem": overall_accuracy, "improvement": improvement, "cat5_excluded": cat5_skipped, - "note": "CORE 88.24% includes cat-5 via GPT-4 judge" if cat5_skipped else None, + "note": ( + "CORE 88.24% includes cat-5 via GPT-4 judge" + if cat5_skipped and not self.config.judge_model + else None + ), }, } @@ -1694,6 +1923,16 @@ def main(): action="store_true", help="Only evaluate (skip ingestion). Assumes data already loaded.", ) + parser.add_argument( + "--judge", + action="store_true", + help="Enable category-5 LLM judging (defaults to gpt-4o unless BENCH_JUDGE_MODEL is set).", + ) + parser.add_argument( + "--judge-model", + default=None, + help="LLM model for category-5 judging (also enables judge mode).", + ) args = parser.parse_args() @@ -1704,6 +1943,8 @@ def main(): if args.data_file: config.data_file = args.data_file + if args.judge or args.judge_model: + config.judge_model = args.judge_model or config.judge_model or "gpt-4o" # Run evaluation evaluator = LoCoMoEvaluator(config) diff --git a/tests/test_locomo_cat5_judge.py b/tests/test_locomo_cat5_judge.py new file mode 100644 index 00000000..20320920 --- /dev/null +++ b/tests/test_locomo_cat5_judge.py @@ -0,0 +1,408 @@ +import sys +from importlib.util import module_from_spec, spec_from_file_location +from pathlib import Path +from types import SimpleNamespace +from typing import Any, Dict, List + +import pytest + +_MODULE_NAME = "locomo_benchmark_module_cat5" + + +def _load_locomo_module() -> Any: + if _MODULE_NAME in sys.modules: + return sys.modules[_MODULE_NAME] + module_path = Path(__file__).resolve().parent / "benchmarks" / "test_locomo.py" + spec = spec_from_file_location(_MODULE_NAME, module_path) + assert spec is not None + assert spec.loader is not None + module = module_from_spec(spec) + sys.modules[_MODULE_NAME] = module + spec.loader.exec_module(module) + return module + + +@pytest.fixture() +def locomo_module() -> Any: + return _load_locomo_module() + + +@pytest.fixture() +def locomo_evaluator(locomo_module: Any) -> Any: + config = locomo_module.LoCoMoConfig() + config.judge_model = None + evaluator = locomo_module.LoCoMoEvaluator(config) + evaluator.openai_client = None + evaluator.use_embedding_similarity = False + return evaluator + + +def _fake_memory(dialog_id: str, content: str) -> Dict[str, Any]: + return { + "content": content, + "metadata": { + "dialog_id": dialog_id, + "session_datetime": "2023-05-08T13:56:00+00:00", + }, + } + + +def _cat5_qa(**overrides: Any) -> Dict[str, Any]: + qa = { + "question": "How does Jolene plan to pursue her dream of climbing mountains?", + "answer": "", + "adversarial_answer": "By ignoring training and winging it.", + "category": 5, + "evidence": ["D10:20"], + } + qa.update(overrides) + return qa + + +def test_cat5_without_judge_still_skips(locomo_evaluator: Any) -> None: + result = locomo_evaluator._evaluate_question(_cat5_qa(), "conv-1") + + assert result["is_correct"] is None + assert result["recalled_count"] == 0 + assert result["explanation"] == "Skipped: requires LLM judge" + + +def test_cat5_with_judge_counts_toward_category_results( + locomo_evaluator: Any, monkeypatch: pytest.MonkeyPatch +) -> None: + locomo_evaluator.config.judge_model = "gpt-4o" + monkeypatch.setattr( + locomo_evaluator, + "_recall_memories_for_qa", + lambda question, sample_id, evidence: [_fake_memory("D10:20", "Jolene watches videos.")], + ) + monkeypatch.setattr( + locomo_evaluator, + "judge_complex_reasoning", + lambda question, adversarial_answer, recalled_memories, evidence_dialog_ids, sample_id: ( + True, + 0.92, + "LLM judge: supported by evidence", + "She plans to study and start with beginner climbs.", + "supported by evidence", + ), + ) + + conversation = {"qa": [_cat5_qa()]} + result = locomo_evaluator._evaluate_only(conversation, "conv-1") + + assert result["correct"] == 1 + assert result["total_questions"] == 1 + assert locomo_evaluator.results[5] == [True] + assert result["qa_results"][0]["judge_generated_answer"] is not None + + +def test_fetch_evidence_memories_uses_local_conversation_cache( + locomo_evaluator: Any, +) -> None: + locomo_evaluator.local_conversation_memories["conv-26"] = { + "D2:3": _fake_memory("D2:3", "Target evidence"), + "D5:8": _fake_memory("D5:8", "Secondary evidence"), + } + + evidence = locomo_evaluator.fetch_evidence_memories( + ["D2:3", "D5:8"], + "conv-26", + use_local_cache=True, + ) + + assert [memory["metadata"]["dialog_id"] for memory in evidence] == ["D2:3", "D5:8"] + + +def test_cat5_judge_uses_cache(locomo_evaluator: Any, monkeypatch: pytest.MonkeyPatch) -> None: + locomo_evaluator.config.judge_model = "gpt-4o" + calls = {"count": 0} + + class FakeCompletions: + def create(self, **kwargs: Any) -> Any: + calls["count"] += 1 + return SimpleNamespace( + choices=[ + SimpleNamespace( + message=SimpleNamespace( + content=( + '{"generated_answer": "She will research and train.", ' + '"verdict": "supported", ' + '"correct": true, "confidence": 0.88, ' + '"reasoning": "Matches the evidence dialog."}' + ) + ) + ) + ] + ) + + locomo_evaluator.openai_client = SimpleNamespace( + chat=SimpleNamespace(completions=FakeCompletions()) + ) + monkeypatch.setattr( + locomo_evaluator, + "fetch_evidence_memories", + lambda evidence_dialog_ids, sample_id, use_local_cache=False: [ + _fake_memory("D10:20", "Jolene is gathering information and watching videos.") + ], + ) + + recalled = [_fake_memory("R1", "Jolene is gathering information and watching videos.")] + first = locomo_evaluator.judge_complex_reasoning( + _cat5_qa()["question"], + _cat5_qa()["adversarial_answer"], + recalled, + ["D10:20"], + "conv-1", + ) + second = locomo_evaluator.judge_complex_reasoning( + _cat5_qa()["question"], + _cat5_qa()["adversarial_answer"], + recalled, + ["D10:20"], + "conv-1", + ) + + assert calls["count"] == 1 + assert first == second + assert first[0] is True + + +def test_cat5_judge_sets_request_timeout( + locomo_evaluator: Any, monkeypatch: pytest.MonkeyPatch +) -> None: + locomo_evaluator.config.judge_model = "gpt-4o" + captured = {} + + class FakeCompletions: + def create(self, **kwargs: Any) -> Any: + captured["timeout"] = kwargs.get("timeout") + return SimpleNamespace( + choices=[ + SimpleNamespace( + message=SimpleNamespace( + content=( + '{"generated_answer": "She will research and train.", ' + '"verdict": "supported", ' + '"correct": true, "confidence": 0.88, ' + '"reasoning": "Matches the evidence dialog."}' + ) + ) + ) + ] + ) + + locomo_evaluator.openai_client = SimpleNamespace( + chat=SimpleNamespace(completions=FakeCompletions()) + ) + monkeypatch.setattr( + locomo_evaluator, + "fetch_evidence_memories", + lambda evidence_dialog_ids, sample_id, use_local_cache=False: [ + _fake_memory("D10:20", "Jolene is gathering information and watching videos.") + ], + ) + + result = locomo_evaluator.judge_complex_reasoning( + _cat5_qa()["question"], + _cat5_qa()["adversarial_answer"], + [_fake_memory("R1", "Jolene is gathering information and watching videos.")], + ["D10:20"], + "conv-1", + ) + + assert result[0] is True + assert captured["timeout"] == locomo_evaluator.OPENAI_REQUEST_TIMEOUT_SECONDS + + +@pytest.mark.parametrize( + ("evidence_memories", "response_content", "expected_message"), + [ + ( + [], + '{"generated_answer": "", "verdict": "unsupported", "correct": true, "confidence": 0.1, "reasoning": ""}', + "Skipped: no evidence memories available for LLM judge", + ), + ( + [{"content": "evidence", "metadata": {"dialog_id": "D10:20"}}], + "not-json", + "Skipped: LLM judge error:", + ), + ], +) +def test_cat5_judge_failures_skip_instead_of_marking_wrong( + locomo_evaluator: Any, + monkeypatch: pytest.MonkeyPatch, + evidence_memories: List[Dict[str, Any]], + response_content: str, + expected_message: str, +) -> None: + locomo_evaluator.config.judge_model = "gpt-4o" + + class FakeCompletions: + def create(self, **kwargs: Any) -> Any: + return SimpleNamespace( + choices=[SimpleNamespace(message=SimpleNamespace(content=response_content))] + ) + + locomo_evaluator.openai_client = SimpleNamespace( + chat=SimpleNamespace(completions=FakeCompletions()) + ) + monkeypatch.setattr( + locomo_evaluator, + "fetch_evidence_memories", + lambda evidence_dialog_ids, sample_id, use_local_cache=False: evidence_memories, + ) + monkeypatch.setattr( + locomo_evaluator, + "_recall_memories_for_qa", + lambda question, sample_id, evidence: [ + _fake_memory("R1", "Jolene is gathering information and watching videos.") + ], + ) + + result = locomo_evaluator._evaluate_question(_cat5_qa(), "conv-1") + + assert result["is_correct"] is None + assert result["confidence"] == 0.0 + assert expected_message in result["explanation"] + + +def test_cat5_judge_prompt_allows_abstention_and_wrong_premise( + locomo_evaluator: Any, monkeypatch: pytest.MonkeyPatch +) -> None: + locomo_evaluator.config.judge_model = "gpt-4o" + captured = {} + + class FakeCompletions: + def create(self, **kwargs: Any) -> Any: + captured["messages"] = kwargs["messages"] + return SimpleNamespace( + choices=[ + SimpleNamespace( + message=SimpleNamespace( + content=( + '{"generated_answer": "I don\'t know; the premise seems wrong.", ' + '"verdict": "contradiction", ' + '"correct": true, "confidence": 0.91, ' + '"reasoning": "The evidence shows the question names the wrong person."}' + ) + ) + ) + ] + ) + + locomo_evaluator.openai_client = SimpleNamespace( + chat=SimpleNamespace(completions=FakeCompletions()) + ) + monkeypatch.setattr( + locomo_evaluator, + "fetch_evidence_memories", + lambda evidence_dialog_ids, sample_id, use_local_cache=False: [ + _fake_memory("D2:3", "The realization belonged to Melanie, not Caroline.") + ], + ) + + result = locomo_evaluator.judge_complex_reasoning( + "What did Caroline realize after her charity race?", + "self-care is important", + [_fake_memory("R1", "Melanie talked about self-care after the race.")], + ["D2:3"], + "conv-26", + ) + + prompt = captured["messages"][1]["content"] + assert "Do NOT assume the adversarial answer is always false" in prompt + assert ( + "abstaining, correcting the person/entity, or stating that the premise is unsupported" + in prompt + ) + assert result[0] is True + + +def test_cat5_judge_prompt_includes_more_than_old_top_12_memories( + locomo_evaluator: Any, monkeypatch: pytest.MonkeyPatch +) -> None: + locomo_evaluator.config.judge_model = "gpt-4o" + captured = {} + + class FakeCompletions: + def create(self, **kwargs: Any) -> Any: + captured["messages"] = kwargs["messages"] + return SimpleNamespace( + choices=[ + SimpleNamespace( + message=SimpleNamespace( + content=( + '{"generated_answer": "Answer", ' + '"verdict": "supported", ' + '"correct": true, "confidence": 0.8, ' + '"reasoning": "ok"}' + ) + ) + ) + ] + ) + + locomo_evaluator.openai_client = SimpleNamespace( + chat=SimpleNamespace(completions=FakeCompletions()) + ) + monkeypatch.setattr( + locomo_evaluator, + "fetch_evidence_memories", + lambda evidence_dialog_ids, sample_id, use_local_cache=False: [ + _fake_memory("D10:20", "Target evidence") + ], + ) + recalled = [_fake_memory(f"R{i}", f"Memory {i}") for i in range(1, 16)] + + locomo_evaluator.judge_complex_reasoning( + _cat5_qa()["question"], + _cat5_qa()["adversarial_answer"], + recalled, + ["D10:20"], + "conv-1", + ) + + prompt = captured["messages"][1]["content"] + assert "[R15] Memory 15" in prompt + + +def test_cat5_with_canonical_answer_stays_deterministic( + locomo_evaluator: Any, monkeypatch: pytest.MonkeyPatch +) -> None: + locomo_evaluator.config.judge_model = "gpt-4o" + monkeypatch.setattr( + locomo_evaluator, + "_recall_memories_for_qa", + lambda question, sample_id, evidence: [_fake_memory("D5:8", "Caroline did not make it.")], + ) + monkeypatch.setattr( + locomo_evaluator, + "check_answer_in_memories", + lambda question, expected_answer, recalled_memories, evidence_dialog_ids, sample_id: ( + True, + 1.0, + "Found answer", + ), + ) + monkeypatch.setattr( + locomo_evaluator, + "judge_complex_reasoning", + lambda *args, **kwargs: (_ for _ in ()).throw(AssertionError("judge should not be used")), + ) + + result = locomo_evaluator._evaluate_question( + _cat5_qa( + question="Did Caroline make the black and white bowl in the photo?", + answer="No", + adversarial_answer="Yes", + evidence=["D5:8"], + ), + "conv-26", + ) + + assert result["is_correct"] is True + assert result["expected_answer"] == "No" + assert result["judge_generated_answer"] is None + assert result["explanation"].startswith("Deterministic cat-5 scoring:")