diff --git a/benchmarks/EXPERIMENT_LOG.md b/benchmarks/EXPERIMENT_LOG.md index 9b7db1d6..72d31bfb 100644 --- a/benchmarks/EXPERIMENT_LOG.md +++ b/benchmarks/EXPERIMENT_LOG.md @@ -27,6 +27,16 @@ on the snapshot-based bench infrastructure (PR #97, merged 2026-03-02). | 2026-03-10 | pre-refactor | main (@ 795368a) | 76.97% (+0.0) | -- | -- | Baseline re-confirmed after #73, #78, #115, #116 merged. Stable. Pre-relation-tier-refactor checkpoint. | | 2026-03-10 | eval-fix | docs/benchmark-agent-guidelines | **89.27% (208/233)** | -- | -- | Fix temporal matching (answer vs memory dates) + skip cat5 (no ground truth). Honest score, beats CORE by 1.03pp. | | 2026-03-10 | cat5-judge | feat/bench-cat5-judge | **89.80% (273/304)** | **87.56% (1739/1986)** | -- | Opt-in GPT-4o judge for cat5. Full run scored cat5 at 95.74% (427/446) with 0 judge skips/errors; added 90s OpenAI request timeout to prevent stuck full runs. | +| 2026-03-10 | main-refresh (no judge) | main | **89.36% (210/235)** | -- | -- | Fresh current-main rerun before PR #80 experiment. Comparison anchor for judge-off. | +| 2026-03-10 | main-refresh (judge) | main | **90.13% (274/304)** | -- | -- | Fresh current-main rerun with `BENCH_JUDGE_MODEL=gpt-4o`. Comparison anchor for judge-on. | +| 2026-03-11 | PR #80 port (no judge) | exp/pr80-enhanced-recall-v2 | 85.53% (201/235) | -- | -- | BM25 + query expansion + rerank port. Regressed **-3.83pp** vs fresh main. Open-domain -11.4pp. Runtime 7.4x. → [postmortem](postmortems/2026-03-11_pr80_enhanced_recall.md) | +| 2026-03-11 | PR #80 port (judge) | exp/pr80-enhanced-recall-v2 | 88.16% (268/304) | -- | -- | Same branch with GPT-4o cat5 judge. Regressed **-1.97pp** vs fresh main. Runtime 10.2x. → [postmortem](postmortems/2026-03-11_pr80_enhanced_recall.md) | +| 2026-03-11 | PR #80 BM25-only f10 | exp/pr80-bm25-only-f10 | 88.09% (-1.28) | -- | -- | Best config variant, still regressed. → [postmortem](postmortems/2026-03-11_pr80_enhanced_recall.md) | +| 2026-03-11 | PR #80 BM25-only f20 | exp/pr80-bm25-only-f20 | 87.66% (-1.70) | -- | -- | More BM25 results = more dilution. → [postmortem](postmortems/2026-03-11_pr80_enhanced_recall.md) | +| 2026-03-11 | PR #80 BM25+rerank top5 | exp/pr80-bm25-rerank-top5 | 87.23% (-2.13) | -- | -- | Reranking didn't recover regression. → [postmortem](postmortems/2026-03-11_pr80_enhanced_recall.md) | +| 2026-03-11 | PR #80 BM25+rerank top10 | exp/pr80-bm25-rerank-top10 | 86.81% (-2.55) | -- | -- | Wider rerank window = worse. → [postmortem](postmortems/2026-03-11_pr80_enhanced_recall.md) | +| 2026-03-11 | #74 entity expansion | exp/74-entity-expansion-precision-v1 | 89.36% (+0.0) | -- | -- | Hub-node detection. Zero delta — benchmark doesn't exercise graph expansion. → [postmortem](postmortems/2026-03-11_issue74_entity_expansion_precision.md) | +| 2026-03-12 | #79 (PR #125) | exp/79-priority-ids-fetch-v1 | 89.36% (+0.0) | -- | -- | Bug fix: priority_ids now fetches by ID. Merged. → [postmortem](postmortems/2026-03-12_issue79_priority_ids_fetch.md) | ### Category Breakdown (LoCoMo-mini) @@ -38,6 +48,13 @@ Categories 1-4 are scored by word-overlap/date matching. Category 5 uses an opt- | 2026-03-10 | pre-refactor | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) | | 2026-03-10 | eval-fix | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | N/A (71 skipped) | | 2026-03-10 | cat5-judge | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | **91.5% (65/71)** | +| 2026-03-10 | main-refresh (no judge) | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | **96.5% (110/114)** | 100.0% (2/2, 69 skipped) | +| 2026-03-10 | main-refresh (judge) | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | **96.5% (110/114)** | **93.0% (66/71)** | +| 2026-03-11 | PR #80 port (no judge) | **86.0% (37/43)** | **93.7% (59/63)** | 46.2% (6/13) | 85.1% (97/114) | 100.0% (2/2, 69 skipped) | +| 2026-03-11 | PR #80 port (judge) | **88.4% (38/43)** | **92.1% (58/63)** | 46.2% (6/13) | 86.0% (98/114) | **95.8% (68/71)** | +| 2026-03-11 | PR #80 BM25-only f10 | 81.4% (35/43) | 92.1% (58/63) | 46.2% (6/13) | 93.0% (106/114) | N/A | +| 2026-03-11 | #74 entity expansion | 79.1% (34/43) | 92.1% (58/63) | 46.2% (6/13) | 96.5% (110/114) | N/A | +| 2026-03-12 | #79 (PR #125) | 79.1% (34/43) | 92.1% (58/63) | 46.2% (6/13) | 96.5% (110/114) | N/A | \* Temporal was artificially low: evaluator compared question dates (empty) vs memory dates instead of answer dates. \*\* Complex was artificially 100%: dataset has no `answer` field for cat5 → empty string → `"" in content` always True. diff --git a/benchmarks/postmortems/.gitkeep b/benchmarks/postmortems/.gitkeep new file mode 100644 index 00000000..e69de29b diff --git a/benchmarks/postmortems/2026-03-11_issue74_entity_expansion_precision.md b/benchmarks/postmortems/2026-03-11_issue74_entity_expansion_precision.md new file mode 100644 index 00000000..61e633eb --- /dev/null +++ b/benchmarks/postmortems/2026-03-11_issue74_entity_expansion_precision.md @@ -0,0 +1,44 @@ +# Issue #74: Graph expansion follows too many hops through hub nodes + +**Status**: NON-PROMOTED / REJECTED (this direction) +**Date**: 2026-03-11 +**Issue**: #74 (remains open) +**Branch**: `exp/74-entity-expansion-precision-v1` (commit `236b075`) + +## Hypothesis + +Hub-node pollution in graph expansion degrades recall precision. Querying "Alex Panagis" with `expand_entities=true` incorrectly pulls in unrelated memories through shared generic nodes (e.g., the "AutoMem" tool node connects all users). Hub-node detection and deprioritization should improve precision without harming recall. + +## Benchmark + +| Metric | Baseline | Test | Delta | +|--------|----------|------|-------| +| LoCoMo-mini | 89.36% | 89.36% | **0.0** | + +**Category deltas**: All zero across all five categories. + +## Why Zero Delta + +The LoCoMo benchmark doesn't exercise graph expansion (`expand_entities`) in its query path. All LoCoMo questions are answered via vector + keyword search. The hub-node problem is real in production (where users query with `expand_entities=true`), but this benchmark can't surface the improvement. + +## Commands Run + +```bash +make bench-eval BENCH=locomo-mini CONFIG=baseline # on exp/74 branch +make bench-compare BENCH=locomo-mini CONFIG=baseline BASELINE=baseline # vs main +``` + +## Outcome + +Hub-node detection alone doesn't move the needle on LoCoMo. The experiment was correctly executed but the benchmark isn't the right instrument to measure this improvement. + +## Promoted Artifact + +`tests/benchmarks/results/compare_issue74_entity_precision_20260311.json` + +## Follow-up + +- Issue #74 remains open — the problem is real, just not benchmark-measurable yet +- Consider Personalized PageRank (#100) or configurable `max_hops` as alternative approaches +- Need a graph-expansion-specific test suite (queries with `expand_entities=true`) to measure future attempts +- Should not be the default direction for recall improvement work diff --git a/benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md b/benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md new file mode 100644 index 00000000..f2a13cd9 --- /dev/null +++ b/benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md @@ -0,0 +1,93 @@ +# PR #80: Enhanced Recall — BM25, LLM Reranking, Query Expansion + +**Status**: REJECTED — regression across all tested configurations +**Date**: 2026-03-11 +**PR**: #80 (by jescalan) +**Branch**: `exp/pr80-enhanced-recall-v2` (commit `a122ba2`) +**Author**: jescalan + +## Hypothesis + +Adding BM25 full-text search, LLM reranking, and query expansion to the recall pipeline will improve accuracy by catching keyword matches that vector search misses and filtering false positives. + +**Pipeline**: query expansion → vector + graph + BM25 search → RRF fusion → metadata scoring → LLM reranking + +## Benchmark Results + +### Full Port (all features enabled) + +| Metric | Baseline (main) | PR #80 | Delta | +|--------|-----------------|--------|-------| +| LoCoMo-mini (judge-off) | 89.36% | 85.53% | **-3.83pp** | +| LoCoMo-mini (judge-on) | 90.13% | 88.16% | **-1.97pp** | + +### Category Breakdown (judge-off) + +| Category | Baseline | PR #80 | Delta | +|----------|----------|--------|-------| +| Single-hop | 79.1% | 86.0% | **+7.0pp** | +| Temporal | 92.1% | 93.7% | **+1.6pp** | +| Multi-hop | 46.2% | 46.2% | **0.0** | +| Open Domain | 96.5% | 85.1% | **-11.4pp** | +| Complex | N/A | N/A | -- | + +### Config Sweep (Round 1, judge-off) + +| Config | Accuracy | Delta vs Main | Notes | +|--------|----------|---------------|-------| +| BM25-only f10 | 88.09% | **-1.28pp** | Best variant, still regressed | +| BM25-only f20 | 87.66% | **-1.70pp** | More BM25 results = more dilution | +| BM25 + rerank top-5 | 87.23% | **-2.13pp** | Reranking didn't help | +| BM25 + rerank top-10 | 86.81% | **-2.55pp** | Wider rerank window = worse | + +### Runtime Impact + +| Config | Runtime | vs Baseline | +|--------|---------|-------------| +| Main (baseline) | 209s | -- | +| Full port (judge-off) | 1546s | **7.4x slower** | +| Full port (judge-on) | 2136s | **10.2x slower** | + +## Root Cause Analysis + +**Open-domain regression is the primary problem.** Open-domain questions (e.g., "What does Alex think about remote work?") rely on semantic similarity — the answer is conceptually related but doesn't share keywords with the question. BM25 keyword matches dilute the vector results via Reciprocal Rank Fusion (RRF), pushing semantically-relevant results down. + +The single-hop improvement (+7.0pp) shows BM25 *does* help for factual lookups where exact keywords matter. But the open-domain loss (-11.4pp) is 1.6x larger and affects more questions (114 vs 43). + +**Runtime cost** is driven by LLM reranking (one API call per candidate per question) and query expansion (one API call per question). Even BM25-only configs add SQLite FTS5 overhead. + +## Commands Run + +```bash +# Full port evaluation +make bench-eval BENCH=locomo-mini CONFIG=baseline # on exp/pr80-enhanced-recall-v2 +make bench-compare BENCH=locomo-mini CONFIG=baseline BASELINE=baseline + +# Config sweep (4 variants) +make bench-eval BENCH=locomo-mini CONFIG=bm25_only_f10 +make bench-eval BENCH=locomo-mini CONFIG=bm25_only_f20 +make bench-eval BENCH=locomo-mini CONFIG=bm25_rerank_top5 +make bench-eval BENCH=locomo-mini CONFIG=bm25_rerank_top10 +``` + +## Outcome + +**REJECTED.** No configuration recovered the open-domain regression. The best variant (BM25-only f10) still regressed -1.28pp overall. + +## Decision + +PR #80's always-on BM25 fusion approach hurts more than it helps on current benchmarks. The single-hop gains don't compensate for open-domain losses. The 7-10x runtime increase is also prohibitive for production use. + +## Promoted Artifacts + +- `tests/benchmarks/results/compare_pr80_judge_off_20260311.json` +- `tests/benchmarks/results/compare_pr80_judge_on_20260311.json` +- `tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json` + +## Follow-up Recommendations + +1. **Targeted BM25 fallback**: Only invoke BM25 when vector search returns low-confidence results (e.g., all scores < 0.5), rather than always-on fusion +2. **Category-aware fusion**: Weight BM25 differently for factual vs open-domain queries (requires query classification) +3. **RRF tuning**: The RRF constant (k=60 default) may be too aggressive for BM25 results — could try lower k to reduce BM25 influence +4. **Reranking without BM25**: Test LLM reranking on vanilla vector results (without BM25 dilution) as a separate experiment +5. **Contributor summary**: If closing PR #80, provide benchmark evidence (this postmortem) rather than preference-based feedback diff --git a/benchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.md b/benchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.md new file mode 100644 index 00000000..636c6d06 --- /dev/null +++ b/benchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.md @@ -0,0 +1,39 @@ +# Issue #79: priority_ids parameter only boosts relevance + +**Status**: ACCEPTED / MERGED +**Date**: 2026-03-12 +**PR**: #125 (commit `5d3708c`) +**Branch**: `exp/79-priority-ids-fetch-v1` + +## Hypothesis + +The `priority_ids` parameter should fetch memories directly by ID and guarantee their inclusion in results, not merely boost their score if they happen to appear in normal search results. + +## Benchmark + +| Metric | Baseline | Test | Delta | +|--------|----------|------|-------| +| LoCoMo-mini | 89.36% | 89.36% | **0.0** | + +**Category deltas**: All zero across Single-hop, Temporal, Multi-hop, Open Domain, Complex. + +Delta of zero is expected — this is a fetch-behavior bug fix, not a recall scoring change. The benchmark exercises natural-language queries, not ID-based lookups. + +## Commands Run + +```bash +make bench-eval BENCH=locomo-mini CONFIG=baseline # on exp/79 branch +make bench-compare BENCH=locomo-mini CONFIG=baseline BASELINE=baseline # vs main +``` + +## Outcome + +Bug fix merged without benchmark regression. MCP clients can now reliably fetch specific memories by ID via `priority_ids`. + +## Promoted Artifact + +`tests/benchmarks/results/compare_issue79_priority_ids_20260311.json` + +## Follow-up + +None needed. Related MCP-side fix: `verygoodplugins/mcp-automem#67`. diff --git a/tests/benchmarks/results/compare_issue74_entity_precision_20260311.json b/tests/benchmarks/results/compare_issue74_entity_precision_20260311.json new file mode 100644 index 00000000..109a983b --- /dev/null +++ b/tests/benchmarks/results/compare_issue74_entity_precision_20260311.json @@ -0,0 +1,14 @@ +{ + "baseline_accuracy": 0.8936170212765957, + "test_accuracy": 0.8936170212765957, + "delta": 0.0, + "category_deltas": { + "Single-hop Recall": 0.0, + "Temporal Understanding": 0.0, + "Multi-hop Reasoning": 0.0, + "Open Domain": 0.0, + "Complex Reasoning": 0.0 + }, + "baseline_file": "benchmarks/results/locomo-mini_baseline_20260311_041134.json", + "test_file": "benchmarks/results/locomo-mini_baseline_20260311_041656.json" +} diff --git a/tests/benchmarks/results/compare_issue79_priority_ids_20260311.json b/tests/benchmarks/results/compare_issue79_priority_ids_20260311.json new file mode 100644 index 00000000..7ad434f6 --- /dev/null +++ b/tests/benchmarks/results/compare_issue79_priority_ids_20260311.json @@ -0,0 +1,14 @@ +{ + "baseline_accuracy": 0.8936170212765957, + "test_accuracy": 0.8936170212765957, + "delta": 0.0, + "category_deltas": { + "Single-hop Recall": 0.0, + "Temporal Understanding": 0.0, + "Multi-hop Reasoning": 0.0, + "Open Domain": 0.0, + "Complex Reasoning": 0.0 + }, + "baseline_file": "benchmarks/results/locomo-mini_baseline_20260311_041134.json", + "test_file": "benchmarks/results/locomo-mini_baseline_20260311_055154.json" +} diff --git a/tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json b/tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json new file mode 100644 index 00000000..27ca4691 --- /dev/null +++ b/tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json @@ -0,0 +1,14 @@ +{ + "baseline_accuracy": 0.8936170212765957, + "test_accuracy": 0.8808510638297873, + "delta": -0.012765957446808418, + "category_deltas": { + "Single-hop Recall": 0.023255813953488413, + "Temporal Understanding": 0.0, + "Multi-hop Reasoning": 0.0, + "Open Domain": -0.03508771929824561, + "Complex Reasoning": 0.0 + }, + "baseline_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_baseline_20260310_233631.json", + "test_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json" +} diff --git a/tests/benchmarks/results/compare_pr80_judge_off_20260311.json b/tests/benchmarks/results/compare_pr80_judge_off_20260311.json new file mode 100644 index 00000000..05b3c57a --- /dev/null +++ b/tests/benchmarks/results/compare_pr80_judge_off_20260311.json @@ -0,0 +1,14 @@ +{ + "baseline_accuracy": 0.8936170212765957, + "test_accuracy": 0.8553191489361702, + "delta": -0.038297872340425476, + "category_deltas": { + "Single-hop Recall": 0.06976744186046513, + "Temporal Understanding": 0.015873015873015928, + "Multi-hop Reasoning": 0.0, + "Open Domain": -0.11403508771929827, + "Complex Reasoning": 0.0 + }, + "baseline_file": "benchmarks/results/locomo-mini_baseline_20260310_233631.json", + "test_file": "benchmarks/results/locomo-mini_baseline_20260311_004300.json" +} diff --git a/tests/benchmarks/results/compare_pr80_judge_on_20260311.json b/tests/benchmarks/results/compare_pr80_judge_on_20260311.json new file mode 100644 index 00000000..c8962bd2 --- /dev/null +++ b/tests/benchmarks/results/compare_pr80_judge_on_20260311.json @@ -0,0 +1,14 @@ +{ + "baseline_accuracy": 0.9013157894736842, + "test_accuracy": 0.881578947368421, + "delta": -0.019736842105263164, + "category_deltas": { + "Single-hop Recall": 0.09302325581395354, + "Temporal Understanding": 0.0, + "Multi-hop Reasoning": 0.0, + "Open Domain": -0.10526315789473684, + "Complex Reasoning": 0.028169014084507005 + }, + "baseline_file": "benchmarks/results/locomo-mini_baseline_20260310_234028.json", + "test_file": "benchmarks/results/locomo-mini_baseline_20260311_011001_judge.json" +}