verygoodplugins · jack-arturo · Mar 25, 2026 · Mar 13, 2026
diff --git a/benchmarks/EXPERIMENT_LOG.md b/benchmarks/EXPERIMENT_LOG.md
@@ -27,6 +27,16 @@ on the snapshot-based bench infrastructure (PR #97, merged 2026-03-02).
 | 2026-03-10 | pre-refactor | main (@ 795368a) | 76.97% (+0.0) | -- | -- | Baseline re-confirmed after #73, #78, #115, #116 merged. Stable. Pre-relation-tier-refactor checkpoint. |
 | 2026-03-10 | eval-fix | docs/benchmark-agent-guidelines | **89.27% (208/233)** | -- | -- | Fix temporal matching (answer vs memory dates) + skip cat5 (no ground truth). Honest score, beats CORE by 1.03pp. |
 | 2026-03-10 | cat5-judge | feat/bench-cat5-judge | **89.80% (273/304)** | **87.56% (1739/1986)** | -- | Opt-in GPT-4o judge for cat5. Full run scored cat5 at 95.74% (427/446) with 0 judge skips/errors; added 90s OpenAI request timeout to prevent stuck full runs. |
+| 2026-03-10 | main-refresh (no judge) | main | **89.36% (210/235)** | -- | -- | Fresh current-main rerun before PR #80 experiment. Comparison anchor for judge-off. |
+| 2026-03-10 | main-refresh (judge) | main | **90.13% (274/304)** | -- | -- | Fresh current-main rerun with `BENCH_JUDGE_MODEL=gpt-4o`. Comparison anchor for judge-on. |
+| 2026-03-11 | PR #80 port (no judge) | exp/pr80-enhanced-recall-v2 | 85.53% (201/235) | -- | -- | BM25 + query expansion + rerank port. Regressed **-3.83pp** vs fresh main. Open-domain -11.4pp. Runtime 7.4x. → [postmortem](postmortems/2026-03-11_pr80_enhanced_recall.md) |
+| 2026-03-11 | PR #80 port (judge) | exp/pr80-enhanced-recall-v2 | 88.16% (268/304) | -- | -- | Same branch with GPT-4o cat5 judge. Regressed **-1.97pp** vs fresh main. Runtime 10.2x. → [postmortem](postmortems/2026-03-11_pr80_enhanced_recall.md) |
+| 2026-03-11 | PR #80 BM25-only f10 | exp/pr80-bm25-only-f10 | 88.09% (-1.28) | -- | -- | Best config variant, still regressed. → [postmortem](postmortems/2026-03-11_pr80_enhanced_recall.md) |
+| 2026-03-11 | PR #80 BM25-only f20 | exp/pr80-bm25-only-f20 | 87.66% (-1.70) | -- | -- | More BM25 results = more dilution. → [postmortem](postmortems/2026-03-11_pr80_enhanced_recall.md) |
+| 2026-03-11 | PR #80 BM25+rerank top5 | exp/pr80-bm25-rerank-top5 | 87.23% (-2.13) | -- | -- | Reranking didn't recover regression. → [postmortem](postmortems/2026-03-11_pr80_enhanced_recall.md) |
+| 2026-03-11 | PR #80 BM25+rerank top10 | exp/pr80-bm25-rerank-top10 | 86.81% (-2.55) | -- | -- | Wider rerank window = worse. → [postmortem](postmortems/2026-03-11_pr80_enhanced_recall.md) |
+| 2026-03-11 | #74 entity expansion | exp/74-entity-expansion-precision-v1 | 89.36% (+0.0) | -- | -- | Hub-node detection. Zero delta — benchmark doesn't exercise graph expansion. → [postmortem](postmortems/2026-03-11_issue74_entity_expansion_precision.md) |
+| 2026-03-12 | #79 (PR #125) | exp/79-priority-ids-fetch-v1 | 89.36% (+0.0) | -- | -- | Bug fix: priority_ids now fetches by ID. Merged. → [postmortem](postmortems/2026-03-12_issue79_priority_ids_fetch.md) |
 
 ### Category Breakdown (LoCoMo-mini)
 
@@ -38,6 +48,13 @@ Categories 1-4 are scored by word-overlap/date matching. Category 5 uses an opt-
 | 2026-03-10 | pre-refactor | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) |
 | 2026-03-10 | eval-fix | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | N/A (71 skipped) |
 | 2026-03-10 | cat5-judge | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | **91.5% (65/71)** |
+| 2026-03-10 | main-refresh (no judge) | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | **96.5% (110/114)** | 100.0% (2/2, 69 skipped) |
+| 2026-03-10 | main-refresh (judge) | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | **96.5% (110/114)** | **93.0% (66/71)** |
+| 2026-03-11 | PR #80 port (no judge) | **86.0% (37/43)** | **93.7% (59/63)** | 46.2% (6/13) | 85.1% (97/114) | 100.0% (2/2, 69 skipped) |
+| 2026-03-11 | PR #80 port (judge) | **88.4% (38/43)** | **92.1% (58/63)** | 46.2% (6/13) | 86.0% (98/114) | **95.8% (68/71)** |
+| 2026-03-11 | PR #80 BM25-only f10 | 81.4% (35/43) | 92.1% (58/63) | 46.2% (6/13) | 93.0% (106/114) | N/A |
+| 2026-03-11 | #74 entity expansion | 79.1% (34/43) | 92.1% (58/63) | 46.2% (6/13) | 96.5% (110/114) | N/A |
+| 2026-03-12 | #79 (PR #125) | 79.1% (34/43) | 92.1% (58/63) | 46.2% (6/13) | 96.5% (110/114) | N/A |
 
 \* Temporal was artificially low: evaluator compared question dates (empty) vs memory dates instead of answer dates.
 \*\* Complex was artificially 100%: dataset has no `answer` field for cat5 → empty string → `"" in content` always True.

diff --git a/benchmarks/postmortems/.gitkeep b/benchmarks/postmortems/.gitkeep
diff --git a/benchmarks/postmortems/2026-03-11_issue74_entity_expansion_precision.md b/benchmarks/postmortems/2026-03-11_issue74_entity_expansion_precision.md
@@ -0,0 +1,44 @@
+# Issue #74: Graph expansion follows too many hops through hub nodes
+
+**Status**: NON-PROMOTED / REJECTED (this direction)
+**Date**: 2026-03-11
+**Issue**: #74 (remains open)
+**Branch**: `exp/74-entity-expansion-precision-v1` (commit `236b075`)
+
+## Hypothesis
+
+Hub-node pollution in graph expansion degrades recall precision. Querying "Alex Panagis" with `expand_entities=true` incorrectly pulls in unrelated memories through shared generic nodes (e.g., the "AutoMem" tool node connects all users). Hub-node detection and deprioritization should improve precision without harming recall.
+
+## Benchmark
+
+| Metric | Baseline | Test | Delta |
+|--------|----------|------|-------|
+| LoCoMo-mini | 89.36% | 89.36% | **0.0** |
+
+**Category deltas**: All zero across all five categories.
+
+## Why Zero Delta
+
+The LoCoMo benchmark doesn't exercise graph expansion (`expand_entities`) in its query path. All LoCoMo questions are answered via vector + keyword search. The hub-node problem is real in production (where users query with `expand_entities=true`), but this benchmark can't surface the improvement.
+
+## Commands Run
+
+```bash
+make bench-eval BENCH=locomo-mini CONFIG=baseline  # on exp/74 branch
+make bench-compare BENCH=locomo-mini CONFIG=baseline BASELINE=baseline  # vs main
+```
+
+## Outcome
+
+Hub-node detection alone doesn't move the needle on LoCoMo. The experiment was correctly executed but the benchmark isn't the right instrument to measure this improvement.
+
+## Promoted Artifact
+
+`tests/benchmarks/results/compare_issue74_entity_precision_20260311.json`
+
+## Follow-up
+
+- Issue #74 remains open — the problem is real, just not benchmark-measurable yet
+- Consider Personalized PageRank (#100) or configurable `max_hops` as alternative approaches
+- Need a graph-expansion-specific test suite (queries with `expand_entities=true`) to measure future attempts
+- Should not be the default direction for recall improvement work
diff --git a/benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md b/benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md
@@ -0,0 +1,93 @@
+# PR #80: Enhanced Recall — BM25, LLM Reranking, Query Expansion
+
+**Status**: REJECTED — regression across all tested configurations
+**Date**: 2026-03-11
+**PR**: #80 (by jescalan)
+**Branch**: `exp/pr80-enhanced-recall-v2` (commit `a122ba2`)
+**Author**: jescalan
+
+## Hypothesis
+
+Adding BM25 full-text search, LLM reranking, and query expansion to the recall pipeline will improve accuracy by catching keyword matches that vector search misses and filtering false positives.
+
+**Pipeline**: query expansion → vector + graph + BM25 search → RRF fusion → metadata scoring → LLM reranking
+
+## Benchmark Results
+
+### Full Port (all features enabled)
+
+| Metric | Baseline (main) | PR #80 | Delta |
+|--------|-----------------|--------|-------|
+| LoCoMo-mini (judge-off) | 89.36% | 85.53% | **-3.83pp** |
+| LoCoMo-mini (judge-on) | 90.13% | 88.16% | **-1.97pp** |
+
+### Category Breakdown (judge-off)
+
+| Category | Baseline | PR #80 | Delta |
+|----------|----------|--------|-------|
+| Single-hop | 79.1% | 86.0% | **+7.0pp** |
+| Temporal | 92.1% | 93.7% | **+1.6pp** |
+| Multi-hop | 46.2% | 46.2% | **0.0** |
+| Open Domain | 96.5% | 85.1% | **-11.4pp** |
+| Complex | N/A | N/A | -- |
+
+### Config Sweep (Round 1, judge-off)
+
+| Config | Accuracy | Delta vs Main | Notes |
+|--------|----------|---------------|-------|
+| BM25-only f10 | 88.09% | **-1.28pp** | Best variant, still regressed |
+| BM25-only f20 | 87.66% | **-1.70pp** | More BM25 results = more dilution |
+| BM25 + rerank top-5 | 87.23% | **-2.13pp** | Reranking didn't help |
+| BM25 + rerank top-10 | 86.81% | **-2.55pp** | Wider rerank window = worse |
+
+### Runtime Impact
+
+| Config | Runtime | vs Baseline |
+|--------|---------|-------------|
+| Main (baseline) | 209s | -- |
+| Full port (judge-off) | 1546s | **7.4x slower** |
+| Full port (judge-on) | 2136s | **10.2x slower** |
+
+## Root Cause Analysis
+
+**Open-domain regression is the primary problem.** Open-domain questions (e.g., "What does Alex think about remote work?") rely on semantic similarity — the answer is conceptually related but doesn't share keywords with the question. BM25 keyword matches dilute the vector results via Reciprocal Rank Fusion (RRF), pushing semantically-relevant results down.
+
+The single-hop improvement (+7.0pp) shows BM25 *does* help for factual lookups where exact keywords matter. But the open-domain loss (-11.4pp) is 1.6x larger and affects more questions (114 vs 43).
+
+**Runtime cost** is driven by LLM reranking (one API call per candidate per question) and query expansion (one API call per question). Even BM25-only configs add SQLite FTS5 overhead.
+
+## Commands Run
+
+```bash
+# Full port evaluation
+make bench-eval BENCH=locomo-mini CONFIG=baseline  # on exp/pr80-enhanced-recall-v2
+make bench-compare BENCH=locomo-mini CONFIG=baseline BASELINE=baseline
+
+# Config sweep (4 variants)
+make bench-eval BENCH=locomo-mini CONFIG=bm25_only_f10
+make bench-eval BENCH=locomo-mini CONFIG=bm25_only_f20
+make bench-eval BENCH=locomo-mini CONFIG=bm25_rerank_top5
+make bench-eval BENCH=locomo-mini CONFIG=bm25_rerank_top10
+```
+
+## Outcome
+
+**REJECTED.** No configuration recovered the open-domain regression. The best variant (BM25-only f10) still regressed -1.28pp overall.
+
+## Decision
+
+PR #80's always-on BM25 fusion approach hurts more than it helps on current benchmarks. The single-hop gains don't compensate for open-domain losses. The 7-10x runtime increase is also prohibitive for production use.
+
+## Promoted Artifacts
+
+- `tests/benchmarks/results/compare_pr80_judge_off_20260311.json`
+- `tests/benchmarks/results/compare_pr80_judge_on_20260311.json`
+- `tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json`
+
+## Follow-up Recommendations
+
+1. **Targeted BM25 fallback**: Only invoke BM25 when vector search returns low-confidence results (e.g., all scores < 0.5), rather than always-on fusion
+2. **Category-aware fusion**: Weight BM25 differently for factual vs open-domain queries (requires query classification)
+3. **RRF tuning**: The RRF constant (k=60 default) may be too aggressive for BM25 results — could try lower k to reduce BM25 influence
+4. **Reranking without BM25**: Test LLM reranking on vanilla vector results (without BM25 dilution) as a separate experiment
+5. **Contributor summary**: If closing PR #80, provide benchmark evidence (this postmortem) rather than preference-based feedback
diff --git a/benchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.md b/benchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.md
@@ -0,0 +1,39 @@
+# Issue #79: priority_ids parameter only boosts relevance
+
+**Status**: ACCEPTED / MERGED
+**Date**: 2026-03-12
+**PR**: #125 (commit `5d3708c`)
+**Branch**: `exp/79-priority-ids-fetch-v1`
+
+## Hypothesis
+
+The `priority_ids` parameter should fetch memories directly by ID and guarantee their inclusion in results, not merely boost their score if they happen to appear in normal search results.
+
+## Benchmark
+
+| Metric | Baseline | Test | Delta |
+|--------|----------|------|-------|
+| LoCoMo-mini | 89.36% | 89.36% | **0.0** |
+
+**Category deltas**: All zero across Single-hop, Temporal, Multi-hop, Open Domain, Complex.
+
+Delta of zero is expected — this is a fetch-behavior bug fix, not a recall scoring change. The benchmark exercises natural-language queries, not ID-based lookups.
+
+## Commands Run
+
+```bash
+make bench-eval BENCH=locomo-mini CONFIG=baseline  # on exp/79 branch
+make bench-compare BENCH=locomo-mini CONFIG=baseline BASELINE=baseline  # vs main
+```
+
+## Outcome
+
+Bug fix merged without benchmark regression. MCP clients can now reliably fetch specific memories by ID via `priority_ids`.
+
+## Promoted Artifact
+
+`tests/benchmarks/results/compare_issue79_priority_ids_20260311.json`
+
+## Follow-up
+
+None needed. Related MCP-side fix: `verygoodplugins/mcp-automem#67`.
diff --git a/tests/benchmarks/results/compare_issue74_entity_precision_20260311.json b/tests/benchmarks/results/compare_issue74_entity_precision_20260311.json
@@ -0,0 +1,14 @@
+{
+  "baseline_accuracy": 0.8936170212765957,
+  "test_accuracy": 0.8936170212765957,
+  "delta": 0.0,
+  "category_deltas": {
+    "Single-hop Recall": 0.0,
+    "Temporal Understanding": 0.0,
+    "Multi-hop Reasoning": 0.0,
+    "Open Domain": 0.0,
+    "Complex Reasoning": 0.0
+  },
+  "baseline_file": "benchmarks/results/locomo-mini_baseline_20260311_041134.json",
+  "test_file": "benchmarks/results/locomo-mini_baseline_20260311_041656.json"
+}
diff --git a/tests/benchmarks/results/compare_issue79_priority_ids_20260311.json b/tests/benchmarks/results/compare_issue79_priority_ids_20260311.json
@@ -0,0 +1,14 @@
+{
+  "baseline_accuracy": 0.8936170212765957,
+  "test_accuracy": 0.8936170212765957,
+  "delta": 0.0,
+  "category_deltas": {
+    "Single-hop Recall": 0.0,
+    "Temporal Understanding": 0.0,
+    "Multi-hop Reasoning": 0.0,
+    "Open Domain": 0.0,
+    "Complex Reasoning": 0.0
+  },
+  "baseline_file": "benchmarks/results/locomo-mini_baseline_20260311_041134.json",
+  "test_file": "benchmarks/results/locomo-mini_baseline_20260311_055154.json"
+}
diff --git a/tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json b/tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json
@@ -0,0 +1,14 @@
+{
+  "baseline_accuracy": 0.8936170212765957,
+  "test_accuracy": 0.8808510638297873,
+  "delta": -0.012765957446808418,
+  "category_deltas": {
+    "Single-hop Recall": 0.023255813953488413,
+    "Temporal Understanding": 0.0,
+    "Multi-hop Reasoning": 0.0,
+    "Open Domain": -0.03508771929824561,
+    "Complex Reasoning": 0.0
+  },
+  "baseline_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_baseline_20260310_233631.json",
+  "test_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json"
+}
diff --git a/tests/benchmarks/results/compare_pr80_judge_off_20260311.json b/tests/benchmarks/results/compare_pr80_judge_off_20260311.json
@@ -0,0 +1,14 @@
+{
+  "baseline_accuracy": 0.8936170212765957,
+  "test_accuracy": 0.8553191489361702,
+  "delta": -0.038297872340425476,
+  "category_deltas": {
+    "Single-hop Recall": 0.06976744186046513,
+    "Temporal Understanding": 0.015873015873015928,
+    "Multi-hop Reasoning": 0.0,
+    "Open Domain": -0.11403508771929827,
+    "Complex Reasoning": 0.0
+  },
+  "baseline_file": "benchmarks/results/locomo-mini_baseline_20260310_233631.json",
+  "test_file": "benchmarks/results/locomo-mini_baseline_20260311_004300.json"
+}
diff --git a/tests/benchmarks/results/compare_pr80_judge_on_20260311.json b/tests/benchmarks/results/compare_pr80_judge_on_20260311.json
@@ -0,0 +1,14 @@
+{
+  "baseline_accuracy": 0.9013157894736842,
+  "test_accuracy": 0.881578947368421,
+  "delta": -0.019736842105263164,
+  "category_deltas": {
+    "Single-hop Recall": 0.09302325581395354,
+    "Temporal Understanding": 0.0,
+    "Multi-hop Reasoning": 0.0,
+    "Open Domain": -0.10526315789473684,
+    "Complex Reasoning": 0.028169014084507005
+  },
+  "baseline_file": "benchmarks/results/locomo-mini_baseline_20260310_234028.json",
+  "test_file": "benchmarks/results/locomo-mini_baseline_20260311_011001_judge.json"
+}