Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions benchmarks/EXPERIMENT_LOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,16 @@ on the snapshot-based bench infrastructure (PR #97, merged 2026-03-02).
| 2026-03-10 | pre-refactor | main (@ 795368a) | 76.97% (+0.0) | -- | -- | Baseline re-confirmed after #73, #78, #115, #116 merged. Stable. Pre-relation-tier-refactor checkpoint. |
| 2026-03-10 | eval-fix | docs/benchmark-agent-guidelines | **89.27% (208/233)** | -- | -- | Fix temporal matching (answer vs memory dates) + skip cat5 (no ground truth). Honest score, beats CORE by 1.03pp. |
| 2026-03-10 | cat5-judge | feat/bench-cat5-judge | **89.80% (273/304)** | **87.56% (1739/1986)** | -- | Opt-in GPT-4o judge for cat5. Full run scored cat5 at 95.74% (427/446) with 0 judge skips/errors; added 90s OpenAI request timeout to prevent stuck full runs. |
| 2026-03-10 | main-refresh (no judge) | main | **89.36% (210/235)** | -- | -- | Fresh current-main rerun before PR #80 experiment. Comparison anchor for judge-off. |
| 2026-03-10 | main-refresh (judge) | main | **90.13% (274/304)** | -- | -- | Fresh current-main rerun with `BENCH_JUDGE_MODEL=gpt-4o`. Comparison anchor for judge-on. |
| 2026-03-11 | PR #80 port (no judge) | exp/pr80-enhanced-recall-v2 | 85.53% (201/235) | -- | -- | BM25 + query expansion + rerank port. Regressed **-3.83pp** vs fresh main. Open-domain -11.4pp. Runtime 7.4x. → [postmortem](postmortems/2026-03-11_pr80_enhanced_recall.md) |
| 2026-03-11 | PR #80 port (judge) | exp/pr80-enhanced-recall-v2 | 88.16% (268/304) | -- | -- | Same branch with GPT-4o cat5 judge. Regressed **-1.97pp** vs fresh main. Runtime 10.2x. → [postmortem](postmortems/2026-03-11_pr80_enhanced_recall.md) |
| 2026-03-11 | PR #80 BM25-only f10 | exp/pr80-bm25-only-f10 | 88.09% (-1.28) | -- | -- | Best config variant, still regressed. → [postmortem](postmortems/2026-03-11_pr80_enhanced_recall.md) |
| 2026-03-11 | PR #80 BM25-only f20 | exp/pr80-bm25-only-f20 | 87.66% (-1.70) | -- | -- | More BM25 results = more dilution. → [postmortem](postmortems/2026-03-11_pr80_enhanced_recall.md) |
| 2026-03-11 | PR #80 BM25+rerank top5 | exp/pr80-bm25-rerank-top5 | 87.23% (-2.13) | -- | -- | Reranking didn't recover regression. → [postmortem](postmortems/2026-03-11_pr80_enhanced_recall.md) |
| 2026-03-11 | PR #80 BM25+rerank top10 | exp/pr80-bm25-rerank-top10 | 86.81% (-2.55) | -- | -- | Wider rerank window = worse. → [postmortem](postmortems/2026-03-11_pr80_enhanced_recall.md) |
| 2026-03-11 | #74 entity expansion | exp/74-entity-expansion-precision-v1 | 89.36% (+0.0) | -- | -- | Hub-node detection. Zero delta — benchmark doesn't exercise graph expansion. → [postmortem](postmortems/2026-03-11_issue74_entity_expansion_precision.md) |
| 2026-03-12 | #79 (PR #125) | exp/79-priority-ids-fetch-v1 | 89.36% (+0.0) | -- | -- | Bug fix: priority_ids now fetches by ID. Merged. → [postmortem](postmortems/2026-03-12_issue79_priority_ids_fetch.md) |

### Category Breakdown (LoCoMo-mini)

Expand All @@ -38,6 +48,13 @@ Categories 1-4 are scored by word-overlap/date matching. Category 5 uses an opt-
| 2026-03-10 | pre-refactor | 76.7% (33/43) | 22.2%\* (14/63) | 46.2% (6/13) | 96.5% (110/114) | 100%\*\* (71/71) |
| 2026-03-10 | eval-fix | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | N/A (71 skipped) |
| 2026-03-10 | cat5-judge | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | 96.5% (110/114) | **91.5% (65/71)** |
| 2026-03-10 | main-refresh (no judge) | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | **96.5% (110/114)** | 100.0% (2/2, 69 skipped) |
| 2026-03-10 | main-refresh (judge) | **79.1% (34/43)** | **92.1% (58/63)** | 46.2% (6/13) | **96.5% (110/114)** | **93.0% (66/71)** |
| 2026-03-11 | PR #80 port (no judge) | **86.0% (37/43)** | **93.7% (59/63)** | 46.2% (6/13) | 85.1% (97/114) | 100.0% (2/2, 69 skipped) |
| 2026-03-11 | PR #80 port (judge) | **88.4% (38/43)** | **92.1% (58/63)** | 46.2% (6/13) | 86.0% (98/114) | **95.8% (68/71)** |
| 2026-03-11 | PR #80 BM25-only f10 | 81.4% (35/43) | 92.1% (58/63) | 46.2% (6/13) | 93.0% (106/114) | N/A |
| 2026-03-11 | #74 entity expansion | 79.1% (34/43) | 92.1% (58/63) | 46.2% (6/13) | 96.5% (110/114) | N/A |
| 2026-03-12 | #79 (PR #125) | 79.1% (34/43) | 92.1% (58/63) | 46.2% (6/13) | 96.5% (110/114) | N/A |

\* Temporal was artificially low: evaluator compared question dates (empty) vs memory dates instead of answer dates.
\*\* Complex was artificially 100%: dataset has no `answer` field for cat5 → empty string → `"" in content` always True.
Expand Down
Empty file added benchmarks/postmortems/.gitkeep
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Issue #74: Graph expansion follows too many hops through hub nodes

**Status**: NON-PROMOTED / REJECTED (this direction)
**Date**: 2026-03-11
**Issue**: #74 (remains open)
**Branch**: `exp/74-entity-expansion-precision-v1` (commit `236b075`)

## Hypothesis

Hub-node pollution in graph expansion degrades recall precision. Querying "Alex Panagis" with `expand_entities=true` incorrectly pulls in unrelated memories through shared generic nodes (e.g., the "AutoMem" tool node connects all users). Hub-node detection and deprioritization should improve precision without harming recall.

## Benchmark

| Metric | Baseline | Test | Delta |
|--------|----------|------|-------|
| LoCoMo-mini | 89.36% | 89.36% | **0.0** |

**Category deltas**: All zero across all five categories.

## Why Zero Delta

The LoCoMo benchmark doesn't exercise graph expansion (`expand_entities`) in its query path. All LoCoMo questions are answered via vector + keyword search. The hub-node problem is real in production (where users query with `expand_entities=true`), but this benchmark can't surface the improvement.

## Commands Run

```bash
make bench-eval BENCH=locomo-mini CONFIG=baseline # on exp/74 branch
make bench-compare BENCH=locomo-mini CONFIG=baseline BASELINE=baseline # vs main
```

## Outcome

Hub-node detection alone doesn't move the needle on LoCoMo. The experiment was correctly executed but the benchmark isn't the right instrument to measure this improvement.

## Promoted Artifact

`tests/benchmarks/results/compare_issue74_entity_precision_20260311.json`

## Follow-up

- Issue #74 remains open — the problem is real, just not benchmark-measurable yet
- Consider Personalized PageRank (#100) or configurable `max_hops` as alternative approaches
- Need a graph-expansion-specific test suite (queries with `expand_entities=true`) to measure future attempts
- Should not be the default direction for recall improvement work
93 changes: 93 additions & 0 deletions benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# PR #80: Enhanced Recall — BM25, LLM Reranking, Query Expansion

**Status**: REJECTED — regression across all tested configurations
**Date**: 2026-03-11
**PR**: #80 (by jescalan)
**Branch**: `exp/pr80-enhanced-recall-v2` (commit `a122ba2`)
**Author**: jescalan

## Hypothesis

Adding BM25 full-text search, LLM reranking, and query expansion to the recall pipeline will improve accuracy by catching keyword matches that vector search misses and filtering false positives.

**Pipeline**: query expansion → vector + graph + BM25 search → RRF fusion → metadata scoring → LLM reranking

## Benchmark Results

### Full Port (all features enabled)

| Metric | Baseline (main) | PR #80 | Delta |
|--------|-----------------|--------|-------|
| LoCoMo-mini (judge-off) | 89.36% | 85.53% | **-3.83pp** |
| LoCoMo-mini (judge-on) | 90.13% | 88.16% | **-1.97pp** |

### Category Breakdown (judge-off)

| Category | Baseline | PR #80 | Delta |
|----------|----------|--------|-------|
| Single-hop | 79.1% | 86.0% | **+7.0pp** |
| Temporal | 92.1% | 93.7% | **+1.6pp** |
| Multi-hop | 46.2% | 46.2% | **0.0** |
| Open Domain | 96.5% | 85.1% | **-11.4pp** |
| Complex | N/A | N/A | -- |

### Config Sweep (Round 1, judge-off)

| Config | Accuracy | Delta vs Main | Notes |
|--------|----------|---------------|-------|
| BM25-only f10 | 88.09% | **-1.28pp** | Best variant, still regressed |
| BM25-only f20 | 87.66% | **-1.70pp** | More BM25 results = more dilution |
| BM25 + rerank top-5 | 87.23% | **-2.13pp** | Reranking didn't help |
| BM25 + rerank top-10 | 86.81% | **-2.55pp** | Wider rerank window = worse |

### Runtime Impact

| Config | Runtime | vs Baseline |
|--------|---------|-------------|
| Main (baseline) | 209s | -- |
| Full port (judge-off) | 1546s | **7.4x slower** |
| Full port (judge-on) | 2136s | **10.2x slower** |

## Root Cause Analysis

**Open-domain regression is the primary problem.** Open-domain questions (e.g., "What does Alex think about remote work?") rely on semantic similarity — the answer is conceptually related but doesn't share keywords with the question. BM25 keyword matches dilute the vector results via Reciprocal Rank Fusion (RRF), pushing semantically-relevant results down.

The single-hop improvement (+7.0pp) shows BM25 *does* help for factual lookups where exact keywords matter. But the open-domain loss (-11.4pp) is 1.6x larger and affects more questions (114 vs 43).

**Runtime cost** is driven by LLM reranking (one API call per candidate per question) and query expansion (one API call per question). Even BM25-only configs add SQLite FTS5 overhead.

## Commands Run

```bash
# Full port evaluation
make bench-eval BENCH=locomo-mini CONFIG=baseline # on exp/pr80-enhanced-recall-v2
make bench-compare BENCH=locomo-mini CONFIG=baseline BASELINE=baseline

# Config sweep (4 variants)
make bench-eval BENCH=locomo-mini CONFIG=bm25_only_f10
make bench-eval BENCH=locomo-mini CONFIG=bm25_only_f20
make bench-eval BENCH=locomo-mini CONFIG=bm25_rerank_top5
make bench-eval BENCH=locomo-mini CONFIG=bm25_rerank_top10
```

## Outcome

**REJECTED.** No configuration recovered the open-domain regression. The best variant (BM25-only f10) still regressed -1.28pp overall.

## Decision

PR #80's always-on BM25 fusion approach hurts more than it helps on current benchmarks. The single-hop gains don't compensate for open-domain losses. The 7-10x runtime increase is also prohibitive for production use.

## Promoted Artifacts

- `tests/benchmarks/results/compare_pr80_judge_off_20260311.json`
- `tests/benchmarks/results/compare_pr80_judge_on_20260311.json`
- `tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json`

## Follow-up Recommendations

1. **Targeted BM25 fallback**: Only invoke BM25 when vector search returns low-confidence results (e.g., all scores < 0.5), rather than always-on fusion
2. **Category-aware fusion**: Weight BM25 differently for factual vs open-domain queries (requires query classification)
3. **RRF tuning**: The RRF constant (k=60 default) may be too aggressive for BM25 results — could try lower k to reduce BM25 influence
4. **Reranking without BM25**: Test LLM reranking on vanilla vector results (without BM25 dilution) as a separate experiment
5. **Contributor summary**: If closing PR #80, provide benchmark evidence (this postmortem) rather than preference-based feedback
39 changes: 39 additions & 0 deletions benchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Issue #79: priority_ids parameter only boosts relevance

**Status**: ACCEPTED / MERGED
**Date**: 2026-03-12
**PR**: #125 (commit `5d3708c`)
**Branch**: `exp/79-priority-ids-fetch-v1`

## Hypothesis

The `priority_ids` parameter should fetch memories directly by ID and guarantee their inclusion in results, not merely boost their score if they happen to appear in normal search results.

## Benchmark

| Metric | Baseline | Test | Delta |
|--------|----------|------|-------|
| LoCoMo-mini | 89.36% | 89.36% | **0.0** |

**Category deltas**: All zero across Single-hop, Temporal, Multi-hop, Open Domain, Complex.

Delta of zero is expected — this is a fetch-behavior bug fix, not a recall scoring change. The benchmark exercises natural-language queries, not ID-based lookups.

## Commands Run

```bash
make bench-eval BENCH=locomo-mini CONFIG=baseline # on exp/79 branch
make bench-compare BENCH=locomo-mini CONFIG=baseline BASELINE=baseline # vs main
```

## Outcome

Bug fix merged without benchmark regression. MCP clients can now reliably fetch specific memories by ID via `priority_ids`.

## Promoted Artifact

`tests/benchmarks/results/compare_issue79_priority_ids_20260311.json`

## Follow-up

None needed. Related MCP-side fix: `verygoodplugins/mcp-automem#67`.
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"baseline_accuracy": 0.8936170212765957,
"test_accuracy": 0.8936170212765957,
"delta": 0.0,
"category_deltas": {
"Single-hop Recall": 0.0,
"Temporal Understanding": 0.0,
"Multi-hop Reasoning": 0.0,
"Open Domain": 0.0,
"Complex Reasoning": 0.0
},
"baseline_file": "benchmarks/results/locomo-mini_baseline_20260311_041134.json",
"test_file": "benchmarks/results/locomo-mini_baseline_20260311_041656.json"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"baseline_accuracy": 0.8936170212765957,
"test_accuracy": 0.8936170212765957,
"delta": 0.0,
"category_deltas": {
"Single-hop Recall": 0.0,
"Temporal Understanding": 0.0,
"Multi-hop Reasoning": 0.0,
"Open Domain": 0.0,
"Complex Reasoning": 0.0
},
"baseline_file": "benchmarks/results/locomo-mini_baseline_20260311_041134.json",
"test_file": "benchmarks/results/locomo-mini_baseline_20260311_055154.json"
}
14 changes: 14 additions & 0 deletions tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"baseline_accuracy": 0.8936170212765957,
"test_accuracy": 0.8808510638297873,
"delta": -0.012765957446808418,
"category_deltas": {
"Single-hop Recall": 0.023255813953488413,
"Temporal Understanding": 0.0,
"Multi-hop Reasoning": 0.0,
"Open Domain": -0.03508771929824561,
"Complex Reasoning": 0.0
},
"baseline_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_baseline_20260310_233631.json",
"test_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json"
Comment thread
jack-arturo marked this conversation as resolved.
}
14 changes: 14 additions & 0 deletions tests/benchmarks/results/compare_pr80_judge_off_20260311.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"baseline_accuracy": 0.8936170212765957,
"test_accuracy": 0.8553191489361702,
"delta": -0.038297872340425476,
"category_deltas": {
"Single-hop Recall": 0.06976744186046513,
"Temporal Understanding": 0.015873015873015928,
"Multi-hop Reasoning": 0.0,
"Open Domain": -0.11403508771929827,
"Complex Reasoning": 0.0
},
"baseline_file": "benchmarks/results/locomo-mini_baseline_20260310_233631.json",
"test_file": "benchmarks/results/locomo-mini_baseline_20260311_004300.json"
}
14 changes: 14 additions & 0 deletions tests/benchmarks/results/compare_pr80_judge_on_20260311.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"baseline_accuracy": 0.9013157894736842,
"test_accuracy": 0.881578947368421,
"delta": -0.019736842105263164,
"category_deltas": {
"Single-hop Recall": 0.09302325581395354,
"Temporal Understanding": 0.0,
"Multi-hop Reasoning": 0.0,
"Open Domain": -0.10526315789473684,
"Complex Reasoning": 0.028169014084507005
},
"baseline_file": "benchmarks/results/locomo-mini_baseline_20260310_234028.json",
"test_file": "benchmarks/results/locomo-mini_baseline_20260311_011001_judge.json"
}
Loading