-
Notifications
You must be signed in to change notification settings - Fork 94
chore(bench): experiment postmortems, artifact promotion, log update #126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
44 changes: 44 additions & 0 deletions
44
benchmarks/postmortems/2026-03-11_issue74_entity_expansion_precision.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| # Issue #74: Graph expansion follows too many hops through hub nodes | ||
|
|
||
| **Status**: NON-PROMOTED / REJECTED (this direction) | ||
| **Date**: 2026-03-11 | ||
| **Issue**: #74 (remains open) | ||
| **Branch**: `exp/74-entity-expansion-precision-v1` (commit `236b075`) | ||
|
|
||
| ## Hypothesis | ||
|
|
||
| Hub-node pollution in graph expansion degrades recall precision. Querying "Alex Panagis" with `expand_entities=true` incorrectly pulls in unrelated memories through shared generic nodes (e.g., the "AutoMem" tool node connects all users). Hub-node detection and deprioritization should improve precision without harming recall. | ||
|
|
||
| ## Benchmark | ||
|
|
||
| | Metric | Baseline | Test | Delta | | ||
| |--------|----------|------|-------| | ||
| | LoCoMo-mini | 89.36% | 89.36% | **0.0** | | ||
|
|
||
| **Category deltas**: All zero across all five categories. | ||
|
|
||
| ## Why Zero Delta | ||
|
|
||
| The LoCoMo benchmark doesn't exercise graph expansion (`expand_entities`) in its query path. All LoCoMo questions are answered via vector + keyword search. The hub-node problem is real in production (where users query with `expand_entities=true`), but this benchmark can't surface the improvement. | ||
|
|
||
| ## Commands Run | ||
|
|
||
| ```bash | ||
| make bench-eval BENCH=locomo-mini CONFIG=baseline # on exp/74 branch | ||
| make bench-compare BENCH=locomo-mini CONFIG=baseline BASELINE=baseline # vs main | ||
| ``` | ||
|
|
||
| ## Outcome | ||
|
|
||
| Hub-node detection alone doesn't move the needle on LoCoMo. The experiment was correctly executed but the benchmark isn't the right instrument to measure this improvement. | ||
|
|
||
| ## Promoted Artifact | ||
|
|
||
| `tests/benchmarks/results/compare_issue74_entity_precision_20260311.json` | ||
|
|
||
| ## Follow-up | ||
|
|
||
| - Issue #74 remains open — the problem is real, just not benchmark-measurable yet | ||
| - Consider Personalized PageRank (#100) or configurable `max_hops` as alternative approaches | ||
| - Need a graph-expansion-specific test suite (queries with `expand_entities=true`) to measure future attempts | ||
| - Should not be the default direction for recall improvement work |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,93 @@ | ||
| # PR #80: Enhanced Recall — BM25, LLM Reranking, Query Expansion | ||
|
|
||
| **Status**: REJECTED — regression across all tested configurations | ||
| **Date**: 2026-03-11 | ||
| **PR**: #80 (by jescalan) | ||
| **Branch**: `exp/pr80-enhanced-recall-v2` (commit `a122ba2`) | ||
| **Author**: jescalan | ||
|
|
||
| ## Hypothesis | ||
|
|
||
| Adding BM25 full-text search, LLM reranking, and query expansion to the recall pipeline will improve accuracy by catching keyword matches that vector search misses and filtering false positives. | ||
|
|
||
| **Pipeline**: query expansion → vector + graph + BM25 search → RRF fusion → metadata scoring → LLM reranking | ||
|
|
||
| ## Benchmark Results | ||
|
|
||
| ### Full Port (all features enabled) | ||
|
|
||
| | Metric | Baseline (main) | PR #80 | Delta | | ||
| |--------|-----------------|--------|-------| | ||
| | LoCoMo-mini (judge-off) | 89.36% | 85.53% | **-3.83pp** | | ||
| | LoCoMo-mini (judge-on) | 90.13% | 88.16% | **-1.97pp** | | ||
|
|
||
| ### Category Breakdown (judge-off) | ||
|
|
||
| | Category | Baseline | PR #80 | Delta | | ||
| |----------|----------|--------|-------| | ||
| | Single-hop | 79.1% | 86.0% | **+7.0pp** | | ||
| | Temporal | 92.1% | 93.7% | **+1.6pp** | | ||
| | Multi-hop | 46.2% | 46.2% | **0.0** | | ||
| | Open Domain | 96.5% | 85.1% | **-11.4pp** | | ||
| | Complex | N/A | N/A | -- | | ||
|
|
||
| ### Config Sweep (Round 1, judge-off) | ||
|
|
||
| | Config | Accuracy | Delta vs Main | Notes | | ||
| |--------|----------|---------------|-------| | ||
| | BM25-only f10 | 88.09% | **-1.28pp** | Best variant, still regressed | | ||
| | BM25-only f20 | 87.66% | **-1.70pp** | More BM25 results = more dilution | | ||
| | BM25 + rerank top-5 | 87.23% | **-2.13pp** | Reranking didn't help | | ||
| | BM25 + rerank top-10 | 86.81% | **-2.55pp** | Wider rerank window = worse | | ||
|
|
||
| ### Runtime Impact | ||
|
|
||
| | Config | Runtime | vs Baseline | | ||
| |--------|---------|-------------| | ||
| | Main (baseline) | 209s | -- | | ||
| | Full port (judge-off) | 1546s | **7.4x slower** | | ||
| | Full port (judge-on) | 2136s | **10.2x slower** | | ||
|
|
||
| ## Root Cause Analysis | ||
|
|
||
| **Open-domain regression is the primary problem.** Open-domain questions (e.g., "What does Alex think about remote work?") rely on semantic similarity — the answer is conceptually related but doesn't share keywords with the question. BM25 keyword matches dilute the vector results via Reciprocal Rank Fusion (RRF), pushing semantically-relevant results down. | ||
|
|
||
| The single-hop improvement (+7.0pp) shows BM25 *does* help for factual lookups where exact keywords matter. But the open-domain loss (-11.4pp) is 1.6x larger and affects more questions (114 vs 43). | ||
|
|
||
| **Runtime cost** is driven by LLM reranking (one API call per candidate per question) and query expansion (one API call per question). Even BM25-only configs add SQLite FTS5 overhead. | ||
|
|
||
| ## Commands Run | ||
|
|
||
| ```bash | ||
| # Full port evaluation | ||
| make bench-eval BENCH=locomo-mini CONFIG=baseline # on exp/pr80-enhanced-recall-v2 | ||
| make bench-compare BENCH=locomo-mini CONFIG=baseline BASELINE=baseline | ||
|
|
||
| # Config sweep (4 variants) | ||
| make bench-eval BENCH=locomo-mini CONFIG=bm25_only_f10 | ||
| make bench-eval BENCH=locomo-mini CONFIG=bm25_only_f20 | ||
| make bench-eval BENCH=locomo-mini CONFIG=bm25_rerank_top5 | ||
| make bench-eval BENCH=locomo-mini CONFIG=bm25_rerank_top10 | ||
| ``` | ||
|
|
||
| ## Outcome | ||
|
|
||
| **REJECTED.** No configuration recovered the open-domain regression. The best variant (BM25-only f10) still regressed -1.28pp overall. | ||
|
|
||
| ## Decision | ||
|
|
||
| PR #80's always-on BM25 fusion approach hurts more than it helps on current benchmarks. The single-hop gains don't compensate for open-domain losses. The 7-10x runtime increase is also prohibitive for production use. | ||
|
|
||
| ## Promoted Artifacts | ||
|
|
||
| - `tests/benchmarks/results/compare_pr80_judge_off_20260311.json` | ||
| - `tests/benchmarks/results/compare_pr80_judge_on_20260311.json` | ||
| - `tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json` | ||
|
|
||
| ## Follow-up Recommendations | ||
|
|
||
| 1. **Targeted BM25 fallback**: Only invoke BM25 when vector search returns low-confidence results (e.g., all scores < 0.5), rather than always-on fusion | ||
| 2. **Category-aware fusion**: Weight BM25 differently for factual vs open-domain queries (requires query classification) | ||
| 3. **RRF tuning**: The RRF constant (k=60 default) may be too aggressive for BM25 results — could try lower k to reduce BM25 influence | ||
| 4. **Reranking without BM25**: Test LLM reranking on vanilla vector results (without BM25 dilution) as a separate experiment | ||
| 5. **Contributor summary**: If closing PR #80, provide benchmark evidence (this postmortem) rather than preference-based feedback |
39 changes: 39 additions & 0 deletions
39
benchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| # Issue #79: priority_ids parameter only boosts relevance | ||
|
|
||
| **Status**: ACCEPTED / MERGED | ||
| **Date**: 2026-03-12 | ||
| **PR**: #125 (commit `5d3708c`) | ||
| **Branch**: `exp/79-priority-ids-fetch-v1` | ||
|
|
||
| ## Hypothesis | ||
|
|
||
| The `priority_ids` parameter should fetch memories directly by ID and guarantee their inclusion in results, not merely boost their score if they happen to appear in normal search results. | ||
|
|
||
| ## Benchmark | ||
|
|
||
| | Metric | Baseline | Test | Delta | | ||
| |--------|----------|------|-------| | ||
| | LoCoMo-mini | 89.36% | 89.36% | **0.0** | | ||
|
|
||
| **Category deltas**: All zero across Single-hop, Temporal, Multi-hop, Open Domain, Complex. | ||
|
|
||
| Delta of zero is expected — this is a fetch-behavior bug fix, not a recall scoring change. The benchmark exercises natural-language queries, not ID-based lookups. | ||
|
|
||
| ## Commands Run | ||
|
|
||
| ```bash | ||
| make bench-eval BENCH=locomo-mini CONFIG=baseline # on exp/79 branch | ||
| make bench-compare BENCH=locomo-mini CONFIG=baseline BASELINE=baseline # vs main | ||
| ``` | ||
|
|
||
| ## Outcome | ||
|
|
||
| Bug fix merged without benchmark regression. MCP clients can now reliably fetch specific memories by ID via `priority_ids`. | ||
|
|
||
| ## Promoted Artifact | ||
|
|
||
| `tests/benchmarks/results/compare_issue79_priority_ids_20260311.json` | ||
|
|
||
| ## Follow-up | ||
|
|
||
| None needed. Related MCP-side fix: `verygoodplugins/mcp-automem#67`. |
14 changes: 14 additions & 0 deletions
14
tests/benchmarks/results/compare_issue74_entity_precision_20260311.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| { | ||
| "baseline_accuracy": 0.8936170212765957, | ||
| "test_accuracy": 0.8936170212765957, | ||
| "delta": 0.0, | ||
| "category_deltas": { | ||
| "Single-hop Recall": 0.0, | ||
| "Temporal Understanding": 0.0, | ||
| "Multi-hop Reasoning": 0.0, | ||
| "Open Domain": 0.0, | ||
| "Complex Reasoning": 0.0 | ||
| }, | ||
| "baseline_file": "benchmarks/results/locomo-mini_baseline_20260311_041134.json", | ||
| "test_file": "benchmarks/results/locomo-mini_baseline_20260311_041656.json" | ||
| } |
14 changes: 14 additions & 0 deletions
14
tests/benchmarks/results/compare_issue79_priority_ids_20260311.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| { | ||
| "baseline_accuracy": 0.8936170212765957, | ||
| "test_accuracy": 0.8936170212765957, | ||
| "delta": 0.0, | ||
| "category_deltas": { | ||
| "Single-hop Recall": 0.0, | ||
| "Temporal Understanding": 0.0, | ||
| "Multi-hop Reasoning": 0.0, | ||
| "Open Domain": 0.0, | ||
| "Complex Reasoning": 0.0 | ||
| }, | ||
| "baseline_file": "benchmarks/results/locomo-mini_baseline_20260311_041134.json", | ||
| "test_file": "benchmarks/results/locomo-mini_baseline_20260311_055154.json" | ||
| } |
14 changes: 14 additions & 0 deletions
14
tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| { | ||
| "baseline_accuracy": 0.8936170212765957, | ||
| "test_accuracy": 0.8808510638297873, | ||
| "delta": -0.012765957446808418, | ||
| "category_deltas": { | ||
| "Single-hop Recall": 0.023255813953488413, | ||
| "Temporal Understanding": 0.0, | ||
| "Multi-hop Reasoning": 0.0, | ||
| "Open Domain": -0.03508771929824561, | ||
| "Complex Reasoning": 0.0 | ||
| }, | ||
| "baseline_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_baseline_20260310_233631.json", | ||
| "test_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json" | ||
| } | ||
14 changes: 14 additions & 0 deletions
14
tests/benchmarks/results/compare_pr80_judge_off_20260311.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| { | ||
| "baseline_accuracy": 0.8936170212765957, | ||
| "test_accuracy": 0.8553191489361702, | ||
| "delta": -0.038297872340425476, | ||
| "category_deltas": { | ||
| "Single-hop Recall": 0.06976744186046513, | ||
| "Temporal Understanding": 0.015873015873015928, | ||
| "Multi-hop Reasoning": 0.0, | ||
| "Open Domain": -0.11403508771929827, | ||
| "Complex Reasoning": 0.0 | ||
| }, | ||
| "baseline_file": "benchmarks/results/locomo-mini_baseline_20260310_233631.json", | ||
| "test_file": "benchmarks/results/locomo-mini_baseline_20260311_004300.json" | ||
| } |
14 changes: 14 additions & 0 deletions
14
tests/benchmarks/results/compare_pr80_judge_on_20260311.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| { | ||
| "baseline_accuracy": 0.9013157894736842, | ||
| "test_accuracy": 0.881578947368421, | ||
| "delta": -0.019736842105263164, | ||
| "category_deltas": { | ||
| "Single-hop Recall": 0.09302325581395354, | ||
| "Temporal Understanding": 0.0, | ||
| "Multi-hop Reasoning": 0.0, | ||
| "Open Domain": -0.10526315789473684, | ||
| "Complex Reasoning": 0.028169014084507005 | ||
| }, | ||
| "baseline_file": "benchmarks/results/locomo-mini_baseline_20260310_234028.json", | ||
| "test_file": "benchmarks/results/locomo-mini_baseline_20260311_011001_judge.json" | ||
| } |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.