Add research decision memory benchmark by wcj007 · Pull Request #735 · moorcheh-ai/memanto

wcj007 · 2026-06-13T14:27:03Z

/claim #639

BountyHub bounty: https://www.bountyhub.dev/bounty/view/ec58c800-823e-4134-b027-99838e096d4b

Summary

Adds examples/benchmarks/research-decision-memory, a credential-free benchmark for issue #639 focused on long-running research and product agents that must preserve the current decision trail while suppressing superseded assumptions.

The scenario tracks target segment, pricing, deployment constraints, launch readiness, data handling, customer ownership, model routing, and the primary benchmark metric across multiple sessions where early hypotheses are corrected by later evidence.

Strategies compared

active_decision_digest: Memanto-style typed digest with current decision slots, evidence IDs, and redaction of erased facts
passive_graph_history: graph-style historical state memory that keeps conflicting states unless reconciled by the caller
append_only_log: raw transcript retrieval that brings back stale assumptions and synthetic secrets
recent_window_log: sliding recent-window baseline that avoids old stale facts but forgets older still-valid decisions

Sample results

Backend	Accuracy	Evidence	Stale Conflicts	Secret Leaks	Avg Tokens	Signal/Noise
active_decision_digest	100.00%	100.00%	0.00%	0.00%	21.12	100.00%
passive_graph_history	50.00%	100.00%	50.00%	12.50%	30.50	66.67%
append_only_log	50.00%	100.00%	50.00%	12.50%	167.25	12.31%
recent_window_log	50.00%	50.00%	0.00%	0.00%	10.62	100.00%

Validation

py -3 examples/benchmarks/research-decision-memory/run_benchmark.py --output examples/benchmarks/research-decision-memory/results/sample_results.json --markdown examples/benchmarks/research-decision-memory/results/sample_results.md
py -3 -m unittest discover -s examples/benchmarks/research-decision-memory -p test_*.py
py -3 -m py_compile examples/benchmarks/research-decision-memory/run_benchmark.py examples/benchmarks/research-decision-memory/test_benchmark.py
git diff --check

Local validation passed: report generation, 5 unittest tests, py_compile, and git diff --check.

Social showcase: pending. I did not fabricate a Reddit/X link from this environment.

No private payout details or real secrets are included.

Summary by CodeRabbit

Release Notes

New Features
- Added a research decision memory benchmark comparing four different memory retrieval strategies with comprehensive evaluation metrics including accuracy, evidence coverage, stale conflict rates, and token analysis.
Documentation
- Provided detailed benchmark documentation, sample results in JSON and Markdown formats, and step-by-step execution instructions.
Tests
- Added unit tests to validate benchmark behavior across all memory strategies and verify reproducibility of benchmark outputs.

coderabbitai · 2026-06-13T14:27:15Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 493490b6-50cd-441c-96e9-47a8c705a85e

📥 Commits

Reviewing files that changed from the base of the PR and between a83f6fa and 62e0e85.

📒 Files selected for processing (1)

examples/benchmarks/research-decision-memory/test_benchmark.py

🚧 Files skipped from review as they are similar to previous changes (1)

examples/benchmarks/research-decision-memory/test_benchmark.py

📝 Walkthrough

Walkthrough

This PR introduces a complete benchmark suite for comparing four in-memory decision retrieval strategies (active_decision_digest, passive_graph_history, append_only_log, recent_window_log) within a simulated research decision scenario, including deterministic test data, metric evaluation, JSON/markdown output generation, comprehensive unit tests, and sample results.

Changes

Research Decision Memory Benchmark

Layer / File(s)	Summary
Benchmark overview and foundational data structures `README.md`, `requirements.txt`, `run_benchmark.py` (lines 1-99, 100-311)	Documentation, stdlib requirement marker, and core module setup with dataclasses (DecisionRecord, Probe, Answer, BackendSummary), utility functions (tokenize, token_count, contains_all, contains_any, percentile_95), and deterministic test fixtures (decision_records with stale/superseded/rejected states and embedded secret; probes with expected terms/evidence/stale terms).
Memory backend implementations `run_benchmark.py` (lines 314-422)	Base class MemoryBackend provides answer timing, rendering, and optional secret appending. Four concrete strategies: ActiveDecisionDigest filters to current-state records per slot; PassiveGraphHistory groups possible decision states; AppendOnlyLog selects by term overlap and tokenization; RecentWindowLog limits results to a trailing window per slot.
Benchmark evaluation, execution, and output `run_benchmark.py` (lines 423-615)	score_answers() validates expected terms and evidence while detecting stale terms and secret markers, computing accuracy, coverage, stale conflict, secret leak, and signal/noise metrics. run() orchestrates backend execution and answer gathering. JSON/markdown output helpers conditionally omit latency for reproducible artifacts. markdown_report() formats results table. CLI interface (parse_args, main) exposes output path, markdown generation, and latency inclusion flags.
Unit tests validating backend behavior and outputs `test_benchmark.py`	Benchmark execution tests verify ActiveDecisionDigest achieves perfect accuracy/coverage with zero stale/secret leaks; AppendOnlyLog retrieves more tokens with lower accuracy and stale conflicts; RecentWindowLog drops accuracy and returns "No decision found." for backfill gaps; PassiveGraphHistory exhibits stale conflicts with accuracy below 1.0. File I/O tests confirm JSON/markdown writing omits/includes latency as configured. Helper methods locate summaries and answers in payloads. Standard unittest entry point.
Sample benchmark results `results/sample_results.json`, `results/sample_results.md`	Pre-computed JSON output containing decision records with session/slot/evidence/status/rationale/relevant/stale terms and embedded secret; probe definitions with expected terms/evidence/stale expectations; per-backend summaries with accuracy, coverage, stale conflict, secret leak, token, latency, and signal-noise metrics; and per-backend answers mapping each probe to synthesized text, evidence IDs, and token counts. Markdown results table with metric columns and notes on latency omission and regeneration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A benchmark hops into the fold,
Four strategies competing, brave and bold,
Decision records dance, evidence shines bright,
Stale assumptions banished from sight—
Memory tested, metrics align,
The research trail crystalline and fine! 🏃‍♂️✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'Add research decision memory benchmark' accurately and concisely describes the main change: introducing a new benchmark system for evaluating memory strategies in research and product agents.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

examples/benchmarks/research-decision-memory/test_benchmark.py (1)

44-59: ⚡ Quick win

Add golden-file parity assertions for checked-in sample artifacts.

This test currently validates structure/content markers but not that generated reproducible outputs match the committed results/sample_results.json and results/sample_results.md. That allows sample artifacts to drift silently.

🔧 Proposed update

     def test_writers_emit_reproducible_json_and_markdown(self) -> None:
         payload = run_benchmark.run()
         with tempfile.TemporaryDirectory() as tmpdir:
             json_path = Path(tmpdir) / "results.json"
             md_path = Path(tmpdir) / "results.md"
             run_benchmark.write_json(json_path, payload)
             run_benchmark.write_markdown(md_path, payload)

             json_text = json_path.read_text()
             markdown_text = md_path.read_text()
             self.assertIn("active_decision_digest", json_text)
             self.assertNotIn("p95_latency_ms", json_text)
             self.assertNotIn("latency_ms", json_text)
             self.assertIn("Research Decision Memory Results", markdown_text)
             self.assertNotIn("| p95 ms |", markdown_text)
+
+            sample_dir = Path(__file__).parent / "results"
+            self.assertEqual(
+                json_text,
+                (sample_dir / "sample_results.json").read_text(),
+            )
+            self.assertEqual(
+                markdown_text,
+                (sample_dir / "sample_results.md").read_text(),
+            )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/research-decision-memory/test_benchmark.py` around lines
44 - 59, The test test_writers_emit_reproducible_json_and_markdown should assert
that generated outputs exactly match the checked-in sample artifacts; load the
committed sample JSON and Markdown (e.g., sample_results.json and
sample_results.md), generate the outputs via run_benchmark.run() +
run_benchmark.write_json/write_markdown into the temp files, then compare them
after normalizing non-deterministic parts: for JSON, load both into Python
objects, remove/normalize ephemeral keys (timestamps, latency fields like
"p95_latency_ms"/"latency_ms", and any run-specific IDs/digests if necessary),
re-dump with json.dumps(..., sort_keys=True, separators=(',',':')) and assert
equality; for Markdown, normalize line endings and strip trailing whitespace
before asserting equality with the committed sample; add these comparisons to
test_writers_emit_reproducible_json_and_markdown so any drift in the sample
artifacts fails the test.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/benchmarks/research-decision-memory/test_benchmark.py`:
- Around line 44-59: The test test_writers_emit_reproducible_json_and_markdown
should assert that generated outputs exactly match the checked-in sample
artifacts; load the committed sample JSON and Markdown (e.g.,
sample_results.json and sample_results.md), generate the outputs via
run_benchmark.run() + run_benchmark.write_json/write_markdown into the temp
files, then compare them after normalizing non-deterministic parts: for JSON,
load both into Python objects, remove/normalize ephemeral keys (timestamps,
latency fields like "p95_latency_ms"/"latency_ms", and any run-specific
IDs/digests if necessary), re-dump with json.dumps(..., sort_keys=True,
separators=(',',':')) and assert equality; for Markdown, normalize line endings
and strip trailing whitespace before asserting equality with the committed
sample; add these comparisons to
test_writers_emit_reproducible_json_and_markdown so any drift in the sample
artifacts fails the test.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: b6d1bc6a-bde1-49f4-aaf1-6b5ba938ed91

📥 Commits

Reviewing files that changed from the base of the PR and between 262db90 and a83f6fa.

📒 Files selected for processing (6)

examples/benchmarks/research-decision-memory/README.md
examples/benchmarks/research-decision-memory/requirements.txt
examples/benchmarks/research-decision-memory/results/sample_results.json
examples/benchmarks/research-decision-memory/results/sample_results.md
examples/benchmarks/research-decision-memory/run_benchmark.py
examples/benchmarks/research-decision-memory/test_benchmark.py

wcj007 added 3 commits June 7, 2026 04:48

Add research decision memory benchmark

5549b6a

Use synthetic secret markers in benchmark fixtures

79adfaf

fix: make decision memory samples reproducible

a83f6fa

coderabbitai Bot reviewed Jun 13, 2026

View reviewed changes

Add benchmark sample parity test

62e0e85

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add research decision memory benchmark#735

Add research decision memory benchmark#735
wcj007 wants to merge 4 commits into
moorcheh-ai:mainfrom
wcj007:improve/research-decision-memory-639

wcj007 commented Jun 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wcj007 commented Jun 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Strategies compared

Sample results

Validation

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wcj007 commented Jun 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading