Skip to content

Add research decision memory benchmark#735

Open
wcj007 wants to merge 4 commits into
moorcheh-ai:mainfrom
wcj007:improve/research-decision-memory-639
Open

Add research decision memory benchmark#735
wcj007 wants to merge 4 commits into
moorcheh-ai:mainfrom
wcj007:improve/research-decision-memory-639

Conversation

@wcj007

@wcj007 wcj007 commented Jun 13, 2026

Copy link
Copy Markdown

/claim #639

BountyHub bounty: https://www.bountyhub.dev/bounty/view/ec58c800-823e-4134-b027-99838e096d4b

Summary

Adds examples/benchmarks/research-decision-memory, a credential-free benchmark for issue #639 focused on long-running research and product agents that must preserve the current decision trail while suppressing superseded assumptions.

The scenario tracks target segment, pricing, deployment constraints, launch readiness, data handling, customer ownership, model routing, and the primary benchmark metric across multiple sessions where early hypotheses are corrected by later evidence.

Strategies compared

  • active_decision_digest: Memanto-style typed digest with current decision slots, evidence IDs, and redaction of erased facts
  • passive_graph_history: graph-style historical state memory that keeps conflicting states unless reconciled by the caller
  • append_only_log: raw transcript retrieval that brings back stale assumptions and synthetic secrets
  • recent_window_log: sliding recent-window baseline that avoids old stale facts but forgets older still-valid decisions

Sample results

Backend Accuracy Evidence Stale Conflicts Secret Leaks Avg Tokens Signal/Noise
active_decision_digest 100.00% 100.00% 0.00% 0.00% 21.12 100.00%
passive_graph_history 50.00% 100.00% 50.00% 12.50% 30.50 66.67%
append_only_log 50.00% 100.00% 50.00% 12.50% 167.25 12.31%
recent_window_log 50.00% 50.00% 0.00% 0.00% 10.62 100.00%

Validation

py -3 examples/benchmarks/research-decision-memory/run_benchmark.py --output examples/benchmarks/research-decision-memory/results/sample_results.json --markdown examples/benchmarks/research-decision-memory/results/sample_results.md
py -3 -m unittest discover -s examples/benchmarks/research-decision-memory -p test_*.py
py -3 -m py_compile examples/benchmarks/research-decision-memory/run_benchmark.py examples/benchmarks/research-decision-memory/test_benchmark.py
git diff --check

Local validation passed: report generation, 5 unittest tests, py_compile, and git diff --check.

Social showcase: pending. I did not fabricate a Reddit/X link from this environment.

No private payout details or real secrets are included.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added a research decision memory benchmark comparing four different memory retrieval strategies with comprehensive evaluation metrics including accuracy, evidence coverage, stale conflict rates, and token analysis.
  • Documentation

    • Provided detailed benchmark documentation, sample results in JSON and Markdown formats, and step-by-step execution instructions.
  • Tests

    • Added unit tests to validate benchmark behavior across all memory strategies and verify reproducibility of benchmark outputs.

@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 493490b6-50cd-441c-96e9-47a8c705a85e

📥 Commits

Reviewing files that changed from the base of the PR and between a83f6fa and 62e0e85.

📒 Files selected for processing (1)
  • examples/benchmarks/research-decision-memory/test_benchmark.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/benchmarks/research-decision-memory/test_benchmark.py

📝 Walkthrough

Walkthrough

This PR introduces a complete benchmark suite for comparing four in-memory decision retrieval strategies (active_decision_digest, passive_graph_history, append_only_log, recent_window_log) within a simulated research decision scenario, including deterministic test data, metric evaluation, JSON/markdown output generation, comprehensive unit tests, and sample results.

Changes

Research Decision Memory Benchmark

Layer / File(s) Summary
Benchmark overview and foundational data structures
README.md, requirements.txt, run_benchmark.py (lines 1-99, 100-311)
Documentation, stdlib requirement marker, and core module setup with dataclasses (DecisionRecord, Probe, Answer, BackendSummary), utility functions (tokenize, token_count, contains_all, contains_any, percentile_95), and deterministic test fixtures (decision_records with stale/superseded/rejected states and embedded secret; probes with expected terms/evidence/stale terms).
Memory backend implementations
run_benchmark.py (lines 314-422)
Base class MemoryBackend provides answer timing, rendering, and optional secret appending. Four concrete strategies: ActiveDecisionDigest filters to current-state records per slot; PassiveGraphHistory groups possible decision states; AppendOnlyLog selects by term overlap and tokenization; RecentWindowLog limits results to a trailing window per slot.
Benchmark evaluation, execution, and output
run_benchmark.py (lines 423-615)
score_answers() validates expected terms and evidence while detecting stale terms and secret markers, computing accuracy, coverage, stale conflict, secret leak, and signal/noise metrics. run() orchestrates backend execution and answer gathering. JSON/markdown output helpers conditionally omit latency for reproducible artifacts. markdown_report() formats results table. CLI interface (parse_args, main) exposes output path, markdown generation, and latency inclusion flags.
Unit tests validating backend behavior and outputs
test_benchmark.py
Benchmark execution tests verify ActiveDecisionDigest achieves perfect accuracy/coverage with zero stale/secret leaks; AppendOnlyLog retrieves more tokens with lower accuracy and stale conflicts; RecentWindowLog drops accuracy and returns "No decision found." for backfill gaps; PassiveGraphHistory exhibits stale conflicts with accuracy below 1.0. File I/O tests confirm JSON/markdown writing omits/includes latency as configured. Helper methods locate summaries and answers in payloads. Standard unittest entry point.
Sample benchmark results
results/sample_results.json, results/sample_results.md
Pre-computed JSON output containing decision records with session/slot/evidence/status/rationale/relevant/stale terms and embedded secret; probe definitions with expected terms/evidence/stale expectations; per-backend summaries with accuracy, coverage, stale conflict, secret leak, token, latency, and signal-noise metrics; and per-backend answers mapping each probe to synthesized text, evidence IDs, and token counts. Markdown results table with metric columns and notes on latency omission and regeneration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A benchmark hops into the fold,
Four strategies competing, brave and bold,
Decision records dance, evidence shines bright,
Stale assumptions banished from sight—
Memory tested, metrics align,
The research trail crystalline and fine! 🏃‍♂️✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Add research decision memory benchmark' accurately and concisely describes the main change: introducing a new benchmark system for evaluating memory strategies in research and product agents.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/benchmarks/research-decision-memory/test_benchmark.py (1)

44-59: ⚡ Quick win

Add golden-file parity assertions for checked-in sample artifacts.

This test currently validates structure/content markers but not that generated reproducible outputs match the committed results/sample_results.json and results/sample_results.md. That allows sample artifacts to drift silently.

🔧 Proposed update
     def test_writers_emit_reproducible_json_and_markdown(self) -> None:
         payload = run_benchmark.run()
         with tempfile.TemporaryDirectory() as tmpdir:
             json_path = Path(tmpdir) / "results.json"
             md_path = Path(tmpdir) / "results.md"
             run_benchmark.write_json(json_path, payload)
             run_benchmark.write_markdown(md_path, payload)

             json_text = json_path.read_text()
             markdown_text = md_path.read_text()
             self.assertIn("active_decision_digest", json_text)
             self.assertNotIn("p95_latency_ms", json_text)
             self.assertNotIn("latency_ms", json_text)
             self.assertIn("Research Decision Memory Results", markdown_text)
             self.assertNotIn("| p95 ms |", markdown_text)
+
+            sample_dir = Path(__file__).parent / "results"
+            self.assertEqual(
+                json_text,
+                (sample_dir / "sample_results.json").read_text(),
+            )
+            self.assertEqual(
+                markdown_text,
+                (sample_dir / "sample_results.md").read_text(),
+            )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/benchmarks/research-decision-memory/test_benchmark.py` around lines
44 - 59, The test test_writers_emit_reproducible_json_and_markdown should assert
that generated outputs exactly match the checked-in sample artifacts; load the
committed sample JSON and Markdown (e.g., sample_results.json and
sample_results.md), generate the outputs via run_benchmark.run() +
run_benchmark.write_json/write_markdown into the temp files, then compare them
after normalizing non-deterministic parts: for JSON, load both into Python
objects, remove/normalize ephemeral keys (timestamps, latency fields like
"p95_latency_ms"/"latency_ms", and any run-specific IDs/digests if necessary),
re-dump with json.dumps(..., sort_keys=True, separators=(',',':')) and assert
equality; for Markdown, normalize line endings and strip trailing whitespace
before asserting equality with the committed sample; add these comparisons to
test_writers_emit_reproducible_json_and_markdown so any drift in the sample
artifacts fails the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/benchmarks/research-decision-memory/test_benchmark.py`:
- Around line 44-59: The test test_writers_emit_reproducible_json_and_markdown
should assert that generated outputs exactly match the checked-in sample
artifacts; load the committed sample JSON and Markdown (e.g.,
sample_results.json and sample_results.md), generate the outputs via
run_benchmark.run() + run_benchmark.write_json/write_markdown into the temp
files, then compare them after normalizing non-deterministic parts: for JSON,
load both into Python objects, remove/normalize ephemeral keys (timestamps,
latency fields like "p95_latency_ms"/"latency_ms", and any run-specific
IDs/digests if necessary), re-dump with json.dumps(..., sort_keys=True,
separators=(',',':')) and assert equality; for Markdown, normalize line endings
and strip trailing whitespace before asserting equality with the committed
sample; add these comparisons to
test_writers_emit_reproducible_json_and_markdown so any drift in the sample
artifacts fails the test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: b6d1bc6a-bde1-49f4-aaf1-6b5ba938ed91

📥 Commits

Reviewing files that changed from the base of the PR and between 262db90 and a83f6fa.

📒 Files selected for processing (6)
  • examples/benchmarks/research-decision-memory/README.md
  • examples/benchmarks/research-decision-memory/requirements.txt
  • examples/benchmarks/research-decision-memory/results/sample_results.json
  • examples/benchmarks/research-decision-memory/results/sample_results.md
  • examples/benchmarks/research-decision-memory/run_benchmark.py
  • examples/benchmarks/research-decision-memory/test_benchmark.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant