Add research decision memory benchmark#735
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR introduces a complete benchmark suite for comparing four in-memory decision retrieval strategies ( ChangesResearch Decision Memory Benchmark
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
examples/benchmarks/research-decision-memory/test_benchmark.py (1)
44-59: ⚡ Quick winAdd golden-file parity assertions for checked-in sample artifacts.
This test currently validates structure/content markers but not that generated reproducible outputs match the committed
results/sample_results.jsonandresults/sample_results.md. That allows sample artifacts to drift silently.🔧 Proposed update
def test_writers_emit_reproducible_json_and_markdown(self) -> None: payload = run_benchmark.run() with tempfile.TemporaryDirectory() as tmpdir: json_path = Path(tmpdir) / "results.json" md_path = Path(tmpdir) / "results.md" run_benchmark.write_json(json_path, payload) run_benchmark.write_markdown(md_path, payload) json_text = json_path.read_text() markdown_text = md_path.read_text() self.assertIn("active_decision_digest", json_text) self.assertNotIn("p95_latency_ms", json_text) self.assertNotIn("latency_ms", json_text) self.assertIn("Research Decision Memory Results", markdown_text) self.assertNotIn("| p95 ms |", markdown_text) + + sample_dir = Path(__file__).parent / "results" + self.assertEqual( + json_text, + (sample_dir / "sample_results.json").read_text(), + ) + self.assertEqual( + markdown_text, + (sample_dir / "sample_results.md").read_text(), + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/research-decision-memory/test_benchmark.py` around lines 44 - 59, The test test_writers_emit_reproducible_json_and_markdown should assert that generated outputs exactly match the checked-in sample artifacts; load the committed sample JSON and Markdown (e.g., sample_results.json and sample_results.md), generate the outputs via run_benchmark.run() + run_benchmark.write_json/write_markdown into the temp files, then compare them after normalizing non-deterministic parts: for JSON, load both into Python objects, remove/normalize ephemeral keys (timestamps, latency fields like "p95_latency_ms"/"latency_ms", and any run-specific IDs/digests if necessary), re-dump with json.dumps(..., sort_keys=True, separators=(',',':')) and assert equality; for Markdown, normalize line endings and strip trailing whitespace before asserting equality with the committed sample; add these comparisons to test_writers_emit_reproducible_json_and_markdown so any drift in the sample artifacts fails the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@examples/benchmarks/research-decision-memory/test_benchmark.py`:
- Around line 44-59: The test test_writers_emit_reproducible_json_and_markdown
should assert that generated outputs exactly match the checked-in sample
artifacts; load the committed sample JSON and Markdown (e.g.,
sample_results.json and sample_results.md), generate the outputs via
run_benchmark.run() + run_benchmark.write_json/write_markdown into the temp
files, then compare them after normalizing non-deterministic parts: for JSON,
load both into Python objects, remove/normalize ephemeral keys (timestamps,
latency fields like "p95_latency_ms"/"latency_ms", and any run-specific
IDs/digests if necessary), re-dump with json.dumps(..., sort_keys=True,
separators=(',',':')) and assert equality; for Markdown, normalize line endings
and strip trailing whitespace before asserting equality with the committed
sample; add these comparisons to
test_writers_emit_reproducible_json_and_markdown so any drift in the sample
artifacts fails the test.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: b6d1bc6a-bde1-49f4-aaf1-6b5ba938ed91
📒 Files selected for processing (6)
examples/benchmarks/research-decision-memory/README.mdexamples/benchmarks/research-decision-memory/requirements.txtexamples/benchmarks/research-decision-memory/results/sample_results.jsonexamples/benchmarks/research-decision-memory/results/sample_results.mdexamples/benchmarks/research-decision-memory/run_benchmark.pyexamples/benchmarks/research-decision-memory/test_benchmark.py
/claim #639
BountyHub bounty: https://www.bountyhub.dev/bounty/view/ec58c800-823e-4134-b027-99838e096d4b
Summary
Adds
examples/benchmarks/research-decision-memory, a credential-free benchmark for issue #639 focused on long-running research and product agents that must preserve the current decision trail while suppressing superseded assumptions.The scenario tracks target segment, pricing, deployment constraints, launch readiness, data handling, customer ownership, model routing, and the primary benchmark metric across multiple sessions where early hypotheses are corrected by later evidence.
Strategies compared
active_decision_digest: Memanto-style typed digest with current decision slots, evidence IDs, and redaction of erased factspassive_graph_history: graph-style historical state memory that keeps conflicting states unless reconciled by the callerappend_only_log: raw transcript retrieval that brings back stale assumptions and synthetic secretsrecent_window_log: sliding recent-window baseline that avoids old stale facts but forgets older still-valid decisionsSample results
Validation
py -3 examples/benchmarks/research-decision-memory/run_benchmark.py --output examples/benchmarks/research-decision-memory/results/sample_results.json --markdown examples/benchmarks/research-decision-memory/results/sample_results.md py -3 -m unittest discover -s examples/benchmarks/research-decision-memory -p test_*.py py -3 -m py_compile examples/benchmarks/research-decision-memory/run_benchmark.py examples/benchmarks/research-decision-memory/test_benchmark.py git diff --checkLocal validation passed: report generation, 5 unittest tests, py_compile, and git diff --check.
Social showcase: pending. I did not fabricate a Reddit/X link from this environment.
No private payout details or real secrets are included.
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests