[codex] Add incident runbook memory benchmark#739
Conversation
📝 WalkthroughWalkthroughAdds a new ChangesIncident Runbook Memory Benchmark
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/benchmarks/incident-runbook-memory/run_benchmark.py`:
- Around line 147-151: The filtering logic at both the initial location (lines
147-151) and the sibling location (lines 182-186) currently filters only by
subject using "event.subject in query.subjects", but they should also filter by
keys to match the filtering behavior of ActiveIncidentDigest, which filters by
both subject and key. Modify both filtering blocks to additionally check that
the event key is in query.keys (e.g., "and event.key in query.keys") so that all
baseline backends apply consistent subject+key filtering criteria, making
backend comparisons equivalent.
In `@examples/benchmarks/incident-runbook-memory/test_benchmark.py`:
- Around line 64-77: Replace the hardcoded result directory path with a
temporary directory to avoid parallel run collisions and read-only checkout
failures. Remove the line that creates result_dir from a fixed repo path and
instead wrap the benchmark write and file read operations in a
tempfile.TemporaryDirectory() context manager. Create json_path and md_path
within the temporary directory, and remove the manual cleanup calls
(json_path.unlink() and md_path.unlink()) since the temporary directory will be
automatically cleaned up when exiting the context. Keep the
benchmark.write_json(), benchmark.write_markdown(), json.loads(), and
read_text() operations unchanged in their logic, only changing the target
directory references.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 66614bf5-25df-444a-8032-6bb7b77257ce
📒 Files selected for processing (6)
examples/benchmarks/incident-runbook-memory/README.mdexamples/benchmarks/incident-runbook-memory/requirements.txtexamples/benchmarks/incident-runbook-memory/results/sample_results.jsonexamples/benchmarks/incident-runbook-memory/results/sample_results.mdexamples/benchmarks/incident-runbook-memory/run_benchmark.pyexamples/benchmarks/incident-runbook-memory/test_benchmark.py
/claim #639
Summary
Adds a reproducible incident-response memory benchmark under
examples/benchmarks/incident-runbook-memoryfor the bounty in #639.The benchmark uses one dense, shifting incident dataset across three deterministic backends:
active_incident_digest: Memanto-style active current-state memory with secret redactionappend_only_log: archive-memory control that returns every raw matching event for the queried subject/keyrecent_window_log: short-context control that keeps only the five newest raw events and returns subject/key matchesSample Results
active_incident_digestappend_only_logrecent_window_logReview Follow-up
Addressed CodeRabbit's actionable review comments by:
tempfile.TemporaryDirectory()results/sample_results.jsonandresults/sample_results.mdReproduce
python examples/benchmarks/incident-runbook-memory/run_benchmark.py --output examples/benchmarks/incident-runbook-memory/results/sample_results.json --markdown examples/benchmarks/incident-runbook-memory/results/sample_results.md python -m unittest discover -s examples/benchmarks/incident-runbook-memory -p test_*.pyValidation
python -m unittest discover -s examples/benchmarks/incident-runbook-memory -p test_*.pypython -m ruff check examples/benchmarks/incident-runbook-memorypython -m py_compile examples/benchmarks/incident-runbook-memory/run_benchmark.py examples/benchmarks/incident-runbook-memory/test_benchmark.pygit diff --checkBounty Notes
This is a deterministic harness with plug-in backend boundaries, not a published hosted-provider benchmark. The README calls that out explicitly and explains how hosted Mem0/Letta/Zep/Hindsight adapters can reuse the same dataset and evaluator.
Social media showcase: pending contributor post.
Summary by CodeRabbit
New Features
Documentation
Tests