Skip to content

[codex] Add incident runbook memory benchmark#739

Open
tanayv77 wants to merge 2 commits into
moorcheh-ai:mainfrom
tanayv77:codex/incident-memory-benchmark-639
Open

[codex] Add incident runbook memory benchmark#739
tanayv77 wants to merge 2 commits into
moorcheh-ai:mainfrom
tanayv77:codex/incident-memory-benchmark-639

Conversation

@tanayv77

@tanayv77 tanayv77 commented Jun 15, 2026

Copy link
Copy Markdown

/claim #639

Summary

Adds a reproducible incident-response memory benchmark under examples/benchmarks/incident-runbook-memory for the bounty in #639.

The benchmark uses one dense, shifting incident dataset across three deterministic backends:

  • active_incident_digest: Memanto-style active current-state memory with secret redaction
  • append_only_log: archive-memory control that returns every raw matching event for the queried subject/key
  • recent_window_log: short-context control that keeps only the five newest raw events and returns subject/key matches

Sample Results

Backend Retrieval accuracy Avg retrieved tokens p95 latency (ms) Stale conflict rate Secret leak rate
active_incident_digest 100.0% 5.4 12.1 0.0% 0.0%
append_only_log 14.3% 15.1 20.6 85.7% 14.3%
recent_window_log 57.1% 5.3 7.5 0.0% 0.0%

Review Follow-up

Addressed CodeRabbit's actionable review comments by:

  • applying subject+key filtering consistently across baseline backends
  • switching writer tests to tempfile.TemporaryDirectory()
  • adding missing docstrings for the new benchmark/test functions and methods
  • regenerating results/sample_results.json and results/sample_results.md

Reproduce

python examples/benchmarks/incident-runbook-memory/run_benchmark.py --output examples/benchmarks/incident-runbook-memory/results/sample_results.json --markdown examples/benchmarks/incident-runbook-memory/results/sample_results.md

python -m unittest discover -s examples/benchmarks/incident-runbook-memory -p test_*.py

Validation

  • python -m unittest discover -s examples/benchmarks/incident-runbook-memory -p test_*.py
  • python -m ruff check examples/benchmarks/incident-runbook-memory
  • python -m py_compile examples/benchmarks/incident-runbook-memory/run_benchmark.py examples/benchmarks/incident-runbook-memory/test_benchmark.py
  • git diff --check

Bounty Notes

This is a deterministic harness with plug-in backend boundaries, not a published hosted-provider benchmark. The README calls that out explicitly and explains how hosted Mem0/Letta/Zep/Hindsight adapters can reuse the same dataset and evaluator.

Social media showcase: pending contributor post.

Summary by CodeRabbit

  • New Features

    • Added an incident-response “agent-memory tradeoff” benchmark with three retrieval backends and a CLI to generate deterministic JSON and Markdown reports.
    • Reports accuracy, latency (estimated), token efficiency, stale-conflict rate, and secret-leak rate (including redaction checks).
  • Documentation

    • Added detailed README describing dataset design, golden query categories (including a non-retrievable credential), and how to run/validate the benchmark.
    • Added sample benchmark report output.
  • Tests

    • Added automated benchmark tests verifying backend comparisons, zero stale/conflict and secret-leak expectations, and report writer output.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a new incident-runbook-memory benchmark example. A standalone Python script defines three in-memory retrieval backends, a synthetic five-session incident event stream, golden queries with secret-leak detection, deterministic latency modeling, and JSON/Markdown report output. Unit tests, pre-generated sample results, and documentation are included.

Changes

Incident Runbook Memory Benchmark

Layer / File(s) Summary
Data contracts, protocol, and utilities
run_benchmark.py
Defines BENCHMARK_NAME, BENCHMARK_VERSION, SECRET_PATTERN, frozen dataclasses (MemoryEvent, GoldenQuery, Retrieval), the MemoryBackend protocol, and helper functions for token counting, secret redaction, and deterministic latency estimation.
Three retrieval backend implementations
run_benchmark.py
Implements ActiveIncidentDigest (subject/key map with secret redaction and stale-fact overwrite), AppendOnlyLog (verbatim append, full scan), and RecentWindowLog (fixed-size sliding window), each computing scanned record count and retrieved token count.
Synthetic dataset and golden queries
run_benchmark.py
build_dataset produces an ordered five-session incident event stream; build_queries produces queries with expected and forbidden terms including a synthetic credential that must not appear in results.
Evaluation, aggregation, and runner
run_benchmark.py
evaluate_retrieval scores retrieval against expected/forbidden terms and flags secret-pattern matches. percentile computes nearest-rank p95. summarize_backend aggregates accuracy, stale-conflict rate, secret-leak rate, token stats, and p95 latency. run_benchmark orchestrates all three backends and returns a JSON-serializable result dict.
CLI, report writers, and entry point
run_benchmark.py
write_json and format_percent handle serialization; write_markdown validates shape and builds a per-backend metrics table with fixed Interpretation and Reproduce sections. parse_args and main provide the CLI defaulting to results/sample_results.*.
Unit tests
test_benchmark.py
Dynamically loads run_benchmark.py via importlib; three tests assert digest outperforms log baseline on accuracy with zero leak/conflict rates, recent-window accuracy is lower than digest, and report writers produce valid JSON and Markdown files.
README, requirements, and sample results
README.md, requirements.txt, results/sample_results.json, results/sample_results.md
README documents the scenario, backends, metrics, and run commands. requirements.txt declares standard-library-only dependency. Pre-generated results include per-backend aggregate metrics, per-query records with secret-leak flags, and notes on deterministic latency.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 Hop, hop through the incident log,
Three backends compared in the fog!
The digest redacts,
The window compacts,
And secrets stay out of the bog. 🔐

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.52% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and concisely summarizes the main change: adding a new incident runbook memory benchmark to the examples directory.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/benchmarks/incident-runbook-memory/run_benchmark.py`:
- Around line 147-151: The filtering logic at both the initial location (lines
147-151) and the sibling location (lines 182-186) currently filters only by
subject using "event.subject in query.subjects", but they should also filter by
keys to match the filtering behavior of ActiveIncidentDigest, which filters by
both subject and key. Modify both filtering blocks to additionally check that
the event key is in query.keys (e.g., "and event.key in query.keys") so that all
baseline backends apply consistent subject+key filtering criteria, making
backend comparisons equivalent.

In `@examples/benchmarks/incident-runbook-memory/test_benchmark.py`:
- Around line 64-77: Replace the hardcoded result directory path with a
temporary directory to avoid parallel run collisions and read-only checkout
failures. Remove the line that creates result_dir from a fixed repo path and
instead wrap the benchmark write and file read operations in a
tempfile.TemporaryDirectory() context manager. Create json_path and md_path
within the temporary directory, and remove the manual cleanup calls
(json_path.unlink() and md_path.unlink()) since the temporary directory will be
automatically cleaned up when exiting the context. Keep the
benchmark.write_json(), benchmark.write_markdown(), json.loads(), and
read_text() operations unchanged in their logic, only changing the target
directory references.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 66614bf5-25df-444a-8032-6bb7b77257ce

📥 Commits

Reviewing files that changed from the base of the PR and between 262db90 and 7c41b48.

📒 Files selected for processing (6)
  • examples/benchmarks/incident-runbook-memory/README.md
  • examples/benchmarks/incident-runbook-memory/requirements.txt
  • examples/benchmarks/incident-runbook-memory/results/sample_results.json
  • examples/benchmarks/incident-runbook-memory/results/sample_results.md
  • examples/benchmarks/incident-runbook-memory/run_benchmark.py
  • examples/benchmarks/incident-runbook-memory/test_benchmark.py

Comment thread examples/benchmarks/incident-runbook-memory/run_benchmark.py
Comment thread examples/benchmarks/incident-runbook-memory/test_benchmark.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant