[codex] Add incident runbook memory benchmark by tanayv77 · Pull Request #739 · moorcheh-ai/memanto

tanayv77 · 2026-06-15T11:16:19Z

/claim #639

Summary

Adds a reproducible incident-response memory benchmark under examples/benchmarks/incident-runbook-memory for the bounty in #639.

The benchmark uses one dense, shifting incident dataset across three deterministic backends:

active_incident_digest: Memanto-style active current-state memory with secret redaction
append_only_log: archive-memory control that returns every raw matching event for the queried subject/key
recent_window_log: short-context control that keeps only the five newest raw events and returns subject/key matches

Sample Results

Backend	Retrieval accuracy	Avg retrieved tokens	p95 latency (ms)	Stale conflict rate	Secret leak rate
`active_incident_digest`	100.0%	5.4	12.1	0.0%	0.0%
`append_only_log`	14.3%	15.1	20.6	85.7%	14.3%
`recent_window_log`	57.1%	5.3	7.5	0.0%	0.0%

Review Follow-up

Addressed CodeRabbit's actionable review comments by:

applying subject+key filtering consistently across baseline backends
switching writer tests to tempfile.TemporaryDirectory()
adding missing docstrings for the new benchmark/test functions and methods
regenerating results/sample_results.json and results/sample_results.md

Reproduce

python examples/benchmarks/incident-runbook-memory/run_benchmark.py --output examples/benchmarks/incident-runbook-memory/results/sample_results.json --markdown examples/benchmarks/incident-runbook-memory/results/sample_results.md

python -m unittest discover -s examples/benchmarks/incident-runbook-memory -p test_*.py

Validation

python -m unittest discover -s examples/benchmarks/incident-runbook-memory -p test_*.py
python -m ruff check examples/benchmarks/incident-runbook-memory
python -m py_compile examples/benchmarks/incident-runbook-memory/run_benchmark.py examples/benchmarks/incident-runbook-memory/test_benchmark.py
git diff --check

Bounty Notes

This is a deterministic harness with plug-in backend boundaries, not a published hosted-provider benchmark. The README calls that out explicitly and explains how hosted Mem0/Letta/Zep/Hindsight adapters can reuse the same dataset and evaluator.

Social media showcase: pending contributor post.

Summary by CodeRabbit

New Features
- Added an incident-response “agent-memory tradeoff” benchmark with three retrieval backends and a CLI to generate deterministic JSON and Markdown reports.
- Reports accuracy, latency (estimated), token efficiency, stale-conflict rate, and secret-leak rate (including redaction checks).
Documentation
- Added detailed README describing dataset design, golden query categories (including a non-retrievable credential), and how to run/validate the benchmark.
- Added sample benchmark report output.
Tests
- Added automated benchmark tests verifying backend comparisons, zero stale/conflict and secret-leak expectations, and report writer output.

coderabbitai · 2026-06-15T11:16:33Z

📝 Walkthrough

Walkthrough

Adds a new incident-runbook-memory benchmark example. A standalone Python script defines three in-memory retrieval backends, a synthetic five-session incident event stream, golden queries with secret-leak detection, deterministic latency modeling, and JSON/Markdown report output. Unit tests, pre-generated sample results, and documentation are included.

Changes

Incident Runbook Memory Benchmark

Layer / File(s)	Summary
Data contracts, protocol, and utilities `run_benchmark.py`	Defines `BENCHMARK_NAME`, `BENCHMARK_VERSION`, `SECRET_PATTERN`, frozen dataclasses (`MemoryEvent`, `GoldenQuery`, `Retrieval`), the `MemoryBackend` protocol, and helper functions for token counting, secret redaction, and deterministic latency estimation.
Three retrieval backend implementations `run_benchmark.py`	Implements `ActiveIncidentDigest` (subject/key map with secret redaction and stale-fact overwrite), `AppendOnlyLog` (verbatim append, full scan), and `RecentWindowLog` (fixed-size sliding window), each computing scanned record count and retrieved token count.
Synthetic dataset and golden queries `run_benchmark.py`	`build_dataset` produces an ordered five-session incident event stream; `build_queries` produces queries with expected and forbidden terms including a synthetic credential that must not appear in results.
Evaluation, aggregation, and runner `run_benchmark.py`	`evaluate_retrieval` scores retrieval against expected/forbidden terms and flags secret-pattern matches. `percentile` computes nearest-rank p95. `summarize_backend` aggregates accuracy, stale-conflict rate, secret-leak rate, token stats, and p95 latency. `run_benchmark` orchestrates all three backends and returns a JSON-serializable result dict.
CLI, report writers, and entry point `run_benchmark.py`	`write_json` and `format_percent` handle serialization; `write_markdown` validates shape and builds a per-backend metrics table with fixed Interpretation and Reproduce sections. `parse_args` and `main` provide the CLI defaulting to `results/sample_results.*`.
Unit tests `test_benchmark.py`	Dynamically loads `run_benchmark.py` via `importlib`; three tests assert digest outperforms log baseline on accuracy with zero leak/conflict rates, recent-window accuracy is lower than digest, and report writers produce valid JSON and Markdown files.
README, requirements, and sample results `README.md`, `requirements.txt`, `results/sample_results.json`, `results/sample_results.md`	README documents the scenario, backends, metrics, and run commands. `requirements.txt` declares standard-library-only dependency. Pre-generated results include per-backend aggregate metrics, per-query records with secret-leak flags, and notes on deterministic latency.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 Hop, hop through the incident log,
Three backends compared in the fog!
The digest redacts,
The window compacts,
And secrets stay out of the bog. 🔐

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 51.52% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title clearly and concisely summarizes the main change: adding a new incident runbook memory benchmark to the examples directory.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/benchmarks/incident-runbook-memory/run_benchmark.py`:
- Around line 147-151: The filtering logic at both the initial location (lines
147-151) and the sibling location (lines 182-186) currently filters only by
subject using "event.subject in query.subjects", but they should also filter by
keys to match the filtering behavior of ActiveIncidentDigest, which filters by
both subject and key. Modify both filtering blocks to additionally check that
the event key is in query.keys (e.g., "and event.key in query.keys") so that all
baseline backends apply consistent subject+key filtering criteria, making
backend comparisons equivalent.

In `@examples/benchmarks/incident-runbook-memory/test_benchmark.py`:
- Around line 64-77: Replace the hardcoded result directory path with a
temporary directory to avoid parallel run collisions and read-only checkout
failures. Remove the line that creates result_dir from a fixed repo path and
instead wrap the benchmark write and file read operations in a
tempfile.TemporaryDirectory() context manager. Create json_path and md_path
within the temporary directory, and remove the manual cleanup calls
(json_path.unlink() and md_path.unlink()) since the temporary directory will be
automatically cleaned up when exiting the context. Keep the
benchmark.write_json(), benchmark.write_markdown(), json.loads(), and
read_text() operations unchanged in their logic, only changing the target
directory references.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 66614bf5-25df-444a-8032-6bb7b77257ce

📥 Commits

Reviewing files that changed from the base of the PR and between 262db90 and 7c41b48.

📒 Files selected for processing (6)

examples/benchmarks/incident-runbook-memory/README.md
examples/benchmarks/incident-runbook-memory/requirements.txt
examples/benchmarks/incident-runbook-memory/results/sample_results.json
examples/benchmarks/incident-runbook-memory/results/sample_results.md
examples/benchmarks/incident-runbook-memory/run_benchmark.py
examples/benchmarks/incident-runbook-memory/test_benchmark.py

Add incident runbook memory benchmark

7c41b48

tanayv77 mentioned this pull request Jun 15, 2026

[BOUNTY $100] 🐜 The Great Agentic Memory Showdown: Memanto Benchmarking & Evaluation Challenge #639

Open

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread examples/benchmarks/incident-runbook-memory/run_benchmark.py

Comment thread examples/benchmarks/incident-runbook-memory/test_benchmark.py Outdated

Address incident benchmark review comments

2b8eb5b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Add incident runbook memory benchmark#739

[codex] Add incident runbook memory benchmark#739
tanayv77 wants to merge 2 commits into
moorcheh-ai:mainfrom
tanayv77:codex/incident-memory-benchmark-639

tanayv77 commented Jun 15, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tanayv77 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Sample Results

Review Follow-up

Reproduce

Validation

Bounty Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tanayv77 commented Jun 15, 2026 •

edited

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading