Add policy drift memory benchmark#742
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughA new self-contained benchmark module is added at ChangesPolicy Drift Memory Benchmark
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/benchmarks/policy_drift_memory/benchmark.py`:
- Around line 245-249: The `--repeats` argument currently accepts zero or
negative integers, which causes the run_once function to skip retrieval loops
and subsequently fail at the assertion on line 201 when result is None. Add
validation to the `--repeats` argument parser definition to ensure only positive
integers (greater than 0) are accepted. This can be accomplished by using a
custom type function that validates the integer is positive, or by adding
appropriate constraints to the argument configuration.
In `@examples/benchmarks/policy_drift_memory/README.md`:
- Line 4: The line starting with `#639: retrieval accuracy versus resource
footprint when instructions mutate` violates markdownlint's MD018 rule because
it begins with a `#` character outside of a proper Markdown header context.
Rephrase this line to avoid starting with the `#` symbol, such as by moving the
issue number reference to after an introductory word or restructuring the text
so the line begins with a letter or other non-hash character while preserving
the content meaning.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: b5caec22-1197-4568-900c-e14306ec9e97
📒 Files selected for processing (4)
examples/benchmarks/policy_drift_memory/README.mdexamples/benchmarks/policy_drift_memory/benchmark.pyexamples/benchmarks/policy_drift_memory/dataset.jsonexamples/benchmarks/policy_drift_memory/requirements.txt
For #639
BountyHub bounty: https://www.bountyhub.dev/bounty/view/ec58c800-823e-4134-b027-99838e096d4b
Showcase: https://gist.github.com/wly12312/76303a6655e269c94c31b8e5e9fd7883
Summary
This PR adds
examples/benchmarks/policy_drift_memory, a deterministic benchmark for agent-memory policy drift. It evaluates whether memory backends can keep current policy/preferences while suppressing stale but historically true facts.The benchmark compares:
memanto_style_active_digestepisode_graph_baselinerecent_window_3append_only_logIt includes a source dataset, scoring logic, reproducible run instructions, and no external dependency requirement for the default offline path.
Local Metrics
Verification
Summary by CodeRabbit