Skip to content

Add policy drift memory benchmark#742

Open
wly12312 wants to merge 3 commits into
moorcheh-ai:mainfrom
wly12312:bounty-agent-support-memory-benchmark
Open

Add policy drift memory benchmark#742
wly12312 wants to merge 3 commits into
moorcheh-ai:mainfrom
wly12312:bounty-agent-support-memory-benchmark

Conversation

@wly12312

@wly12312 wly12312 commented Jun 15, 2026

Copy link
Copy Markdown

For #639

BountyHub bounty: https://www.bountyhub.dev/bounty/view/ec58c800-823e-4134-b027-99838e096d4b

Showcase: https://gist.github.com/wly12312/76303a6655e269c94c31b8e5e9fd7883

Summary

This PR adds examples/benchmarks/policy_drift_memory, a deterministic benchmark for agent-memory policy drift. It evaluates whether memory backends can keep current policy/preferences while suppressing stale but historically true facts.

The benchmark compares:

  • memanto_style_active_digest
  • episode_graph_baseline
  • recent_window_3
  • append_only_log

It includes a source dataset, scoring logic, reproducible run instructions, and no external dependency requirement for the default offline path.

Local Metrics

backend,accuracy,total_retrieved_tokens,avg_retrieved_tokens,p95_latency_ms
memanto_style_active_digest,1.0,282,56.4,0.0144
episode_graph_baseline,0.4,261,52.2,0.008
recent_window_3,0.0,315,63,0.0004
append_only_log,0.0,361,72.2,0.0005

Verification

python examples/benchmarks/policy_drift_memory/benchmark.py --repeats 200
python -m py_compile examples/benchmarks/policy_drift_memory/benchmark.py

Summary by CodeRabbit

  • New Features
    • Added a Policy Drift Memory Benchmark tool to compare retrieval backends on answer accuracy and resource usage under instruction changes.
    • Included benchmark documentation, a ready-to-run sample dataset, and an offline-only CLI with summary (and optional detailed JSON) outputs.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: c5aef7ad-3b8b-4f4a-9ff4-8fd696fcb08b

📥 Commits

Reviewing files that changed from the base of the PR and between fc781c6 and 7b98832.

📒 Files selected for processing (2)
  • examples/benchmarks/policy_drift_memory/README.md
  • examples/benchmarks/policy_drift_memory/benchmark.py
✅ Files skipped from review due to trivial changes (1)
  • examples/benchmarks/policy_drift_memory/README.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/benchmarks/policy_drift_memory/benchmark.py

📝 Walkthrough

Walkthrough

A new self-contained benchmark module is added at examples/benchmarks/policy_drift_memory/ comprising a JSON dataset of policy-mutation events and queries, four in-memory retrieval backends with different retention strategies, a scoring and orchestration layer, a CLI entry point, and comprehensive documentation.

Changes

Policy Drift Memory Benchmark

Layer / File(s) Summary
Dataset and backend contracts
examples/benchmarks/policy_drift_memory/dataset.json, examples/benchmarks/policy_drift_memory/benchmark.py (lines 1–49)
Defines 10 chronological policy-update events across three entities and per-entity queries with required/forbidden fields; introduces approximate token counting, p95 latency computation, text normalization utilities, QueryResult dataclass, and the abstract Backend interface with ingest and answer methods.
Concrete backend implementations and scoring
examples/benchmarks/policy_drift_memory/benchmark.py (lines 51–182)
Implements AppendOnlyLogBackend, RecentWindowBackend, EpisodeGraphBaseline, and ActiveDigestBackend with distinct event storage and query answering strategies; adds score_answer for required/forbidden term presence checking and build_backends for instantiation and ordering.
Benchmark runner and CLI
examples/benchmarks/policy_drift_memory/benchmark.py (lines 184–281)
run_once ingests all dataset events into each backend, executes repeated query answering per backend to collect latency and token metrics, and aggregates accuracy and p95 latency results with per-query details; main parses CLI arguments, loads dataset.json, prints a CSV header and summary rows, and optionally writes full JSON output.
README and requirements
examples/benchmarks/policy_drift_memory/README.md, examples/benchmarks/policy_drift_memory/requirements.txt
README documents the benchmark objective, lists the four backends, describes accuracy and resource metrics, provides run commands with optional JSON output mode, includes example output, and notes Python 3.10+ and fully offline execution requirements; requirements.txt clarifies stdlib-only dependency.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Poem

🐇 Hop, hop through the policy maze,
Four backends tested across shifting days,
The active digest knows what's current and true,
While append-only logs pile up the stew.
With tokens counted and p95 in hand,
This benchmark bunny makes facts firmly stand! 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add policy drift memory benchmark' directly and specifically describes the main addition in the changeset - a new benchmark for evaluating agent-memory policy drift located at examples/benchmarks/policy_drift_memory.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/benchmarks/policy_drift_memory/benchmark.py`:
- Around line 245-249: The `--repeats` argument currently accepts zero or
negative integers, which causes the run_once function to skip retrieval loops
and subsequently fail at the assertion on line 201 when result is None. Add
validation to the `--repeats` argument parser definition to ensure only positive
integers (greater than 0) are accepted. This can be accomplished by using a
custom type function that validates the integer is positive, or by adding
appropriate constraints to the argument configuration.

In `@examples/benchmarks/policy_drift_memory/README.md`:
- Line 4: The line starting with `#639: retrieval accuracy versus resource
footprint when instructions mutate` violates markdownlint's MD018 rule because
it begins with a `#` character outside of a proper Markdown header context.
Rephrase this line to avoid starting with the `#` symbol, such as by moving the
issue number reference to after an introductory word or restructuring the text
so the line begins with a letter or other non-hash character while preserving
the content meaning.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: b5caec22-1197-4568-900c-e14306ec9e97

📥 Commits

Reviewing files that changed from the base of the PR and between 262db90 and fc781c6.

📒 Files selected for processing (4)
  • examples/benchmarks/policy_drift_memory/README.md
  • examples/benchmarks/policy_drift_memory/benchmark.py
  • examples/benchmarks/policy_drift_memory/dataset.json
  • examples/benchmarks/policy_drift_memory/requirements.txt

Comment thread examples/benchmarks/policy_drift_memory/benchmark.py
Comment thread examples/benchmarks/policy_drift_memory/README.md Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant