Add policy drift memory benchmark by wly12312 · Pull Request #742 · moorcheh-ai/memanto

wly12312 · 2026-06-15T14:51:17Z

BountyHub bounty: https://www.bountyhub.dev/bounty/view/ec58c800-823e-4134-b027-99838e096d4b

Showcase: https://gist.github.com/wly12312/76303a6655e269c94c31b8e5e9fd7883

Summary

This PR adds examples/benchmarks/policy_drift_memory, a deterministic benchmark for agent-memory policy drift. It evaluates whether memory backends can keep current policy/preferences while suppressing stale but historically true facts.

The benchmark compares:

memanto_style_active_digest
episode_graph_baseline
recent_window_3
append_only_log

It includes a source dataset, scoring logic, reproducible run instructions, and no external dependency requirement for the default offline path.

Local Metrics

backend,accuracy,total_retrieved_tokens,avg_retrieved_tokens,p95_latency_ms
memanto_style_active_digest,1.0,282,56.4,0.0144
episode_graph_baseline,0.4,261,52.2,0.008
recent_window_3,0.0,315,63,0.0004
append_only_log,0.0,361,72.2,0.0005

Verification

python examples/benchmarks/policy_drift_memory/benchmark.py --repeats 200
python -m py_compile examples/benchmarks/policy_drift_memory/benchmark.py

Summary by CodeRabbit

New Features
- Added a Policy Drift Memory Benchmark tool to compare retrieval backends on answer accuracy and resource usage under instruction changes.
- Included benchmark documentation, a ready-to-run sample dataset, and an offline-only CLI with summary (and optional detailed JSON) outputs.

coderabbitai · 2026-06-15T14:51:33Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: c5aef7ad-3b8b-4f4a-9ff4-8fd696fcb08b

📥 Commits

Reviewing files that changed from the base of the PR and between fc781c6 and 7b98832.

📒 Files selected for processing (2)

examples/benchmarks/policy_drift_memory/README.md
examples/benchmarks/policy_drift_memory/benchmark.py

✅ Files skipped from review due to trivial changes (1)

examples/benchmarks/policy_drift_memory/README.md

🚧 Files skipped from review as they are similar to previous changes (1)

examples/benchmarks/policy_drift_memory/benchmark.py

📝 Walkthrough

Walkthrough

A new self-contained benchmark module is added at examples/benchmarks/policy_drift_memory/ comprising a JSON dataset of policy-mutation events and queries, four in-memory retrieval backends with different retention strategies, a scoring and orchestration layer, a CLI entry point, and comprehensive documentation.

Changes

Policy Drift Memory Benchmark

Layer / File(s)	Summary
Dataset and backend contracts `examples/benchmarks/policy_drift_memory/dataset.json`, `examples/benchmarks/policy_drift_memory/benchmark.py` (lines 1–49)	Defines 10 chronological policy-update events across three entities and per-entity queries with `required`/`forbidden` fields; introduces approximate token counting, p95 latency computation, text normalization utilities, `QueryResult` dataclass, and the abstract `Backend` interface with `ingest` and `answer` methods.
Concrete backend implementations and scoring `examples/benchmarks/policy_drift_memory/benchmark.py` (lines 51–182)	Implements `AppendOnlyLogBackend`, `RecentWindowBackend`, `EpisodeGraphBaseline`, and `ActiveDigestBackend` with distinct event storage and query answering strategies; adds `score_answer` for required/forbidden term presence checking and `build_backends` for instantiation and ordering.
Benchmark runner and CLI `examples/benchmarks/policy_drift_memory/benchmark.py` (lines 184–281)	`run_once` ingests all dataset events into each backend, executes repeated query answering per backend to collect latency and token metrics, and aggregates accuracy and p95 latency results with per-query details; `main` parses CLI arguments, loads `dataset.json`, prints a CSV header and summary rows, and optionally writes full JSON output.
README and requirements `examples/benchmarks/policy_drift_memory/README.md`, `examples/benchmarks/policy_drift_memory/requirements.txt`	README documents the benchmark objective, lists the four backends, describes accuracy and resource metrics, provides run commands with optional JSON output mode, includes example output, and notes Python 3.10+ and fully offline execution requirements; `requirements.txt` clarifies stdlib-only dependency.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

[BOUNTY $100] 🐜 The Great Agentic Memory Showdown: Memanto Benchmarking & Evaluation Challenge #639: This PR directly implements the policy drift memory benchmark described in that issue, including all four backends, accuracy vs. token-footprint metrics, p95 latency reporting, JSON output mode, and offline reproducibility requirements with deterministic token counting.

Poem

🐇 Hop, hop through the policy maze,
Four backends tested across shifting days,
The active digest knows what's current and true,
While append-only logs pile up the stew.
With tokens counted and p95 in hand,
This benchmark bunny makes facts firmly stand! 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add policy drift memory benchmark' directly and specifically describes the main addition in the changeset - a new benchmark for evaluating agent-memory policy drift located at examples/benchmarks/policy_drift_memory.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/benchmarks/policy_drift_memory/benchmark.py`:
- Around line 245-249: The `--repeats` argument currently accepts zero or
negative integers, which causes the run_once function to skip retrieval loops
and subsequently fail at the assertion on line 201 when result is None. Add
validation to the `--repeats` argument parser definition to ensure only positive
integers (greater than 0) are accepted. This can be accomplished by using a
custom type function that validates the integer is positive, or by adding
appropriate constraints to the argument configuration.

In `@examples/benchmarks/policy_drift_memory/README.md`:
- Line 4: The line starting with `#639: retrieval accuracy versus resource
footprint when instructions mutate` violates markdownlint's MD018 rule because
it begins with a `#` character outside of a proper Markdown header context.
Rephrase this line to avoid starting with the `#` symbol, such as by moving the
issue number reference to after an introductory word or restructuring the text
so the line begins with a letter or other non-hash character while preserving
the content meaning.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: b5caec22-1197-4568-900c-e14306ec9e97

📥 Commits

Reviewing files that changed from the base of the PR and between 262db90 and fc781c6.

📒 Files selected for processing (4)

examples/benchmarks/policy_drift_memory/README.md
examples/benchmarks/policy_drift_memory/benchmark.py
examples/benchmarks/policy_drift_memory/dataset.json
examples/benchmarks/policy_drift_memory/requirements.txt

Add policy drift memory benchmark

fc781c6

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread examples/benchmarks/policy_drift_memory/benchmark.py

Comment thread examples/benchmarks/policy_drift_memory/README.md Outdated

wly12312 added 2 commits June 16, 2026 09:23

Fix benchmark README lint warning

9535cb0

Validate benchmark repeat count

7b98832

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add policy drift memory benchmark#742

Add policy drift memory benchmark#742
wly12312 wants to merge 3 commits into
moorcheh-ai:mainfrom
wly12312:bounty-agent-support-memory-benchmark

wly12312 commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wly12312 commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Local Metrics

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wly12312 commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading