[BOUNTY #639] Add robotics fleet memory benchmark by MyouzzZ · Pull Request #738 · moorcheh-ai/memanto

MyouzzZ · 2026-06-15T09:33:45Z

/claim #639

BountyHub: https://www.bountyhub.dev/bounty/view/ec58c800-823e-4134-b027-99838e096d4b
Public showcase: https://gist.github.com/MyouzzZ/7c7eb76f11c6ffbb1a8b13ae767187db

Summary

Adds examples/benchmarks/robotics-fleet-memory, a deterministic no-key benchmark for the Great Agentic Memory Showdown. The scenario stresses robotics fleet operations where memory has to preserve current state while avoiding stale shift notes and a synthetic credential leak.

Included artifacts:

Source dataset with 12 shift events and 7 golden queries.
Three comparable backends: append-only log, recent-window baseline, and Memanto-style active digest.
Metrics for retrieval accuracy, ingested/retrieved tokens, p95 latency, stale conflict rate, and secret leak rate.
Committed JSON and Markdown sample results.
Node test coverage for dataset shape, scoring, invalid iteration handling, leakage detection, and active-digest improvement.

Benchmark metric summary

Backend	Accuracy	Retrieved tokens	Avg retrieved tokens	Stale conflict rate	Secret leak rate	p95 latency (ms)
append_only_log	0.143	222	31.71	0.714	0.143	0.9139
recent_window	0.571	141	20.14	0.143	0	0.2101
active_fleet_digest	1	69	9.86	0	0	0.1926

Validation

npm.cmd run benchmark
npm.cmd test
git diff --check

Notes

The benchmark is intentionally offline and deterministic so it can run in CI without API keys. The backend contract is limited to ingest(event) and retrieve(query), so live Memanto, Mem0, or Zep adapters can be added without changing the dataset or scoring harness.

Follow-up: addressed CodeRabbit's iteration-count validation comment in 1adec47.

Summary by CodeRabbit

New Features
- Added a “robotics fleet memory” benchmark with three in-memory backend options and reporting for accuracy, token efficiency, latency (p95), stale conflict rate, and secret-leak prevention.
- Added BENCHMARK_ITERATIONS configuration to control benchmark iteration count.
- Included a ready-to-review dataset and sample results for the benchmark.
Documentation
- Added a comprehensive README covering scenarios, metrics, how to run, and the evaluation harness behavior.
Tests
- Added an automated test suite to verify determinism, validation, scoring behavior, benchmark comparisons, and report output.

coderabbitai · 2026-06-15T09:33:59Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 539230e6-df60-4be0-909c-d20441a7371a

📥 Commits

Reviewing files that changed from the base of the PR and between c6bd323 and 1adec47.

📒 Files selected for processing (4)

examples/benchmarks/robotics-fleet-memory/results/sample_results.json
examples/benchmarks/robotics-fleet-memory/results/sample_results.md
examples/benchmarks/robotics-fleet-memory/run_benchmark.mjs
examples/benchmarks/robotics-fleet-memory/test_benchmark.mjs

✅ Files skipped from review due to trivial changes (1)

examples/benchmarks/robotics-fleet-memory/results/sample_results.md

🚧 Files skipped from review as they are similar to previous changes (3)

examples/benchmarks/robotics-fleet-memory/results/sample_results.json
examples/benchmarks/robotics-fleet-memory/test_benchmark.mjs
examples/benchmarks/robotics-fleet-memory/run_benchmark.mjs

📝 Walkthrough

Walkthrough

Adds a new self-contained robotics-fleet-memory benchmark example under examples/benchmarks/. It includes a JSON evaluation dataset with fleet state events and a credential-suppression scenario, three in-memory backend implementations, a scoring and Markdown reporting engine, a Node.js CLI runner with configurable iterations, a test suite, and committed sample results.

Changes

Robotics Fleet Memory Benchmark

Layer / File(s)	Summary
Package setup and evaluation dataset `examples/benchmarks/robotics-fleet-memory/package.json`, `.env.example`, `dataset/robotics_fleet_sessions.json`	ESM package with benchmark/test scripts and Node >=20 constraint; `BENCHMARK_ITERATIONS=50` env example; full dataset JSON with fleet events (including one sensitive API-token event) and retrieval queries with expected, stale, and prohibited term sets.
Backend implementations and scoring engine `examples/benchmarks/robotics-fleet-memory/run_benchmark.mjs` (lines 1–307)	Text normalization, keyword extraction, and overlap/percentile utilities; retrieval helper ranking events by keyword overlap; `AppendOnlyLogBackend`, `RecentWindowBackend`, and `ActiveFleetDigestBackend` classes with distinct retention and sensitive-event suppression semantics; `scoreContext()` that checks context against expected/stale/prohibited terms; per-backend execution loop measuring accuracy, token counts, stale conflict rate, secret leak rate, and p95 latency.
Markdown reporting and CLI entry `examples/benchmarks/robotics-fleet-memory/run_benchmark.mjs` (lines 309–425)	`renderMarkdown()` producing summary table and per-query failure section; CLI parsing for `--dataset`/`--output`/`--markdown`/`--iterations`; `main()` orchestrating benchmark execution and optional file writes; module exports for all public classes and helpers.
Node.js test suite `examples/benchmarks/robotics-fleet-memory/test_benchmark.mjs`	Seven test cases: dataset shape validation, deterministic `tokenCount` output, `scoreContext` stale/prohibited detection, iteration validation with error messages, relative backend performance assertions (`active_fleet_digest` vs `append_only_log`), and `renderMarkdown` output shape.
Committed sample results and README `examples/benchmarks/robotics-fleet-memory/README.md`, `results/sample_results.json`, `results/sample_results.md`	README documenting benchmark purpose, backends, metrics, run commands, adapter contract, and dataset layout; `sample_results.json` with full per-backend per-query metrics; `sample_results.md` with rendered aggregated metrics table and per-query failure report.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

[BOUNTY $100] 🐜 The Great Agentic Memory Showdown: Memanto Benchmarking & Evaluation Challenge #639: Directly fulfills the bounty specification for a robotics fleet memory benchmarking suite with quantifiable metrics (accuracy, token counts, latency, stale conflict rate, secret leak rate) and reproducible test scenarios under /examples/benchmarks/.

Poem

🐇 Hop, hop through the fleet's busy halls,
Three backends compete down memory walls,
The digest beats all — stale facts swept away,
No token leaks out, secrets kept at bay,
The rabbit runs benchmarks to measure the score,
50 iterations, then asks for more! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[BOUNTY `#639`] Add robotics fleet memory benchmark' clearly summarizes the main change—adding a new robotics fleet memory benchmark to the examples directory.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/benchmarks/robotics-fleet-memory/run_benchmark.mjs`:
- Around line 282-283: The iterations variable assignment at line 282-283 and
the similar assignment at line 369-370 do not validate that the resulting value
is a positive integer, allowing invalid values like NaN, 0, negative numbers, or
non-integers to be used in benchmark loops. Add validation after each assignment
to check that iterations is a valid positive integer (greater than 0 and an
integer), and throw an error with a clear message if the validation fails. This
will ensure the benchmark runs with valid iteration counts and fails fast with
meaningful feedback rather than producing misleading metrics.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8352f2f6-a854-495e-8dc4-2bf562d9216c

📥 Commits

Reviewing files that changed from the base of the PR and between 262db90 and c6bd323.

📒 Files selected for processing (8)

examples/benchmarks/robotics-fleet-memory/.env.example
examples/benchmarks/robotics-fleet-memory/README.md
examples/benchmarks/robotics-fleet-memory/dataset/robotics_fleet_sessions.json
examples/benchmarks/robotics-fleet-memory/package.json
examples/benchmarks/robotics-fleet-memory/results/sample_results.json
examples/benchmarks/robotics-fleet-memory/results/sample_results.md
examples/benchmarks/robotics-fleet-memory/run_benchmark.mjs
examples/benchmarks/robotics-fleet-memory/test_benchmark.mjs

feat: add robotics fleet memory benchmark

c6bd323

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread examples/benchmarks/robotics-fleet-memory/run_benchmark.mjs Outdated

fix: validate benchmark iteration counts

1adec47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BOUNTY #639] Add robotics fleet memory benchmark#738

[BOUNTY #639] Add robotics fleet memory benchmark#738
MyouzzZ wants to merge 2 commits into
moorcheh-ai:mainfrom
MyouzzZ:codex/robotics-fleet-memory-639

MyouzzZ commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MyouzzZ commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark metric summary

Validation

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MyouzzZ commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading