Skip to content

[BOUNTY #639] Add robotics fleet memory benchmark#738

Open
MyouzzZ wants to merge 2 commits into
moorcheh-ai:mainfrom
MyouzzZ:codex/robotics-fleet-memory-639
Open

[BOUNTY #639] Add robotics fleet memory benchmark#738
MyouzzZ wants to merge 2 commits into
moorcheh-ai:mainfrom
MyouzzZ:codex/robotics-fleet-memory-639

Conversation

@MyouzzZ

@MyouzzZ MyouzzZ commented Jun 15, 2026

Copy link
Copy Markdown

/claim #639

BountyHub: https://www.bountyhub.dev/bounty/view/ec58c800-823e-4134-b027-99838e096d4b
Public showcase: https://gist.github.com/MyouzzZ/7c7eb76f11c6ffbb1a8b13ae767187db

Summary

Adds examples/benchmarks/robotics-fleet-memory, a deterministic no-key benchmark for the Great Agentic Memory Showdown. The scenario stresses robotics fleet operations where memory has to preserve current state while avoiding stale shift notes and a synthetic credential leak.

Included artifacts:

  • Source dataset with 12 shift events and 7 golden queries.
  • Three comparable backends: append-only log, recent-window baseline, and Memanto-style active digest.
  • Metrics for retrieval accuracy, ingested/retrieved tokens, p95 latency, stale conflict rate, and secret leak rate.
  • Committed JSON and Markdown sample results.
  • Node test coverage for dataset shape, scoring, invalid iteration handling, leakage detection, and active-digest improvement.

Benchmark metric summary

Backend Accuracy Retrieved tokens Avg retrieved tokens Stale conflict rate Secret leak rate p95 latency (ms)
append_only_log 0.143 222 31.71 0.714 0.143 0.9139
recent_window 0.571 141 20.14 0.143 0 0.2101
active_fleet_digest 1 69 9.86 0 0 0.1926

Validation

  • npm.cmd run benchmark
  • npm.cmd test
  • git diff --check

Notes

The benchmark is intentionally offline and deterministic so it can run in CI without API keys. The backend contract is limited to ingest(event) and retrieve(query), so live Memanto, Mem0, or Zep adapters can be added without changing the dataset or scoring harness.

Follow-up: addressed CodeRabbit's iteration-count validation comment in 1adec47.

Summary by CodeRabbit

  • New Features
    • Added a “robotics fleet memory” benchmark with three in-memory backend options and reporting for accuracy, token efficiency, latency (p95), stale conflict rate, and secret-leak prevention.
    • Added BENCHMARK_ITERATIONS configuration to control benchmark iteration count.
    • Included a ready-to-review dataset and sample results for the benchmark.
  • Documentation
    • Added a comprehensive README covering scenarios, metrics, how to run, and the evaluation harness behavior.
  • Tests
    • Added an automated test suite to verify determinism, validation, scoring behavior, benchmark comparisons, and report output.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 539230e6-df60-4be0-909c-d20441a7371a

📥 Commits

Reviewing files that changed from the base of the PR and between c6bd323 and 1adec47.

📒 Files selected for processing (4)
  • examples/benchmarks/robotics-fleet-memory/results/sample_results.json
  • examples/benchmarks/robotics-fleet-memory/results/sample_results.md
  • examples/benchmarks/robotics-fleet-memory/run_benchmark.mjs
  • examples/benchmarks/robotics-fleet-memory/test_benchmark.mjs
✅ Files skipped from review due to trivial changes (1)
  • examples/benchmarks/robotics-fleet-memory/results/sample_results.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • examples/benchmarks/robotics-fleet-memory/results/sample_results.json
  • examples/benchmarks/robotics-fleet-memory/test_benchmark.mjs
  • examples/benchmarks/robotics-fleet-memory/run_benchmark.mjs

📝 Walkthrough

Walkthrough

Adds a new self-contained robotics-fleet-memory benchmark example under examples/benchmarks/. It includes a JSON evaluation dataset with fleet state events and a credential-suppression scenario, three in-memory backend implementations, a scoring and Markdown reporting engine, a Node.js CLI runner with configurable iterations, a test suite, and committed sample results.

Changes

Robotics Fleet Memory Benchmark

Layer / File(s) Summary
Package setup and evaluation dataset
examples/benchmarks/robotics-fleet-memory/package.json, .env.example, dataset/robotics_fleet_sessions.json
ESM package with benchmark/test scripts and Node >=20 constraint; BENCHMARK_ITERATIONS=50 env example; full dataset JSON with fleet events (including one sensitive API-token event) and retrieval queries with expected, stale, and prohibited term sets.
Backend implementations and scoring engine
examples/benchmarks/robotics-fleet-memory/run_benchmark.mjs (lines 1–307)
Text normalization, keyword extraction, and overlap/percentile utilities; retrieval helper ranking events by keyword overlap; AppendOnlyLogBackend, RecentWindowBackend, and ActiveFleetDigestBackend classes with distinct retention and sensitive-event suppression semantics; scoreContext() that checks context against expected/stale/prohibited terms; per-backend execution loop measuring accuracy, token counts, stale conflict rate, secret leak rate, and p95 latency.
Markdown reporting and CLI entry
examples/benchmarks/robotics-fleet-memory/run_benchmark.mjs (lines 309–425)
renderMarkdown() producing summary table and per-query failure section; CLI parsing for --dataset/--output/--markdown/--iterations; main() orchestrating benchmark execution and optional file writes; module exports for all public classes and helpers.
Node.js test suite
examples/benchmarks/robotics-fleet-memory/test_benchmark.mjs
Seven test cases: dataset shape validation, deterministic tokenCount output, scoreContext stale/prohibited detection, iteration validation with error messages, relative backend performance assertions (active_fleet_digest vs append_only_log), and renderMarkdown output shape.
Committed sample results and README
examples/benchmarks/robotics-fleet-memory/README.md, results/sample_results.json, results/sample_results.md
README documenting benchmark purpose, backends, metrics, run commands, adapter contract, and dataset layout; sample_results.json with full per-backend per-query metrics; sample_results.md with rendered aggregated metrics table and per-query failure report.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Poem

🐇 Hop, hop through the fleet's busy halls,
Three backends compete down memory walls,
The digest beats all — stale facts swept away,
No token leaks out, secrets kept at bay,
The rabbit runs benchmarks to measure the score,
50 iterations, then asks for more! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[BOUNTY #639] Add robotics fleet memory benchmark' clearly summarizes the main change—adding a new robotics fleet memory benchmark to the examples directory.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/benchmarks/robotics-fleet-memory/run_benchmark.mjs`:
- Around line 282-283: The iterations variable assignment at line 282-283 and
the similar assignment at line 369-370 do not validate that the resulting value
is a positive integer, allowing invalid values like NaN, 0, negative numbers, or
non-integers to be used in benchmark loops. Add validation after each assignment
to check that iterations is a valid positive integer (greater than 0 and an
integer), and throw an error with a clear message if the validation fails. This
will ensure the benchmark runs with valid iteration counts and fails fast with
meaningful feedback rather than producing misleading metrics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8352f2f6-a854-495e-8dc4-2bf562d9216c

📥 Commits

Reviewing files that changed from the base of the PR and between 262db90 and c6bd323.

📒 Files selected for processing (8)
  • examples/benchmarks/robotics-fleet-memory/.env.example
  • examples/benchmarks/robotics-fleet-memory/README.md
  • examples/benchmarks/robotics-fleet-memory/dataset/robotics_fleet_sessions.json
  • examples/benchmarks/robotics-fleet-memory/package.json
  • examples/benchmarks/robotics-fleet-memory/results/sample_results.json
  • examples/benchmarks/robotics-fleet-memory/results/sample_results.md
  • examples/benchmarks/robotics-fleet-memory/run_benchmark.mjs
  • examples/benchmarks/robotics-fleet-memory/test_benchmark.mjs

Comment thread examples/benchmarks/robotics-fleet-memory/run_benchmark.mjs Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant