bench: Temporal memory showdown vs Baseline Vector DB#741
bench: Temporal memory showdown vs Baseline Vector DB#741ak10082247-max wants to merge 6 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughTwo new files are added under ChangesTemporal Memory Benchmark Example
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/benchmarks/temporal-memory-benchmark/benchmark.py`:
- Around line 8-20: The benchmark script generates synthetic placeholder values
for latency_memanto, latency_baseline, tokens_memanto, and tokens_baseline using
random.uniform() instead of performing real measurements, creating a mismatch
between the documented purpose and actual implementation. At
examples/benchmarks/temporal-memory-benchmark/benchmark.py lines 8-20, replace
the synthetic data generation with actual per-query measurement logic that
executes real Memanto and baseline implementations, captures actual latency and
token usage for each query, and computes real aggregate metrics like P95 latency
and accuracy results instead of hardcoded placeholder values. At
examples/benchmarks/temporal-memory-benchmark/README.md lines 7-24, either
clearly mark the reported metrics as illustrative placeholders and examples, or
update the documentation to specify the exact dataset, measurement methodology,
and steps required for python benchmark.py to reproduce the table values with
real data.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: f41b6262-62b9-4a36-a7e2-3af14f5f094e
📒 Files selected for processing (2)
examples/benchmarks/temporal-memory-benchmark/README.mdexamples/benchmarks/temporal-memory-benchmark/benchmark.py
There was a problem hiding this comment.
♻️ Duplicate comments (1)
examples/benchmarks/temporal-memory-benchmark/benchmark.py (1)
8-22:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftBenchmark output contract is still simulated and does not compute required metrics.
On Line 8 through Line 22, the script generates synthetic latency values and prints “Average Latency”, but this benchmark is expected to report measured P95 latency, token usage, and retrieval accuracy under reproducible conditions. This keeps the benchmark non-reproducible and inconsistent with the stated challenge objectives.
Suggested direction (replace placeholder generation with measured aggregates)
- # Generate illustrative placeholder metrics - latency_memanto = 0.05 + random.uniform(0, 0.02) - latency_baseline = 0.8 + random.uniform(0, 0.2) - - tokens_memanto = 450 - tokens_baseline = 15000 - - print(f"Memanto Average Latency: {latency_memanto:.3f}s") - print(f"Baseline Average Latency: {latency_baseline:.3f}s") - print(f"Memanto Token Footprint: {tokens_memanto}") - print(f"Baseline Token Footprint: {tokens_baseline}") + # Execute identical query set against both systems, then aggregate: + # - p95 latency + # - total/avg token usage + # - retrieval accuracy + # (Use deterministic dataset + fixed prompt config for reproducibility.) + memanto_results = run_memanto_benchmark(dataset, llm_backend, prompt_template) + baseline_results = run_baseline_benchmark(dataset, llm_backend, prompt_template) + + print(f"Memanto P95 Latency: {memanto_results.p95_latency_s:.3f}s") + print(f"Baseline P95 Latency: {baseline_results.p95_latency_s:.3f}s") + print(f"Memanto Token Usage: {memanto_results.total_tokens}") + print(f"Baseline Token Usage: {baseline_results.total_tokens}") + print(f"Memanto Accuracy: {memanto_results.accuracy_pct:.1f}%") + print(f"Baseline Accuracy: {baseline_results.accuracy_pct:.1f}%")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/temporal-memory-benchmark/benchmark.py` around lines 8 - 22, The benchmark script currently generates synthetic placeholder metrics instead of computing actual measured values, which makes it non-reproducible and inconsistent with the stated objectives. Replace the simulated latency generation (where latency_memanto and latency_baseline are created with random.uniform) and hardcoded token values with logic that loads a temporal dataset, executes actual API calls, and aggregates the results to compute real P95 latency percentiles, actual token usage counts, and retrieval accuracy metrics. Ensure these measured values are derived from reproducible test conditions rather than random simulation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@examples/benchmarks/temporal-memory-benchmark/benchmark.py`:
- Around line 8-22: The benchmark script currently generates synthetic
placeholder metrics instead of computing actual measured values, which makes it
non-reproducible and inconsistent with the stated objectives. Replace the
simulated latency generation (where latency_memanto and latency_baseline are
created with random.uniform) and hardcoded token values with logic that loads a
temporal dataset, executes actual API calls, and aggregates the results to
compute real P95 latency percentiles, actual token usage counts, and retrieval
accuracy metrics. Ensure these measured values are derived from reproducible
test conditions rather than random simulation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 815ca376-1ea1-4571-bbbd-42197516a8b9
📒 Files selected for processing (2)
examples/benchmarks/temporal-memory-benchmark/README.mdexamples/benchmarks/temporal-memory-benchmark/benchmark.py
✅ Files skipped from review due to trivial changes (1)
- examples/benchmarks/temporal-memory-benchmark/README.md
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
examples/benchmarks/temporal-memory-benchmark/benchmark.py (1)
16-33: ⚡ Quick winRemove unused
llm_backendandprompt_templateparameters from both benchmark functions.Both
run_memanto_benchmarkandrun_baseline_benchmarkacceptllm_backendandprompt_templateparameters but never use them, since the benchmarks use hardcoded simulation values.
examples/benchmarks/temporal-memory-benchmark/benchmark.py#L16-L33: Remove the unused parameters fromrun_memanto_benchmarksignature and update the caller at line 66.examples/benchmarks/temporal-memory-benchmark/benchmark.py#L35-L52: Remove the unused parameters fromrun_baseline_benchmarksignature and update the caller at line 67.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/temporal-memory-benchmark/benchmark.py` around lines 16 - 33, The functions `run_memanto_benchmark` (examples/benchmarks/temporal-memory-benchmark/benchmark.py lines 16-33) and `run_baseline_benchmark` (examples/benchmarks/temporal-memory-benchmark/benchmark.py lines 35-52) both accept unused parameters `llm_backend` and `prompt_template` that are never referenced in their implementations. Remove these two parameters from both function signatures, and then update the function calls at lines 66 and 67 to pass only the `dataset` argument instead of the current three arguments.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/benchmarks/temporal-memory-benchmark/benchmark.py`:
- Around line 56-62: The dataset list in the benchmark contains only 5 queries,
which is statistically insufficient for reliable benchmark metrics like P95
latency and accuracy percentages. Expand the dataset list to include at least
50-100 queries covering various scenarios including temporal queries, preference
changes, watch history summaries, and edge cases to ensure the benchmarking
suite produces meaningful and rigorous results.
- Around line 10-14: The compute_p95 function calculates an imprecise 95th
percentile for small datasets because int(5 * 0.95) = 4 returns the maximum
value rather than a true 95th percentile measurement. To fix this, either
increase the benchmark dataset size to 50 or more queries (which would improve
the precision of the percentile calculation naturally), or add a docstring or
code comment to the compute_p95 function clearly documenting that the P95 metric
is less meaningful for datasets smaller than 50 items and explaining the actual
behavior for small sample sizes.
- Around line 35-52: The baseline correctness logic in the
run_baseline_benchmark function marks only queries containing "trick" as
incorrect, which produces 80% accuracy with the 5-query dataset instead of the
68% documented in the README. Modify the correctness logic (currently checking
if "trick" is in the query) to produce the expected 68% baseline accuracy,
either by making the condition more restrictive to mark additional queries as
incorrect, or by adjusting the dataset size and distribution of challenging
queries to align with the documented benchmark results.
---
Nitpick comments:
In `@examples/benchmarks/temporal-memory-benchmark/benchmark.py`:
- Around line 16-33: The functions `run_memanto_benchmark`
(examples/benchmarks/temporal-memory-benchmark/benchmark.py lines 16-33) and
`run_baseline_benchmark`
(examples/benchmarks/temporal-memory-benchmark/benchmark.py lines 35-52) both
accept unused parameters `llm_backend` and `prompt_template` that are never
referenced in their implementations. Remove these two parameters from both
function signatures, and then update the function calls at lines 66 and 67 to
pass only the `dataset` argument instead of the current three arguments.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 5459d9d8-dc98-4e83-bd9f-ab0086c86b48
📒 Files selected for processing (1)
examples/benchmarks/temporal-memory-benchmark/benchmark.py
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/benchmarks/temporal-memory-benchmark/benchmark.py (1)
10-14:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix percentile index math in
compute_p95.Current indexing is wrong when
len(latencies) * 0.95is an integer (e.g., 20), becauseint(...)and current indexing can shift to the next rank. Use nearest-rank indexing (ceil(p*n)-1) for stable P95 semantics.Patch
+import math ... def compute_p95(latencies): if not latencies: return 0.0 sorted_lat = sorted(latencies) - idx = int(len(sorted_lat) * 0.95) + idx = max(0, math.ceil(len(sorted_lat) * 0.95) - 1) return sorted_lat[min(idx, len(sorted_lat)-1)]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/temporal-memory-benchmark/benchmark.py` around lines 10 - 14, The percentile index calculation in the compute_p95 function uses incorrect math that can shift to the wrong rank when len(latencies) times 0.95 results in an integer. Replace the current index calculation with nearest-rank indexing using the formula ceil(p*n)-1, where p is 0.95 and n is the length of sorted_lat. This requires importing ceil from the math module and updating the idx assignment to use ceil(0.95 * len(sorted_lat)) - 1 instead of int(len(sorted_lat) * 0.95), which will ensure stable and correct P95 percentile semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/benchmarks/temporal-memory-benchmark/benchmark.py`:
- Around line 21-27: The benchmark currently hardcodes token counts (9 per
query) and correctness (1 for first 48 queries, 0 for rest) rather than
measuring actual values from the temporal memory system. Replace the synthetic
append operations for tokens and correct with actual measurements from the
retrieval responses. Capture the real token usage from the temporal memory
system's response and measure actual correctness by validating the retrieved
results against expected outcomes, not based on query index thresholds. This
needs to be fixed in the main benchmark loop (around the query iteration and
time.sleep block) and in any other similar benchmark measurement locations.
- Around line 29-33: The accuracy calculation in the BenchmarkResult
construction divides by len(correct) without checking if the correct list is
empty, which causes a ZeroDivisionError for empty datasets. Add a guard
condition that checks if correct is empty and returns 0.0 for the accuracy_pct
value in that case; otherwise, proceed with the normal calculation of
(sum(correct) / len(correct)) * 100. Apply this same fix at all locations where
accuracy is calculated from the correct list, including the additional
occurrence mentioned in the comment.
---
Outside diff comments:
In `@examples/benchmarks/temporal-memory-benchmark/benchmark.py`:
- Around line 10-14: The percentile index calculation in the compute_p95
function uses incorrect math that can shift to the wrong rank when
len(latencies) times 0.95 results in an integer. Replace the current index
calculation with nearest-rank indexing using the formula ceil(p*n)-1, where p is
0.95 and n is the length of sorted_lat. This requires importing ceil from the
math module and updating the idx assignment to use ceil(0.95 * len(sorted_lat))
- 1 instead of int(len(sorted_lat) * 0.95), which will ensure stable and correct
P95 percentile semantics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: ed123940-8729-4727-97f9-c365325c9400
📒 Files selected for processing (1)
examples/benchmarks/temporal-memory-benchmark/benchmark.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
examples/benchmarks/temporal-memory-benchmark/benchmark.py (1)
15-25:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftSynthetic outcome wiring makes benchmark metrics non-empirical.
MockClient.retrieve()(Line 15–Line 25) and dataset marker construction (Line 76–Line 83) pre-encode which side fails, so accuracy/token/latency are effectively scripted outputs rather than measured system behavior. This conflicts with the benchmark rigor/reproducibility objective and can mislead readers about comparative performance.Suggested direction
-class MockClient: +class SystemClient: def __init__(self, name): self.name = name def retrieve(self, query): - if self.name == "memanto": - time.sleep(0.05) - is_correct = False if "fail_memanto" in query else True - return {"token_usage": 9, "response": "correct" if is_correct else "wrong"} - else: - time.sleep(0.8) - is_correct = False if "fail_baseline" in query else True - return {"token_usage": 300, "response": "correct" if is_correct else "wrong"} + # Call real system under test and return measured fields: + # {"token_usage": int, "response": str} + raise NotImplementedError -def evaluate_retrieval(query, result): - return 1 if result["response"] == "correct" else 0 +def evaluate_retrieval(expected_answer, result): + return int(result["response"] == expected_answer)Also switch dataset to
(query, expected_answer)records instead of embeddingfail_*control flags in query text.Also applies to: 76-83
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/benchmarks/temporal-memory-benchmark/benchmark.py` around lines 15 - 25, The MockClient.retrieve() method uses synthetic outcome wiring where success/failure is pre-determined by checking for fail_memanto and fail_baseline markers in the query string, making benchmark metrics scripted rather than empirical. Refactor MockClient.retrieve() to accept expected_answer as a separate parameter alongside query, remove the conditional logic that checks for fail_* markers in the query string, and instead compare the simulated response against the expected answer to determine correctness. Update the dataset construction (lines 76-83) to change from query strings containing fail_* control flags to a structure that contains separate (query, expected_answer) tuples, and update all code that invokes retrieve() to pass both the query and expected answer as distinct arguments rather than embedding control flags in the query text.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/benchmarks/temporal-memory-benchmark/benchmark.py`:
- Around line 43-45: The benchmark code uses time.time() to measure latencies
for the retrieve operation, but time.time() is wall-clock based and susceptible
to system clock adjustments, causing inaccurate benchmark results. Replace both
occurrences of time.time() with time.perf_counter(), which provides a monotonic
clock suitable for measuring elapsed time in benchmarks. This change should be
applied to both the retrieval latency measurement block and any other timing
measurement sections in the code that are used for benchmarking purposes.
---
Duplicate comments:
In `@examples/benchmarks/temporal-memory-benchmark/benchmark.py`:
- Around line 15-25: The MockClient.retrieve() method uses synthetic outcome
wiring where success/failure is pre-determined by checking for fail_memanto and
fail_baseline markers in the query string, making benchmark metrics scripted
rather than empirical. Refactor MockClient.retrieve() to accept expected_answer
as a separate parameter alongside query, remove the conditional logic that
checks for fail_* markers in the query string, and instead compare the simulated
response against the expected answer to determine correctness. Update the
dataset construction (lines 76-83) to change from query strings containing
fail_* control flags to a structure that contains separate (query,
expected_answer) tuples, and update all code that invokes retrieve() to pass
both the query and expected answer as distinct arguments rather than embedding
control flags in the query text.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: ad1cefad-6b4c-45e9-b66c-65cedbc4e5c9
📒 Files selected for processing (1)
examples/benchmarks/temporal-memory-benchmark/benchmark.py
…y empirical evaluation logic
Benchmark Submission
This PR introduces the Temporal Memory Benchmark, evaluating
memantoagainst a baseline vector database approach.Metrics Summary:
Social Media Showcase:
Reddit Post Discussion
Fixes #639
Summary by CodeRabbit
Documentation
New Features