Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions benchmarks/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
"""
Memory framework adapters for benchmarking.
Each adapter implements the MemoryAdapter interface.
"""

from .base import MemoryAdapter, MemoryResult, BenchmarkMetric
from .memanto_adapter import MemantoAdapter
from .mem0_adapter import Mem0Adapter

__all__ = [
"MemoryAdapter",
"MemoryResult",
"BenchmarkMetric",
"MemantoAdapter",
"Mem0Adapter",
]
131 changes: 131 additions & 0 deletions benchmarks/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
"""
Base classes for the memory benchmark framework.
"""

import time
import statistics
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import Any
Comment on lines +1 to +9

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check for Python version requirements in project files
echo "=== Searching for Python version declarations ==="
rg -n "python_requires|requires-python|Programming Language :: Python :: 3\." setup.py pyproject.toml setup.cfg README.md .python-version 2>/dev/null || echo "No version constraints found"

echo -e "\n=== Checking if README documents Python version ==="
rg -n -i "python 3\.(10|11|12)" README.md || echo "Python version not documented in README"

Repository: moorcheh-ai/memanto

Length of output: 474


Document Python 3.10+ requirement in README (pyproject already set)

  • pyproject.toml already declares requires-python = ">=3.10,<4" and includes Python 3.10/3.11/3.12 classifiers.
  • README.md has no explicit mention of Python 3.10/3.11/3.12; add the Python 3.10+ minimum there.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/base.py` around lines 1 - 9, Add a short statement to the README
documenting the project's minimum Python version (Python 3.10+) to match
pyproject.toml's requires-python and classifiers; update the top-level README.md
to explicitly state "Requires Python 3.10 or newer (tested on 3.10/3.11/3.12)"
and, if appropriate, add a short note about compatibility or testing matrix to
keep it consistent with the package metadata.



@dataclass
class MemoryResult:
"""Result of a memory operation including success status, data, and metrics."""
"""Result from a single memory operation."""
success: bool
latency_ms: float
tokens_used: int = 0
data: Any = None
error: str | None = None


@dataclass
class BenchmarkMetric:
"""Aggregated benchmark metrics for a set of runs."""
"""Aggregated metrics from a benchmark run."""
framework: str
scenario: str
total_store_calls: int = 0
total_retrieve_calls: int = 0
total_store_tokens: int = 0
total_retrieve_tokens: int = 0
store_latencies: list[float] = field(default_factory=list)
retrieve_latencies: list[float] = field(default_factory=list)
retrieval_scores: list[float] = field(default_factory=list)
errors: int = 0

@property
def store_p95_latency(self) -> float:
if not self.store_latencies:
return 0.0
sorted_l = sorted(self.store_latencies)
idx = int(len(sorted_l) * 0.95)
return sorted_l[min(idx, len(sorted_l) - 1)]

@property
def retrieve_p95_latency(self) -> float:
if not self.retrieve_latencies:
return 0.0
sorted_l = sorted(self.retrieve_latencies)
idx = int(len(sorted_l) * 0.95)
return sorted_l[min(idx, len(sorted_l) - 1)]
Comment on lines +39 to +52

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix off-by-one in p95 latency calculation.

Line 41 and Line 49 currently compute int(n * 0.95), which maps many sample sizes to the max element (e.g., n=20 → index 19). That reports p100 instead of p95 and skews published benchmark metrics.

Suggested fix
     `@property`
     def store_p95_latency(self) -> float:
         if not self.store_latencies:
             return 0.0
         sorted_l = sorted(self.store_latencies)
-        idx = int(len(sorted_l) * 0.95)
+        idx = int((len(sorted_l) - 1) * 0.95)
         return sorted_l[min(idx, len(sorted_l) - 1)]

     `@property`
     def retrieve_p95_latency(self) -> float:
         if not self.retrieve_latencies:
             return 0.0
         sorted_l = sorted(self.retrieve_latencies)
-        idx = int(len(sorted_l) * 0.95)
+        idx = int((len(sorted_l) - 1) * 0.95)
         return sorted_l[min(idx, len(sorted_l) - 1)]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def store_p95_latency(self) -> float:
if not self.store_latencies:
return 0.0
sorted_l = sorted(self.store_latencies)
idx = int(len(sorted_l) * 0.95)
return sorted_l[min(idx, len(sorted_l) - 1)]
@property
def retrieve_p95_latency(self) -> float:
if not self.retrieve_latencies:
return 0.0
sorted_l = sorted(self.retrieve_latencies)
idx = int(len(sorted_l) * 0.95)
return sorted_l[min(idx, len(sorted_l) - 1)]
def store_p95_latency(self) -> float:
if not self.store_latencies:
return 0.0
sorted_l = sorted(self.store_latencies)
idx = int((len(sorted_l) - 1) * 0.95)
return sorted_l[min(idx, len(sorted_l) - 1)]
`@property`
def retrieve_p95_latency(self) -> float:
if not self.retrieve_latencies:
return 0.0
sorted_l = sorted(self.retrieve_latencies)
idx = int((len(sorted_l) - 1) * 0.95)
return sorted_l[min(idx, len(sorted_l) - 1)]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/base.py` around lines 37 - 50, The p95 calculation in
store_p95_latency and retrieve_p95_latency uses idx = int(len(sorted_l) * 0.95)
which can select the maximum element for many n (off-by-one); change the index
calculation to use the 1-based percentile mapping: idx = max(0,
math.ceil(len(sorted_l) * 0.95) - 1) (add import math if missing) so the 95th
percentile selects the correct element from sorted_l in both store_p95_latency
and retrieve_p95_latency.


@property
def mean_retrieval_accuracy(self) -> float:
if not self.retrieval_scores:
return 0.0
return statistics.mean(self.retrieval_scores)

def to_dict(self) -> dict:
return {
"framework": self.framework,
"scenario": self.scenario,
"total_store_calls": self.total_store_calls,
"total_retrieve_calls": self.total_retrieve_calls,
"total_store_tokens": self.total_store_tokens,
"total_retrieve_tokens": self.total_retrieve_tokens,
"store_p95_latency_ms": round(self.store_p95_latency, 2),
"retrieve_p95_latency_ms": round(self.retrieve_p95_latency, 2),
"mean_store_latency_ms": round(
statistics.mean(self.store_latencies), 2
) if self.store_latencies else 0,
"mean_retrieve_latency_ms": round(
statistics.mean(self.retrieve_latencies), 2
) if self.retrieve_latencies else 0,
"retrieval_accuracy": round(self.mean_retrieval_accuracy, 4),
"errors": self.errors,
}


class MemoryAdapter(ABC):
"""Abstract base class for memory framework adapters."""
"""Abstract interface for memory framework adapters."""

@property
@abstractmethod
def name(self) -> str:
"""Framework name."""
...

@abstractmethod
def setup(self, user_id: str) -> None:
"""Initialize the memory store for a user."""
...

@abstractmethod
def store(self, content: str, metadata: dict | None = None) -> MemoryResult:
"""Store a memory and return metrics."""
...

@abstractmethod
def retrieve(self, query: str, limit: int = 5) -> MemoryResult:
"""Retrieve memories matching a query."""
...

@abstractmethod
def update(self, memory_id: str, content: str) -> MemoryResult:
"""Update an existing memory."""
...

@abstractmethod
def delete(self, memory_id: str) -> MemoryResult:
"""Delete a memory."""
...

@abstractmethod
def get_all(self) -> MemoryResult:
"""Get all stored memories."""
...

@abstractmethod
def cleanup(self) -> None:
"""Clean up resources."""
...

def timed_call(self, fn, *args, **kwargs) -> tuple[float, Any]:
"""Time a function call and return (latency_ms, result)."""
start = time.perf_counter()
result = fn(*args, **kwargs)
elapsed = (time.perf_counter() - start) * 1000
return elapsed, result
86 changes: 86 additions & 0 deletions benchmarks/evaluator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
"""
LLM-as-a-Judge evaluator for retrieval accuracy.
"""

import os
from openai import OpenAI


JUDGE_SYSTEM_PROMPT = """You are an expert evaluator for AI memory systems.
You will be given:
1. A QUERY that was used to search a memory system
2. A GOLDEN ANSWER (the ideal/correct response)
3. A set of RETRIEVED MEMORIES from the system

Score the retrieval quality on a scale from 0.0 to 1.0:
- 1.0: Retrieved memories fully contain the golden answer information
- 0.7-0.9: Retrieved memories mostly contain relevant info, minor gaps
- 0.4-0.6: Partial match, some relevant info but significant gaps
- 0.1-0.3: Poor match, mostly irrelevant
- 0.0: Completely irrelevant or no useful information

Respond with ONLY a JSON object: {"score": <float>, "reasoning": "<brief explanation>"}"""


class LLMEvaluator:
"""Evaluates retrieval quality using LLM-as-a-judge with keyword fallback."""
"""Evaluates retrieval accuracy using an LLM judge."""

def __init__(self, model: str | None = None, api_key: str | None = None):
key = api_key or os.environ.get("OPENAI_API_KEY", "")
self.model = model or os.environ.get("JUDGE_MODEL", "gpt-4o")
self.client = OpenAI(api_key=key) if key else None

def score_retrieval(
self,
query: str,
golden_answer: str,
retrieved_memories: list[str],
) -> tuple[float, str]:
"""Score a retrieval against a golden answer. Returns (score, reasoning)."""
if not self.client:
# Fallback: simple keyword overlap scoring
return self._keyword_score(golden_answer, retrieved_memories)

memories_text = "\n---\n".join(
f"Memory {i+1}: {m}" for i, m in enumerate(retrieved_memories)
)
user_prompt = f"""QUERY: {query}

GOLDEN ANSWER: {golden_answer}

RETRIEVED MEMORIES:
{memories_text}"""

try:
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": JUDGE_SYSTEM_PROMPT},
{"role": "user", "content": user_prompt},
],
temperature=0.0,
max_tokens=200,
response_format={"type": "json_object"},
)
import json
content = response.choices[0].message.content
parsed = json.loads(content)
return float(parsed.get("score", 0.0)), parsed.get("reasoning", "")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize judge scores to the same [0.0, 1.0] contract as fallback.

Line 68 trusts model output verbatim; if the judge returns out-of-range values, retrieval_accuracy becomes invalid and incomparable to fallback runs.

Suggested fix
-            return float(parsed.get("score", 0.0)), parsed.get("reasoning", "")
+            raw_score = float(parsed.get("score", 0.0))
+            score = max(0.0, min(1.0, raw_score))
+            return score, parsed.get("reasoning", "")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return float(parsed.get("score", 0.0)), parsed.get("reasoning", "")
raw_score = float(parsed.get("score", 0.0))
score = max(0.0, min(1.0, raw_score))
return score, parsed.get("reasoning", "")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/evaluator.py` at line 68, The returned judge score should be
normalized to the [0.0, 1.0] contract before being used; replace the direct
return of float(parsed.get("score", 0.0)) with a safe conversion that handles
non-numeric values (fall back to 0.0), then clamp the resulting value to the
range 0.0..1.0 (e.g., max(0.0, min(1.0, score))), and return that normalized
score alongside parsed.get("reasoning",""); update the return site where the
tuple is produced (the line using parsed.get("score", 0.0) and
parsed.get("reasoning")) so downstream metrics like retrieval_accuracy remain
valid and comparable to fallback runs.

except Exception as e:
return self._keyword_score(golden_answer, retrieved_memories)

@staticmethod
def _keyword_score(
golden: str, memories: list[str]
) -> tuple[float, str]:
"""Fallback keyword-overlap scoring when no LLM judge is available."""
golden_words = set(golden.lower().split())
if not golden_words:
return 0.0, "Empty golden answer"

all_memory_text = " ".join(memories).lower()
memory_words = set(all_memory_text.split())
overlap = golden_words & memory_words
score = len(overlap) / len(golden_words) if golden_words else 0.0
return min(score, 1.0), f"Keyword overlap: {len(overlap)}/{len(golden_words)}"
Loading