Mab runner#135
Open
Jasonya wants to merge 3 commits into
Open
Conversation
End-to-end eval pipelines for the MemoryAgentBench benchmarks running
against MIRIX via the remote client. All four datasets share a common
runner shape (ingest -> wrap_user_prompt -> task_agent.answer ->
judge -> snapshot), are namespace-isolated per user_id, and produce a
metrics.json that organize_results.py aggregates.
Runners (one per benchmark)
- evals/run_mab_longmem_eval.sh + evals/longmem_eval.py -- LongMemEval-S,
session-aware ingest preserving occurred_at per session, MAB judge
routed by category.
- evals/run_mab_ruler_eval.sh + evals/ruler_eval.py -- RULER QA1
(SHDOCQA, single-hop) and QA2 (MHDOCQA, multi-hop). Sample_id is
prefixed with the source (shdocqa_/mhdocqa_) so the two subsets
can share the same Postgres DB.
- evals/run_mab_lru_eval.sh + evals/lru_eval.py -- Long_Range_
Understanding split (infbench_sum_eng_shots2 + detective_qa).
Source-routed judge: mab_summary for infbench summarization,
substring for detective_qa multiple choice.
Shared eval modules
- evals/_chunking.py -- sentence-aware NLTK punkt + tiktoken
gpt-4o-mini chunker at the MAB-canonical 4096-token budget.
Each parse_* function delegates here so RULER / LRU / LongMemEval
all chunk identically.
- evals/_eval_db.py -- direct-PG helpers (measure_memory_size,
dump_memories) so memlength comes from authoritative PG state
rather than /memory/components, which caps at 50 items per type
and would silently under-report on conversations with thousands
of memory items.
- evals/llm_judge_substring.py -- normalise + substring_exact_match
judge for short-answer benchmarks (RULER, detective).
- evals/llm_judge_mab.py -- MAB-aligned LongMemEval judge
(gpt-4o, 5 category prompts + abstention, yes/no scoring).
- evals/llm_judge_mab_summary.py -- MAB summarization judge for
LRU/infbench (gpt-4o-2024-05-13, three calls per record:
fluency * recall * precision -> F1).
- evals/configs/mab.yaml -- MAB-profile Mirix config (gpt-4.1-mini
answer LLM, gpt-4.1-nano topic extractor).
Eval orchestration
- evals/organize_results.py -- aggregator with --judge {default,
substring, mab, mab_summary} routing. Carries keypoints through
the judge task tuple so the summarization judge has what it needs.
Surfaces gpt-4-{fluency, recall, precision, f1} in metrics when
mab_summary is used.
- evals/memory_snapshot.py -- pg_dump + neo4j export + results copy
helper invoked by each runner at end-of-job.
- evals/task_agent.py -- thin OpenAI tool-calling agent that does
retrieval via MirixClient.search; defensive empty-key filter
and TypeError guard around the kwarg splat keep the eval running
when the LLM emits a malformed tool call.
- evals/mirix_memory_system.py + evals/task_agent.py -- construct
MirixClient with timeout=1800 so a 4096-token chunk (100-150s
server-side) clears the per-request timeout with ample headroom.
- evals/main_eval.py -- update LoCoMo runner to use dump_memories.
- mirix/database/redis_client.py: add an instance _modules_missing
flag plus _check_modules_missing(exc) helper. Every set_json /
search_* method short-circuits to its empty sentinel ([], False)
once a Redis ResponseError ("unknown command 'JSON.SET' /
'FT.SEARCH'") has been seen. On a stock OSS Redis without
RedisJSON + RediSearch the modules are absent and every cache
write raised on each memory item; a full SHDOCQA ingest emitted
thousands of ERROR lines. The first such error now logs a single
WARNING and flips the latch; subsequent calls return silently.
Functionally equivalent for the PG-fallback path (managers' if
results: branch still sees [] and falls through to PostgreSQL),
only the log volume changes.
- mirix/functions/function_sets/memory_tools.py: in
semantic_memory_insert, read item["source"] via item.get("source",
"") so an LLM tool call that omits the optional source field
doesn't crash the agent with KeyError mid-ingest. Observed on
RULER ingest where ~5% of semantic items had a missing source.
- mirix/client/remote_client.py: replace the inaccurate comment
next to the per-request timeout in _request. The old text
claimed AsyncClient(timeout=N) was silently overridden by the
wrapped AsyncHTTPTransport; a fake-transport repro showed
RetryTransport propagates request.extensions['timeout']
unchanged. The correct framing is that self.timeout is the
caller-controlled budget the per-request argument reads from.
478-line reference covering: - Eval inventory (every runner, helper, judge, runner shell, config with file:line refs) - MIRIX core patches (Redis modules-missing latch, defensive dict access in semantic_memory_insert, per-request timeout plumbing in remote_client) - all additive, no structural change - Benchmark results (SHDOCQA char-4096 82% vs token-4096 69% on fixed retrieval; smoke tests across 4 datasets; memory store granularity numbers) - Bugs and lessons learned (macOS tmp_cleaner /tmp/.venv wipe; users.organization_id NULL killing retrieval silently; Redis user cache hiding the PG fix; chunk-size apples-to-apples confounder; the RetryTransport-timeout claim that was not true; eval crash from empty-string kwarg in LLM tool calls) - Next steps (push branch, optional full runs, write-through cache invalidation, summarisation-workload chunk-size re-measure)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.