Mab runner by Jasonya · Pull Request #135 · Mirix-AI/MIRIX

Jasonya · 2026-06-04T07:29:56Z

No description provided.

End-to-end eval pipelines for the MemoryAgentBench benchmarks running against MIRIX via the remote client. All four datasets share a common runner shape (ingest -> wrap_user_prompt -> task_agent.answer -> judge -> snapshot), are namespace-isolated per user_id, and produce a metrics.json that organize_results.py aggregates. Runners (one per benchmark) - evals/run_mab_longmem_eval.sh + evals/longmem_eval.py -- LongMemEval-S, session-aware ingest preserving occurred_at per session, MAB judge routed by category. - evals/run_mab_ruler_eval.sh + evals/ruler_eval.py -- RULER QA1 (SHDOCQA, single-hop) and QA2 (MHDOCQA, multi-hop). Sample_id is prefixed with the source (shdocqa_/mhdocqa_) so the two subsets can share the same Postgres DB. - evals/run_mab_lru_eval.sh + evals/lru_eval.py -- Long_Range_ Understanding split (infbench_sum_eng_shots2 + detective_qa). Source-routed judge: mab_summary for infbench summarization, substring for detective_qa multiple choice. Shared eval modules - evals/_chunking.py -- sentence-aware NLTK punkt + tiktoken gpt-4o-mini chunker at the MAB-canonical 4096-token budget. Each parse_* function delegates here so RULER / LRU / LongMemEval all chunk identically. - evals/_eval_db.py -- direct-PG helpers (measure_memory_size, dump_memories) so memlength comes from authoritative PG state rather than /memory/components, which caps at 50 items per type and would silently under-report on conversations with thousands of memory items. - evals/llm_judge_substring.py -- normalise + substring_exact_match judge for short-answer benchmarks (RULER, detective). - evals/llm_judge_mab.py -- MAB-aligned LongMemEval judge (gpt-4o, 5 category prompts + abstention, yes/no scoring). - evals/llm_judge_mab_summary.py -- MAB summarization judge for LRU/infbench (gpt-4o-2024-05-13, three calls per record: fluency * recall * precision -> F1). - evals/configs/mab.yaml -- MAB-profile Mirix config (gpt-4.1-mini answer LLM, gpt-4.1-nano topic extractor). Eval orchestration - evals/organize_results.py -- aggregator with --judge {default, substring, mab, mab_summary} routing. Carries keypoints through the judge task tuple so the summarization judge has what it needs. Surfaces gpt-4-{fluency, recall, precision, f1} in metrics when mab_summary is used. - evals/memory_snapshot.py -- pg_dump + neo4j export + results copy helper invoked by each runner at end-of-job. - evals/task_agent.py -- thin OpenAI tool-calling agent that does retrieval via MirixClient.search; defensive empty-key filter and TypeError guard around the kwarg splat keep the eval running when the LLM emits a malformed tool call. - evals/mirix_memory_system.py + evals/task_agent.py -- construct MirixClient with timeout=1800 so a 4096-token chunk (100-150s server-side) clears the per-request timeout with ample headroom. - evals/main_eval.py -- update LoCoMo runner to use dump_memories.

- mirix/database/redis_client.py: add an instance _modules_missing flag plus _check_modules_missing(exc) helper. Every set_json / search_* method short-circuits to its empty sentinel ([], False) once a Redis ResponseError ("unknown command 'JSON.SET' / 'FT.SEARCH'") has been seen. On a stock OSS Redis without RedisJSON + RediSearch the modules are absent and every cache write raised on each memory item; a full SHDOCQA ingest emitted thousands of ERROR lines. The first such error now logs a single WARNING and flips the latch; subsequent calls return silently. Functionally equivalent for the PG-fallback path (managers' if results: branch still sees [] and falls through to PostgreSQL), only the log volume changes. - mirix/functions/function_sets/memory_tools.py: in semantic_memory_insert, read item["source"] via item.get("source", "") so an LLM tool call that omits the optional source field doesn't crash the agent with KeyError mid-ingest. Observed on RULER ingest where ~5% of semantic items had a missing source. - mirix/client/remote_client.py: replace the inaccurate comment next to the per-request timeout in _request. The old text claimed AsyncClient(timeout=N) was silently overridden by the wrapped AsyncHTTPTransport; a fake-transport repro showed RetryTransport propagates request.extensions['timeout'] unchanged. The correct framing is that self.timeout is the caller-controlled budget the per-request argument reads from.

478-line reference covering: - Eval inventory (every runner, helper, judge, runner shell, config with file:line refs) - MIRIX core patches (Redis modules-missing latch, defensive dict access in semantic_memory_insert, per-request timeout plumbing in remote_client) - all additive, no structural change - Benchmark results (SHDOCQA char-4096 82% vs token-4096 69% on fixed retrieval; smoke tests across 4 datasets; memory store granularity numbers) - Bugs and lessons learned (macOS tmp_cleaner /tmp/.venv wipe; users.organization_id NULL killing retrieval silently; Redis user cache hiding the PG fix; chunk-size apples-to-apples confounder; the RetryTransport-timeout claim that was not true; eval crash from empty-string kwarg in LLM tool calls) - Next steps (push branch, optional full runs, write-through cache invalidation, summarisation-workload chunk-size re-measure)

Jasonya added 3 commits June 4, 2026 14:49

Jasonya requested a review from wangyu-ustc June 4, 2026 07:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mab runner#135

Mab runner#135
Jasonya wants to merge 3 commits into
mainfrom
mab_runner

Jasonya commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Jasonya commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant