Skip to content

Mab runner#135

Open
Jasonya wants to merge 3 commits into
mainfrom
mab_runner
Open

Mab runner#135
Jasonya wants to merge 3 commits into
mainfrom
mab_runner

Conversation

@Jasonya

@Jasonya Jasonya commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

Jasonya added 3 commits June 4, 2026 14:49
End-to-end eval pipelines for the MemoryAgentBench benchmarks running
against MIRIX via the remote client. All four datasets share a common
runner shape (ingest -> wrap_user_prompt -> task_agent.answer ->
judge -> snapshot), are namespace-isolated per user_id, and produce a
metrics.json that organize_results.py aggregates.

Runners (one per benchmark)
- evals/run_mab_longmem_eval.sh + evals/longmem_eval.py -- LongMemEval-S,
  session-aware ingest preserving occurred_at per session, MAB judge
  routed by category.
- evals/run_mab_ruler_eval.sh + evals/ruler_eval.py -- RULER QA1
  (SHDOCQA, single-hop) and QA2 (MHDOCQA, multi-hop). Sample_id is
  prefixed with the source (shdocqa_/mhdocqa_) so the two subsets
  can share the same Postgres DB.
- evals/run_mab_lru_eval.sh + evals/lru_eval.py -- Long_Range_
  Understanding split (infbench_sum_eng_shots2 + detective_qa).
  Source-routed judge: mab_summary for infbench summarization,
  substring for detective_qa multiple choice.

Shared eval modules
- evals/_chunking.py -- sentence-aware NLTK punkt + tiktoken
  gpt-4o-mini chunker at the MAB-canonical 4096-token budget.
  Each parse_* function delegates here so RULER / LRU / LongMemEval
  all chunk identically.
- evals/_eval_db.py -- direct-PG helpers (measure_memory_size,
  dump_memories) so memlength comes from authoritative PG state
  rather than /memory/components, which caps at 50 items per type
  and would silently under-report on conversations with thousands
  of memory items.
- evals/llm_judge_substring.py -- normalise + substring_exact_match
  judge for short-answer benchmarks (RULER, detective).
- evals/llm_judge_mab.py -- MAB-aligned LongMemEval judge
  (gpt-4o, 5 category prompts + abstention, yes/no scoring).
- evals/llm_judge_mab_summary.py -- MAB summarization judge for
  LRU/infbench (gpt-4o-2024-05-13, three calls per record:
  fluency * recall * precision -> F1).
- evals/configs/mab.yaml -- MAB-profile Mirix config (gpt-4.1-mini
  answer LLM, gpt-4.1-nano topic extractor).

Eval orchestration
- evals/organize_results.py -- aggregator with --judge {default,
  substring, mab, mab_summary} routing. Carries keypoints through
  the judge task tuple so the summarization judge has what it needs.
  Surfaces gpt-4-{fluency, recall, precision, f1} in metrics when
  mab_summary is used.
- evals/memory_snapshot.py -- pg_dump + neo4j export + results copy
  helper invoked by each runner at end-of-job.
- evals/task_agent.py -- thin OpenAI tool-calling agent that does
  retrieval via MirixClient.search; defensive empty-key filter
  and TypeError guard around the kwarg splat keep the eval running
  when the LLM emits a malformed tool call.
- evals/mirix_memory_system.py + evals/task_agent.py -- construct
  MirixClient with timeout=1800 so a 4096-token chunk (100-150s
  server-side) clears the per-request timeout with ample headroom.
- evals/main_eval.py -- update LoCoMo runner to use dump_memories.
- mirix/database/redis_client.py: add an instance _modules_missing
  flag plus _check_modules_missing(exc) helper. Every set_json /
  search_* method short-circuits to its empty sentinel ([], False)
  once a Redis ResponseError ("unknown command 'JSON.SET' /
  'FT.SEARCH'") has been seen. On a stock OSS Redis without
  RedisJSON + RediSearch the modules are absent and every cache
  write raised on each memory item; a full SHDOCQA ingest emitted
  thousands of ERROR lines. The first such error now logs a single
  WARNING and flips the latch; subsequent calls return silently.
  Functionally equivalent for the PG-fallback path (managers' if
  results: branch still sees [] and falls through to PostgreSQL),
  only the log volume changes.

- mirix/functions/function_sets/memory_tools.py: in
  semantic_memory_insert, read item["source"] via item.get("source",
  "") so an LLM tool call that omits the optional source field
  doesn't crash the agent with KeyError mid-ingest. Observed on
  RULER ingest where ~5% of semantic items had a missing source.

- mirix/client/remote_client.py: replace the inaccurate comment
  next to the per-request timeout in _request. The old text
  claimed AsyncClient(timeout=N) was silently overridden by the
  wrapped AsyncHTTPTransport; a fake-transport repro showed
  RetryTransport propagates request.extensions['timeout']
  unchanged. The correct framing is that self.timeout is the
  caller-controlled budget the per-request argument reads from.
478-line reference covering:

- Eval inventory (every runner, helper, judge, runner shell, config
  with file:line refs)
- MIRIX core patches (Redis modules-missing latch, defensive dict
  access in semantic_memory_insert, per-request timeout plumbing in
  remote_client) - all additive, no structural change
- Benchmark results (SHDOCQA char-4096 82% vs token-4096 69% on
  fixed retrieval; smoke tests across 4 datasets; memory store
  granularity numbers)
- Bugs and lessons learned (macOS tmp_cleaner /tmp/.venv wipe;
  users.organization_id NULL killing retrieval silently; Redis user
  cache hiding the PG fix; chunk-size apples-to-apples confounder;
  the RetryTransport-timeout claim that was not true; eval crash
  from empty-string kwarg in LLM tool calls)
- Next steps (push branch, optional full runs, write-through cache
  invalidation, summarisation-workload chunk-size re-measure)
@Jasonya Jasonya requested a review from wangyu-ustc June 4, 2026 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant