Skip to content

fix(adapter-backends): game scenarios trainable/servable + meaningful in-training assessment#1061

Merged
jayscambler merged 3 commits into
mainfrom
demo/recursive-loop
Jun 9, 2026
Merged

fix(adapter-backends): game scenarios trainable/servable + meaningful in-training assessment#1061
jayscambler merged 3 commits into
mainfrom
demo/recursive-loop

Conversation

@jayscambler

Copy link
Copy Markdown
Contributor

What

A live end-to-end recursive-loop demo on local MLX (train → publish → auto-resolve → serve) surfaced two real gaps on the mlxlm/opd adapter path, both of which silently broke game scenarios (most of SCENARIO_REGISTRY). Both fixed here, with the demo + a writeup.

Fix 1 — game scenarios produced an empty task prompt

ScenarioInterface (game) scenarios expose describe_rules / describe_strategy_interface / describe_evaluation_criteria but no get_task_prompt or description. resolve_scenario_context only checked the latter two, so it returned "" — every game scenario trained and served on an empty prompt. It now composes the describe_* methods into a task instruction.

Fix 2 — the in-training assessment fed the model a raw prompt

_assess_mlxlm passed the bare task string to generate(), but mlx-lm's LoRA trainer and the serving path (MLXLMProvider.format_mlxlm_prompt) both apply the instruct chat template. An instruct model given a raw prompt emits prose, not verifier-scorable JSON, so the in-training avg_score read ~0 even for a good adapter. Extracted format_assess_prompt (chat template + optional quality prefix, with a raw fallback for non-instruct tokenizers) and use it in assessment — so train / assess / serve all format identically.

Live result

grid_ctf, base Qwen2.5-0.5B-Instruct-4bit, 8 verifier-scored samples per stage:

Stage Mean verifier score Valid JSON
run N — base model as agent 0.5809 75%
run N+1 — auto-served LoRA adapter 0.8241 100%
delta +0.2432 (+41.9%)

Whole loop (train → publish → auto-resolve → serve → re-measure) in 43s. After Fix 2 the in-training assessment (0.857) agrees with the independent served-adapter measurement (0.824). The serving run is given no model path — the adapter is resolved purely from the registry the training run published to, via the bridge (_resolve_local_recordplan_local_clientMLXLMClient).

Added

  • scripts/demo_recursive_loop.py — self-contained, runnable end-to-end loop.
  • docs/case-study-recursive-loop.md — the writeup.

Tests (CI-safe, no mlx)

resolve_scenario_context: describe_* composition, get_task_prompt/description precedence, empty fallback. format_assess_prompt: chat-template application, quality-prefix when score-conditioned, raw fallback when the tokenizer has no chat template. Full touched-area regression green; ruff + mypy clean.

…ngful in-training assessment

Surfaced by a live recursive-loop demo on local MLX (train -> publish -> auto-resolve ->
serve). Two real gaps on the mlxlm/opd adapter path, both blocking game scenarios:

1. resolve_scenario_context returned '' for ScenarioInterface (game) scenarios -- they
   expose describe_rules/strategy_interface/evaluation_criteria but no get_task_prompt or
   description, so every game scenario trained/served on an EMPTY prompt. Now composes the
   describe_* methods into a task instruction.

2. _assess_mlxlm fed the model a RAW prompt, but mlx-lm's LoRA trainer and the serving path
   (MLXLMProvider.format_mlxlm_prompt) both apply the instruct chat template. An instruct
   model given a raw prompt emits prose, not verifier-scorable JSON, so the in-training
   metric read ~0 even for a good adapter. Extracted format_assess_prompt (chat-template +
   optional quality prefix, raw fallback for non-instruct tokenizers) and use it in assess.

Adds scripts/demo_recursive_loop.py (self-contained end-to-end loop on grid_ctf) and
docs/case-study-recursive-loop.md. Live result: base 0.58 -> auto-served adapter 0.82
(+41.9%) in 43s; in-training assessment (0.857) now agrees with the served measurement.

Tests: resolve_scenario_context describe_* composition + precedence + empty fallback;
format_assess_prompt chat-template + quality-prefix + raw fallback. CI-safe (no mlx).
#1061)

The new case-study-recursive-loop.md was orphaned from the public docs path. Link it
beside the MLX-training / OPD-case-study entries in the root README, docs/README, and
the package README so users discover it. Root README re-applied without the formatter
to avoid mangling the synced whats-new block + Surfaces table.
…tudy trajectory

Take the loop further: scripts/demo_recursive_loop_multigen.py runs the genuine recursive
loop, where each generation trains ONLY on the model's own verifier-curated proposals (no
external data, no human). The served model proposes -> verifier scores -> best-so-far becomes
the next adapter's training set -> publish + auto-activate -> next gen is served by it.

Live result (grid_ctf, Qwen2.5-0.5B, 3 generations, 33s): mean 0.577 -> 0.619 and best
0.689 -> 0.737, monotonic, valid-JSON 13/20 -> 20/20. The +7.4% is smaller than the single
step's +41.9% and honestly so -- bootstrapping from a weak cold-start distribution is slower
than training on a globally-curated elite; the point is the shape (it compounds on its own
output), not the magnitude. Documented as a new section in the case study.
@jayscambler jayscambler merged commit f296732 into main Jun 9, 2026
33 of 34 checks passed
@jayscambler jayscambler deleted the demo/recursive-loop branch June 9, 2026 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant