fix(adapter-backends): game scenarios trainable/servable + meaningful in-training assessment#1061
Merged
Conversation
…ngful in-training assessment Surfaced by a live recursive-loop demo on local MLX (train -> publish -> auto-resolve -> serve). Two real gaps on the mlxlm/opd adapter path, both blocking game scenarios: 1. resolve_scenario_context returned '' for ScenarioInterface (game) scenarios -- they expose describe_rules/strategy_interface/evaluation_criteria but no get_task_prompt or description, so every game scenario trained/served on an EMPTY prompt. Now composes the describe_* methods into a task instruction. 2. _assess_mlxlm fed the model a RAW prompt, but mlx-lm's LoRA trainer and the serving path (MLXLMProvider.format_mlxlm_prompt) both apply the instruct chat template. An instruct model given a raw prompt emits prose, not verifier-scorable JSON, so the in-training metric read ~0 even for a good adapter. Extracted format_assess_prompt (chat-template + optional quality prefix, raw fallback for non-instruct tokenizers) and use it in assess. Adds scripts/demo_recursive_loop.py (self-contained end-to-end loop on grid_ctf) and docs/case-study-recursive-loop.md. Live result: base 0.58 -> auto-served adapter 0.82 (+41.9%) in 43s; in-training assessment (0.857) now agrees with the served measurement. Tests: resolve_scenario_context describe_* composition + precedence + empty fallback; format_assess_prompt chat-template + quality-prefix + raw fallback. CI-safe (no mlx).
#1061) The new case-study-recursive-loop.md was orphaned from the public docs path. Link it beside the MLX-training / OPD-case-study entries in the root README, docs/README, and the package README so users discover it. Root README re-applied without the formatter to avoid mangling the synced whats-new block + Surfaces table.
…tudy trajectory Take the loop further: scripts/demo_recursive_loop_multigen.py runs the genuine recursive loop, where each generation trains ONLY on the model's own verifier-curated proposals (no external data, no human). The served model proposes -> verifier scores -> best-so-far becomes the next adapter's training set -> publish + auto-activate -> next gen is served by it. Live result (grid_ctf, Qwen2.5-0.5B, 3 generations, 33s): mean 0.577 -> 0.619 and best 0.689 -> 0.737, monotonic, valid-JSON 13/20 -> 20/20. The +7.4% is smaller than the single step's +41.9% and honestly so -- bootstrapping from a weak cold-start distribution is slower than training on a globally-curated elite; the point is the shape (it compounds on its own output), not the magnitude. Documented as a new section in the case study.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A live end-to-end recursive-loop demo on local MLX (train → publish → auto-resolve → serve) surfaced two real gaps on the
mlxlm/opdadapter path, both of which silently broke game scenarios (most ofSCENARIO_REGISTRY). Both fixed here, with the demo + a writeup.Fix 1 — game scenarios produced an empty task prompt
ScenarioInterface(game) scenarios exposedescribe_rules/describe_strategy_interface/describe_evaluation_criteriabut noget_task_promptordescription.resolve_scenario_contextonly checked the latter two, so it returned""— every game scenario trained and served on an empty prompt. It now composes thedescribe_*methods into a task instruction.Fix 2 — the in-training assessment fed the model a raw prompt
_assess_mlxlmpassed the bare task string togenerate(), but mlx-lm's LoRA trainer and the serving path (MLXLMProvider.format_mlxlm_prompt) both apply the instruct chat template. An instruct model given a raw prompt emits prose, not verifier-scorable JSON, so the in-trainingavg_scoreread ~0 even for a good adapter. Extractedformat_assess_prompt(chat template + optional quality prefix, with a raw fallback for non-instruct tokenizers) and use it in assessment — so train / assess / serve all format identically.Live result
grid_ctf, baseQwen2.5-0.5B-Instruct-4bit, 8 verifier-scored samples per stage:Whole loop (train → publish → auto-resolve → serve → re-measure) in 43s. After Fix 2 the in-training assessment (0.857) agrees with the independent served-adapter measurement (0.824). The serving run is given no model path — the adapter is resolved purely from the registry the training run published to, via the bridge (
_resolve_local_record→plan_local_client→MLXLMClient).Added
scripts/demo_recursive_loop.py— self-contained, runnable end-to-end loop.docs/case-study-recursive-loop.md— the writeup.Tests (CI-safe, no mlx)
resolve_scenario_context: describe_* composition, get_task_prompt/description precedence, empty fallback.format_assess_prompt: chat-template application, quality-prefix when score-conditioned, raw fallback when the tokenizer has no chat template. Full touched-area regression green; ruff + mypy clean.