feat(self-improve): generalize ReST-EM loop to mlxlm backend + agent-task scenarios#1064
Merged
Conversation
… scenarios
The self-improving loop (autoctx self-improve) was hardcoded to the from-scratch mlx
backend and game scenarios. Generalize it along both axes:
Backend: the loop now drives either SFT backend -- mlx (from-scratch GPT) or mlxlm (LoRA on
a pretrained base). mlxlm's in-scenario assessment now collects {strategy, score} samples to
a collect_samples_path (the adapter analogue of the mlx collect path), so ReST-EM can keep the
elite and retrain. train.py allows collect_samples_path for mlx + mlxlm and rejects it for the
online-RL / distillation backends (grpo/opd/trl), which have no SFT sample stream.
Scenarios: agent-task scenarios (scored via evaluate_output on free text) are now first-class.
_assess_mlxlm collects the raw text as the sample's strategy, and records_to_completions uses a
string strategy verbatim as the completion (json.dumps would quote/escape it). Game scenarios
(JSON strategies) are unchanged.
CLI: autoctx self-improve gains --backend, --base-model, --fine-tune-type, --num-layers, and
--batch-size (lower it for mlxlm on small seeds: mlx-lm requires the validation split to hold
>= batch_size examples).
Verified live: autoctx self-improve --backend mlxlm on grid_ctf ran 2 rounds, growing the
dataset 24->28 and improving avg_score 0.755 -> 0.878 across rounds.
Tests (CI-safe): records_to_completions verbatim-text vs JSON strategy; loop rejects non-SFT
backends; loop threads backend + adapter params into run_training (mocked); collect_samples_path
rejected for non-SFT backends. Documents the new flags + backend/scenario coverage in mlx-training.md.
…d after scoring (review #1064) [P2] The agent-task assessment called scenario.evaluate_output(output=...) WITHOUT the required state arg, so it raised for AgentTaskInterface scenarios, got swallowed by the broad except, and could report valid_rate>0 while collecting nothing. Fixes: - scenario_task_prompt_and_state resolves the prompt AND the state it was built from; the mlxlm assessment passes that same state into evaluate_output(output, state). - valid is incremented only AFTER scoring succeeds (both mlxlm and the mlx/prepare path), so a scoring error can no longer inflate valid_rate. - the from-scratch mlx path (prepare.py) likewise resolves + passes state. [P2] Docs over-claimed agent-task free-text across both SFT backends. The from-scratch mlx backend emits the structured <|...|> token contract (JSON strategies), so free-text agent tasks fit the mlxlm backend (pretrained instruct model). Narrowed mlx-training.md: both backends handle game/JSON-strategy scenarios; free-text agent tasks use --backend mlxlm. Tests: scenario_task_prompt_and_state returns state for agent tasks / None for games.
jayscambler
added a commit
that referenced
this pull request
Jun 11, 2026
…K) (#1070) Two bugs found live on a GSM8K GRPO run made every reward 0 -> no gradient (RLVR silently learned nothing), on top of the length issue fixed in #1067: 1. GRPO prompts were raw strings, so TRL skipped the chat template -> the instruct model never entered chat mode, never emitted EOS, and every completion ran to max_completion_length (clipped_ratio=1, mean_terminated_length=0) with no parseable answer. build_prompt_dataset_rows now emits CONVERSATIONAL prompts ([{role:user,content}]) so TRL applies the chat template and completions terminate. 2. score_completions required a JSON construction (extract_json_object) for ALL scenarios, so a free-text GSM8K answer scored 0 even when correct. It is now hybrid for agent tasks: pass the extracted JSON when present (back-compat for JSON-construction agent tasks) else the RAW text, and it accepts conversational (message-list) completions. Tests: free-text agent-task scoring (no JSON), conversational completion handling, conversational prompt rows. Both bugs are the same game-vs-agent-task split fixed earlier in the mlxlm assessment path (#1064).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The self-improving loop (
autoctx self-improve, ReST-EM / expert iteration) was hardcoded to the from-scratchmlxbackend and game scenarios. This generalizes it along both axes — the two follow-ons chosen after closing the recursive loop.Backend:
mlx→{mlx, mlxlm}The loop now drives either SFT backend:
mlx(from-scratch GPT) ormlxlm(LoRA on a pretrained base,--backend mlxlm).mlxlm's in-scenario assessment now collects{strategy, score}samples to acollect_samples_path(the adapter analogue of the existing mlx collect path), so ReST-EM can keep the elite and retrain.train.pypermitscollect_samples_pathformlx+mlxlmand rejects it for the online-RL/distillation backends (grpo/opd/trl), which have no SFT sample stream.Scenarios: game → game + agent-task
Agent-task scenarios (scored via
evaluate_outputon free text) are now first-class._assess_mlxlmcollects the raw text as the sample's strategy, andrecords_to_completionsuses a string strategy verbatim as the completion (json.dumpswould quote/escape it). Game scenarios (JSON strategies) are unchanged.CLI
autoctx self-improvegains--backend,--base-model,--fine-tune-type,--num-layers, and--batch-size(lower it formlxlmon small seeds: mlx-lm requires the validation split to hold ≥ batch_size examples).Verified live
autoctx self-improve --backend mlxlmongrid_ctf(cached Qwen2.5-0.5B), 2 rounds: dataset grew 24→28, per-round avg_score 0.755 → 0.878. The loop collects samples, filters elite, grows the dataset, and retrains the adapter each round — end to end through the CLI.Tests (CI-safe, no mlx)
records_to_completionsverbatim-text vs JSON-strategy; loop rejects non-SFT backends; loop threads backend + adapter params intorun_training(mocked);collect_samples_pathrejected for non-SFT backends. Updated the priorMLX-onlyguard test to the newmlx and mlxlmmessage. Documents the new flags + backend/scenario coverage inmlx-training.md. ruff + mypy clean.