fix(cli-train): drive adapter backends end-to-end (--train-steps wiring + per-backend defaults)#1062
Merged
Merged
Conversation
…ng + per-backend defaults) The CLI/runner/train.py were already wired for the mlxlm/opd/grpo/trl backends, but two from-scratch-tuned defaults silently broke them when driven via 'autoctx train': 1. --train-steps was never forwarded CLI -> runner subprocess, so every backend trained at train.py's 8-step default. 8 LoRA steps learns ~nothing. Now TrainingConfig carries train_steps, the runner forwards it when set, and train.py resolves an unset (0) sentinel per backend: 8 from-scratch (mlx/cuda), 100 pretrained-adapter (mlxlm/opd/grpo/trl). 2. train.py's default learning_rate=1e-3 is ~10x too high for a LoRA adapter and DIVERGED it to garbage tokens (assessment avg_score=0). It now resolves an unset (0) sentinel per backend: 1e-3 from-scratch, 1e-4 mlxlm, 1e-5 opd/grpo/trl (each backend's own tuned rate). Also: CLI exposes --train-steps and validates --backend against the known set. Verified live: 'train.py --backend mlxlm --train-steps 80' now trains a healthy adapter (assessment avg_score 0.865, valid_rate 1.0; was 0.0/garbage at the 1e-3 default). Tests: runner forwards --train-steps when set / omits when unset; _default_train_steps and _default_learning_rate resolution per backend. Full training-backend + runner + CLI regression green. Documents the resolved defaults in mlx-training.md.
mlx-training.md documented --learning-rate as an 'autoctx train' flag, but only --train-steps was wired -- the CLI rejected --learning-rate with 'No such option'. Wire it through parallel to --train-steps: TrainingConfig.learning_rate (0 = backend default), runner forwards it when > 0, CLI exposes + validates it. Now 'autoctx train --backend mlxlm --learning-rate 1e-4' works and the documented row is accurate. Tests: runner forwards --learning-rate when set / omits when unset.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Make the recursive loop fully drivable from the CLI for the pretrained-adapter backends (
mlxlm/opd/grpo/trl). The CLI, runner, andtrain.pywere already wired to accept these backends — but two from-scratch-tuned defaults silently broke them when driven throughautoctx train:1.
--train-stepswas never forwardedTrainingConfighad notrain_stepsfield and the runner never passed--train-stepsto the subprocess, so every backend trained attrain.py's 8-step default. An 8-step LoRA learns essentially nothing. Now:TrainingConfig.train_steps(default0= unset), forwarded by the runner only when> 0.train.pyresolves the0sentinel per backend via_default_train_steps: 8 for from-scratch (mlx/cuda), 100 for adapter backends.--train-steps.2.
learning_rate=1e-3diverged LoRA adapterstrain.py's default LR is tuned for the from-scratch GPT; it is ~10x too high for a LoRA adapter and diverged it to garbage tokens (in-training assessmentavg_score=0). Nowtrain.pyresolves an unset (0) LR per backend via_default_learning_rate: 1e-3 from-scratch, 1e-4mlxlm, 1e-5opd/grpo/trl— each backend's own tuned rate.Also: the CLI validates
--backendagainst the known set and rejects negative--train-steps.Verified live
train.py --backend mlxlm --train-steps 80ongrid_ctf(cached Qwen2.5-0.5B):avg_score=0.0,valid_rate=0.0— adapter emits!!!!garbage.avg_score=0.8654,valid_rate=1.0, 80 steps, 20s.Tests
Runner forwards
--train-stepswhen set / omits when unset;_default_train_stepsand_default_learning_rateresolution per backend. Full training-backend + runner + CLI regression green; ruff + mypy clean. Documents the resolved defaults inmlx-training.md.Note
Both bugs are the same class the recursive-loop demo work keeps surfacing: a from-scratch-tuned default silently breaking the pretrained-adapter path. The sentinel-default pattern (
0= per-backend default) keeps existing mlx/cuda behavior byte-identical while making the adapter backends correct out of the box.