chore: bump trl to 1.6.0 and adapt the worker training stack#75
Merged
Conversation
trl 0.23.0 mis-read transformers 5.8.1's tuple-returning package probe and tried to import the absent `mergekit`, so the DPO executor failed to import and was skipped at worker startup. trl 1.6.0 fixes this but relocates PPO to `trl.experimental.ppo` and tightens `SFTTrainer`'s model type, so repoint the PPO imports and the `get_reward` reward-dispatch patch, and narrow the LoRA peft model with an isinstance guard. trl 1.6.0 requires datasets>=4.7.0, which moves datasets to 5.0.0. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
8 tasks
timzsu
requested changes
Jun 17, 2026
timzsu
left a comment
Collaborator
There was a problem hiding this comment.
Overall looks good. A minor comment about docstring cleanup.
Collaborator
There was a problem hiding this comment.
There seem to be some stale references to previous TRL versions, such as line 782-784 and 829.
Collaborator
Author
There was a problem hiding this comment.
Fixed. Also dropped all unnecessary guards for legacy TRL versions.
With trl pinned to 1.6.0, the multi-version compatibility shims in the training executors are dead code. Collapse the tokenizer/processing_class construction ladders (SFT/DPO/LoRA) and the PPO signature-introspection fallback into direct trainer calls, and drop the eval dataset/dataloader, reward_model, output_hidden_states, and return_dict presets that trl 1.6.0's PPO loop already sets itself per call. Type the PPO data collator as a DataCollatorWithPadding subclass so the call site no longer needs a type-ignore. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Bumps
trlfrom 0.23.0 to 1.6.0 to fix the DPO executor failing to load under transformers 5.8.1. trl 0.23.0 reads transformers 5.8.1's tuple-returning_is_package_availableas a truthy value, so it enters themergekitimport branch and the DPO import chain (dpo_trainer→callbacks→mergekit_utils) raisesModuleNotFoundError: No module named 'mergekit'; the worker then skips the executor at startup ("Skipping executor dpo"). trl 1.6.0 handles the probe correctly. The 0.23→1.6 jump is a major bump that relocates PPO and tightens trainer typing, so this PR also adapts the worker training executors. trl 1.6.0 requiresdatasets>=4.7.0, which moves datasets 4.3→5.0.Changes
pyproject.toml,uv.lock,src/worker/requirements/requirements.txt: raisetrl>=1.6.0anddatasets>=4.7.0, re-lock, and regenerate. datasets resolves to 5.0.0 (the latest satisfying trl's new floor).src/worker/executors/ppo_executor.py: PPO left the stable trl path, so repointPPOConfig,PPOTrainer, andAutoModelForCausalLMWithValueHeadtotrl.experimental.ppo.*, and retarget the_patched_reward_dispatchmonkeypatch toget_rewardintrl.experimental(it moved out oftrl.trainer.utils), reading the canonical reference from its defining module and patching the trainer namespace viagetattras before.src/worker/executors/lora_sft_executor.py: trl 1.6.0 tightensSFTTrainer.modeltostr | PreTrainedModel | PeftModel, so narrowget_peft_model'sPeftModel | PeftMixedModelreturn with an isinstance guard (mixed adapters are never used here) rather than passing the broader union.tests/worker/test_ppo_early_stopping.py: retarget thePPOTrainer.logpatches to the experimental module.Design
trl.trainer.ppo_*(≤0.25.1) still has the DPO bug, and every version that fixes DPO (≥0.29) has already moved PPO totrl.experimental.ppo. PPO must be repointed regardless, so there is no cheaper intermediate target — 1.6.0 is the furthest-forward option at equal cost. ThePPOTrainersignature there still matches the executor's existing introspection-based construction, so only the import paths and the reward-dispatch patch change.get_peft_modelonly returns aPeftMixedModelundermixed=True, which this executor never sets, so the value is always aPeftModel. A runtime isinstance guard fails fast at the boundary if that assumption ever breaks, whereas acastwould silently pass an incompatible model intoSFTTrainer.datasets>=4.7.0and the resolver picks the latest (5.0.0). The training executors use only the stableload_dataset/Datasetcore, and thetrust_remote_codekwarg some pass toload_datasetwas already an inert passthrough on the pre-bump datasets 4.3.0, so the bump introduces no behavior change there. lxml stays <6, so the documented pip-audit ignores remain valid.Test Plan
pre-commit run --all-filesanduv run pytest tests/(the GPU-only cleanup testtests/worker/test_mp_executor_cleanup_gpu.pyrequires CUDA and is excluded, as in CI).Test Result
pytest tests/pass.DONEon the rebuilt images: DPO trains with nomergekitimport (the pre-bump failure is gone), SFT and LoRA SFT complete their loops on datasets 5.0 and the LoRA adapter saves, and PPO runs throughtrl.experimental.ppowith the KL early-stop path firing — exercising both the relocated imports and theget_rewardreward-dispatch patch.Pre-submission Checklist
pre-commit run --all-filesand fixed any issues.uv run pytest tests/passes locally.uv sync --all-packages --group ci --frozen). (No SDK/CLI code changes; dependency floors only.)[BREAKING]and described migration steps above.