Skip to content

chore: bump trl to 1.6.0 and adapt the worker training stack#75

Merged
kaiitunnz merged 2 commits into
mainfrom
kaiitunnz/chore/bump-trl
Jun 18, 2026
Merged

chore: bump trl to 1.6.0 and adapt the worker training stack#75
kaiitunnz merged 2 commits into
mainfrom
kaiitunnz/chore/bump-trl

Conversation

@kaiitunnz

Copy link
Copy Markdown
Collaborator

Purpose

Bumps trl from 0.23.0 to 1.6.0 to fix the DPO executor failing to load under transformers 5.8.1. trl 0.23.0 reads transformers 5.8.1's tuple-returning _is_package_available as a truthy value, so it enters the mergekit import branch and the DPO import chain (dpo_trainercallbacksmergekit_utils) raises ModuleNotFoundError: No module named 'mergekit'; the worker then skips the executor at startup ("Skipping executor dpo"). trl 1.6.0 handles the probe correctly. The 0.23→1.6 jump is a major bump that relocates PPO and tightens trainer typing, so this PR also adapts the worker training executors. trl 1.6.0 requires datasets>=4.7.0, which moves datasets 4.3→5.0.

Changes

  • Dependency bumppyproject.toml, uv.lock, src/worker/requirements/requirements.txt: raise trl>=1.6.0 and datasets>=4.7.0, re-lock, and regenerate. datasets resolves to 5.0.0 (the latest satisfying trl's new floor).
  • PPO executorsrc/worker/executors/ppo_executor.py: PPO left the stable trl path, so repoint PPOConfig, PPOTrainer, and AutoModelForCausalLMWithValueHead to trl.experimental.ppo.*, and retarget the _patched_reward_dispatch monkeypatch to get_reward in trl.experimental (it moved out of trl.trainer.utils), reading the canonical reference from its defining module and patching the trainer namespace via getattr as before.
  • LoRA SFT executorsrc/worker/executors/lora_sft_executor.py: trl 1.6.0 tightens SFTTrainer.model to str | PreTrainedModel | PeftModel, so narrow get_peft_model's PeftModel | PeftMixedModel return with an isinstance guard (mixed adapters are never used here) rather than passing the broader union.
  • Teststests/worker/test_ppo_early_stopping.py: retarget the PPOTrainer.log patches to the experimental module.

Design

  • Land on the latest 1.6.0, not a conservative 0.x. The mergekit/DPO fix and a stable PPO import path are mutually exclusive: every trl version old enough to keep PPO on trl.trainer.ppo_* (≤0.25.1) still has the DPO bug, and every version that fixes DPO (≥0.29) has already moved PPO to trl.experimental.ppo. PPO must be repointed regardless, so there is no cheaper intermediate target — 1.6.0 is the furthest-forward option at equal cost. The PPOTrainer signature there still matches the executor's existing introspection-based construction, so only the import paths and the reward-dispatch patch change.
  • isinstance guard over a cast for the LoRA model. get_peft_model only returns a PeftMixedModel under mixed=True, which this executor never sets, so the value is always a PeftModel. A runtime isinstance guard fails fast at the boundary if that assumption ever breaks, whereas a cast would silently pass an incompatible model into SFTTrainer.
  • datasets 5.0 is forced, not chosen. trl 1.6.0 requires datasets>=4.7.0 and the resolver picks the latest (5.0.0). The training executors use only the stable load_dataset/Dataset core, and the trust_remote_code kwarg some pass to load_dataset was already an inert passthrough on the pre-bump datasets 4.3.0, so the bump introduces no behavior change there. lxml stays <6, so the documented pip-audit ignores remain valid.

Test Plan

  • pre-commit run --all-files and uv run pytest tests/ (the GPU-only cleanup test tests/worker/test_mp_executor_cleanup_gpu.py requires CUDA and is excluded, as in CI).
  • End-to-end on rebuilt server/worker images: one GPU-capped workflow per trl trainer path — DPO, SFT, LoRA SFT, and PPO on TinyLlama (PPO with a small sentiment reward model) — to confirm each executor loads and completes a real training loop on the new wheels.

Test Result

  • pre-commit and pytest tests/ pass.
  • All four end-to-end training workflows reach DONE on the rebuilt images: DPO trains with no mergekit import (the pre-bump failure is gone), SFT and LoRA SFT complete their loops on datasets 5.0 and the LoRA adapter saves, and PPO runs through trl.experimental.ppo with the KL early-stop path firing — exercising both the relocated imports and the get_reward reward-dispatch patch.

Pre-submission Checklist
  • I have read the contribution guidelines.
  • I have run pre-commit run --all-files and fixed any issues.
  • I have added or updated tests covering my changes (if applicable).
  • I have verified that uv run pytest tests/ passes locally.
  • If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker. (No schema/proto changes.)
  • If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen). (No SDK/CLI code changes; dependency floors only.)
  • If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
  • I have updated documentation or config examples if user-facing behavior changed. (No user-facing behavior change.)

trl 0.23.0 mis-read transformers 5.8.1's tuple-returning package probe and
tried to import the absent `mergekit`, so the DPO executor failed to import and
was skipped at worker startup. trl 1.6.0 fixes this but relocates PPO to
`trl.experimental.ppo` and tightens `SFTTrainer`'s model type, so repoint the
PPO imports and the `get_reward` reward-dispatch patch, and narrow the LoRA
peft model with an isinstance guard. trl 1.6.0 requires datasets>=4.7.0, which
moves datasets to 5.0.0.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

@timzsu timzsu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. A minor comment about docstring cleanup.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be some stale references to previous TRL versions, such as line 782-784 and 829.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Also dropped all unnecessary guards for legacy TRL versions.

With trl pinned to 1.6.0, the multi-version compatibility shims in the
training executors are dead code. Collapse the tokenizer/processing_class
construction ladders (SFT/DPO/LoRA) and the PPO signature-introspection
fallback into direct trainer calls, and drop the eval dataset/dataloader,
reward_model, output_hidden_states, and return_dict presets that trl 1.6.0's
PPO loop already sets itself per call. Type the PPO data collator as a
DataCollatorWithPadding subclass so the call site no longer needs a
type-ignore.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
@kaiitunnz kaiitunnz requested a review from timzsu June 17, 2026 18:00

@timzsu timzsu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@kaiitunnz kaiitunnz merged commit d551a73 into main Jun 18, 2026
11 checks passed
@kaiitunnz kaiitunnz deleted the kaiitunnz/chore/bump-trl branch June 18, 2026 03:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants