chore: bump trl to 1.6.0 and adapt the worker training stack by kaiitunnz · Pull Request #75 · mlsys-io/FlowMesh

kaiitunnz · 2026-06-17T06:46:48Z

Purpose

Bumps trl from 0.23.0 to 1.6.0 to fix the DPO executor failing to load under transformers 5.8.1. trl 0.23.0 reads transformers 5.8.1's tuple-returning _is_package_available as a truthy value, so it enters the mergekit import branch and the DPO import chain (dpo_trainer → callbacks → mergekit_utils) raises ModuleNotFoundError: No module named 'mergekit'; the worker then skips the executor at startup ("Skipping executor dpo"). trl 1.6.0 handles the probe correctly. The 0.23→1.6 jump is a major bump that relocates PPO and tightens trainer typing, so this PR also adapts the worker training executors. trl 1.6.0 requires datasets>=4.7.0, which moves datasets 4.3→5.0.

Changes

Dependency bump — pyproject.toml, uv.lock, src/worker/requirements/requirements.txt: raise trl>=1.6.0 and datasets>=4.7.0, re-lock, and regenerate. datasets resolves to 5.0.0 (the latest satisfying trl's new floor).
PPO executor — src/worker/executors/ppo_executor.py: PPO left the stable trl path, so repoint PPOConfig, PPOTrainer, and AutoModelForCausalLMWithValueHead to trl.experimental.ppo.*, and retarget the _patched_reward_dispatch monkeypatch to get_reward in trl.experimental (it moved out of trl.trainer.utils), reading the canonical reference from its defining module and patching the trainer namespace via getattr as before.
LoRA SFT executor — src/worker/executors/lora_sft_executor.py: trl 1.6.0 tightens SFTTrainer.model to str | PreTrainedModel | PeftModel, so narrow get_peft_model's PeftModel | PeftMixedModel return with an isinstance guard (mixed adapters are never used here) rather than passing the broader union.
Tests — tests/worker/test_ppo_early_stopping.py: retarget the PPOTrainer.log patches to the experimental module.

Design

Land on the latest 1.6.0, not a conservative 0.x. The mergekit/DPO fix and a stable PPO import path are mutually exclusive: every trl version old enough to keep PPO on trl.trainer.ppo_* (≤0.25.1) still has the DPO bug, and every version that fixes DPO (≥0.29) has already moved PPO to trl.experimental.ppo. PPO must be repointed regardless, so there is no cheaper intermediate target — 1.6.0 is the furthest-forward option at equal cost. The PPOTrainer signature there still matches the executor's existing introspection-based construction, so only the import paths and the reward-dispatch patch change.
isinstance guard over a cast for the LoRA model. get_peft_model only returns a PeftMixedModel under mixed=True, which this executor never sets, so the value is always a PeftModel. A runtime isinstance guard fails fast at the boundary if that assumption ever breaks, whereas a cast would silently pass an incompatible model into SFTTrainer.
datasets 5.0 is forced, not chosen. trl 1.6.0 requires datasets>=4.7.0 and the resolver picks the latest (5.0.0). The training executors use only the stable load_dataset/Dataset core, and the trust_remote_code kwarg some pass to load_dataset was already an inert passthrough on the pre-bump datasets 4.3.0, so the bump introduces no behavior change there. lxml stays <6, so the documented pip-audit ignores remain valid.

Test Plan

pre-commit run --all-files and uv run pytest tests/ (the GPU-only cleanup test tests/worker/test_mp_executor_cleanup_gpu.py requires CUDA and is excluded, as in CI).
End-to-end on rebuilt server/worker images: one GPU-capped workflow per trl trainer path — DPO, SFT, LoRA SFT, and PPO on TinyLlama (PPO with a small sentiment reward model) — to confirm each executor loads and completes a real training loop on the new wheels.

Test Result

pre-commit and pytest tests/ pass.
All four end-to-end training workflows reach DONE on the rebuilt images: DPO trains with no mergekit import (the pre-bump failure is gone), SFT and LoRA SFT complete their loops on datasets 5.0 and the LoRA adapter saves, and PPO runs through trl.experimental.ppo with the KL early-stop path firing — exercising both the relocated imports and the get_reward reward-dispatch patch.

Pre-submission Checklist

I have read the contribution guidelines.
I have run pre-commit run --all-files and fixed any issues.
I have added or updated tests covering my changes (if applicable).
I have verified that uv run pytest tests/ passes locally.
If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker. (No schema/proto changes.)
If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen). (No SDK/CLI code changes; dependency floors only.)
If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
I have updated documentation or config examples if user-facing behavior changed. (No user-facing behavior change.)

trl 0.23.0 mis-read transformers 5.8.1's tuple-returning package probe and tried to import the absent `mergekit`, so the DPO executor failed to import and was skipped at worker startup. trl 1.6.0 fixes this but relocates PPO to `trl.experimental.ppo` and tightens `SFTTrainer`'s model type, so repoint the PPO imports and the `get_reward` reward-dispatch patch, and narrow the LoRA peft model with an isinstance guard. trl 1.6.0 requires datasets>=4.7.0, which moves datasets to 5.0.0. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

timzsu

Overall looks good. A minor comment about docstring cleanup.

timzsu · 2026-06-17T13:44:28Z

There seem to be some stale references to previous TRL versions, such as line 782-784 and 829.

Fixed. Also dropped all unnecessary guards for legacy TRL versions.

With trl pinned to 1.6.0, the multi-version compatibility shims in the training executors are dead code. Collapse the tokenizer/processing_class construction ladders (SFT/DPO/LoRA) and the PPO signature-introspection fallback into direct trainer calls, and drop the eval dataset/dataloader, reward_model, output_hidden_states, and return_dict presets that trl 1.6.0's PPO loop already sets itself per call. Type the PPO data collator as a DataCollatorWithPadding subclass so the call site no longer needs a type-ignore. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

timzsu

LGTM.

kaiitunnz requested a review from timzsu as a code owner June 17, 2026 06:46

kaiitunnz mentioned this pull request Jun 17, 2026

feat: extend capability gate to all task types #76

Open

8 tasks

timzsu requested changes Jun 17, 2026

View reviewed changes

kaiitunnz requested a review from timzsu June 17, 2026 18:00

timzsu approved these changes Jun 18, 2026

View reviewed changes

kaiitunnz merged commit d551a73 into main Jun 18, 2026
11 checks passed

kaiitunnz deleted the kaiitunnz/chore/bump-trl branch June 18, 2026 03:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: bump trl to 1.6.0 and adapt the worker training stack#75

chore: bump trl to 1.6.0 and adapt the worker training stack#75
kaiitunnz merged 2 commits into
mainfrom
kaiitunnz/chore/bump-trl

kaiitunnz commented Jun 17, 2026

Uh oh!

timzsu left a comment

Uh oh!

timzsu Jun 17, 2026

Uh oh!

kaiitunnz Jun 17, 2026

Uh oh!

timzsu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaiitunnz commented Jun 17, 2026

Purpose

Changes

Design

Test Plan

Test Result

Uh oh!

timzsu left a comment

Choose a reason for hiding this comment

Uh oh!

timzsu Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

kaiitunnz Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

timzsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants