[cfg] fix: sync strategy from ActorConfig/CriticConfig to EngineConfig#5885
Conversation
FSDPActorConfig and FSDPCriticConfig set self.engine = self.fsdp_config but never sync self.strategy to self.engine.strategy. Since EngineConfig.strategy defaults to None, engine_workers.py (the new worker path used with use_legacy_worker_impl=disable) always falls back to FSDP1 regardless of the user's actor.strategy setting. This causes crashes for models that require FSDP2, such as Qwen3.5 and other models with multi-dimensional RoPE position_ids, where FSDP1's parameter wrapping breaks apply_rotary_pos_emb with shape mismatches. Fix: sync strategy in __post_init__ using object.__setattr__ (needed because BaseConfig has frozen field logic).
There was a problem hiding this comment.
Code Review
This pull request ensures that the FSDP strategy is correctly propagated to the engine configuration in both actor and critic workers, preventing an unintended fallback to FSDP1. A review comment suggests also syncing the ulysses_sequence_parallel_size in the critic configuration to maintain consistency and ensure sequence parallelism settings are properly applied.
| # Sync strategy to engine config so engine_workers can pick the right FSDP version. | ||
| # EngineConfig.strategy defaults to None, so without this, engine_workers.py always | ||
| # falls back to FSDP1 even when critic.strategy="fsdp2". | ||
| object.__setattr__(self.engine, "strategy", self.strategy) |
There was a problem hiding this comment.
In addition to syncing the strategy, ulysses_sequence_parallel_size should also be synced to the engine configuration in FSDPCriticConfig for consistency and backward compatibility, similar to the implementation in FSDPActorConfig. Without this, sequence parallelism settings defined at the top level of the critic configuration will not propagate to the underlying FSDP engine.
Note that ulysses_sequence_parallel_size is already defined as a mutable field in FSDPEngineConfig, so direct assignment is permitted.
object.__setattr__(self.engine, "strategy", self.strategy)
# backward compatibility
if self.ulysses_sequence_parallel_size > 1:
self.fsdp.ulysses_sequence_parallel_size = self.ulysses_sequence_parallel_size
Summary
FSDPActorConfig.__post_init__andFSDPCriticConfig.__post_init__setself.engine = self.fsdp_configbut never syncself.strategytoself.engine.strategy. SinceEngineConfig.strategydefaults toNone,engine_workers.py:162always passesNoneas the backend toEngineRegistry.new(), which falls back to FSDP1 regardless of the user'sactor.strategysetting.This causes crashes for models that require FSDP2, such as Qwen3.5 and other models with multi-dimensional RoPE
position_ids, where FSDP1's parameter wrapping breaksapply_rotary_pos_embwith shape mismatches.Repro
actor.strategy=fsdp2withuse_legacy_worker_impl=disable(newengine_workers.pypath)engine_workers.pyreadsengine_config.strategy→ getsNone→ defaults to FSDP1apply_rotary_pos_embFix
Sync
strategyfrom the actor/critic config to the engine config in__post_init__:Uses
object.__setattr__becauseBaseConfighas frozen field logic that prevents normal attribute assignment.Affected configs
FSDPActorConfig(verl/workers/config/actor.py)FSDPCriticConfig(verl/workers/config/critic.py)Note:
McoreActorConfig,VeOmniActorConfig,TorchTitanActorConfigare not affected because their engine configs have matching hardcodedstrategydefaults.Impact
Affects all FSDP2 training using the new
engine_workers.pypath (use_legacy_worker_impl=disable). The legacy worker path is unaffected because it doesn't readengine_config.strategy.Test plan
engine_config.strategy == "fsdp2"whenactor.strategy = "fsdp2"withuse_legacy_worker_impl=disablestrategy=fsdp2+use_legacy_worker_impl=disable— FSDP2 correctly applied, 3 training steps completedstrategy=fsdp(default) is unchanged