Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
(TaskRunner pid=37717) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::WorkerDict.actor_rollout_ref_init_model() (pid=38309, ip=172.17.0.3, actor_id=d6503e2665e2d40fc6795be001000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7114d32adb20>)
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(TaskRunner pid=37717) return self.__get_result()
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(TaskRunner pid=37717) raise self._exception
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/workspace/verl/single_controller/ray/base.py", line 844, in func
(TaskRunner pid=37717) return getattr(self.worker_dict[key], name)(*args, **kwargs)
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/workspace/verl/single_controller/base/decorator.py", line 462, in inner
(TaskRunner pid=37717) return func(*args, **kwargs)
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/workspace/verl/utils/transferqueue_utils.py", line 314, in dummy_inner
(TaskRunner pid=37717) output = func(*args, **kwargs)
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/workspace/verl/workers/fsdp_workers.py", line 812, in init_model
(TaskRunner pid=37717) ) = self._build_model_optimizer(
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/workspace/verl/workers/fsdp_workers.py", line 400, in _build_model_optimizer
(TaskRunner pid=37717) actor_module = actor_module_class.from_pretrained(
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/transformers/models/auto/auto_factory.py", line 604, in from_pretrained
(TaskRunner pid=37717) return model_class.from_pretrained(
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/transformers/modeling_utils.py", line 288, in _wrapper
(TaskRunner pid=37717) return func(*args, **kwargs)
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/transformers/modeling_utils.py", line 5103, in from_pretrained
(TaskRunner pid=37717) model = cls(config, *model_args, **model_kwargs)
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/transformers/models/qwen3/modeling_qwen3.py", line 435, in __init__
(TaskRunner pid=37717) super().__init__(config)
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/transformers/modeling_utils.py", line 2197, in __init__
(TaskRunner pid=37717) self.config._attn_implementation_internal = self._check_and_adjust_attn_implementation(
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/transformers/modeling_utils.py", line 2812, in _check_and_adjust_attn_implementation
(TaskRunner pid=37717) lazy_import_flash_attention(applicable_attn_implementation)
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 136, in lazy_import_flash_attention
(TaskRunner pid=37717) _flash_fn, _flash_varlen_fn, _pad_fn, _unpad_fn = _lazy_imports(implementation)
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 83, in _lazy_imports
(TaskRunner pid=37717) from flash_attn import flash_attn_func, flash_attn_varlen_func
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/flash_attn/__init__.py", line 3, in <module>
(TaskRunner pid=37717) from flash_attn.flash_attn_interface import (
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 15, in <module>
(TaskRunner pid=37717) import flash_attn_2_cuda as flash_attn_gpu
(TaskRunner pid=37717) ImportError: /venv/main/lib/python3.12/site-packages/flash_attn_2_cuda.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_jb
(WorkerDict pid=38309) /workspace/verl/utils/tokenizer.py:107: UserWarning: Failed to create processor: Unsupported processor type: Qwen2TokenizerFast. This may affect multimodal processing [repeated 3x across cluster]
(WorkerDict pid=38309) warnings.warn(f"Failed to create processor: {e}. This may affect multimodal processing", stacklevel=1) [repeated 3x across cluster]
(WorkerDict pid=38309) `torch_dtype` is deprecated! Use `dtype` instead! [repeated 3x across cluster]
Any guidance would be appreciated. Thanks!
Description
I followed the installation instructions in the repository's
README.mdand attempted to run the SDPO generalization experiment. However, the
training process fails during model initialization.
Environment
Steps to Reproduce
README.md.Error Logs
Any guidance would be appreciated. Thanks!