Description
🐞 Bugs in Distributed Loading of Non-Distributed Checkpoints and/or Model Creation via HF Wrapper
There are issues when loading non-distributed checkpoints or creating models via from_pretrained
using the HF wrapper. (For details, see discussion below.)
- When
tensor_parallel
is not 1: model creation succeeds, but checkpoint loading fails. - When
sequence_parallel
is not 1: model creation succeeds, but checkpoint loading fails. - When
pipeline_parallel
is not 1: model creation fails at different points, depending on whethersequence_tensor_parallel
is set.