Skip to content

Bugs in Distributed Loading of Non-Distributed Checkpoints #244

Closed
@bigximik

Description

@bigximik

🐞 Bugs in Distributed Loading of Non-Distributed Checkpoints and/or Model Creation via HF Wrapper

There are issues when loading non-distributed checkpoints or creating models via from_pretrained using the HF wrapper. (For details, see discussion below.)

  • When tensor_parallel is not 1: model creation succeeds, but checkpoint loading fails.
  • When sequence_parallel is not 1: model creation succeeds, but checkpoint loading fails.
  • When pipeline_parallel is not 1: model creation fails at different points, depending on whether sequence_tensor_parallel is set.

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions