fix: When there are tensors registered with register buffer in the weight file, the weights are only loaded on device 0 when loading weights across multiple devices. #7717

KeeProMise · 2025-12-08T03:57:00Z

Problem

When loading model weights across multiple devices (tensor parallel), buffers registered via register_buffer were only being loaded on device 0, or not loaded at all in inference scenarios.

Root Causes

load_buffer() function lacked device awareness (no mp_group parameter)
Inference Engine's load_model_with_checkpoint() completely ignored buffers
Inconsistent buffer loading across different code paths

Solution

Enhanced load_buffer() to accept mp_group parameter and handle device migration
Updated all load_buffer() call sites to pass mp_group parameter
Added buffer loading logic to Inference Engine's load_model_with_checkpoint()

Files Changed

deepspeed/module_inject/auto_tp.py
deepspeed/module_inject/replace_module.py
deepspeed/inference/engine.py

loading weights across multiple devices

981f4fa

KeeProMise requested review from hwchen2017, loadams and tohtana as code owners December 8, 2025 03:57

Merge branch 'master' into 7709

fcc3a1d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: When there are tensors registered with register buffer in the weight file, the weights are only loaded on device 0 when loading weights across multiple devices. #7717

fix: When there are tensors registered with register buffer in the weight file, the weights are only loaded on device 0 when loading weights across multiple devices. #7717

Uh oh!

KeeProMise commented Dec 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: When there are tensors registered with register buffer in the weight file, the weights are only loaded on device 0 when loading weights across multiple devices. #7717

Are you sure you want to change the base?

fix: When there are tensors registered with register buffer in the weight file, the weights are only loaded on device 0 when loading weights across multiple devices. #7717

Uh oh!

Conversation

KeeProMise commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Causes

Solution

Files Changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KeeProMise commented Dec 8, 2025 •

edited

Loading