Skip to content

fix: reclaim GPU memory after checkpoint load with CPU optimizer offloading#2154

Draft
samsja wants to merge 1 commit intomainfrom
fix/cpu-offload-ckpt-resume-oom
Draft

fix: reclaim GPU memory after checkpoint load with CPU optimizer offloading#2154
samsja wants to merge 1 commit intomainfrom
fix/cpu-offload-ckpt-resume-oom

Conversation

@samsja
Copy link
Copy Markdown
Member

@samsja samsja commented Mar 31, 2026

Summary

  • Fixes OOM when resuming from checkpoint with optim_cpu_offload enabled
  • After set_state_dict() + _move_states("cpu"), the state_dict (owned by dcp_load) still holds references to stale GPU optimizer tensors, and the CUDA caching allocator retains the memory — causing OOM on the first training step
  • Fix: clear stale references, gc.collect(), and torch.cuda.empty_cache() to fully reclaim GPU memory

Before / After

Measured on Qwen3-0.6B with optim_cpu_offload=true, single GPU, expandable_segments:True:

Metric Before After
Reserved after load 14,362 MB 2,942 MB
Allocated after load 2,884 MB 2,884 MB
Wasted (reserved − allocated) 11,478 MB 58 MB
Spike from baseline 11,420 MB 0 MB

🤖 Generated with Claude Code

…oading

When resuming from a checkpoint with optim_cpu_offload, set_state_dict()
materializes optimizer states on GPU. After _move_states("cpu") moves them
to CPU, the state_dict (owned by dcp_load) still holds references to the
stale GPU tensors, and the CUDA caching allocator retains the memory.
This causes OOM on the first training step.

Fix: clear the state_dict references, run gc.collect(), and call
torch.cuda.empty_cache() to fully reclaim GPU memory.

Measured on Qwen3-0.6B (single GPU):
- Before: 11.4GB wasted GPU memory after checkpoint load
- After:  58MB wasted (essentially zero)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants