fix: reclaim GPU memory after checkpoint load with CPU optimizer offloading by samsja · Pull Request #2154 · PrimeIntellect-ai/prime-rl

samsja · 2026-03-31T17:06:02Z

Summary

Fixes OOM when resuming from checkpoint with optim_cpu_offload enabled
After set_state_dict() + _move_states("cpu"), the state_dict (owned by dcp_load) still holds references to stale GPU optimizer tensors, and the CUDA caching allocator retains the memory — causing OOM on the first training step
Fix: clear stale references, gc.collect(), and torch.cuda.empty_cache() to fully reclaim GPU memory

Before / After

Measured on Qwen3-0.6B with optim_cpu_offload=true, single GPU, expandable_segments:True:

Metric	Before	After
Reserved after load	14,362 MB	2,942 MB
Allocated after load	2,884 MB	2,884 MB
Wasted (reserved − allocated)	11,478 MB	58 MB
Spike from baseline	11,420 MB	0 MB

🤖 Generated with Claude Code

…oading When resuming from a checkpoint with optim_cpu_offload, set_state_dict() materializes optimizer states on GPU. After _move_states("cpu") moves them to CPU, the state_dict (owned by dcp_load) still holds references to the stale GPU tensors, and the CUDA caching allocator retains the memory. This causes OOM on the first training step. Fix: clear the state_dict references, run gc.collect(), and call torch.cuda.empty_cache() to fully reclaim GPU memory. Measured on Qwen3-0.6B (single GPU): - Before: 11.4GB wasted GPU memory after checkpoint load - After: 58MB wasted (essentially zero) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

S1ro1 approved these changes Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reclaim GPU memory after checkpoint load with CPU optimizer offloading#2154

fix: reclaim GPU memory after checkpoint load with CPU optimizer offloading#2154
samsja wants to merge 1 commit intomainfrom
fix/cpu-offload-ckpt-resume-oom

samsja commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

samsja commented Mar 31, 2026

Summary

Before / After

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants