Skip to content

Fix GC pause during CUDAGraph capture to prevent abort#1339

Merged
valarLip merged 1 commit into
mainfrom
Jasen/cudagraph-gc-fix
Jun 25, 2026
Merged

Fix GC pause during CUDAGraph capture to prevent abort#1339
valarLip merged 1 commit into
mainfrom
Jasen/cudagraph-gc-fix

Conversation

@Jasen2201

Copy link
Copy Markdown
Contributor

…bort (#1322)

During CUDAGraph capture, MiniMax-M3's autotuned _topk_index_partial_kernel discards candidate CompiledKernels. A gen-0 GC firing inside the stream-capture region runs CompiledKernel.del -> hipModuleUnload, which HIP forbids while a stream is capturing (HIP 900), corrupting the capture and aborting the custom_all_reduce IPC handshake (SIGABRT). gc.freeze() did not help because the discarded kernels are created mid-loop. Disable GC for the whole capture window and restore via try/finally.

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

…bort (#1322)

During CUDAGraph capture, MiniMax-M3's autotuned _topk_index_partial_kernel
discards candidate CompiledKernels. A gen-0 GC firing inside the stream-capture
region runs CompiledKernel.__del__ -> hipModuleUnload, which HIP forbids while a
stream is capturing (HIP 900), corrupting the capture and aborting the
custom_all_reduce IPC handshake (SIGABRT). gc.freeze() did not help because the
discarded kernels are created mid-loop. Disable GC for the whole capture window
and restore via try/finally.
Copilot AI review requested due to automatic review settings June 24, 2026 08:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents HIP stream-capture aborts during CUDA/HIP graph capture by pausing Python’s cyclic GC across the entire capture window, avoiding finalizers (e.g., Triton/HIP module unload) running inside capture.

Changes:

  • Add a local pause_gc() context manager that runs gc.collect(), disables GC for the capture window, then restores GC in finally.
  • Wrap graph_capture() with pause_gc() to cover the full capture loop.
  • Rename the graph_capture() context variable from gc to capture_ctx and update stream usages accordingly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2288 to +2296
def pause_gc():
# No GC during capture: a finalizer's hipModuleUnload aborts it (HIP 900).
gc.collect()
gc.disable()
try:
yield
finally:
gc.enable()
gc.collect()
@valarLip valarLip merged commit 069399b into main Jun 25, 2026
27 of 33 checks passed
@valarLip valarLip deleted the Jasen/cudagraph-gc-fix branch June 25, 2026 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants