Skip to content

Commit d06e19f

Browse files
committed
add explanation about ctrl-c
Signed-off-by: youkaichao <[email protected]>
1 parent 0616acf commit d06e19f

File tree

1 file changed

+2
-0
lines changed

1 file changed

+2
-0
lines changed

_posts/2025-11-27-improved-cuda-debugging.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ GPU computational power has been increasing exponentially, but memory bandwidth
1515

1616
When a GPU kernel hangs, the program typically freezes or becomes unresponsive—even pressing Ctrl-C cannot stop it. The most straightforward solution is to kill the process, but this approach provides no information about the root cause. Developers are left to guess blindly, bisecting code changes and running tests iteratively until they identify the issue.
1717

18+
> Side note on why pressing Ctrl-C doesn't work: pressing Ctrl-C sends a SIGINT signal to the process. If the process is running Python code, the SIGINT signal is caught by the Python interpreter, which turns it into a KeyboardInterrupt exception and queues the exception to be handled after the process returns to run Python code. However, if the process is running a CUDA kernel and waiting for the GPU to finish, it is waiting for the low-level CUDA API to return, while no Python code is running, so the KeyboardInterrupt exception cannot be raised. In the following `conditional_hang.py` example, if you want to terminate the process via Ctrl-C, you need to add `import signal; signal.signal(signal.SIGINT, signal.SIG_DFL)` at the beginning of the script so that Python interpreter does not catch the SIGINT signal, then Ctrl-C can successfully terminate the process. The downside is Python interpreter will not be able to show the error stack when it is stopped by Ctrl-C.
19+
1820
Fortunately, there is a better way. The CUDA driver includes a feature called `user induced GPU core dump generation`: the driver opens pipes in the operating system that allow users to trigger a core dump by writing to them. When triggered, the CUDA driver dumps the GPU state to core dump files, enabling inspection of what's happening inside the GPU and, most importantly, identifying which GPU kernel is hanging.
1921

2022
Consider a simple example of a conditional hanging kernel:

0 commit comments

Comments
 (0)