-
Notifications
You must be signed in to change notification settings - Fork 665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up GPU memory after killing sglang processes #2457
Conversation
scripts/killall_sglang.sh
Outdated
fi | ||
# Clean all GPU processes | ||
kill -9 $(nvidia-smi | sed -n '/Processes:/,$p' | grep " [0-9]" | awk '{print $5}') 2>/dev/null | ||
lsof /dev/nvidia* | awk '{print $2}' | xargs kill -9 2>/dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is from another PR which doesn't contain this fix tho. updated the script to show gpu status again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ata is fast 🚀
a2a73de
to
1dc84ae
Compare
scripts/killall_sglang.sh
Outdated
kill -9 $(nvidia-smi | sed -n '/Processes:/,$p' | grep " [0-9]" | awk '{print $5}') 2>/dev/null | ||
fi | ||
# Clean all GPU processes | ||
kill -9 $(nvidia-smi | sed -n '/Processes:/,$p' | grep " [0-9]" | awk '{print $5}') 2>/dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't enable it by default because, in some environments, we shouldn't terminate other processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some environment
Like where?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, in our development environment, we share an H100 node, and others might use the GPU to run other tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why I didn't set it as default, but it's also easy to enable when we want to, just by adding an extra parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK, lemme update it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhyncs put back the if condition and enabled killing for the router tests.
a469eb1
to
230aede
Compare
@MrAta can you fix the conflict |
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Fixed! |
Signed-off-by: Ata Fatahi <[email protected]>
@@ -60,7 +60,7 @@ jobs: | |||
pip install --force-reinstall dist/*.whl | |||
- name: Run e2e test | |||
run: | | |||
bash scripts/killall_sglang.sh | |||
bash scripts/killall_sglang.sh "nuk_gpus" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does pr-test-rust need this parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like most of the time GPU memory isn't cleaned up properly during those tests; so it's to make sure we can load the model for tests.
Motivation
Clean up GPU memory after killing sglang processes
Modifications
Modify the killall_sglang.sh script to clean up any fore and background process using up the GPU memory.
Currently, running the script doesn't clean up the memory properly (see screenshot below), with this PR it should be fixed.
Checklist