Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up GPU memory after killing sglang processes #2457

Merged
merged 5 commits into from
Dec 17, 2024

Conversation

MrAta
Copy link
Contributor

@MrAta MrAta commented Dec 11, 2024

Motivation

Clean up GPU memory after killing sglang processes

Modifications

Modify the killall_sglang.sh script to clean up any fore and background process using up the GPU memory.

Currently, running the script doesn't clean up the memory properly (see screenshot below), with this PR it should be fixed.
Screenshot 2024-12-11 at 7 27 42 PM

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

fi
# Clean all GPU processes
kill -9 $(nvidia-smi | sed -n '/Processes:/,$p' | grep " [0-9]" | awk '{print $5}') 2>/dev/null
lsof /dev/nvidia* | awk '{print $2}' | xargs kill -9 2>/dev/null
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we show nvidia-smi again at the end to ensure everything is cleaned up? seems that
image
still has issues

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is from another PR which doesn't contain this fix tho. updated the script to show gpu status again.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ata is fast 🚀

@MrAta MrAta force-pushed the ata/clean branch 2 times, most recently from a2a73de to 1dc84ae Compare December 12, 2024 03:36
@ByronHsu
Copy link
Collaborator

@zhyncs

kill -9 $(nvidia-smi | sed -n '/Processes:/,$p' | grep " [0-9]" | awk '{print $5}') 2>/dev/null
fi
# Clean all GPU processes
kill -9 $(nvidia-smi | sed -n '/Processes:/,$p' | grep " [0-9]" | awk '{print $5}') 2>/dev/null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't enable it by default because, in some environments, we shouldn't terminate other processes.

Copy link
Contributor Author

@MrAta MrAta Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some environment

Like where?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, in our development environment, we share an H100 node, and others might use the GPU to run other tasks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I didn't set it as default, but it's also easy to enable when we want to, just by adding an extra parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK, lemme update it then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhyncs put back the if condition and enabled killing for the router tests.

@MrAta MrAta force-pushed the ata/clean branch 2 times, most recently from a469eb1 to 230aede Compare December 12, 2024 07:33
@ByronHsu
Copy link
Collaborator

@MrAta can you fix the conflict

@MrAta
Copy link
Contributor Author

MrAta commented Dec 12, 2024

@MrAta can you fix the conflict

Fixed!

@MrAta MrAta requested a review from zhyncs December 12, 2024 23:38
@MrAta MrAta requested a review from ByronHsu December 12, 2024 23:42
@@ -60,7 +60,7 @@ jobs:
pip install --force-reinstall dist/*.whl
- name: Run e2e test
run: |
bash scripts/killall_sglang.sh
bash scripts/killall_sglang.sh "nuk_gpus"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does pr-test-rust need this parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like most of the time GPU memory isn't cleaned up properly during those tests; so it's to make sure we can load the model for tests.

@merrymercy merrymercy merged commit ce094a5 into sgl-project:main Dec 17, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants