-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark modal script revamped - profiling and external libraries (Issue #504) #510
Conversation
I am still working on a newer base image for this so that the nsight errors disappear but meanwhile it should be working for most of it |
Makefile
Outdated
@@ -14,6 +14,7 @@ CUDA_OUTPUT_FILE = -o $@ | |||
# NVCC flags | |||
# -t=0 is short for --threads, 0 = number of CPUs on the machine | |||
NVCC_FLAGS = -O3 -t=0 --use_fast_math | |||
NVCC_FLAGS += --std=c++17 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure what the repercussions would be here right now, is it needed for the modal script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What version of GCC are you using?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in on my container? This is what I got from using 'gcc --version':
command_args = ['gcc', '--version']
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
I used a cuda image from docker hub as a base, but the current image I am using is this:
and this is the cuda image I am layering on:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That helps - it's probably (I'm guessing) the gcc compiler on ubuntu 20.04 is too old. Try finding a container which has a newer version of ubuntu like 22.04. That should fix your compilation issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am messing around with images for newer versions to also fix the errors I was getting with nsight, but it takes while because reloading all the containers after every little detail takes so much time, it is such a hassle to build a cuda dev env on a docker 😅 but I am trying to solve those right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, your right on that. Good luck!
I just fixed the gcc versioning mistake, and cuDNN should also work with the modal script now. |
Basically added support for libraries like cuDNN on modal so that people can run using libraries on cloud gpus. Basically added my own custom docker image for cuda environment setup and also added support for profiling using nsight systems which comes pre-installed. Based on issue #504
Features:
-(Please not that profile_gpt2cu.py does not work with this and you have to manually change the command to run nsys profile)
Examples:
training the gpt2 model with cuDNN use:
GPU_MEM=80 modal run dev/cuda/benchmark_on_modal.py
--compile-command "make train_gpt2cu USE_CUDNN=1"
--run-command "./train_gpt2cu -i dev/data/tinyshakespeare/tiny_shakespeare_train.bin -j dev/data/tinyshakespeare/tiny_shakespeare_val.bin -v 250 -s 250 -g 144 -f shakespeare.log -b 4"
For profiling using nsight system:
GPU_MEM=80 modal run dev/cuda/benchmark_on_modal.py
--compile-command "make train_gpt2cu USE_CUDNN=1"
--run-command "nsys profile ./train_gpt2cu -i dev/data/tinyshakespeare/tiny_shakespeare_train.bin -j dev/data/tinyshakespeare/tiny_shakespeare_val.bin -v 250 -s 250 -g 144 -f shakespeare.log -b 4"
NOTE: Currently there is a bug in the profiling using nsight system which produces a lot of errors on the command line but it
does not actually interfere with the model training and validation. The report (that you download) is still generated and can be viewed from Nsight Systems. Additionally, 'profile_gpt2cu.py' does not work with this too.
Basically enables more features to be used on the cloud gpus. Please feel free to add-on/optimize this script so that more folks can explore the codebase and test out all kinds of features and profiling the codebase has to offer currently.