Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark modal script revamped - profiling and external libraries (Issue #504) #510

Closed
wants to merge 14 commits into from

Conversation

vyom1611
Copy link
Contributor

@vyom1611 vyom1611 commented Jun 1, 2024

Basically added support for libraries like cuDNN on modal so that people can run using libraries on cloud gpus. Basically added my own custom docker image for cuda environment setup and also added support for profiling using nsight systems which comes pre-installed. Based on issue #504

Features:

  • cuDNN supported
  • openMPI supported
  • Nsight Systems supported and now by attaching a volume, you can download the report to view it on your local machine in GUI form
  • instructions in the benchmark_on_modal.py script on how to use the volumes/download reports and existing errors.
    -(Please not that profile_gpt2cu.py does not work with this and you have to manually change the command to run nsys profile)

Examples:
training the gpt2 model with cuDNN use:
GPU_MEM=80 modal run dev/cuda/benchmark_on_modal.py
--compile-command "make train_gpt2cu USE_CUDNN=1"
--run-command "./train_gpt2cu -i dev/data/tinyshakespeare/tiny_shakespeare_train.bin -j dev/data/tinyshakespeare/tiny_shakespeare_val.bin -v 250 -s 250 -g 144 -f shakespeare.log -b 4"

For profiling using nsight system:
GPU_MEM=80 modal run dev/cuda/benchmark_on_modal.py
--compile-command "make train_gpt2cu USE_CUDNN=1"
--run-command "nsys profile ./train_gpt2cu -i dev/data/tinyshakespeare/tiny_shakespeare_train.bin -j dev/data/tinyshakespeare/tiny_shakespeare_val.bin -v 250 -s 250 -g 144 -f shakespeare.log -b 4"

NOTE: Currently there is a bug in the profiling using nsight system which produces a lot of errors on the command line but it
does not actually interfere with the model training and validation. The report (that you download) is still generated and can be viewed from Nsight Systems. Additionally, 'profile_gpt2cu.py' does not work with this too.

Basically enables more features to be used on the cloud gpus. Please feel free to add-on/optimize this script so that more folks can explore the codebase and test out all kinds of features and profiling the codebase has to offer currently.

@vyom1611
Copy link
Contributor Author

vyom1611 commented Jun 3, 2024

I am still working on a newer base image for this so that the nsight errors disappear but meanwhile it should be working for most of it

@vyom1611 vyom1611 reopened this Jun 3, 2024
Makefile Outdated
@@ -14,6 +14,7 @@ CUDA_OUTPUT_FILE = -o $@
# NVCC flags
# -t=0 is short for --threads, 0 = number of CPUs on the machine
NVCC_FLAGS = -O3 -t=0 --use_fast_math
NVCC_FLAGS += --std=c++17
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure what the repercussions would be here right now, is it needed for the modal script?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently yes, it prints out 100s of lines of errors in the terminal without that (I believe its because of the docker image I created having the newest version of cuda), but the current script will not work without that.

Screenshot 2024-06-04 at 9 30 24 AM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What version of GCC are you using?

Copy link
Contributor Author

@vyom1611 vyom1611 Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in on my container? This is what I got from using 'gcc --version':
command_args = ['gcc', '--version']
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0

I used a cuda image from docker hub as a base, but the current image I am using is this:

https://hub.docker.com/layers/totallyvyom/cuda-env/latest/images/sha256-a273f155a5ad9dcd4dd366a6da2f48f80f10b675bd136ab350f73b695d23b23b?tab=layers

and this is the cuda image I am layering on:

https://hub.docker.com/layers/nvidia/cuda/12.4.1-cudnn-devel-ubuntu20.04/images/sha256-f18cf1a9ac2842e59f13b0d0729594da8cbd68cadd2379308cdd98c0374dbd80?context=explore

Copy link
Contributor

@rosslwheeler rosslwheeler Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That helps - it's probably (I'm guessing) the gcc compiler on ubuntu 20.04 is too old. Try finding a container which has a newer version of ubuntu like 22.04. That should fix your compilation issues.

Copy link
Contributor Author

@vyom1611 vyom1611 Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am messing around with images for newer versions to also fix the errors I was getting with nsight, but it takes while because reloading all the containers after every little detail takes so much time, it is such a hassle to build a cuda dev env on a docker 😅 but I am trying to solve those right now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, your right on that. Good luck!

@vyom1611
Copy link
Contributor Author

vyom1611 commented Jun 4, 2024

I just fixed the gcc versioning mistake, and cuDNN should also work with the modal script now.

@vyom1611 vyom1611 requested a review from karpathy June 7, 2024 06:43
@vyom1611 vyom1611 requested a review from rosslwheeler June 10, 2024 08:24
@vyom1611 vyom1611 marked this pull request as draft June 12, 2024 19:00
@vyom1611 vyom1611 marked this pull request as ready for review June 12, 2024 19:00
@vyom1611 vyom1611 closed this Jun 12, 2024
karpathy added a commit that referenced this pull request Jun 13, 2024
Benchmark modal script fixed - profiling and cuDNN (Issue #504 and PR #510 fixes)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants