Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove MPI from multi-GPU example #268

Merged
merged 9 commits into from
Feb 7, 2025
Merged

Remove MPI from multi-GPU example #268

merged 9 commits into from
Feb 7, 2025

Conversation

jwallwork23
Copy link
Collaborator

Closes #253.

I decided to close #258 and split it into two separate PRs. Here's the first one that removes MPI from the multi-GPU example. (The second one will introduce a CPU-only example that uses MPI.)

Note that I decided to switch the names of the multigpu.py module / MultiGPU class back to simplenet.py / SimpleNet in this example. This is because the class is a direct copy. (Unlike in MultiIONet, where it's modified.)

@jwallwork23 jwallwork23 added documentation Improvements or additions to documentation testing Related to FTorch testing labels Jan 30, 2025
@jwallwork23 jwallwork23 self-assigned this Jan 30, 2025
@jwallwork23
Copy link
Collaborator Author

jwallwork23 commented Jan 30, 2025

Tested on a laptop with 1 CUDA GPU device with the patch

diff --git a/examples/3_MultiGPU/multigpu_infer_fortran.f90 b/examples/3_MultiGPU/multigpu_infer_fortran.f90
index 297844e..cfba096 100644
--- a/examples/3_MultiGPU/multigpu_infer_fortran.f90
+++ b/examples/3_MultiGPU/multigpu_infer_fortran.f90
@@ -27,7 +27,7 @@ program inference
    type(torch_tensor), dimension(1) :: out_tensors

    ! Variables for multi-GPU setup
-   integer, parameter :: num_devices = 2
+   integer, parameter :: num_devices = 1
    integer :: device_index, i

    ! Get TorchScript model file as a command line argument
diff --git a/examples/3_MultiGPU/multigpu_infer_python.py b/examples/3_MultiGPU/multigpu_infer_python.py
index 1b49398..c504063 100644
--- a/examples/3_MultiGPU/multigpu_infer_python.py
+++ b/examples/3_MultiGPU/multigpu_infer_python.py
@@ -53,7 +53,7 @@ def deploy(saved_model: str, device: str, batch_size: int = 1) -> torch.Tensor:
 if __name__ == "__main__":
     saved_model_file = "saved_multigpu_model_cuda.pt"

-    for device_index in range(2):
+    for device_index in range(1):
         device_to_run = f"cuda:{device_index}"

         batch_size_to_run = 1

@jwallwork23 jwallwork23 marked this pull request as ready for review January 30, 2025 14:03
@jwallwork23 jwallwork23 mentioned this pull request Jan 30, 2025
@jwallwork23 jwallwork23 added the gpu Related to buiding and running on GPU label Jan 30, 2025
Copy link
Member

@jatkinson1000 jatkinson1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me @jwallwork23
Happy for merging after a final confirmation of running on GPU.

One small suggestion.
One comment: I wonder if it is worth adding a comment/note about how to change the number of devices you run on, and alongside this altering the multi-infer python top also use a n_devices variable like you do in the Fortran.

@jwallwork23
Copy link
Collaborator Author

This looks good to me @jwallwork23 Happy for merging after a final confirmation of running on GPU.

Thanks!

One small suggestion. One comment: I wonder if it is worth adding a comment/note about how to change the number of devices you run on, and alongside this altering the multi-infer python top also use a n_devices variable like you do in the Fortran.

Good suggestion - addressed in 3b67ae8.

@jwallwork23 jwallwork23 mentioned this pull request Feb 6, 2025
3 tasks
@jwallwork23
Copy link
Collaborator Author

Tested with 2 Ampere GPUs - passes with the latest fix in 4fd8800. Will merge once CI passes.

@jwallwork23 jwallwork23 merged commit 78914a9 into main Feb 7, 2025
6 checks passed
@jwallwork23 jwallwork23 deleted the 253_multigpu-no-mpi branch February 7, 2025 13:45
jatkinson1000 added a commit that referenced this pull request Feb 25, 2025
* Drop ENABLE_MPI CMake argument
* Remove MPI from MultiGPU example
* Mention GPU/CUDA as dependency

---------

Co-authored-by: Jack Atkinson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation gpu Related to buiding and running on GPU testing Related to FTorch testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move Example 3 to not require MPI
2 participants