Remove MPI from multi-GPU example #268

jwallwork23 · 2025-01-30T13:46:33Z

Closes #253.

I decided to close #258 and split it into two separate PRs. Here's the first one that removes MPI from the multi-GPU example. (The second one will introduce a CPU-only example that uses MPI.)

Note that I decided to switch the names of the multigpu.py module / MultiGPU class back to simplenet.py / SimpleNet in this example. This is because the class is a direct copy. (Unlike in MultiIONet, where it's modified.)

jwallwork23 · 2025-01-30T14:03:09Z

Tested on a laptop with 1 CUDA GPU device with the patch

diff --git a/examples/3_MultiGPU/multigpu_infer_fortran.f90 b/examples/3_MultiGPU/multigpu_infer_fortran.f90
index 297844e..cfba096 100644
--- a/examples/3_MultiGPU/multigpu_infer_fortran.f90
+++ b/examples/3_MultiGPU/multigpu_infer_fortran.f90
@@ -27,7 +27,7 @@ program inference
    type(torch_tensor), dimension(1) :: out_tensors

    ! Variables for multi-GPU setup
-   integer, parameter :: num_devices = 2
+   integer, parameter :: num_devices = 1
    integer :: device_index, i

    ! Get TorchScript model file as a command line argument
diff --git a/examples/3_MultiGPU/multigpu_infer_python.py b/examples/3_MultiGPU/multigpu_infer_python.py
index 1b49398..c504063 100644
--- a/examples/3_MultiGPU/multigpu_infer_python.py
+++ b/examples/3_MultiGPU/multigpu_infer_python.py
@@ -53,7 +53,7 @@ def deploy(saved_model: str, device: str, batch_size: int = 1) -> torch.Tensor:
 if __name__ == "__main__":
     saved_model_file = "saved_multigpu_model_cuda.pt"

-    for device_index in range(2):
+    for device_index in range(1):
         device_to_run = f"cuda:{device_index}"

         batch_size_to_run = 1

jatkinson1000

This looks good to me @jwallwork23
Happy for merging after a final confirmation of running on GPU.

One small suggestion.
One comment: I wonder if it is worth adding a comment/note about how to change the number of devices you run on, and alongside this altering the multi-infer python top also use a n_devices variable like you do in the Fortran.

examples/3_MultiGPU/README.md

Co-authored-by: Jack Atkinson <[email protected]>

jwallwork23 · 2025-02-06T10:08:32Z

This looks good to me @jwallwork23 Happy for merging after a final confirmation of running on GPU.

Thanks!

One small suggestion. One comment: I wonder if it is worth adding a comment/note about how to change the number of devices you run on, and alongside this altering the multi-infer python top also use a n_devices variable like you do in the Fortran.

Good suggestion - addressed in 3b67ae8.

jwallwork23 · 2025-02-07T13:32:51Z

Tested with 2 Ampere GPUs - passes with the latest fix in 4fd8800. Will merge once CI passes.

* Drop ENABLE_MPI CMake argument * Remove MPI from MultiGPU example * Mention GPU/CUDA as dependency --------- Co-authored-by: Jack Atkinson <[email protected]>

jwallwork23 added 5 commits January 30, 2025 13:34

Drop ENABLE_MPI CMake argument

b82775c

Remove MPI from example 3

7c15ea2

lint

62b853d

Consistent naming; comments

30a6e67

Mention GPU/CUDA as dependency

aa79743

jwallwork23 added documentation Improvements or additions to documentation testing Related to FTorch testing labels Jan 30, 2025

jwallwork23 self-assigned this Jan 30, 2025

jwallwork23 force-pushed the 253_multigpu-no-mpi branch from aa79743 to 8be7e77 Compare January 30, 2025 14:02

jwallwork23 marked this pull request as ready for review January 30, 2025 14:03

jwallwork23 requested a review from jatkinson1000 January 30, 2025 14:03

Don't require MPI for example 3

88e836e

jwallwork23 force-pushed the 253_multigpu-no-mpi branch from 8be7e77 to 88e836e Compare January 30, 2025 14:09

jwallwork23 mentioned this pull request Jan 30, 2025

MPI example #270

Merged

jwallwork23 added the gpu Related to buiding and running on GPU label Jan 30, 2025

jatkinson1000 approved these changes Feb 6, 2025

View reviewed changes

examples/3_MultiGPU/README.md Outdated Show resolved Hide resolved

jwallwork23 and others added 2 commits February 6, 2025 10:04

Apply suggestions from @jatkinson1000 code review

3793c3d

Co-authored-by: Jack Atkinson <[email protected]>

Set num_devices in Python example

3b67ae8

jwallwork23 mentioned this pull request Feb 6, 2025

XPU and MPS take 3 #276

Merged

3 tasks

Fix naming

4fd8800

jwallwork23 merged commit 78914a9 into main Feb 7, 2025
6 checks passed

jwallwork23 deleted the 253_multigpu-no-mpi branch February 7, 2025 13:45

jatkinson1000 added a commit that referenced this pull request Feb 25, 2025

Remove MPI from multi-GPU example (#268)

a908124

* Drop ENABLE_MPI CMake argument * Remove MPI from MultiGPU example * Mention GPU/CUDA as dependency --------- Co-authored-by: Jack Atkinson <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove MPI from multi-GPU example #268

Remove MPI from multi-GPU example #268

jwallwork23 commented Jan 30, 2025

jwallwork23 commented Jan 30, 2025 •

edited

Loading

jatkinson1000 left a comment

jwallwork23 commented Feb 6, 2025

jwallwork23 commented Feb 7, 2025

Remove MPI from multi-GPU example #268

Remove MPI from multi-GPU example #268

Conversation

jwallwork23 commented Jan 30, 2025

jwallwork23 commented Jan 30, 2025 • edited Loading

jatkinson1000 left a comment

Choose a reason for hiding this comment

jwallwork23 commented Feb 6, 2025

jwallwork23 commented Feb 7, 2025

jwallwork23 commented Jan 30, 2025 •

edited

Loading