Skip to content

Commit

Permalink
Add notes about running TF with GPUs (#2348)
Browse files Browse the repository at this point in the history
  • Loading branch information
YuanTingHsieh authored Feb 2, 2024
1 parent 8c10f56 commit 899f11e
Show file tree
Hide file tree
Showing 3 changed files with 76 additions and 9 deletions.
28 changes: 26 additions & 2 deletions examples/hello-world/hello-cyclic/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ bash ./prepare_data.sh

Use nvflare simulator to run the hello-examples:

```
nvflare simulator -w /tmp/nvflare/ -n 2 -t 2 hello-cyclic/jobs/hello-cyclic
```bash
nvflare simulator -w /tmp/nvflare/ -n 2 -t 2 ./jobs/hello-cyclic
```

### 3. Access the logs and results
Expand All @@ -40,3 +40,27 @@ $ ls /tmp/nvflare/simulate_job/
app_server app_site-1 app_site-2 log.txt

```

### 4. Notes on running with GPUs

For running with GPUs, we recommend using
[NVIDIA TensorFlow docker](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)

If you choose to run the example using GPUs, it is important to note that,
by default, TensorFlow will attempt to allocate all available GPU memory at the start.
In scenarios where multiple clients are involved, you have a couple of options to address this.

One approach is to include specific flags to prevent TensorFlow from allocating all GPU memory.
For instance:

```bash
TF_FORCE_GPU_ALLOW_GROWTH=true nvflare simulator -w /tmp/nvflare/ -n 2 -t 2 ./jobs/hello-cyclic
```

If you possess more GPUs than clients,
an alternative strategy is to run one client on each GPU.
This can be achieved as illustrated below:

```bash
TF_FORCE_GPU_ALLOW_GROWTH=true nvflare simulator -w /tmp/nvflare/ -n 2 -gpu 0,1 ./jobs/hello-cyclic
```
30 changes: 27 additions & 3 deletions examples/hello-world/hello-tf2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,10 @@ Prepare the data first:
bash ./prepare_data.sh
```

Use nvflare simulator to run the hello-examples: (TF2 does not allow multiple processes to be running on a single GPU at the same time. Need to set the simulator threads to 1. "-gpu" option can be used to run multiple concurrent clients.)
Use nvflare simulator to run the hello-examples:

```
nvflare simulator -w /tmp/nvflare/ -n 2 -t 1 hello-tf2/jobs/hello-tf2
```bash
nvflare simulator -w /tmp/nvflare/ -n 2 -t 2 ./jobs/hello-tf2
```

### 3. Access the logs and results
Expand All @@ -41,3 +41,27 @@ $ ls /tmp/nvflare/simulate_job/
app_server app_site-1 app_site-2 log.txt

```

### 4. Notes on running with GPUs

For running with GPUs, we recommend using
[NVIDIA TensorFlow docker](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)

If you choose to run the example using GPUs, it is important to note that,
by default, TensorFlow will attempt to allocate all available GPU memory at the start.
In scenarios where multiple clients are involved, you have a couple of options to address this.

One approach is to include specific flags to prevent TensorFlow from allocating all GPU memory.
For instance:

```bash
TF_FORCE_GPU_ALLOW_GROWTH=true nvflare simulator -w /tmp/nvflare/ -n 2 -t 2 ./jobs/hello-tf2
```

If you possess more GPUs than clients,
an alternative strategy is to run one client on each GPU.
This can be achieved as illustrated below:

```bash
TF_FORCE_GPU_ALLOW_GROWTH=true nvflare simulator -w /tmp/nvflare/ -n 2 -gpu 0,1 ./jobs/hello-tf2
```
27 changes: 23 additions & 4 deletions examples/hello-world/ml-to-fl/tf/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,7 @@ nvflare job list_templates
\* depends on whether TF can found GPU or not


Note that for running with GPUs, we recommend using [NVIDIA TensorFlow docker](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)

For running with GPUs, please check the [note](#notes-on-running-with-gpus)

## Transform CIFAR10 TensorFlow training code to FL with NVFLARE Client API

Expand Down Expand Up @@ -108,7 +107,27 @@ Then we can run the job using the simulator:

```bash
bash ./prepare_data.sh
TF_GPU_ALLOCATOR=cuda_malloc_async nvflare simulator -n 2 -t 2 ./jobs/tensorflow_multi_gpu -w tensorflow_multi_gpu_workspace
nvflare simulator -n 2 -t 2 ./jobs/tensorflow_multi_gpu -w tensorflow_multi_gpu_workspace
```

Note that the flag "TF_GPU_ALLOCATOR=cuda_malloc_async" is only needed if you are going to run more than one process in the same GPU.
## Notes on running with GPUs


If you choose to run the example using GPUs, it is important to note that,
by default, TensorFlow will attempt to allocate all available GPU memory at the start.
In scenarios where multiple clients are involved, you have a couple of options to address this.

One approach is to include specific flags to prevent TensorFlow from allocating all GPU memory.
For instance:

```bash
TF_FORCE_GPU_ALLOW_GROWTH=true TF_GPU_ALLOCATOR=cuda_malloc_async nvflare simulator -n 2 -t 2 ./jobs/tensorflow_multi_gpu -w tensorflow_multi_gpu_workspace
```

If you possess more GPUs than clients,
an alternative strategy is to run one client on each GPU.
This can be achieved as illustrated below:

```bash
nvflare simulator -n 2 -gpu 0,1 ./jobs/tensorflow_multi_gpu -w tensorflow_multi_gpu_workspace
```

0 comments on commit 899f11e

Please sign in to comment.