feat(gpu): Update and enhance GPU initialization script #1363

cjac · 2025-10-12T04:06:09Z

This commit introduces several improvements to the install_gpu_driver.sh script and its documentation:

Refactored for Custom Images: The script now supports a deferred configuration model when used for building custom Dataproc images. By passing --metadata invocation-type=custom-images, the script performs driver/toolkit installations but defers Hadoop/Spark-specific settings to the first boot of a cluster instance via a systemd service (dataproc-gpu-config.service). This ensures compatibility with the custom image build process.
Improved Dependency Handling:
- Added logic to handle potentially missing kernel-devel packages from vaulted or staging Rocky Linux repositories.
- Ensures python3-venv is installed on Ubuntu 2.2+ for the GPU agent.
- Corrected Conda root path for Dataproc 2.3+.
Enhanced Repository and Key Management:
- Updated GPG key fetching for NVIDIA Container Toolkit and CUDA repositories on Debian/Ubuntu to include necessary keys and proxy support.
NVIDIA Artifact Hash Verification: Added an associative array recognized_hashes to store known SHA256 sums for downloaded NVIDIA driver and CUDA .run files. The script now checks the hash of downloaded files against this list, although it currently only warns on mismatch.
Documentation Updates (README.md):
- Clarified default CUDA versions per Dataproc image series.
- Updated example gcloud commands to be more complete and modern.
- Detailed the new invocation-type metadata for custom image builds.
- Reorganized and updated sections on cuDNN, metadata parameters, Secure Boot, and troubleshooting.
- Added an important section on performance implications and the benefits of cache warming, especially when builds from source are required.
- Noted that the GPU agent now handles metric creation, deprecating the need for the create_gpu_metrics.py script.
- Changed default for install-gpu-agent to true.
Script Robustness:
- Added set +e around get_metadata_attribute calls to handle missing attributes gracefully.
- Improved error handling and messages in various functions.
- Ensured YARN/Spark configurations are only applied if the respective config directories exist, allowing
- MIG scripts are fetched if missing when a MIG-enabled GPU is detected during the configuration phase.

These changes aim to make the GPU initialization action more reliable, flexible, and easier to use, both for regular cluster creation and custom image building.

Accepts PR #1357
Fixes Issue #1356
Addresses GoogleCloudDataproc/custom-images#110 in the initialization-actions repository. I don't think a separate Issue was opened for this work.
Fixes a long-running issue about Rocky systems not being able to stop the hadoop-yarn-nodemanager service with /etc/init.d/hadoop-yarn-nodemanager stop or systemctl stop hadoop-yarn-nodemanager - this fix should be contributed upstream to bigtop ; I see that HEAD on the repo has native systemd services, which may address the issue moving forward. This fix only addresses legacy images before the next release of bigtop.

cjac · 2025-10-12T04:06:53Z

/gcbrun

cjac · 2025-10-12T04:33:00Z

/gcbrun

cjac · 2025-10-12T16:35:54Z

/gcbrun

cjac · 2025-10-12T16:58:45Z

/gcbrun

cjac · 2025-10-12T17:03:10Z

/gcbrun

cjac · 2025-10-12T19:37:57Z

/gcbrun

This commit introduces several improvements to the `install_gpu_driver.sh` script and its documentation: 1. **Refactored for Custom Images:** The script now supports a deferred configuration model when used for building custom Dataproc images. By passing `--metadata invocation-type=custom-images`, the script performs driver/toolkit installations but defers Hadoop/Spark-specific settings to the first boot of a cluster instance via a systemd service (`dataproc-gpu-config.service`). This ensures compatibility with the custom image build process. 2. **Improved Dependency Handling:** * Added logic to handle potentially missing `kernel-devel` packages from vaulted or staging Rocky Linux repositories. * Ensures `python3-venv` is installed on Ubuntu 2.2+ for the GPU agent. * Corrected Conda root path for Dataproc 2.3+. 3. **Enhanced Repository and Key Management:** * Updated GPG key fetching for NVIDIA Container Toolkit and CUDA repositories on Debian/Ubuntu to include necessary keys and proxy support. 4. **NVIDIA Artifact Hash Verification:** Added an associative array `recognized_hashes` to store known SHA256 sums for downloaded NVIDIA driver and CUDA `.run` files. The script now checks the hash of downloaded files against this list, although it currently only warns on mismatch. 5. **Documentation Updates (README.md):** * Clarified default CUDA versions per Dataproc image series. * Updated example `gcloud` commands to be more complete and modern. * Detailed the new `invocation-type` metadata for custom image builds. * Reorganized and updated sections on cuDNN, metadata parameters, Secure Boot, and troubleshooting. * Added an important section on performance implications and the benefits of cache warming, especially when builds from source are required. * Noted that the GPU agent now handles metric creation, deprecating the need for the `create_gpu_metrics.py` script. * Changed default for `install-gpu-agent` to `true`. 6. **Script Robustness:** * Added `set +e` around `get_metadata_attribute` calls to handle missing attributes gracefully. * Improved error handling and messages in various functions. * Ensured YARN/Spark configurations are only applied if the respective config directories exist. * MIG scripts are fetched if missing when a MIG-enabled GPU is detected during the configuration phase. * Repaired broken /etc/init.d/hadoop-yarn-nodemanager stop function * Removed dependency on lspci * Accepting false values of either install-gpu-agent and enable-gpu-monitoring metadata to disable GPU metrics collection These changes aim to make the GPU initialization action more reliable, flexible, and easier to use, both for regular cluster creation and custom image building.

cjac · 2025-10-13T17:28:38Z

/gcbrun

This commit addresses several issues related to NodeManager stability on Rocky Linux and fixes errors in the verification scripts. **NodeManager Restart:** * Ensures `hadoop-yarn-nodemanager` service is disabled at the start of the init action to prevent conflicts with the `Restart=always` policy. * The service is now masked within the `yarn_exit_handler` before port checks and unmasked/enabled just before starting. * The LSB init script (`/etc/init.d/hadoop-yarn-nodemanager`) now uses the `daemon` function correctly, passing the `nodemanager` command without additional `--daemon start` flags, allowing the LSB wrapper to manage the daemon lifecycle. * Removed a duplicate definition of the `ensure_good_nodemanager_init_script` function. * Added aggressive port clearing for all NodeManager related ports in the `stop()` function of the LSB script. **Verification Script Fixes:** * Corrected quoting and variable expansion in the `verify_pytorch` command string in `gpu_test_case_base.py` to prevent remote shell syntax errors. * Fixed an `AttributeError` in `verify_cluster.py` by changing `self.getClusterRegion()` to `self.cluster_region`. These changes aim to make NodeManager restarts more reliable during the GPU initialization process and ensure the verification scripts run correctly.

cjac · 2025-10-15T17:36:54Z

/gcbrun

cjac · 2025-10-15T18:19:03Z

/gcbrun

cjac · 2025-10-15T18:26:30Z

/gcbrun

cjac · 2025-10-15T18:48:00Z

/gcbrun

cjac · 2025-10-15T19:01:41Z

/gcbrun

cjac · 2025-10-15T19:09:32Z

/gcbrun

cjac · 2025-10-15T19:18:41Z

/gcbrun

cjac · 2025-10-15T19:33:42Z

/gcbrun

cjac · 2025-10-15T20:11:31Z

/gcbrun

cjac · 2025-10-20T07:43:43Z

reverting to previous commit

cjac · 2025-10-20T07:43:48Z

/gcbrun

cjac · 2025-10-21T03:45:00Z

/gcbrun

cjac · 2025-10-21T05:03:26Z

/gcbrun

cjac · 2025-10-21T05:33:42Z

/gcbrun

cjac · 2025-10-21T05:49:34Z

/gcbrun

This commit introduces several major improvements to the GPU initialization action and its testing: 1. **Custom Image Deferred Configuration:** - The `install_gpu_driver.sh` script now supports a deferred configuration model for custom image builds, triggered by the `invocation-type=custom-images` metadata. - A systemd service (`dataproc-gpu-config.service`) is created to run Hadoop/Spark specific configurations on the first boot of instances created from the custom image. - Logic for generating the deferred script and service is encapsulated in new functions. 2. **Presubmit Script Enhancements:** - Fixed an issue in `run-presubmit-on-k8s.sh` where the log streaming loop could run indefinitely if a pod was deleted unexpectedly. The loop now checks for pod existence. - Updated `cloudbuild/Dockerfile` to install Python dependencies from `requirements.txt`. 3. **Test Infrastructure:** - Added `gpu/gpu_test_case_base.py` to provide a base class for GPU-related test cases. - Introduced `gpu/verify_cluster.py` for post-creation cluster validation. - Updated `gpu/BUILD` to include the new test base. 4. **Script Improvements (`install_gpu_driver.sh`):** - Improved robustness of package/URL fetching with retries. - Enhanced NVIDIA Container Toolkit repository and key setup. - More sophisticated handling of CUDA, Driver, cuDNN, and NCCL version matrix. - Better caching mechanisms for build artifacts. - Refined Secure Boot module signing logic. - Updated Conda environment paths for newer Dataproc versions. - Improved error handling and cleanup. 5. **Documentation:** - Extensively revamped `gpu/README.md` to reflect new features, metadata, custom image workflow, caching benefits, and troubleshooting. These changes aim to make the GPU initialization action more robust, easier to test, and fully compatible with custom image build pipelines.

cjac self-assigned this Oct 12, 2025

cjac force-pushed the gpu-20251011 branch from 73edb7a to 9f0fbfa Compare October 12, 2025 16:35

cjac force-pushed the gpu-20251011 branch from 9f0fbfa to c57bb21 Compare October 12, 2025 16:58

cjac force-pushed the gpu-20251011 branch from c57bb21 to db0297d Compare October 12, 2025 17:02

cjac force-pushed the gpu-20251011 branch from 3837fc1 to 5e70d67 Compare October 13, 2025 17:28

cjac mentioned this pull request Oct 13, 2025

Remove the unsupported 2.2-ubuntu22 from the table #1357

Open

cjac force-pushed the gpu-20251011 branch from 1e7cdfa to 1688073 Compare October 15, 2025 17:36

cjac force-pushed the gpu-20251011 branch from d2eb527 to edfcee2 Compare October 15, 2025 18:26

cjac force-pushed the gpu-20251011 branch from edfcee2 to 11f7c4b Compare October 15, 2025 18:47

cjac force-pushed the gpu-20251011 branch from 11f7c4b to a5fcb12 Compare October 15, 2025 19:01

cjac force-pushed the gpu-20251011 branch from a5fcb12 to d4e0a09 Compare October 15, 2025 19:09

cjac force-pushed the gpu-20251011 branch from d4e0a09 to c74b7b4 Compare October 15, 2025 19:18

cjac force-pushed the gpu-20251011 branch from c74b7b4 to c9d628e Compare October 15, 2025 19:33

cjac force-pushed the gpu-20251011 branch from c9d628e to 6938c22 Compare October 15, 2025 20:11

cjac force-pushed the gpu-20251011 branch 3 times, most recently from 2c57c65 to 9f8ebaa Compare October 20, 2025 07:43

cjac force-pushed the gpu-20251011 branch from 9f8ebaa to 7b9e5f4 Compare October 21, 2025 03:44

cjac force-pushed the gpu-20251011 branch from 7b9e5f4 to 30615ad Compare October 21, 2025 05:03

cjac force-pushed the gpu-20251011 branch from 30615ad to b9dcb58 Compare October 21, 2025 05:33

cjac requested a review from dilipgodhia October 21, 2025 16:12

dilipgodhia approved these changes Oct 21, 2025

View reviewed changes

cjac force-pushed the gpu-20251011 branch from b9dcb58 to bc30135 Compare October 21, 2025 17:04

cjac force-pushed the gpu-20251011 branch from bc30135 to 7b00fda Compare October 21, 2025 17:11

cjac marked this pull request as ready for review October 21, 2025 17:12

cjac merged commit 349f7d3 into GoogleCloudDataproc:main Oct 21, 2025
1 of 2 checks passed

feat(gpu): Update and enhance GPU initialization script #1363

feat(gpu): Update and enhance GPU initialization script #1363

Uh oh!

Conversation

cjac commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjac commented Oct 12, 2025

Uh oh!

cjac commented Oct 12, 2025

Uh oh!

cjac commented Oct 12, 2025

Uh oh!

cjac commented Oct 12, 2025

Uh oh!

cjac commented Oct 12, 2025

Uh oh!

cjac commented Oct 12, 2025

Uh oh!

cjac commented Oct 13, 2025

Uh oh!

cjac commented Oct 15, 2025

Uh oh!

cjac commented Oct 15, 2025

Uh oh!

cjac commented Oct 15, 2025

Uh oh!

cjac commented Oct 15, 2025

Uh oh!

cjac commented Oct 15, 2025

Uh oh!

cjac commented Oct 15, 2025

Uh oh!

cjac commented Oct 15, 2025

Uh oh!

cjac commented Oct 15, 2025

Uh oh!

cjac commented Oct 15, 2025

Uh oh!

cjac commented Oct 20, 2025

Uh oh!

cjac commented Oct 20, 2025

Uh oh!

cjac commented Oct 21, 2025

Uh oh!

cjac commented Oct 21, 2025

Uh oh!

cjac commented Oct 21, 2025

Uh oh!

cjac commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cjac commented Oct 12, 2025 •

edited

Loading