Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
627 commits
Select commit Hold shift + click to select a range
405eb21
Merge pull request #1164 from ajdecon/release-22.04
ajdecon Apr 26, 2022
7835e21
update ansible to match kubespray supported versions
ajdecon Apr 26, 2022
3a547e1
Merge pull request #1165 from ajdecon/update-ansible-kubespray
ajdecon Apr 27, 2022
8a8b560
Update NVIDIA signing key for package repos
ajdecon Apr 28, 2022
9b80ba2
Merge pull request #1166 from ajdecon/update-nv-signing-key
ajdecon May 2, 2022
4ce7605
Update default Slurm version to 21.08.8
ajdecon May 4, 2022
1dd8a46
update to -2 release from schedmd
ajdecon May 10, 2022
7e0b7d6
Merge pull request #1169 from ajdecon/slurm-21.08.8
ajdecon May 12, 2022
f9bb50e
Simpler fix for deploying trident on kube masters.
jguynvidia May 26, 2022
e26f082
Added newline at end of file
jguynvidia May 26, 2022
b8bf790
update default mofed version
ajdecon Jun 3, 2022
26aa7cf
Merge pull request #1100 from ajdecon/mofed-role
ajdecon Jun 6, 2022
e91ba49
Merge pull request #1176 from jasonguy/trident_fix_alt
ajdecon Jun 8, 2022
740df4b
Updated NCCL results with DGX A100s; updated MPI commands; Updated NC…
yangatgithub Jun 14, 2022
132636c
Minor editing to update image name
yangatgithub Jun 14, 2022
10d152a
Add Dockerfile for NCCL MPI validation
yangatgithub Jun 14, 2022
3f89eaa
Have same slurm.conf among nodes and controller
seyong-um Jun 15, 2022
508c9ae
Add slurm_conf_symlink flag
seyong-um Jun 16, 2022
c4f029a
Change URL of Fedora EPEL GPG key
ajdecon Jun 16, 2022
22d8d2d
Merge pull request #1184 from ajdecon/molecule-fix
ajdecon Jun 16, 2022
255c56e
Switch to using official MetalLB helm repo
ajdecon Jun 16, 2022
4c8bede
Apply code review
seyong-um Jun 16, 2022
36cd63d
Merge pull request #1182 from hkmc-airlab/share-slurm-conf
ajdecon Jun 17, 2022
6593691
Fix bugs preventing slurm reinstall or rebuild
biocyberman Jun 21, 2022
6e17f2b
Merge pull request #1180 from yangatgithub/network_operator_06
ajdecon Jun 22, 2022
89afecb
Change execution condidtion for stop docker.slurm-exporter service
biocyberman Jun 23, 2022
efbf24f
Merge pull request #1187 from biocyberman/vang/dev
ajdecon Jun 23, 2022
891c7a7
update main readme with additional levels of transparency.
tuttlebr Jun 27, 2022
2e0ba5b
add more details and format appropriately
tuttlebr Jun 27, 2022
436be63
Specify /run partition size
seyong-um Jul 1, 2022
cc3585a
Merge pull request #1191 from hkmc-airlab/resize-run-partition
ajdecon Jul 5, 2022
48056f9
Update default Slurm version to 22.05.2
ajdecon Jul 6, 2022
d680b41
Remove CgroupReleaseAgentDir which is no longer supported in 22.05.x
ajdecon Jul 7, 2022
5266e5c
Convert gres.conf syntax from CPUs to Cores
ajdecon Jul 7, 2022
a9d0e32
Update prometheus roles
0leaf Jul 8, 2022
8888b4e
Add alertmanager roles
0leaf Jul 8, 2022
5391467
Add alertmanager playbook
0leaf Jul 8, 2022
baf7b6c
fixed typos and removed static verion links in Release Notes
tuttlebr Jul 8, 2022
561842f
update airgap docs with TOC and standard formatting
tuttlebr Jul 8, 2022
fd7fadb
update cloud-native readme as deprecated and link to migrated repo.
tuttlebr Jul 8, 2022
6c17660
update container docs with standard TOC and formatting
tuttlebr Jul 8, 2022
604de4b
fix ngc-ready doc formatting issue
tuttlebr Jul 8, 2022
198c0a7
fix formatting issue with docker-rootless doc
tuttlebr Jul 8, 2022
ab97569
format and update docs/deepods documentation
tuttlebr Jul 8, 2022
a64c869
format and update docs/dev documentation
tuttlebr Jul 8, 2022
a06113d
format and update docs/dev documentation
tuttlebr Jul 8, 2022
4440a6c
format and update docs/k8s-cluster documentation
tuttlebr Jul 8, 2022
5f52f72
format and update docs/misc documentation
tuttlebr Jul 8, 2022
1ac9f85
format and update docs/ngc-ready documentation
tuttlebr Jul 8, 2022
3e1b6fb
format and update docs/pxe documentation
tuttlebr Jul 8, 2022
224980b
format and update docs/slurm-cluster documentation
tuttlebr Jul 8, 2022
0c9ac0e
format and update docs/slurm-cluster documentation
tuttlebr Jul 8, 2022
e7ac21c
format and update documentation
tuttlebr Jul 8, 2022
d4c1522
add logic to autodetect gpus with nvml in slurm
ajdecon Apr 13, 2022
6ae6520
add documentation on use of NVML
ajdecon Apr 13, 2022
f633f34
Add note on configuring MIG to Slurm NVML docs
ajdecon Apr 14, 2022
e4ae3c9
nvidia-mig-manager role: specify use of bash so we can run pipefail s…
ajdecon Apr 18, 2022
bb48796
add mig configuration example to slurm docs
ajdecon Apr 18, 2022
1fdc88f
fix doc reference to nhc config
ajdecon Apr 18, 2022
fa19fbd
Default use of NVML to true
ajdecon Jul 7, 2022
23e6e01
disable nvml in molecule test
ajdecon Jul 8, 2022
e216065
Merge pull request #1188 from tuttlebr/documentation-updates-062022
ajdecon Jul 8, 2022
c4a8dab
Fix alertmanager roles relative path to variable
0leaf Jul 10, 2022
2927a59
Fix prometheus roles relative path to variable
0leaf Jul 10, 2022
d9214ad
Add slurm monitoring images/readme
0leaf Jul 10, 2022
1fd6d28
Add slurm monitoring alertmanager
0leaf Jul 10, 2022
ac47483
Merge pull request #1198 from hkmc-airlab/slurm-alertmanager
ajdecon Jul 10, 2022
ad84379
update kubernetes-sigs/kubespray link
Jul 21, 2022
3ff2942
Merge pull request #1200 from elgalu/kubespray-link
ajdecon Jul 22, 2022
221fc55
update to kubespray v2.19.0
ajdecon Aug 5, 2022
81dcc0d
Update dependency versions for Ansible Galaxy roles and collections
ajdecon Aug 9, 2022
9f6749e
Update default values for role components
ajdecon Aug 9, 2022
6cdd1bf
Update chart versions in deploy scripts
ajdecon Aug 9, 2022
65bf672
rollback new version of metallb due to backward compat issue
ajdecon Aug 9, 2022
a639210
Move nccl test container from network operator direcotry to src/conta…
yangatgithub Aug 10, 2022
2388da9
Merge branch 'slurm-22.05.2' into 22.08-versions
ajdecon Aug 16, 2022
2bc4d0b
Merge pull request #1209 from yangatgithub/nccl_container
ajdecon Aug 16, 2022
28dd8bd
Fix to Kubespray configuration for Slurm setup (Docker)
ajdecon Aug 17, 2022
756d10b
Merge pull request #1208 from ajdecon/22.08-versions
ajdecon Aug 17, 2022
d89c746
Merge pull request #1196 from ajdecon/slurm-gres-fix
ajdecon Aug 18, 2022
e3cf2b5
Update to latest version of Nvidia driver role
dholt Aug 18, 2022
0a235c7
Merge pull request #1216 from NVIDIA/dholt-patch-1
dholt Aug 19, 2022
5b4bce1
DeepOps Release 22.08
dholt Aug 23, 2022
b7f7fd9
Update doc with instructions for creating a new release
dholt Aug 23, 2022
2d85edf
update wording
dholt Aug 23, 2022
ed01c11
Fix link to PR
dholt Aug 23, 2022
5fdde40
Merge pull request #1218 from dholt/release-22.08
dholt Aug 24, 2022
2cf44a8
Merge pull request #1219 from dholt/update-release-howto-docs
dholt Aug 24, 2022
30f4dee
Merge pull request #1157 from ajdecon/slurm-nvml
ajdecon Sep 30, 2022
9c0e79f
Bump Kubeflow fo v1.6
supertetelman Oct 28, 2022
6ec8ffd
Merge pull request #11 from NVIDIA/master
supertetelman Oct 28, 2022
064dd98
Merge pull request #12 from supertetelman/master
supertetelman Oct 28, 2022
4fc1d85
Remove docker from nightly K8s tests
supertetelman Oct 31, 2022
bc93738
Airgap documentation update.
mkunin-work Nov 16, 2022
d0370ff
Add CodeQL workflow for GitHub code scanning
lgtm-migrator Nov 30, 2022
5e47208
Merge pull request #1249 from supertetelman/kubeflow-v1.6
ajdecon Dec 22, 2022
76a0011
Bump GFD from v0.6-> v0.7
iamadrigal Dec 20, 2022
80ce28c
Bump GPU device plugin v0.12.2 -> v0.13.0
iamadrigal Dec 20, 2022
4e8b207
Bump GPU Operator v1.11.1 -> v22.9.1
iamadrigal Dec 20, 2022
c55e786
bump GPU Operator from github to ngc
supertetelman Mar 28, 2023
574eea3
Change centos epel uri from download.* to dl.* for valid SSL
supertetelman Mar 29, 2023
047a566
Conform to standard gpu operator namespacing
supertetelman Apr 15, 2023
972e70c
Install new jmespath requirement in setup.sh
supertetelman Apr 15, 2023
b6ee676
Fix non-GPU Operator installs by allowing installation into default n…
supertetelman Apr 15, 2023
5261d43
Add virtual support for Ubuntu 22.04
supertetelman Apr 15, 2023
f2c4438
Merge pull request #1262 from supertetelman/jmespath
ajdecon Apr 17, 2023
e4eee7f
Merge pull request #1264 from supertetelman/non-operator-default-ns
ajdecon Apr 17, 2023
3cca90e
Merge pull request #1265 from supertetelman/ubuntu2204
ajdecon Apr 17, 2023
0e8964f
Merge pull request #1261 from supertetelman/gpu-operator-ns
ajdecon Apr 17, 2023
1f86eba
Merge pull request #1244 from lgtm-migrator/codeql
dholt Apr 19, 2023
7fb8f2c
Merge pull request #1243 from mkunin-nvidia/airgapped-documentation-u…
dholt Apr 19, 2023
4e6b079
Fix hardcoded slurm username
jeremyfix Jun 28, 2023
ad45ce0
Add version to k8s debug
supertetelman Jan 25, 2023
a9e59c4
remove docker runtime tests from multinode jenkins
supertetelman Jan 25, 2023
c8a2162
Add dle test back to nightly jenkins
supertetelman Jan 25, 2023
cc7c2f3
Test local docker registry even with conatainerd runtime
supertetelman Jan 25, 2023
25895ae
add ansible version to debug
supertetelman Jan 25, 2023
2ea2547
Bump to latest kubespreay release of 2.21 with bugfixes
supertetelman Mar 29, 2023
fc78304
change containerd_snapshotter default to native, based on GitHub work…
supertetelman Mar 29, 2023
fbde120
Comment out/remove support for local insecure containerd registries u…
supertetelman Apr 15, 2023
b6d24e3
Bump metallb from 0.12.1 to 0.13.9
supertetelman Apr 15, 2023
a34fd68
Update metallb deployment to use new CRD and remove deprecated inline…
supertetelman Apr 15, 2023
983544a
label metallb ns properly
supertetelman Apr 15, 2023
e65946a
Update Jenkins munge for new metallb config
supertetelman Jul 6, 2023
795e5e9
Update core monitoring/LN services to use control-plane instead of ma…
supertetelman Jul 10, 2023
ddf6511
More comprehensive update of master role -> control-plane
supertetelman Jul 10, 2023
93b0273
Fix multinode Jenkinsfile
supertetelman Jul 10, 2023
40d35ae
Version bumps for gpu operator (23.3.2), GFD (0.8.0), and device plug…
supertetelman Jul 10, 2023
78b363c
Merge pull request #1279 from jeremyfix/patch-1
dholt Jul 25, 2023
c71488e
Bump Kubespray to v2.22.1
supertetelman Jul 10, 2023
495b7a6
Bump Kubeflow (1.7.0) and kustomize (5.1.0)
Jul 11, 2023
a25fdb2
Workaround bug to add kubeflow support for K8s v1.26
supertetelman Jul 11, 2023
7729f46
Update networking config for kubeflow v1.7
supertetelman Jul 11, 2023
c06e9ee
Disable secure cookies in Kubeflow
supertetelman Jul 11, 2023
4c8db30
BUG:1284 - K8s Dashboard update
supertetelman Jul 18, 2023
f339cf1
update roles to latest versions
dholt Jul 25, 2023
55c302f
update nvidia_driver_ubuntu_cuda_keyring_package to latest version
JH-LEE-KR Jul 26, 2023
2b95117
Fix for docker install playbook due to kubespray changes
dholt Jul 27, 2023
b25e195
add ubuntu 22.04 support
dholt Jul 27, 2023
05f52a3
update slurm version
dholt Jul 27, 2023
98a2444
update HPC SDK versions
dholt Jul 27, 2023
d109602
remove duplicate variable for openmpi version
dholt Jul 27, 2023
3311a4b
use latest version by default
dholt Jul 27, 2023
06b6f20
update version
dholt Jul 27, 2023
8ff806d
move some version defaults out of config to simplify updates
dholt Jul 27, 2023
e118b3a
set default newer driver version
dholt Aug 2, 2023
59af370
remove driver version
dholt Aug 2, 2023
c2f0aa8
move config to roles
dholt Aug 3, 2023
b6bdffb
update version
dholt Aug 3, 2023
4ac96a4
Merge pull request #1292 from JH-LEE-KR/update-cuda_keyring
dholt Aug 7, 2023
5d98c78
update release tag
dholt Aug 7, 2023
22ef6c0
remove release notes section
dholt Aug 7, 2023
d248b65
Merge pull request #1296 from dholt/release-23.08
michael-balint Aug 28, 2023
6186cf1
Update ansible.cfg
Musab0 Nov 2, 2023
5efe4a5
Update GPU process cleanup logic in SLURM epilog script
ilya-da Sep 14, 2024
a87841c
Increase KillWait to 120 in slurm.conf
ilya-da Sep 14, 2024
78264a9
Merge pull request #1318 from ilya-da/killwait_update
dholt Jan 16, 2025
eb1fa28
Merge pull request #1316 from ilya-da/nvidia-smi_kill
dholt Jan 16, 2025
69e1d48
Fixing the broken link.
mkunin-work Mar 5, 2025
257df81
Merge pull request #1324 from mkunin-work/ngc-ready
dholt Mar 6, 2025
72ac337
update kubernetes version to v1.32.0
alexfrolov Feb 25, 2025
e9d1f11
update versions in setup.sh
alexfrolov Feb 25, 2025
608abeb
ignore error when un-holding docker (in case docker or other packages…
alexfrolov Mar 1, 2025
3420c85
fix bug in install_helm.sh
alexfrolov Mar 1, 2025
4c26441
Merge pull request #1322 from alexfrolov/upgrade-k8s-v1-32
dholt Mar 10, 2025
414d8de
fix typo in docker playbook
alexfrolov Mar 18, 2025
0669eda
Merge pull request #1325 from alexfrolov/upgrade-k8s-v1-32
dholt Mar 18, 2025
4de6428
Use include_tasks instead of ansible.builtin.include since this modul…
jungyh0218 Mar 31, 2025
2594f6b
Merge pull request #1326 from jungyh0218/master
dholt Apr 1, 2025
8237d64
Revert "Update ansible.cfg"
jungyh0218 Apr 3, 2025
05b8046
Merge pull request #1327 from jungyh0218/master
dholt Apr 9, 2025
b465bb6
Fix typo and version for installing docker and kubespray
JH-LEE-KR Jun 29, 2025
79c1763
Merge pull request #1330 from JH-LEE-KR/fix-docker-kubespray
dholt Jun 30, 2025
dbb824f
remove dgxie rest api due to potential vulnerability
dholt Jan 9, 2026
b462a1a
Merge pull request #1334 from dholt/remove-dgxie
dholt Jan 9, 2026
698d6c6
fix: replace distutils with packaging in setup.sh
dholt Feb 18, 2026
16d746e
ci: update GitHub Actions workflows and add setup.sh test
dholt Feb 18, 2026
879ab71
ci: fix ansible-lint version pairing and molecule docker driver
dholt Feb 18, 2026
122d571
fix: install packaging before version check in setup.sh
dholt Feb 18, 2026
f9811e7
fix: replace deprecated ansible patterns in roles
dholt Feb 18, 2026
15e6dc7
refactor: drop Ubuntu 18.04 and CentOS 7 support
dholt Feb 18, 2026
4226a3b
feat: upgrade Ansible to 10.7.0 and all dependencies
dholt Feb 18, 2026
2b1539d
fix: resolve compatibility issues with upgraded dependencies
dholt Feb 18, 2026
e3cc7b1
fix: update kubespray role paths for v2.30.0
dholt Feb 18, 2026
adac6a8
fix: replace pip install docker with apt python3-docker (PEP 668)
dholt Feb 18, 2026
7f3c71c
fix: add passlib to setup.sh dependencies
dholt Feb 19, 2026
a9e4b02
Add MAAS dynamic inventory script for Ansible
dholt Feb 19, 2026
30093ec
Address PR review: remove StrictHostKeyChecking, optimize --host, fix…
dholt Feb 19, 2026
5f82bd7
Remove implicit etcd-kube-master relationship
dholt Feb 19, 2026
ee1cb5d
Update MAAS docs: drop Ubuntu 18.04 refs, update MAAS version, add TO…
dholt Feb 19, 2026
2b62ae3
refactor: rename k8s inventory groups for kubespray v2.30
dholt Feb 19, 2026
e3d93e4
fix: switch containerd snapshotter from native to overlayfs
dholt Feb 19, 2026
07aa6b4
fix: update lint script and molecule configs for CI
dholt Feb 19, 2026
87cf056
fix: update spack role deps from gcc-7 to gcc (Ubuntu 22.04+)
dholt Feb 19, 2026
f2ffb8b
Merge pull request #1336 from dholt/fix/setup-distutils-to-packaging
michael-balint Feb 19, 2026
d462d56
Merge pull request #1337 from dholt/feature/maas-dynamic-inventory
dholt Feb 20, 2026
29eaca5
feat: add MAAS deploy workflow and dynamic inventory integration
dholt Feb 20, 2026
fb5180a
fix: kubectl binary copy works through bastion and cross-platform
dholt Feb 20, 2026
66ef5ba
fix: update config.example and scripts for kubespray v2.30 group names
dholt Feb 20, 2026
13ff300
fix: detect placeholder config values in MAAS inventory script
dholt Feb 20, 2026
0618771
Merge pull request #1338 from dholt/feature/maas-deploy-workflow-v2
michael-balint Feb 20, 2026
d15df99
Merge pull request #1339 from dholt/fix/kubectl-cross-platform
michael-balint Feb 20, 2026
97051de
Merge pull request #1340 from dholt/fix/config-group-renames
michael-balint Feb 20, 2026
cf9f321
fix: address Copilot review feedback on kubectl cross-platform
dholt Feb 20, 2026
e2b4910
fix: address Copilot review feedback on MAAS deploy and inventory
dholt Feb 20, 2026
91ed38e
Merge pull request #1341 from dholt/fix/maas-copilot-feedback
michael-balint Feb 23, 2026
428ca13
Merge pull request #1342 from dholt/fix/kubectl-copilot-feedback
michael-balint Feb 23, 2026
67cd68a
fix: replace yaml.load with yaml.safe_load to prevent deserialization…
dholt Apr 27, 2026
742f923
Merge pull request #1343 from dholt/fix/psirt-yaml-deserialization
dholt Apr 27, 2026
e9bcfa4
chore(release): bump component versions for 26.05
dholt May 14, 2026
252dd33
docs: update Kubernetes inventory group names
dholt May 14, 2026
ff1d89d
chore(maas): update role dependency for Ansible 10
dholt May 14, 2026
ed9cda3
fix(nvidia): support Ubuntu open kernel modules
dholt May 15, 2026
56be783
fix(slurm): make controller setup rerunnable
dholt May 15, 2026
a5c9a97
fix(k8s): use mapping vars for NFS role include
dholt May 15, 2026
908555e
fix(k8s): update network operator role for current chart
dholt May 15, 2026
e6ae743
fix(k8s): run Helm installer with bash
dholt May 16, 2026
f091348
Fix MAAS requirements and document staged upgrades
dholt May 19, 2026
440dd0d
Merge pull request #1344 from NVIDIA/dholt/release-26.05-version-bumps
michael-balint May 26, 2026
8d47978
release: 26.05 notes and README tag
dholt May 26, 2026
8a5a1bc
Merge pull request #1345 from NVIDIA/dholt/release-26.05-notes
michael-balint May 26, 2026
e6bf1f6
feat(dgx): update DGX software stack role
dholt May 27, 2026
c8c74ed
fix(dgx): harden EL8 driver install
dholt May 27, 2026
03e19e9
Merge pull request #1346 from NVIDIA/dholt/dgx-stack-rhel8
dholt May 27, 2026
40d707f
feat: support Ubuntu 24.04 container toolkit
dholt May 28, 2026
a6242f6
Refresh DCGM exporter
dholt May 28, 2026
d4632fe
Refresh CUDA example images
dholt May 28, 2026
11e9b59
Merge pull request #1349 from NVIDIA/dholt/cuda-example-refresh
michael-balint May 28, 2026
e6553c4
Merge pull request #1348 from NVIDIA/dholt/dcgm-exporter-refresh
michael-balint May 28, 2026
975337f
Merge pull request #1347 from NVIDIA/dholt/ubuntu-24-doc-refresh
dholt May 28, 2026
617e72b
feat: support Red Hat container toolkit path
dholt May 28, 2026
ad241ba
Retire legacy PXE provisioning paths
dholt May 28, 2026
0df3a98
docs: clarify legacy CI and virtual lab status
dholt May 28, 2026
226aa8d
Merge pull request #1351 from NVIDIA/dholt/maas-provisioning-cleanup
michael-balint May 28, 2026
1e77de7
Merge pull request #1353 from NVIDIA/dholt/legacy-ci-doc-refresh
michael-balint May 28, 2026
b3f1510
Merge pull request #1350 from NVIDIA/dholt/container-toolkit-rhel
michael-balint May 28, 2026
d59fc62
Refresh Container Toolkit airgap guidance
dholt May 28, 2026
ab5f1e9
Merge pull request #1352 from NVIDIA/dholt/container-toolkit-doc-cleanup
michael-balint Jun 2, 2026
972c7a8
Clarify legacy OS support matrix
dholt May 28, 2026
8858399
Merge pull request #1354 from NVIDIA/dholt/os-matrix-doc-cleanup
michael-balint Jun 2, 2026
9db02ec
Refresh RAPIDS and RoCE examples
dholt May 28, 2026
ce10816
Refresh workload and registry examples
dholt May 29, 2026
2e08909
fix: handle non-dict parsed JSON in firmware version manifest parser
dholt Jul 1, 2026
01b332f
chore(release): bump component versions for 26.07
dholt Jul 2, 2026
3dfe40e
fix: derive DEEPOPS_VERSION for script debug output
dholt Jul 2, 2026
7821567
Merge pull request #1355 from NVIDIA/dholt/fix-parse-manifest-typeerr…
michael-balint Jul 2, 2026
9533349
Merge pull request #1356 from NVIDIA/dholt/example-roce-refresh
michael-balint Jul 2, 2026
fa1e345
Merge pull request #1357 from NVIDIA/dholt/release-26.07-version-bumps
michael-balint Jul 2, 2026
e806b5c
Clarify Kubernetes storage helper ownership
dholt May 29, 2026
78908da
Merge pull request #1358 from NVIDIA/dholt/k8s-storage-helper-refresh
michael-balint Jul 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
26 changes: 26 additions & 0 deletions .ansible-lint
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
# Project-level ansible-lint configuration for ansible-lint 26.x
# Profile levels: min, basic, moderate, safety, shared, production
profile: min

# Exclude external/vendored roles and hardware-specific roles
exclude_paths:
- roles/galaxy/
- roles/nvidia-dgx/
- roles/nvidia-dgx-firmware/
- roles/nvidia-gpu-tests/
- submodules/

# Skip rules for pre-existing issues across the codebase.
# These should be fixed incrementally in future PRs.
skip_list:
- fqcn[action-core] # 634 violations: modules not using FQCN
- fqcn[action] # 24 violations: same for non-core
- name[casing] # 526 violations: task names not capitalized
- yaml[truthy] # 152 violations: yes/no instead of true/false
- yaml[octal-values] # 52 violations: octal file modes
- jinja[spacing] # 20 violations: jinja2 spacing
- name[play] # 12 violations: play names
- schema[meta] # 3 violations: meta/main.yml schema
- key-order[task] # 3 violations: task key ordering
- ignore-errors # pre-existing ignore_errors usage
40 changes: 40 additions & 0 deletions .claude/skills/test-playbooks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
---
name: test-playbooks
description: Test Ansible playbooks across Ubuntu versions on target machines
---

## Prerequisites
- Virtualenv activated (`source .venv/bin/activate` or `source /opt/deepops/env/bin/activate`)
- Target machines provisioned and accessible via SSH
- Inventory configured: either `config/inventory` (static) or `config/maas-inventory.yml` (MAAS dynamic)

## Steps
1. Verify connectivity: `ansible -m ping all`
2. Run playbook: `ansible-playbook playbooks/<playbook>.yml`
3. Verify results (check playbook output, run smoke tests on targets)
4. To test another OS version: reprovision targets with the new OS, re-run playbook, verify again

## Test Matrix
| Playbook | Inventory groups needed | Test on 24.04 | Test on 22.04 |
|----------|------------------------|---------------|---------------|
| k8s-cluster.yml | kube_control_plane, kube_node, etcd | yes | yes |
| slurm-cluster.yml | slurm-master, slurm-node | yes | yes |
| ngc-ready-server.yml | (any host group) | yes | yes |

## MAAS Users
If using MAAS dynamic inventory (`scripts/maas_inventory.py`), the deploy script automates provisioning:
```bash
./scripts/maas_deploy.sh --status # check VM state
./scripts/maas_deploy.sh --os noble --profile k8s # deploy + tag for K8s
./scripts/maas_deploy.sh --os jammy --profile slurm # deploy + tag for Slurm
./scripts/maas_deploy.sh --profile k8s --tags-only # re-tag without redeploying
./scripts/maas_deploy.sh --release # release VMs
```
Profiles assign MAAS tags that the dynamic inventory maps to Ansible groups:
- **k8s**: first machine = `kube_control_plane` + `etcd`, remaining = `kube_node`
- **slurm**: first machine = `slurm-master`, remaining = `slurm-node`

## Group Naming
- K8s groups use underscores: `kube_control_plane`, `kube_node`, `k8s_cluster`
- Slurm groups use hyphens: `slurm-master`, `slurm-node`, `slurm-cluster`
- Old hyphenated K8s names (`kube-master`, `kube-node`) are accepted via TAG_ALIASES
31 changes: 31 additions & 0 deletions .github/workflows/ansible-lint-roles.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
name: run ansible-lint on deepops roles
on:
- push
- pull_request
jobs:
lint:
runs-on: ubuntu-22.04
steps:

- name: check out repo
uses: actions/checkout@v4
with:
path: "${{ github.repository }}"

- name: set up python
uses: actions/setup-python@v4
with:
python-version: "3.10"

- name: install dependencies
run: |
python3 -m pip install --upgrade pip
python3 -m pip install ansible-lint==26.1.1 ansible==10.7.0

- name: run lint script
env:
ANSIBLE_LINT_EXCLUDE: "nvidia-dgx|nvidia-gpu-tests"
run: |
cd "${{ github.repository }}"
bash ./scripts/deepops/ansible-lint-roles.sh
41 changes: 41 additions & 0 deletions .github/workflows/codeql.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
name: "CodeQL"

on:
push:
branches: [ "master", "release-20.02", "release-20.06", "release-20.08", "release-20.10", "release-20.11", "release-20.12", "release-21.03", "release-21.05", "release-21.06", "release-21.09", "release-21.12", "release-22.01", "release-22.04" ]
pull_request:
branches: [ "master" ]
schedule:
- cron: "38 16 * * 2"

jobs:
analyze:
name: Analyze
runs-on: ubuntu-latest
permissions:
actions: read
contents: read
security-events: write

strategy:
fail-fast: false
matrix:
language: [ python ]

steps:
- name: Checkout
uses: actions/checkout@v4

- name: Initialize CodeQL
uses: github/codeql-action/init@v3
with:
languages: ${{ matrix.language }}
queries: +security-and-quality

- name: Autobuild
uses: github/codeql-action/autobuild@v3

- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@v3
with:
category: "/language:${{ matrix.language }}"
50 changes: 50 additions & 0 deletions .github/workflows/molecule.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
name: test ansible roles with molecule
on:
- push
- pull_request
jobs:
build:
runs-on: ubuntu-24.04
strategy:
fail-fast: false
matrix:
deepops-role:
- cachefilesd
- facts
- kerberos_client
- lmod
- nfs
- nhc
- nvidia_dcgm
- openmpi
- openshift
- mofed
- spack
# Excluded from Docker CI (require systemd services that can't
# run in containers): nis_client, rsyslog_client, rsyslog_server,
# slurm. Tested end-to-end on real MAAS VMs instead.
# Also excluded: singularity_wrapper (broken upstream Galaxy dep)
steps:
- name: check out repo
uses: actions/checkout@v4
with:
path: "${{ github.repository }}"
- name: set up python
uses: actions/setup-python@v4
with:
python-version: "3.12"
- name: install dependencies
run: |
python3 -m pip install --upgrade pip
python3 -m pip install ansible==10.7.0 passlib
python3 -m pip install molecule molecule-plugins[docker] docker
- name: run molecule test
env:
ANSIBLE_ROLES_PATH: "${{ github.workspace }}/${{ github.repository }}/roles/galaxy:${{ github.workspace }}/${{ github.repository }}/roles"
run: |
cd "${{ github.repository }}/roles"
ansible-galaxy role install --force -r ./requirements.yml
ansible-galaxy collection install --force -r ./requirements.yml
cd "${{ matrix.deepops-role }}"
molecule test
25 changes: 25 additions & 0 deletions .github/workflows/setup.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
name: test setup.sh
on:
- push
- pull_request
jobs:
setup:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os:
- ubuntu-22.04
- ubuntu-24.04
steps:
- name: check out repo
uses: actions/checkout@v4

- name: run setup.sh
run: bash scripts/setup.sh

- name: verify ansible in venv
run: |
source /opt/deepops/env/bin/activate
ansible --version
python3 -c "from packaging.version import Version; print('packaging OK')"
31 changes: 31 additions & 0 deletions .github/workflows/stale.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# This workflow warns and then closes issues and PRs that have had no activity for a specified amount of time.
#
# You can adjust the behavior by modifying this file.
# For more information, see:
# https://github.com/actions/stale
name: Mark stale issues and pull requests

on:
schedule:
- cron: '40 0 * * *'

jobs:
stale:

runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write

steps:
- uses: actions/stale@v9
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
stale-issue-message: 'This issue is stale because it has been open for 60 days with no activity. Please update the issue or it will be closed in 7 days.'
stale-pr-message: 'This PR is stale because it has been open for 180 days with no activity. Please update the PR or it will be closed in 7 days.'
stale-issue-label: 'no-issue-activity'
stale-pr-label: 'no-pr-activity'
days-before-issue-stale: 60
days-before-pr-stale: 180
exempt-pr-labels: 'no-stale'
exempt-issue-labels: 'no-stale'
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,13 @@
# misc.
.*.swp

# virtualenv
/.venv/

# claude code
/CLAUDE.md
/tasks/

# project-specific
/admin.conf
/config*/
Expand Down
2 changes: 1 addition & 1 deletion .gitmodules
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[submodule "kubespray"]
path = submodules/kubespray
url = https://github.com/kubernetes-incubator/kubespray.git
url = https://github.com/kubernetes-sigs/kubespray.git
[submodule "packer-maas"]
path = submodules/packer-maas
url = https://github.com/DeepOps/packer-maas.git
83 changes: 53 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,69 +1,92 @@
DeepOps
===

GPU infrastructure and automation tools
# DeepOps

Infrastructure automation tools for Kubernetes and Slurm clusters with NVIDIA GPUs.

## Table of Contents

- [DeepOps](#deepops)
- [Table of Contents](#table-of-contents)
- [Overview](#overview)
- [Deployment Requirements](#deployment-requirements)
- [Provisioning System](#provisioning-system)
- [Cluster System](#cluster-system)
- [Kubernetes](#kubernetes)
- [Slurm](#slurm)
- [Hybrid clusters](#hybrid-clusters)
- [Virtual](#virtual)
- [Updating DeepOps](#updating-deepops)
- [Copyright and License](#copyright-and-license)
- [Issues](#issues)
- [Contributing](#contributing)

## Overview

The DeepOps project encapsulates best practices in the deployment of GPU server clusters and sharing single powerful nodes (such as [NVIDIA DGX Systems](https://www.nvidia.com/en-us/data-center/dgx-systems/)). DeepOps can also be adapted or used in a modular fashion to match site-specific cluster needs. For example:
The DeepOps project encapsulates best practices in the deployment of GPU server clusters and sharing single powerful nodes (such as [NVIDIA DGX Systems](https://www.nvidia.com/en-us/data-center/dgx-systems/)). DeepOps may also be adapted or used in a modular fashion to match site-specific cluster needs. For example:

* An on-prem data center of NVIDIA DGX servers where DeepOps provides end-to-end capabilities to set up the entire cluster management stack
* An existing cluster running Kubernetes where DeepOps scripts are used to deploy Kubeflow and connect NFS storage
* An existing cluster that needs a resource manager / batch scheduler, where DeepOps is used to install Slurm, Kubernetes, or a hybrid of both
* A single machine where no scheduler is desired, only NVIDIA drivers, Docker, and the NVIDIA Container Runtime
- An on-prem data center of NVIDIA DGX servers where DeepOps provides end-to-end capabilities to set up the entire cluster management stack
- An existing cluster running Kubernetes where DeepOps scripts are used to deploy KubeFlow and connect NFS storage
- An existing cluster that needs a resource manager / batch scheduler, where DeepOps is used to install Slurm or Kubernetes
- A single machine where no scheduler is desired, only NVIDIA drivers, Docker, and the NVIDIA Container Runtime

Check out the [video tutorial](https://drive.google.com/file/d/1RNLQYlgJqE8JMv0np8SdEDqeCN2piavF/view) for how to use DeepOps to deploy Kubernetes and Kubeflow on a single DGX Station. This provides a good base test ground for larger deployments.
Latest release: [DeepOps 26.05 Release](https://github.com/NVIDIA/deepops/releases/tag/26.05)

## Releases
It is recommended to use the latest release branch for stable code (linked above). All development takes place on the master branch, which is generally [functional](docs/deepops/testing.md) but may change significantly between releases.

Latest release: [DeepOps 21.09 Release](https://github.com/NVIDIA/deepops/releases/tag/21.09)
## Deployment Requirements

It is recommended to use the latest release branch for stable code (linked above). All development takes place on the master branch, which is generally [functional](docs/deepops/testing.md) but may change significantly between releases.
### Provisioning System

## Getting Started
The provisioning system is used to orchestrate the running of all playbooks and one will be needed when instantiating Kubernetes or Slurm clusters. Current release validation focuses on:

For detailed help or guidance, read through our [Getting Started Guide](docs/) or pick one of the deployment options documented below.
- Ubuntu 22.04 LTS and 24.04 LTS
- NVIDIA DGX OS 6 and 7

## Deployment Options
DeepOps still retains legacy/community-maintained paths for older environments such as DGX OS 4/5, Ubuntu 18.04/20.04, and CentOS 7/8. Treat those paths as compatibility references unless your site validates them for the release you deploy.

### Supported Ansible versions
### Cluster System

DeepOps supports using Ansible 2.9.x.
Ansible 2.10.x and newer are not currently supported.
The cluster nodes will follow the requirements described by Slurm or Kubernetes. You may also use a cluster node as a provisioning system but it is not required. Current release validation focuses on:

### Supported distributions
- Ubuntu 22.04 LTS and 24.04 LTS for generic Kubernetes and Slurm deployments
- NVIDIA DGX OS 6 and 7 for DGX systems
- Red Hat Enterprise Linux / Rocky Linux 8 and 9 for DGX platform software installation through the `nvidia-dgx` role

DeepOps currently supports the following Linux distributions:
DeepOps still retains legacy/community-maintained paths for older environments such as DGX OS 4/5, Ubuntu 18.04/20.04, CentOS 7/8, and the historical DGX EL7 stack. Treat those paths as compatibility references unless your site validates them for the release you deploy.

* NVIDIA DGX OS 4, 5
* Ubuntu 18.04 LTS, 20.04 LTS
* CentOS 7, 8
You may also install a supported operating system on all servers via a 3rd-party solution such as [MAAS](https://maas.io/) or [Foreman](https://www.theforeman.org/), or via an existing site-standard automated installer.
For new Ubuntu 24.04 or DGX OS 7 deployments, prefer Ubuntu autoinstall/cloud-init or MAAS and then apply DeepOps roles after the OS is present.
For DGX platform software installation on top of vanilla Ubuntu or Red Hat family operating systems, see the [DGX software stack role guide](docs/deepops/dgx-software-stack.md).

### Kubernetes

Kubernetes (K8s) is an open-source system for automating deployment, scaling, and management of containerized applications.
Kubernetes (K8s) is an open-source system for automating deployment, scaling, and management of containerized applications. The instantiation of a Kubernetes cluster is done by [Kubespray](submodules/kubespray). Kubespray runs on bare metal and most clouds, using Ansible as its substrate for provisioning and orchestration. For people with familiarity with Ansible, existing Ansible deployments or the desire to run a Kubernetes cluster across multiple platforms, Kubespray is a good choice. Kubespray does generic configuration management tasks from the "OS operators" ansible world, plus some initial K8s clustering (with networking plugins included) and control plane bootstrapping. DeepOps provides additional playbooks for orchestration and optimization of GPU environments.

Consult the [DeepOps Kubernetes Deployment Guide](docs/k8s-cluster/) for instructions on building a GPU-enabled Kubernetes cluster using DeepOps.

For more information on Kubernetes in general, refer to the [official Kubernetes docs](https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/).

### Slurm

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.
Slurm is an open-source cluster resource management and job scheduling system that strives to be simple, scalable, portable, fault-tolerant, and interconnect agnostic. Slurm currently has been tested only under Linux.

As a cluster resource manager, Slurm provides three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates conflicting requests for resources by managing a queue of pending work. Slurm cluster instantiation is achieved through [SchedMD](https://slurm.schedmd.com/download.html)

Consult the [DeepOps Slurm Deployment Guide](docs/slurm-cluster/) for instructions on building a GPU-enabled Slurm cluster using DeepOps.

For more information on Slurm in general, refer to the [official Slurm docs](https://slurm.schedmd.com/overview.html).

### DGX POD Hybrid Cluster
### Hybrid clusters

**DeepOps does not test or support a configuration where both Kubernetes and Slurm are deployed on the same physical cluster.**

A hybrid cluster with both Kubernetes and Slurm can also be deployed. This is recommended for [DGX POD](https://www.nvidia.com/en-us/data-center/dgx-pod-reference-architecture/) and other setups that wish to make maximal use of the cluster.
[NVIDIA Bright Cluster Manager](https://www.brightcomputing.com/brightclustermanager) is recommended as an enterprise solution which enables managing multiple workload managers within a single cluster, including Kubernetes, Slurm, Univa Grid Engine, and PBS Pro.

Consult the [DeepOps DGX POD Deployment Guide](docs/deepops/dgx-pod.md) for step-by-step instructions on building a GPU-enabled hybrid cluster using DeepOps.
**DeepOps does not test or support a configuration where nodes have a heterogenous OS running.**
The `nvidia-dgx` role can install NVIDIA DGX platform software on supported DGX systems running Red Hat Enterprise Linux / Rocky Linux 8 or 9; broader Kubernetes or Slurm cluster support on RHEL still requires site-specific validation.

### Virtual

To try DeepOps before deploying it on an actual cluster, a virtualized version of DeepOps may be deployed on a single node using Vagrant. This can be used for testing, adding new features, or configuring DeepOps to meet deployment-specific needs.
To try DeepOps before deploying it on an actual cluster, a virtualized version of DeepOps may be deployed on a single node using Vagrant. This path is useful for learning and local experimentation, but it is a legacy/community-supported lab path and should not be treated as release-grade validation for current GPU clusters.

Consult the [Virtual DeepOps Deployment Guide](virtual/README.md) to build a GPU-enabled virtual cluster with DeepOps.

Expand Down
Loading