Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new MultiNodeEnvironment CRD to setup the environment needed for running GPU workloads across multi-nodes #225

Open
wants to merge 32 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
9334351
Remove +clientgen from GpuConfig, MigConfig, and ImexChannelConfig
klueska Jan 9, 2025
3acadd6
Add CRD for creating a multi-node environment
klueska Jan 9, 2025
a574e88
Update Makefile to generate MultiNodeEnvironment CRD, client, deepcopy
klueska Jan 9, 2025
78ed90d
Add generated MultiNodeEnvironment CRD, client, and deepcopy
klueska Jan 9, 2025
401ecc3
Make the nvidia.com client set available to the driver
klueska Jan 10, 2025
330c383
Add indirection of ImexManager through wrapping Controller abstraction
klueska Jan 10, 2025
96e1b71
Add a workqueue abstraction for processing objects pulled from informers
klueska Jan 12, 2025
059bffe
Add logic to autogenerate a ResourceClaim from a MultiNodeEnvironment
klueska Jan 10, 2025
57cd7cf
Add logic to autogenerate a DeviceClass from a MultiNodeEnvironment
klueska Jan 12, 2025
ec4a456
Allow either a resourceClaimName or a deviceClassName to be specified
klueska Jan 12, 2025
a3fec3a
Add Deployment support for MultiNodeEnvironments and completely refactor
klueska Jan 14, 2025
c69abf8
Rename and copy cmds / helm charts to split GPU and IMEX drivers
klueska Jan 18, 2025
a6ea6f3
Strip GPU / IMEX drivers to remove corresponding devices
klueska Jan 20, 2025
d5287ce
Add ability to allocate a per-node IMEX deamon via a ResourceClaim
klueska Jan 22, 2025
c1b7b91
Rename MultiNodeEnvironment to ComputeDomain
klueska Jan 22, 2025
05d6482
Rename gpu.nvidia.com/v1alpha1 API to resource.nvidia.com/v1beta1
klueska Jan 22, 2025
3837fe5
Move to GetComputeDomainFunc instead of ComputeDomainExistsFunc
klueska Jan 23, 2025
f3cf66e
Add ComputeDomainStatus and set it as its deployment pods come online.
klueska Jan 23, 2025
da2a4ff
Update ImexDaemonSettingsManager to pull IPs from ComputeDomain status
klueska Jan 23, 2025
f40f376
Add the ability to set affinities as part of a ComputeDomain
klueska Jan 23, 2025
8242c79
Move creation of IMEX channel pool to after the deployment is fully up.
klueska Jan 24, 2025
2eab64d
Add placeholder for Delayed vs. Immediate mode for ComputeDomain
klueska Jan 24, 2025
ee76c09
Support ResourceClaimNames as a list in a ComputeDomain
klueska Jan 25, 2025
c4082bf
WIP: Temporarily import branch of nvidia-container-toolkit
klueska Jan 25, 2025
0a96b95
WIP: Remove explicit mounting of nvidia-imex and nvidia-imex-ctl
klueska Jan 25, 2025
76385cd
Add a finalizer to ComputeDomains to ensure they are the last removed
klueska Jan 27, 2025
fc1e1da
Add optimization to avoid redundant Delete calls
klueska Jan 27, 2025
c19f6c0
Standardize on passing ComputeDomainUID to RemoveFinalizer calls
klueska Jan 27, 2025
b7c3182
Remove unnecessary code to check for ComputeDomain existence
klueska Jan 27, 2025
bf742a7
Pull RemoveFinalizer() out of Delete() and call it conditionally
klueska Jan 28, 2025
a9ea832
Rename ImexDaemonSettingsManager and move CDI edits for Channeln there
klueska Jan 28, 2025
3b7a65b
Skip injection of the IMEX channel device node if no cliqueID
klueska Jan 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
.cache/
.bash_history
/nvidia-dra-controller
/nvidia-dra-plugin
/nvidia-dra-imex-controller
/nvidia-dra-imex-plugin
/nvidia-dra-gpu-plugin
.idea
[._]*.sw[a-p]
coverage.out
72 changes: 70 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ goimports:
find . -name \*.go \
-not -name "zz_generated.deepcopy.go" \
-not -path "./vendor/*" \
-not -path "./pkg/nvidia.com/resource/clientset/versioned/*" \
-not -path "./$(PKG_BASE)/clientset/versioned/*" \
-exec goimports -local $(MODULE) -w {} \;

golangci-lint:
Expand All @@ -101,7 +101,22 @@ coverage: test
cat $(COVERAGE_FILE) | grep -v "_mock.go" > $(COVERAGE_FILE).no-mocks
go tool cover -func=$(COVERAGE_FILE).no-mocks

generate: generate-deepcopy fmt
generate: generate-crds generate-informers fmt

generate-crds: generate-deepcopy .remove-crds
for dir in $(CLIENT_SOURCES); do \
controller-gen crd:crdVersions=v1 \
paths=$(CURDIR)/$${dir} \
output:crd:dir=$(CURDIR)/deployments/helm/tmp_crds; \
done
mkdir -p $(CURDIR)/deployments/helm/$(GPU_DRIVER_NAME)/crds
cp -R $(CURDIR)/deployments/helm/tmp_crds/* \
$(CURDIR)/deployments/helm/$(GPU_DRIVER_NAME)/crds
mkdir -p $(CURDIR)/deployments/helm/$(IMEX_DRIVER_NAME)/crds
cp -R $(CURDIR)/deployments/helm/tmp_crds/* \
$(CURDIR)/deployments/helm/$(IMEX_DRIVER_NAME)/crds
rm -rf $(CURDIR)/deployments/helm/tmp_crds


generate-deepcopy: .remove-deepcopy
for dir in $(DEEPCOPY_SOURCES); do \
Expand All @@ -111,16 +126,69 @@ generate-deepcopy: .remove-deepcopy
output:object:dir=$(CURDIR)/$${dir}; \
done

generate-informers: .remove-informers generate-listers
informer-gen \
--go-header-file=$(CURDIR)/hack/boilerplate.go.txt \
--output-package "$(MODULE)/$(PKG_BASE)/informers" \
--input-dirs "$(shell for api in $(CLIENT_APIS); do echo -n "$(MODULE)/$(API_BASE)/$$api,"; done | sed 's/,$$//')" \
--output-base "$(CURDIR)/pkg/tmp_informers" \
--versioned-clientset-package "$(MODULE)/$(PKG_BASE)/clientset/versioned" \
--listers-package "$(MODULE)/$(PKG_BASE)/listers"
mkdir -p $(CURDIR)/$(PKG_BASE)
mv $(CURDIR)/pkg/tmp_informers/$(MODULE)/$(PKG_BASE)/informers \
$(CURDIR)/$(PKG_BASE)/informers
rm -rf $(CURDIR)/pkg/tmp_informers

generate-listers: .remove-listers generate-clientset
lister-gen \
--go-header-file=$(CURDIR)/hack/boilerplate.go.txt \
--output-package "$(MODULE)/$(PKG_BASE)/listers" \
--input-dirs "$(shell for api in $(CLIENT_APIS); do echo -n "$(MODULE)/$(API_BASE)/$$api,"; done | sed 's/,$$//')" \
--output-base "$(CURDIR)/pkg/tmp_listers"
mkdir -p $(CURDIR)/$(PKG_BASE)
mv $(CURDIR)/pkg/tmp_listers/$(MODULE)/$(PKG_BASE)/listers \
$(CURDIR)/$(PKG_BASE)/listers
rm -rf $(CURDIR)/pkg/tmp_listers

generate-clientset: .remove-clientset
client-gen \
--go-header-file=$(CURDIR)/hack/boilerplate.go.txt \
--clientset-name "versioned" \
--build-tag "ignore_autogenerated" \
--output-package "$(MODULE)/$(PKG_BASE)/clientset" \
--input-base "$(MODULE)/$(API_BASE)" \
--output-base "$(CURDIR)/pkg/tmp_clientset" \
--input "$(shell echo $(CLIENT_APIS) | tr ' ' ',')" \
--plural-exceptions "$(shell echo $(PLURAL_EXCEPTIONS) | tr ' ' ',')"
mkdir -p $(CURDIR)/$(PKG_BASE)
mv $(CURDIR)/pkg/tmp_clientset/$(MODULE)/$(PKG_BASE)/clientset \
$(CURDIR)/$(PKG_BASE)/clientset
rm -rf $(CURDIR)/pkg/tmp_clientset

.remove-crds:
rm -rf $(CURDIR)/deployments/helm/$(DRIVER_NAME)/crds

.remove-deepcopy:
for dir in $(DEEPCOPY_SOURCES); do \
rm -f $(CURDIR)/$${dir}/zz_generated.deepcopy.go; \
done

.remove-clientset:
rm -rf $(CURDIR)/$(PKG_BASE)/clientset

.remove-listers:
rm -rf $(CURDIR)/$(PKG_BASE)/listers

.remove-informers:
rm -rf $(CURDIR)/$(PKG_BASE)/informers

build-image:
$(DOCKER) build \
--progress=plain \
--build-arg GOLANG_VERSION="$(GOLANG_VERSION)" \
--build-arg CLIENT_GEN_VERSION="$(CLIENT_GEN_VERSION)" \
--build-arg LISTER_GEN_VERSION="$(LISTER_GEN_VERSION)" \
--build-arg INFORMER_GEN_VERSION="$(INFORMER_GEN_VERSION)" \
--build-arg CONTROLLER_GEN_VERSION="$(CONTROLLER_GEN_VERSION)" \
--build-arg GOLANGCI_LINT_VERSION="$(GOLANGCI_LINT_VERSION)" \
--build-arg MOQ_VERSION="$(MOQ_VERSION)" \
Expand Down
5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,9 +84,8 @@ This should show two pods running in the `nvidia` namespace:
kubectl get pods -n nvidia
```
```
NAME READY STATUS RESTARTS AGE
nvidia-dra-driver-k8s-dra-driver-controller-844fcb94b-ktbkc 1/1 Running 0 69s
nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-5vfp9 1/1 Running 0 69s
NAME READY STATUS RESTARTS AGE
nvidia-dra-driver-k8s-dra-driver-gpu-kubelet-plugin-5vfp9 1/1 Running 0 69s
```

### Run the examples by following the steps in the demo script
Expand Down
225 changes: 0 additions & 225 deletions api/nvidia.com/resource/gpu/v1alpha1/zz_generated.deepcopy.go

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
* limitations under the License.
*/

package v1alpha1
package v1beta1

import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
Expand All @@ -24,12 +24,14 @@ import (
)

const (
GroupName = "gpu.nvidia.com"
Version = "v1alpha1"
GroupName = "resource.nvidia.com"
Version = "v1beta1"

GpuConfigKind = "GpuConfig"
MigDeviceConfigKind = "MigDeviceConfig"
ImexChannelConfigKind = "ImexChannelConfig"
ImexDaemonConfigKind = "ImexDaemonConfig"
ComputeDomainKind = "ComputeDomain"
)

// Interface defines the set of common APIs for all configs
Expand All @@ -56,6 +58,8 @@ func init() {
&GpuConfig{},
&MigDeviceConfig{},
&ImexChannelConfig{},
&ImexDaemonConfig{},
&ComputeDomain{},
)
metav1.AddToGroupVersion(scheme, schemeGroupVersion)

Expand Down
Loading