Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't get vGPU licensed against my License Server #663

Closed
4 of 6 tasks
urbaman opened this issue Jan 29, 2024 · 4 comments
Closed
4 of 6 tasks

Can't get vGPU licensed against my License Server #663

urbaman opened this issue Jan 29, 2024 · 4 comments

Comments

@urbaman
Copy link

urbaman commented Jan 29, 2024

1. Quick Debug Information

  • OS/Version (Ubuntu22.04):
  • Kernel Version: 5.15.0-92-generic
  • Container Runtime Type/Version (Containerd 1.6.27):
  • K8s Flavor/Version (kubeadm 1.28.6):
  • GPU Operator Version: 23.9.1

2. Issue or feature description

Installed the helm chart to enable vGPU on my 9 workers, can't get it to be licensed

3. Steps to reproduce the issue

Install the helm chart, run a test worload:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
    resources:
      limits:
        nvidia.com/gpu: 1
  dnsConfig:
    options:
      - name: ndots
        value: "1"
kubectl logs cuda-vectoradd
Error from server (NotFound): pods "cuda-vectoradd" not found

4. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log
kubectl get pods -n gpu-operator
NAME                                                         READY   STATUS      RESTARTS      AGE
gpu-feature-discovery-5rqbc                                  1/1     Running     0             39h
gpu-feature-discovery-6swjq                                  1/1     Running     0             39h
gpu-feature-discovery-7zgjt                                  1/1     Running     0             39h
gpu-feature-discovery-898bd                                  1/1     Running     0             39h
gpu-feature-discovery-b59sm                                  1/1     Running     0             39h
gpu-feature-discovery-c5zg6                                  1/1     Running     0             39h
gpu-feature-discovery-dh4cz                                  1/1     Running     0             39h
gpu-feature-discovery-pv7sv                                  1/1     Running     0             39h
gpu-feature-discovery-xc8mk                                  1/1     Running     0             39h
gpu-operator-999cc8dcc-dsmhq                                 1/1     Running     2 (81m ago)   39h
gpu-operator-node-feature-discovery-gc-7cc7ccfff8-z54zc      1/1     Running     0             39h
gpu-operator-node-feature-discovery-master-d8597d549-zhqn8   1/1     Running     0             39h
gpu-operator-node-feature-discovery-worker-4fph8             1/1     Running     0             39h
gpu-operator-node-feature-discovery-worker-68t7k             1/1     Running     0             39h
gpu-operator-node-feature-discovery-worker-6cknl             1/1     Running     0             39h
gpu-operator-node-feature-discovery-worker-88q42             1/1     Running     0             39h
gpu-operator-node-feature-discovery-worker-bg6t6             1/1     Running     0             39h
gpu-operator-node-feature-discovery-worker-bws98             1/1     Running     0             39h
gpu-operator-node-feature-discovery-worker-fwmfn             1/1     Running     0             39h
gpu-operator-node-feature-discovery-worker-lt8xl             1/1     Running     0             39h
gpu-operator-node-feature-discovery-worker-mjkzj             1/1     Running     0             39h
gpu-operator-node-feature-discovery-worker-v7stk             1/1     Running     0             39h
gpu-operator-node-feature-discovery-worker-vhv88             1/1     Running     0             39h
gpu-operator-node-feature-discovery-worker-z87zz             1/1     Running     0             39h
nvidia-container-toolkit-daemonset-79d2c                     1/1     Running     0             39h
nvidia-container-toolkit-daemonset-b2d6m                     1/1     Running     0             39h
nvidia-container-toolkit-daemonset-bdr5n                     1/1     Running     0             39h
nvidia-container-toolkit-daemonset-hdrx4                     1/1     Running     0             39h
nvidia-container-toolkit-daemonset-jbzds                     1/1     Running     0             39h
nvidia-container-toolkit-daemonset-jg24g                     1/1     Running     0             39h
nvidia-container-toolkit-daemonset-l5tfj                     1/1     Running     0             39h
nvidia-container-toolkit-daemonset-rbd55                     1/1     Running     0             39h
nvidia-container-toolkit-daemonset-xrk9w                     1/1     Running     0             39h
nvidia-cuda-validator-27wp9                                  0/1     Completed   0             39h
nvidia-cuda-validator-4xp9l                                  0/1     Completed   0             39h
nvidia-cuda-validator-f2wjc                                  0/1     Completed   0             39h
nvidia-cuda-validator-hqx7r                                  0/1     Completed   0             39h
nvidia-cuda-validator-hw7d5                                  0/1     Completed   0             39h
nvidia-cuda-validator-mk72b                                  0/1     Completed   0             39h
nvidia-cuda-validator-rl8pt                                  0/1     Completed   0             39h
nvidia-cuda-validator-xk4rt                                  0/1     Completed   0             39h
nvidia-cuda-validator-z8sln                                  0/1     Completed   0             39h
nvidia-dcgm-exporter-5mcf6                                   1/1     Running     0             39h
nvidia-dcgm-exporter-5rvl5                                   1/1     Running     0             39h
nvidia-dcgm-exporter-65p2q                                   1/1     Running     0             39h
nvidia-dcgm-exporter-7wn6c                                   1/1     Running     0             39h
nvidia-dcgm-exporter-dhxqp                                   1/1     Running     0             39h
nvidia-dcgm-exporter-m2f2f                                   1/1     Running     0             39h
nvidia-dcgm-exporter-q9wwt                                   1/1     Running     0             39h
nvidia-dcgm-exporter-t8qcd                                   1/1     Running     0             39h
nvidia-dcgm-exporter-zsxzd                                   1/1     Running     0             39h
nvidia-device-plugin-daemonset-6w2rj                         1/1     Running     0             39h
nvidia-device-plugin-daemonset-8qpmf                         1/1     Running     0             39h
nvidia-device-plugin-daemonset-bq6q7                         1/1     Running     6 (39h ago)   39h
nvidia-device-plugin-daemonset-ms8hz                         1/1     Running     0             39h
nvidia-device-plugin-daemonset-pgcxc                         1/1     Running     0             39h
nvidia-device-plugin-daemonset-s5cdd                         1/1     Running     0             39h
nvidia-device-plugin-daemonset-t7vdm                         1/1     Running     0             39h
nvidia-device-plugin-daemonset-t9mjq                         1/1     Running     0             39h
nvidia-device-plugin-daemonset-xn6lz                         1/1     Running     0             39h
nvidia-driver-daemonset-5jk8g                                1/1     Running     0             39h
nvidia-driver-daemonset-f8t75                                1/1     Running     0             39h
nvidia-driver-daemonset-fk2qn                                1/1     Running     0             39h
nvidia-driver-daemonset-l2lwl                                1/1     Running     0             39h
nvidia-driver-daemonset-mh89m                                1/1     Running     0             39h
nvidia-driver-daemonset-mqhst                                1/1     Running     0             39h
nvidia-driver-daemonset-q8rh9                                1/1     Running     0             39h
nvidia-driver-daemonset-wm2lc                                1/1     Running     0             39h
nvidia-driver-daemonset-zzn7k                                1/1     Running     0             39h
nvidia-operator-validator-4glqt                              1/1     Running     0             39h
nvidia-operator-validator-ddzw9                              1/1     Running     0             39h
nvidia-operator-validator-dvqwn                              1/1     Running     0             39h
nvidia-operator-validator-kvs4g                              1/1     Running     0             39h
nvidia-operator-validator-kzkcp                              1/1     Running     0             39h
nvidia-operator-validator-ptv76                              1/1     Running     0             39h
nvidia-operator-validator-rpnkp                              1/1     Running     0             39h
nvidia-operator-validator-sc6g5                              1/1     Running     0             39h
nvidia-operator-validator-z2qpl                              1/1     Running     0             39h
kubectl get ds -n gpu-operator
NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
gpu-feature-discovery                        9         9         9       9            9           nvidia.com/gpu.deploy.gpu-feature-discovery=true   39h
gpu-operator-node-feature-discovery-worker   12        12        12      12           12          <none>                                             39h
nvidia-container-toolkit-daemonset           9         9         9       9            9           nvidia.com/gpu.deploy.container-toolkit=true       39h
nvidia-dcgm-exporter                         9         9         9       9            9           nvidia.com/gpu.deploy.dcgm-exporter=true           39h
nvidia-device-plugin-daemonset               9         9         9       9            9           nvidia.com/gpu.deploy.device-plugin=true           39h
nvidia-driver-daemonset                      9         9         9       9            9           nvidia.com/gpu.deploy.driver=true                  39h
nvidia-mig-manager                           0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             39h
nvidia-operator-validator                    9         9         9       9            9           nvidia.com/gpu.deploy.operator-validator=true      39h
kubectl exec nvidia-driver-daemonset-5jk8g -n gpu-operator -c nvidia-driver-ctr -- nvidia-smi
Mon Jan 29 15:38:33 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  GRID P40-1Q                    On  | 00000000:00:10.0 Off |                  N/A |
| N/A   N/A    P8              N/A /  N/A |      0MiB /  1024MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

@urbaman
Copy link
Author

urbaman commented Jan 29, 2024

This is what i found out during the drivers creation:

Copying ClientConfigToken...
cp: cannot stat '/drivers/ClientConfigToken/*': No such file or directory

But I do have the configmap with the token and the following helm values:

  # vGPU licensing configuration
  licensingConfig:
    configMapName: "licensing-config"
    nlsEnabled: false
kubectl get cm -n gpu-operator licensing-config -o yaml
apiVersion: v1
data:
  client_configuration_token.tok: <TOKEN>
  gridd.conf: |
    # Description: Set Feature to be enabled
    # Data type: integer
    # Possible values:
    # 0 => for unlicensed state
    # 1 => for NVIDIA vGPU
    # 2 => for NVIDIA RTX Virtual Workstation
    # 4 => for NVIDIA Virtual Compute Server
    FeatureType=1
kind: ConfigMap
metadata:
  creationTimestamp: "2024-01-27T23:54:50Z"
  name: licensing-config
  namespace: gpu-operator
  resourceVersion: "114723913"
  uid: da5072fc-faa9-467e-9157-baf4a3d50083

@urbaman
Copy link
Author

urbaman commented Jan 30, 2024

Ok, solved maunally editing the driver deployment after creation and rollout restarting it, adding the needed volumeMount and volume:

volumeMounth:

        - mountPath: /drivers/ClientConfigToken/client_configuration_token.tok
          name: licensing-token
          readOnly: true
          subPath: client_configuration_token.tok

volume:

      - configMap:
          defaultMode: 420
          items:
          - key: client_configuration_token.tok
            path: client_configuration_token.tok
          name: licensing-config
        name: licensing-token

@cdesiniotis
Copy link
Contributor

Hi @urbaman please set driver.licensingConfig.nlsEnabled=true. The licensing token is only mounted when NLS is enabled.

As a matter of interest, what GPU Operator helm chart version are you using? NLS should be enabled by default in the helm chart.

@urbaman
Copy link
Author

urbaman commented Jan 31, 2024

Ah, ok.

You're right, I must have worked out the values from a previous version. I usually work with values files, so I modified an old one instead of creating a new one.

ubuntu@k8cp1:~$ helm list -n gpu-operator
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
gpu-operator    gpu-operator    1               2024-01-29 17:26:23.167445283 +0100 CET deployed        gpu-operator-v23.9.1    v23.9.1
ubuntu@k8cp1:~$ helm search repo gpu-operator
NAME                    CHART VERSION   APP VERSION     DESCRIPTION
nvidia/gpu-operator     v23.9.1         v23.9.1         NVIDIA GPU Operator creates/configures/manages ...
ubuntu@k8cp1:~$ helm show values nvidia/gpu-operator -o yaml | grep nls
Error: unknown shorthand flag: 'o' in -o
ubuntu@k8cp1:~$ helm show values nvidia/gpu-operator | grep nls
    nlsEnabled: true

It could be helpful to have a complete helm values reference somewhere (threre's a partial one on the docs)

Thank you very much!

@urbaman urbaman closed this as completed Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants