Wrong node capacity and allocatable when using MIG #637

xhejtman · 2023-12-17T12:00:29Z

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
Kernel Version: 6.2.0-37-generic
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd, 1.7.7
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Rancher/RKE2, 1.27.8
GPU Operator Version: 23.9.1.

2. Issue or feature description

When MIG is enabled, both MIG resource and nvidia.com/gpu resource are reported as allocatable:

Allocatable:
  cerit.io/gpu-count:      2
  cerit.io/gpu-mem:        0
  cpu:                     64
  ephemeral-storage:       7104643354787
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  519659388Ki
  nvidia.com/gpu:          2
  nvidia.com/mig-1g.10gb:  6
  nvidia.com/mig-2g.20gb:  4
  nvidia.com/mig-3g.40gb:  0
  pods:                    160

which means that both requests nvidia.com/gpu and nvidia.com/mig-1g.10gb can land on the node, however, the nvidia.com/gpu request fails to inject GPU.

3. Steps to reproduce the issue

Enable MIG on A100 GPU.

This may be just a bug in Kubernetes, not the gpu operator itself.

The text was updated successfully, but these errors were encountered:

shivamerla · 2023-12-21T00:54:00Z

@xhejtman this is controlled by the mig.strategy: mixed parameter. When mixed strategy is used the device-plugin will

Expose any GPUs not in MIG mode using the traditional nvidia.com/gpu resource type
Expose individual MIG devices with a new resource type following the schema nvidia.com/mig-<slice_count>g.<memory_size>gb

So in your case, you do seem to have some GPUs with MIG disabled and others with enabled. Is that correct? Otherwise this would be a bug.

xhejtman · 2023-12-21T00:56:05Z

I have both GPUs set into mig configuration:

Thu Dec 21 00:55:18 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:27:00.0 Off |                   On |
| N/A   50C    P0              83W / 300W |     38MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  | 00000000:A3:00.0 Off |                   On |
| N/A   48C    P0              81W / 300W |     38MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    3   0   0  |              10MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    5   0   1  |              10MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    9   0   2  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   10   0   3  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   13   0   4  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    3   0   0  |              10MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    5   0   1  |              10MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    9   0   2  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1   10   0   3  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1   13   0   4  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

shivamerla · 2023-12-21T01:46:57Z

Ah, this seems to be a bug then. Will look into this. cc @elezar @klueska

elezar · 2024-01-08T11:59:44Z

@xhejtman could you provide the logs from the device plugin?

xhejtman · 2024-01-08T12:30:15Z

2.log

In meantime, I checked that Kubernetes 1.27.8 is not a problem, I have different cluster with 23.6.1 operator and it works ok.

elezar · 2024-01-08T12:46:02Z

Looking at the logs, we're only starting 2 GRPC servers:

2023-12-18T12:40:20.600590354+01:00 stderr F I1218 11:40:20.600444       1 server.go:165] Starting GRPC server for 'nvidia.com/mig-1g.10gb'
2023-12-18T12:40:20.601080041+01:00 stderr F I1218 11:40:20.600967       1 server.go:117] Starting to serve 'nvidia.com/mig-1g.10gb' on /var/lib/kubelet/device-plugins/nvidia-mig-1g.10gb.sock
2023-12-18T12:40:20.633441912+01:00 stderr F I1218 11:40:20.632289       1 server.go:125] Registered device plugin for 'nvidia.com/mig-1g.10gb' with Kubelet
2023-12-18T12:40:20.633473571+01:00 stderr F I1218 11:40:20.632494       1 server.go:165] Starting GRPC server for 'nvidia.com/mig-2g.20gb'
2023-12-18T12:40:20.633492757+01:00 stderr F I1218 11:40:20.632946       1 server.go:117] Starting to serve 'nvidia.com/mig-2g.20gb' on /var/lib/kubelet/device-plugins/nvidia-mig-2g.20gb.sock
2023-12-18T12:40:20.649231279+01:00 stderr F I1218 11:40:20.644793       1 server.go:125] Registered device plugin for 'nvidia.com/mig-2g.20gb' with Kubelet

meaning that the running instance of the plugin should only be exposing these as allocatable resources.

Could you confirm that /var/lib/kubelet/device-plugins/ only references these two resource types? It could be that when applying the MIG config update the other socket was not removed.

xhejtman · 2024-01-08T12:49:16Z

root@kub-as6:/var/lib/kubelet/device-plugins# ls -1
kubelet.sock
kubelet_internal_checkpoint
nvidia-mig-1g.10gb.sock
nvidia-mig-2g.20gb.sock
root@kub-as6:/var/lib/kubelet/device-plugins#

klueska assigned klueska and elezar Jan 25, 2024

klueska added the bug Issue/PR to expose/discuss/fix a bug label Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong node capacity and allocatable when using MIG #637

Wrong node capacity and allocatable when using MIG #637

xhejtman commented Dec 17, 2023

shivamerla commented Dec 21, 2023

xhejtman commented Dec 21, 2023

shivamerla commented Dec 21, 2023

elezar commented Jan 8, 2024

xhejtman commented Jan 8, 2024

elezar commented Jan 8, 2024

xhejtman commented Jan 8, 2024

Wrong node capacity and allocatable when using MIG #637

Wrong node capacity and allocatable when using MIG #637

Comments

xhejtman commented Dec 17, 2023

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

shivamerla commented Dec 21, 2023

xhejtman commented Dec 21, 2023

shivamerla commented Dec 21, 2023

elezar commented Jan 8, 2024

xhejtman commented Jan 8, 2024

elezar commented Jan 8, 2024

xhejtman commented Jan 8, 2024