Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong node capacity and allocatable when using MIG #637

Open
xhejtman opened this issue Dec 17, 2023 · 7 comments
Open

Wrong node capacity and allocatable when using MIG #637

xhejtman opened this issue Dec 17, 2023 · 7 comments
Assignees
Labels
bug Issue/PR to expose/discuss/fix a bug

Comments

@xhejtman
Copy link

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
  • Kernel Version: 6.2.0-37-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd, 1.7.7
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Rancher/RKE2, 1.27.8
  • GPU Operator Version: 23.9.1.

2. Issue or feature description

When MIG is enabled, both MIG resource and nvidia.com/gpu resource are reported as allocatable:

Allocatable:
  cerit.io/gpu-count:      2
  cerit.io/gpu-mem:        0
  cpu:                     64
  ephemeral-storage:       7104643354787
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  519659388Ki
  nvidia.com/gpu:          2
  nvidia.com/mig-1g.10gb:  6
  nvidia.com/mig-2g.20gb:  4
  nvidia.com/mig-3g.40gb:  0
  pods:                    160

which means that both requests nvidia.com/gpu and nvidia.com/mig-1g.10gb can land on the node, however, the nvidia.com/gpu request fails to inject GPU.

3. Steps to reproduce the issue

Enable MIG on A100 GPU.

This may be just a bug in Kubernetes, not the gpu operator itself.

@shivamerla
Copy link
Contributor

@xhejtman this is controlled by the mig.strategy: mixed parameter. When mixed strategy is used the device-plugin will

  • Expose any GPUs not in MIG mode using the traditional nvidia.com/gpu resource type
  • Expose individual MIG devices with a new resource type following the schema nvidia.com/mig-<slice_count>g.<memory_size>gb

So in your case, you do seem to have some GPUs with MIG disabled and others with enabled. Is that correct? Otherwise this would be a bug.

@xhejtman
Copy link
Author

I have both GPUs set into mig configuration:

Thu Dec 21 00:55:18 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:27:00.0 Off |                   On |
| N/A   50C    P0              83W / 300W |     38MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  | 00000000:A3:00.0 Off |                   On |
| N/A   48C    P0              81W / 300W |     38MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    3   0   0  |              10MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    5   0   1  |              10MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    9   0   2  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   10   0   3  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   13   0   4  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    3   0   0  |              10MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    5   0   1  |              10MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    9   0   2  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1   10   0   3  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1   13   0   4  |               5MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

@shivamerla
Copy link
Contributor

Ah, this seems to be a bug then. Will look into this. cc @elezar @klueska

@elezar
Copy link
Member

elezar commented Jan 8, 2024

@xhejtman could you provide the logs from the device plugin?

@xhejtman
Copy link
Author

xhejtman commented Jan 8, 2024

2.log

In meantime, I checked that Kubernetes 1.27.8 is not a problem, I have different cluster with 23.6.1 operator and it works ok.

@elezar
Copy link
Member

elezar commented Jan 8, 2024

Looking at the logs, we're only starting 2 GRPC servers:

2023-12-18T12:40:20.600590354+01:00 stderr F I1218 11:40:20.600444       1 server.go:165] Starting GRPC server for 'nvidia.com/mig-1g.10gb'
2023-12-18T12:40:20.601080041+01:00 stderr F I1218 11:40:20.600967       1 server.go:117] Starting to serve 'nvidia.com/mig-1g.10gb' on /var/lib/kubelet/device-plugins/nvidia-mig-1g.10gb.sock
2023-12-18T12:40:20.633441912+01:00 stderr F I1218 11:40:20.632289       1 server.go:125] Registered device plugin for 'nvidia.com/mig-1g.10gb' with Kubelet
2023-12-18T12:40:20.633473571+01:00 stderr F I1218 11:40:20.632494       1 server.go:165] Starting GRPC server for 'nvidia.com/mig-2g.20gb'
2023-12-18T12:40:20.633492757+01:00 stderr F I1218 11:40:20.632946       1 server.go:117] Starting to serve 'nvidia.com/mig-2g.20gb' on /var/lib/kubelet/device-plugins/nvidia-mig-2g.20gb.sock
2023-12-18T12:40:20.649231279+01:00 stderr F I1218 11:40:20.644793       1 server.go:125] Registered device plugin for 'nvidia.com/mig-2g.20gb' with Kubelet

meaning that the running instance of the plugin should only be exposing these as allocatable resources.

Could you confirm that /var/lib/kubelet/device-plugins/ only references these two resource types? It could be that when applying the MIG config update the other socket was not removed.

@xhejtman
Copy link
Author

xhejtman commented Jan 8, 2024

root@kub-as6:/var/lib/kubelet/device-plugins# ls -1
kubelet.sock
kubelet_internal_checkpoint
nvidia-mig-1g.10gb.sock
nvidia-mig-2g.20gb.sock
root@kub-as6:/var/lib/kubelet/device-plugins#

@klueska klueska added the bug Issue/PR to expose/discuss/fix a bug label Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR to expose/discuss/fix a bug
Projects
None yet
Development

No branches or pull requests

4 participants