dcgm-exporter collects metrics incorrectly? #348

happy2048 · 2022-05-05T02:44:46Z

Environment

● Kubernetes： 1.20.11
● OS: Centos7(3.10.0-1160.15.2.el7.x86_64)
● Docker: 19.03.15
● NVIDIA Driver Version: 510.47.03
● DCGM Exporter Docker Image: nvcr.io/nvidia/k8s/dcgm-exporter:2.3.5-2.6.5-ubuntu20.04

Issue description

There is no process is using gpus on my node, output of nvidia-smi:

# nvidia-smi

Tue Apr 26 15:29:49 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:54:00.0 Off |                    0 |
| N/A   33C    P0    67W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:5A:00.0 Off |                    0 |
| N/A   32C    P0    65W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:6B:00.0 Off |                    0 |
| N/A   32C    P0    64W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:70:00.0 Off |                    0 |
| N/A   34C    P0    71W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:BE:00.0 Off |                    0 |
| N/A   33C    P0    64W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:C3:00.0 Off |                    0 |
| N/A   30C    P0    63W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:DA:00.0 Off |                    0 |
| N/A   32C    P0    63W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:E0:00.0 Off |                    0 |
| N/A   33C    P0    66W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

But I found the output of the metric DCGM_FI_DEV_FB_USED is 850MiB:

# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
# TYPE DCGM_FI_DEV_FB_USED gauge
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-14cdc0a4-f52f-a50c-f758-eb93e013e555",device="nvidia6",gpu="6",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-4eb035e5-2709-daa2-e3a1-5d2d8da60610",device="nvidia1",gpu="1",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-6fdb3b1a-3e7f-abc1-3e6a-7378a3cb2778",device="nvidia3",gpu="3",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-8ded7a54-6143-7898-563a-eb598623d740",device="nvidia0",gpu="0",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-a4f2126d-7321-144c-aa01-cdb6e7c8022a",device="nvidia7",gpu="7",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-bfbd4faf-d197-bd20-e054-2735bdd0c49e",device="nvidia4",gpu="4",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-c4d991e3-a1a7-0966-056b-a26d078c2f67",device="nvidia5",gpu="5",modelName="NVIDIA A100-SXM4-80GB"} 850
DCGM_FI_DEV_FB_USED{Hostname="nvidia-dcgm-exporter-cdxbn",NodeName="cn-beijing.192.168.10.140",UUID="GPU-f7b1a978-75f0-a263-4206-8d7cd53641e0",device="nvidia2",gpu="2",modelName="NVIDIA A100-SXM4-80GB"} 850

At the same time I used the nvml to get the used gpu memory is 0, why dcgm-exporter outputs the 850MiB?
And I also tested M40,P100,P4,T4,V100,A10 with driver 510.47.03, the value of metric DCGM_FI_DEV_FB_USED is not 0 even if there is no gpu process is using gpus.
Is a bug for dcgm? or is a bug for nvidia drvier 510.47.03?

The text was updated successfully, but these errors were encountered:

shivamerla · 2022-05-05T17:59:34Z

@happy2048 Thanks for reporting this. I am working with DCGM team to get this analyzed.

glowkey · 2022-05-05T20:15:58Z

This is caused by a change in driver 510 that lumps the reserved memory into the used category. We are updating DCGM to handle this case and split the used and reserved into separate fields.

wsxiaozhang · 2022-05-12T07:46:49Z

This is caused by a change in driver 510 that lumps the reserved memory into the used category. We are updating DCGM to handle this case and split the used and reserved into separate fields.

@glowkey is there any timeline to release that fix in DCGM?

glowkey · 2022-05-12T19:27:17Z

Timeline is for a DCGM 2.4 based dcgm-exporter by the end of May 2022.

cdesiniotis · 2024-01-31T00:28:52Z

@glowkey I am assuming this bug was fixed in DCGM / dcgm-exporter. Can we close this issue?

glowkey · 2024-01-31T00:32:53Z

Yes, this is fixed!

cdesiniotis closed this as completed Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dcgm-exporter collects metrics incorrectly? #348

dcgm-exporter collects metrics incorrectly? #348

happy2048 commented May 5, 2022

shivamerla commented May 5, 2022

glowkey commented May 5, 2022

wsxiaozhang commented May 12, 2022

glowkey commented May 12, 2022

cdesiniotis commented Jan 31, 2024

glowkey commented Jan 31, 2024

dcgm-exporter collects metrics incorrectly? #348

dcgm-exporter collects metrics incorrectly? #348

Comments

happy2048 commented May 5, 2022

Environment

Issue description

shivamerla commented May 5, 2022

glowkey commented May 5, 2022

wsxiaozhang commented May 12, 2022

glowkey commented May 12, 2022

cdesiniotis commented Jan 31, 2024

glowkey commented Jan 31, 2024