-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dcgm-exporter collects metrics incorrectly? #348
Comments
@happy2048 Thanks for reporting this. I am working with DCGM team to get this analyzed. |
This is caused by a change in driver 510 that lumps the reserved memory into the used category. We are updating DCGM to handle this case and split the used and reserved into separate fields. |
@glowkey is there any timeline to release that fix in DCGM? |
Timeline is for a DCGM 2.4 based dcgm-exporter by the end of May 2022. |
@glowkey I am assuming this bug was fixed in DCGM / dcgm-exporter. Can we close this issue? |
Yes, this is fixed! |
Environment
● Kubernetes: 1.20.11
● OS: Centos7(3.10.0-1160.15.2.el7.x86_64)
● Docker: 19.03.15
● NVIDIA Driver Version: 510.47.03
● DCGM Exporter Docker Image: nvcr.io/nvidia/k8s/dcgm-exporter:2.3.5-2.6.5-ubuntu20.04
Issue description
There is no process is using gpus on my node, output of nvidia-smi:
But I found the output of the metric DCGM_FI_DEV_FB_USED is 850MiB:
At the same time I used the nvml to get the used gpu memory is 0, why dcgm-exporter outputs the 850MiB?
And I also tested M40,P100,P4,T4,V100,A10 with driver 510.47.03, the value of metric DCGM_FI_DEV_FB_USED is not 0 even if there is no gpu process is using gpus.
Is a bug for dcgm? or is a bug for nvidia drvier 510.47.03?
The text was updated successfully, but these errors were encountered: