This repository has been archived by the owner on Nov 2, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 302
exporter returns no profiling metrics after some period of time #189
Comments
there are some other errors also
|
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hi guys, I am using docker version of the dcgm-exporter,
when I just started the dcgm-exporter container, I can get the profiling metrics well.
After some period of time, I cannot get the profiling metrics like DCGM_FI_PROF_*.
actually, exporter prints the profiling metrics as zero, while other metrics like DCGM_FI_DEV_POWER_USAGE are printed well.
If I restart the container, then the metrics are exported well.
Here is an example
as you can see, gpu is now being utilized, but DCGM_FI_PROF_* gives zero.
this is the result of nvidia-smi
Here is the log for the /var/log/nv-hostengine.log in the dcgm-exporter container
the current version I used is the followings:
(actually I also used nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu18.04
the same thing happened except the profiling metrics are not exported, while newer version prints zero)
The os is
the nvidia-driver and gpus are
This is the command I used
and /NAS/dcgm_exporter/dcp-metrics-included-all.csv contains
Should I do something that I missed?
The text was updated successfully, but these errors were encountered: