Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DCGM-Expoter msg="Could not retrieve ConfigMap ..." #400

Closed
devnjw opened this issue Aug 30, 2022 · 6 comments
Closed

DCGM-Expoter msg="Could not retrieve ConfigMap ..." #400

devnjw opened this issue Aug 30, 2022 · 6 comments

Comments

@devnjw
Copy link

devnjw commented Aug 30, 2022

I am running a cluster with a number of nvidia gpu. I'm also monitoring gpu using dcgm-exporter. However, sometimes the dcgm-exporter fails to give metrics with the logs below.

time="2022-08-19T07:25:51Z" level=info msg="Starting dcgm-exporter"
time="2022-08-19T07:25:51Z" level=info msg="DCGM successfully initialized!"
time="2022-08-19T07:25:51Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2022-08-19T07:26:21Z" level=info msg="Could not retrieve ConfigMap 'gpu-monitor:exporter-metrics-config-map': Get \"https://{ip}/api/v1/namespaces/gpu-monitor/configmaps/exporter-metrics-config-map\": dial tcp {ip}: i/o timeout, falling back to metric file /etc/dcgm-exporter/default-counters.csv"
time="2022-08-19T07:26:21Z" level=info msg="Kubernetes metrics collection enabled!"
time="2022-08-19T07:26:21Z" level=info msg="Pipeline starting"
time="2022-08-19T07:26:21Z" level=info msg="Starting webserver"

I think it is normal to restart Pod if the exporter has not found ConfigMap, but it doesn't. (Or at least it should be marked as not ready.)
I would appreciate it if you could give me feedback or fix this issue after checking it.

Other normal dcgm-exporters have the following logs.

time="2022-08-05T00:13:49Z" level=info msg="Starting dcgm-exporter"
time="2022-08-05T00:13:49Z" level=info msg="DCGM successfully initialized!"
time="2022-08-05T00:13:49Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2022-08-05T00:13:49Z" level=warning msg="Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled"
time="2022-08-05T00:13:49Z" level=warning msg="Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled"
time="2022-08-05T00:13:49Z" level=warning msg="Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled"
time="2022-08-05T00:13:49Z" level=warning msg="Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled"
time="2022-08-05T00:13:49Z" level=warning msg="Skipping line 64 ('DCGM_FI_PROF_PCIE_RX_BYTES'): DCP metrics not enabled"
time="2022-08-05T00:13:49Z" level=info msg="Kubernetes metrics collection enabled!"
time="2022-08-05T00:13:49Z" level=info msg="Pipeline starting"
time="2022-08-05T00:13:49Z" level=info msg="Starting webserver"
@shivamerla
Copy link
Contributor

@devnjw What env are you passing for dcgm-exporter? Are you trying to pass ConfigMap name using DCGM_EXPORTER_CONFIGMAP_DATA env? For custom metrics you can create a ConfigMap and deploy as here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#custom-metrics-config.

@devnjw
Copy link
Author

devnjw commented Sep 7, 2022

@shivamerla Thank you for your reply, but I don't need a custom exporter.
I am asking if it is more appropriate to panic() the exporter when an error such as the first log above occurs.

@shivamerla
Copy link
Contributor

got it, yes i will relay this to DCGM exporter team. When its configured to run using custom ConfigMap and that is not found, exporter should error out.

@shivamerla
Copy link
Contributor

@glowkey @dualvtable Please take a look at this.

@glowkey
Copy link

glowkey commented Oct 21, 2022

Tracking here: NVIDIA/dcgm-exporter#111

@cdesiniotis
Copy link
Contributor

@devnjw this should be fixed in newer versions of dcgm-exporter. Closing. Please re-open if you are still experiencing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants