Skip to content
This repository has been archived by the owner on Nov 2, 2021. It is now read-only.

dcgm-exporter doesn't see GPU processes and GPU memory usage #209

Open
lev-stas opened this issue Aug 23, 2021 · 0 comments
Open

dcgm-exporter doesn't see GPU processes and GPU memory usage #209

lev-stas opened this issue Aug 23, 2021 · 0 comments

Comments

@lev-stas
Copy link

Hi, I'm trying to set GPU monitoring via Grafana/Prometheus. I have stand alone server with two GPUs and use dcgm-exporter in docker container as metrics exporter. I run docker in privileged mode by command docker run -d -e --priveleged -v /home/dockeradm/nvidia-smi-exporter/default-counters.csv:/etc/dcgm-exporter/default-counters.csv -p9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu18.04 , and it see my GPUs. But it can't detect GPU processes and GPU Memory Usage.
There is output of nvidia-smi util from host

]$ nvidia-smi
Mon Aug 23 23:03:29 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 455.32.00    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:37:00.0 Off |                    0 |
| N/A   60C    P0    42W / 250W |   1393MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:86:00.0 Off |                    0 |
| N/A   64C    P0    47W / 250W |  10095MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     17748      C   ...189c/arasov/bin/python3.7        0MiB |
|    0   N/A  N/A     53799      C   ...189c/arasov/bin/python3.7     1389MiB |
|    1   N/A  N/A     17748      C   ...189c/arasov/bin/python3.7    10091MiB |
|    1   N/A  N/A     53799      C   ...189c/arasov/bin/python3.7        0MiB |
+-----------------------------------------------------------------------------+

and there is the output of nvidia-smi inside the container

root@ccdc999ac0bd:/# nvidia-smi
Mon Aug 23 19:25:22 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 455.32.00    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:37:00.0 Off |                    0 |
| N/A   59C    P0    41W / 250W |   1393MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:86:00.0 Off |                    0 |
| N/A   62C    P0    46W / 250W |  10095MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Am I missing something or doing something wrong? How should I set container to make it detect GPU processes and GPU usage?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant