Skip to content
This repository was archived by the owner on Nov 2, 2021. It is now read-only.
This repository was archived by the owner on Nov 2, 2021. It is now read-only.

exporter returns no profiling metrics after some period of time #189

Open
@shovsj

Description

@shovsj

Hi guys, I am using docker version of the dcgm-exporter,
when I just started the dcgm-exporter container, I can get the profiling metrics well.
After some period of time, I cannot get the profiling metrics like DCGM_FI_PROF_*.
actually, exporter prints the profiling metrics as zero, while other metrics like DCGM_FI_DEV_POWER_USAGE are printed well.
If I restart the container, then the metrics are exported well.

Here is an example

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
# HELP DCGM_FI_DEV_POWER_USAGE Power draw (in W).
# TYPE DCGM_FI_DEV_POWER_USAGE gauge
# HELP DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION Total energy consumption since boot (in mJ).
# TYPE DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION counter
# HELP DCGM_FI_DEV_PCIE_REPLAY_COUNTER Total number of PCIe retries.
# TYPE DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter
# HELP DCGM_FI_DEV_MEM_COPY_UTIL Memory utilization (in %).
# TYPE DCGM_FI_DEV_MEM_COPY_UTIL gauge
# HELP DCGM_FI_DEV_ENC_UTIL Encoder utilization (in %).
# TYPE DCGM_FI_DEV_ENC_UTIL gauge
# HELP DCGM_FI_DEV_DEC_UTIL Decoder utilization (in %).
# TYPE DCGM_FI_DEV_DEC_UTIL gauge
# HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
# HELP DCGM_FI_DEV_FB_FREE Framebuffer memory free (in MiB).
# TYPE DCGM_FI_DEV_FB_FREE gauge
# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
# TYPE DCGM_FI_DEV_FB_USED gauge
# HELP DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL Total number of NVLink bandwidth counters for all lanes.
# TYPE DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL counter
# HELP DCGM_FI_DEV_VGPU_LICENSE_STATUS vGPU License status
# TYPE DCGM_FI_DEV_VGPU_LICENSE_STATUS gauge
# HELP DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS Number of remapped rows for uncorrectable errors
# TYPE DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS counter
# HELP DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS Number of remapped rows for correctable errors
# TYPE DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS counter
# HELP DCGM_FI_DEV_ROW_REMAP_FAILURE Whether remapping of rows has failed
# TYPE DCGM_FI_DEV_ROW_REMAP_FAILURE gauge
# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE Ratio of time the graphics engine is active (in %).
# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
# HELP DCGM_FI_PROF_SM_ACTIVE The ratio of cycles an SM has at least 1 warp assigned (in %).
# TYPE DCGM_FI_PROF_SM_ACTIVE gauge
# HELP DCGM_FI_PROF_SM_OCCUPANCY The ratio of number of warps resident on an SM (in %).
# TYPE DCGM_FI_PROF_SM_OCCUPANCY gauge
# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active (in %).
# TYPE DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge
# HELP DCGM_FI_PROF_DRAM_ACTIVE Ratio of cycles the device memory interface is active sending or receiving data (in %).
# TYPE DCGM_FI_PROF_DRAM_ACTIVE gauge
# HELP DCGM_FI_PROF_PIPE_FP64_ACTIVE Ratio of cycles the fp64 pipes are active (in %).
# TYPE DCGM_FI_PROF_PIPE_FP64_ACTIVE gauge
# HELP DCGM_FI_PROF_PIPE_FP32_ACTIVE Ratio of cycles the fp32 pipes are active (in %).
# TYPE DCGM_FI_PROF_PIPE_FP32_ACTIVE gauge
# HELP DCGM_FI_PROF_PIPE_FP16_ACTIVE Ratio of cycles the fp16 pipes are active (in %).
# TYPE DCGM_FI_PROF_PIPE_FP16_ACTIVE gauge
# HELP DCGM_FI_PROF_PCIE_TX_BYTES The number of bytes of active pcie tx data including both header and payload.
# TYPE DCGM_FI_PROF_PCIE_TX_BYTES counter
# HELP DCGM_FI_PROF_PCIE_RX_BYTES The number of bytes of active pcie rx data including both header and payload.
# TYPE DCGM_FI_PROF_PCIE_RX_BYTES counter


DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 1380
DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 877
DCGM_FI_DEV_MEMORY_TEMP{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 41
DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 43
DCGM_FI_DEV_POWER_USAGE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 92.562000
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 118893321950
DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_DEV_MEM_COPY_UTIL{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 10
DCGM_FI_DEV_ENC_UTIL{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_DEV_DEC_UTIL{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_DEV_FB_FREE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 9939
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 22571
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_SM_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_SM_OCCUPANCY{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_PIPE_FP64_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_PIPE_FP32_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_PIPE_FP16_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_PCIE_TX_BYTES{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_PROF_PCIE_RX_BYTES{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0

DCGM_FI_DEV_SM_CLOCK{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 1380
DCGM_FI_DEV_MEM_CLOCK{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 877
DCGM_FI_DEV_MEMORY_TEMP{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 54
DCGM_FI_DEV_GPU_TEMP{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 56
DCGM_FI_DEV_POWER_USAGE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 167.362000
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 129931554972
DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_DEV_MEM_COPY_UTIL{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 22
DCGM_FI_DEV_ENC_UTIL{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_DEV_DEC_UTIL{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_DEV_FB_FREE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 30332
DCGM_FI_DEV_FB_USED{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 2178
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_SM_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_SM_OCCUPANCY{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_DRAM_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_PIPE_FP64_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_PIPE_FP32_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_PIPE_FP16_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_PCIE_TX_BYTES{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_PROF_PCIE_RX_BYTES{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0

as you can see, gpu is now being utilized, but DCGM_FI_PROF_* gives zero.

this is the result of nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:04:00.0 Off |                    0 |
| N/A   43C    P0    96W / 250W |  22571MiB / 32510MiB |     43%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:1B:00.0 Off |                    0 |
| N/A   56C    P0   168W / 250W |   2178MiB / 32510MiB |     95%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Here is the log for the /var/log/nv-hostengine.log in the dcgm-exporter container

2021-05-18 05:28:26.755 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:26.945 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:26.945 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:26.945 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:26.945 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.055 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 05:28:27.136 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.136 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.136 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.136 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.326 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.326 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.326 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.326 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]

the current version I used is the followings:

nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04

(actually I also used nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu18.04
the same thing happened except the profiling metrics are not exported, while newer version prints zero)

The os is

CentOS Linux release 7.9.2009 (Core)
Linux cl-platform-gpu02 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

the nvidia-driver and gpus are

450.51.05
GPU 0: Tesla V100-PCIE-32GB
GPU 1: Tesla V100-PCIE-32GB

This is the command I used

docker run --gpus all --cap-add CAP_SYS_ADMIN --network host -v /NAS:/NAS -d nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04 --address 0.0.0.0:31400 -f /NAS/dcgm_exporter/dcp-metrics-included-all.csv --address 0.0.0.0:31400

and /NAS/dcgm_exporter/dcp-metrics-included-all.csv contains

# Format,,
# If line starts with a '#' it is considered a comment,,
# DCGM FIELD, Prometheus metric type, help message

# Clocks,,
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature,,
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power,,
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE,,
# DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
# DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.

# Utilization (the sample period varies depending on the product),,
# DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).

# Errors and violations,,
DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.
# DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
# DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
# DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
# DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).

# Memory usage,,
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).

# ECC,,
# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.

# Retired pages,,
# DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
# DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.

# NVLink,,
# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes.
# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0,               counter, The number of bytes of active NVLink rx or tx data including both header and payload.

# VGPU License status,,
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status

# Remapped rows,,
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed

# DCP metrics,,
DCGM_FI_PROF_GR_ENGINE_ACTIVE,   gauge, Ratio of time the graphics engine is active (in %).
DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).
DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active (in %).
DCGM_FI_PROF_PCIE_TX_BYTES,      counter, The number of bytes of active pcie tx data including both header and payload.
DCGM_FI_PROF_PCIE_RX_BYTES,      counter, The number of bytes of active pcie rx data including both header and payload.

Should I do something that I missed?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions