Skip to content
This repository has been archived by the owner on Nov 2, 2021. It is now read-only.

exporter returns no profiling metrics after some period of time #189

Open
shovsj opened this issue May 18, 2021 · 1 comment
Open

exporter returns no profiling metrics after some period of time #189

shovsj opened this issue May 18, 2021 · 1 comment

Comments

@shovsj
Copy link

shovsj commented May 18, 2021

Hi guys, I am using docker version of the dcgm-exporter,
when I just started the dcgm-exporter container, I can get the profiling metrics well.
After some period of time, I cannot get the profiling metrics like DCGM_FI_PROF_*.
actually, exporter prints the profiling metrics as zero, while other metrics like DCGM_FI_DEV_POWER_USAGE are printed well.
If I restart the container, then the metrics are exported well.

Here is an example

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
# HELP DCGM_FI_DEV_POWER_USAGE Power draw (in W).
# TYPE DCGM_FI_DEV_POWER_USAGE gauge
# HELP DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION Total energy consumption since boot (in mJ).
# TYPE DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION counter
# HELP DCGM_FI_DEV_PCIE_REPLAY_COUNTER Total number of PCIe retries.
# TYPE DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter
# HELP DCGM_FI_DEV_MEM_COPY_UTIL Memory utilization (in %).
# TYPE DCGM_FI_DEV_MEM_COPY_UTIL gauge
# HELP DCGM_FI_DEV_ENC_UTIL Encoder utilization (in %).
# TYPE DCGM_FI_DEV_ENC_UTIL gauge
# HELP DCGM_FI_DEV_DEC_UTIL Decoder utilization (in %).
# TYPE DCGM_FI_DEV_DEC_UTIL gauge
# HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
# HELP DCGM_FI_DEV_FB_FREE Framebuffer memory free (in MiB).
# TYPE DCGM_FI_DEV_FB_FREE gauge
# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
# TYPE DCGM_FI_DEV_FB_USED gauge
# HELP DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL Total number of NVLink bandwidth counters for all lanes.
# TYPE DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL counter
# HELP DCGM_FI_DEV_VGPU_LICENSE_STATUS vGPU License status
# TYPE DCGM_FI_DEV_VGPU_LICENSE_STATUS gauge
# HELP DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS Number of remapped rows for uncorrectable errors
# TYPE DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS counter
# HELP DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS Number of remapped rows for correctable errors
# TYPE DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS counter
# HELP DCGM_FI_DEV_ROW_REMAP_FAILURE Whether remapping of rows has failed
# TYPE DCGM_FI_DEV_ROW_REMAP_FAILURE gauge
# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE Ratio of time the graphics engine is active (in %).
# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
# HELP DCGM_FI_PROF_SM_ACTIVE The ratio of cycles an SM has at least 1 warp assigned (in %).
# TYPE DCGM_FI_PROF_SM_ACTIVE gauge
# HELP DCGM_FI_PROF_SM_OCCUPANCY The ratio of number of warps resident on an SM (in %).
# TYPE DCGM_FI_PROF_SM_OCCUPANCY gauge
# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active (in %).
# TYPE DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge
# HELP DCGM_FI_PROF_DRAM_ACTIVE Ratio of cycles the device memory interface is active sending or receiving data (in %).
# TYPE DCGM_FI_PROF_DRAM_ACTIVE gauge
# HELP DCGM_FI_PROF_PIPE_FP64_ACTIVE Ratio of cycles the fp64 pipes are active (in %).
# TYPE DCGM_FI_PROF_PIPE_FP64_ACTIVE gauge
# HELP DCGM_FI_PROF_PIPE_FP32_ACTIVE Ratio of cycles the fp32 pipes are active (in %).
# TYPE DCGM_FI_PROF_PIPE_FP32_ACTIVE gauge
# HELP DCGM_FI_PROF_PIPE_FP16_ACTIVE Ratio of cycles the fp16 pipes are active (in %).
# TYPE DCGM_FI_PROF_PIPE_FP16_ACTIVE gauge
# HELP DCGM_FI_PROF_PCIE_TX_BYTES The number of bytes of active pcie tx data including both header and payload.
# TYPE DCGM_FI_PROF_PCIE_TX_BYTES counter
# HELP DCGM_FI_PROF_PCIE_RX_BYTES The number of bytes of active pcie rx data including both header and payload.
# TYPE DCGM_FI_PROF_PCIE_RX_BYTES counter


DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 1380
DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 877
DCGM_FI_DEV_MEMORY_TEMP{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 41
DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 43
DCGM_FI_DEV_POWER_USAGE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 92.562000
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 118893321950
DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_DEV_MEM_COPY_UTIL{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 10
DCGM_FI_DEV_ENC_UTIL{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_DEV_DEC_UTIL{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_DEV_FB_FREE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 9939
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 22571
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_SM_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_SM_OCCUPANCY{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_PIPE_FP64_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_PIPE_FP32_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_PIPE_FP16_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_PCIE_TX_BYTES{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_PROF_PCIE_RX_BYTES{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0

DCGM_FI_DEV_SM_CLOCK{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 1380
DCGM_FI_DEV_MEM_CLOCK{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 877
DCGM_FI_DEV_MEMORY_TEMP{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 54
DCGM_FI_DEV_GPU_TEMP{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 56
DCGM_FI_DEV_POWER_USAGE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 167.362000
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 129931554972
DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_DEV_MEM_COPY_UTIL{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 22
DCGM_FI_DEV_ENC_UTIL{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_DEV_DEC_UTIL{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_DEV_FB_FREE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 30332
DCGM_FI_DEV_FB_USED{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 2178
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_SM_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_SM_OCCUPANCY{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_DRAM_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_PIPE_FP64_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_PIPE_FP32_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_PIPE_FP16_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_PCIE_TX_BYTES{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_PROF_PCIE_RX_BYTES{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0

as you can see, gpu is now being utilized, but DCGM_FI_PROF_* gives zero.

this is the result of nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:04:00.0 Off |                    0 |
| N/A   43C    P0    96W / 250W |  22571MiB / 32510MiB |     43%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:1B:00.0 Off |                    0 |
| N/A   56C    P0   168W / 250W |   2178MiB / 32510MiB |     95%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Here is the log for the /var/log/nv-hostengine.log in the dcgm-exporter container

2021-05-18 05:28:26.755 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:26.945 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:26.945 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:26.945 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:26.945 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.055 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 05:28:27.136 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.136 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.136 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.136 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.326 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.326 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.326 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.326 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]

the current version I used is the followings:

nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04

(actually I also used nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu18.04
the same thing happened except the profiling metrics are not exported, while newer version prints zero)

The os is

CentOS Linux release 7.9.2009 (Core)
Linux cl-platform-gpu02 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

the nvidia-driver and gpus are

450.51.05
GPU 0: Tesla V100-PCIE-32GB
GPU 1: Tesla V100-PCIE-32GB

This is the command I used

docker run --gpus all --cap-add CAP_SYS_ADMIN --network host -v /NAS:/NAS -d nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04 --address 0.0.0.0:31400 -f /NAS/dcgm_exporter/dcp-metrics-included-all.csv --address 0.0.0.0:31400

and /NAS/dcgm_exporter/dcp-metrics-included-all.csv contains

# Format,,
# If line starts with a '#' it is considered a comment,,
# DCGM FIELD, Prometheus metric type, help message

# Clocks,,
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature,,
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power,,
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE,,
# DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
# DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.

# Utilization (the sample period varies depending on the product),,
# DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).

# Errors and violations,,
DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.
# DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
# DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
# DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
# DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).

# Memory usage,,
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).

# ECC,,
# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.

# Retired pages,,
# DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
# DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.

# NVLink,,
# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes.
# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0,               counter, The number of bytes of active NVLink rx or tx data including both header and payload.

# VGPU License status,,
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status

# Remapped rows,,
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed

# DCP metrics,,
DCGM_FI_PROF_GR_ENGINE_ACTIVE,   gauge, Ratio of time the graphics engine is active (in %).
DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).
DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active (in %).
DCGM_FI_PROF_PCIE_TX_BYTES,      counter, The number of bytes of active pcie tx data including both header and payload.
DCGM_FI_PROF_PCIE_RX_BYTES,      counter, The number of bytes of active pcie rx data including both header and payload.

Should I do something that I missed?

@shovsj
Copy link
Author

shovsj commented May 18, 2021

there are some other errors also

2021-05-18 07:32:52.824 ERROR [1:1] NSCQ returned warning during session creation. Ensure driver version matches NSCQ version. NSCQ warning ret: 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:432] [DcgmNs::DcgmNvSwitchManager::AttachToNscq]
2021-05-18 07:32:52.824 ERROR [1:1] Could not read NvSwitch status [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:552] [DcgmNs::DcgmNvSwitchManager::AttachNvSwitches]
2021-05-18 07:32:52.824 ERROR [1:1] AttachNvSwitches returned -14 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:446] [DcgmNs::DcgmNvSwitchManager::AttachToNscq]
2021-05-18 07:32:52.824 ERROR [1:1] AttachToNscq() returned -14 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:290] [DcgmNs::DcgmNvSwitchManager::Init]
2021-05-18 07:32:52.824 ERROR [1:1] Could not initialize switch manager. Ret: No data is available [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:33] [DcgmNs::DcgmModuleNvSwitch::DcgmModuleNvSwitch]
2021-05-18 07:32:52.825 ERROR [1:26] ReadNvSwitchStatusAllSwitches() returned No data is available [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:387] [DcgmNs::DcgmModuleNvSwitch::RunOnce]
2021-05-18 07:32:52.825 WARN  [1:1] Got 0 entities from GetAllEntitiesOfEntityGroup() of eg 3 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmGroupManager.cpp:159] [DcgmGroupManager::AddAllEntitiesToGroup]
2021-05-18 07:32:53.200 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:32:53.201 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:32:54.200 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:32:54.201 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:32:55.204 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:32:55.217 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:32:56.146 ERROR [1:28] [PerfWorks] Decoded zero samples after 42 attempt(s) to decode counters [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:539] [DcgmLopConfig::GetSamples]
2021-05-18 07:32:56.146 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 07:32:56.201 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:32:56.214 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:32:57.032 ERROR [1:28] [PerfWorks] Decoded zero samples after 47 attempt(s) to decode counters [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:539] [DcgmLopConfig::GetSamples]
2021-05-18 07:32:57.032 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 07:32:57.201 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:32:57.214 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:32:57.737 ERROR [1:28] [PerfWorks] Decoded zero samples after 42 attempt(s) to decode counters [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:539] [DcgmLopConfig::GetSamples]
2021-05-18 07:32:57.737 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 07:32:58.201 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:32:58.214 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:32:58.781 ERROR [1:28] [PerfWorks] Decoded zero samples after 47 attempt(s) to decode counters [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:539] [DcgmLopConfig::GetSamples]
2021-05-18 07:32:58.781 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 07:32:59.202 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:32:59.214 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:32:59.675 ERROR [1:28] [PerfWorks] Decoded zero samples after 48 attempt(s) to decode counters [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:539] [DcgmLopConfig::GetSamples]
2021-05-18 07:32:59.675 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 07:33:00.201 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:33:00.214 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:33:00.376 ERROR [1:28] [PerfWorks] Decoded zero samples after 40 attempt(s) to decode counters [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:539] [DcgmLopConfig::GetSamples]
2021-05-18 07:33:00.376 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 07:33:01.201 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:33:01.218 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:33:01.570 ERROR [1:28] [PerfWorks] Decoded zero samples after 47 attempt(s) to decode counters [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:539] [DcgmLopConfig::GetSamples]
2021-05-18 07:33:01.570 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 07:33:02.201 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:33:02.215 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:33:02.651 ERROR [1:28] [PerfWorks] Decoded zero samples after 47 attempt(s) to decode counters [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:539] [DcgmLopConfig::GetSamples]
2021-05-18 07:33:02.651 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 07:33:03.201 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:33:03.215 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 07:33:03.348 ERROR [1:28] [PerfWorks] Decoded zero samples after 48 attempt(s) to decode counters [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:539] [DcgmLopConfig::GetSamples]
2021-05-18 07:33:03.348 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 07:33:03.540 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant