NVML errors with Jupyterlab GPU dashboard #114

brianklobucher · 2021-11-15T15:39:43Z

it looks like the PCIe and NVLink info is not supported, but jupyterlab-nvdashboard assumes they're supported...

$ docker run -it --gpus all nvcr.io/nvaie/nvidia-rapids:21.08-cuda11.4-ubuntu20.04-py3.8 bash This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdeveloper.download.nvidia.com%2Flicenses%2FNVIDIA_Deep_Learning_Container_License.pdf&amp;data=04%7C01%7Cbklobucher%40nvidia.com%7C374c2b1a88514a766c8e08d99eca77e1%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637715414696240549%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8%2F0L6kTCbFy4irrvUAo1y2TSLbJ3KP2vAPsRhFinDV8%3D&amp;reserved=0
(rapids) root@955ce2376d75:/workspace# conda list | grep nvml
pynvml                    11.0.0             pyhd8ed1ab_0    conda-forge
(rapids) root@955ce2376d75:/workspace# conda list jupyterlab-nvdashboard # packages in environment at /opt/conda/envs/rapids:
#
# Name                    Version                   Build  Channel
jupyterlab-nvdashboard    0.6.0                      py_0    rapidsai
(rapids) root@955ce2376d75:/workspace# nvidia-smi Wed Nov  3 00:18:24 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID T4-16C         On   | 00000000:02:00.0 Off |                    0 |
| N/A   N/A    P8    N/A /  N/A |   2220MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
(rapids) root@955ce2376d75:/workspace# grep nvmlDeviceGet /opt/conda/envs/rapids/lib/python3.8/site-packages/jupyterlab_nvdashboard/apps/gpu.py
ngpus = pynvml.nvmlDeviceGetCount()
gpu_handles = [pynvml.nvmlDeviceGetHandleByIndex(i) for i in range(ngpus)]
            pynvml.nvmlDeviceGetUtilizationRates(gpu_handles[i]).gpu
        return [pynvml.nvmlDeviceGetMemoryInfo(handle).used for handle in gpu_handles]
        return pynvml.nvmlDeviceGetMemoryInfo(gpu_handles[0]).total
    pci_gen = pynvml.nvmlDeviceGetMaxPcieLinkGeneration(gpu_handles[0])
    pci_width = pynvml.nvmlDeviceGetMaxPcieLinkWidth(gpu_handles[0])
        pynvml.nvmlDeviceGetPcieThroughput(
        pynvml.nvmlDeviceGetPcieThroughput(
            pynvml.nvmlDeviceGetPcieThroughput(
            pynvml.nvmlDeviceGetPcieThroughput(
    nvlink_ver = pynvml.nvmlDeviceGetNvLinkVersion(gpu_handles[0], 0)
                pynvml.nvmlDeviceGetNvLinkUtilizationCounter(
                pynvml.nvmlDeviceGetNvLinkUtilizationCounter(
                    pynvml.nvmlDeviceGetNvLinkUtilizationCounter(
                    pynvml.nvmlDeviceGetNvLinkUtilizationCounter(
                pynvml.nvmlDeviceGetNvLinkUtilizationCounter(
                pynvml.nvmlDeviceGetNvLinkUtilizationCounter(
                    pynvml.nvmlDeviceGetNvLinkUtilizationCounter(
                    pynvml.nvmlDeviceGetNvLinkUtilizationCounter(
        pynvml.nvmlDeviceGetMemoryInfo(handle).total / (1024 * 1024)
            gpu = pynvml.nvmlDeviceGetUtilizationRates(gpu_handles[i]).gpu
            mem = pynvml.nvmlDeviceGetMemoryInfo(gpu_handles[i]).used
                pynvml.nvmlDeviceGetPcieThroughput(
                pynvml.nvmlDeviceGetPcieThroughput(
(rapids) root@955ce2376d75:/workspace# python Python 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information.
>>> from pynvml import *
>>> nvmlInit()
>>> nvmlSystemGetDriverVersion()
b'470.63.01'
>>> nvmlDeviceGetName(nvmlDeviceGetHandleByIndex(0))
b'GRID T4-16C'
>>> nvmlDeviceGetPcieThroughput(nvmlDeviceGetHandleByIndex(0), 
>>> NVML_PCIE_UTIL_TX_BYTES)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py", line 2542, in nvmlDeviceGetPcieThroughput
    _nvmlCheckReturn(ret)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported
>>> nvmlDeviceGetNvLinkVersion(nvmlDeviceGetHandleByIndex(0), 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py", line 2656, in nvmlDeviceGetNvLinkVersion
    _nvmlCheckReturn(ret)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported
>>> nvmlDeviceGetMemoryInfo(nvmlDeviceGetHandleByIndex(0)).total
17179869184
>>> nvmlDeviceGetUtilizationRates(nvmlDeviceGetHandleByIndex(0)).gpu
0

Below is the NVML stack trace

ERROR:bokeh.util.tornado:Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/tornado/gen.py", line 526, in callback
    result_list.append(f.result())
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/bokeh/server/session.py", line 67, in _needs_document_lock_wrapper
    result = func(self, *args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/bokeh/server/session.py", line 195, in with_document_locked
    return func(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/bokeh/document/document.py", line 1212, in wrapper
    return doc._with_self_as_curdoc(invoke)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/bokeh/document/document.py", line 1198, in _with_self_as_curdoc
    return f()
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/bokeh/document/document.py", line 1211, in invoke
    return f(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 575, in cb
    pynvml.nvmlDeviceGetPcieThroughput(
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py", line 2542, in nvmlDeviceGetPcieThroughput
    _nvmlCheckReturn(ret)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported

ERROR:bokeh.util.tornado:Error thrown from periodic callback:
ERROR:bokeh.util.tornado:Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/tornado/gen.py", line 526, in callback
    result_list.append(f.result())
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/bokeh/server/session.py", line 67, in _needs_document_lock_wrapper
    result = func(self, *args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/bokeh/server/session.py", line 195, in with_document_locked
    return func(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/bokeh/document/document.py", line 1212, in wrapper
    return doc._with_self_as_curdoc(invoke)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/bokeh/document/document.py", line 1198, in _with_self_as_curdoc
    return f()
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/bokeh/document/document.py", line 1211, in invoke
    return f(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 575, in cb
    pynvml.nvmlDeviceGetPcieThroughput(
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py", line 2542, in nvmlDeviceGetPcieThroughput
    _nvmlCheckReturn(ret)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported

The text was updated successfully, but these errors were encountered:

jacobtomlinson · 2021-11-16T12:13:27Z

Is this related to #28? We are aware that not all GPU configurations show NVLINK info and the plot shows a 500 error.

cmd-ntrf mentioned this issue Apr 29, 2022

Make nvdashboard usable even when "NVML Shared Library Not Found" #126

Merged

pentschev mentioned this issue Jun 20, 2022

Update deprecated NVML API calls for NVLink query #133

Merged

jacobtomlinson mentioned this issue Nov 18, 2022

nvdashboard stopped working with nvcr.io/nvidia/pytorch:22.08-py3 #139

Closed

jacobtomlinson mentioned this issue Mar 7, 2023

Machine Resources 500 error #153

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVML errors with Jupyterlab GPU dashboard #114

NVML errors with Jupyterlab GPU dashboard #114

brianklobucher commented Nov 15, 2021 •

edited by jacobtomlinson

Loading

jacobtomlinson commented Nov 16, 2021

NVML errors with Jupyterlab GPU dashboard #114

NVML errors with Jupyterlab GPU dashboard #114

Comments

brianklobucher commented Nov 15, 2021 • edited by jacobtomlinson Loading

jacobtomlinson commented Nov 16, 2021

brianklobucher commented Nov 15, 2021 •

edited by jacobtomlinson

Loading