Skip to content

Commit

Permalink
Move exporter to stable release 2.1.0
Browse files Browse the repository at this point in the history
Includes the following changes:
1. Update DCGM version to 2.0.13 (includes MIG metrics)
2. Added error handling for setcap in the entrypoint script - this fixes an issue where the setcap command fails in OpenShift, in turn causing a CrashLoopBackoff error for the dcgm-exporter pod
3. Renamed entrypoint script
4. Fixed typo in the DCGM download URL for UBI8
5. Updated Helm chart versioning and metadata
6. Updated documentation
  • Loading branch information
dualvtable committed Oct 20, 2020
1 parent 488614d commit 4d93acc
Show file tree
Hide file tree
Showing 11 changed files with 96 additions and 65 deletions.
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ DOCKER ?= docker
MKDIR ?= mkdir
REGISTRY ?= nvidia

DCGM_VERSION := 2.0.10
DCGM_VERSION := 2.0.13
GOLANG_VERSION := 1.14.2
VERSION := 2.1.0-rc.2
VERSION := 2.1.0
FULL_VERSION := $(DCGM_VERSION)-$(VERSION)

.PHONY: all binary install check-format
Expand Down
62 changes: 40 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,24 @@
# NVIDIA GPU Monitoring Tools

This repository contains Golang bindings and DCGM-Exporter for gathering GPU telemetry in Kubernetes.

## Bindings

This Github repository contains Golang bindings for the following two libraries:
Golang bindings are provided for the following two libraries:
- [NVIDIA Management Library (NVML)](https://docs.nvidia.com/deploy/nvml-api/nvml-api-reference.html#nvml-api-reference) is a C-based API for monitoring and managing NVIDIA GPU devices.
- [NVIDIA Data Center GPU Manager (DCGM)](https://developer.nvidia.com/dcgm) is a set of tools for managing and monitoring NVIDIA GPUs in cluster environments. It's a low overhead tool suite that performs a variety of functions on each host system including active health monitoring, diagnostics, system validation, policies, power and clock management, group configuration and accounting.

You will also find samples for both of these bindings in this repository.

## DCGM exporter

This Github repository also contains the DCGM exporter software. It exposes GPU metrics exporter for [Prometheus](https://prometheus.io/) leveraging [NVIDIA Data Center GPU Manager (DCGM)](https://developer.nvidia.com/dcgm).
## DCGM-Exporter

Find the installation and run instructions [here](https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/exporters/prometheus-dcgm/README.md).
The repository also contains DCGM-Exporter. It exposes GPU metrics exporter for [Prometheus](https://prometheus.io/) leveraging [NVIDIA DCGM](https://developer.nvidia.com/dcgm).

### Quickstart

To gather metrics on a GPU node, simply start the dcgm-exporter container:
To gather metrics on a GPU node, simply start the `dcgm-exporter` container:
```
$ docker run -d --gpus all --rm -p 9400:9400 nvidia/dcgm-exporter:latest
$ docker run -d --gpus all --rm -p 9400:9400 nvidia/dcgm-exporter:2.0.13-2.1.0-ubuntu18.04
$ curl localhost:9400/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
Expand All @@ -35,15 +35,34 @@ DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"

### Quickstart on Kubernetes

Note: Consider using the [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) rather than the DCGM exporter directly.
Note: Consider using the [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) rather than DCGM-Exporter directly.

Ensure you have already setup your cluster with the [default runtime as NVIDIA](https://github.com/NVIDIA/nvidia-container-runtime#docker-engine-setup).
To gather metrics on your GPU nodes you can deploy the daemonset:

The recommended way to install DCGM-Exporter is to use the Helm chart:
```
$ helm repo add gpu-helm-charts \
https://nvidia.github.io/gpu-monitoring-tools/helm-charts
```
Update the repo:
```
$ helm repo update
```
And install the chart:
```
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/gpu-monitoring-tools/2.0.0-rc.12/dcgm-exporter.yaml
$ helm install \
--generate-name \
gpu-helm-charts/dcgm-exporter
```

Once the `dcgm-exporter` pod is deployed, you can use port forwarding to obtain metrics quickly:


```
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/gpu-monitoring-tools/master/dcgm-exporter.yaml
# Let's get the output of a random pod:
$ NAME=$(kubectl get pods -l "app.kubernetes.io/name=dcgm-exporter, app.kubernetes.io/version=2.0.0-rc.12" \
$ NAME=$(kubectl get pods -l "app.kubernetes.io/name=dcgm-exporter" \
-o "jsonpath={ .items[0].metadata.name}")
$ kubectl port-forward $NAME 8080:9400 &
Expand All @@ -61,16 +80,15 @@ DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"
...
```
To integrate `dcgm-exporter` with Prometheus and Grafana, see the full instructions in the [user guide](https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html#gpu-telemetry).
To integrate DCGM-Exporter with Prometheus and Grafana, see the full instructions in the [user guide](https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html#gpu-telemetry).
`dcgm-exporter` is deployed as part of the GPU Operator. To get started with integrating with Prometheus, check the Operator [user guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#gpu-telemetry).

### Building From source and Running on Bare Metal
### Building from Source

The dcgm-exporter is actually fairly straightforward to build and use.
`dcgm-exporter` is actually fairly straightforward to build and use.
Ensure you have the following:
- [Golang >= 1.14 installed](https://golang.org/)
- [DCGM installed](https://developer.nvidia.com/dcgm)
- On DGX, the NVIDIA Fabric Manager up and running

```
$ git clone https://github.com/NVIDIA/gpu-monitoring-tools.git
Expand All @@ -93,11 +111,11 @@ DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"
...
```

### Changing Metrics

### Changing the Metrics

With dcgm-exporter 2.0 you can configure which fields are collected by specifying a custom CSV file.
You will find the [default CSV file here](https://github.com/NVIDIA/gpu-monitoring-tools/blob/2.0.0-rc.12/etc/dcgm-exporter/default-counters.csv) and on your system or container at /etc/dcgm-exporter/default-counters.csv
With `dcgm-exporter` you can configure which fields are collected by specifying a custom CSV file.
You will find the default CSV file under `etc/dcgm-exporter/default-counters.csv` in the repository, which is copied on your system or container at
`/etc/dcgm-exporter/default-counters.csv`

The format of this file is pretty straightforward:
```
Expand All @@ -117,13 +135,13 @@ $ dcgm-exporter -f /tmp/custom-collectors.csv

Notes:
- Always make sure your entries have 3 commas (',')
- The complete list of counters that can be collected can be found on the DCGM API reference website: https://docs.nvidia.com/datacenter/dcgm/1.7/dcgm-api/group__dcgmFieldIdentifiers.html
- The complete list of counters that can be collected can be found on the DCGM API reference manual: https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/group__dcgmFieldIdentifiers.html

### What about a Grafana Dashboard?

You can find the official NVIDIA dcgm-exporter dashboard here: https://grafana.com/grafana/dashboards/12239
You can find the official NVIDIA DCGM-Exporter dashboard here: https://grafana.com/grafana/dashboards/12239

You will also find the json file on this repo: https://github.com/NVIDIA/gpu-monitoring-tools/blob/2.0.0-rc.12/grafana/dcgm-exporter-dashboard.json
You will also find the `json` file on this repo under `grafana/dcgm-exporter-dashboard.json`

Pull requests are accepted!

Expand Down
12 changes: 6 additions & 6 deletions dcgm-exporter.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,23 +18,23 @@ metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.0-rc.2"
app.kubernetes.io/version: "2.1.0"
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.0-rc.2"
app.kubernetes.io/version: "2.1.0"
template:
metadata:
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.0-rc.2"
app.kubernetes.io/version: "2.1.0"
name: "dcgm-exporter"
spec:
containers:
- image: "nvidia/dcgm-exporter:2.0.10-2.1.0-rc.2-ubuntu18.04"
- image: "nvidia/dcgm-exporter:2.0.13-2.1.0-ubuntu18.04"
env:
- name: "DCGM_EXPORTER_LISTEN"
value: ":9400"
Expand Down Expand Up @@ -64,11 +64,11 @@ metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.0-rc.2"
app.kubernetes.io/version: "2.1.0"
spec:
selector:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.0-rc.2"
app.kubernetes.io/version: "2.1.0"
ports:
- name: "metrics"
port: 9400
17 changes: 13 additions & 4 deletions deployment/dcgm-exporter/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,17 @@
apiVersion: v1
apiVersion: v2
name: dcgm-exporter
description: A Helm chart for DCGM exporter
version: "2.1.0"
kubeVersion: ">= 1.13.0"
appVersion: "2.1.0"
sources:
- https://gitlab.com/nvidia/container-toolkit/gpu-monitoring-tools
version: "1.1.0"
appVersion: "2.0.10"
kubeVersion: ">= 1.13.0"
home: https://github.com/nvidia/gpu-monitoring-tools/
icon: https://assets.nvidiagrid.net/ngc/logos/DCGM.png
keywords:
- gpu
- cuda
- compute
- monitoring
- telemetry
- tesla
2 changes: 1 addition & 1 deletion deployment/dcgm-exporter/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ image:
pullPolicy: IfNotPresent
# Image tag defaults to AppVersion, but you can use the tag key
# for the image tag, e.g:
tag: 2.0.10-2.1.0-rc.2-ubuntu18.04
tag: 2.0.13-2.1.0-ubuntu18.04

# Comment the following line to stop profiling metrics from DCGM
arguments: ["-f", "/etc/dcgm-exporter/dcp-metrics-included.csv"]
Expand Down
10 changes: 5 additions & 5 deletions docker/Dockerfile.ubi8
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ LABEL io.k8s.display-name="NVIDIA DCGM Exporter"

ARG DCGM_VERSION
RUN yum install -y wget libgomp && \
wget https://developer.download.nvidia.com/compute/redist/dcgm/${DCGM_VERSION}/RPMS/x86_64/datacenter-gpu-manager-${DCGM_VERSION}-1.x86_64.rpm && \
wget --no-check-certificate https://developer.download.nvidia.com/compute/redist/dcgm/${DCGM_VERSION}/RPMS/x86_64/datacenter-gpu-manager-${DCGM_VERSION}-1-x86_64.rpm && \
rpm -i ./datacenter-gpu-manager-*.rpm && \
rm ./datacenter-gpu-manager-*.rpm

Expand All @@ -29,13 +29,13 @@ LABEL name="NVIDIA DCGM Exporter"
LABEL vendor="NVIDIA"
LABEL version="${VERSION}"
LABEL release="N/A"
LABEL summary="Exports GPU Metrics in the prometheus format"
LABEL summary="Exports GPU Metrics to Prometheus"
LABEL description="See summary"

COPY ./LICENSE ./licenses/LICENSE

ENV NO_SETCAP=
COPY docker/docker-entrypoint.sh /usr/local/dcgm/docker-entrypoint.sh
RUN chmod +x /usr/local/dcgm/docker-entrypoint.sh
COPY docker/dcgm-exporter-entrypoint.sh /usr/local/dcgm/dcgm-exporter-entrypoint.sh
RUN chmod +x /usr/local/dcgm/dcgm-exporter-entrypoint.sh

ENTRYPOINT ["/usr/local/dcgm/docker-entrypoint.sh"]
ENTRYPOINT ["/usr/local/dcgm/dcgm-exporter-entrypoint.sh"]
6 changes: 3 additions & 3 deletions docker/Dockerfile.ubuntu18.04
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility,compat32
ENV NVIDIA_VISIBLE_DEVICES=all

ENV NO_SETCAP=
COPY docker/docker-entrypoint.sh /usr/local/dcgm/docker-entrypoint.sh
RUN chmod +x /usr/local/dcgm/docker-entrypoint.sh
COPY docker/dcgm-exporter-entrypoint.sh /usr/local/dcgm/dcgm-exporter-entrypoint.sh
RUN chmod +x /usr/local/dcgm/dcgm-exporter-entrypoint.sh

ENTRYPOINT ["/usr/local/dcgm/docker-entrypoint.sh"]
ENTRYPOINT ["/usr/local/dcgm/dcgm-exporter-entrypoint.sh"]
6 changes: 3 additions & 3 deletions docker/Dockerfile.ubuntu20.04
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility,compat32
ENV NVIDIA_VISIBLE_DEVICES=all

ENV NO_SETCAP=
COPY docker/docker-entrypoint.sh /usr/local/dcgm/docker-entrypoint.sh
RUN chmod +x /usr/local/dcgm/docker-entrypoint.sh
COPY docker/dcgm-exporter-entrypoint.sh /usr/local/dcgm/dcgm-exporter-entrypoint.sh
RUN chmod +x /usr/local/dcgm/dcgm-exporter-entrypoint.sh

ENTRYPOINT ["/usr/local/dcgm/docker-entrypoint.sh"]
ENTRYPOINT ["/usr/local/dcgm/dcgm-exporter-entrypoint.sh"]
21 changes: 21 additions & 0 deletions docker/dcgm-exporter-entrypoint.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/usr/bin/env bash
set -euo pipefail

# We want to setcap only when the container is started with the right caps
DCGM_EXPORTER=$(readlink -f $(which dcgm-exporter))
if [ -z "$NO_SETCAP" ]; then
setcap 'cap_sys_admin=+ep' $DCGM_EXPORTER
if [ $? -eq 0 ]; then
if ! $DCGM_EXPORTER -v 1>/dev/null 2>/dev/null; then
>&2 echo "Warning #2: dcgm-exporter doesn't have sufficient privileges to expose profiling metrics. To get profiling metrics with dcgm-exporter, use --cap-add SYS_ADMIN"
setcap 'cap_sys_admin=-ep' $DCGM_EXPORTER
fi
else
>&2 echo "Warning #1: dcgm-exporter doesn't have sufficient privileges to expose profiling metrics. To get profiling metrics with dcgm-exporter, use --cap-add SYS_ADMIN"
fi

fi

# Pass the command line arguments to dcgm-exporter
set -- $DCGM_EXPORTER "$@"
exec "$@"
17 changes: 0 additions & 17 deletions docker/docker-entrypoint.sh

This file was deleted.

4 changes: 2 additions & 2 deletions service-monitor.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,12 @@ metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.0-rc.2"
app.kubernetes.io/version: "2.1.0"
spec:
selector:
matchLabels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.0-rc.2"
app.kubernetes.io/version: "2.1.0"
endpoints:
- port: "metrics"
path: "/metrics"

0 comments on commit 4d93acc

Please sign in to comment.