Skip to content

Commit

Permalink
Setcap only when the right permissions are passed in at runtime
Browse files Browse the repository at this point in the history
To support profiling metrics, dcgm-exporter needs at least cap_sys_admin. Making this the default prevents users from running the dcgm-exporter image
when they don't provide this capability.
  • Loading branch information
dualvtable authored and guptaNswati committed Sep 29, 2020
1 parent 3a29257 commit a26a2fe
Show file tree
Hide file tree
Showing 16 changed files with 3,820 additions and 3,419 deletions.
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ DOCKER ?= docker
MKDIR ?= mkdir
REGISTRY ?= nvidia

DCGM_VERSION := 1.7.2
DCGM_VERSION := 2.0.10
GOLANG_VERSION := 1.14.2
VERSION := 2.0.0-rc.7
VERSION := 2.1.0-rc.1
FULL_VERSION := $(DCGM_VERSION)-$(VERSION)

.PHONY: all binary install check-format
Expand Down
47 changes: 4 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@

This Github repository contains Golang bindings for the following two libraries:
- [NVIDIA Management Library (NVML)](https://docs.nvidia.com/deploy/nvml-api/nvml-api-reference.html#nvml-api-reference) is a C-based API for monitoring and managing NVIDIA GPU devices.
- [NVIDIA Data Center GPU Manager (DCGM)](https://developer.nvidia.com/data-center-gpu-manager-dcgm) is a set of tools for managing and monitoring NVIDIA GPUs in cluster environments. It's a low overhead tool suite that performs a variety of functions on each host system including active health monitoring, diagnostics, system validation, policies, power and clock management, group configuration and accounting.
- [NVIDIA Data Center GPU Manager (DCGM)](https://developer.nvidia.com/dcgm) is a set of tools for managing and monitoring NVIDIA GPUs in cluster environments. It's a low overhead tool suite that performs a variety of functions on each host system including active health monitoring, diagnostics, system validation, policies, power and clock management, group configuration and accounting.

You will also find samples for both of these bindings in this repository.

## DCGM exporter

This Github repository also contains the DCGM exporter software. It exposes GPU metrics exporter for [Prometheus](https://prometheus.io/) leveraging [NVIDIA Data Center GPU Manager (DCGM)](https://developer.nvidia.com/data-center-gpu-manager-dcgm).
This Github repository also contains the DCGM exporter software. It exposes GPU metrics exporter for [Prometheus](https://prometheus.io/) leveraging [NVIDIA Data Center GPU Manager (DCGM)](https://developer.nvidia.com/dcgm).

Find the installation and run instructions [here](https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/exporters/prometheus-dcgm/README.md).

Expand Down Expand Up @@ -60,48 +60,9 @@ DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",c
DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 9223372036854775794
...
# If you are using the Prometheus operator
# Note on exporters here:
# https://github.com/coreos/prometheus-operator/blob/release-0.38/Documentation/user-guides/running-exporters.md
$ helm repo add stable https://kubernetes-charts.storage.googleapis.com
$ helm install stable/prometheus-operator --generate-name \
--set "prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false"
$ kubectl create -f \
https://raw.githubusercontent.com/NVIDIA/gpu-monitoring-tools/2.0.0-rc.12/service-monitor.yaml
# Note might take ~1-2 minutes for prometheus to pickup the metrics and display them
# You can also check in the WebUI the servce-discovery tab (in the Status category)
$ NAME=$(kubectl get svc -l app=prometheus-operator-prometheus -o jsonpath='{.items[0].metadata.name}')
$ kubectl port-forward $NAME 9090:9090 &
$ curl -sL http://127.0.01:9090/api/v1/query?query=DCGM_FI_DEV_MEMORY_TEMP"
{
status: "success",
data: {
resultType: "vector",
result: [
{
metric: {
UUID: "GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",
__name__: "DCGM_FI_DEV_MEMORY_TEMP",
__container__: "",
__pod__: "",
__namespace__: "",
...
pod: "dcgm-exporter-fn7fm",
service: "dcgm-exporter"
},
value: [
1588399049.227,
"9223372036854776000"
]
},
...
]
}
}
```

To integrate `dcgm-exporter` with Prometheus and Grafana, see the full instructions in the [user guide](https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html#gpu-telemetry).
`dcgm-exporter` is deployed as part of the GPU Operator. To get started with integrating with Prometheus, check the Operator [user guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#gpu-telemetry).

### Building From source and Running on Bare Metal

Expand Down
2 changes: 1 addition & 1 deletion bindings/go/dcgm/admin.go
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ var (

func initDcgm(m mode, args ...string) (err error) {
const (
dcgmLib = "libdcgm.so.1"
dcgmLib = "libdcgm.so"
)
lib := C.CString(dcgmLib)
defer freeCString(lib)
Expand Down
1,588 changes: 891 additions & 697 deletions bindings/go/dcgm/dcgm_agent.h

Large diffs are not rendered by default.

636 changes: 345 additions & 291 deletions bindings/go/dcgm/dcgm_errors.h

Large diffs are not rendered by default.

Loading

0 comments on commit a26a2fe

Please sign in to comment.