A comprehensive Helm chart for monitoring GPU resources in Kubernetes clusters. This tool provides real-time visibility into GPU allocation, utilization, memory usage, and pod status through an integrated Prometheus and Grafana monitoring stack.
The GPU Assessment Tool helps you:
- Monitor GPU allocation: Track total vs. allocated GPUs across your cluster
- Measure GPU utilization: View real-time GPU compute utilization percentages
- Track memory usage: Monitor GPU memory consumption and availability
- Observe pod status: See running and pending GPU-enabled pods
- Filter by GPU type: Dynamic filtering by GPU model (e.g., A100, V100, etc.)
The tool uses NVIDIA DCGM (Data Center GPU Manager) metrics collected by Prometheus and visualized through a pre-configured Grafana dashboard.
Before installing the GPU Assessment Tool, ensure you have:
- Kubernetes cluster (v1.19+)
- Helm 3 installed on your local machine
- NVIDIA GPU Operator specifically, DCGM exporter deployed in your cluster
- kubectl configured to access your cluster
Ensure DCGM metrics are available in your cluster:
# Check if DCGM exporter pods are running
kubectl get pods -A | grep dcgm
# Verify metrics are being exposed
kubectl port-forward -n <dcgm-namespace> <dcgm-pod-name> 9400:9400
curl http://localhost:9400/metrics | grep DCGM_FI_DEVFirst, update the Helm dependencies to download Prometheus and Grafana charts:
helm dependency updateThis will download the required charts into the charts/ directory.
Install the chart with default configuration:
helm install gpu-assessment-tool . --namespace gpu-assessment-tool --create-namespaceOr install with custom values:
helm install gpu-assessment-tool . \
--namespace gpu-assessment-tool \
--create-namespace \
--values custom-values.yamlAfter installation, access the Grafana dashboard:
# Port-forward to Grafana service
kubectl port-forward -n gpu-assessment-tool svc/gpu-assessment-tool-grafana 3000:80Open your browser and navigate to: http://localhost:3000
The GPU Assessment dashboard will automatically load as the home dashboard.
In case you want to edit the dashboards, login with the following credentials:
- Username:
admin - Password:
admin
The values.yaml file contains the default configuration.
By default, the installation will spin up a prometheus pod and a grafana pod.
In case you do not have prometheus installed on your cluster, you probably do not have kube-state-metrics exporter. Please enable it:
prometheus:
kube-state-metrics:
enabled: truenote: Enabling kube-state-metrics when you already have one installed on your cluster might cause metrics duplication
If you already have Prometheus running in your cluster, we recommend using it because it already holds historical data. To use it, disable the prometheus installation and provide us with your prometheus service endpoint:
prometheus:
enabled: false # Disable built-in Prometheus
global:
prometheusUrl: "http://my-prometheus-server.monitoring.svc:9090"In case you experience slowness in the dashboards operation try to increase the resources:
prometheus:
resources:
limits:
cpu: 1000m
memory: 4096Mi
requests:
cpu: 200m
memory: 1024Mi
grafana:
resources:
limits:
cpu: 500m
memory: 2048Mi
requests:
cpu: 100m
memory: 512MiIf you plan on exposing the dashboard, changing the credentials is recommended:
grafana:
adminUser: your-admin-user
adminPassword: your-secure-passwordThe GPU Assessment Dashboard provides:
- Time-series graph showing total GPUs vs. allocated GPUs
- Percentage gauge of GPU allocation
- Average GPU compute utilization across all GPUs
- Real-time percentage display
- Threshold indicators (green: >80%, yellow: 50-80%, red: <50%)
- Total memory capacity vs. used memory in mebibytes
- Memory usage percentage
- Count of running pods using GPUs
- Count of pods waiting for GPU resources
- Helps identify resource constraints
To remove the GPU Assessment Tool:
helm uninstall gpu-assessment-tool --namespace gpu-assessment-toolTo also remove the namespace:
kubectl delete namespace gpu-assessment-tool-
Verify DCGM exporter is running:
kubectl get pods -A | grep dcgm -
Check Prometheus is scraping DCGM metrics:
kubectl logs -n monitoring deployment/gpu-assessment-tool-prometheus-server
-
Ensure Prometheus has the correct ServiceMonitor or scrape configuration for DCGM
- Check Prometheus data source connection in Grafana
- Verify the Prometheus URL is correct
- Confirm DCGM metrics are available:
DCGM_FI_DEV_FB_FREE,DCGM_FI_DEV_GPU_UTIL
Check resource availability:
kubectl describe pod -n monitoring <pod-name>The tool consists of four main components:
- DCGM Exporter: Exposes NVIDIA GPU metrics (external - deployed via GPU Operator)
- kube-state-metrics: Exposes Kubernetes pod and resource metrics
- Prometheus: Collects and stores metrics from DCGM and kube-state-metrics
- Grafana: Provides visualization through the GPU Assessment Dashboard
┌─────────────────┐ ┌──────────────────┐
│ DCGM Exporter │ │ kube-state- │
│ │ │ metrics │
└────────┬────────┘ └────────┬─────────┘
│ GPU Metrics │ K8s Metrics
│ │
└────────┬────────────────┘
│
▼
┌─────────────────┐
│ Prometheus │ Scrapes & Stores Metrics
└────────┬────────┘
│ Queries
▼
┌─────────────────┐
│ Grafana │ Visualizes Dashboard
└─────────────────┘
| Component | Version | Required |
|---|---|---|
| Kubernetes | 1.19+ | Yes |
| Helm | 3.0+ | Yes |
| DCGM Exporter | --- | Yes |
| Prometheus | 27.45.0 (included) | Yes |
| Grafana | 10.1.4 (included) | Yes |
For issues, questions, or contributions, please contact your cluster administrator or refer to:
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this project except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
