graph TD
subgraph "Hardware"
A[2× RTX 3090] --> B[PCIe Lanes]
C[Google Coral TPU] --> D[USB 3.1]
end
subgraph "Kubernetes"
E[NVIDIA Operator] --> F[NVIDIA Container Runtime]
F --> G[NVIDIA Device Plugin]
G --> H[GPU Resources]
end
subgraph "Applications"
I[AI Models] --> J[Ollama]
I --> K[ComfyUI]
I --> L[LLM Applications]
end
B --> E
D --> Applications
H --> Applications
style A fill:#f9f,stroke:#333
style C fill:#bbf,stroke:#333
style E fill:#9cf,stroke:#333
- NVIDIA GPUs: 2× RTX 3090 (24GB VRAM each)
- Google Coral: USB Accelerator (TPU)
- PCIe Lanes: 64 lanes via Threadripper
- Cooling: Custom water cooling loop
The NVIDIA GPU Operator is installed as part of the infrastructure tier via Helm:
The drivers are pre-installed on the host:
# Verify NVIDIA drivers are installed
nvidia-smi
# Check driver version
cat /proc/driver/nvidia/version
The NVIDIA Operator is installed via Helm as part of the infrastructure tier:
# The installation is handled by the infrastructure ApplicationSet
# The Helm chart is located at infrastructure/controllers/nvidia-gpu-operator/
# If you need to install manually:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=true
# Check that the pods are running
kubectl get pods -n gpu-operator
# Verify GPU allocation
kubectl get nodes -o json | jq '.items[].status.allocatable | select(has("nvidia.com/gpu"))'
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ai
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
spec:
containers:
- name: ollama
image: ollama/ollama:latest
resources:
limits:
nvidia.com/gpu: 1
memory: 32Gi
requests:
nvidia.com/gpu: 1
memory: 16Gi
volumeMounts:
- name: ollama-models
mountPath: /root/.ollama
nodeSelector:
gpu: "true"
volumes:
- name: ollama-models
persistentVolumeClaim:
claimName: ollama-models-pvc
GPU Time-Slicing is configured through the NVIDIA Operator:
# Located in infrastructure/controllers/nvidia-gpu-operator/values.yaml
devicePlugin:
config:
name: "time-slicing-config"
default: "0"
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4
apiVersion: apps/v1
kind: Deployment
metadata:
name: comfyui
namespace: ai
spec:
template:
spec:
containers:
- name: comfyui
resources:
limits:
nvidia.com/gpu: 2 # Uses 2 GPU slices (50% of one GPU)
For best performance, use the following node affinity and priority settings:
apiVersion: apps/v1
kind: Deployment
metadata:
name: high-priority-ai
spec:
template:
spec:
priorityClassName: high-priority
nodeSelector:
gpu: "true"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "High priority AI workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: medium-priority
value: 100000
globalDefault: false
description: "Medium priority AI workloads"
For Google Coral TPU:
apiVersion: v1
kind: Pod
metadata:
name: tpu-pod
spec:
containers:
- name: tpu-container
image: tensorflow/tensorflow:latest
volumeMounts:
- name: coral-device
mountPath: /dev/bus/usb
volumes:
- name: coral-device
hostPath:
path: /dev/bus/usb
GPU metrics are collected by Prometheus and visualized in Grafana:
# Part of the monitoring tier
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: nvidia-dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: nvidia-dcgm-exporter
podMetricsEndpoints:
- port: metrics
interval: 15s
-
GPU Not Detected
# Check NVIDIA driver status nvidia-smi # Check device plugin pods kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset
-
Resource Allocation Issues
# Check GPU allocation kubectl describe node <node-name> | grep nvidia.com/gpu # Check pod resource requests kubectl describe pod <pod-name> -n <namespace>
-
Container Runtime Problems
# Check NVIDIA runtime configuration kubectl get cm -n gpu-operator nvidia-container-toolkit-config -o yaml # Verify container runtime kubectl get pods -n gpu-operator -l app=nvidia-container-toolkit-daemonset
If GPU becomes unavailable:
# Restart NVIDIA device plugin
kubectl rollout restart ds -n gpu-operator nvidia-device-plugin-daemonset
# Verify GPU resources are available
kubectl get nodes -o json | jq '.items[].status.allocatable | select(has("nvidia.com/gpu"))'
# Check GPU node status
kubectl describe node <node-name> | grep nvidia