Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 100 additions & 0 deletions examples/keda-autoscaling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# KEDA Autoscaling for ModelServing

Autoscale ModelServing instances using Prometheus metrics and [KEDA](https://keda.sh/).

**Pipeline:** Prometheus -> KEDA -> HPA -> ModelServing -> Pod Scaling

## Prerequisites

- Kubernetes cluster with [Kthena (Volcano)](https://github.com/volcano-sh/volcano) installed
- [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) (for ServiceMonitor/PodMonitor CRDs)
- [KEDA](https://keda.sh/docs/deploy/) v2.x installed in the cluster
- A deployed ModelServing instance with kthena-router

## Manifests

| File | Purpose |
|------|---------|
| `servicemonitor-router.yaml` | Scrapes metrics from kthena-router (port 8080) |
| `podmonitor-inference.yaml` | Scrapes vLLM metrics from inference pods (port 8000) |
| `keda-scaledobject.yaml` | KEDA ScaledObject that drives HPA based on Prometheus queries |

## Deployment

### 1. Deploy monitoring targets

```bash
kubectl apply -f examples/keda-autoscaling/servicemonitor-router.yaml
kubectl apply -f examples/keda-autoscaling/podmonitor-inference.yaml
```

Verify metrics are being scraped in your Prometheus UI.

### 2. Configure the ScaledObject

Edit `keda-scaledobject.yaml` before applying:

- **`spec.scaleTargetRef.name`** — set to your ModelServing resource name
- **`model` label in router query** — set to the model name your ModelServing instance serves (e.g. `"deepseek-ai/DeepSeek-R1"`)
- **`model_serving` label in vLLM query** — set to your ModelServing `metadata.name`

```bash
kubectl apply -f examples/keda-autoscaling/keda-scaledobject.yaml
```

### 3. Verify

```bash
# Check KEDA created the HPA
kubectl get hpa

# Check ScaledObject status
kubectl get scaledobject modelserving-scaler

# Watch pods scale
kubectl get pods -w
```

## Prometheus Queries Explained

### Router active requests (per model)

```promql
sum(kthena_router_active_downstream_requests{model="my-model-name"})
```

The `kthena_router_active_downstream_requests` gauge tracks currently active client requests to the router. The `model` label identifies which model the request targets, so filtering by `model` ensures each ModelServing instance scales based only on **its own traffic** — not aggregate load across all models.

### vLLM pending requests (per ModelServing)

```promql
avg(vllm:num_requests_waiting{model_serving="my-modelserving"})
```

The `vllm:num_requests_waiting` gauge is exposed by each vLLM inference pod. Filtering by `model_serving` scopes the average to pods belonging to a specific ModelServing instance.

## Per-Model Scaling

The kthena-router supports routing to multiple models simultaneously. Without per-model filtering, a single busy model would trigger scaling for **all** ModelServing instances. Each ScaledObject must filter metrics to the specific model/ModelServing it manages:

- Create **one ScaledObject per ModelServing** instance
- Set the `model` label filter to match the model name served by that instance
- Adjust `threshold`, `minReplicaCount`, and `maxReplicaCount` per model based on expected load

## Testing

Generate load against a specific model and watch scaling:

```bash
# Send requests to trigger scaling
for i in $(seq 1 50); do
curl -s http://<router-endpoint>/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "my-model-name", "prompt": "Hello", "max_tokens": 100}' &
done

# Watch the HPA react
kubectl get hpa -w
```

The ScaledObject's `cooldownPeriod: 120` means pods scale down 2 minutes after load drops below the threshold.
31 changes: 31 additions & 0 deletions examples/keda-autoscaling/keda-scaledobject.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: modelserving-scaler
spec:
scaleTargetRef:
apiVersion: workload.serving.volcano.sh/v1alpha1
kind: ModelServing
name: my-modelserving
minReplicaCount: 1
maxReplicaCount: 10
cooldownPeriod: 120
pollingInterval: 15
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
# Scope to pods belonging to this specific ModelServing instance.
# Replace "my-modelserving" with your ModelServing metadata.name.
query: avg(vllm:num_requests_waiting{model_serving="my-modelserving"})
threshold: "5"
metricName: vllm_requests_waiting_avg
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
# Scope to the specific model served by this ModelServing instance.
# The "model" label on router metrics identifies the model name
# (e.g. "deepseek-ai/DeepSeek-R1"). Replace with your model name.
query: sum(kthena_router_active_downstream_requests{model="my-model-name"})
threshold: "20"
metricName: router_active_downstream_requests
15 changes: 15 additions & 0 deletions examples/keda-autoscaling/podmonitor-inference.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: inference-pods
labels:
app.kubernetes.io/component: inference
spec:
selector:
matchLabels:
modelserving.volcano.sh/entry: "true"
podMetricsEndpoints:
- port: http
targetPort: 8000
Comment on lines +12 to +13
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodMonitor endpoint sets port: http, but ModelServing-generated inference pods/templates don’t name the container port http (they typically only set containerPort: 8000 without a name). With a named port mismatch Prometheus Operator won’t be able to resolve the scrape target. Either name the metrics port http in the pod spec, or remove port: and rely on targetPort, or set port to the actual named port used by the inference pods.

Suggested change
- port: http
targetPort: 8000
- targetPort: 8000

Copilot uses AI. Check for mistakes.
path: /metrics
interval: 15s
Comment on lines +12 to +15
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The targetPort field is not a valid field within a podMetricsEndpoints item for a PodMonitor resource. Its inclusion will likely make this manifest invalid and prevent metrics from being scraped. The port field is sufficient. Please remove the targetPort line.

    - port: http
      path: /metrics
      interval: 15s

14 changes: 14 additions & 0 deletions examples/keda-autoscaling/servicemonitor-router.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three example manifests omit metadata.namespace. In this repo, most example YAMLs set an explicit namespace (e.g. kthena-system/default), and for ServiceMonitor/PodMonitor it also affects discovery because they only select targets within their own namespace unless spec.namespaceSelector is configured. Consider adding an explicit namespace (and namespaceSelector if you expect scraping across namespaces) to make the example apply correctly out-of-the-box.

Suggested change
metadata:
metadata:
namespace: kthena-system

Copilot uses AI. Check for mistakes.
name: kthena-router
labels:
app.kubernetes.io/component: kthena-router
spec:
selector:
matchLabels:
app.kubernetes.io/component: kthena-router
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ServiceMonitor.spec.selector.matchLabels matches any Service with app.kubernetes.io/component: kthena-router. The Helm chart also labels the kthena-router-webhook Service with the same component label, but that Service exposes only the webhook port (no http metrics port). This can lead to failed scrapes / confusing Prometheus targets. Prefer selecting a label that uniquely identifies the metrics Service, or add a dedicated label on the router metrics Service and match on that.

Suggested change
app.kubernetes.io/component: kthena-router
app.kubernetes.io/name: kthena-router

Copilot uses AI. Check for mistakes.
endpoints:
- port: http
path: /metrics
interval: 15s
Loading