-
Notifications
You must be signed in to change notification settings - Fork 84
feat: add KEDA autoscaling example manifests #831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,100 @@ | ||
| # KEDA Autoscaling for ModelServing | ||
|
|
||
| Autoscale ModelServing instances using Prometheus metrics and [KEDA](https://keda.sh/). | ||
|
|
||
| **Pipeline:** Prometheus -> KEDA -> HPA -> ModelServing -> Pod Scaling | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - Kubernetes cluster with [Kthena (Volcano)](https://github.com/volcano-sh/volcano) installed | ||
| - [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) (for ServiceMonitor/PodMonitor CRDs) | ||
| - [KEDA](https://keda.sh/docs/deploy/) v2.x installed in the cluster | ||
| - A deployed ModelServing instance with kthena-router | ||
|
|
||
| ## Manifests | ||
|
|
||
| | File | Purpose | | ||
| |------|---------| | ||
| | `servicemonitor-router.yaml` | Scrapes metrics from kthena-router (port 8080) | | ||
| | `podmonitor-inference.yaml` | Scrapes vLLM metrics from inference pods (port 8000) | | ||
| | `keda-scaledobject.yaml` | KEDA ScaledObject that drives HPA based on Prometheus queries | | ||
|
|
||
| ## Deployment | ||
|
|
||
| ### 1. Deploy monitoring targets | ||
|
|
||
| ```bash | ||
| kubectl apply -f examples/keda-autoscaling/servicemonitor-router.yaml | ||
| kubectl apply -f examples/keda-autoscaling/podmonitor-inference.yaml | ||
| ``` | ||
|
|
||
| Verify metrics are being scraped in your Prometheus UI. | ||
|
|
||
| ### 2. Configure the ScaledObject | ||
|
|
||
| Edit `keda-scaledobject.yaml` before applying: | ||
|
|
||
| - **`spec.scaleTargetRef.name`** — set to your ModelServing resource name | ||
| - **`model` label in router query** — set to the model name your ModelServing instance serves (e.g. `"deepseek-ai/DeepSeek-R1"`) | ||
| - **`model_serving` label in vLLM query** — set to your ModelServing `metadata.name` | ||
|
|
||
| ```bash | ||
| kubectl apply -f examples/keda-autoscaling/keda-scaledobject.yaml | ||
| ``` | ||
|
|
||
| ### 3. Verify | ||
|
|
||
| ```bash | ||
| # Check KEDA created the HPA | ||
| kubectl get hpa | ||
|
|
||
| # Check ScaledObject status | ||
| kubectl get scaledobject modelserving-scaler | ||
|
|
||
| # Watch pods scale | ||
| kubectl get pods -w | ||
| ``` | ||
|
|
||
| ## Prometheus Queries Explained | ||
|
|
||
| ### Router active requests (per model) | ||
|
|
||
| ```promql | ||
| sum(kthena_router_active_downstream_requests{model="my-model-name"}) | ||
| ``` | ||
|
|
||
| The `kthena_router_active_downstream_requests` gauge tracks currently active client requests to the router. The `model` label identifies which model the request targets, so filtering by `model` ensures each ModelServing instance scales based only on **its own traffic** — not aggregate load across all models. | ||
|
|
||
| ### vLLM pending requests (per ModelServing) | ||
|
|
||
| ```promql | ||
| avg(vllm:num_requests_waiting{model_serving="my-modelserving"}) | ||
| ``` | ||
|
|
||
| The `vllm:num_requests_waiting` gauge is exposed by each vLLM inference pod. Filtering by `model_serving` scopes the average to pods belonging to a specific ModelServing instance. | ||
|
|
||
| ## Per-Model Scaling | ||
|
|
||
| The kthena-router supports routing to multiple models simultaneously. Without per-model filtering, a single busy model would trigger scaling for **all** ModelServing instances. Each ScaledObject must filter metrics to the specific model/ModelServing it manages: | ||
|
|
||
| - Create **one ScaledObject per ModelServing** instance | ||
| - Set the `model` label filter to match the model name served by that instance | ||
| - Adjust `threshold`, `minReplicaCount`, and `maxReplicaCount` per model based on expected load | ||
|
|
||
| ## Testing | ||
|
|
||
| Generate load against a specific model and watch scaling: | ||
|
|
||
| ```bash | ||
| # Send requests to trigger scaling | ||
| for i in $(seq 1 50); do | ||
| curl -s http://<router-endpoint>/v1/completions \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{"model": "my-model-name", "prompt": "Hello", "max_tokens": 100}' & | ||
| done | ||
|
|
||
| # Watch the HPA react | ||
| kubectl get hpa -w | ||
| ``` | ||
|
|
||
| The ScaledObject's `cooldownPeriod: 120` means pods scale down 2 minutes after load drops below the threshold. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| apiVersion: keda.sh/v1alpha1 | ||
| kind: ScaledObject | ||
| metadata: | ||
| name: modelserving-scaler | ||
| spec: | ||
| scaleTargetRef: | ||
| apiVersion: workload.serving.volcano.sh/v1alpha1 | ||
| kind: ModelServing | ||
| name: my-modelserving | ||
| minReplicaCount: 1 | ||
| maxReplicaCount: 10 | ||
| cooldownPeriod: 120 | ||
| pollingInterval: 15 | ||
| triggers: | ||
| - type: prometheus | ||
| metadata: | ||
| serverAddress: http://prometheus.monitoring.svc.cluster.local:9090 | ||
| # Scope to pods belonging to this specific ModelServing instance. | ||
| # Replace "my-modelserving" with your ModelServing metadata.name. | ||
| query: avg(vllm:num_requests_waiting{model_serving="my-modelserving"}) | ||
| threshold: "5" | ||
| metricName: vllm_requests_waiting_avg | ||
| - type: prometheus | ||
| metadata: | ||
| serverAddress: http://prometheus.monitoring.svc.cluster.local:9090 | ||
| # Scope to the specific model served by this ModelServing instance. | ||
| # The "model" label on router metrics identifies the model name | ||
| # (e.g. "deepseek-ai/DeepSeek-R1"). Replace with your model name. | ||
| query: sum(kthena_router_active_downstream_requests{model="my-model-name"}) | ||
| threshold: "20" | ||
| metricName: router_active_downstream_requests |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| apiVersion: monitoring.coreos.com/v1 | ||
| kind: PodMonitor | ||
| metadata: | ||
| name: inference-pods | ||
| labels: | ||
| app.kubernetes.io/component: inference | ||
| spec: | ||
| selector: | ||
| matchLabels: | ||
| modelserving.volcano.sh/entry: "true" | ||
| podMetricsEndpoints: | ||
| - port: http | ||
| targetPort: 8000 | ||
| path: /metrics | ||
| interval: 15s | ||
|
Comment on lines
+12
to
+15
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The - port: http
path: /metrics
interval: 15s |
||
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,14 @@ | ||||||||
| apiVersion: monitoring.coreos.com/v1 | ||||||||
| kind: ServiceMonitor | ||||||||
| metadata: | ||||||||
|
||||||||
| metadata: | |
| metadata: | |
| namespace: kthena-system |
Copilot
AI
Mar 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ServiceMonitor.spec.selector.matchLabels matches any Service with app.kubernetes.io/component: kthena-router. The Helm chart also labels the kthena-router-webhook Service with the same component label, but that Service exposes only the webhook port (no http metrics port). This can lead to failed scrapes / confusing Prometheus targets. Prefer selecting a label that uniquely identifies the metrics Service, or add a dedicated label on the router metrics Service and match on that.
| app.kubernetes.io/component: kthena-router | |
| app.kubernetes.io/name: kthena-router |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PodMonitor endpoint sets
port: http, but ModelServing-generated inference pods/templates don’t name the container porthttp(they typically only setcontainerPort: 8000without aname). With a named port mismatch Prometheus Operator won’t be able to resolve the scrape target. Either name the metrics porthttpin the pod spec, or removeport:and rely ontargetPort, or setportto the actual named port used by the inference pods.