feat: add KEDA autoscaling example manifests#831
feat: add KEDA autoscaling example manifests#831WHOIM1205 wants to merge 2 commits intovolcano-sh:mainfrom
Conversation
Add ScaledObject, PodMonitor, and ServiceMonitor manifests for KEDA-based autoscaling of inference workloads. Signed-off-by: WHOIM1205 <[email protected]>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a prototype for integrating KEDA-based external autoscaling with ModelServing. It sets up the necessary monitoring components, including a ServiceMonitor for the kthena-router and a PodMonitor for inference pods, to expose relevant metrics to Prometheus. An example ScaledObject is also provided, demonstrating how ModelServing can be scaled dynamically based on these Prometheus metrics, specifically targeting backend queue pressure and incoming request load. The primary goal is to validate this approach for future, more complete implementations. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces valuable example manifests for integrating KEDA-based autoscaling with ModelServing. The changes are well-structured and provide a good starting point. My review includes suggestions to enhance the robustness of the Prometheus queries in the ScaledObject to ensure correct behavior in multi-model/multi-instance deployments. I've also identified and proposed a fix for a critical issue in the PodMonitor configuration that would prevent metrics from being scraped.
| - port: http | ||
| targetPort: 8000 | ||
| path: /metrics | ||
| interval: 15s |
There was a problem hiding this comment.
The targetPort field is not a valid field within a podMetricsEndpoints item for a PodMonitor resource. Its inclusion will likely make this manifest invalid and prevent metrics from being scraped. The port field is sufficient. Please remove the targetPort line.
- port: http
path: /metrics
interval: 15s| - type: prometheus | ||
| metadata: | ||
| serverAddress: http://prometheus.monitoring.svc.cluster.local:9090 | ||
| query: avg(vllm:num_requests_waiting) |
There was a problem hiding this comment.
The vllm:num_requests_waiting metric is scraped from all inference pods matched by the PodMonitor. In an environment with multiple ModelServing instances, this query will calculate the average across all of them, leading to incorrect scaling decisions. The query should be filtered by a label that uniquely identifies the pods belonging to this ModelServing instance (my-modelserving).
For example, if pods have a label like modelserving.volcano.sh/name: my-modelserving, the query should be updated to use it. Note that Prometheus relabeling will convert label characters like / and . to _. You'll need to verify the exact label on the pods.
query: avg(vllm:num_requests_waiting{modelserving_volcano_sh_name="my-modelserving"})| - type: prometheus | ||
| metadata: | ||
| serverAddress: http://prometheus.monitoring.svc.cluster.local:9090 | ||
| query: sum(kthena_router_active_downstream_requests) |
There was a problem hiding this comment.
The kthena_router_active_downstream_requests metric is labeled by model. This query sums up active requests across all models, which could lead to incorrect scaling behavior if multiple models are served by the same router. To ensure this ScaledObject only considers metrics for my-modelserving, you should filter the query by the model name.
query: sum(kthena_router_active_downstream_requests{model="my-modelserving"})There was a problem hiding this comment.
Pull request overview
Adds prototype Kubernetes manifests demonstrating how to use Prometheus-scraped metrics with KEDA to autoscale the ModelServing custom resource (via its scale subresource), as an exploration towards HPA/KEDA integration discussed in #799.
Changes:
- Add a
ServiceMonitorto scrapekthena-routermetrics. - Add a
PodMonitorto scrape inference pod (vLLM/SGLang) metrics. - Add a KEDA
ScaledObjectthat scales aModelServingbased on Prometheus queries.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| examples/keda-autoscaling/servicemonitor-router.yaml | ServiceMonitor example for scraping router /metrics. |
| examples/keda-autoscaling/podmonitor-inference.yaml | PodMonitor example for scraping inference pods’ /metrics. |
| examples/keda-autoscaling/keda-scaledobject.yaml | KEDA ScaledObject example targeting ModelServing using Prometheus triggers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| query: avg(vllm:num_requests_waiting) | ||
| threshold: "5" | ||
| metricName: vllm_requests_waiting_avg | ||
| - type: prometheus | ||
| metadata: | ||
| serverAddress: http://prometheus.monitoring.svc.cluster.local:9090 | ||
| query: sum(kthena_router_active_downstream_requests) |
There was a problem hiding this comment.
These Prometheus queries are not scoped to the specific ModelServing/router instance (no namespace/service/pod/model label filters). In a cluster with multiple ModelServings or routers, this will aggregate unrelated traffic/queue metrics and drive incorrect scaling. Update the queries to filter to the intended target (e.g., by namespace, service, and/or ModelServing-related pod labels) so the ScaledObject only reacts to metrics from my-modelserving.
| query: avg(vllm:num_requests_waiting) | |
| threshold: "5" | |
| metricName: vllm_requests_waiting_avg | |
| - type: prometheus | |
| metadata: | |
| serverAddress: http://prometheus.monitoring.svc.cluster.local:9090 | |
| query: sum(kthena_router_active_downstream_requests) | |
| query: avg(vllm:num_requests_waiting{namespace="default",service="my-modelserving"}) | |
| threshold: "5" | |
| metricName: vllm_requests_waiting_avg | |
| - type: prometheus | |
| metadata: | |
| serverAddress: http://prometheus.monitoring.svc.cluster.local:9090 | |
| query: sum(kthena_router_active_downstream_requests{namespace="default",service="my-modelserving"}) |
| @@ -0,0 +1,14 @@ | |||
| apiVersion: monitoring.coreos.com/v1 | |||
| kind: ServiceMonitor | |||
| metadata: | |||
There was a problem hiding this comment.
All three example manifests omit metadata.namespace. In this repo, most example YAMLs set an explicit namespace (e.g. kthena-system/default), and for ServiceMonitor/PodMonitor it also affects discovery because they only select targets within their own namespace unless spec.namespaceSelector is configured. Consider adding an explicit namespace (and namespaceSelector if you expect scraping across namespaces) to make the example apply correctly out-of-the-box.
| metadata: | |
| metadata: | |
| namespace: kthena-system |
| spec: | ||
| selector: | ||
| matchLabels: | ||
| app.kubernetes.io/component: kthena-router |
There was a problem hiding this comment.
ServiceMonitor.spec.selector.matchLabels matches any Service with app.kubernetes.io/component: kthena-router. The Helm chart also labels the kthena-router-webhook Service with the same component label, but that Service exposes only the webhook port (no http metrics port). This can lead to failed scrapes / confusing Prometheus targets. Prefer selecting a label that uniquely identifies the metrics Service, or add a dedicated label on the router metrics Service and match on that.
| app.kubernetes.io/component: kthena-router | |
| app.kubernetes.io/name: kthena-router |
| - port: http | ||
| targetPort: 8000 |
There was a problem hiding this comment.
PodMonitor endpoint sets port: http, but ModelServing-generated inference pods/templates don’t name the container port http (they typically only set containerPort: 8000 without a name). With a named port mismatch Prometheus Operator won’t be able to resolve the scrape target. Either name the metrics port http in the pod spec, or remove port: and rely on targetPort, or set port to the actual named port used by the inference pods.
| - port: http | |
| targetPort: 8000 | |
| - targetPort: 8000 |
|
/assign @hzxuzhonghu |
LiZhenCheng9527
left a comment
There was a problem hiding this comment.
Have you tested all these configurations?
i validated configs and i will test cluster next |
|
hey @LiZhenCheng9527 @hzxuzhonghu |
| - type: prometheus | ||
| metadata: | ||
| serverAddress: http://prometheus.monitoring.svc.cluster.local:9090 | ||
| query: sum(kthena_router_active_downstream_requests) |
There was a problem hiding this comment.
I am curious this metrics is not per model, how could it be appropriate. kthena router support routing to multiple models.
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Please also note this is missing a readme
- Filter kthena_router_active_downstream_requests by model label - Filter vllm:num_requests_waiting by model_serving label - Add README with deployment steps and per-model scaling docs Signed-off-by: WHOIM1205 <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
hey @hzxuzhonghu i’ve updated the example to scope the metric per model using the model label so each ModelServing instance scales only based on its own traffic also added a README with setup and usage steps let me know if you think there's a better way to handle the model toresource mapping here |



This PR adds a small prototype to explore KEDA-based autoscaling for ModelServing.
It includes:
The idea is to validate how external autoscaling (via KEDA + Prometheus) could integrate with the existing ModelServing controller, which already exposes the Kubernetes scale subresource.
For the prototype:
This is meant as a minimal example to validate the approach before moving to a more complete implementation.
Note: this assumes the built-in autoscaler is not active for the same ModelServing to avoid conflicting updates to spec.replicas.
Refs #799