feat: add KEDA autoscaling example manifests by WHOIM1205 · Pull Request #831 · volcano-sh/kthena

WHOIM1205 · 2026-03-20T21:47:07Z

This PR adds a small prototype to explore KEDA-based autoscaling for ModelServing.

It includes:

a ServiceMonitor for scraping kthena-router metrics
a PodMonitor for scraping inference pod metrics (vLLM/SGLang)
an example ScaledObject targeting ModelServing using Prometheus metrics

The idea is to validate how external autoscaling (via KEDA + Prometheus) could integrate with the existing ModelServing controller, which already exposes the Kubernetes scale subresource.

For the prototype:

scaling is based on vllm:num_requests_waiting (backend queue pressure)
and kthena_router_active_downstream_requests (incoming request load)

This is meant as a minimal example to validate the approach before moving to a more complete implementation.

Note: this assumes the built-in autoscaler is not active for the same ModelServing to avoid conflicting updates to spec.replicas.

Refs #799

Add ScaledObject, PodMonitor, and ServiceMonitor manifests for KEDA-based autoscaling of inference workloads. Signed-off-by: WHOIM1205 <[email protected]>

gemini-code-assist · 2026-03-20T21:47:23Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a prototype for integrating KEDA-based external autoscaling with ModelServing. It sets up the necessary monitoring components, including a ServiceMonitor for the kthena-router and a PodMonitor for inference pods, to expose relevant metrics to Prometheus. An example ScaledObject is also provided, demonstrating how ModelServing can be scaled dynamically based on these Prometheus metrics, specifically targeting backend queue pressure and incoming request load. The primary goal is to validate this approach for future, more complete implementations.

Highlights

KEDA ScaledObject: Introduced an example KEDA ScaledObject to enable external autoscaling for ModelServing, configured to scale based on Prometheus metrics from vLLM request queues and kthena-router active requests.
Prometheus Monitors: Added a ServiceMonitor for scraping metrics from the kthena-router and a PodMonitor for scraping metrics from inference pods (vLLM/SGLang) to provide data for KEDA autoscaling.
External Autoscaling Prototype: Provided a minimal prototype to validate the integration of KEDA and Prometheus for external autoscaling with the existing ModelServing controller.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces valuable example manifests for integrating KEDA-based autoscaling with ModelServing. The changes are well-structured and provide a good starting point. My review includes suggestions to enhance the robustness of the Prometheus queries in the ScaledObject to ensure correct behavior in multi-model/multi-instance deployments. I've also identified and proposed a fix for a critical issue in the PodMonitor configuration that would prevent metrics from being scraped.

gemini-code-assist · 2026-03-20T21:49:29Z

examples/keda-autoscaling/podmonitor-inference.yaml

+    - port: http
+      targetPort: 8000
+      path: /metrics
+      interval: 15s


The targetPort field is not a valid field within a podMetricsEndpoints item for a PodMonitor resource. Its inclusion will likely make this manifest invalid and prevent metrics from being scraped. The port field is sufficient. Please remove the targetPort line.

- port: http path: /metrics interval: 15s

gemini-code-assist · 2026-03-20T21:49:29Z

examples/keda-autoscaling/keda-scaledobject.yaml

+    - type: prometheus
+      metadata:
+        serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
+        query: avg(vllm:num_requests_waiting)


The vllm:num_requests_waiting metric is scraped from all inference pods matched by the PodMonitor. In an environment with multiple ModelServing instances, this query will calculate the average across all of them, leading to incorrect scaling decisions. The query should be filtered by a label that uniquely identifies the pods belonging to this ModelServing instance (my-modelserving).

For example, if pods have a label like modelserving.volcano.sh/name: my-modelserving, the query should be updated to use it. Note that Prometheus relabeling will convert label characters like / and . to _. You'll need to verify the exact label on the pods.

query: avg(vllm:num_requests_waiting{modelserving_volcano_sh_name="my-modelserving"})

gemini-code-assist · 2026-03-20T21:49:29Z

examples/keda-autoscaling/keda-scaledobject.yaml

+    - type: prometheus
+      metadata:
+        serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
+        query: sum(kthena_router_active_downstream_requests)


The kthena_router_active_downstream_requests metric is labeled by model. This query sums up active requests across all models, which could lead to incorrect scaling behavior if multiple models are served by the same router. To ensure this ScaledObject only considers metrics for my-modelserving, you should filter the query by the model name.

query: sum(kthena_router_active_downstream_requests{model="my-modelserving"})

Copilot

Pull request overview

Adds prototype Kubernetes manifests demonstrating how to use Prometheus-scraped metrics with KEDA to autoscale the ModelServing custom resource (via its scale subresource), as an exploration towards HPA/KEDA integration discussed in #799.

Changes:

Add a ServiceMonitor to scrape kthena-router metrics.
Add a PodMonitor to scrape inference pod (vLLM/SGLang) metrics.
Add a KEDA ScaledObject that scales a ModelServing based on Prometheus queries.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
examples/keda-autoscaling/servicemonitor-router.yaml	ServiceMonitor example for scraping router `/metrics`.
examples/keda-autoscaling/podmonitor-inference.yaml	PodMonitor example for scraping inference pods’ `/metrics`.
examples/keda-autoscaling/keda-scaledobject.yaml	KEDA ScaledObject example targeting `ModelServing` using Prometheus triggers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T21:51:26Z

examples/keda-autoscaling/keda-scaledobject.yaml

+        query: avg(vllm:num_requests_waiting)
+        threshold: "5"
+        metricName: vllm_requests_waiting_avg
+    - type: prometheus
+      metadata:
+        serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
+        query: sum(kthena_router_active_downstream_requests)


These Prometheus queries are not scoped to the specific ModelServing/router instance (no namespace/service/pod/model label filters). In a cluster with multiple ModelServings or routers, this will aggregate unrelated traffic/queue metrics and drive incorrect scaling. Update the queries to filter to the intended target (e.g., by namespace, service, and/or ModelServing-related pod labels) so the ScaledObject only reacts to metrics from my-modelserving.

Suggested change

query: avg(vllm:num_requests_waiting)

threshold: "5"

metricName: vllm_requests_waiting_avg

- type: prometheus

metadata:

serverAddress: http://prometheus.monitoring.svc.cluster.local:9090

query: sum(kthena_router_active_downstream_requests)

query: avg(vllm:num_requests_waiting{namespace="default",service="my-modelserving"})

threshold: "5"

metricName: vllm_requests_waiting_avg

- type: prometheus

metadata:

serverAddress: http://prometheus.monitoring.svc.cluster.local:9090

query: sum(kthena_router_active_downstream_requests{namespace="default",service="my-modelserving"})

Copilot · 2026-03-20T21:51:26Z

examples/keda-autoscaling/servicemonitor-router.yaml

@@ -0,0 +1,14 @@
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:


All three example manifests omit metadata.namespace. In this repo, most example YAMLs set an explicit namespace (e.g. kthena-system/default), and for ServiceMonitor/PodMonitor it also affects discovery because they only select targets within their own namespace unless spec.namespaceSelector is configured. Consider adding an explicit namespace (and namespaceSelector if you expect scraping across namespaces) to make the example apply correctly out-of-the-box.

Suggested change

metadata:

metadata:

namespace: kthena-system

Copilot · 2026-03-20T21:51:27Z

examples/keda-autoscaling/servicemonitor-router.yaml

+spec:
+  selector:
+    matchLabels:
+      app.kubernetes.io/component: kthena-router


ServiceMonitor.spec.selector.matchLabels matches any Service with app.kubernetes.io/component: kthena-router. The Helm chart also labels the kthena-router-webhook Service with the same component label, but that Service exposes only the webhook port (no http metrics port). This can lead to failed scrapes / confusing Prometheus targets. Prefer selecting a label that uniquely identifies the metrics Service, or add a dedicated label on the router metrics Service and match on that.

Suggested change

app.kubernetes.io/component: kthena-router

app.kubernetes.io/name: kthena-router

Copilot · 2026-03-20T21:51:27Z

examples/keda-autoscaling/podmonitor-inference.yaml

+    - port: http
+      targetPort: 8000


PodMonitor endpoint sets port: http, but ModelServing-generated inference pods/templates don’t name the container port http (they typically only set containerPort: 8000 without a name). With a named port mismatch Prometheus Operator won’t be able to resolve the scrape target. Either name the metrics port http in the pod spec, or remove port: and rely on targetPort, or set port to the actual named port used by the inference pods.

Suggested change

- port: http

targetPort: 8000

- targetPort: 8000

WHOIM1205 · 2026-03-21T12:02:28Z

/assign @hzxuzhonghu

LiZhenCheng9527

Have you tested all these configurations?

WHOIM1205 · 2026-03-24T09:37:44Z

Have you tested all these configurations?

i validated configs and i will test cluster next

WHOIM1205 · 2026-03-24T11:26:23Z

hey @LiZhenCheng9527 @hzxuzhonghu
I tested KEDA scaling locally using a sample deployment.
After setting up metrics-server and generating load, scaling works as expected the number of pods increases under load and stabilizes correctly.
This validates that the KEDA-based approach is functioning as intended.

hzxuzhonghu · 2026-03-31T11:34:07Z

examples/keda-autoscaling/keda-scaledobject.yaml

+    - type: prometheus
+      metadata:
+        serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
+        query: sum(kthena_router_active_downstream_requests)


I am curious this metrics is not per model, how could it be appropriate. kthena router support routing to multiple models.

hzxuzhonghu

Please also note this is missing a readme

- Filter kthena_router_active_downstream_requests by model label - Filter vllm:num_requests_waiting by model_serving label - Add README with deployment steps and per-model scaling docs Signed-off-by: WHOIM1205 <[email protected]>

volcano-sh-bot · 2026-03-31T19:42:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from hzxuzhonghu. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

examples/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

WHOIM1205 · 2026-03-31T19:44:24Z

hey @hzxuzhonghu

i’ve updated the example to scope the metric per model using the model label so each ModelServing instance scales only based on its own traffic

also added a README with setup and usage steps

let me know if you think there's a better way to handle the model toresource mapping here

feat: add KEDA autoscaling example manifests

ec9507c

Add ScaledObject, PodMonitor, and ServiceMonitor manifests for KEDA-based autoscaling of inference workloads. Signed-off-by: WHOIM1205 <[email protected]>

Copilot AI review requested due to automatic review settings March 20, 2026 21:47

volcano-sh-bot requested review from LiZhenCheng9527 and YaoZengzeng March 20, 2026 21:47

volcano-sh-bot added the size/M label Mar 20, 2026

Copilot started reviewing on behalf of WHOIM1205 March 20, 2026 21:47 View session

gemini-code-assist bot reviewed Mar 20, 2026

View reviewed changes

Copilot AI reviewed Mar 20, 2026

View reviewed changes

volcano-sh-bot assigned hzxuzhonghu Mar 21, 2026

LiZhenCheng9527 reviewed Mar 24, 2026

View reviewed changes

hzxuzhonghu reviewed Mar 31, 2026

View reviewed changes

fix: scope autoscaling metrics per model and add README

b94b66e

- Filter kthena_router_active_downstream_requests by model label - Filter vllm:num_requests_waiting by model_serving label - Add README with deployment steps and per-model scaling docs Signed-off-by: WHOIM1205 <[email protected]>

volcano-sh-bot added size/L and removed size/M labels Mar 31, 2026

This was referenced Apr 2, 2026

Understanding current Kthena autoscaler for HPA/KEDA integration #799

Open

Proposal: Prometheus + KEDA autoscaling for ModelServing #868

Open

	app.kubernetes.io/component: kthena-router
	app.kubernetes.io/name: kthena-router

Conversation

WHOIM1205 commented Mar 20, 2026

Uh oh!

gemini-code-assist bot commented Mar 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

WHOIM1205 commented Mar 21, 2026

Uh oh!

LiZhenCheng9527 left a comment

Choose a reason for hiding this comment

Uh oh!

WHOIM1205 commented Mar 24, 2026

Uh oh!

WHOIM1205 commented Mar 24, 2026

Uh oh!

hzxuzhonghu Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

volcano-sh-bot commented Mar 31, 2026

Uh oh!

WHOIM1205 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants