Skip to content

feat: add KEDA autoscaling example manifests#831

Open
WHOIM1205 wants to merge 2 commits intovolcano-sh:mainfrom
WHOIM1205:keda-autoscaling-prototype
Open

feat: add KEDA autoscaling example manifests#831
WHOIM1205 wants to merge 2 commits intovolcano-sh:mainfrom
WHOIM1205:keda-autoscaling-prototype

Conversation

@WHOIM1205
Copy link
Copy Markdown
Contributor

This PR adds a small prototype to explore KEDA-based autoscaling for ModelServing.

It includes:

  • a ServiceMonitor for scraping kthena-router metrics
  • a PodMonitor for scraping inference pod metrics (vLLM/SGLang)
  • an example ScaledObject targeting ModelServing using Prometheus metrics

The idea is to validate how external autoscaling (via KEDA + Prometheus) could integrate with the existing ModelServing controller, which already exposes the Kubernetes scale subresource.

For the prototype:

  • scaling is based on vllm:num_requests_waiting (backend queue pressure)
  • and kthena_router_active_downstream_requests (incoming request load)

This is meant as a minimal example to validate the approach before moving to a more complete implementation.

Note: this assumes the built-in autoscaler is not active for the same ModelServing to avoid conflicting updates to spec.replicas.

Refs #799

Add ScaledObject, PodMonitor, and ServiceMonitor manifests
for KEDA-based autoscaling of inference workloads.

Signed-off-by: WHOIM1205 <[email protected]>
Copilot AI review requested due to automatic review settings March 20, 2026 21:47
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a prototype for integrating KEDA-based external autoscaling with ModelServing. It sets up the necessary monitoring components, including a ServiceMonitor for the kthena-router and a PodMonitor for inference pods, to expose relevant metrics to Prometheus. An example ScaledObject is also provided, demonstrating how ModelServing can be scaled dynamically based on these Prometheus metrics, specifically targeting backend queue pressure and incoming request load. The primary goal is to validate this approach for future, more complete implementations.

Highlights

  • KEDA ScaledObject: Introduced an example KEDA ScaledObject to enable external autoscaling for ModelServing, configured to scale based on Prometheus metrics from vLLM request queues and kthena-router active requests.
  • Prometheus Monitors: Added a ServiceMonitor for scraping metrics from the kthena-router and a PodMonitor for scraping metrics from inference pods (vLLM/SGLang) to provide data for KEDA autoscaling.
  • External Autoscaling Prototype: Provided a minimal prototype to validate the integration of KEDA and Prometheus for external autoscaling with the existing ModelServing controller.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable example manifests for integrating KEDA-based autoscaling with ModelServing. The changes are well-structured and provide a good starting point. My review includes suggestions to enhance the robustness of the Prometheus queries in the ScaledObject to ensure correct behavior in multi-model/multi-instance deployments. I've also identified and proposed a fix for a critical issue in the PodMonitor configuration that would prevent metrics from being scraped.

Comment on lines +12 to +15
- port: http
targetPort: 8000
path: /metrics
interval: 15s
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The targetPort field is not a valid field within a podMetricsEndpoints item for a PodMonitor resource. Its inclusion will likely make this manifest invalid and prevent metrics from being scraped. The port field is sufficient. Please remove the targetPort line.

    - port: http
      path: /metrics
      interval: 15s

- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
query: avg(vllm:num_requests_waiting)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The vllm:num_requests_waiting metric is scraped from all inference pods matched by the PodMonitor. In an environment with multiple ModelServing instances, this query will calculate the average across all of them, leading to incorrect scaling decisions. The query should be filtered by a label that uniquely identifies the pods belonging to this ModelServing instance (my-modelserving).

For example, if pods have a label like modelserving.volcano.sh/name: my-modelserving, the query should be updated to use it. Note that Prometheus relabeling will convert label characters like / and . to _. You'll need to verify the exact label on the pods.

        query: avg(vllm:num_requests_waiting{modelserving_volcano_sh_name="my-modelserving"})

- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
query: sum(kthena_router_active_downstream_requests)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The kthena_router_active_downstream_requests metric is labeled by model. This query sums up active requests across all models, which could lead to incorrect scaling behavior if multiple models are served by the same router. To ensure this ScaledObject only considers metrics for my-modelserving, you should filter the query by the model name.

        query: sum(kthena_router_active_downstream_requests{model="my-modelserving"})

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds prototype Kubernetes manifests demonstrating how to use Prometheus-scraped metrics with KEDA to autoscale the ModelServing custom resource (via its scale subresource), as an exploration towards HPA/KEDA integration discussed in #799.

Changes:

  • Add a ServiceMonitor to scrape kthena-router metrics.
  • Add a PodMonitor to scrape inference pod (vLLM/SGLang) metrics.
  • Add a KEDA ScaledObject that scales a ModelServing based on Prometheus queries.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
examples/keda-autoscaling/servicemonitor-router.yaml ServiceMonitor example for scraping router /metrics.
examples/keda-autoscaling/podmonitor-inference.yaml PodMonitor example for scraping inference pods’ /metrics.
examples/keda-autoscaling/keda-scaledobject.yaml KEDA ScaledObject example targeting ModelServing using Prometheus triggers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +18 to +24
query: avg(vllm:num_requests_waiting)
threshold: "5"
metricName: vllm_requests_waiting_avg
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
query: sum(kthena_router_active_downstream_requests)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These Prometheus queries are not scoped to the specific ModelServing/router instance (no namespace/service/pod/model label filters). In a cluster with multiple ModelServings or routers, this will aggregate unrelated traffic/queue metrics and drive incorrect scaling. Update the queries to filter to the intended target (e.g., by namespace, service, and/or ModelServing-related pod labels) so the ScaledObject only reacts to metrics from my-modelserving.

Suggested change
query: avg(vllm:num_requests_waiting)
threshold: "5"
metricName: vllm_requests_waiting_avg
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
query: sum(kthena_router_active_downstream_requests)
query: avg(vllm:num_requests_waiting{namespace="default",service="my-modelserving"})
threshold: "5"
metricName: vllm_requests_waiting_avg
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
query: sum(kthena_router_active_downstream_requests{namespace="default",service="my-modelserving"})

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,14 @@
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three example manifests omit metadata.namespace. In this repo, most example YAMLs set an explicit namespace (e.g. kthena-system/default), and for ServiceMonitor/PodMonitor it also affects discovery because they only select targets within their own namespace unless spec.namespaceSelector is configured. Consider adding an explicit namespace (and namespaceSelector if you expect scraping across namespaces) to make the example apply correctly out-of-the-box.

Suggested change
metadata:
metadata:
namespace: kthena-system

Copilot uses AI. Check for mistakes.
spec:
selector:
matchLabels:
app.kubernetes.io/component: kthena-router
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ServiceMonitor.spec.selector.matchLabels matches any Service with app.kubernetes.io/component: kthena-router. The Helm chart also labels the kthena-router-webhook Service with the same component label, but that Service exposes only the webhook port (no http metrics port). This can lead to failed scrapes / confusing Prometheus targets. Prefer selecting a label that uniquely identifies the metrics Service, or add a dedicated label on the router metrics Service and match on that.

Suggested change
app.kubernetes.io/component: kthena-router
app.kubernetes.io/name: kthena-router

Copilot uses AI. Check for mistakes.
Comment on lines +12 to +13
- port: http
targetPort: 8000
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodMonitor endpoint sets port: http, but ModelServing-generated inference pods/templates don’t name the container port http (they typically only set containerPort: 8000 without a name). With a named port mismatch Prometheus Operator won’t be able to resolve the scrape target. Either name the metrics port http in the pod spec, or remove port: and rely on targetPort, or set port to the actual named port used by the inference pods.

Suggested change
- port: http
targetPort: 8000
- targetPort: 8000

Copilot uses AI. Check for mistakes.
@WHOIM1205
Copy link
Copy Markdown
Contributor Author

/assign @hzxuzhonghu

Copy link
Copy Markdown
Contributor

@LiZhenCheng9527 LiZhenCheng9527 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested all these configurations?

@WHOIM1205
Copy link
Copy Markdown
Contributor Author

Have you tested all these configurations?

i validated configs and i will test cluster next

@WHOIM1205
Copy link
Copy Markdown
Contributor Author

hey @LiZhenCheng9527 @hzxuzhonghu
I tested KEDA scaling locally using a sample deployment.
After setting up metrics-server and generating load, scaling works as expected the number of pods increases under load and stabilizes correctly.
This validates that the KEDA-based approach is functioning as intended.
image
image
image

- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
query: sum(kthena_router_active_downstream_requests)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious this metrics is not per model, how could it be appropriate. kthena router support routing to multiple models.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also note this is missing a readme

- Filter kthena_router_active_downstream_requests by model label
- Filter vllm:num_requests_waiting by model_serving label
- Add README with deployment steps and per-model scaling docs

Signed-off-by: WHOIM1205 <[email protected]>
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from hzxuzhonghu. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@WHOIM1205
Copy link
Copy Markdown
Contributor Author

hey @hzxuzhonghu

i’ve updated the example to scope the metric per model using the model label so each ModelServing instance scales only based on its own traffic

also added a README with setup and usage steps

let me know if you think there's a better way to handle the model toresource mapping here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants