Skip to content

Proposal: Prometheus + KEDA autoscaling for ModelServing #868

Open
WHOIM1205 wants to merge 2 commits intovolcano-sh:mainfrom
WHOIM1205:keda-autoscaling-proposal
Open

Proposal: Prometheus + KEDA autoscaling for ModelServing #868
WHOIM1205 wants to merge 2 commits intovolcano-sh:mainfrom
WHOIM1205:keda-autoscaling-proposal

Conversation

@WHOIM1205
Copy link
Copy Markdown
Contributor

Proposal: Prometheus + KEDA autoscaling for ModelServing

This PR adds a design proposal for integrating Prometheus + KEDA-based autoscaling into Kthena ModelServing.

The proposal outlines:

  • Problem statement and gaps in current autoscaling
  • Proposed architecture (Prometheus → KEDA → HPA → ModelServing)
  • Key design decisions and trade-offs
  • Failure modes and rollout plan
  • Learnings from the working prototype

This builds on earlier work where the end-to-end flow has already been validated locally.


Related


Notes

  • This proposal introduces KEDA as an optional autoscaling path
  • It does not modify or replace the existing AutoscalingPolicy system
  • The goal is to align on design before continuing further implementation

Copilot AI review requested due to automatic review settings April 4, 2026 19:06
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a proposal and supporting Kubernetes manifests for implementing KEDA and Prometheus-based autoscaling for ModelServing. The changes include new CRD definitions, deployment configurations, and a detailed design document. Feedback on the manifests identifies several critical issues: crd.json contains a leading typo and cluster-specific metadata that should be removed; fixed-crd.yaml is truncated and incomplete; and prom.yaml has invalid YAML syntax in its selector definitions. Additionally, the documentation includes redundant proposal files that should be consolidated.

crd.json Outdated
@@ -0,0 +1,532 @@
i{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The JSON file starts with an 'i' character, which is likely a typo (e.g., from a vim insert command). This makes the JSON invalid and unparseable.

Suggested change
i{
{

fixed-crd.yaml Outdated
properties:
metadata:
description: Object's metadata.
properties:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This file appears to be truncated. It ends abruptly at line 249 with properties:, leaving the ModelServing CRD definition incomplete.

crd.json Outdated
Comment on lines +4 to +21
"metadata": {
"annotations": {
"controller-gen.kubebuilder.io/version": "v0.14.0",
"kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"apiextensions.k8s.io/v1\",\"kind\":\"CustomResourceDefinition\",\"metadata\":{\"annotations\":{\"controller-gen.kubebuilder.io/version\":\"v0.14.0\"},\"labels\":{\"app.kubernetes.io/part-of\":\"keda-operator\",\"app.kubernetes.io/version\":\"2.14.0\"},\"name\":\"scaledobjects.keda.sh\"},\"spec\":{\"group\":\"keda.sh\",\"names\":{\"kind\":\"ScaledObject\",\"listKind\":\"ScaledObjectList\",\"plural\":\"scaledobjects\",\"shortNames\":[\"so\"],\"singular\":\"scaledobject\"},\"scope\":\"Namespaced\",\"versions\":[{\"additionalPrinterColumns\":[{\"jsonPath\":\".status.scaleTargetKind\",\"name\":\"ScaleTargetKind\",\"type\":\"string\"},{\"jsonPath\":\".spec.scaleTargetRef.name\",\"name\":\"ScaleTargetName\",\"type\":\"string\"},{\"jsonPath\":\".spec.minReplicaCount\",\"name\":\"Min\",\"type\":\"integer\"},{\"jsonPath\":\".spec.maxReplicaCount\",\"name\":\"Max\",\"type\":\"integer\"},{\"jsonPath\":\".spec.triggers[*].type\",\"name\":\"Triggers\",\"type\":\"string\"},{\"jsonPath\":\".spec.triggers[*].authenticationRef.name\",\"name\":\"Authentication\",\"type\":\"string\"},{\"jsonPath\":\".status.conditions[?(@.type==\\\"Ready\\\")].status\",\"name\":\"Ready\",\"type\":\"string\"},{\"jsonPath\":\".status.conditions[?(@.type==\\\"Active\\\")].status\",\"name\":\"Active\",\"type\":\"string\"},{\"jsonPath\":\".status.conditions[?(@.type==\\\"Fallback\\\")].status\",\"name\":\"Fallback\",\"type\":\"string\"},{\"jsonPath\":\".status.conditions[?(@.type==\\\"Paused\\\")].status\",\"name\":\"Paused\",\"type\":\"string\"},{\"jsonPath\":\".metadata.creationTimestamp\",\"name\":\"Age\",\"type\":\"date\"}],\"name\":\"v1alpha1\",\"schema\":{\"openAPIV3Schema\":{\"description\":\"ScaledObject is a specification for a ScaledObject resource\",\"properties\":{\"apiVersion\":{\"description\":\"APIVersion defines the versioned schema of this representation of an object.\\nServers should convert recognized schemas to the latest internal value, and\\nmay reject unrecognized values.\\nMore info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources\",\"type\":\"string\"},\"kind\":{\"description\":\"Kind is a string value representing the REST resource this object represents.\\nServers may infer this from the endpoint the client submits requests to.\\nCannot be updated.\\nIn CamelCase.\\nMore info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds\",\"type\":\"string\"},\"metadata\":{\"type\":\"object\"},\"spec\":{\"description\":\"ScaledObjectSpec is the spec for a ScaledObject resource\",\"properties\":{\"advanced\":{\"description\":\"AdvancedConfig specifies advance scaling options\",\"properties\":{\"horizontalPodAutoscalerConfig\":{\"description\":\"HorizontalPodAutoscalerConfig specifies horizontal scale config\",\"properties\":{\"behavior\":{\"description\":\"HorizontalPodAutoscalerBehavior configures the scaling behavior of the target\\nin both Up and Down directions (scaleUp and scaleDown fields respectively).\",\"properties\":{\"scaleDown\":{\"description\":\"scaleDown is scaling policy for scaling Down.\\nIf not set, the default value is to allow to scale down to minReplicas pods, with a\\n300 second stabilization window (i.e., the highest recommendation for\\nthe last 300sec is used).\",\"properties\":{\"policies\":{\"description\":\"policies is a list of potential scaling polices which can be used during scaling.\\nAt least one policy must be specified, otherwise the HPAScalingRules will be discarded as invalid\",\"items\":{\"description\":\"HPAScalingPolicy is a single policy which must hold true for a specified past interval.\",\"properties\":{\"periodSeconds\":{\"description\":\"periodSeconds specifies the window of time for which the policy should hold true.\\nPeriodSeconds must be greater than zero and less than or equal to 1800 (30 min).\",\"format\":\"int32\",\"type\":\"integer\"},\"type\":{\"description\":\"type is used to specify the scaling policy.\",\"type\":\"string\"},\"value\":{\"description\":\"value contains the amount of change which is permitted by the policy.\\nIt must be greater than zero\",\"format\":\"int32\",\"type\":\"integer\"}},\"required\":[\"periodSeconds\",\"type\",\"value\"],\"type\":\"object\"},\"type\":\"array\",\"x-kubernetes-list-type\":\"atomic\"},\"selectPolicy\":{\"description\":\"selectPolicy is used to specify which policy should be used.\\nIf not set, the default value Max is used.\",\"type\":\"string\"},\"stabilizationWindowSeconds\":{\"description\":\"stabilizationWindowSeconds is the number of seconds for which past recommendations should be\\nconsidered while scaling up or scaling down.\\nStabilizationWindowSeconds must be greater than or equal to zero and less than or equal to 3600 (one hour).\\nIf not set, use the default values:\\n- For scale up: 0 (i.e. no stabilization is done).\\n- For scale down: 300 (i.e. the stabilization window is 300 seconds long).\",\"format\":\"int32\",\"maximum\":3600,\"minimum\":0,\"type\":\"integer\"}},\"type\":\"object\"},\"scaleUp\":{\"description\":\"scaleUp is scaling policy for scaling Up.\\nIf not set, the default value is the higher of:\\n * increase no more than 4 pods per 60 seconds\\n * double the number of pods per 60 seconds\\nNo stabilization is used.\",\"properties\":{\"policies\":{\"description\":\"policies is a list of potential scaling polices which can be used during scaling.\\nAt least one policy must be specified, otherwise the HPAScalingRules will be discarded as invalid\",\"items\":{\"description\":\"HPAScalingPolicy is a single policy which must hold true for a specified past interval.\",\"properties\":{\"periodSeconds\":{\"description\":\"periodSeconds specifies the window of time for which the policy should hold true.\\nPeriodSeconds must be greater than zero and less than or equal to 1800 (30 min).\",\"format\":\"int32\",\"type\":\"integer\"},\"type\":{\"description\":\"type is used to specify the scaling policy.\",\"type\":\"string\"},\"value\":{\"description\":\"value contains the amount of change which is permitted by the policy.\\nIt must be greater than zero\",\"format\":\"int32\",\"type\":\"integer\"}},\"required\":[\"periodSeconds\",\"type\",\"value\"],\"type\":\"object\"},\"type\":\"array\",\"x-kubernetes-list-type\":\"atomic\"},\"selectPolicy\":{\"description\":\"selectPolicy is used to specify which policy should be used.\\nIf not set, the default value Max is used.\",\"type\":\"string\"},\"stabilizationWindowSeconds\":{\"description\":\"stabilizationWindowSeconds is the number of seconds for which past recommendations should be\\nconsidered while scaling up or scaling down.\\nStabilizationWindowSeconds must be greater than or equal to zero and less than or equal to 3600 (one hour).\\nIf not set, use the default values:\\n- For scale up: 0 (i.e. no stabilization is done).\\n- For scale down: 300 (i.e. the stabilization window is 300 seconds long).\",\"format\":\"int32\",\"maximum\":3600,\"minimum\":0,\"type\":\"integer\"}},\"type\":\"object\"}},\"type\":\"object\"},\"name\":{\"type\":\"string\"}},\"type\":\"object\"},\"restoreToOriginalReplicaCount\":{\"type\":\"boolean\"},\"scalingModifiers\":{\"description\":\"ScalingModifiers describes advanced scaling logic options like formula\",\"properties\":{\"activationTarget\":{\"type\":\"string\"},\"formula\":{\"type\":\"string\"},\"metricType\":{\"description\":\"MetricTargetType specifies the type of metric being targeted, and should be either\\n\\\"Value\\\", \\\"AverageValue\\\", or \\\"Utilization\\\"\",\"type\":\"string\"},\"target\":{\"type\":\"string\"}},\"type\":\"object\"}},\"type\":\"object\"},\"cooldownPeriod\":{\"format\":\"int32\",\"type\":\"integer\"},\"fallback\":{\"description\":\"Fallback is the spec for fallback options\",\"properties\":{\"failureThreshold\":{\"format\":\"int32\",\"type\":\"integer\"},\"replicas\":{\"format\":\"int32\",\"type\":\"integer\"}},\"required\":[\"failureThreshold\",\"replicas\"],\"type\":\"object\"},\"idleReplicaCount\":{\"format\":\"int32\",\"type\":\"integer\"},\"initialCooldownPeriod\":{\"format\":\"int32\",\"type\":\"integer\"},\"maxReplicaCount\":{\"format\":\"int32\",\"type\":\"integer\"},\"minReplicaCount\":{\"format\":\"int32\",\"type\":\"integer\"},\"pollingInterval\":{\"format\":\"int32\",\"type\":\"integer\"},\"scaleTargetRef\":{\"description\":\"ScaleTarget holds the reference to the scale target Object\",\"properties\":{\"apiVersion\":{\"type\":\"string\"},\"envSourceContainerName\":{\"type\":\"string\"},\"kind\":{\"type\":\"string\"},\"name\":{\"type\":\"string\"}},\"required\":[\"name\"],\"type\":\"object\"},\"triggers\":{\"items\":{\"description\":\"ScaleTriggers reference the scaler that will be used\",\"properties\":{\"authenticationRef\":{\"description\":\"AuthenticationRef points to the TriggerAuthentication or ClusterTriggerAuthentication object that\\nis used to authenticate the scaler with the environment\",\"properties\":{\"kind\":{\"description\":\"Kind of the resource being referred to. Defaults to TriggerAuthentication.\",\"type\":\"string\"},\"name\":{\"type\":\"string\"}},\"required\":[\"name\"],\"type\":\"object\"},\"metadata\":{\"additionalProperties\":{\"type\":\"string\"},\"type\":\"object\"},\"metricType\":{\"description\":\"MetricTargetType specifies the type of metric being targeted, and should be either\\n\\\"Value\\\", \\\"AverageValue\\\", or \\\"Utilization\\\"\",\"type\":\"string\"},\"name\":{\"type\":\"string\"},\"type\":{\"type\":\"string\"},\"useCachedMetrics\":{\"type\":\"boolean\"}},\"required\":[\"metadata\",\"type\"],\"type\":\"object\"},\"type\":\"array\"}},\"required\":[\"scaleTargetRef\",\"triggers\"],\"type\":\"object\"},\"status\":{\"description\":\"ScaledObjectStatus is the status for a ScaledObject resource\",\"properties\":{\"compositeScalerName\":{\"type\":\"string\"},\"conditions\":{\"description\":\"Conditions an array representation to store multiple Conditions\",\"items\":{\"description\":\"Condition to store the condition state\",\"properties\":{\"message\":{\"description\":\"A human readable message indicating details about the transition.\",\"type\":\"string\"},\"reason\":{\"description\":\"The reason for the condition's last transition.\",\"type\":\"string\"},\"status\":{\"description\":\"Status of the condition, one of True, False, Unknown.\",\"type\":\"string\"},\"type\":{\"description\":\"Type of condition\",\"type\":\"string\"}},\"required\":[\"status\",\"type\"],\"type\":\"object\"},\"type\":\"array\"},\"externalMetricNames\":{\"items\":{\"type\":\"string\"},\"type\":\"array\"},\"health\":{\"additionalProperties\":{\"description\":\"HealthStatus is the status for a ScaledObject's health\",\"properties\":{\"numberOfFailures\":{\"format\":\"int32\",\"type\":\"integer\"},\"status\":{\"description\":\"HealthStatusType is an indication of whether the health status is happy or failing\",\"type\":\"string\"}},\"type\":\"object\"},\"type\":\"object\"},\"hpaName\":{\"type\":\"string\"},\"lastActiveTime\":{\"format\":\"date-time\",\"type\":\"string\"},\"originalReplicaCount\":{\"format\":\"int32\",\"type\":\"integer\"},\"pausedReplicaCount\":{\"format\":\"int32\",\"type\":\"integer\"},\"resourceMetricNames\":{\"items\":{\"type\":\"string\"},\"type\":\"array\"},\"scaleTargetGVKR\":{\"description\":\"GroupVersionKindResource provides unified structure for schema.GroupVersionKind and Resource\",\"properties\":{\"group\":{\"type\":\"string\"},\"kind\":{\"type\":\"string\"},\"resource\":{\"type\":\"string\"},\"version\":{\"type\":\"string\"}},\"required\":[\"group\",\"kind\",\"resource\",\"version\"],\"type\":\"object\"},\"scaleTargetKind\":{\"type\":\"string\"}},\"type\":\"object\"}},\"required\":[\"spec\"],\"type\":\"object\"}},\"served\":true,\"storage\":true,\"subresources\":{\"status\":{}}}]}}\n"
},
"creationTimestamp": "2026-03-24T10:11:35Z",
"deletionGracePeriodSeconds": 0,
"deletionTimestamp": "2026-03-24T20:00:31Z",
"finalizers": [],
"generation": 1,
"labels": {
"app.kubernetes.io/part-of": "keda-operator",
"app.kubernetes.io/version": "2.14.0"
},
"name": "scaledobjects.keda.sh",
"resourceVersion": "19239",
"uid": "a141b8da-d181-4bc3-aceb-238dbafea7e8"
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This manifest appears to be a raw export from a running cluster. It contains cluster-specific metadata such as uid, resourceVersion, creationTimestamp, and a large last-applied-configuration annotation. These should be removed to ensure the manifest is clean and portable.

crd.json Outdated
Comment on lines +495 to +531
"status": {
"acceptedNames": {
"kind": "ScaledObject",
"listKind": "ScaledObjectList",
"plural": "scaledobjects",
"shortNames": [
"so"
],
"singular": "scaledobject"
},
"conditions": [
{
"lastTransitionTime": "2026-03-24T10:11:35Z",
"message": "no conflicts found",
"reason": "NoConflicts",
"status": "True",
"type": "NamesAccepted"
},
{
"lastTransitionTime": "2026-03-24T10:11:35Z",
"message": "the initial names have been accepted",
"reason": "InitialNamesAccepted",
"status": "True",
"type": "Established"
},
{
"lastTransitionTime": "2026-03-24T20:00:31Z",
"message": "CustomResource deletion is in progress",
"reason": "InstanceDeletionInProgress",
"status": "True",
"type": "Terminating"
}
],
"storedVersions": [
"v1alpha1"
]
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The status block should not be included in a static CRD definition file stored in the repository. Please remove this section.

prom.yaml Outdated
Comment on lines +52 to +54
podMonitorSelector: {}
matchLabels:
release: prometheus
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The podMonitorSelector is incorrectly defined. The empty braces {} on line 52 make the subsequent matchLabels block orphaned or invalid depending on the YAML parser.

  podMonitorSelector:
    matchLabels:
      release: prometheus

prom.yaml Outdated
Comment on lines +81 to +83
serviceMonitorSelector: {}
matchLabels:
release: prometheus
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The serviceMonitorSelector is incorrectly defined. The empty braces {} on line 81 make the subsequent matchLabels block orphaned.

  serviceMonitorSelector:
    matchLabels:
      release: prometheus

@@ -0,0 +1,359 @@
# Proposal: KEDA + Prometheus Autoscaling for Kthena ModelServing
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file is redundant as it is nearly identical to docs/proposals/keda-autoscaling.md. Please consolidate the proposal into a single file to avoid confusion and maintenance overhead.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR proposes adding Prometheus + KEDA-based autoscaling support for Kthena ModelServing. The proposal is comprehensive, addressing the limitation of the existing AutoscalingPolicy that cannot directly query Prometheus and lacks per-model demand signals from the router.

Changes:

  • Design proposal document for KEDA + Prometheus autoscaling architecture, including problem statement, proposed approach, design decisions, failure modes, and rollout plan
  • Example YAML artifacts and CRD definitions (though these appear to be temporary test files that should not be committed)

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
docs/proposals/keda-prometheus-autoscaling.md Comprehensive design proposal covering architecture, metrics, design decisions, alternatives, failure modes, and rollout plan
docs/proposals/keda-autoscaling.md Near-duplicate of the above proposal with only date difference
prom.yaml Prometheus CRD example with YAML syntax errors
crd.json KEDA ScaledObject CRD definition with JSON syntax error at the start
svc-backup.yaml, deploy-backup.yaml, service.yaml, fixed-crd.yaml Test/backup artifacts that appear unrelated to this proposal


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

prom.yaml Outdated
Comment on lines +52 to +54
podMonitorSelector: {}
matchLabels:
release: prometheus
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file contains malformed YAML. Lines 52-54 show podMonitorSelector: {} followed by indented matchLabels: which is invalid YAML structure. An empty {} object cannot have properties. Similarly, lines 81-83 show the same issue with serviceMonitorSelector:. The selectors should either be {} (empty) or contain the matchLabels: at the same indentation level without the preceding {}.

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +46
**Status:** Draft
**Authors:** @david_laid
**Date:** 2026-04-02

### Related PRs

| PR | Description |
|----|-------------|
| [#831](https://github.com/volcano-sh/kthena/pull/831) | Initial prototype -- example manifests (ServiceMonitor, PodMonitor, ScaledObject) |
| [#836](https://github.com/volcano-sh/kthena/pull/836) | Helm integration -- templates + `values.yaml` for monitoring and autoscaling |
| [#839](https://github.com/volcano-sh/kthena/pull/839) | Controller fix -- populates `status.labelSelector` so HPA can actually find pods |

All three reference [#799](https://github.com/volcano-sh/kthena/issues/799). We've validated the full flow end-to-end before writing this up.

---

## 1. Problem

LLM inference traffic is bursty. Without autoscaling, you either overprovision GPUs or eat latency spikes when load surges.

Kthena already has an autoscaler (`AutoscalingPolicy` + `AutoscalingPolicyBinding`). It scrapes metrics from pod endpoints, has panic mode, supports heterogeneous cost-optimized scaling. Works fine for pod-level signals like `kthena:num_requests_waiting`.

Where it falls short:

1. **Can't talk to Prometheus.** It scrapes pods directly. Most teams already have Prometheus running -- the autoscaler can't use it.

2. **No per-model demand signal.** The router exposes `kthena_router_active_downstream_requests{model="..."}` which tells you how much traffic a model is getting *before* it hits backends. The built-in autoscaler doesn't use this.

3. **Extra moving parts.** Teams already running KEDA end up maintaining two autoscaling systems side by side.

The goal here is to add KEDA as an optional autoscaling path. We're not touching AutoscalingPolicy.

### Non-goals

- Modifying or replacing `AutoscalingPolicy` / `AutoscalingPolicyBinding`
- Building a custom metrics adapter
- Multi-model-per-ModelServing (we assume 1:1)
- Role-level scaling via KEDA (built-in autoscaler handles that with `subTargets.kind: Role`)
- Auto-generating ScaledObjects (Phase 3 at earliest)

---

## 2. Proposed Approach

Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a duplicate of docs/proposals/keda-prometheus-autoscaling.md with only a different date (2026-04-02 vs 2026-03-31). Having two identical proposal files in the repository is confusing and should be avoided. Consider removing this file and keeping only the more recent version, or clarifying why both files are needed.

Suggested change
**Status:** Draft
**Authors:** @david_laid
**Date:** 2026-04-02
### Related PRs
| PR | Description |
|----|-------------|
| [#831](https://github.com/volcano-sh/kthena/pull/831) | Initial prototype -- example manifests (ServiceMonitor, PodMonitor, ScaledObject) |
| [#836](https://github.com/volcano-sh/kthena/pull/836) | Helm integration -- templates + `values.yaml` for monitoring and autoscaling |
| [#839](https://github.com/volcano-sh/kthena/pull/839) | Controller fix -- populates `status.labelSelector` so HPA can actually find pods |
All three reference [#799](https://github.com/volcano-sh/kthena/issues/799). We've validated the full flow end-to-end before writing this up.
---
## 1. Problem
LLM inference traffic is bursty. Without autoscaling, you either overprovision GPUs or eat latency spikes when load surges.
Kthena already has an autoscaler (`AutoscalingPolicy` + `AutoscalingPolicyBinding`). It scrapes metrics from pod endpoints, has panic mode, supports heterogeneous cost-optimized scaling. Works fine for pod-level signals like `kthena:num_requests_waiting`.
Where it falls short:
1. **Can't talk to Prometheus.** It scrapes pods directly. Most teams already have Prometheus running -- the autoscaler can't use it.
2. **No per-model demand signal.** The router exposes `kthena_router_active_downstream_requests{model="..."}` which tells you how much traffic a model is getting *before* it hits backends. The built-in autoscaler doesn't use this.
3. **Extra moving parts.** Teams already running KEDA end up maintaining two autoscaling systems side by side.
The goal here is to add KEDA as an optional autoscaling path. We're not touching AutoscalingPolicy.
### Non-goals
- Modifying or replacing `AutoscalingPolicy` / `AutoscalingPolicyBinding`
- Building a custom metrics adapter
- Multi-model-per-ModelServing (we assume 1:1)
- Role-level scaling via KEDA (built-in autoscaler handles that with `subTargets.kind: Role`)
- Auto-generating ScaledObjects (Phase 3 at earliest)
---
## 2. Proposed Approach
**Status:** Superseded
**Authors:** @david_laid
**Date:** 2026-04-02
This file previously duplicated the full proposal now maintained at
[`docs/proposals/keda-prometheus-autoscaling.md`](docs/proposals/keda-prometheus-autoscaling.md).
To avoid keeping two identical proposal documents in the repository, that file is the
canonical version and should be used for all future updates and references.
This file is retained only as a compatibility pointer for any existing links to
`docs/proposals/keda-autoscaling.md`.

Copilot uses AI. Check for mistakes.
crd.json Outdated
@@ -0,0 +1,532 @@
i{
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON file starts with invalid syntax: i{ on line 1 instead of just {. This makes the entire JSON file invalid and unparseable. This appears to be a typo or corruption.

Suggested change
i{
{

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reemove all the yamls in root dir.


Where it falls short:

1. **Can't talk to Prometheus.** It scrapes pods directly. Most teams already have Prometheus running -- the autoscaler can't use it.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the point, i am wondering we can support it later.


1. **Can't talk to Prometheus.** It scrapes pods directly. Most teams already have Prometheus running -- the autoscaler can't use it.

2. **No per-model demand signal.** The router exposes `kthena_router_active_downstream_requests{model="..."}` which tells you how much traffic a model is getting *before* it hits backends. The built-in autoscaler doesn't use this.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not support using kthena router metrics

- Modifying or replacing `AutoscalingPolicy` / `AutoscalingPolicyBinding`
- Building a custom metrics adapter
- Multi-model-per-ModelServing (we assume 1:1)
- Role-level scaling via KEDA (built-in autoscaler handles that with `subTargets.kind: Role`)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer leaving it as second step work

### Scale-up flow at runtime

```
User traffic Router Prometheus KEDA HPA ModelServing
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you replace with mermaid, which is more readable


### How it works

1. Prometheus scrapes the router, collects `kthena_router_active_downstream_requests{model="..."}`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kthena_router_active_downstream_requests is just an example metric type, not a must


Right now operators have to manually match the `model` label in the PromQL query to the correct ModelServing CR. Easy to mess up.

We should add a `kthena.io/model-name` annotation on ModelServing. Doesn't need a new controller -- just a standard place to record the mapping. We can build tooling on top of it later if auto-generating ScaledObjects makes sense.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind the model name from the router matric can be different from the real model name running in vlllm or sglang.

Because we can do model name match in router and route to the real backend model instance

triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use a placeholder here

name: my-model-serving
minReplicaCount: 1
maxReplicaCount: 10
cooldownPeriod: 60
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems too short as a llm can takes 5minutes to start from ground

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnot it same as keda-autoscaling.md?

crd.json Outdated
@@ -0,0 +1,532 @@
i{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this, why put it in proposal pr

Copilot AI review requested due to automatic review settings April 8, 2026 08:59
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign lizhencheng9527 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@WHOIM1205
Copy link
Copy Markdown
Contributor Author

hey @hzxuzhonghu
thanks a lot for the detailed review this was really helpful
i’ve addressed all the points:

  1. cleaned up root-level files
  2. reduced scope and separated future work
  3. replaced diagrams with mermaid
  4. made metrics generic
  5. clarified model-to-ModelServing mapping
  6. improved examples and placeholders
    Would really appreciate another look whenever you get time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants