Proposal: Prometheus + KEDA autoscaling for ModelServing by WHOIM1205 · Pull Request #868 · volcano-sh/kthena

WHOIM1205 · 2026-04-04T19:06:08Z

Proposal: Prometheus + KEDA autoscaling for ModelServing

This PR adds a design proposal for integrating Prometheus + KEDA-based autoscaling into Kthena ModelServing.

The proposal outlines:

Problem statement and gaps in current autoscaling
Proposed architecture (Prometheus → KEDA → HPA → ModelServing)
Key design decisions and trade-offs
Failure modes and rollout plan
Learnings from the working prototype

This builds on earlier work where the end-to-end flow has already been validated locally.

Understanding current Kthena autoscaler for HPA/KEDA integration #799 (umbrella issue)
feat: add KEDA autoscaling example manifests #831 (example manifests)
Add Prometheus monitoring and KEDA autoscaling support #836 (Helm integration)
Add KEDA autoscaling support for ModelServing via Prometheus #839 (controller fix for status.labelSelector)

Notes

This proposal introduces KEDA as an optional autoscaling path
It does not modify or replace the existing AutoscalingPolicy system
The goal is to align on design before continuing further implementation

gemini-code-assist

Code Review

This pull request introduces a proposal and supporting Kubernetes manifests for implementing KEDA and Prometheus-based autoscaling for ModelServing. The changes include new CRD definitions, deployment configurations, and a detailed design document. Feedback on the manifests identifies several critical issues: crd.json contains a leading typo and cluster-specific metadata that should be removed; fixed-crd.yaml is truncated and incomplete; and prom.yaml has invalid YAML syntax in its selector definitions. Additionally, the documentation includes redundant proposal files that should be consolidated.

gemini-code-assist · 2026-04-04T19:07:27Z

crd.json

@@ -0,0 +1,532 @@
+i{


The JSON file starts with an 'i' character, which is likely a typo (e.g., from a vim insert command). This makes the JSON invalid and unparseable.

Suggested change

i{

{

gemini-code-assist · 2026-04-04T19:07:27Z

fixed-crd.yaml

+                          properties:
+                            metadata:
+                              description: Object's metadata.
+                              properties:


This file appears to be truncated. It ends abruptly at line 249 with properties:, leaving the ModelServing CRD definition incomplete.

gemini-code-assist · 2026-04-04T19:07:27Z

crd.json

+    "metadata": {
+        "annotations": {
+            "controller-gen.kubebuilder.io/version": "v0.14.0",
+            "kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"apiextensions.k8s.io/v1\",\"kind\":\"CustomResourceDefinition\",\"metadata\":{\"annotations\":{\"controller-gen.kubebuilder.io/version\":\"v0.14.0\"},\"labels\":{\"app.kubernetes.io/part-of\":\"keda-operator\",\"app.kubernetes.io/version\":\"2.14.0\"},\"name\":\"scaledobjects.keda.sh\"},\"spec\":{\"group\":\"keda.sh\",\"names\":{\"kind\":\"ScaledObject\",\"listKind\":\"ScaledObjectList\",\"plural\":\"scaledobjects\",\"shortNames\":[\"so\"],\"singular\":\"scaledobject\"},\"scope\":\"Namespaced\",\"versions\":[{\"additionalPrinterColumns\":[{\"jsonPath\":\".status.scaleTargetKind\",\"name\":\"ScaleTargetKind\",\"type\":\"string\"},{\"jsonPath\":\".spec.scaleTargetRef.name\",\"name\":\"ScaleTargetName\",\"type\":\"string\"},{\"jsonPath\":\".spec.minReplicaCount\",\"name\":\"Min\",\"type\":\"integer\"},{\"jsonPath\":\".spec.maxReplicaCount\",\"name\":\"Max\",\"type\":\"integer\"},{\"jsonPath\":\".spec.triggers[*].type\",\"name\":\"Triggers\",\"type\":\"string\"},{\"jsonPath\":\".spec.triggers[*].authenticationRef.name\",\"name\":\"Authentication\",\"type\":\"string\"},{\"jsonPath\":\".status.conditions[?(@.type==\\\"Ready\\\")].status\",\"name\":\"Ready\",\"type\":\"string\"},{\"jsonPath\":\".status.conditions[?(@.type==\\\"Active\\\")].status\",\"name\":\"Active\",\"type\":\"string\"},{\"jsonPath\":\".status.conditions[?(@.type==\\\"Fallback\\\")].status\",\"name\":\"Fallback\",\"type\":\"string\"},{\"jsonPath\":\".status.conditions[?(@.type==\\\"Paused\\\")].status\",\"name\":\"Paused\",\"type\":\"string\"},{\"jsonPath\":\".metadata.creationTimestamp\",\"name\":\"Age\",\"type\":\"date\"}],\"name\":\"v1alpha1\",\"schema\":{\"openAPIV3Schema\":{\"description\":\"ScaledObject is a specification for a ScaledObject resource\",\"properties\":{\"apiVersion\":{\"description\":\"APIVersion defines the versioned schema of this representation of an object.\\nServers should convert recognized schemas to the latest internal value, and\\nmay reject unrecognized values.\\nMore info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources\",\"type\":\"string\"},\"kind\":{\"description\":\"Kind is a string value representing the REST resource this object represents.\\nServers may infer this from the endpoint the client submits requests to.\\nCannot be updated.\\nIn CamelCase.\\nMore info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds\",\"type\":\"string\"},\"metadata\":{\"type\":\"object\"},\"spec\":{\"description\":\"ScaledObjectSpec is the spec for a ScaledObject resource\",\"properties\":{\"advanced\":{\"description\":\"AdvancedConfig specifies advance scaling options\",\"properties\":{\"horizontalPodAutoscalerConfig\":{\"description\":\"HorizontalPodAutoscalerConfig specifies horizontal scale config\",\"properties\":{\"behavior\":{\"description\":\"HorizontalPodAutoscalerBehavior configures the scaling behavior of the target\\nin both Up and Down directions (scaleUp and scaleDown fields respectively).\",\"properties\":{\"scaleDown\":{\"description\":\"scaleDown is scaling policy for scaling Down.\\nIf not set, the default value is to allow to scale down to minReplicas pods, with a\\n300 second stabilization window (i.e., the highest recommendation for\\nthe last 300sec is used).\",\"properties\":{\"policies\":{\"description\":\"policies is a list of potential scaling polices which can be used during scaling.\\nAt least one policy must be specified, otherwise the HPAScalingRules will be discarded as invalid\",\"items\":{\"description\":\"HPAScalingPolicy is a single policy which must hold true for a specified past interval.\",\"properties\":{\"periodSeconds\":{\"description\":\"periodSeconds specifies the window of time for which the policy should hold true.\\nPeriodSeconds must be greater than zero and less than or equal to 1800 (30 min).\",\"format\":\"int32\",\"type\":\"integer\"},\"type\":{\"description\":\"type is used to specify the scaling policy.\",\"type\":\"string\"},\"value\":{\"description\":\"value contains the amount of change which is permitted by the policy.\\nIt must be greater than zero\",\"format\":\"int32\",\"type\":\"integer\"}},\"required\":[\"periodSeconds\",\"type\",\"value\"],\"type\":\"object\"},\"type\":\"array\",\"x-kubernetes-list-type\":\"atomic\"},\"selectPolicy\":{\"description\":\"selectPolicy is used to specify which policy should be used.\\nIf not set, the default value Max is used.\",\"type\":\"string\"},\"stabilizationWindowSeconds\":{\"description\":\"stabilizationWindowSeconds is the number of seconds for which past recommendations should be\\nconsidered while scaling up or scaling down.\\nStabilizationWindowSeconds must be greater than or equal to zero and less than or equal to 3600 (one hour).\\nIf not set, use the default values:\\n- For scale up: 0 (i.e. no stabilization is done).\\n- For scale down: 300 (i.e. the stabilization window is 300 seconds long).\",\"format\":\"int32\",\"maximum\":3600,\"minimum\":0,\"type\":\"integer\"}},\"type\":\"object\"},\"scaleUp\":{\"description\":\"scaleUp is scaling policy for scaling Up.\\nIf not set, the default value is the higher of:\\n  * increase no more than 4 pods per 60 seconds\\n  * double the number of pods per 60 seconds\\nNo stabilization is used.\",\"properties\":{\"policies\":{\"description\":\"policies is a list of potential scaling polices which can be used during scaling.\\nAt least one policy must be specified, otherwise the HPAScalingRules will be discarded as invalid\",\"items\":{\"description\":\"HPAScalingPolicy is a single policy which must hold true for a specified past interval.\",\"properties\":{\"periodSeconds\":{\"description\":\"periodSeconds specifies the window of time for which the policy should hold true.\\nPeriodSeconds must be greater than zero and less than or equal to 1800 (30 min).\",\"format\":\"int32\",\"type\":\"integer\"},\"type\":{\"description\":\"type is used to specify the scaling policy.\",\"type\":\"string\"},\"value\":{\"description\":\"value contains the amount of change which is permitted by the policy.\\nIt must be greater than zero\",\"format\":\"int32\",\"type\":\"integer\"}},\"required\":[\"periodSeconds\",\"type\",\"value\"],\"type\":\"object\"},\"type\":\"array\",\"x-kubernetes-list-type\":\"atomic\"},\"selectPolicy\":{\"description\":\"selectPolicy is used to specify which policy should be used.\\nIf not set, the default value Max is used.\",\"type\":\"string\"},\"stabilizationWindowSeconds\":{\"description\":\"stabilizationWindowSeconds is the number of seconds for which past recommendations should be\\nconsidered while scaling up or scaling down.\\nStabilizationWindowSeconds must be greater than or equal to zero and less than or equal to 3600 (one hour).\\nIf not set, use the default values:\\n- For scale up: 0 (i.e. no stabilization is done).\\n- For scale down: 300 (i.e. the stabilization window is 300 seconds long).\",\"format\":\"int32\",\"maximum\":3600,\"minimum\":0,\"type\":\"integer\"}},\"type\":\"object\"}},\"type\":\"object\"},\"name\":{\"type\":\"string\"}},\"type\":\"object\"},\"restoreToOriginalReplicaCount\":{\"type\":\"boolean\"},\"scalingModifiers\":{\"description\":\"ScalingModifiers describes advanced scaling logic options like formula\",\"properties\":{\"activationTarget\":{\"type\":\"string\"},\"formula\":{\"type\":\"string\"},\"metricType\":{\"description\":\"MetricTargetType specifies the type of metric being targeted, and should be either\\n\\\"Value\\\", \\\"AverageValue\\\", or \\\"Utilization\\\"\",\"type\":\"string\"},\"target\":{\"type\":\"string\"}},\"type\":\"object\"}},\"type\":\"object\"},\"cooldownPeriod\":{\"format\":\"int32\",\"type\":\"integer\"},\"fallback\":{\"description\":\"Fallback is the spec for fallback options\",\"properties\":{\"failureThreshold\":{\"format\":\"int32\",\"type\":\"integer\"},\"replicas\":{\"format\":\"int32\",\"type\":\"integer\"}},\"required\":[\"failureThreshold\",\"replicas\"],\"type\":\"object\"},\"idleReplicaCount\":{\"format\":\"int32\",\"type\":\"integer\"},\"initialCooldownPeriod\":{\"format\":\"int32\",\"type\":\"integer\"},\"maxReplicaCount\":{\"format\":\"int32\",\"type\":\"integer\"},\"minReplicaCount\":{\"format\":\"int32\",\"type\":\"integer\"},\"pollingInterval\":{\"format\":\"int32\",\"type\":\"integer\"},\"scaleTargetRef\":{\"description\":\"ScaleTarget holds the reference to the scale target Object\",\"properties\":{\"apiVersion\":{\"type\":\"string\"},\"envSourceContainerName\":{\"type\":\"string\"},\"kind\":{\"type\":\"string\"},\"name\":{\"type\":\"string\"}},\"required\":[\"name\"],\"type\":\"object\"},\"triggers\":{\"items\":{\"description\":\"ScaleTriggers reference the scaler that will be used\",\"properties\":{\"authenticationRef\":{\"description\":\"AuthenticationRef points to the TriggerAuthentication or ClusterTriggerAuthentication object that\\nis used to authenticate the scaler with the environment\",\"properties\":{\"kind\":{\"description\":\"Kind of the resource being referred to. Defaults to TriggerAuthentication.\",\"type\":\"string\"},\"name\":{\"type\":\"string\"}},\"required\":[\"name\"],\"type\":\"object\"},\"metadata\":{\"additionalProperties\":{\"type\":\"string\"},\"type\":\"object\"},\"metricType\":{\"description\":\"MetricTargetType specifies the type of metric being targeted, and should be either\\n\\\"Value\\\", \\\"AverageValue\\\", or \\\"Utilization\\\"\",\"type\":\"string\"},\"name\":{\"type\":\"string\"},\"type\":{\"type\":\"string\"},\"useCachedMetrics\":{\"type\":\"boolean\"}},\"required\":[\"metadata\",\"type\"],\"type\":\"object\"},\"type\":\"array\"}},\"required\":[\"scaleTargetRef\",\"triggers\"],\"type\":\"object\"},\"status\":{\"description\":\"ScaledObjectStatus is the status for a ScaledObject resource\",\"properties\":{\"compositeScalerName\":{\"type\":\"string\"},\"conditions\":{\"description\":\"Conditions an array representation to store multiple Conditions\",\"items\":{\"description\":\"Condition to store the condition state\",\"properties\":{\"message\":{\"description\":\"A human readable message indicating details about the transition.\",\"type\":\"string\"},\"reason\":{\"description\":\"The reason for the condition's last transition.\",\"type\":\"string\"},\"status\":{\"description\":\"Status of the condition, one of True, False, Unknown.\",\"type\":\"string\"},\"type\":{\"description\":\"Type of condition\",\"type\":\"string\"}},\"required\":[\"status\",\"type\"],\"type\":\"object\"},\"type\":\"array\"},\"externalMetricNames\":{\"items\":{\"type\":\"string\"},\"type\":\"array\"},\"health\":{\"additionalProperties\":{\"description\":\"HealthStatus is the status for a ScaledObject's health\",\"properties\":{\"numberOfFailures\":{\"format\":\"int32\",\"type\":\"integer\"},\"status\":{\"description\":\"HealthStatusType is an indication of whether the health status is happy or failing\",\"type\":\"string\"}},\"type\":\"object\"},\"type\":\"object\"},\"hpaName\":{\"type\":\"string\"},\"lastActiveTime\":{\"format\":\"date-time\",\"type\":\"string\"},\"originalReplicaCount\":{\"format\":\"int32\",\"type\":\"integer\"},\"pausedReplicaCount\":{\"format\":\"int32\",\"type\":\"integer\"},\"resourceMetricNames\":{\"items\":{\"type\":\"string\"},\"type\":\"array\"},\"scaleTargetGVKR\":{\"description\":\"GroupVersionKindResource provides unified structure for schema.GroupVersionKind and Resource\",\"properties\":{\"group\":{\"type\":\"string\"},\"kind\":{\"type\":\"string\"},\"resource\":{\"type\":\"string\"},\"version\":{\"type\":\"string\"}},\"required\":[\"group\",\"kind\",\"resource\",\"version\"],\"type\":\"object\"},\"scaleTargetKind\":{\"type\":\"string\"}},\"type\":\"object\"}},\"required\":[\"spec\"],\"type\":\"object\"}},\"served\":true,\"storage\":true,\"subresources\":{\"status\":{}}}]}}\n"
+        },
+        "creationTimestamp": "2026-03-24T10:11:35Z",
+        "deletionGracePeriodSeconds": 0,
+        "deletionTimestamp": "2026-03-24T20:00:31Z",
+        "finalizers": [],
+        "generation": 1,
+        "labels": {
+            "app.kubernetes.io/part-of": "keda-operator",
+            "app.kubernetes.io/version": "2.14.0"
+        },
+        "name": "scaledobjects.keda.sh",
+        "resourceVersion": "19239",
+        "uid": "a141b8da-d181-4bc3-aceb-238dbafea7e8"
+    },


This manifest appears to be a raw export from a running cluster. It contains cluster-specific metadata such as uid, resourceVersion, creationTimestamp, and a large last-applied-configuration annotation. These should be removed to ensure the manifest is clean and portable.

gemini-code-assist · 2026-04-04T19:07:27Z

crd.json

+    "status": {
+        "acceptedNames": {
+            "kind": "ScaledObject",
+            "listKind": "ScaledObjectList",
+            "plural": "scaledobjects",
+            "shortNames": [
+                "so"
+            ],
+            "singular": "scaledobject"
+        },
+        "conditions": [
+            {
+                "lastTransitionTime": "2026-03-24T10:11:35Z",
+                "message": "no conflicts found",
+                "reason": "NoConflicts",
+                "status": "True",
+                "type": "NamesAccepted"
+            },
+            {
+                "lastTransitionTime": "2026-03-24T10:11:35Z",
+                "message": "the initial names have been accepted",
+                "reason": "InitialNamesAccepted",
+                "status": "True",
+                "type": "Established"
+            },
+            {
+                "lastTransitionTime": "2026-03-24T20:00:31Z",
+                "message": "CustomResource deletion is in progress",
+                "reason": "InstanceDeletionInProgress",
+                "status": "True",
+                "type": "Terminating"
+            }
+        ],
+        "storedVersions": [
+            "v1alpha1"
+        ]
+    }


The status block should not be included in a static CRD definition file stored in the repository. Please remove this section.

gemini-code-assist · 2026-04-04T19:07:27Z

prom.yaml

+  podMonitorSelector: {}
+    matchLabels:
+      release: prometheus


The podMonitorSelector is incorrectly defined. The empty braces {} on line 52 make the subsequent matchLabels block orphaned or invalid depending on the YAML parser.

podMonitorSelector: matchLabels: release: prometheus

gemini-code-assist · 2026-04-04T19:07:27Z

prom.yaml

+  serviceMonitorSelector: {}
+    matchLabels:
+      release: prometheus


The serviceMonitorSelector is incorrectly defined. The empty braces {} on line 81 make the subsequent matchLabels block orphaned.

serviceMonitorSelector: matchLabels: release: prometheus

gemini-code-assist · 2026-04-04T19:07:27Z

docs/proposals/keda-prometheus-autoscaling.md

@@ -0,0 +1,359 @@
+# Proposal: KEDA + Prometheus Autoscaling for Kthena ModelServing


This file is redundant as it is nearly identical to docs/proposals/keda-autoscaling.md. Please consolidate the proposal into a single file to avoid confusion and maintenance overhead.

Copilot

Pull request overview

This PR proposes adding Prometheus + KEDA-based autoscaling support for Kthena ModelServing. The proposal is comprehensive, addressing the limitation of the existing AutoscalingPolicy that cannot directly query Prometheus and lacks per-model demand signals from the router.

Changes:

Design proposal document for KEDA + Prometheus autoscaling architecture, including problem statement, proposed approach, design decisions, failure modes, and rollout plan
Example YAML artifacts and CRD definitions (though these appear to be temporary test files that should not be committed)

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
docs/proposals/keda-prometheus-autoscaling.md	Comprehensive design proposal covering architecture, metrics, design decisions, alternatives, failure modes, and rollout plan
docs/proposals/keda-autoscaling.md	Near-duplicate of the above proposal with only date difference
prom.yaml	Prometheus CRD example with YAML syntax errors
crd.json	KEDA ScaledObject CRD definition with JSON syntax error at the start
svc-backup.yaml, deploy-backup.yaml, service.yaml, fixed-crd.yaml	Test/backup artifacts that appear unrelated to this proposal

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-04T19:08:23Z

prom.yaml

+  podMonitorSelector: {}
+    matchLabels:
+      release: prometheus


This file contains malformed YAML. Lines 52-54 show podMonitorSelector: {} followed by indented matchLabels: which is invalid YAML structure. An empty {} object cannot have properties. Similarly, lines 81-83 show the same issue with serviceMonitorSelector:. The selectors should either be {} (empty) or contain the matchLabels: at the same indentation level without the preceding {}.

Copilot · 2026-04-04T19:08:24Z

docs/proposals/keda-autoscaling.md

+**Status:** Draft
+**Authors:** @david_laid
+**Date:** 2026-04-02
+
+### Related PRs
+
+| PR | Description |
+|----|-------------|
+| [#831](https://github.com/volcano-sh/kthena/pull/831) | Initial prototype -- example manifests (ServiceMonitor, PodMonitor, ScaledObject) |
+| [#836](https://github.com/volcano-sh/kthena/pull/836) | Helm integration -- templates + `values.yaml` for monitoring and autoscaling |
+| [#839](https://github.com/volcano-sh/kthena/pull/839) | Controller fix -- populates `status.labelSelector` so HPA can actually find pods |
+
+All three reference [#799](https://github.com/volcano-sh/kthena/issues/799). We've validated the full flow end-to-end before writing this up.
+
+---
+
+## 1. Problem
+
+LLM inference traffic is bursty. Without autoscaling, you either overprovision GPUs or eat latency spikes when load surges.
+
+Kthena already has an autoscaler (`AutoscalingPolicy` + `AutoscalingPolicyBinding`). It scrapes metrics from pod endpoints, has panic mode, supports heterogeneous cost-optimized scaling. Works fine for pod-level signals like `kthena:num_requests_waiting`.
+
+Where it falls short:
+
+1. **Can't talk to Prometheus.** It scrapes pods directly. Most teams already have Prometheus running -- the autoscaler can't use it.
+
+2. **No per-model demand signal.** The router exposes `kthena_router_active_downstream_requests{model="..."}` which tells you how much traffic a model is getting *before* it hits backends. The built-in autoscaler doesn't use this.
+
+3. **Extra moving parts.** Teams already running KEDA end up maintaining two autoscaling systems side by side.
+
+The goal here is to add KEDA as an optional autoscaling path. We're not touching AutoscalingPolicy.
+
+### Non-goals
+
+- Modifying or replacing `AutoscalingPolicy` / `AutoscalingPolicyBinding`
+- Building a custom metrics adapter
+- Multi-model-per-ModelServing (we assume 1:1)
+- Role-level scaling via KEDA (built-in autoscaler handles that with `subTargets.kind: Role`)
+- Auto-generating ScaledObjects (Phase 3 at earliest)
+
+---
+
+## 2. Proposed Approach
+


This is a duplicate of docs/proposals/keda-prometheus-autoscaling.md with only a different date (2026-04-02 vs 2026-03-31). Having two identical proposal files in the repository is confusing and should be avoided. Consider removing this file and keeping only the more recent version, or clarifying why both files are needed.

Suggested change

**Status:** Draft

**Authors:** @david_laid

**Date:** 2026-04-02

### Related PRs

| PR | Description |

|----|-------------|

| [#831](https://github.com/volcano-sh/kthena/pull/831) | Initial prototype -- example manifests (ServiceMonitor, PodMonitor, ScaledObject) |

| [#836](https://github.com/volcano-sh/kthena/pull/836) | Helm integration -- templates + `values.yaml` for monitoring and autoscaling |

| [#839](https://github.com/volcano-sh/kthena/pull/839) | Controller fix -- populates `status.labelSelector` so HPA can actually find pods |

All three reference [#799](https://github.com/volcano-sh/kthena/issues/799). We've validated the full flow end-to-end before writing this up.

---

## 1. Problem

LLM inference traffic is bursty. Without autoscaling, you either overprovision GPUs or eat latency spikes when load surges.

Kthena already has an autoscaler (`AutoscalingPolicy` + `AutoscalingPolicyBinding`). It scrapes metrics from pod endpoints, has panic mode, supports heterogeneous cost-optimized scaling. Works fine for pod-level signals like `kthena:num_requests_waiting`.

Where it falls short:

1. **Can't talk to Prometheus.** It scrapes pods directly. Most teams already have Prometheus running -- the autoscaler can't use it.

2. **No per-model demand signal.** The router exposes `kthena_router_active_downstream_requests{model="..."}` which tells you how much traffic a model is getting *before* it hits backends. The built-in autoscaler doesn't use this.

3. **Extra moving parts.** Teams already running KEDA end up maintaining two autoscaling systems side by side.

The goal here is to add KEDA as an optional autoscaling path. We're not touching AutoscalingPolicy.

### Non-goals

- Modifying or replacing `AutoscalingPolicy` / `AutoscalingPolicyBinding`

- Building a custom metrics adapter

- Multi-model-per-ModelServing (we assume 1:1)

- Role-level scaling via KEDA (built-in autoscaler handles that with `subTargets.kind: Role`)

- Auto-generating ScaledObjects (Phase 3 at earliest)

---

## 2. Proposed Approach

**Status:** Superseded

**Authors:** @david_laid

**Date:** 2026-04-02

This file previously duplicated the full proposal now maintained at

[`docs/proposals/keda-prometheus-autoscaling.md`](docs/proposals/keda-prometheus-autoscaling.md).

To avoid keeping two identical proposal documents in the repository, that file is the

canonical version and should be used for all future updates and references.

This file is retained only as a compatibility pointer for any existing links to

`docs/proposals/keda-autoscaling.md`.

Copilot · 2026-04-04T19:08:24Z

crd.json

@@ -0,0 +1,532 @@
+i{


The JSON file starts with invalid syntax: i{ on line 1 instead of just {. This makes the entire JSON file invalid and unparseable. This appears to be a typo or corruption.

Suggested change

i{

{

Signed-off-by: WHOIM1205 <[email protected]>

hzxuzhonghu

Please reemove all the yamls in root dir.

hzxuzhonghu · 2026-04-08T03:07:01Z

docs/proposals/keda-autoscaling.md

+
+Where it falls short:
+
+1. **Can't talk to Prometheus.** It scrapes pods directly. Most teams already have Prometheus running -- the autoscaler can't use it.


Thanks for the point, i am wondering we can support it later.

hzxuzhonghu · 2026-04-08T03:07:44Z

docs/proposals/keda-autoscaling.md

+
+1. **Can't talk to Prometheus.** It scrapes pods directly. Most teams already have Prometheus running -- the autoscaler can't use it.
+
+2. **No per-model demand signal.** The router exposes `kthena_router_active_downstream_requests{model="..."}` which tells you how much traffic a model is getting *before* it hits backends. The built-in autoscaler doesn't use this.


It does not support using kthena router metrics

hzxuzhonghu · 2026-04-08T03:08:46Z

docs/proposals/keda-autoscaling.md

+- Modifying or replacing `AutoscalingPolicy` / `AutoscalingPolicyBinding`
+- Building a custom metrics adapter
+- Multi-model-per-ModelServing (we assume 1:1)
+- Role-level scaling via KEDA (built-in autoscaler handles that with `subTargets.kind: Role`)


Prefer leaving it as second step work

hzxuzhonghu · 2026-04-08T03:13:18Z

docs/proposals/keda-autoscaling.md

+### Scale-up flow at runtime
+
+```
+  User traffic              Router               Prometheus           KEDA              HPA             ModelServing


can you replace with mermaid, which is more readable

hzxuzhonghu · 2026-04-08T03:16:10Z

docs/proposals/keda-autoscaling.md

+
+### How it works
+
+1. Prometheus scrapes the router, collects `kthena_router_active_downstream_requests{model="..."}`.


kthena_router_active_downstream_requests is just an example metric type, not a must

hzxuzhonghu · 2026-04-08T03:35:53Z

docs/proposals/keda-autoscaling.md

+
+Right now operators have to manually match the `model` label in the PromQL query to the correct ModelServing CR. Easy to mess up.
+
+We should add a `kthena.io/model-name` annotation on ModelServing. Doesn't need a new controller -- just a standard place to record the mapping. We can build tooling on top of it later if auto-generating ScaledObjects makes sense.


Keep in mind the model name from the router matric can be different from the real model name running in vlllm or sglang.

Because we can do model name match in router and route to the real backend model instance

hzxuzhonghu · 2026-04-08T03:37:38Z

docs/proposals/keda-autoscaling.md

+  triggers:
+    - type: prometheus
+      metadata:
+        serverAddress: http://prometheus.monitoring.svc:9090


Can use a placeholder here

hzxuzhonghu · 2026-04-08T03:38:20Z

docs/proposals/keda-autoscaling.md

+    name: my-model-serving
+  minReplicaCount: 1
+  maxReplicaCount: 10
+  cooldownPeriod: 60


This seems too short as a llm can takes 5minutes to start from ground

hzxuzhonghu · 2026-04-08T03:39:37Z

docs/proposals/keda-prometheus-autoscaling.md

isnot it same as keda-autoscaling.md?

hzxuzhonghu · 2026-04-08T03:40:30Z

crd.json

@@ -0,0 +1,532 @@
+i{


what's this, why put it in proposal pr

Signed-off-by: WHOIM1205 <[email protected]>

volcano-sh-bot · 2026-04-08T08:59:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign lizhencheng9527 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

docs/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

WHOIM1205 · 2026-04-08T09:01:23Z

hey @hzxuzhonghu
thanks a lot for the detailed review this was really helpful
i’ve addressed all the points:

cleaned up root-level files
reduced scope and separated future work
replaced diagrams with mermaid
made metrics generic
clarified model-to-ModelServing mapping
improved examples and placeholders
Would really appreciate another look whenever you get time

Copilot AI review requested due to automatic review settings April 4, 2026 19:06

volcano-sh-bot requested review from YaoZengzeng and hzxuzhonghu April 4, 2026 19:06

volcano-sh-bot added the size/XXL label Apr 4, 2026

Copilot started reviewing on behalf of WHOIM1205 April 4, 2026 19:06 View session

gemini-code-assist bot reviewed Apr 4, 2026

View reviewed changes

Copilot AI reviewed Apr 4, 2026

View reviewed changes

add proposal for keda autoscaling

ae3308a

Signed-off-by: WHOIM1205 <[email protected]>

WHOIM1205 force-pushed the keda-autoscaling-proposal branch from 6815263 to ae3308a Compare April 5, 2026 17:14

WHOIM1205 mentioned this pull request Apr 7, 2026

examples: add kthena.io/model-name annotation to ModelServing samples #871

Open

hzxuzhonghu reviewed Apr 8, 2026

View reviewed changes

refine keda autoscaling proposal based on review feedback

827e7a9

Signed-off-by: WHOIM1205 <[email protected]>

Copilot AI review requested due to automatic review settings April 8, 2026 08:59

volcano-sh-bot added size/L and removed size/XXL labels Apr 8, 2026

Copilot started reviewing on behalf of WHOIM1205 April 8, 2026 09:00 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

WHOIM1205 mentioned this pull request Apr 9, 2026

refactor(model-serving): remove controller label propagation and add autoscaling example #877

Open

		@@ -0,0 +1,359 @@
		# Proposal: KEDA + Prometheus Autoscaling for Kthena ModelServing


		Where it falls short:

		1. Can't talk to Prometheus. It scrapes pods directly. Most teams already have Prometheus running -- the autoscaler can't use it.


		1. Can't talk to Prometheus. It scrapes pods directly. Most teams already have Prometheus running -- the autoscaler can't use it.

		2. No per-model demand signal. The router exposes `kthena_router_active_downstream_requests{model="..."}` which tells you how much traffic a model is getting before it hits backends. The built-in autoscaler doesn't use this.


		### How it works

		1. Prometheus scrapes the router, collects `kthena_router_active_downstream_requests{model="..."}`.


		Right now operators have to manually match the `model` label in the PromQL query to the correct ModelServing CR. Easy to mess up.

		We should add a `kthena.io/model-name` annotation on ModelServing. Doesn't need a new controller -- just a standard place to record the mapping. We can build tooling on top of it later if auto-generating ScaledObjects makes sense.

Conversation

WHOIM1205 commented Apr 4, 2026