Proposal: Prometheus + KEDA autoscaling for ModelServing #868
Proposal: Prometheus + KEDA autoscaling for ModelServing #868WHOIM1205 wants to merge 2 commits intovolcano-sh:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a proposal and supporting Kubernetes manifests for implementing KEDA and Prometheus-based autoscaling for ModelServing. The changes include new CRD definitions, deployment configurations, and a detailed design document. Feedback on the manifests identifies several critical issues: crd.json contains a leading typo and cluster-specific metadata that should be removed; fixed-crd.yaml is truncated and incomplete; and prom.yaml has invalid YAML syntax in its selector definitions. Additionally, the documentation includes redundant proposal files that should be consolidated.
crd.json
Outdated
| @@ -0,0 +1,532 @@ | |||
| i{ | |||
fixed-crd.yaml
Outdated
| properties: | ||
| metadata: | ||
| description: Object's metadata. | ||
| properties: |
crd.json
Outdated
| "metadata": { | ||
| "annotations": { | ||
| "controller-gen.kubebuilder.io/version": "v0.14.0", | ||
| "kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"apiextensions.k8s.io/v1\",\"kind\":\"CustomResourceDefinition\",\"metadata\":{\"annotations\":{\"controller-gen.kubebuilder.io/version\":\"v0.14.0\"},\"labels\":{\"app.kubernetes.io/part-of\":\"keda-operator\",\"app.kubernetes.io/version\":\"2.14.0\"},\"name\":\"scaledobjects.keda.sh\"},\"spec\":{\"group\":\"keda.sh\",\"names\":{\"kind\":\"ScaledObject\",\"listKind\":\"ScaledObjectList\",\"plural\":\"scaledobjects\",\"shortNames\":[\"so\"],\"singular\":\"scaledobject\"},\"scope\":\"Namespaced\",\"versions\":[{\"additionalPrinterColumns\":[{\"jsonPath\":\".status.scaleTargetKind\",\"name\":\"ScaleTargetKind\",\"type\":\"string\"},{\"jsonPath\":\".spec.scaleTargetRef.name\",\"name\":\"ScaleTargetName\",\"type\":\"string\"},{\"jsonPath\":\".spec.minReplicaCount\",\"name\":\"Min\",\"type\":\"integer\"},{\"jsonPath\":\".spec.maxReplicaCount\",\"name\":\"Max\",\"type\":\"integer\"},{\"jsonPath\":\".spec.triggers[*].type\",\"name\":\"Triggers\",\"type\":\"string\"},{\"jsonPath\":\".spec.triggers[*].authenticationRef.name\",\"name\":\"Authentication\",\"type\":\"string\"},{\"jsonPath\":\".status.conditions[?(@.type==\\\"Ready\\\")].status\",\"name\":\"Ready\",\"type\":\"string\"},{\"jsonPath\":\".status.conditions[?(@.type==\\\"Active\\\")].status\",\"name\":\"Active\",\"type\":\"string\"},{\"jsonPath\":\".status.conditions[?(@.type==\\\"Fallback\\\")].status\",\"name\":\"Fallback\",\"type\":\"string\"},{\"jsonPath\":\".status.conditions[?(@.type==\\\"Paused\\\")].status\",\"name\":\"Paused\",\"type\":\"string\"},{\"jsonPath\":\".metadata.creationTimestamp\",\"name\":\"Age\",\"type\":\"date\"}],\"name\":\"v1alpha1\",\"schema\":{\"openAPIV3Schema\":{\"description\":\"ScaledObject is a specification for a ScaledObject resource\",\"properties\":{\"apiVersion\":{\"description\":\"APIVersion defines the versioned schema of this representation of an object.\\nServers should convert recognized schemas to the latest internal value, and\\nmay reject unrecognized values.\\nMore info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources\",\"type\":\"string\"},\"kind\":{\"description\":\"Kind is a string value representing the REST resource this object represents.\\nServers may infer this from the endpoint the client submits requests to.\\nCannot be updated.\\nIn CamelCase.\\nMore info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds\",\"type\":\"string\"},\"metadata\":{\"type\":\"object\"},\"spec\":{\"description\":\"ScaledObjectSpec is the spec for a ScaledObject resource\",\"properties\":{\"advanced\":{\"description\":\"AdvancedConfig specifies advance scaling options\",\"properties\":{\"horizontalPodAutoscalerConfig\":{\"description\":\"HorizontalPodAutoscalerConfig specifies horizontal scale config\",\"properties\":{\"behavior\":{\"description\":\"HorizontalPodAutoscalerBehavior configures the scaling behavior of the target\\nin both Up and Down directions (scaleUp and scaleDown fields respectively).\",\"properties\":{\"scaleDown\":{\"description\":\"scaleDown is scaling policy for scaling Down.\\nIf not set, the default value is to allow to scale down to minReplicas pods, with a\\n300 second stabilization window (i.e., the highest recommendation for\\nthe last 300sec is used).\",\"properties\":{\"policies\":{\"description\":\"policies is a list of potential scaling polices which can be used during scaling.\\nAt least one policy must be specified, otherwise the HPAScalingRules will be discarded as invalid\",\"items\":{\"description\":\"HPAScalingPolicy is a single policy which must hold true for a specified past interval.\",\"properties\":{\"periodSeconds\":{\"description\":\"periodSeconds specifies the window of time for which the policy should hold true.\\nPeriodSeconds must be greater than zero and less than or equal to 1800 (30 min).\",\"format\":\"int32\",\"type\":\"integer\"},\"type\":{\"description\":\"type is used to specify the scaling policy.\",\"type\":\"string\"},\"value\":{\"description\":\"value contains the amount of change which is permitted by the policy.\\nIt must be greater than zero\",\"format\":\"int32\",\"type\":\"integer\"}},\"required\":[\"periodSeconds\",\"type\",\"value\"],\"type\":\"object\"},\"type\":\"array\",\"x-kubernetes-list-type\":\"atomic\"},\"selectPolicy\":{\"description\":\"selectPolicy is used to specify which policy should be used.\\nIf not set, the default value Max is used.\",\"type\":\"string\"},\"stabilizationWindowSeconds\":{\"description\":\"stabilizationWindowSeconds is the number of seconds for which past recommendations should be\\nconsidered while scaling up or scaling down.\\nStabilizationWindowSeconds must be greater than or equal to zero and less than or equal to 3600 (one hour).\\nIf not set, use the default values:\\n- For scale up: 0 (i.e. no stabilization is done).\\n- For scale down: 300 (i.e. the stabilization window is 300 seconds long).\",\"format\":\"int32\",\"maximum\":3600,\"minimum\":0,\"type\":\"integer\"}},\"type\":\"object\"},\"scaleUp\":{\"description\":\"scaleUp is scaling policy for scaling Up.\\nIf not set, the default value is the higher of:\\n * increase no more than 4 pods per 60 seconds\\n * double the number of pods per 60 seconds\\nNo stabilization is used.\",\"properties\":{\"policies\":{\"description\":\"policies is a list of potential scaling polices which can be used during scaling.\\nAt least one policy must be specified, otherwise the HPAScalingRules will be discarded as invalid\",\"items\":{\"description\":\"HPAScalingPolicy is a single policy which must hold true for a specified past interval.\",\"properties\":{\"periodSeconds\":{\"description\":\"periodSeconds specifies the window of time for which the policy should hold true.\\nPeriodSeconds must be greater than zero and less than or equal to 1800 (30 min).\",\"format\":\"int32\",\"type\":\"integer\"},\"type\":{\"description\":\"type is used to specify the scaling policy.\",\"type\":\"string\"},\"value\":{\"description\":\"value contains the amount of change which is permitted by the policy.\\nIt must be greater than zero\",\"format\":\"int32\",\"type\":\"integer\"}},\"required\":[\"periodSeconds\",\"type\",\"value\"],\"type\":\"object\"},\"type\":\"array\",\"x-kubernetes-list-type\":\"atomic\"},\"selectPolicy\":{\"description\":\"selectPolicy is used to specify which policy should be used.\\nIf not set, the default value Max is used.\",\"type\":\"string\"},\"stabilizationWindowSeconds\":{\"description\":\"stabilizationWindowSeconds is the number of seconds for which past recommendations should be\\nconsidered while scaling up or scaling down.\\nStabilizationWindowSeconds must be greater than or equal to zero and less than or equal to 3600 (one hour).\\nIf not set, use the default values:\\n- For scale up: 0 (i.e. no stabilization is done).\\n- For scale down: 300 (i.e. the stabilization window is 300 seconds long).\",\"format\":\"int32\",\"maximum\":3600,\"minimum\":0,\"type\":\"integer\"}},\"type\":\"object\"}},\"type\":\"object\"},\"name\":{\"type\":\"string\"}},\"type\":\"object\"},\"restoreToOriginalReplicaCount\":{\"type\":\"boolean\"},\"scalingModifiers\":{\"description\":\"ScalingModifiers describes advanced scaling logic options like formula\",\"properties\":{\"activationTarget\":{\"type\":\"string\"},\"formula\":{\"type\":\"string\"},\"metricType\":{\"description\":\"MetricTargetType specifies the type of metric being targeted, and should be either\\n\\\"Value\\\", \\\"AverageValue\\\", or \\\"Utilization\\\"\",\"type\":\"string\"},\"target\":{\"type\":\"string\"}},\"type\":\"object\"}},\"type\":\"object\"},\"cooldownPeriod\":{\"format\":\"int32\",\"type\":\"integer\"},\"fallback\":{\"description\":\"Fallback is the spec for fallback options\",\"properties\":{\"failureThreshold\":{\"format\":\"int32\",\"type\":\"integer\"},\"replicas\":{\"format\":\"int32\",\"type\":\"integer\"}},\"required\":[\"failureThreshold\",\"replicas\"],\"type\":\"object\"},\"idleReplicaCount\":{\"format\":\"int32\",\"type\":\"integer\"},\"initialCooldownPeriod\":{\"format\":\"int32\",\"type\":\"integer\"},\"maxReplicaCount\":{\"format\":\"int32\",\"type\":\"integer\"},\"minReplicaCount\":{\"format\":\"int32\",\"type\":\"integer\"},\"pollingInterval\":{\"format\":\"int32\",\"type\":\"integer\"},\"scaleTargetRef\":{\"description\":\"ScaleTarget holds the reference to the scale target Object\",\"properties\":{\"apiVersion\":{\"type\":\"string\"},\"envSourceContainerName\":{\"type\":\"string\"},\"kind\":{\"type\":\"string\"},\"name\":{\"type\":\"string\"}},\"required\":[\"name\"],\"type\":\"object\"},\"triggers\":{\"items\":{\"description\":\"ScaleTriggers reference the scaler that will be used\",\"properties\":{\"authenticationRef\":{\"description\":\"AuthenticationRef points to the TriggerAuthentication or ClusterTriggerAuthentication object that\\nis used to authenticate the scaler with the environment\",\"properties\":{\"kind\":{\"description\":\"Kind of the resource being referred to. Defaults to TriggerAuthentication.\",\"type\":\"string\"},\"name\":{\"type\":\"string\"}},\"required\":[\"name\"],\"type\":\"object\"},\"metadata\":{\"additionalProperties\":{\"type\":\"string\"},\"type\":\"object\"},\"metricType\":{\"description\":\"MetricTargetType specifies the type of metric being targeted, and should be either\\n\\\"Value\\\", \\\"AverageValue\\\", or \\\"Utilization\\\"\",\"type\":\"string\"},\"name\":{\"type\":\"string\"},\"type\":{\"type\":\"string\"},\"useCachedMetrics\":{\"type\":\"boolean\"}},\"required\":[\"metadata\",\"type\"],\"type\":\"object\"},\"type\":\"array\"}},\"required\":[\"scaleTargetRef\",\"triggers\"],\"type\":\"object\"},\"status\":{\"description\":\"ScaledObjectStatus is the status for a ScaledObject resource\",\"properties\":{\"compositeScalerName\":{\"type\":\"string\"},\"conditions\":{\"description\":\"Conditions an array representation to store multiple Conditions\",\"items\":{\"description\":\"Condition to store the condition state\",\"properties\":{\"message\":{\"description\":\"A human readable message indicating details about the transition.\",\"type\":\"string\"},\"reason\":{\"description\":\"The reason for the condition's last transition.\",\"type\":\"string\"},\"status\":{\"description\":\"Status of the condition, one of True, False, Unknown.\",\"type\":\"string\"},\"type\":{\"description\":\"Type of condition\",\"type\":\"string\"}},\"required\":[\"status\",\"type\"],\"type\":\"object\"},\"type\":\"array\"},\"externalMetricNames\":{\"items\":{\"type\":\"string\"},\"type\":\"array\"},\"health\":{\"additionalProperties\":{\"description\":\"HealthStatus is the status for a ScaledObject's health\",\"properties\":{\"numberOfFailures\":{\"format\":\"int32\",\"type\":\"integer\"},\"status\":{\"description\":\"HealthStatusType is an indication of whether the health status is happy or failing\",\"type\":\"string\"}},\"type\":\"object\"},\"type\":\"object\"},\"hpaName\":{\"type\":\"string\"},\"lastActiveTime\":{\"format\":\"date-time\",\"type\":\"string\"},\"originalReplicaCount\":{\"format\":\"int32\",\"type\":\"integer\"},\"pausedReplicaCount\":{\"format\":\"int32\",\"type\":\"integer\"},\"resourceMetricNames\":{\"items\":{\"type\":\"string\"},\"type\":\"array\"},\"scaleTargetGVKR\":{\"description\":\"GroupVersionKindResource provides unified structure for schema.GroupVersionKind and Resource\",\"properties\":{\"group\":{\"type\":\"string\"},\"kind\":{\"type\":\"string\"},\"resource\":{\"type\":\"string\"},\"version\":{\"type\":\"string\"}},\"required\":[\"group\",\"kind\",\"resource\",\"version\"],\"type\":\"object\"},\"scaleTargetKind\":{\"type\":\"string\"}},\"type\":\"object\"}},\"required\":[\"spec\"],\"type\":\"object\"}},\"served\":true,\"storage\":true,\"subresources\":{\"status\":{}}}]}}\n" | ||
| }, | ||
| "creationTimestamp": "2026-03-24T10:11:35Z", | ||
| "deletionGracePeriodSeconds": 0, | ||
| "deletionTimestamp": "2026-03-24T20:00:31Z", | ||
| "finalizers": [], | ||
| "generation": 1, | ||
| "labels": { | ||
| "app.kubernetes.io/part-of": "keda-operator", | ||
| "app.kubernetes.io/version": "2.14.0" | ||
| }, | ||
| "name": "scaledobjects.keda.sh", | ||
| "resourceVersion": "19239", | ||
| "uid": "a141b8da-d181-4bc3-aceb-238dbafea7e8" | ||
| }, |
There was a problem hiding this comment.
crd.json
Outdated
| "status": { | ||
| "acceptedNames": { | ||
| "kind": "ScaledObject", | ||
| "listKind": "ScaledObjectList", | ||
| "plural": "scaledobjects", | ||
| "shortNames": [ | ||
| "so" | ||
| ], | ||
| "singular": "scaledobject" | ||
| }, | ||
| "conditions": [ | ||
| { | ||
| "lastTransitionTime": "2026-03-24T10:11:35Z", | ||
| "message": "no conflicts found", | ||
| "reason": "NoConflicts", | ||
| "status": "True", | ||
| "type": "NamesAccepted" | ||
| }, | ||
| { | ||
| "lastTransitionTime": "2026-03-24T10:11:35Z", | ||
| "message": "the initial names have been accepted", | ||
| "reason": "InitialNamesAccepted", | ||
| "status": "True", | ||
| "type": "Established" | ||
| }, | ||
| { | ||
| "lastTransitionTime": "2026-03-24T20:00:31Z", | ||
| "message": "CustomResource deletion is in progress", | ||
| "reason": "InstanceDeletionInProgress", | ||
| "status": "True", | ||
| "type": "Terminating" | ||
| } | ||
| ], | ||
| "storedVersions": [ | ||
| "v1alpha1" | ||
| ] | ||
| } |
prom.yaml
Outdated
| podMonitorSelector: {} | ||
| matchLabels: | ||
| release: prometheus |
prom.yaml
Outdated
| serviceMonitorSelector: {} | ||
| matchLabels: | ||
| release: prometheus |
| @@ -0,0 +1,359 @@ | |||
| # Proposal: KEDA + Prometheus Autoscaling for Kthena ModelServing | |||
There was a problem hiding this comment.
Pull request overview
This PR proposes adding Prometheus + KEDA-based autoscaling support for Kthena ModelServing. The proposal is comprehensive, addressing the limitation of the existing AutoscalingPolicy that cannot directly query Prometheus and lacks per-model demand signals from the router.
Changes:
- Design proposal document for KEDA + Prometheus autoscaling architecture, including problem statement, proposed approach, design decisions, failure modes, and rollout plan
- Example YAML artifacts and CRD definitions (though these appear to be temporary test files that should not be committed)
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/proposals/keda-prometheus-autoscaling.md | Comprehensive design proposal covering architecture, metrics, design decisions, alternatives, failure modes, and rollout plan |
| docs/proposals/keda-autoscaling.md | Near-duplicate of the above proposal with only date difference |
| prom.yaml | Prometheus CRD example with YAML syntax errors |
| crd.json | KEDA ScaledObject CRD definition with JSON syntax error at the start |
| svc-backup.yaml, deploy-backup.yaml, service.yaml, fixed-crd.yaml | Test/backup artifacts that appear unrelated to this proposal |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
prom.yaml
Outdated
| podMonitorSelector: {} | ||
| matchLabels: | ||
| release: prometheus |
There was a problem hiding this comment.
This file contains malformed YAML. Lines 52-54 show podMonitorSelector: {} followed by indented matchLabels: which is invalid YAML structure. An empty {} object cannot have properties. Similarly, lines 81-83 show the same issue with serviceMonitorSelector:. The selectors should either be {} (empty) or contain the matchLabels: at the same indentation level without the preceding {}.
| **Status:** Draft | ||
| **Authors:** @david_laid | ||
| **Date:** 2026-04-02 | ||
|
|
||
| ### Related PRs | ||
|
|
||
| | PR | Description | | ||
| |----|-------------| | ||
| | [#831](https://github.com/volcano-sh/kthena/pull/831) | Initial prototype -- example manifests (ServiceMonitor, PodMonitor, ScaledObject) | | ||
| | [#836](https://github.com/volcano-sh/kthena/pull/836) | Helm integration -- templates + `values.yaml` for monitoring and autoscaling | | ||
| | [#839](https://github.com/volcano-sh/kthena/pull/839) | Controller fix -- populates `status.labelSelector` so HPA can actually find pods | | ||
|
|
||
| All three reference [#799](https://github.com/volcano-sh/kthena/issues/799). We've validated the full flow end-to-end before writing this up. | ||
|
|
||
| --- | ||
|
|
||
| ## 1. Problem | ||
|
|
||
| LLM inference traffic is bursty. Without autoscaling, you either overprovision GPUs or eat latency spikes when load surges. | ||
|
|
||
| Kthena already has an autoscaler (`AutoscalingPolicy` + `AutoscalingPolicyBinding`). It scrapes metrics from pod endpoints, has panic mode, supports heterogeneous cost-optimized scaling. Works fine for pod-level signals like `kthena:num_requests_waiting`. | ||
|
|
||
| Where it falls short: | ||
|
|
||
| 1. **Can't talk to Prometheus.** It scrapes pods directly. Most teams already have Prometheus running -- the autoscaler can't use it. | ||
|
|
||
| 2. **No per-model demand signal.** The router exposes `kthena_router_active_downstream_requests{model="..."}` which tells you how much traffic a model is getting *before* it hits backends. The built-in autoscaler doesn't use this. | ||
|
|
||
| 3. **Extra moving parts.** Teams already running KEDA end up maintaining two autoscaling systems side by side. | ||
|
|
||
| The goal here is to add KEDA as an optional autoscaling path. We're not touching AutoscalingPolicy. | ||
|
|
||
| ### Non-goals | ||
|
|
||
| - Modifying or replacing `AutoscalingPolicy` / `AutoscalingPolicyBinding` | ||
| - Building a custom metrics adapter | ||
| - Multi-model-per-ModelServing (we assume 1:1) | ||
| - Role-level scaling via KEDA (built-in autoscaler handles that with `subTargets.kind: Role`) | ||
| - Auto-generating ScaledObjects (Phase 3 at earliest) | ||
|
|
||
| --- | ||
|
|
||
| ## 2. Proposed Approach | ||
|
|
There was a problem hiding this comment.
This is a duplicate of docs/proposals/keda-prometheus-autoscaling.md with only a different date (2026-04-02 vs 2026-03-31). Having two identical proposal files in the repository is confusing and should be avoided. Consider removing this file and keeping only the more recent version, or clarifying why both files are needed.
| **Status:** Draft | |
| **Authors:** @david_laid | |
| **Date:** 2026-04-02 | |
| ### Related PRs | |
| | PR | Description | | |
| |----|-------------| | |
| | [#831](https://github.com/volcano-sh/kthena/pull/831) | Initial prototype -- example manifests (ServiceMonitor, PodMonitor, ScaledObject) | | |
| | [#836](https://github.com/volcano-sh/kthena/pull/836) | Helm integration -- templates + `values.yaml` for monitoring and autoscaling | | |
| | [#839](https://github.com/volcano-sh/kthena/pull/839) | Controller fix -- populates `status.labelSelector` so HPA can actually find pods | | |
| All three reference [#799](https://github.com/volcano-sh/kthena/issues/799). We've validated the full flow end-to-end before writing this up. | |
| --- | |
| ## 1. Problem | |
| LLM inference traffic is bursty. Without autoscaling, you either overprovision GPUs or eat latency spikes when load surges. | |
| Kthena already has an autoscaler (`AutoscalingPolicy` + `AutoscalingPolicyBinding`). It scrapes metrics from pod endpoints, has panic mode, supports heterogeneous cost-optimized scaling. Works fine for pod-level signals like `kthena:num_requests_waiting`. | |
| Where it falls short: | |
| 1. **Can't talk to Prometheus.** It scrapes pods directly. Most teams already have Prometheus running -- the autoscaler can't use it. | |
| 2. **No per-model demand signal.** The router exposes `kthena_router_active_downstream_requests{model="..."}` which tells you how much traffic a model is getting *before* it hits backends. The built-in autoscaler doesn't use this. | |
| 3. **Extra moving parts.** Teams already running KEDA end up maintaining two autoscaling systems side by side. | |
| The goal here is to add KEDA as an optional autoscaling path. We're not touching AutoscalingPolicy. | |
| ### Non-goals | |
| - Modifying or replacing `AutoscalingPolicy` / `AutoscalingPolicyBinding` | |
| - Building a custom metrics adapter | |
| - Multi-model-per-ModelServing (we assume 1:1) | |
| - Role-level scaling via KEDA (built-in autoscaler handles that with `subTargets.kind: Role`) | |
| - Auto-generating ScaledObjects (Phase 3 at earliest) | |
| --- | |
| ## 2. Proposed Approach | |
| **Status:** Superseded | |
| **Authors:** @david_laid | |
| **Date:** 2026-04-02 | |
| This file previously duplicated the full proposal now maintained at | |
| [`docs/proposals/keda-prometheus-autoscaling.md`](docs/proposals/keda-prometheus-autoscaling.md). | |
| To avoid keeping two identical proposal documents in the repository, that file is the | |
| canonical version and should be used for all future updates and references. | |
| This file is retained only as a compatibility pointer for any existing links to | |
| `docs/proposals/keda-autoscaling.md`. |
crd.json
Outdated
| @@ -0,0 +1,532 @@ | |||
| i{ | |||
There was a problem hiding this comment.
The JSON file starts with invalid syntax: i{ on line 1 instead of just {. This makes the entire JSON file invalid and unparseable. This appears to be a typo or corruption.
| i{ | |
| { |
Signed-off-by: WHOIM1205 <[email protected]>
6815263 to
ae3308a
Compare
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Please reemove all the yamls in root dir.
docs/proposals/keda-autoscaling.md
Outdated
|
|
||
| Where it falls short: | ||
|
|
||
| 1. **Can't talk to Prometheus.** It scrapes pods directly. Most teams already have Prometheus running -- the autoscaler can't use it. |
There was a problem hiding this comment.
Thanks for the point, i am wondering we can support it later.
docs/proposals/keda-autoscaling.md
Outdated
|
|
||
| 1. **Can't talk to Prometheus.** It scrapes pods directly. Most teams already have Prometheus running -- the autoscaler can't use it. | ||
|
|
||
| 2. **No per-model demand signal.** The router exposes `kthena_router_active_downstream_requests{model="..."}` which tells you how much traffic a model is getting *before* it hits backends. The built-in autoscaler doesn't use this. |
There was a problem hiding this comment.
It does not support using kthena router metrics
| - Modifying or replacing `AutoscalingPolicy` / `AutoscalingPolicyBinding` | ||
| - Building a custom metrics adapter | ||
| - Multi-model-per-ModelServing (we assume 1:1) | ||
| - Role-level scaling via KEDA (built-in autoscaler handles that with `subTargets.kind: Role`) |
There was a problem hiding this comment.
Prefer leaving it as second step work
docs/proposals/keda-autoscaling.md
Outdated
| ### Scale-up flow at runtime | ||
|
|
||
| ``` | ||
| User traffic Router Prometheus KEDA HPA ModelServing |
There was a problem hiding this comment.
can you replace with mermaid, which is more readable
docs/proposals/keda-autoscaling.md
Outdated
|
|
||
| ### How it works | ||
|
|
||
| 1. Prometheus scrapes the router, collects `kthena_router_active_downstream_requests{model="..."}`. |
There was a problem hiding this comment.
kthena_router_active_downstream_requests is just an example metric type, not a must
docs/proposals/keda-autoscaling.md
Outdated
|
|
||
| Right now operators have to manually match the `model` label in the PromQL query to the correct ModelServing CR. Easy to mess up. | ||
|
|
||
| We should add a `kthena.io/model-name` annotation on ModelServing. Doesn't need a new controller -- just a standard place to record the mapping. We can build tooling on top of it later if auto-generating ScaledObjects makes sense. |
There was a problem hiding this comment.
Keep in mind the model name from the router matric can be different from the real model name running in vlllm or sglang.
Because we can do model name match in router and route to the real backend model instance
docs/proposals/keda-autoscaling.md
Outdated
| triggers: | ||
| - type: prometheus | ||
| metadata: | ||
| serverAddress: http://prometheus.monitoring.svc:9090 |
docs/proposals/keda-autoscaling.md
Outdated
| name: my-model-serving | ||
| minReplicaCount: 1 | ||
| maxReplicaCount: 10 | ||
| cooldownPeriod: 60 |
There was a problem hiding this comment.
This seems too short as a llm can takes 5minutes to start from ground
There was a problem hiding this comment.
isnot it same as keda-autoscaling.md?
crd.json
Outdated
| @@ -0,0 +1,532 @@ | |||
| i{ | |||
There was a problem hiding this comment.
what's this, why put it in proposal pr
Signed-off-by: WHOIM1205 <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
hey @hzxuzhonghu
|
Proposal: Prometheus + KEDA autoscaling for ModelServing
This PR adds a design proposal for integrating Prometheus + KEDA-based autoscaling into Kthena ModelServing.
The proposal outlines:
This builds on earlier work where the end-to-end flow has already been validated locally.
Related
status.labelSelector)Notes