Align heterogeneous autoscaler optimizer with CRD validation guarantees by WHOIM1205 · Pull Request #766 · volcano-sh/kthena

WHOIM1205 · 2026-02-10T23:56:43Z

Align optimizer with CRD validation contract

Summary

This PR aligns the heterogeneous autoscaler optimizer logic with the CRD validation guarantees for panicThresholdPercent.

Based on the API definition:

// +kubebuilder:validation:Minimum=110
// +kubebuilder:validation:Maximum=1000
// +kubebuilder:default=200

volcano-sh-bot · 2026-02-10T23:56:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign lizhencheng9527 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist · 2026-02-10T23:56:57Z

Summary of Changes

Hello @WHOIM1205, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical nil pointer dereference issue within the heterogeneous autoscaler optimizer. Previously, if panicThresholdPercent was omitted from panicPolicy configuration, the controller would crash. The implemented fix introduces a nil guard, ensuring robust handling of such configurations. This enhancement significantly improves the stability of the autoscaler, preventing unexpected restarts and maintaining consistent autoscaling functionality across workloads.

Highlights

Nil Pointer Dereference Fix: Added a nil guard before dereferencing panicThresholdPercent in the heterogeneous autoscaler optimizer.
Panic Mode Evaluation: Ensured panic mode is only evaluated when panicThresholdPercent is explicitly set.
Behavior Consistency: Preserved existing behavior for fully configured panicPolicy and aligned heterogeneous optimizer behavior with the existing homogeneous scaler logic.
Autoscaler Stability: Prevents autoscaler controller crashes caused by partially configured panicPolicy and eliminates CrashLoopBackOff scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

pkg/autoscaler/autoscaler/optimizer.go
- Modified the condition for evaluating panic mode to include a nil check for autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent.
pkg/autoscaler/controller/autoscale_controller_test.go
- Added TestNilPanicThreshold_DoOptimize_NoPanic to verify the optimizer's behavior when PanicThresholdPercent is nil.
- Added TestSetPanicThreshold_DoOptimize_PanicModeWorks to confirm that panic mode functions correctly when PanicThresholdPercent is explicitly set.

Activity

Added TestNilPanicThreshold_DoOptimize_NoPanic to verify the optimizer does not panic when panicThresholdPercent is omitted.
Added TestSetPanicThreshold_DoOptimize_PanicModeWorks to verify panic mode behavior remains unchanged when panicThresholdPercent is set.
All new test cases passed locally.
All existing autoscaler tests were run to confirm no regressions.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly addresses a nil pointer dereference in the heterogeneous autoscaler, preventing a potential panic. It also remediates a high-severity Denial of Service (DoS) vulnerability in the autoscaler optimizer by introducing a nil-pointer check to prevent a controller panic when processing an AutoscalingPolicy with a partially configured panicPolicy. However, a critical integer overflow vulnerability has been identified in the same line of code, which could lead to incorrect scaling behavior under high load. Additionally, the new tests could be improved for better maintainability and robustness.

gemini-code-assist · 2026-02-10T23:59:31Z

pkg/autoscaler/autoscaler/optimizer.go

 		return nil, nil
 	}
-	if recommendedInstances*100 >= instancesCountSum*(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) {
+	if autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent != nil && recommendedInstances*100 >= instancesCountSum*(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) {


This line correctly remediates a high-severity Denial of Service (DoS) vulnerability by preventing a nil pointer dereference when autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent is not set, which could otherwise lead to a controller panic. However, the arithmetic in this condition is prone to integer overflow. The variables recommendedInstances, instancesCountSum, and PanicThresholdPercent are all int32. Multiplying them can easily exceed the maximum value of int32, potentially leading to incorrect panic mode evaluation. To prevent this, consider casting the operands to int64 before multiplication.

Suggested change

if autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent != nil && recommendedInstances*100 >= instancesCountSum*(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) {

if autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent != nil && int64(recommendedInstances)*100 >= int64(instancesCountSum)*int64(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) {

gemini-code-assist · 2026-02-10T23:59:31Z

pkg/autoscaler/controller/autoscale_controller_test.go

 	}
 }

+func TestNilPanicThreshold_DoOptimize_NoPanic(t *testing.T) {


The new test functions TestNilPanicThreshold_DoOptimize_NoPanic and TestSetPanicThreshold_DoOptimize_PanicModeWorks share a lot of setup code. To improve maintainability, consider extracting the common setup logic into a test helper function. This would reduce duplication and make the tests easier to read and maintain.

gemini-code-assist · 2026-02-10T23:59:32Z

pkg/autoscaler/controller/autoscale_controller_test.go

+	if updates == 0 {
+		t.Fatalf("expected update actions when PanicThresholdPercent is set, got 0")
+	}


The assertion in this test only checks that at least one update action occurred (updates > 0), which is quite broad. For a more robust test, you should assert the specific final replica counts for each ModelServing instance, similar to TestTwoBackendsHighLoad_then_DoOptimize_expect_DistributionA5B4. This would verify that the panic mode scaling logic is working precisely as intended. Based on the test's configuration, the expected distribution should be 5 replicas for ms-panic-a and 4 for ms-panic-b.

Copilot

Pull request overview

Fixes a nil pointer dereference in the heterogeneous autoscaler optimizer when panicPolicy.panicThresholdPercent is omitted, and adds regression tests to cover the nil vs non-nil threshold cases.

Changes:

Added a nil guard before dereferencing PanicThresholdPercent in heterogeneous Optimizer.Optimize().
Added controller tests to ensure doOptimize() does not crash when PanicThresholdPercent is nil.
Added a test intended to validate behavior when PanicThresholdPercent is set.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`pkg/autoscaler/autoscaler/optimizer.go`	Prevents nil dereference by guarding access to `PanicThresholdPercent` before evaluating panic-mode threshold logic.
`pkg/autoscaler/controller/autoscale_controller_test.go`	Adds regression tests for `doOptimize()` with nil and non-nil `panicThresholdPercent`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-11T00:05:12Z

pkg/autoscaler/controller/autoscale_controller_test.go

+	// PanicThresholdPercent set to 200 — with load=100, recommended will far exceed threshold
+	var threshold int32 = 200
+	policy := &workload.AutoscalingPolicy{Spec: workload.AutoscalingPolicySpec{TolerancePercent: 0, Metrics: []workload.AutoscalingPolicyMetric{{MetricName: "load", TargetValue: resource.MustParse("1")}}, Behavior: workload.AutoscalingPolicyBehavior{ScaleUp: workload.AutoscalingPolicyScaleUpPolicy{PanicPolicy: workload.AutoscalingPolicyPanicPolicy{Period: metav1.Duration{Duration: 1 * time.Second}, PanicThresholdPercent: &threshold}}}}}
+	binding := &workload.AutoscalingPolicyBinding{ObjectMeta: metav1.ObjectMeta{Name: "binding-panic", Namespace: ns}, Spec: workload.AutoscalingPolicyBindingSpec{PolicyRef: corev1.LocalObjectReference{Name: "ap"}, HeterogeneousTarget: &workload.HeterogeneousTarget{Params: []workload.HeterogeneousTargetParam{paramA, paramB}, CostExpansionRatePercent: 100}}}


TestSetPanicThreshold_DoOptimize_PanicModeWorks doesn’t actually validate panic mode: PanicModeHold is left nil, so NewStatus() sets PanicModeHoldMilliseconds to 0 and RefreshPanicMode() will never make IsPanicMode() true. Either set PanicModeHold in the policy and assert the optimizer entered panic mode (e.g., via ac.optimizerMap[formatAutoscalerMapKey(...)]), or rename the test/assertions to reflect that it only checks update actions.

Copilot · 2026-02-11T00:05:12Z

pkg/autoscaler/controller/autoscale_controller_test.go

+func TestNilPanicThreshold_DoOptimize_NoPanic(t *testing.T) {
+	ns := "ns"
+	msA := &workload.ModelServing{ObjectMeta: metav1.ObjectMeta{Name: "ms-nil-a", Namespace: ns}, Spec: workload.ModelServingSpec{Replicas: ptrInt32(1)}}
+	msB := &workload.ModelServing{ObjectMeta: metav1.ObjectMeta{Name: "ms-nil-b", Namespace: ns}, Spec: workload.ModelServingSpec{Replicas: ptrInt32(2)}}


PR description’s “Code Changes” section mentions pkg/autoscaler/autoscaler/autoscale_controller_test.go, but the tests in this PR are in pkg/autoscaler/controller/autoscale_controller_test.go. Consider updating the description to avoid confusion when tracking changes.

LiZhenCheng9527 · 2026-02-11T03:04:08Z

pkg/autoscaler/autoscaler/optimizer.go

 		return nil, nil
 	}
-	if recommendedInstances*100 >= instancesCountSum*(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) {
+	if autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent != nil && recommendedInstances*100 >= instancesCountSum*(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) {


In the current API definition for the autoscaler, if PanicPolicy is set, PanicThresholdPercent must also be configured.

Thanks for the clarification.
The nil check is just a defensive safeguard to prevent a controller panic if an unexpected object without panicThresholdPercent reaches this path. It doesn’t change behavior when the field is correctly configured, but keeps the controller safe.

hzxuzhonghu

PanicThresholdPercent can never be nil,

	// +kubebuilder:validation:Minimum=110
	// +kubebuilder:validation:Maximum=1000
	// +kubebuilder:default=200

Signed-off-by: WHOIM1205 <[email protected]>

WHOIM1205 · 2026-02-12T21:35:17Z

PanicThresholdPercent can never be nil,

	// +kubebuilder:validation:Minimum=110
	// +kubebuilder:validation:Maximum=1000
	// +kubebuilder:default=200

Thanks for the clarification.
I’ve removed the nil guard and aligned the optimizer logic with the CRD validation guarantees. The code now directly dereferences PanicThresholdPercent as intended by the API contract.

All autoscaler tests pass locally.

hzxuzhonghu

Please update the pr title and description. And also i would suggest test some more meaningful logic

Replace weak assertion that only checked update count with meaningful validation of actual replica distribution after panic mode scaling. The test now verifies that when panic threshold is exceeded: - ms-panic-a scales to 5 replicas (maxReplicas, lower cost) - ms-panic-b scales to 4 replicas (maxReplicas, higher cost) This provides stronger guarantees about optimizer behavior while keeping the test simple and maintainable. Signed-off-by: WHOIM1205 <[email protected]>

WHOIM1205 · 2026-02-16T21:38:55Z

Please update the pr title and description. And also i would suggest test some more meaningful logic

Thanks for the suggestion.

I’ve updated the test to validate the actual replica changes instead of only checking for update actions. The assertions now verify the expected replica counts under panic mode.

All autoscaler tests pass locally.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-16T21:41:28Z

pkg/autoscaler/controller/autoscale_controller_test.go

+func TestDefaultPanicThreshold_DoOptimize_NoPanic(t *testing.T) {
+	ns := "ns"
+	msA := &workload.ModelServing{ObjectMeta: metav1.ObjectMeta{Name: "ms-nil-a", Namespace: ns}, Spec: workload.ModelServingSpec{Replicas: ptrInt32(1)}}
+	msB := &workload.ModelServing{ObjectMeta: metav1.ObjectMeta{Name: "ms-nil-b", Namespace: ns}, Spec: workload.ModelServingSpec{Replicas: ptrInt32(2)}}
+	client := clientfake.NewSimpleClientset(msA, msB)
+	msLister := workloadLister.NewModelServingLister(newModelServingIndexer(msA, msB))
+
+	srv := httptest.NewServer(httpHandlerWithBody("# TYPE load gauge\nload 10\n"))
+	defer srv.Close()
+	u, _ := url.Parse(srv.URL)
+	host, portStr, _ := net.SplitHostPort(u.Host)
+	port := toInt32(portStr)
+
+	paramA := workload.HeterogeneousTargetParam{Target: workload.Target{TargetRef: corev1.ObjectReference{Kind: workload.ModelServingKind.Kind, Namespace: ns, Name: "ms-nil-a"}, MetricEndpoint: workload.MetricEndpoint{Uri: u.Path, Port: port}}, MinReplicas: 1, MaxReplicas: 5, Cost: 10}
+	paramB := workload.HeterogeneousTargetParam{Target: workload.Target{TargetRef: corev1.ObjectReference{Kind: workload.ModelServingKind.Kind, Namespace: ns, Name: "ms-nil-b"}, MetricEndpoint: workload.MetricEndpoint{Uri: u.Path, Port: port}}, MinReplicas: 2, MaxReplicas: 4, Cost: 20}
+	// PanicThresholdPercent set to CRD default of 200 — per API guarantee this is never nil
+	var threshold int32 = 200
+	policy := &workload.AutoscalingPolicy{Spec: workload.AutoscalingPolicySpec{TolerancePercent: 0, Metrics: []workload.AutoscalingPolicyMetric{{MetricName: "load", TargetValue: resource.MustParse("1")}}, Behavior: workload.AutoscalingPolicyBehavior{ScaleUp: workload.AutoscalingPolicyScaleUpPolicy{PanicPolicy: workload.AutoscalingPolicyPanicPolicy{Period: metav1.Duration{Duration: 1 * time.Second}, PanicThresholdPercent: &threshold}}}}}
+	binding := &workload.AutoscalingPolicyBinding{ObjectMeta: metav1.ObjectMeta{Name: "binding-nil", Namespace: ns}, Spec: workload.AutoscalingPolicyBindingSpec{PolicyRef: corev1.LocalObjectReference{Name: "ap"}, HeterogeneousTarget: &workload.HeterogeneousTarget{Params: []workload.HeterogeneousTargetParam{paramA, paramB}, CostExpansionRatePercent: 100}}}
+
+	lbsA := map[string]string{}
+	lbsB := map[string]string{}
+	pods := []*corev1.Pod{readyPod(ns, "pod-nil-a", host, lbsA), readyPod(ns, "pod-nil-b", host, lbsB)}
+	ac := &AutoscaleController{client: client, namespace: ns, modelServingLister: msLister, podsLister: fakePodLister{podsByNs: map[string][]*corev1.Pod{ns: pods}}, scalerMap: map[string]*autoscalerAutoscaler{}, optimizerMap: map[string]*autoscalerOptimizer{}}
+
+	if err := ac.doOptimize(context.Background(), binding, policy); err != nil {
+		t.Fatalf("doOptimize should not error with default PanicThresholdPercent: %v", err)
+	}
+}


The test name "TestDefaultPanicThreshold_DoOptimize_NoPanic" is ambiguous. It's unclear whether "NoPanic" means "no crash/error" or "panic mode not activated". The test only verifies that doOptimize doesn't return an error, but doesn't verify whether panic mode was activated or not. Consider either: (1) Renaming to something like "TestDefaultPanicThreshold_DoOptimize_NoError" to clarify it's testing for absence of errors, or (2) Adding assertions to verify whether panic mode was actually triggered and renaming accordingly.

Copilot AI review requested due to automatic review settings February 10, 2026 23:56

volcano-sh-bot requested review from YaoZengzeng and git-malu February 10, 2026 23:56

volcano-sh-bot added the size/M label Feb 10, 2026

Copilot started reviewing on behalf of WHOIM1205 February 10, 2026 23:57 View session

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

Copilot AI reviewed Feb 11, 2026

View reviewed changes

LiZhenCheng9527 reviewed Feb 11, 2026

View reviewed changes

hzxuzhonghu reviewed Feb 12, 2026

View reviewed changes

fix: remove unnecessary nil guard on PanicThresholdPercent in optimizer

e2dc2b2

Signed-off-by: WHOIM1205 <[email protected]>

WHOIM1205 force-pushed the fix/optimizer-nil-panic-threshold branch from 7104c79 to e2dc2b2 Compare February 12, 2026 21:34

hzxuzhonghu reviewed Feb 13, 2026

View reviewed changes

WHOIM1205 changed the title ~~Fix nil pointer panic in heterogeneous autoscaler optimizer~~ Align heterogeneous autoscaler optimizer with CRD validation guarantees Feb 16, 2026

Copilot AI review requested due to automatic review settings February 16, 2026 21:37

Copilot started reviewing on behalf of WHOIM1205 February 16, 2026 21:38 View session

Copilot AI reviewed Feb 16, 2026

View reviewed changes

	if autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent != nil && recommendedInstances100 >= instancesCountSum(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) {
	if autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent != nil && int64(recommendedInstances)100 >= int64(instancesCountSum)int64(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) {

Conversation

WHOIM1205 commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Align optimizer with CRD validation contract

Summary

Uh oh!

volcano-sh-bot commented Feb 10, 2026

Uh oh!

gemini-code-assist bot commented Feb 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

LiZhenCheng9527 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

WHOIM1205 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

WHOIM1205 commented Feb 12, 2026

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

WHOIM1205 commented Feb 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

WHOIM1205 commented Feb 10, 2026 •

edited

Loading