Skip to content

Align heterogeneous autoscaler optimizer with CRD validation guarantees#766

Open
WHOIM1205 wants to merge 2 commits intovolcano-sh:mainfrom
WHOIM1205:fix/optimizer-nil-panic-threshold
Open

Align heterogeneous autoscaler optimizer with CRD validation guarantees#766
WHOIM1205 wants to merge 2 commits intovolcano-sh:mainfrom
WHOIM1205:fix/optimizer-nil-panic-threshold

Conversation

@WHOIM1205
Copy link
Copy Markdown
Contributor

@WHOIM1205 WHOIM1205 commented Feb 10, 2026

Align optimizer with CRD validation contract

Summary

This PR aligns the heterogeneous autoscaler optimizer logic with the CRD validation guarantees for panicThresholdPercent.

Based on the API definition:

// +kubebuilder:validation:Minimum=110
// +kubebuilder:validation:Maximum=1000
// +kubebuilder:default=200

Copilot AI review requested due to automatic review settings February 10, 2026 23:56
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign lizhencheng9527 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @WHOIM1205, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical nil pointer dereference issue within the heterogeneous autoscaler optimizer. Previously, if panicThresholdPercent was omitted from panicPolicy configuration, the controller would crash. The implemented fix introduces a nil guard, ensuring robust handling of such configurations. This enhancement significantly improves the stability of the autoscaler, preventing unexpected restarts and maintaining consistent autoscaling functionality across workloads.

Highlights

  • Nil Pointer Dereference Fix: Added a nil guard before dereferencing panicThresholdPercent in the heterogeneous autoscaler optimizer.
  • Panic Mode Evaluation: Ensured panic mode is only evaluated when panicThresholdPercent is explicitly set.
  • Behavior Consistency: Preserved existing behavior for fully configured panicPolicy and aligned heterogeneous optimizer behavior with the existing homogeneous scaler logic.
  • Autoscaler Stability: Prevents autoscaler controller crashes caused by partially configured panicPolicy and eliminates CrashLoopBackOff scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • pkg/autoscaler/autoscaler/optimizer.go
    • Modified the condition for evaluating panic mode to include a nil check for autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent.
  • pkg/autoscaler/controller/autoscale_controller_test.go
    • Added TestNilPanicThreshold_DoOptimize_NoPanic to verify the optimizer's behavior when PanicThresholdPercent is nil.
    • Added TestSetPanicThreshold_DoOptimize_PanicModeWorks to confirm that panic mode functions correctly when PanicThresholdPercent is explicitly set.
Activity
  • Added TestNilPanicThreshold_DoOptimize_NoPanic to verify the optimizer does not panic when panicThresholdPercent is omitted.
  • Added TestSetPanicThreshold_DoOptimize_PanicModeWorks to verify panic mode behavior remains unchanged when panicThresholdPercent is set.
  • All new test cases passed locally.
  • All existing autoscaler tests were run to confirm no regressions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a nil pointer dereference in the heterogeneous autoscaler, preventing a potential panic. It also remediates a high-severity Denial of Service (DoS) vulnerability in the autoscaler optimizer by introducing a nil-pointer check to prevent a controller panic when processing an AutoscalingPolicy with a partially configured panicPolicy. However, a critical integer overflow vulnerability has been identified in the same line of code, which could lead to incorrect scaling behavior under high load. Additionally, the new tests could be improved for better maintainability and robustness.

return nil, nil
}
if recommendedInstances*100 >= instancesCountSum*(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) {
if autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent != nil && recommendedInstances*100 >= instancesCountSum*(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

This line correctly remediates a high-severity Denial of Service (DoS) vulnerability by preventing a nil pointer dereference when autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent is not set, which could otherwise lead to a controller panic. However, the arithmetic in this condition is prone to integer overflow. The variables recommendedInstances, instancesCountSum, and PanicThresholdPercent are all int32. Multiplying them can easily exceed the maximum value of int32, potentially leading to incorrect panic mode evaluation. To prevent this, consider casting the operands to int64 before multiplication.

Suggested change
if autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent != nil && recommendedInstances*100 >= instancesCountSum*(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) {
if autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent != nil && int64(recommendedInstances)*100 >= int64(instancesCountSum)*int64(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) {

}
}

func TestNilPanicThreshold_DoOptimize_NoPanic(t *testing.T) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new test functions TestNilPanicThreshold_DoOptimize_NoPanic and TestSetPanicThreshold_DoOptimize_PanicModeWorks share a lot of setup code. To improve maintainability, consider extracting the common setup logic into a test helper function. This would reduce duplication and make the tests easier to read and maintain.

Comment on lines +289 to +291
if updates == 0 {
t.Fatalf("expected update actions when PanicThresholdPercent is set, got 0")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertion in this test only checks that at least one update action occurred (updates > 0), which is quite broad. For a more robust test, you should assert the specific final replica counts for each ModelServing instance, similar to TestTwoBackendsHighLoad_then_DoOptimize_expect_DistributionA5B4. This would verify that the panic mode scaling logic is working precisely as intended. Based on the test's configuration, the expected distribution should be 5 replicas for ms-panic-a and 4 for ms-panic-b.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a nil pointer dereference in the heterogeneous autoscaler optimizer when panicPolicy.panicThresholdPercent is omitted, and adds regression tests to cover the nil vs non-nil threshold cases.

Changes:

  • Added a nil guard before dereferencing PanicThresholdPercent in heterogeneous Optimizer.Optimize().
  • Added controller tests to ensure doOptimize() does not crash when PanicThresholdPercent is nil.
  • Added a test intended to validate behavior when PanicThresholdPercent is set.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pkg/autoscaler/autoscaler/optimizer.go Prevents nil dereference by guarding access to PanicThresholdPercent before evaluating panic-mode threshold logic.
pkg/autoscaler/controller/autoscale_controller_test.go Adds regression tests for doOptimize() with nil and non-nil panicThresholdPercent.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +270 to +273
// PanicThresholdPercent set to 200 — with load=100, recommended will far exceed threshold
var threshold int32 = 200
policy := &workload.AutoscalingPolicy{Spec: workload.AutoscalingPolicySpec{TolerancePercent: 0, Metrics: []workload.AutoscalingPolicyMetric{{MetricName: "load", TargetValue: resource.MustParse("1")}}, Behavior: workload.AutoscalingPolicyBehavior{ScaleUp: workload.AutoscalingPolicyScaleUpPolicy{PanicPolicy: workload.AutoscalingPolicyPanicPolicy{Period: metav1.Duration{Duration: 1 * time.Second}, PanicThresholdPercent: &threshold}}}}}
binding := &workload.AutoscalingPolicyBinding{ObjectMeta: metav1.ObjectMeta{Name: "binding-panic", Namespace: ns}, Spec: workload.AutoscalingPolicyBindingSpec{PolicyRef: corev1.LocalObjectReference{Name: "ap"}, HeterogeneousTarget: &workload.HeterogeneousTarget{Params: []workload.HeterogeneousTargetParam{paramA, paramB}, CostExpansionRatePercent: 100}}}
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestSetPanicThreshold_DoOptimize_PanicModeWorks doesn’t actually validate panic mode: PanicModeHold is left nil, so NewStatus() sets PanicModeHoldMilliseconds to 0 and RefreshPanicMode() will never make IsPanicMode() true. Either set PanicModeHold in the policy and assert the optimizer entered panic mode (e.g., via ac.optimizerMap[formatAutoscalerMapKey(...)]), or rename the test/assertions to reflect that it only checks update actions.

Copilot uses AI. Check for mistakes.
Comment on lines +226 to +229
func TestNilPanicThreshold_DoOptimize_NoPanic(t *testing.T) {
ns := "ns"
msA := &workload.ModelServing{ObjectMeta: metav1.ObjectMeta{Name: "ms-nil-a", Namespace: ns}, Spec: workload.ModelServingSpec{Replicas: ptrInt32(1)}}
msB := &workload.ModelServing{ObjectMeta: metav1.ObjectMeta{Name: "ms-nil-b", Namespace: ns}, Spec: workload.ModelServingSpec{Replicas: ptrInt32(2)}}
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description’s “Code Changes” section mentions pkg/autoscaler/autoscaler/autoscale_controller_test.go, but the tests in this PR are in pkg/autoscaler/controller/autoscale_controller_test.go. Consider updating the description to avoid confusion when tracking changes.

Copilot uses AI. Check for mistakes.
return nil, nil
}
if recommendedInstances*100 >= instancesCountSum*(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) {
if autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent != nil && recommendedInstances*100 >= instancesCountSum*(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current API definition for the autoscaler, if PanicPolicy is set, PanicThresholdPercent must also be configured.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification.
The nil check is just a defensive safeguard to prevent a controller panic if an unexpected object without panicThresholdPercent reaches this path. It doesn’t change behavior when the field is correctly configured, but keeps the controller safe.

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PanicThresholdPercent can never be nil,

	// +kubebuilder:validation:Minimum=110
	// +kubebuilder:validation:Maximum=1000
	// +kubebuilder:default=200

@WHOIM1205 WHOIM1205 force-pushed the fix/optimizer-nil-panic-threshold branch from 7104c79 to e2dc2b2 Compare February 12, 2026 21:34
@WHOIM1205
Copy link
Copy Markdown
Contributor Author

PanicThresholdPercent can never be nil,

	// +kubebuilder:validation:Minimum=110
	// +kubebuilder:validation:Maximum=1000
	// +kubebuilder:default=200

Thanks for the clarification.
I’ve removed the nil guard and aligned the optimizer logic with the CRD validation guarantees. The code now directly dereferences PanicThresholdPercent as intended by the API contract.

All autoscaler tests pass locally.

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the pr title and description. And also i would suggest test some more meaningful logic

@WHOIM1205 WHOIM1205 changed the title Fix nil pointer panic in heterogeneous autoscaler optimizer Align heterogeneous autoscaler optimizer with CRD validation guarantees Feb 16, 2026
Replace weak assertion that only checked update count with meaningful
validation of actual replica distribution after panic mode scaling.

The test now verifies that when panic threshold is exceeded:
- ms-panic-a scales to 5 replicas (maxReplicas, lower cost)
- ms-panic-b scales to 4 replicas (maxReplicas, higher cost)

This provides stronger guarantees about optimizer behavior while
keeping the test simple and maintainable.

Signed-off-by: WHOIM1205 <[email protected]>
Copilot AI review requested due to automatic review settings February 16, 2026 21:37
@WHOIM1205
Copy link
Copy Markdown
Contributor Author

Please update the pr title and description. And also i would suggest test some more meaningful logic

Thanks for the suggestion.

I’ve updated the test to validate the actual replica changes instead of only checking for update actions. The assertions now verify the expected replica counts under panic mode.

All autoscaler tests pass locally.
Screenshot 2026-02-17 030502

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +226 to +254
func TestDefaultPanicThreshold_DoOptimize_NoPanic(t *testing.T) {
ns := "ns"
msA := &workload.ModelServing{ObjectMeta: metav1.ObjectMeta{Name: "ms-nil-a", Namespace: ns}, Spec: workload.ModelServingSpec{Replicas: ptrInt32(1)}}
msB := &workload.ModelServing{ObjectMeta: metav1.ObjectMeta{Name: "ms-nil-b", Namespace: ns}, Spec: workload.ModelServingSpec{Replicas: ptrInt32(2)}}
client := clientfake.NewSimpleClientset(msA, msB)
msLister := workloadLister.NewModelServingLister(newModelServingIndexer(msA, msB))

srv := httptest.NewServer(httpHandlerWithBody("# TYPE load gauge\nload 10\n"))
defer srv.Close()
u, _ := url.Parse(srv.URL)
host, portStr, _ := net.SplitHostPort(u.Host)
port := toInt32(portStr)

paramA := workload.HeterogeneousTargetParam{Target: workload.Target{TargetRef: corev1.ObjectReference{Kind: workload.ModelServingKind.Kind, Namespace: ns, Name: "ms-nil-a"}, MetricEndpoint: workload.MetricEndpoint{Uri: u.Path, Port: port}}, MinReplicas: 1, MaxReplicas: 5, Cost: 10}
paramB := workload.HeterogeneousTargetParam{Target: workload.Target{TargetRef: corev1.ObjectReference{Kind: workload.ModelServingKind.Kind, Namespace: ns, Name: "ms-nil-b"}, MetricEndpoint: workload.MetricEndpoint{Uri: u.Path, Port: port}}, MinReplicas: 2, MaxReplicas: 4, Cost: 20}
// PanicThresholdPercent set to CRD default of 200 — per API guarantee this is never nil
var threshold int32 = 200
policy := &workload.AutoscalingPolicy{Spec: workload.AutoscalingPolicySpec{TolerancePercent: 0, Metrics: []workload.AutoscalingPolicyMetric{{MetricName: "load", TargetValue: resource.MustParse("1")}}, Behavior: workload.AutoscalingPolicyBehavior{ScaleUp: workload.AutoscalingPolicyScaleUpPolicy{PanicPolicy: workload.AutoscalingPolicyPanicPolicy{Period: metav1.Duration{Duration: 1 * time.Second}, PanicThresholdPercent: &threshold}}}}}
binding := &workload.AutoscalingPolicyBinding{ObjectMeta: metav1.ObjectMeta{Name: "binding-nil", Namespace: ns}, Spec: workload.AutoscalingPolicyBindingSpec{PolicyRef: corev1.LocalObjectReference{Name: "ap"}, HeterogeneousTarget: &workload.HeterogeneousTarget{Params: []workload.HeterogeneousTargetParam{paramA, paramB}, CostExpansionRatePercent: 100}}}

lbsA := map[string]string{}
lbsB := map[string]string{}
pods := []*corev1.Pod{readyPod(ns, "pod-nil-a", host, lbsA), readyPod(ns, "pod-nil-b", host, lbsB)}
ac := &AutoscaleController{client: client, namespace: ns, modelServingLister: msLister, podsLister: fakePodLister{podsByNs: map[string][]*corev1.Pod{ns: pods}}, scalerMap: map[string]*autoscalerAutoscaler{}, optimizerMap: map[string]*autoscalerOptimizer{}}

if err := ac.doOptimize(context.Background(), binding, policy); err != nil {
t.Fatalf("doOptimize should not error with default PanicThresholdPercent: %v", err)
}
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name "TestDefaultPanicThreshold_DoOptimize_NoPanic" is ambiguous. It's unclear whether "NoPanic" means "no crash/error" or "panic mode not activated". The test only verifies that doOptimize doesn't return an error, but doesn't verify whether panic mode was activated or not. Consider either: (1) Renaming to something like "TestDefaultPanicThreshold_DoOptimize_NoError" to clarify it's testing for absence of errors, or (2) Adding assertions to verify whether panic mode was actually triggered and renaming accordingly.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants