Align heterogeneous autoscaler optimizer with CRD validation guarantees#766
Align heterogeneous autoscaler optimizer with CRD validation guarantees#766WHOIM1205 wants to merge 2 commits intovolcano-sh:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello @WHOIM1205, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical nil pointer dereference issue within the heterogeneous autoscaler optimizer. Previously, if Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly addresses a nil pointer dereference in the heterogeneous autoscaler, preventing a potential panic. It also remediates a high-severity Denial of Service (DoS) vulnerability in the autoscaler optimizer by introducing a nil-pointer check to prevent a controller panic when processing an AutoscalingPolicy with a partially configured panicPolicy. However, a critical integer overflow vulnerability has been identified in the same line of code, which could lead to incorrect scaling behavior under high load. Additionally, the new tests could be improved for better maintainability and robustness.
| return nil, nil | ||
| } | ||
| if recommendedInstances*100 >= instancesCountSum*(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) { | ||
| if autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent != nil && recommendedInstances*100 >= instancesCountSum*(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) { |
There was a problem hiding this comment.
This line correctly remediates a high-severity Denial of Service (DoS) vulnerability by preventing a nil pointer dereference when autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent is not set, which could otherwise lead to a controller panic. However, the arithmetic in this condition is prone to integer overflow. The variables recommendedInstances, instancesCountSum, and PanicThresholdPercent are all int32. Multiplying them can easily exceed the maximum value of int32, potentially leading to incorrect panic mode evaluation. To prevent this, consider casting the operands to int64 before multiplication.
| if autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent != nil && recommendedInstances*100 >= instancesCountSum*(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) { | |
| if autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent != nil && int64(recommendedInstances)*100 >= int64(instancesCountSum)*int64(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) { |
| } | ||
| } | ||
|
|
||
| func TestNilPanicThreshold_DoOptimize_NoPanic(t *testing.T) { |
There was a problem hiding this comment.
The new test functions TestNilPanicThreshold_DoOptimize_NoPanic and TestSetPanicThreshold_DoOptimize_PanicModeWorks share a lot of setup code. To improve maintainability, consider extracting the common setup logic into a test helper function. This would reduce duplication and make the tests easier to read and maintain.
| if updates == 0 { | ||
| t.Fatalf("expected update actions when PanicThresholdPercent is set, got 0") | ||
| } |
There was a problem hiding this comment.
The assertion in this test only checks that at least one update action occurred (updates > 0), which is quite broad. For a more robust test, you should assert the specific final replica counts for each ModelServing instance, similar to TestTwoBackendsHighLoad_then_DoOptimize_expect_DistributionA5B4. This would verify that the panic mode scaling logic is working precisely as intended. Based on the test's configuration, the expected distribution should be 5 replicas for ms-panic-a and 4 for ms-panic-b.
There was a problem hiding this comment.
Pull request overview
Fixes a nil pointer dereference in the heterogeneous autoscaler optimizer when panicPolicy.panicThresholdPercent is omitted, and adds regression tests to cover the nil vs non-nil threshold cases.
Changes:
- Added a nil guard before dereferencing
PanicThresholdPercentin heterogeneousOptimizer.Optimize(). - Added controller tests to ensure
doOptimize()does not crash whenPanicThresholdPercentis nil. - Added a test intended to validate behavior when
PanicThresholdPercentis set.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
pkg/autoscaler/autoscaler/optimizer.go |
Prevents nil dereference by guarding access to PanicThresholdPercent before evaluating panic-mode threshold logic. |
pkg/autoscaler/controller/autoscale_controller_test.go |
Adds regression tests for doOptimize() with nil and non-nil panicThresholdPercent. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // PanicThresholdPercent set to 200 — with load=100, recommended will far exceed threshold | ||
| var threshold int32 = 200 | ||
| policy := &workload.AutoscalingPolicy{Spec: workload.AutoscalingPolicySpec{TolerancePercent: 0, Metrics: []workload.AutoscalingPolicyMetric{{MetricName: "load", TargetValue: resource.MustParse("1")}}, Behavior: workload.AutoscalingPolicyBehavior{ScaleUp: workload.AutoscalingPolicyScaleUpPolicy{PanicPolicy: workload.AutoscalingPolicyPanicPolicy{Period: metav1.Duration{Duration: 1 * time.Second}, PanicThresholdPercent: &threshold}}}}} | ||
| binding := &workload.AutoscalingPolicyBinding{ObjectMeta: metav1.ObjectMeta{Name: "binding-panic", Namespace: ns}, Spec: workload.AutoscalingPolicyBindingSpec{PolicyRef: corev1.LocalObjectReference{Name: "ap"}, HeterogeneousTarget: &workload.HeterogeneousTarget{Params: []workload.HeterogeneousTargetParam{paramA, paramB}, CostExpansionRatePercent: 100}}} |
There was a problem hiding this comment.
TestSetPanicThreshold_DoOptimize_PanicModeWorks doesn’t actually validate panic mode: PanicModeHold is left nil, so NewStatus() sets PanicModeHoldMilliseconds to 0 and RefreshPanicMode() will never make IsPanicMode() true. Either set PanicModeHold in the policy and assert the optimizer entered panic mode (e.g., via ac.optimizerMap[formatAutoscalerMapKey(...)]), or rename the test/assertions to reflect that it only checks update actions.
| func TestNilPanicThreshold_DoOptimize_NoPanic(t *testing.T) { | ||
| ns := "ns" | ||
| msA := &workload.ModelServing{ObjectMeta: metav1.ObjectMeta{Name: "ms-nil-a", Namespace: ns}, Spec: workload.ModelServingSpec{Replicas: ptrInt32(1)}} | ||
| msB := &workload.ModelServing{ObjectMeta: metav1.ObjectMeta{Name: "ms-nil-b", Namespace: ns}, Spec: workload.ModelServingSpec{Replicas: ptrInt32(2)}} |
There was a problem hiding this comment.
PR description’s “Code Changes” section mentions pkg/autoscaler/autoscaler/autoscale_controller_test.go, but the tests in this PR are in pkg/autoscaler/controller/autoscale_controller_test.go. Consider updating the description to avoid confusion when tracking changes.
| return nil, nil | ||
| } | ||
| if recommendedInstances*100 >= instancesCountSum*(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) { | ||
| if autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent != nil && recommendedInstances*100 >= instancesCountSum*(*autoscalePolicy.Spec.Behavior.ScaleUp.PanicPolicy.PanicThresholdPercent) { |
There was a problem hiding this comment.
In the current API definition for the autoscaler, if PanicPolicy is set, PanicThresholdPercent must also be configured.
There was a problem hiding this comment.
Thanks for the clarification.
The nil check is just a defensive safeguard to prevent a controller panic if an unexpected object without panicThresholdPercent reaches this path. It doesn’t change behavior when the field is correctly configured, but keeps the controller safe.
hzxuzhonghu
left a comment
There was a problem hiding this comment.
PanicThresholdPercent can never be nil,
// +kubebuilder:validation:Minimum=110
// +kubebuilder:validation:Maximum=1000
// +kubebuilder:default=200
Signed-off-by: WHOIM1205 <[email protected]>
7104c79 to
e2dc2b2
Compare
Thanks for the clarification. All autoscaler tests pass locally. |
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Please update the pr title and description. And also i would suggest test some more meaningful logic
Replace weak assertion that only checked update count with meaningful validation of actual replica distribution after panic mode scaling. The test now verifies that when panic threshold is exceeded: - ms-panic-a scales to 5 replicas (maxReplicas, lower cost) - ms-panic-b scales to 4 replicas (maxReplicas, higher cost) This provides stronger guarantees about optimizer behavior while keeping the test simple and maintainable. Signed-off-by: WHOIM1205 <[email protected]>
Thanks for the suggestion. I’ve updated the test to validate the actual replica changes instead of only checking for update actions. The assertions now verify the expected replica counts under panic mode. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func TestDefaultPanicThreshold_DoOptimize_NoPanic(t *testing.T) { | ||
| ns := "ns" | ||
| msA := &workload.ModelServing{ObjectMeta: metav1.ObjectMeta{Name: "ms-nil-a", Namespace: ns}, Spec: workload.ModelServingSpec{Replicas: ptrInt32(1)}} | ||
| msB := &workload.ModelServing{ObjectMeta: metav1.ObjectMeta{Name: "ms-nil-b", Namespace: ns}, Spec: workload.ModelServingSpec{Replicas: ptrInt32(2)}} | ||
| client := clientfake.NewSimpleClientset(msA, msB) | ||
| msLister := workloadLister.NewModelServingLister(newModelServingIndexer(msA, msB)) | ||
|
|
||
| srv := httptest.NewServer(httpHandlerWithBody("# TYPE load gauge\nload 10\n")) | ||
| defer srv.Close() | ||
| u, _ := url.Parse(srv.URL) | ||
| host, portStr, _ := net.SplitHostPort(u.Host) | ||
| port := toInt32(portStr) | ||
|
|
||
| paramA := workload.HeterogeneousTargetParam{Target: workload.Target{TargetRef: corev1.ObjectReference{Kind: workload.ModelServingKind.Kind, Namespace: ns, Name: "ms-nil-a"}, MetricEndpoint: workload.MetricEndpoint{Uri: u.Path, Port: port}}, MinReplicas: 1, MaxReplicas: 5, Cost: 10} | ||
| paramB := workload.HeterogeneousTargetParam{Target: workload.Target{TargetRef: corev1.ObjectReference{Kind: workload.ModelServingKind.Kind, Namespace: ns, Name: "ms-nil-b"}, MetricEndpoint: workload.MetricEndpoint{Uri: u.Path, Port: port}}, MinReplicas: 2, MaxReplicas: 4, Cost: 20} | ||
| // PanicThresholdPercent set to CRD default of 200 — per API guarantee this is never nil | ||
| var threshold int32 = 200 | ||
| policy := &workload.AutoscalingPolicy{Spec: workload.AutoscalingPolicySpec{TolerancePercent: 0, Metrics: []workload.AutoscalingPolicyMetric{{MetricName: "load", TargetValue: resource.MustParse("1")}}, Behavior: workload.AutoscalingPolicyBehavior{ScaleUp: workload.AutoscalingPolicyScaleUpPolicy{PanicPolicy: workload.AutoscalingPolicyPanicPolicy{Period: metav1.Duration{Duration: 1 * time.Second}, PanicThresholdPercent: &threshold}}}}} | ||
| binding := &workload.AutoscalingPolicyBinding{ObjectMeta: metav1.ObjectMeta{Name: "binding-nil", Namespace: ns}, Spec: workload.AutoscalingPolicyBindingSpec{PolicyRef: corev1.LocalObjectReference{Name: "ap"}, HeterogeneousTarget: &workload.HeterogeneousTarget{Params: []workload.HeterogeneousTargetParam{paramA, paramB}, CostExpansionRatePercent: 100}}} | ||
|
|
||
| lbsA := map[string]string{} | ||
| lbsB := map[string]string{} | ||
| pods := []*corev1.Pod{readyPod(ns, "pod-nil-a", host, lbsA), readyPod(ns, "pod-nil-b", host, lbsB)} | ||
| ac := &AutoscaleController{client: client, namespace: ns, modelServingLister: msLister, podsLister: fakePodLister{podsByNs: map[string][]*corev1.Pod{ns: pods}}, scalerMap: map[string]*autoscalerAutoscaler{}, optimizerMap: map[string]*autoscalerOptimizer{}} | ||
|
|
||
| if err := ac.doOptimize(context.Background(), binding, policy); err != nil { | ||
| t.Fatalf("doOptimize should not error with default PanicThresholdPercent: %v", err) | ||
| } | ||
| } |
There was a problem hiding this comment.
The test name "TestDefaultPanicThreshold_DoOptimize_NoPanic" is ambiguous. It's unclear whether "NoPanic" means "no crash/error" or "panic mode not activated". The test only verifies that doOptimize doesn't return an error, but doesn't verify whether panic mode was activated or not. Consider either: (1) Renaming to something like "TestDefaultPanicThreshold_DoOptimize_NoError" to clarify it's testing for absence of errors, or (2) Adding assertions to verify whether panic mode was actually triggered and renaming accordingly.

Align optimizer with CRD validation contract
Summary
This PR aligns the heterogeneous autoscaler optimizer logic with the CRD validation guarantees for
panicThresholdPercent.Based on the API definition: