feat : e2e test for Model serving partition revision control#837
feat : e2e test for Model serving partition revision control#837katara-Jayprakash wants to merge 5 commits intovolcano-sh:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the ModelServing controller's rolling update capabilities by introducing and thoroughly testing partition-based revision control. It ensures that during phased rollouts, a specified portion of replicas maintains their current stable configuration while the remaining replicas are updated, providing a robust mechanism for controlled deployments and rollbacks. The changes primarily involve adding comprehensive end-to-end tests to validate these new behaviors. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
30cdfac to
13d5704
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces new end-to-end tests for ModelServing to validate partition-based rolling update behavior, including scenarios for boundary protection, historical revision usage upon pod deletion within a protected partition, and standard rolling updates without partitions. The reviewer suggests refactoring duplicated ModelServing setup logic and pod verification logic into helper functions to improve maintainability and readability. Additionally, a performance improvement is suggested by compiling a regular expression once at the package level instead of repeatedly inside a function.
There was a problem hiding this comment.
Pull request overview
Adds end-to-end coverage for ModelServing “partition-based” rolling update revision behavior, validating that lower ordinals remain pinned to the historical revision while higher ordinals move to the new revision, and that non-partitioned rollouts still fully converge as before.
Changes:
- Add E2E test for partition boundary protection (CurrentRevision pinned; UpdateRevision applied only to ordinals >= partition).
- Add E2E test ensuring deleted pods within the protected partition range are recreated from the historical (CurrentRevision) template.
- Add E2E test verifying default (no partition) rolling update fully converges (CurrentRevision == UpdateRevision; all replicas updated).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Verify Status.CurrentRevision and Status.UpdateRevision. | ||
| t.Log("Verifying Status.CurrentRevision and Status.UpdateRevision after partitioned update") | ||
| var finalMS *workload.ModelServing | ||
| require.Eventually(t, func() bool { |
There was a problem hiding this comment.
If ModelServing is already running, you can simply retrieve ModelServing.Status to compare the results.
|
|
||
| assert.Equal(t, initialCurrentRevision, finalMS.Status.CurrentRevision, | ||
| "CurrentRevision should remain the initial revision") | ||
| assert.NotEqual(t, finalMS.Status.CurrentRevision, finalMS.Status.UpdateRevision, |
There was a problem hiding this comment.
Isn't these two assert already checked in the require.Eventually block above?
| ms.Status.UpdatedReplicas == (replicas-partition) | ||
| }, 3*time.Minute, 5*time.Second, "ModelServing revision status did not converge to expected partitioned state") | ||
|
|
||
| // Verify the partitioned state is established: R-0,R-1,R-2 have old image, R-3,R-4 have new |
There was a problem hiding this comment.
Could we encapsulate some duplicated logic into functions? The code looks so redundant.
|
|
||
| // TestModelServingNoPartitionRollingUpdate verifies default rolling-update behavior | ||
| // when partition is nil: all replicas move to the new revision and image. | ||
| func TestModelServingNoPartitionRollingUpdate(t *testing.T) { |
There was a problem hiding this comment.
| func TestModelServingNoPartitionRollingUpdate(t *testing.T) { | |
| func TestModelServingRollingUpdate(t *testing.T) { |
Just the simplest rolling update, no need to specify "NoPartition" explicitly.
| return false | ||
| } | ||
| return verify(pods.Items) | ||
| }, timeout, 5*time.Second, failureMsg) |
There was a problem hiding this comment.
5*time.Second seems too long., which increase the total test time
There was a problem hiding this comment.
5*time.Second seems too long., which increase the total test time
working on it! sorry for keeping it delay
87ba6f4 to
8cbf1fc
Compare
8cbf1fc to
4f6e541
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
4f6e541 to
b73bca6
Compare
b73bca6 to
59b1d82
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
sorry for so much comments and things going on! i actually experimenting with my claude opus4.5! |
59b1d82 to
02e60dd
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
dd9c895 to
6b4231d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
not sure i understand, if this is fort e2e test, why do you change this file
| // Create missing ServingGroups for ordinals in [partition, expectedCount) using the new revision. | ||
| // Use dense ordinals [0,expectedCount) instead of appending from maxOrdinal+1 so that a gap | ||
| // left by rolling update (e.g. ordinals 0,1,2,4 after deleting 3) is filled at 3 instead of | ||
| // incorrectly creating maxOrdinal+1 (which would exceed replica bounds). |
There was a problem hiding this comment.
As far as I recall, the implementation of ControllerRevision already fills the sequence numbers in the range (0, Partition).
There was a problem hiding this comment.
kthena/pkg/model-serving-controller/controller/model_serving_controller.go
Lines 700 to 749 in 024db2f
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| const replicas = int32(3) | ||
|
|
||
| modelServing := createBasicModelServing("test-rolling-update", replicas) |
There was a problem hiding this comment.
createBasicModelServing is defined with three parameters (name, servingGroupReplicas, workloadRoleReplicas, ...), but this call passes only two args. This won’t compile; pass the missing workloadRoleReplicas value (often 0 in this file) or use the correct helper/signature.
| modelServing := createBasicModelServing("test-rolling-update", replicas) | |
| modelServing := createBasicModelServing("test-rolling-update", replicas, 0) |
| // Create missing ServingGroups for ordinals in [partition, expectedCount) using the new revision. | ||
| // Use dense ordinals [0,expectedCount) instead of appending from maxOrdinal+1 so that a gap | ||
| // left by rolling update (e.g. ordinals 0,1,2,4 after deleting 3) is filled at 3 instead of | ||
| // incorrectly creating maxOrdinal+1 (which would exceed replica bounds). |
There was a problem hiding this comment.
The function-level comment for scaleUpServingGroups still states that when partition is not set it creates new groups starting from maxOrdinal+1, but the updated logic now fills missing ordinals densely in [partition, expectedCount). Please update the comment to match the new behavior so future readers don’t assume the old append-from-max semantics.
| // Create missing ServingGroups for ordinals in [partition, expectedCount) using the new revision. | |
| // Use dense ordinals [0,expectedCount) instead of appending from maxOrdinal+1 so that a gap | |
| // left by rolling update (e.g. ordinals 0,1,2,4 after deleting 3) is filled at 3 instead of | |
| // incorrectly creating maxOrdinal+1 (which would exceed replica bounds). | |
| // Create missing ServingGroups by densely filling any absent ordinals in [partition, expectedCount) | |
| // using the new revision. This applies even when partition == 0: the controller fills gaps within | |
| // the expected ordinal range instead of appending new groups starting from maxOrdinal+1. For example, | |
| // if ordinals 0,1,2,4 exist and 3 is missing, ordinal 3 is created rather than incorrectly creating 5. |
| klog.V(4).Infof("scaleUpServingGroups: missing ordinals in [%d,%d) with newRevision=%d, modelServing=%s", | ||
| partition, expectedCount, missingNewRevision, utils.GetNamespaceName(ms)) |
There was a problem hiding this comment.
This log line is misleading: it prints with newRevision=%d but the value is missingNewRevision (a count), not the revision string. Rename the log key to something like missingCount/missingNewRevisionCount (and optionally log newRevision separately) to avoid confusion when debugging.
| klog.V(4).Infof("scaleUpServingGroups: missing ordinals in [%d,%d) with newRevision=%d, modelServing=%s", | |
| partition, expectedCount, missingNewRevision, utils.GetNamespaceName(ms)) | |
| klog.V(4).Infof("scaleUpServingGroups: missing ordinals in [%d,%d) with missingNewRevisionCount=%d, newRevision=%s, modelServing=%s", | |
| partition, expectedCount, missingNewRevision, newRevision, utils.GetNamespaceName(ms)) |
There was a problem hiding this comment.
no need to change this part
| if isProtected && container.Image == nginxImage { | ||
| protectedCorrect++ | ||
| } else if !isProtected && container.Image == nginxAlpineImage { | ||
| updatedCorrect++ |
There was a problem hiding this comment.
you cant verify index in this way due to th special machanism of ms index
#874
Of course, checking the partition index is fine, so the reasonable approach should be that the indexes within the partition are normal, and for those outside the partition, there's no need to check the index; just check that the number of running pods meets expectations.
Signed-off-by: katara-Jayprakash <katarajayprakash@icloud.com>
Signed-off-by: katara-Jayprakash <katarajayprakash@icloud.com>
Signed-off-by: katara-Jayprakash <katarajayprakash@icloud.com>
1b8d57c to
41523d6
Compare
Signed-off-by: katara-Jayprakash <katarajayprakash@icloud.com>
41523d6 to
189ef64
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| targetOrdinal := 1 | ||
| targetGroupName := fmt.Sprintf("%s-%d", modelServing.Name, targetOrdinal) | ||
| labelSelector := fmt.Sprintf("modelserving.volcano.sh/group-name=%s", targetGroupName) |
There was a problem hiding this comment.
In this test you hardcode the group-name label key as a string ("modelserving.volcano.sh/group-name"). The repo already defines and uses workload.GroupNameLabelKey for this label (and the same file uses it in collectRunningServingGroupStates). Using the constant here avoids drift if the label key ever changes and keeps tests consistent.
| labelSelector := fmt.Sprintf("modelserving.volcano.sh/group-name=%s", targetGroupName) | |
| labelSelector := fmt.Sprintf("%s=%s", workload.GroupNameLabelKey, targetGroupName) |
| } | ||
| _, servingGroupOrdinal := utils.GetParentNameAndOrdinal(servingGroup.Name) | ||
| isPartitionProtected := partition > 0 && index < partition | ||
| isPartitionProtected := partition > 0 && servingGroupOrdinal < partition |
There was a problem hiding this comment.
utils.GetParentNameAndOrdinal returns ordinal=-1 when the name doesn't match the expected "-" pattern. With the new logic, servingGroupOrdinal < partition would treat -1 as partition-protected, which is incorrect and could make the controller use a historical revision for an unrelated/invalid group name. Consider guarding with servingGroupOrdinal >= 0 (and optionally logging/skipping invalid names) before applying partition protection.
| isPartitionProtected := partition > 0 && servingGroupOrdinal < partition | |
| if partition > 0 && servingGroupOrdinal < 0 { | |
| klog.Warningf("manageRole: invalid serving group name %s for partition protection, skipping protected revision logic", servingGroup.Name) | |
| } | |
| isPartitionProtected := partition > 0 && servingGroupOrdinal >= 0 && servingGroupOrdinal < partition |
…roller/model_serving_controller.go file Signed-off-by: katara-Jayprakash <katarajayprakash@icloud.com>
What type of PR is this?
implemented e2e test for Model serving partition revision control
/cc @FAUST-BENCHOU
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #801
Special notes for your reviewer:
Does this PR introduce a user-facing change?: