Skip to content

feat : e2e test for Model serving partition revision control#837

Open
katara-Jayprakash wants to merge 5 commits intovolcano-sh:mainfrom
katara-Jayprakash:ModelServing-Partition-Revision-Control
Open

feat : e2e test for Model serving partition revision control#837
katara-Jayprakash wants to merge 5 commits intovolcano-sh:mainfrom
katara-Jayprakash:ModelServing-Partition-Revision-Control

Conversation

@katara-Jayprakash
Copy link
Copy Markdown
Member

@katara-Jayprakash katara-Jayprakash commented Mar 25, 2026

What type of PR is this?
implemented e2e test for Model serving partition revision control
/cc @FAUST-BENCHOU

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #801

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


Copilot AI review requested due to automatic review settings March 25, 2026 07:44
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the ModelServing controller's rolling update capabilities by introducing and thoroughly testing partition-based revision control. It ensures that during phased rollouts, a specified portion of replicas maintains their current stable configuration while the remaining replicas are updated, providing a robust mechanism for controlled deployments and rollbacks. The changes primarily involve adding comprehensive end-to-end tests to validate these new behaviors.

Highlights

  • Partitioned Rolling Updates: Introduced new E2E tests to verify the behavior of ModelServing partitioned rolling updates, ensuring that a specified number of replicas retain their current revision while others are updated.
  • Historical Revision Protection: Added a test case to confirm that if a pod within a partition-protected range is deleted, it is recreated using its historical revision, maintaining the stability of the protected partition.
  • Default Rolling Update Behavior: Included an E2E test to validate the standard rolling update process when no partition is specified, ensuring all replicas are updated to the new revision and image.
  • Helper Functions: Implemented utility functions getGroupOrdinal and getPodContainerImage to assist in parsing group names and extracting container images within the E2E tests.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new end-to-end tests for ModelServing to validate partition-based rolling update behavior, including scenarios for boundary protection, historical revision usage upon pod deletion within a protected partition, and standard rolling updates without partitions. The reviewer suggests refactoring duplicated ModelServing setup logic and pod verification logic into helper functions to improve maintainability and readability. Additionally, a performance improvement is suggested by compiling a regular expression once at the package level instead of repeatedly inside a function.

Comment thread test/e2e/controller-manager/model_serving_test.go Outdated
Comment thread test/e2e/controller-manager/model_serving_test.go Outdated
Comment thread test/e2e/controller-manager/model_serving_test.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end coverage for ModelServing “partition-based” rolling update revision behavior, validating that lower ordinals remain pinned to the historical revision while higher ordinals move to the new revision, and that non-partitioned rollouts still fully converge as before.

Changes:

  • Add E2E test for partition boundary protection (CurrentRevision pinned; UpdateRevision applied only to ordinals >= partition).
  • Add E2E test ensuring deleted pods within the protected partition range are recreated from the historical (CurrentRevision) template.
  • Add E2E test verifying default (no partition) rolling update fully converges (CurrentRevision == UpdateRevision; all replicas updated).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/e2e/controller-manager/model_serving_test.go Outdated
Comment thread test/e2e/controller-manager/model_serving_test.go Outdated
Comment thread test/e2e/controller-manager/model_serving_test.go Outdated
Copilot AI review requested due to automatic review settings March 25, 2026 18:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/e2e/controller-manager/model_serving_test.go Outdated
Comment thread test/e2e/controller-manager/model_serving_test.go Outdated
Comment thread test/e2e/controller-manager/model_serving_test.go Outdated
Comment thread test/e2e/controller-manager/model_serving_test.go Outdated
// Verify Status.CurrentRevision and Status.UpdateRevision.
t.Log("Verifying Status.CurrentRevision and Status.UpdateRevision after partitioned update")
var finalMS *workload.ModelServing
require.Eventually(t, func() bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ModelServing is already running, you can simply retrieve ModelServing.Status to compare the results.

@katara-Jayprakash katara-Jayprakash changed the title feat : Model serving partition revision control feat : e2e test for Model serving partition revision control Mar 30, 2026

assert.Equal(t, initialCurrentRevision, finalMS.Status.CurrentRevision,
"CurrentRevision should remain the initial revision")
assert.NotEqual(t, finalMS.Status.CurrentRevision, finalMS.Status.UpdateRevision,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't these two assert already checked in the require.Eventually block above?

ms.Status.UpdatedReplicas == (replicas-partition)
}, 3*time.Minute, 5*time.Second, "ModelServing revision status did not converge to expected partitioned state")

// Verify the partitioned state is established: R-0,R-1,R-2 have old image, R-3,R-4 have new
Copy link
Copy Markdown
Member

@YaoZengzeng YaoZengzeng Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we encapsulate some duplicated logic into functions? The code looks so redundant.


// TestModelServingNoPartitionRollingUpdate verifies default rolling-update behavior
// when partition is nil: all replicas move to the new revision and image.
func TestModelServingNoPartitionRollingUpdate(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func TestModelServingNoPartitionRollingUpdate(t *testing.T) {
func TestModelServingRollingUpdate(t *testing.T) {

Just the simplest rolling update, no need to specify "NoPartition" explicitly.

return false
}
return verify(pods.Items)
}, timeout, 5*time.Second, failureMsg)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5*time.Second seems too long., which increase the total test time

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5*time.Second seems too long., which increase the total test time

working on it! sorry for keeping it delay

@katara-Jayprakash katara-Jayprakash force-pushed the ModelServing-Partition-Revision-Control branch from 87ba6f4 to 8cbf1fc Compare April 3, 2026 18:12
Copilot AI review requested due to automatic review settings April 3, 2026 19:32
@katara-Jayprakash katara-Jayprakash force-pushed the ModelServing-Partition-Revision-Control branch from 8cbf1fc to 4f6e541 Compare April 3, 2026 19:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/e2e/controller-manager/model_serving_test.go Outdated
Comment thread test/e2e/controller-manager/model_serving_test.go
Comment thread test/e2e/controller-manager/model_serving_test.go Outdated
Comment thread test/e2e/controller-manager/model_serving_test.go
Comment thread test/e2e/controller-manager/model_serving_test.go
@katara-Jayprakash katara-Jayprakash force-pushed the ModelServing-Partition-Revision-Control branch from 4f6e541 to b73bca6 Compare April 3, 2026 21:27
Copilot AI review requested due to automatic review settings April 4, 2026 05:40
@katara-Jayprakash katara-Jayprakash force-pushed the ModelServing-Partition-Revision-Control branch from b73bca6 to 59b1d82 Compare April 4, 2026 05:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@katara-Jayprakash
Copy link
Copy Markdown
Member Author

sorry for so much comments and things going on! i actually experimenting with my claude opus4.5!

@katara-Jayprakash katara-Jayprakash force-pushed the ModelServing-Partition-Revision-Control branch from 59b1d82 to 02e60dd Compare April 4, 2026 06:07
Copilot AI review requested due to automatic review settings April 4, 2026 07:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings April 4, 2026 07:43
@katara-Jayprakash katara-Jayprakash force-pushed the ModelServing-Partition-Revision-Control branch from dd9c895 to 6b4231d Compare April 4, 2026 07:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i understand, if this is fort e2e test, why do you change this file

// Create missing ServingGroups for ordinals in [partition, expectedCount) using the new revision.
// Use dense ordinals [0,expectedCount) instead of appending from maxOrdinal+1 so that a gap
// left by rolling update (e.g. ordinals 0,1,2,4 after deleting 3) is filled at 3 instead of
// incorrectly creating maxOrdinal+1 (which would exceed replica bounds).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I recall, the implementation of ControllerRevision already fills the sequence numbers in the range (0, Partition).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if partition > 0 {
klog.V(4).Infof("scaleUpServingGroups: partition=%d set, filling missing ordinals in [0, %d) for modelServing=%s",
partition, partition, utils.GetNamespaceName(ms))
// When partition is set, fill missing ordinals in [0, partition) using CurrentRevision
for ordinal := 0; ordinal < partition && ordinal < expectedCount; ordinal++ {
if existingOrdinals[ordinal] {
klog.V(4).Infof("scaleUpServingGroups: ordinal %d already exists, skipping", ordinal)
continue
}
// Use CurrentRevision for partition-protected ordinals
revisionToUse := newRevision
if ms.Status.CurrentRevision != "" {
revisionToUse = ms.Status.CurrentRevision
}
klog.V(4).Infof("scaleUpServingGroups: ordinal %d missing (partition-protected), revisionToUse=%s, currentRevision=%s",
ordinal, revisionToUse, ms.Status.CurrentRevision)
// For ordinal < partition, we should use the old template from the revision
// Two cases:
// 1. First startup: use ms.Spec.Template.Roles (which corresponds to CurrentRevision)
// 2. During recovery: use template from ControllerRevision retrieved by revision
var rolesToUse []workloadv1alpha1.Role
cr, _ := utils.GetControllerRevision(ctx, c.kubeClientSet, ms, revisionToUse)
if cr != nil {
// Case 2: Recovery scenario - use template from ControllerRevision
if roles, err := utils.GetRolesFromControllerRevision(cr); err != nil {
klog.Warningf("Failed to get roles from ControllerRevision for revision %s (ordinal %d): %v, falling back to ms.Spec.Template.Roles", revisionToUse, ordinal, err)
rolesToUse = ms.Spec.Template.Roles
} else {
rolesToUse = roles
klog.V(4).Infof("Recovering ServingGroup at ordinal %d with revision %s using template from ControllerRevision (partition=%d)", ordinal, revisionToUse, partition)
}
} else {
// Case 1: First startup - ControllerRevision not found, use ms.Spec.Template.Roles
rolesToUse = ms.Spec.Template.Roles
klog.V(4).Infof("Creating missing ServingGroup at ordinal %d with revision %s using ms.Spec.Template.Roles (partition=%d, first startup)", ordinal, revisionToUse, partition)
}
if err := createServingGroup(ordinal, revisionToUse, rolesToUse); err != nil {
return err
}
// Update existingOrdinals and maxOrdinal
existingOrdinals[ordinal] = true
if ordinal > maxOrdinal {
maxOrdinal = ordinal
}
}
}

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


const replicas = int32(3)

modelServing := createBasicModelServing("test-rolling-update", replicas)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createBasicModelServing is defined with three parameters (name, servingGroupReplicas, workloadRoleReplicas, ...), but this call passes only two args. This won’t compile; pass the missing workloadRoleReplicas value (often 0 in this file) or use the correct helper/signature.

Suggested change
modelServing := createBasicModelServing("test-rolling-update", replicas)
modelServing := createBasicModelServing("test-rolling-update", replicas, 0)

Copilot uses AI. Check for mistakes.
Comment on lines +752 to +755
// Create missing ServingGroups for ordinals in [partition, expectedCount) using the new revision.
// Use dense ordinals [0,expectedCount) instead of appending from maxOrdinal+1 so that a gap
// left by rolling update (e.g. ordinals 0,1,2,4 after deleting 3) is filled at 3 instead of
// incorrectly creating maxOrdinal+1 (which would exceed replica bounds).
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function-level comment for scaleUpServingGroups still states that when partition is not set it creates new groups starting from maxOrdinal+1, but the updated logic now fills missing ordinals densely in [partition, expectedCount). Please update the comment to match the new behavior so future readers don’t assume the old append-from-max semantics.

Suggested change
// Create missing ServingGroups for ordinals in [partition, expectedCount) using the new revision.
// Use dense ordinals [0,expectedCount) instead of appending from maxOrdinal+1 so that a gap
// left by rolling update (e.g. ordinals 0,1,2,4 after deleting 3) is filled at 3 instead of
// incorrectly creating maxOrdinal+1 (which would exceed replica bounds).
// Create missing ServingGroups by densely filling any absent ordinals in [partition, expectedCount)
// using the new revision. This applies even when partition == 0: the controller fills gaps within
// the expected ordinal range instead of appending new groups starting from maxOrdinal+1. For example,
// if ordinals 0,1,2,4 exist and 3 is missing, ordinal 3 is created rather than incorrectly creating 5.

Copilot uses AI. Check for mistakes.
Comment on lines +762 to +763
klog.V(4).Infof("scaleUpServingGroups: missing ordinals in [%d,%d) with newRevision=%d, modelServing=%s",
partition, expectedCount, missingNewRevision, utils.GetNamespaceName(ms))
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log line is misleading: it prints with newRevision=%d but the value is missingNewRevision (a count), not the revision string. Rename the log key to something like missingCount/missingNewRevisionCount (and optionally log newRevision separately) to avoid confusion when debugging.

Suggested change
klog.V(4).Infof("scaleUpServingGroups: missing ordinals in [%d,%d) with newRevision=%d, modelServing=%s",
partition, expectedCount, missingNewRevision, utils.GetNamespaceName(ms))
klog.V(4).Infof("scaleUpServingGroups: missing ordinals in [%d,%d) with missingNewRevisionCount=%d, newRevision=%s, modelServing=%s",
partition, expectedCount, missingNewRevision, newRevision, utils.GetNamespaceName(ms))

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to change this part

if isProtected && container.Image == nginxImage {
protectedCorrect++
} else if !isProtected && container.Image == nginxAlpineImage {
updatedCorrect++
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you cant verify index in this way due to th special machanism of ms index
#874

Of course, checking the partition index is fine, so the reasonable approach should be that the indexes within the partition are normal, and for those outside the partition, there's no need to check the index; just check that the number of running pods meets expectations.

Signed-off-by: katara-Jayprakash <katarajayprakash@icloud.com>
Signed-off-by: katara-Jayprakash <katarajayprakash@icloud.com>
Signed-off-by: katara-Jayprakash <katarajayprakash@icloud.com>
Signed-off-by: katara-Jayprakash <katarajayprakash@icloud.com>
Copilot AI review requested due to automatic review settings April 15, 2026 04:00
@katara-Jayprakash katara-Jayprakash force-pushed the ModelServing-Partition-Revision-Control branch from 41523d6 to 189ef64 Compare April 15, 2026 04:00
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


targetOrdinal := 1
targetGroupName := fmt.Sprintf("%s-%d", modelServing.Name, targetOrdinal)
labelSelector := fmt.Sprintf("modelserving.volcano.sh/group-name=%s", targetGroupName)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test you hardcode the group-name label key as a string ("modelserving.volcano.sh/group-name"). The repo already defines and uses workload.GroupNameLabelKey for this label (and the same file uses it in collectRunningServingGroupStates). Using the constant here avoids drift if the label key ever changes and keeps tests consistent.

Suggested change
labelSelector := fmt.Sprintf("modelserving.volcano.sh/group-name=%s", targetGroupName)
labelSelector := fmt.Sprintf("%s=%s", workload.GroupNameLabelKey, targetGroupName)

Copilot uses AI. Check for mistakes.
}
_, servingGroupOrdinal := utils.GetParentNameAndOrdinal(servingGroup.Name)
isPartitionProtected := partition > 0 && index < partition
isPartitionProtected := partition > 0 && servingGroupOrdinal < partition
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utils.GetParentNameAndOrdinal returns ordinal=-1 when the name doesn't match the expected "-" pattern. With the new logic, servingGroupOrdinal < partition would treat -1 as partition-protected, which is incorrect and could make the controller use a historical revision for an unrelated/invalid group name. Consider guarding with servingGroupOrdinal >= 0 (and optionally logging/skipping invalid names) before applying partition protection.

Suggested change
isPartitionProtected := partition > 0 && servingGroupOrdinal < partition
if partition > 0 && servingGroupOrdinal < 0 {
klog.Warningf("manageRole: invalid serving group name %s for partition protection, skipping protected revision logic", servingGroup.Name)
}
isPartitionProtected := partition > 0 && servingGroupOrdinal >= 0 && servingGroupOrdinal < partition

Copilot uses AI. Check for mistakes.
…roller/model_serving_controller.go file

Signed-off-by: katara-Jayprakash <katarajayprakash@icloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

E2E test for ModelServing Partition Revision Control

7 participants