Skip to content

Fix/modelserving partition revision regressions#840

Open
katara-Jayprakash wants to merge 7 commits intovolcano-sh:mainfrom
katara-Jayprakash:fix/modelserving-partition-revision-regressions
Open

Fix/modelserving partition revision regressions#840
katara-Jayprakash wants to merge 7 commits intovolcano-sh:mainfrom
katara-Jayprakash:fix/modelserving-partition-revision-regressions

Conversation

@katara-Jayprakash
Copy link
Copy Markdown
Member

@katara-Jayprakash katara-Jayprakash commented Mar 26, 2026

What type of PR is this?
when I was running e2e regarding Model serving partition revision control (#837) tests and i  found two bugs. First, ControllerRevision is being changed after it is created (it should never change once saved i guess) and  Second, manageRoleReplicas is always using newRevision when scaling up, even when it should use effectiveRevision based on partition rules.

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #841

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


Copilot AI review requested due to automatic review settings March 26, 2026 09:33
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses two critical bugs related to model serving partition revision control. It enforces the immutability of ControllerRevision objects to prevent their accidental alteration after creation and rectifies the manageRoleReplicas function to correctly apply the effectiveRevision based on partition rules during scaling and pod recreation. These fixes ensure more predictable and reliable behavior in model serving deployments.

Highlights

  • ControllerRevision Immutability: Ensured that ControllerRevision objects, once created, are immutable. This prevents unintended modifications to historical revision data, maintaining the integrity of deployment history.
  • Correct Revision Usage in Role Management: Corrected the logic within the manageRoleReplicas function to properly determine and utilize the effectiveRevision for scaling operations. This ensures that partition rules are respected when scaling up or recreating pods.
  • Enhanced Test Coverage: Added new unit tests to validate both the immutability of ControllerRevision objects and the correct application of revisions during the management of role replicas.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the revision handling logic within the model serving controller. It introduces an effectiveRevision concept to ensure that when managing role replicas and recreating pods, the correct, potentially existing, revision is consistently applied. The CreateControllerRevision utility function was updated to enforce immutability for existing controller revisions; attempts to create a revision with the same key but different data will now log a warning and return the existing, unchanged revision. New tests were added to validate these behaviors, specifically confirming that recreated pods use the serving group's existing revision and that controller revisions remain immutable once created. There are two suggestions: one to remove redundant effectiveRevision calculation in the manageRole function, and another to elevate a klog.Warningf to klog.Errorf in CreateControllerRevision for better visibility of potential conflicts.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes regressions in ModelServing partition revision control by enforcing immutability of ControllerRevision snapshots and ensuring role replica management uses the correct (effective) revision for existing ServingGroups.

Changes:

  • Stop mutating apps/v1.ControllerRevision data/revision after creation; preserve the first stored payload for a given revision key.
  • Ensure manageRoleReplicas uses the ServingGroup’s effective revision (from store / ServingGroup) when recreating/scaling roles, rather than always using the latest newRevision.
  • Add targeted unit tests covering ControllerRevision immutability and ServingGroup revision usage during pod recreation.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
pkg/model-serving-controller/utils/controller_revision.go Makes ControllerRevision snapshots immutable by not updating existing objects.
pkg/model-serving-controller/utils/controller_revision_test.go Adds a test asserting existing ControllerRevision payload is not overwritten.
pkg/model-serving-controller/controller/model_serving_controller.go Uses effective (ServingGroup) revision when recreating/scaling role pods.
pkg/model-serving-controller/controller/model_serving_controller_test.go Adds a test asserting recreated pods keep the ServingGroup revision.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 26, 2026 09:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@LiZhenCheng9527 LiZhenCheng9527 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FAUST-BENCHOU Is it expected that the controllerRevision will change after it has been created?

func (c *ModelServingController) manageRoleReplicas(ctx context.Context, ms *workloadv1alpha1.ModelServing, groupName string, targetRole workloadv1alpha1.Role, servingGroupOrdinal int, newRevision string) {
// TODO: add podGroup update after gang scheduler finished
effectiveRevision := newRevision
if revision, ok := c.store.GetServingGroupRevision(utils.GetNamespaceName(ms), groupName); ok && revision != "" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The revisions in our store contain the old ModelServing revisions. If we handle it this way, wouldn’t that mean we cannot perform a rolling update?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should retrieve the revision in this way within a servingGroup protected by Partition. In fact, it’s not just the revision; you should create the pod using the old pod configuration.

@FAUST-BENCHOU
Copy link
Copy Markdown
Contributor

@FAUST-BENCHOU Is it expected that the controllerRevision will change after it has been created?

Yes, when CreateControllerRevision is executed, if a CR with the same name already exists and the data is inconsistent with the current serialization result, it will use existing.Revision++ as a protective measure.

I checked and it's indeed not a standard practice.satefulset uses hash to avoid update
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/history/controller_history.go#L126-L128

However, it's a bit different from ours. The hash in the StatefulSet's ControllerRevision name is calculated based on the "data to be written to the object," so the name changes as the content changes, allowing a new revision to be created. But ours is

	cr := &appsv1.ControllerRevision{
		ObjectMeta: metav1.ObjectMeta{
			Name:      controllerRevisionName,
			Namespace: ms.Namespace,
			Labels: map[string]string{
				ControllerRevisionLabelKey:         ms.Name,
				ControllerRevisionRevisionLabelKey: revision,
			},
			OwnerReferences: []metav1.OwnerReference{
				newModelServingOwnerRef(ms),
			},
		},
		Revision: 1, // ControllerRevision revision number
		Data: runtime.RawExtension{
			Raw: data,
		},
	}

and never hash(only hash revision name)
then if data changed we need to update the whole controller revision
the reason why it doesn't directly hash like StatefulSet is that the control loop mechanism of StatefulSet is different from ModelServing. If you want to hash data, it may require a lot of changes.

anyway there are two potential solutions i think:
1.we can change Update to Delete and do not need revision++,and it will create a new ControllerRevision(same name but data differs)

2.calculate the hash of the final data bytes to be written, then concatenate the maybe like Name = msName + "-" + hash,past is msName + "-" + revision(https://github.com/volcano-sh/kthena/pull/671/changes#diff-0f6da250c0d3486a5b856156b44a6653a9b18939893c3e1c819b0719d9c25643). The hash changes whenever the content changes, ensuring a new name every time. Only Create is needed, eliminating the need for Update/Delete for "different data with the same name".But to be honest it may need a huge change......But its more standard anyway

@LiZhenCheng9527 @katara-Jayprakash whats your point?

effectiveRevision := newRevision
if revision, ok := c.store.GetServingGroupRevision(utils.GetNamespaceName(ms), groupName); ok && revision != "" {
effectiveRevision = revision
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think rolling updates should rely on manageServingGroupRollingUpdate + current/update revision + partition in status; old revisions in the store are not the authoritative source for rolling updates.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the source is currentRevision or UpdateRevision, then once the partition has been set, currentRevision remains unchanged.

And use the revision information from ModelServing.Status. It is also not possible to determine which serving groups need to be upgraded.

@hzxuzhonghu
Copy link
Copy Markdown
Member

I did a first pass on this diff. The two fixes line up with the regressions described:

  • recreated roles use the ServingGroup's effective revision instead of always using newRevision
  • existing ControllerRevision snapshots are treated as immutable

I did not spot a blocking issue in the current change set. The main thing I would still want to confirm is the e2e coverage around rollback/recovery paths and partitioned recreation flows.

@katara-Jayprakash
Copy link
Copy Markdown
Member Author

katara-Jayprakash commented Mar 30, 2026

I did a first pass on this diff. The two fixes line up with the regressions described:

  • recreated roles use the ServingGroup's effective revision instead of always using newRevision
  • existing ControllerRevision snapshots are treated as immutable

I did not spot a blocking issue in the current change set. The main thing I would still want to confirm is the e2e coverage around rollback/recovery paths and partitioned recreation flows.

Thanks, and yes this came directly from new e2e in #837 .could you review this pr too sir,

// scaleUpRoles handles Role scaling up.
// It creates new Roles with increasing indices starting from the current max index + 1.
func (c *ModelServingController) scaleUpRoles(ctx context.Context, ms *workloadv1alpha1.ModelServing, groupName string, targetRole workloadv1alpha1.Role, roleList []datastore.Role, expectedCount int, servingGroupOrdinal int, newRevision string) {
func (c *ModelServingController) scaleUpRoles(ctx context.Context, ms *workloadv1alpha1.ModelServing, groupName string, targetRole workloadv1alpha1.Role, roleList []datastore.Role, expectedCount int, servingGroupOrdinal int, effectiveRevision string) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/effectiveRevision/revision
effectiveRevision seems redudant

},
}

groupName := utils.GenerateServingGroupName(ms.Name, 0)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please construct a real case, say test-manage-role-group-rev has two replicas, and partition = 1. So we can update the group 1 with the new revision, while group 0 still use the old revision

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@katara-Jayprakash katara-Jayprakash force-pushed the fix/modelserving-partition-revision-regressions branch from 62ce6a1 to 61381d6 Compare April 3, 2026 19:33
Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Looks the e2e failure is not related

@katara-Jayprakash
Copy link
Copy Markdown
Member Author

@hzxuzhonghu thanx for review, could you point out sir what causing the failure of e2e? would love to fix it!

@hzxuzhonghu
Copy link
Copy Markdown
Member

Filed an issue #870

Copilot AI review requested due to automatic review settings April 8, 2026 09:28
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

New changes are detected. LGTM label has been removed.

@volcano-sh-bot
Copy link
Copy Markdown
Contributor

Adding label do-not-merge/contains-merge-commits because PR contains merge commits, which are not allowed in this repository.
Use git rebase to reapply your commits on top of the target branch. Detailed instructions for doing so can be found here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from hzxuzhonghu. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@LiZhenCheng9527
Copy link
Copy Markdown
Contributor

This issue has been fixed by #860 .
@katara-Jayprakash Could you please check whether this PR should be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: controller overwrites old revisions and ignores partition rules during scale-u

6 participants