Fix/modelserving partition revision regressions by katara-Jayprakash · Pull Request #840 · volcano-sh/kthena

katara-Jayprakash · 2026-03-26T09:33:33Z

What type of PR is this?
when I was running e2e regarding Model serving partition revision control (#837) tests and i found two bugs. First, ControllerRevision is being changed after it is created (it should never change once saved i guess) and Second, manageRoleReplicas is always using newRevision when scaling up, even when it should use effectiveRevision based on partition rules.

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #841

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

gemini-code-assist · 2026-03-26T09:33:54Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses two critical bugs related to model serving partition revision control. It enforces the immutability of ControllerRevision objects to prevent their accidental alteration after creation and rectifies the manageRoleReplicas function to correctly apply the effectiveRevision based on partition rules during scaling and pod recreation. These fixes ensure more predictable and reliable behavior in model serving deployments.

Highlights

ControllerRevision Immutability: Ensured that ControllerRevision objects, once created, are immutable. This prevents unintended modifications to historical revision data, maintaining the integrity of deployment history.
Correct Revision Usage in Role Management: Corrected the logic within the manageRoleReplicas function to properly determine and utilize the effectiveRevision for scaling operations. This ensures that partition rules are respected when scaling up or recreating pods.
Enhanced Test Coverage: Added new unit tests to validate both the immutability of ControllerRevision objects and the correct application of revisions during the management of role replicas.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the revision handling logic within the model serving controller. It introduces an effectiveRevision concept to ensure that when managing role replicas and recreating pods, the correct, potentially existing, revision is consistently applied. The CreateControllerRevision utility function was updated to enforce immutability for existing controller revisions; attempts to create a revision with the same key but different data will now log a warning and return the existing, unchanged revision. New tests were added to validate these behaviors, specifically confirming that recreated pods use the serving group's existing revision and that controller revisions remain immutable once created. There are two suggestions: one to remove redundant effectiveRevision calculation in the manageRole function, and another to elevate a klog.Warningf to klog.Errorf in CreateControllerRevision for better visibility of potential conflicts.

pkg/model-serving-controller/controller/model_serving_controller.go

pkg/model-serving-controller/utils/controller_revision.go

Copilot

Pull request overview

This PR fixes regressions in ModelServing partition revision control by enforcing immutability of ControllerRevision snapshots and ensuring role replica management uses the correct (effective) revision for existing ServingGroups.

Changes:

Stop mutating apps/v1.ControllerRevision data/revision after creation; preserve the first stored payload for a given revision key.
Ensure manageRoleReplicas uses the ServingGroup’s effective revision (from store / ServingGroup) when recreating/scaling roles, rather than always using the latest newRevision.
Add targeted unit tests covering ControllerRevision immutability and ServingGroup revision usage during pod recreation.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
pkg/model-serving-controller/utils/controller_revision.go	Makes ControllerRevision snapshots immutable by not updating existing objects.
pkg/model-serving-controller/utils/controller_revision_test.go	Adds a test asserting existing ControllerRevision payload is not overwritten.
pkg/model-serving-controller/controller/model_serving_controller.go	Uses effective (ServingGroup) revision when recreating/scaling role pods.
pkg/model-serving-controller/controller/model_serving_controller_test.go	Adds a test asserting recreated pods keep the ServingGroup revision.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/model-serving-controller/utils/controller_revision.go

pkg/model-serving-controller/controller/model_serving_controller.go

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/model-serving-controller/controller/model_serving_controller.go

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

LiZhenCheng9527

@FAUST-BENCHOU Is it expected that the controllerRevision will change after it has been created?

LiZhenCheng9527 · 2026-03-28T02:50:44Z

pkg/model-serving-controller/controller/model_serving_controller.go

 func (c *ModelServingController) manageRoleReplicas(ctx context.Context, ms *workloadv1alpha1.ModelServing, groupName string, targetRole workloadv1alpha1.Role, servingGroupOrdinal int, newRevision string) {
 	// TODO: add podGroup update after gang scheduler finished
+	effectiveRevision := newRevision
+	if revision, ok := c.store.GetServingGroupRevision(utils.GetNamespaceName(ms), groupName); ok && revision != "" {


The revisions in our store contain the old ModelServing revisions. If we handle it this way, wouldn’t that mean we cannot perform a rolling update?

I think you should retrieve the revision in this way within a servingGroup protected by Partition. In fact, it’s not just the revision; you should create the pod using the old pod configuration.

FAUST-BENCHOU · 2026-03-28T11:04:42Z

@FAUST-BENCHOU Is it expected that the controllerRevision will change after it has been created?

Yes, when CreateControllerRevision is executed, if a CR with the same name already exists and the data is inconsistent with the current serialization result, it will use existing.Revision++ as a protective measure.

I checked and it's indeed not a standard practice.satefulset uses hash to avoid update
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/history/controller_history.go#L126-L128

However, it's a bit different from ours. The hash in the StatefulSet's ControllerRevision name is calculated based on the "data to be written to the object," so the name changes as the content changes, allowing a new revision to be created. But ours is

	cr := &appsv1.ControllerRevision{
		ObjectMeta: metav1.ObjectMeta{
			Name:      controllerRevisionName,
			Namespace: ms.Namespace,
			Labels: map[string]string{
				ControllerRevisionLabelKey:         ms.Name,
				ControllerRevisionRevisionLabelKey: revision,
			},
			OwnerReferences: []metav1.OwnerReference{
				newModelServingOwnerRef(ms),
			},
		},
		Revision: 1, // ControllerRevision revision number
		Data: runtime.RawExtension{
			Raw: data,
		},
	}

and never hash(only hash revision name)
then if data changed we need to update the whole controller revision
the reason why it doesn't directly hash like StatefulSet is that the control loop mechanism of StatefulSet is different from ModelServing. If you want to hash data, it may require a lot of changes.

anyway there are two potential solutions i think:
1.we can change Update to Delete and do not need revision++,and it will create a new ControllerRevision(same name but data differs)

2.calculate the hash of the final data bytes to be written, then concatenate the maybe like Name = msName + "-" + hash,past is msName + "-" + revision(https://github.com/volcano-sh/kthena/pull/671/changes#diff-0f6da250c0d3486a5b856156b44a6653a9b18939893c3e1c819b0719d9c25643). The hash changes whenever the content changes, ensuring a new name every time. Only Create is needed, eliminating the need for Update/Delete for "different data with the same name".But to be honest it may need a huge change......But its more standard anyway

@LiZhenCheng9527 @katara-Jayprakash whats your point?

FAUST-BENCHOU · 2026-03-28T11:13:16Z

pkg/model-serving-controller/controller/model_serving_controller.go

+	effectiveRevision := newRevision
+	if revision, ok := c.store.GetServingGroupRevision(utils.GetNamespaceName(ms), groupName); ok && revision != "" {
+		effectiveRevision = revision
+	}


i think rolling updates should rely on manageServingGroupRollingUpdate + current/update revision + partition in status; old revisions in the store are not the authoritative source for rolling updates.

If the source is currentRevision or UpdateRevision, then once the partition has been set, currentRevision remains unchanged.

And use the revision information from ModelServing.Status. It is also not possible to determine which serving groups need to be upgraded.

hzxuzhonghu · 2026-03-30T02:57:08Z

I did a first pass on this diff. The two fixes line up with the regressions described:

recreated roles use the ServingGroup's effective revision instead of always using newRevision
existing ControllerRevision snapshots are treated as immutable

I did not spot a blocking issue in the current change set. The main thing I would still want to confirm is the e2e coverage around rollback/recovery paths and partitioned recreation flows.

katara-Jayprakash · 2026-03-30T06:46:13Z

I did a first pass on this diff. The two fixes line up with the regressions described:

recreated roles use the ServingGroup's effective revision instead of always using newRevision

existing ControllerRevision snapshots are treated as immutable

I did not spot a blocking issue in the current change set. The main thing I would still want to confirm is the e2e coverage around rollback/recovery paths and partitioned recreation flows.

Thanks, and yes this came directly from new e2e in #837 .could you review this pr too sir,

hzxuzhonghu · 2026-04-01T03:16:52Z

pkg/model-serving-controller/controller/model_serving_controller.go

 // scaleUpRoles handles Role scaling up.
 // It creates new Roles with increasing indices starting from the current max index + 1.
-func (c *ModelServingController) scaleUpRoles(ctx context.Context, ms *workloadv1alpha1.ModelServing, groupName string, targetRole workloadv1alpha1.Role, roleList []datastore.Role, expectedCount int, servingGroupOrdinal int, newRevision string) {
+func (c *ModelServingController) scaleUpRoles(ctx context.Context, ms *workloadv1alpha1.ModelServing, groupName string, targetRole workloadv1alpha1.Role, roleList []datastore.Role, expectedCount int, servingGroupOrdinal int, effectiveRevision string) {


s/effectiveRevision/revision
effectiveRevision seems redudant

hzxuzhonghu · 2026-04-01T03:26:56Z

pkg/model-serving-controller/controller/model_serving_controller_test.go

+		},
+	}
+
+	groupName := utils.GenerateServingGroupName(ms.Name, 0)


could you please construct a real case, say test-manage-role-group-rev has two replicas, and partition = 1. So we can update the group 1 with the new revision, while group 0 still use the old revision

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: katara-Jayprakash <[email protected]>

…vision Signed-off-by: katara-Jayprakash <[email protected]>

Signed-off-by: katara-Jayprakash <[email protected]>

…ision Signed-off-by: katara-Jayprakash <[email protected]>

Signed-off-by: katara-Jayprakash <[email protected]>

hzxuzhonghu

/lgtm

Looks the e2e failure is not related

katara-Jayprakash · 2026-04-07T05:31:14Z

@hzxuzhonghu thanx for review, could you point out sir what causing the failure of e2e? would love to fix it!

hzxuzhonghu · 2026-04-07T06:41:30Z

Filed an issue #870

…on-regressions

volcano-sh-bot · 2026-04-08T09:28:29Z

New changes are detected. LGTM label has been removed.

volcano-sh-bot · 2026-04-08T09:28:31Z

Adding label do-not-merge/contains-merge-commits because PR contains merge commits, which are not allowed in this repository.
Use git rebase to reapply your commits on top of the target branch. Detailed instructions for doing so can be found here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

volcano-sh-bot · 2026-04-08T09:28:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from hzxuzhonghu. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

LiZhenCheng9527 · 2026-04-09T14:49:00Z

This issue has been fixed by #860 .
@katara-Jayprakash Could you please check whether this PR should be closed?

Copilot AI review requested due to automatic review settings March 26, 2026 09:33

volcano-sh-bot requested review from YaoZengzeng and hzxuzhonghu March 26, 2026 09:33

volcano-sh-bot added the size/L label Mar 26, 2026

Copilot started reviewing on behalf of katara-Jayprakash March 26, 2026 09:33 View session

gemini-code-assist bot reviewed Mar 26, 2026

View reviewed changes

pkg/model-serving-controller/controller/model_serving_controller.go Outdated Show resolved Hide resolved

pkg/model-serving-controller/utils/controller_revision.go Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 26, 2026

View reviewed changes

pkg/model-serving-controller/utils/controller_revision.go Show resolved Hide resolved

pkg/model-serving-controller/utils/controller_revision.go Outdated Show resolved Hide resolved

pkg/model-serving-controller/controller/model_serving_controller.go Show resolved Hide resolved

Copilot AI review requested due to automatic review settings March 26, 2026 09:44

Copilot started reviewing on behalf of katara-Jayprakash March 26, 2026 09:45 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

pkg/model-serving-controller/controller/model_serving_controller.go Show resolved Hide resolved

katara-Jayprakash requested a review from Copilot March 26, 2026 09:58

Copilot started reviewing on behalf of katara-Jayprakash March 26, 2026 09:59 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

LiZhenCheng9527 reviewed Mar 28, 2026

View reviewed changes

FAUST-BENCHOU reviewed Mar 28, 2026

View reviewed changes

hzxuzhonghu reviewed Apr 1, 2026

View reviewed changes

katara-Jayprakash requested a review from Copilot April 3, 2026 06:56

Copilot started reviewing on behalf of katara-Jayprakash April 3, 2026 06:56 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

katara-Jayprakash added 6 commits April 4, 2026 01:00

make ControllerRevision immutable for existing revision keys

c37c696

Signed-off-by: katara-Jayprakash <[email protected]>

ensure pod recreation/role scale-up in existing groups keeps group re…

94f4bc9

…vision Signed-off-by: katara-Jayprakash <[email protected]>

removed the redundant effectiveRevision

3cb3dbb

Signed-off-by: katara-Jayprakash <[email protected]>

ControllerRevision payload mismatch log to error

a7cb67b

Signed-off-by: katara-Jayprakash <[email protected]>

change the parameter naming convention to something like effectiveRev…

943c22d

…ision Signed-off-by: katara-Jayprakash <[email protected]>

Fixing the e2e test accroding to review

61381d6

Signed-off-by: katara-Jayprakash <[email protected]>

katara-Jayprakash force-pushed the fix/modelserving-partition-revision-regressions branch from 62ce6a1 to 61381d6 Compare April 3, 2026 19:33

hzxuzhonghu reviewed Apr 7, 2026

View reviewed changes

volcano-sh-bot assigned hzxuzhonghu Apr 7, 2026

volcano-sh-bot added the lgtm label Apr 7, 2026

Merge branch 'volcano-sh:main' into fix/modelserving-partition-revisi…

c232fa7

…on-regressions

Copilot AI review requested due to automatic review settings April 8, 2026 09:28

volcano-sh-bot removed the lgtm label Apr 8, 2026

volcano-sh-bot added the do-not-merge/contains-merge-commits label Apr 8, 2026

Copilot started reviewing on behalf of katara-Jayprakash April 8, 2026 09:29 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Conversation

katara-Jayprakash commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Mar 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

LiZhenCheng9527 left a comment

Choose a reason for hiding this comment

Uh oh!

LiZhenCheng9527 Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

LiZhenCheng9527 Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

FAUST-BENCHOU commented Mar 28, 2026

Uh oh!

FAUST-BENCHOU Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

LiZhenCheng9527 Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu commented Mar 30, 2026

Uh oh!

katara-Jayprakash commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hzxuzhonghu Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

katara-Jayprakash commented Apr 7, 2026

Uh oh!

hzxuzhonghu commented Apr 7, 2026

Uh oh!

volcano-sh-bot commented Apr 8, 2026

Uh oh!

volcano-sh-bot commented Apr 8, 2026

Uh oh!

volcano-sh-bot commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

katara-Jayprakash commented Mar 26, 2026 •

edited

Loading

katara-Jayprakash commented Mar 30, 2026 •

edited

Loading