Fix/modelserving partition revision regressions#840
Fix/modelserving partition revision regressions#840katara-Jayprakash wants to merge 7 commits intovolcano-sh:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses two critical bugs related to model serving partition revision control. It enforces the immutability of Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the revision handling logic within the model serving controller. It introduces an effectiveRevision concept to ensure that when managing role replicas and recreating pods, the correct, potentially existing, revision is consistently applied. The CreateControllerRevision utility function was updated to enforce immutability for existing controller revisions; attempts to create a revision with the same key but different data will now log a warning and return the existing, unchanged revision. New tests were added to validate these behaviors, specifically confirming that recreated pods use the serving group's existing revision and that controller revisions remain immutable once created. There are two suggestions: one to remove redundant effectiveRevision calculation in the manageRole function, and another to elevate a klog.Warningf to klog.Errorf in CreateControllerRevision for better visibility of potential conflicts.
pkg/model-serving-controller/controller/model_serving_controller.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
This PR fixes regressions in ModelServing partition revision control by enforcing immutability of ControllerRevision snapshots and ensuring role replica management uses the correct (effective) revision for existing ServingGroups.
Changes:
- Stop mutating
apps/v1.ControllerRevisiondata/revision after creation; preserve the first stored payload for a given revision key. - Ensure
manageRoleReplicasuses the ServingGroup’s effective revision (from store / ServingGroup) when recreating/scaling roles, rather than always using the latestnewRevision. - Add targeted unit tests covering ControllerRevision immutability and ServingGroup revision usage during pod recreation.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pkg/model-serving-controller/utils/controller_revision.go | Makes ControllerRevision snapshots immutable by not updating existing objects. |
| pkg/model-serving-controller/utils/controller_revision_test.go | Adds a test asserting existing ControllerRevision payload is not overwritten. |
| pkg/model-serving-controller/controller/model_serving_controller.go | Uses effective (ServingGroup) revision when recreating/scaling role pods. |
| pkg/model-serving-controller/controller/model_serving_controller_test.go | Adds a test asserting recreated pods keep the ServingGroup revision. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
LiZhenCheng9527
left a comment
There was a problem hiding this comment.
@FAUST-BENCHOU Is it expected that the controllerRevision will change after it has been created?
| func (c *ModelServingController) manageRoleReplicas(ctx context.Context, ms *workloadv1alpha1.ModelServing, groupName string, targetRole workloadv1alpha1.Role, servingGroupOrdinal int, newRevision string) { | ||
| // TODO: add podGroup update after gang scheduler finished | ||
| effectiveRevision := newRevision | ||
| if revision, ok := c.store.GetServingGroupRevision(utils.GetNamespaceName(ms), groupName); ok && revision != "" { |
There was a problem hiding this comment.
The revisions in our store contain the old ModelServing revisions. If we handle it this way, wouldn’t that mean we cannot perform a rolling update?
There was a problem hiding this comment.
I think you should retrieve the revision in this way within a servingGroup protected by Partition. In fact, it’s not just the revision; you should create the pod using the old pod configuration.
Yes, when I checked and it's indeed not a standard practice.satefulset uses hash to avoid update However, it's a bit different from ours. The hash in the StatefulSet's ControllerRevision name is calculated based on the "data to be written to the object," so the name changes as the content changes, allowing a new revision to be created. But ours is and never hash(only hash revision name) anyway there are two potential solutions i think: 2.calculate the hash of the final data bytes to be written, then concatenate the maybe like Name = msName + "-" + hash,past is msName + "-" + revision(https://github.com/volcano-sh/kthena/pull/671/changes#diff-0f6da250c0d3486a5b856156b44a6653a9b18939893c3e1c819b0719d9c25643). The hash changes whenever the content changes, ensuring a new name every time. Only Create is needed, eliminating the need for Update/Delete for "different data with the same name".But to be honest it may need a huge change......But its more standard anyway @LiZhenCheng9527 @katara-Jayprakash whats your point? |
| effectiveRevision := newRevision | ||
| if revision, ok := c.store.GetServingGroupRevision(utils.GetNamespaceName(ms), groupName); ok && revision != "" { | ||
| effectiveRevision = revision | ||
| } |
There was a problem hiding this comment.
i think rolling updates should rely on manageServingGroupRollingUpdate + current/update revision + partition in status; old revisions in the store are not the authoritative source for rolling updates.
There was a problem hiding this comment.
If the source is currentRevision or UpdateRevision, then once the partition has been set, currentRevision remains unchanged.
And use the revision information from ModelServing.Status. It is also not possible to determine which serving groups need to be upgraded.
|
I did a first pass on this diff. The two fixes line up with the regressions described:
I did not spot a blocking issue in the current change set. The main thing I would still want to confirm is the e2e coverage around rollback/recovery paths and partitioned recreation flows. |
Thanks, and yes this came directly from new e2e in #837 .could you review this pr too sir, |
| // scaleUpRoles handles Role scaling up. | ||
| // It creates new Roles with increasing indices starting from the current max index + 1. | ||
| func (c *ModelServingController) scaleUpRoles(ctx context.Context, ms *workloadv1alpha1.ModelServing, groupName string, targetRole workloadv1alpha1.Role, roleList []datastore.Role, expectedCount int, servingGroupOrdinal int, newRevision string) { | ||
| func (c *ModelServingController) scaleUpRoles(ctx context.Context, ms *workloadv1alpha1.ModelServing, groupName string, targetRole workloadv1alpha1.Role, roleList []datastore.Role, expectedCount int, servingGroupOrdinal int, effectiveRevision string) { |
There was a problem hiding this comment.
s/effectiveRevision/revision
effectiveRevision seems redudant
| }, | ||
| } | ||
|
|
||
| groupName := utils.GenerateServingGroupName(ms.Name, 0) |
There was a problem hiding this comment.
could you please construct a real case, say test-manage-role-group-rev has two replicas, and partition = 1. So we can update the group 1 with the new revision, while group 0 still use the old revision
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: katara-Jayprakash <[email protected]>
…vision Signed-off-by: katara-Jayprakash <[email protected]>
Signed-off-by: katara-Jayprakash <[email protected]>
Signed-off-by: katara-Jayprakash <[email protected]>
…ision Signed-off-by: katara-Jayprakash <[email protected]>
Signed-off-by: katara-Jayprakash <[email protected]>
62ce6a1 to
61381d6
Compare
hzxuzhonghu
left a comment
There was a problem hiding this comment.
/lgtm
Looks the e2e failure is not related
|
@hzxuzhonghu thanx for review, could you point out sir what causing the failure of e2e? would love to fix it! |
|
Filed an issue #870 |
|
New changes are detected. LGTM label has been removed. |
|
Adding label DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
This issue has been fixed by #860 . |
What type of PR is this?
when I was running e2e regarding
Model serving partition revision control (#837)tests and i found two bugs. First, ControllerRevision is being changed after it is created (it should never change once saved i guess) and Second, manageRoleReplicas is always using newRevision when scaling up, even when it should use effectiveRevision based on partition rules.What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #841
Special notes for your reviewer:
Does this PR introduce a user-facing change?: