test/e2e: fix flake in WaitForModelServingReady#876
test/e2e: fix flake in WaitForModelServingReady#876vyagh wants to merge 1 commit intovolcano-sh:mainfrom
Conversation
Signed-off-by: Shubham Sharma <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @vyagh! It looks like this is your first PR to volcano-sh/kthena 🎉 |
There was a problem hiding this comment.
Pull request overview
Improves reliability of the WaitForModelServingReady e2e helper to reduce flakes in router/controller-manager end-to-end tests by ensuring polling behavior times out cleanly instead of aborting early on transient client errors.
Changes:
- Removes the redundant
context.WithTimeoutwrapper aroundwait.PollUntilContextTimeout. - Treats
Get()errors as retryable by returning(false, nil)from the polling condition.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func WaitForModelServingReady(t *testing.T, ctx context.Context, kthenaClient *clientset.Clientset, namespace, name string) { | ||
| t.Log("Waiting for ModelServing to be ready...") | ||
| timeoutCtx, cancel := context.WithTimeout(ctx, 5*time.Minute) | ||
| defer cancel() | ||
| err := wait.PollUntilContextTimeout(timeoutCtx, 5*time.Second, 5*time.Minute, true, func(ctx context.Context) (bool, error) { | ||
| err := wait.PollUntilContextTimeout(ctx, 5*time.Second, 5*time.Minute, true, func(ctx context.Context) (bool, error) { | ||
| ms, err := kthenaClient.WorkloadV1alpha1().ModelServings(namespace).Get(ctx, name, metav1.GetOptions{}) |
There was a problem hiding this comment.
Consider calling t.Helper() at the start of this helper function so assertion failures (e.g., require.NoError below) are reported at the caller site instead of inside utils.go, which makes debugging failing e2e tests easier.
There was a problem hiding this comment.
Code Review
This pull request updates the WaitForModelServingReady function in the E2E test utilities by removing a redundant local context timeout and using the provided context directly. It also modifies the polling logic to retry on errors when retrieving the ModelServing resource instead of failing immediately. I have no feedback to provide.
What type of PR is this?
/kind bug
What this PR does / why we need it:
Fixes a flake in TestModelRoutePrefillDecodeDisaggregation by improving the reliability of the WaitForModelServingReady e2e utility.
The flake was caused by two interacting issues in test/e2e/utils/utils.go:
Which issue(s) this PR fixes:
Fixes #870
Special notes for your reviewer: