test/e2e: fix flake in WaitForModelServingReady by vyagh · Pull Request #876 · volcano-sh/kthena

vyagh · 2026-04-09T13:36:53Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
Fixes a flake in TestModelRoutePrefillDecodeDisaggregation by improving the reliability of the WaitForModelServingReady e2e utility.

The flake was caused by two interacting issues in test/e2e/utils/utils.go:

Redundant timeout: a context.WithTimeout wrapper was used around PollUntilContextTimeout , which already manages its own timeout. This creates a race condition where the context could expire during a condition call.
immediate Abort on Error: condition function returned (false, err). In k8s.io/apimachinery, returning a non nil error from a polling condition immediately terminates the loop. when the context expired during a client go call, the resulting "context deadline exceeded" error killed the poll prematurely instead of allowing a clean timeout.

Which issue(s) this PR fixes:
Fixes #870

Special notes for your reviewer:

analysis verified against k8s.io/[email protected] source code
other polling utilities in test/e2e/router/context/context.go and test/e2e/utils/chat.go were inspected. they correctly use return false, nil for transient errors

Signed-off-by: Shubham Sharma <[email protected]>

volcano-sh-bot · 2026-04-09T13:37:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

test/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot · 2026-04-09T13:37:05Z

Welcome @vyagh! It looks like this is your first PR to volcano-sh/kthena 🎉

Copilot

Pull request overview

Improves reliability of the WaitForModelServingReady e2e helper to reduce flakes in router/controller-manager end-to-end tests by ensuring polling behavior times out cleanly instead of aborting early on transient client errors.

Changes:

Removes the redundant context.WithTimeout wrapper around wait.PollUntilContextTimeout.
Treats Get() errors as retryable by returning (false, nil) from the polling condition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-09T13:39:48Z

test/e2e/utils/utils.go

 func WaitForModelServingReady(t *testing.T, ctx context.Context, kthenaClient *clientset.Clientset, namespace, name string) {
 	t.Log("Waiting for ModelServing to be ready...")
-	timeoutCtx, cancel := context.WithTimeout(ctx, 5*time.Minute)
-	defer cancel()
-	err := wait.PollUntilContextTimeout(timeoutCtx, 5*time.Second, 5*time.Minute, true, func(ctx context.Context) (bool, error) {
+	err := wait.PollUntilContextTimeout(ctx, 5*time.Second, 5*time.Minute, true, func(ctx context.Context) (bool, error) {
 		ms, err := kthenaClient.WorkloadV1alpha1().ModelServings(namespace).Get(ctx, name, metav1.GetOptions{})


Consider calling t.Helper() at the start of this helper function so assertion failures (e.g., require.NoError below) are reported at the caller site instead of inside utils.go, which makes debugging failing e2e tests easier.

gemini-code-assist

Code Review

This pull request updates the WaitForModelServingReady function in the E2E test utilities by removing a redundant local context timeout and using the provided context directly. It also modifies the polling logic to retry on errors when retrieving the ModelServing resource instead of failing immediately. I have no feedback to provide.

FAUST-BENCHOU

lgtm

test/e2e: fix flake in WaitForModelServingReady

d124a41

Signed-off-by: Shubham Sharma <[email protected]>

Copilot AI review requested due to automatic review settings April 9, 2026 13:36

volcano-sh-bot added the kind/bug label Apr 9, 2026

volcano-sh-bot requested review from YaoZengzeng and hzxuzhonghu April 9, 2026 13:37

volcano-sh-bot added the size/XS label Apr 9, 2026

Copilot started reviewing on behalf of vyagh April 9, 2026 13:37 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

gemini-code-assist bot reviewed Apr 9, 2026

View reviewed changes

FAUST-BENCHOU reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test/e2e: fix flake in WaitForModelServingReady#876

test/e2e: fix flake in WaitForModelServingReady#876
vyagh wants to merge 1 commit intovolcano-sh:mainfrom
vyagh:feat/fix-e2e-flake-870

vyagh commented Apr 9, 2026

Uh oh!

volcano-sh-bot commented Apr 9, 2026

Uh oh!

volcano-sh-bot commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

FAUST-BENCHOU left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vyagh commented Apr 9, 2026

Uh oh!

volcano-sh-bot commented Apr 9, 2026

Uh oh!

volcano-sh-bot commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

FAUST-BENCHOU left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants