Skip to content

test/e2e: fix flake in WaitForModelServingReady#876

Open
vyagh wants to merge 1 commit intovolcano-sh:mainfrom
vyagh:feat/fix-e2e-flake-870
Open

test/e2e: fix flake in WaitForModelServingReady#876
vyagh wants to merge 1 commit intovolcano-sh:mainfrom
vyagh:feat/fix-e2e-flake-870

Conversation

@vyagh
Copy link
Copy Markdown

@vyagh vyagh commented Apr 9, 2026

What type of PR is this?
/kind bug

What this PR does / why we need it:
Fixes a flake in TestModelRoutePrefillDecodeDisaggregation by improving the reliability of the WaitForModelServingReady e2e utility.

The flake was caused by two interacting issues in test/e2e/utils/utils.go:

  1. Redundant timeout: a context.WithTimeout wrapper was used around PollUntilContextTimeout , which already manages its own timeout. This creates a race condition where the context could expire during a condition call.
  2. immediate Abort on Error: condition function returned (false, err). In k8s.io/apimachinery, returning a non nil error from a polling condition immediately terminates the loop. when the context expired during a client go call, the resulting "context deadline exceeded" error killed the poll prematurely instead of allowing a clean timeout.

Which issue(s) this PR fixes:
Fixes #870

Special notes for your reviewer:

  • analysis verified against k8s.io/[email protected] source code
  • other polling utilities in test/e2e/router/context/context.go and test/e2e/utils/chat.go were inspected. they correctly use return false, nil for transient errors

Copilot AI review requested due to automatic review settings April 9, 2026 13:36
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot
Copy link
Copy Markdown
Contributor

Welcome @vyagh! It looks like this is your first PR to volcano-sh/kthena 🎉

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves reliability of the WaitForModelServingReady e2e helper to reduce flakes in router/controller-manager end-to-end tests by ensuring polling behavior times out cleanly instead of aborting early on transient client errors.

Changes:

  • Removes the redundant context.WithTimeout wrapper around wait.PollUntilContextTimeout.
  • Treats Get() errors as retryable by returning (false, nil) from the polling condition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 32 to 35
func WaitForModelServingReady(t *testing.T, ctx context.Context, kthenaClient *clientset.Clientset, namespace, name string) {
t.Log("Waiting for ModelServing to be ready...")
timeoutCtx, cancel := context.WithTimeout(ctx, 5*time.Minute)
defer cancel()
err := wait.PollUntilContextTimeout(timeoutCtx, 5*time.Second, 5*time.Minute, true, func(ctx context.Context) (bool, error) {
err := wait.PollUntilContextTimeout(ctx, 5*time.Second, 5*time.Minute, true, func(ctx context.Context) (bool, error) {
ms, err := kthenaClient.WorkloadV1alpha1().ModelServings(namespace).Get(ctx, name, metav1.GetOptions{})
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider calling t.Helper() at the start of this helper function so assertion failures (e.g., require.NoError below) are reported at the caller site instead of inside utils.go, which makes debugging failing e2e tests easier.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the WaitForModelServingReady function in the E2E test utilities by removing a redundant local context timeout and using the provided context directly. It also modifies the polling logic to retry on errors when retrieving the ModelServing resource instead of failing immediately. I have no feedback to provide.

Copy link
Copy Markdown
Contributor

@FAUST-BENCHOU FAUST-BENCHOU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TestModelRoutePrefillDecodeDisaggregation Flake

4 participants