Skip to content

fix wait for webhook to be ready before running webhook tests#816

Merged
volcano-sh-bot merged 4 commits intovolcano-sh:mainfrom
katara-Jayprakash:test-webhook
Mar 17, 2026
Merged

fix wait for webhook to be ready before running webhook tests#816
volcano-sh-bot merged 4 commits intovolcano-sh:mainfrom
katara-Jayprakash:test-webhook

Conversation

@katara-Jayprakash
Copy link
Copy Markdown
Member

@katara-Jayprakash katara-Jayprakash commented Mar 15, 2026

What type of PR is this?

This PR fixes webhook tests that were failing with "connection refused" error in webhook_test.go e2e tests.
Screenshot 2026-03-16 at 1 31 53 AM

What this PR does / why we need it:

we need this pr to make our e2e test more stable

Which issue(s) this PR fixes:
Fixes #815

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

 NONE

Signed-off-by: katara-Jayprakash <katarajayprakash@icloud.com>
Copilot AI review requested due to automatic review settings March 15, 2026 20:00
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a potential race condition in the E2E test suite by ensuring that the webhook server is fully initialized and ready to receive requests before any webhook-dependent tests are executed. This enhancement improves the reliability and stability of the E2E tests, reducing intermittent failures caused by timing issues.

Highlights

  • Webhook Test Reliability: Introduced a new utility function, WaitForWebhookReady, to ensure that the webhook server is fully operational before executing related E2E tests.
  • E2E Test Setup Enhancement: Integrated the WaitForWebhookReady function into the TestWebhook E2E test suite, preventing test failures due to race conditions where tests might run before the webhook is ready.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • test/e2e/controller-manager/test_suite_test.go
    • Added a new function WaitForWebhookReady that polls the Kubernetes API for the webhook endpoint's readiness, specifically checking for available addresses.
  • test/e2e/controller-manager/webhook_test.go
    • Modified the TestWebhook function to call WaitForWebhookReady after setting up the E2E test environment, ensuring the webhook is ready before proceeding with test cases.
Activity
  • No activity has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a waiting mechanism for the webhook to become ready before executing e2e tests, which is a great way to improve test stability. I've provided a couple of suggestions to enhance the implementation of the new wait function for better clarity and maintainability.

Comment on lines +133 to +148
err := wait.PollUntilContextTimeout(ctx, 2*time.Second, 1*time.Minute, true, func(ctx context.Context) (bool, error) {
ep, err := kubeClient.CoreV1().Endpoints(kthenaNamespace).Get(ctx, "kthena-controller-manager-webhook", metav1.GetOptions{})

// try again
if err != nil {
t.Logf("webhook endpoint not ready yet: %v", err)
return false, nil
}
for _, subset := range ep.Subsets {
if len(subset.Addresses) > 0 {
return true, nil // true = we are done! webhook is ready
}
}
t.Log("webhook has no ready addresses yet")
return false, nil
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since the context ctx already has a timeout, you can use wait.PollUntilContextCancel to avoid specifying a redundant timeout. This makes the intent clearer and is consistent with other parts of this file (e.g., line 96).

Suggested change
err := wait.PollUntilContextTimeout(ctx, 2*time.Second, 1*time.Minute, true, func(ctx context.Context) (bool, error) {
ep, err := kubeClient.CoreV1().Endpoints(kthenaNamespace).Get(ctx, "kthena-controller-manager-webhook", metav1.GetOptions{})
// try again
if err != nil {
t.Logf("webhook endpoint not ready yet: %v", err)
return false, nil
}
for _, subset := range ep.Subsets {
if len(subset.Addresses) > 0 {
return true, nil // true = we are done! webhook is ready
}
}
t.Log("webhook has no ready addresses yet")
return false, nil
})
err := wait.PollUntilContextCancel(ctx, 2*time.Second, true, func(ctx context.Context) (bool, error) {
ep, err := kubeClient.CoreV1().Endpoints(kthenaNamespace).Get(ctx, "kthena-controller-manager-webhook", metav1.GetOptions{})
// try again
if err != nil {
t.Logf("webhook endpoint not ready yet: %v", err)
return false, nil
}
for _, subset := range ep.Subsets {
if len(subset.Addresses) > 0 {
return true, nil // true = we are done! webhook is ready
}
}
t.Log("webhook has no ready addresses yet")
return false, nil
})

Comment on lines +149 to +150
// you are missing this line!
require.NoError(t, err, "webhook did not become ready in time increase the waiting period")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment appears to be a leftover from development and can be removed. Additionally, the error message can be made more direct to clearly state the reason for the test failure.

Suggested change
// you are missing this line!
require.NoError(t, err, "webhook did not become ready in time increase the waiting period")
require.NoError(t, err, "timed out waiting for webhook to become ready")

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the reliability of the controller-manager webhook e2e tests by waiting for the webhook endpoint to become ready before executing webhook validation/mutation test cases.

Changes:

  • Update TestWebhook to wait for the webhook to be ready before running test cases.
  • Add a WaitForWebhookReady helper that polls the webhook Endpoints object until it has ready addresses.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
test/e2e/controller-manager/webhook_test.go Calls the new readiness wait helper before running webhook test cases.
test/e2e/controller-manager/test_suite_test.go Adds WaitForWebhookReady helper that polls the webhook endpoint for readiness.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +130 to +131
ctx, cancel := context.WithTimeout(context.Background(), 1*time.Minute)
defer cancel()
Signed-off-by: katara-Jayprakash <katarajayprakash@icloud.com>
@katara-Jayprakash
Copy link
Copy Markdown
Member Author

katara-Jayprakash commented Mar 15, 2026

/cc @YaoZengzeng @LiZhenCheng9527 @hzxuzhonghu I would appreciate your review sir, when you have a moment.

Copy link
Copy Markdown
Contributor

@LiZhenCheng9527 LiZhenCheng9527 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current readiness check does not actually validate the condition that caused the flake.

waitForWebhookReady polls the Endpoints object for ready addresses, but that is effectively the same readiness signal that already let the test continue before. The issue description says the controller-manager pod can be Ready while the webhook server inside it is still not accepting connections, so this new check can still pass before the webhook listener is reachable.

Please switch the probe to an actual webhook request path, for example a dry-run create against one of the webhook-protected resources, and retry only on connect: connection refused until it succeeds.

@FAUST-BENCHOU
Copy link
Copy Markdown
Contributor

FYI: #806 (comment)
seems same problem

@katara-Jayprakash
Copy link
Copy Markdown
Member Author

The current readiness check does not actually validate the condition that caused the flake.

waitForWebhookReady polls the Endpoints object for ready addresses, but that is effectively the same readiness signal that already let the test continue before. The issue description says the controller-manager pod can be Ready while the webhook server inside it is still not accepting connections, so this new check can still pass before the webhook listener is reachable.

Please switch the probe to an actual webhook request path, for example a dry-run create against one of the webhook-protected resources, and retry only on connect: connection refused until it succeeds.

thanx for you response, i will going to update this pr till tonight

Signed-off-by: katara-Jayprakash <katarajayprakash@icloud.com>
Copilot AI review requested due to automatic review settings March 16, 2026 14:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Stabilizes controller-manager e2e webhook tests by adding a readiness wait before executing webhook assertions, addressing intermittent “connection refused” failures (Fixes #815).

Changes:

  • Add a pre-test webhook readiness probe to reduce flakiness in webhook_test.go.
  • Implement waitForWebhookReady helper that polls the API server until webhook requests succeed.
  • Add strings import to support connection error inspection.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
test/e2e/controller-manager/webhook_test.go Adds a readiness wait before running webhook test cases.
test/e2e/controller-manager/test_suite_test.go Implements polling helper to detect when the webhook server is accepting requests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +127 to +132
func waitForWebhookReady(t *testing.T, kthenaClient *clientset.Clientset, namespace string) {
t.Helper()
t.Log("Waiting for webhook server to accept requests")

ctx, cancel := context.WithTimeout(context.Background(), 1*time.Minute)
defer cancel()
Comment on lines +139 to +147
_, err := kthenaClient.WorkloadV1alpha1().ModelBoosters(namespace).Create(ctx, probe, metav1.CreateOptions{DryRun: []string{"All"}})
if err != nil {
errStr := err.Error()
if strings.Contains(errStr, "connect: connection refused") {
t.Logf("Webhook not ready yet (connection refused), retrying: %v", err)
return false, nil
}
return false, err
}
probe.Namespace = namespace
probe.Name = "webhook-ready-probe-" + utils.RandomString(5)

_, err := kthenaClient.WorkloadV1alpha1().ModelBoosters(namespace).Create(ctx, probe, metav1.CreateOptions{DryRun: []string{"All"}})
Signed-off-by: katara-Jayprakash <katarajayprakash@icloud.com>
@katara-Jayprakash
Copy link
Copy Markdown
Member Author

/cc @hzxuzhonghu ptal sir, i think its ready now!

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Overall, think this can fix the flake. I am thinking about a more grace way, check the kthena ready before running any test, maybe can do that in func TestMain(m *testing.M) { after insatlled kthena.

@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot merged commit 64b0668 into volcano-sh:main Mar 17, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TestWebhook subtests fail with connection refused error when run after TestModelServingControllerManagerRestart

6 participants