-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Webhook Race Condition with TLS handshake error: tls: bad certificate #2560
Comments
Typically knative webhooks on startup go through the following steps
For this flow the webhooks need to be constructed with the certificate controller in addition with the default/validating admission controllers If you're getting |
The typical misconfiguration we see is if the liveness probe timeout of the webhook deployment is too low - it never gets a chance to become the leader and create the certificate. This is because K8s kills the container. ie. vmware-tanzu/sources-for-knative#356 It's interesting to see |
This issue is stale because it has been open for 90 days with no |
This issue or pull request is stale because it has been open for 90 days with no activity. This bot triages issues and PRs according to the following rules:
You can:
/lifecycle stale |
This issue is stale because it has been open for 90 days with no |
/reopen |
@dprotaso: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@njtran just following up if this is still an issue. I haven't had time to dig into this but this might be a good-first-issue |
Hey @dprotaso, thanks for following up! I hadn't added more to this just because in our later releases we haven't seen this problem occur. I don't have a reproduction I can give you, but I know that I saw it not uncommonly in our earlier releases. Happy to dive into this with you if you want? |
Did anything change in your later releases? |
Yep. Here are the webhook definitions for what I believe had the issue:
And here are the webhook definitions now, where I haven't seen the issue in a while.
Sometimes the issue has been because of an old unreachable webhook left around due to Argo CR syncing. Maybe you see something different though? |
We think it might be something to do with Argo syncing some old versions of webhooks. Have you heard anything like this? |
@dprotaso Is the below error a related issue? Karpenter: v0.27.1
|
I'm pretty sure I'm seeing this too: Karpenter: v0.27.6 I'm seeing this while building out this cluster -- everything is "new" (no old versions of eks, karpenter, etc):
|
Hoping to get some insight on the following issue. Happy to hop on a call or slack huddle in the knative slack to give more info.
Expected Behavior
The webhook should work and not require a non-deterministic amount of container restarts for it to work.
Actual Behavior
Using defaulting and validating webhooks for Karpenter CRDs. When first installing Karpenter, we get the following error in the webhook container logs. Even after receiving a failure, the webhook container stays ready. The issue is resolved sometimes by restarting the container a non-deterministic amount of times.
This webhook is further proven broken when it blocks creation of the CRD because the certificate is signed by an unknown authority.
Here’s where this webhook was created in code that we started to see this issue after.
Steps to Reproduce the Problem
Additional Info
The text was updated successfully, but these errors were encountered: