Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

brupop API server TLS cert is untrusted #486

Open
jackgill opened this issue Jul 13, 2023 · 14 comments
Open

brupop API server TLS cert is untrusted #486

jackgill opened this issue Jul 13, 2023 · 14 comments

Comments

@jackgill
Copy link

As I mentioned in #478, the brupop API server on one of my EKS clusters apparently has an untrusted TLS cert:

> kubectl -n brupop-bottlerocket-aws get bottlerocketshadows
Error from server: conversion webhook for brupop.bottlerocket.aws/v1, Kind=BottlerocketShadow failed: Post "https://brupop-apiserver.brupop-bottlerocket-aws.svc:443/crdconvert?timeout=30s": x509: certificate signed by unknown authority

I installed brupop using the 1.1.0 manifest file and it is working fine on several other EKS clusters deployed using the same method.
Image I'm using:
1.1.0
Issue or Feature Request:
Looking at the PKI for brupop I see a self-signed issuer cert, but I'm not clear on how this cert is supposed to be trusted. Any advice on how to troubleshoot this issue would be appreciated.

@jackgill
Copy link
Author

I found some additional information: when describing the API server pods, I noticed this event:
MountVolume.SetUp failed for volume "bottlerocket-tls-keys" : secret "brupop-apiserver-certificate" not found
I did however see the secret so I suspect a race condition where the secret did not exist when the pod was first created, perhaps due to cert-manager running slow or something. What's confusing is that I recreated all the pods, the event did not recur, and yet I am still seeing the same error when trying to list bottlerocketshadows.

@gthao313
Copy link
Member

@jackgill Can you share with us how you install brupops? like installing cert-manager first and then install brupop? I'm trying to reproduce this issue.

Usually we need cert-manager running on the EKS cluster first and then install brupop after few minus when we confirm cert-manager is running.

@gthao313
Copy link
Member

@jackgill I just reproduce the same error MountVolume.SetUp failed for volume "bottlerocket-tls-keys" : secret "brupop-apiserver-certificate" not found by installing brupop immediately after running cert-manager installation. I think maybe "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s" the cert-manager endpoint just not ready.

But this maybe not same as your situation.

@jackgill
Copy link
Author

@gthao313 cert-manager was installed via helm long before brupop was deployed on the cluster. Usually we use terraform to apply the brupop manifests but I've noticed that it creates the resources out of order, which I thought might be causing this problem. So I tried installing brupop with kubectl apply -f but I still saw this issue.

@gthao313
Copy link
Member

Thanks for the info. Can you check if secret brupop-apiserver-certificate is on the cluster under brupop-bottlerocket-aws namespace. If the secret isn't here, we can target on it might be something wrong on the brupop deployment. Otherwise, it should be TLS cert issue.

@jackgill
Copy link
Author

The secret is there currently. I think that it might not have been there when the pods were first created, so I tried recreating the pods now that the secret exists. However the same error remains.

@stefan-lipinski
Copy link

We were facing the same problem today. In cert-manager logs:

cert-manager/certificate/customresourcedefinition/generic-inject-reconciler "msg"="unable to fetch associated certificate" "error"="Certificate.cert-manager.io "root-certificate\" not found" "certificate"={"Namespace":"brupop-bottlerocket-aws","Name":"root-certificate

In my point of view in bottlerocket-update-operator.yaml line 15

cert-manager.io/inject-ca-from: brupop-bottlerocket-aws/root-certificate

needs to be replaced by

cert-manager.io/inject-ca-from: brupop-bottlerocket-aws/brupop-selfsigned-ca

@jackgill
Copy link
Author

jackgill commented Jan 16, 2024

Update: I am able to reproduce this on version 1.3.0 of brupop, installed via Helm. However, based on @stefan-lipinski's comment above, I deployed the CRD using a manifest file which I had edited to specify brupop-selfsigned-ca instead of root-certificate and this resolved the issue.

The fix may be as simple as updating the certificate name here:

pub const ROOT_CERTIFICATE_NAME: &str = "root-certificate";
However I'm not familiar enough with this codebase to determine if there are any other impacts to making that change.

@v0lumehi
Copy link

I just fixed the same bug. In my case I am using cert-manager and the solution was to set the annotation to the value "brupop-apiserver-certificate"

@cbgbt
Copy link
Contributor

cbgbt commented Jan 2, 2025

We've been investigating this issue and the chain of trust appears correct as-implemented in Brupop today. We shouldn't need to refer to the "top-level" self-signed certificate authority for SSL to function. Even though the change proposed in #595 resolves the issue for folks who are seeing it, it doesn't seem like we actually want to merge that upstream -- there is likely some other issue that this works around.

We have a few theories about what that issue could be:

  • While reading the cert-manager docs, we found this documentation section which suggests to me that some TLS implementations may find our self-signed certificate (used by the root issuer) to be invalid.
  • Brupop 0.2.2 had an SSL issuer/ca configuration that didn't separate authorities from the server certificate in the chain of trust. Possibly this issue is seen by folks updating from a very old version of Brupop.

Can you provide us more details on your setup? Specifically:

  • What version of Kubernetes are you using? Is it EKS?
  • What version of cert-manager do you have installed in your cluster?
  • Did you see this issue after upgrading Brupop, or is this on a clean install?

@v0lumehi since you've chimed in recently, could you possibly provide a bit more information?

@cbgbt
Copy link
Contributor

cbgbt commented Jan 3, 2025

Ah, I think the change mentioned in this comment by @v0lumehi is actually more in-line with how we wish to use the chain of trust:

Since it might help you: I just fixed this bug and set the value to cert-manager.io/inject-ca-from: brupop-bottlerocket-aws/brupop-apiserver-certificate in the CRD

@v0lumehi
Copy link

v0lumehi commented Jan 13, 2025

Hi, if you set the annotation to the value mentioned in my comment the problem is solved.
I have a setup in EKS and could see validation errors in the kube api logs in CloudWatch. If you set the value of the annotation as described these errors disappear

@v0lumehi
Copy link

Unfortunately I can't remember the exact error message, but I think it was a "connection closed" or something like that when trying to call the brupop-api-agent

@stefan-lipinski
Copy link

stefan-lipinski commented Jan 17, 2025

@cbgbt

We started using Bottlerocket on EKS in the second half of 2023. The initial config was

locals {
  eks_cluster_version              = "1.24"
  eks_vpc_cni_version              = "v1.11.4-eksbuild.1"
  eks_coredns_version              = "v1.8.7-eksbuild.3"
  eks_kube_proxy_version           = "v1.24.7-eksbuild.2"
  eks_node_group_version           = "1.24"
  cluster_autoscaler_version       = "v1.24.1"
  cert-manager                     = "v1.8.2"
  ingress_image_repository         = "registry.k8s.io/ingress-nginx/controller"
  ingress_image_tag                = "v1.8.2"
  load_balancer_image_tag          = "v2.4.3"
  load_balancer_helm_chart_version = "1.4.4"
  cloudwatch_agent_version         = "amazon/cloudwatch-agent:1.300026.3b189"
  fluent-bit_agent_version         = "amazon/aws-for-fluent-bit:2.31.12.20230911"
  eks_ami_type                     = "BOTTLEROCKET_x86_64"
  eks_ami_release_version          = "1.14.3-764e37e4"
}

We were facing this issue from the beginning. The error messages appeared every few seconds in the logs and looked like this:

E1023 10:22:19.116778      11 cacher.go:476] cacher (bottlerocketshadows.brupop.bottlerocket.aws): unexpected ListAndWatch error: failed to list brupop.bottlerocket.aws/v1, Kind=BottlerocketShadow: conversion webhook for brupop.bottlerocket.aws/v2, Kind=BottlerocketShadow failed: Post "https://brupop-apiserver.brupop-bottlerocket-aws.svc:443/crdconvert?timeout=30s": tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 2024-10-23T10:22:19Z is after 2024-09-07T15:58:24Z; reinitializing...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants