-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
no service port 8443 found for service "karpenter" after migrating from cluster autoscaler #6544
Comments
We have the exact same error. Unable to even remove Karpenter at this point either |
The preview section of the Karpenter docs are used upcoming and pre-released version of Karpenter. I suggest you use the latest version of Karpetner which would be v0.37.0. https://karpenter.sh/v0.37. Specifically: https://karpenter.sh/v0.37/getting-started/migrating-from-cas/ |
Same problem here. |
@fnmarquez What is the page are you using? |
EDIT 7/26: We were able to resolve this today, it turned out the main branch CRDs were applied to the cluster instead of the 0.37.0-tagged CRDs. This caused a lot of requirements confusion but once we diff'd what was running in the cluster we were able to track it down based on the We hit the same issue when upgrading our existing EKS configurations, there was no previous cluster-autoscaler configuration in use. Using the 0.37 helm chart we have
Once I flipped the webhook value to
Now I'm starting to go down the rabbit-hole of generating a self-signed cert and updating the |
We are also facing the same issue. Karpenter v0.37.0 was running fine in eks cluster 1.29. Post upgrading the cluster to 1.30, this issue started to occur. We were not using webhooks before too.
Checks done:
Update: It's working now |
I installed karpenter in a separate namespace and with a different service name than the default. I can't configure the values for the webhook and it returns the error:
However, it is obvious that I did not create the service with this name in the kube-system namespace.
|
Ran into similar problems, we solved our problem by not using static CRDs that have hardcoded webhook items enabled. We switched to the karpenter-crd helm chart and updated the webhook values. See: https://karpenter.sh/preview/upgrading/upgrade-guide/#crd-upgrades |
the simplest way to upgrade the karpenter is to delete all validating and mutatingwebhooks since karpenter .37.0+ does not uses any webhooks |
Not the easiest of upgrades if you're doing this through pipelines though. Having the CRD updated/replaced seems to be the better option right now. The issue we faced was having tried to update to the newer version, you could no longer do anything due to the CRD looking for the mutating webhook, which looks like it's an issue on the listener of the server, where in actual fact it is nothing to do with it and you can end up down a rabbit hole. |
we are also facing this same issue after upgrading from 0.32.1 to 0.37.2 We tried deleting the above mentioned webhooks and even tried the helm upgrade flag --reuse-values to try to force the new helm upgrade to not remove the svc port 8443. While --reuse-values preserves the svc 8443 port it causes other strange side effects where the karpenter clusrterrole does not correctly upgrade with some permissions missing It's also worth noting that the crd definitions for nodepool and ec2nodeclass also refer to the service in the kube-system namespace despite the v0.32.1 using the karpenter namespace Would really appreciate a detailed working solution for this upgrade please |
I am seeing similar issue when I try to Upgrade Karpenter to major version 1.0.0 from v0.36.2 .Karpnter argo sync status is broken but I do see pods running with below errors . {"level":"ERROR","time":"2024-09-09T01:23:18.867Z","logger":"controller","message":"Reconciler error","commit":"490ef94","controller":"provisioner","error":"creating node claim, conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s\": service "karpenter" not found"} {"level":"ERROR","time":"2024-09-09T01:23:19.314Z","logger":"controller","message":"Reconciler error","commit":"490ef94","controller":"nodeclass.hash","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"applications"},"namespace":"","name":"applications","reconcileID":"77f1ac75-ebe2-485f-8758-4e47730b47ac","error":"conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s\": service "karpenter" not found; conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s\": service "karpenter" not found","errorCauses":[{"error":"conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s": service "karpenter" not found"},{"error":"conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s": service "karpenter" not found"}]} |
you need to update the namespace in crd to kube-system |
As @amohamedhey mentioned, if you have the conversion webhook enabled, you need to change namespace in there: # ec2nodeclasses.karpenter.k8s.aws
# nodeclaims.karpenter.sh
# nodepools.karpenter.sh
#...
conversion:
strategy: Webhook
webhook:
conversionReviewVersions:
- v1beta1
- v1
clientConfig:
service:
name: karpenter
namespace: kube-system
port: 8443 |
There is an additional detail to the above that is important and causing these issues when using Argocd. And that is the fact that the service definition has the name variablized (https://github.com/aws/karpenter-provider-aws/blob/v0.37.0/charts/karpenter/templates/service.yaml#L4) BUT as shown above WHEN the conversion piece is included in the CRDs the service name is hardcoded (https://github.com/aws/karpenter-provider-aws/blob/v0.36.4/pkg/apis/crds/karpenter.sh_nodeclaims.yaml#L817). Thus when you change the template name in argocd (say when you are deploying to multiple clusters) the service name changes and then everything breaks due to the hardcoded crd name. The same issue also exists for namespaces changes. I am not sure how to bring this to the attention of the devs. Probably the easiest fix for this would be to not variablize the name of the k8s service for kapenter. |
Fix for the above is to add the below var to your vars file:
|
It's happend me after move from 0.37 to 1.0.1 |
2 things that fixed it for me were
|
What version do you use? did you upgraded to 1.x? |
Didn't run into any of these issues when upgraded from 0.37.2-> 0.37.3 first before migrating to v1.0.0. We also got on to the latest patch version 1.0.2 as well. |
@pheianox @amohamedhey @sudeepthisingi I tried updating my version from 36.2 to 1.0.1, and I’m seeing the following error. Do we need to install the webhooks manually? Below are the errors I’m encountering: {"level":"INFO","time":"2024-09-23T17:55:24.495Z","logger":"controller","message":"k8s.io/[email protected]/tools/cache/reflector.go:232: failed to list *v1.ConfigMap: Unauthorized","commit":"490ef94"} ❯ kubectl get clusterrole karpenter -o yaml apiVersion: rbac.authorization.k8s.io/v1
CRD NAMESPACE CONFIG ; |
I had the same issue. I solved this by using karpenter-crd chart and changed the serviceNamespace to the namespace that we installed karpenter. |
I was able to get above issue resolved however seeing different issue now any idea on below error ? name":"applications"},"namespace":"","name":"applications","reconcileID":"97c36b12-0155-416b-8bd1-32e0eca51423","error":"conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s\": tls: failed to verify certificate: x509: certificate signed by unknown authority; conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s\": tls: failed to verify certificate: x509: certificate signed by unknown authority","errorCauses":[{"error":"conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority"},{"error":"conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority"}]} |
I'm really hard to understand, why they made it default to |
This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity. |
Ran into same conflict and it broke ArgoCD deployments. Followed the upgrade instructions from the karpenter documentation ComparisonError |
@nativo-mshivaswamy Could you please confirm which services are listed when running |
thanks its help me to resolve my problem |
Problem still persist if using kustomize and just karpenter chart with symlinked crds
All crds after upgrade has 'kube-system' namespace for karpenter service which leads to an error in both kustomizations and pod logs
|
WORKAROUND (This isn't pretty): EKS, K8s , 1.30
I did run into this case where I made my old EC2NodeClass a phantom. Because I would have to raise an AWS ticket to remove it from ETCD, I just made a new one and updated my node pools. |
set service.namespace -> |
@makis10 solution is exactly it.
|
If you are in my situation where it's not a namespace issue - then it may be this. I use
My CRDs are managed by the helm chart
The CRDs still keep trying to ping What I needed to do was add this block, like so
This is because there is a way to actually change the service pointing to the Webhook as detailed here. (https://github.com/aws/karpenter-provider-aws/blob/v0.37.6/charts/karpenter-crd/templates/karpenter.k8s.aws_ec2nodeclasses.yaml#L1305). In my case I ran into this when bumping from 0.32 to 0.37.6. I was going to try disabling webhooks as a last resort but managed to find that. You can always verify what is on the K8S cluster with Just be careful of the bump to v1 as the values may not be respected (#6847) |
I'm also wondering if part of the problem is also because the usage for
|
Description
Observed Behavior:
After following the instructions https://karpenter.sh/preview/getting-started/migrating-from-cas/, when trying to create the NodePool (as outlined here) I receive the error:
Expected Behavior:
A working karpenter installation
Reproduction Steps (Please include YAML):
I followed the instructions at https://karpenter.sh/preview/getting-started/migrating-from-cas/ on an existing EKS cluster.
Versions:
kubectl version
): 1.30.2The text was updated successfully, but these errors were encountered: