Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no service port 8443 found for service "karpenter" after migrating from cluster autoscaler #6544

Open
TimSin opened this issue Jul 18, 2024 · 36 comments
Labels
question Issues that are support related questions

Comments

@TimSin
Copy link

TimSin commented Jul 18, 2024

Description

Observed Behavior:

After following the instructions https://karpenter.sh/preview/getting-started/migrating-from-cas/, when trying to create the NodePool (as outlined here) I receive the error:

Error from server: error when creating "nodepool.yaml": conversion webhook for karpenter.sh/v1beta1, Kind=NodePool failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s": no service port 8443 found for service "karpenter"

Expected Behavior:

A working karpenter installation

Reproduction Steps (Please include YAML):

I followed the instructions at https://karpenter.sh/preview/getting-started/migrating-from-cas/ on an existing EKS cluster.

Versions:

  • Chart Version: 0.37.0
  • Kubernetes Version (kubectl version): 1.30.2
@TimSin TimSin added bug Something isn't working needs-triage Issues that need to be triaged labels Jul 18, 2024
@rwe-dtroup
Copy link

We have the exact same error. Unable to even remove Karpenter at this point either

@engedaam
Copy link
Contributor

The preview section of the Karpenter docs are used upcoming and pre-released version of Karpenter. I suggest you use the latest version of Karpetner which would be v0.37.0. https://karpenter.sh/v0.37. Specifically: https://karpenter.sh/v0.37/getting-started/migrating-from-cas/

@engedaam engedaam added question Issues that are support related questions and removed bug Something isn't working needs-triage Issues that need to be triaged labels Jul 22, 2024
@fnmarquez
Copy link

Same problem here.

@engedaam
Copy link
Contributor

engedaam commented Jul 23, 2024

@fnmarquez What is the page are you using?

@NicholasRaymondiSpot
Copy link

NicholasRaymondiSpot commented Jul 25, 2024

EDIT 7/26: We were able to resolve this today, it turned out the main branch CRDs were applied to the cluster instead of the 0.37.0-tagged CRDs. This caused a lot of requirements confusion but once we diff'd what was running in the cluster we were able to track it down based on the v1 configurations.

We hit the same issue when upgrading our existing EKS configurations, there was no previous cluster-autoscaler configuration in use. Using the 0.37 helm chart we have webhook.enabled: false from the default values. These same errors only started showing up in our cluster after upgrading from 0.36.2 to 0.37 and applying the latest CRDs.

>  kubectl get nodeclaim -A
Error from server: conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s": no service port 8443 found for service "karpenter"
> kubectl logs deployment/karpenter -n kube-system
{"level":"ERROR","time":"2024-07-24T23:20:37.998Z","logger":"controller","message":"Reconciler error","commit":"490ef94","controller":"provisioner","error":"creating node claim, conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post \"https://karpenter.kube-system.svc:8443/?timeout=30s\": no service port 8443 found for service \"karpenter\"; creating node claim, conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post \"https://karpenter.kube-system.svc:8443/?timeout=30s\": no service port 8443 found for service \"karpenter\"","errorCauses":[{"error":"creating node claim, conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post \"https://karpenter.kube-system.svc:8443/?timeout=30s\": no service port 8443 found for service \"karpenter\""},{"error":"creating node claim, conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post \"https://karpenter.kube-system.svc:8443/?timeout=30s\": no service port 8443 found for service \"karpenter\""}]}
{"level":"ERROR","time":"2024-07-24T23:20:42.036Z","logger":"controller","message":"Reconciler error","commit":"490ef94","controller":"node.termination","controllerGroup":"","controllerKind":"Node","Node":{"name":"ip-1-2-3-4.ec2.internal"},"namespace":"","name":"ip-1-2-3-4.ec2.internal","reconcileID":"<ID>","error":"deleting nodeclaims, conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post \"https://karpenter.kube-system.svc:8443/?timeout=30s\": no service port 8443 found for service \"karpenter\""}

Once I flipped the webhook value to true and added the port configuration to our deployment, we started getting different errors for signed cert trust:

> kubectl get nodeclaim -A
Error from server: conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority

Now I'm starting to go down the rabbit-hole of generating a self-signed cert and updating the caBundle base64-encoded PEM cert value in our validation.webhook.karpenter.k8s.aws ValidatingWebhookConfiguration to see if this will get us working again. So far I'm not confident that this is the right approach for resolving this issue but it's put a halt on our Karpenter upgrades until we can find a proper path forward.

@code-crusher
Copy link

code-crusher commented Aug 5, 2024

We are also facing the same issue. Karpenter v0.37.0 was running fine in eks cluster 1.29. Post upgrading the cluster to 1.30, this issue started to occur. We were not using webhooks before too.

> kubectl get service -n kube-system karpenter
NAME        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
karpenter   ClusterIP   172.20.116.212   <none>        8000/TCP   43s
> kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter
{"level":"ERROR","time":"2024-08-05T06:40:22.229Z","logger":"controller","message":"Reconciler error","commit":"490ef94","controller":"nodeclass.status","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"XXXX-ec2nodeclass"},"namespace":"","name":"XXX-ec2nodeclass","reconcileID":"81bd9baf-5b2d-4f26-af7b-XXXXX","error":"conversion webhook for karpenter.k8s.aws/v1beta1, Kind=EC2NodeClass failed: Post \"https://karpenter.kube-system.svc:8443/?timeout=30s\": no service port 8443 found for service \"karpenter\""}

Checks done:

  1. No webhooks related to Karpenter exists in validatingwebhookconfigurations and mutatingwebhookconfigurations
  2. Pods and Service are running.

Update: It's working now
Purged everything related to Karpenter and re-installed. Just worked fine. Still unsure why but I did notice the CRDs were never deleted upon first try of purging -- had to delete manually by removing finaliser.

@Vinaum8
Copy link

Vinaum8 commented Aug 15, 2024

I installed karpenter in a separate namespace and with a different service name than the default.

I can't configure the values ​​for the webhook and it returns the error:

service: karpenter not found.

However, it is obvious that I did not create the service with this name in the kube-system namespace.

I am using version 0.37.1 and I installed karpenter with ArgoCD.

@rcalosso
Copy link

Ran into similar problems, we solved our problem by not using static CRDs that have hardcoded webhook items enabled. We switched to the karpenter-crd helm chart and updated the webhook values.

See: https://karpenter.sh/preview/upgrading/upgrade-guide/#crd-upgrades

@Vinaum8
Copy link

Vinaum8 commented Aug 22, 2024

#6818

@akramincity
Copy link

the simplest way to upgrade the karpenter is to delete all validating and mutatingwebhooks since karpenter .37.0+ does not uses any webhooks
kubectl delete validatingwebhookconfiguration validation.webhook.config.karpenter.sh validation.webhook.karpenter.sh
kubectl delete mutatingwebhookconfigurations defaulting.webhook.karpenter.k8s.aws

@rwe-dtroup
Copy link

the simplest way to upgrade the karpenter is to delete all validating and mutatingwebhooks since karpenter .37.0+ does not uses any webhooks kubectl delete validatingwebhookconfiguration validation.webhook.config.karpenter.sh validation.webhook.karpenter.sh kubectl delete mutatingwebhookconfigurations defaulting.webhook.karpenter.k8s.aws

Not the easiest of upgrades if you're doing this through pipelines though. Having the CRD updated/replaced seems to be the better option right now.

The issue we faced was having tried to update to the newer version, you could no longer do anything due to the CRD looking for the mutating webhook, which looks like it's an issue on the listener of the server, where in actual fact it is nothing to do with it and you can end up down a rabbit hole.

@gearoidrog
Copy link

we are also facing this same issue after upgrading from 0.32.1 to 0.37.2

We tried deleting the above mentioned webhooks and even tried the helm upgrade flag --reuse-values to try to force the new helm upgrade to not remove the svc port 8443.

While --reuse-values preserves the svc 8443 port it causes other strange side effects where the karpenter clusrterrole does not correctly upgrade with some permissions missing

It's also worth noting that the crd definitions for nodepool and ec2nodeclass also refer to the service in the kube-system namespace despite the v0.32.1 using the karpenter namespace

Would really appreciate a detailed working solution for this upgrade please

@akunduru9
Copy link

I am seeing similar issue when I try to Upgrade Karpenter to major version 1.0.0 from v0.36.2 .Karpnter argo sync status is broken but I do see pods running with below errors .

{"level":"ERROR","time":"2024-09-09T01:23:18.867Z","logger":"controller","message":"Reconciler error","commit":"490ef94","controller":"provisioner","error":"creating node claim, conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s\": service "karpenter" not found"}

{"level":"ERROR","time":"2024-09-09T01:23:19.314Z","logger":"controller","message":"Reconciler error","commit":"490ef94","controller":"nodeclass.hash","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"applications"},"namespace":"","name":"applications","reconcileID":"77f1ac75-ebe2-485f-8758-4e47730b47ac","error":"conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s\": service "karpenter" not found; conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s\": service "karpenter" not found","errorCauses":[{"error":"conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s": service "karpenter" not found"},{"error":"conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s": service "karpenter" not found"}]}

@amohamedhey
Copy link

you need to update the namespace in crd to kube-system

@piximos
Copy link

piximos commented Sep 10, 2024

As @amohamedhey mentioned, if you have the conversion webhook enabled, you need to change namespace in there:

# ec2nodeclasses.karpenter.k8s.aws
# nodeclaims.karpenter.sh
# nodepools.karpenter.sh 

#...
  conversion:
    strategy: Webhook
    webhook:
      conversionReviewVersions:
        - v1beta1
        - v1
      clientConfig:
        service:
          name: karpenter
          namespace: kube-system
          port: 8443

@scotthesterberg
Copy link

There is an additional detail to the above that is important and causing these issues when using Argocd. And that is the fact that the service definition has the name variablized (https://github.com/aws/karpenter-provider-aws/blob/v0.37.0/charts/karpenter/templates/service.yaml#L4) BUT as shown above WHEN the conversion piece is included in the CRDs the service name is hardcoded (https://github.com/aws/karpenter-provider-aws/blob/v0.36.4/pkg/apis/crds/karpenter.sh_nodeclaims.yaml#L817). Thus when you change the template name in argocd (say when you are deploying to multiple clusters) the service name changes and then everything breaks due to the hardcoded crd name.

The same issue also exists for namespaces changes.

I am not sure how to bring this to the attention of the devs. Probably the easiest fix for this would be to not variablize the name of the k8s service for kapenter.

@scotthesterberg
Copy link

scotthesterberg commented Sep 11, 2024

Fix for the above is to add the below var to your vars file:

fullnameOverride: karpenter

@davidt-gh
Copy link

It's happend me after move from 0.37 to 1.0.1

@gearoidrog
Copy link

2 things that fixed it for me were

  1. pass this flag in the helm upgrade command
    --set webhook.enabled=true
  2. after downloading the CRD files for nodepool, ec2nodeclass and nodeclaims run a yq update to change the namespace to karpenter and then apply the updated yaml files
    yq eval '.spec.conversion.webhook.clientConfig.service.namespace = "karpenter"'

@davidt-gh
Copy link

2 things that fixed it for me were

  1. pass this flag in the helm upgrade command
    --set webhook.enabled=true
  2. after downloading the CRD files for nodepool, ec2nodeclass and nodeclaims run a yq update to change the namespace to karpenter and then apply the updated yaml files
    yq eval '.spec.conversion.webhook.clientConfig.service.namespace = "karpenter"'

What version do you use? did you upgraded to 1.x?

@sudeepthisingi
Copy link

Didn't run into any of these issues when upgraded from 0.37.2-> 0.37.3 first before migrating to v1.0.0. We also got on to the latest patch version 1.0.2 as well.

@akunduru9
Copy link

akunduru9 commented Sep 23, 2024

@pheianox @amohamedhey @sudeepthisingi

I tried updating my version from 36.2 to 1.0.1, and I’m seeing the following error. Do we need to install the webhooks manually? Below are the errors I’m encountering:

{"level":"INFO","time":"2024-09-23T17:55:24.495Z","logger":"controller","message":"k8s.io/[email protected]/tools/cache/reflector.go:232: failed to list *v1.ConfigMap: Unauthorized","commit":"490ef94"}
{"level":"ERROR","time":"2024-09-23T17:55:24.495Z","logger":"controller","message":"k8s.io/[email protected]/tools/cache/reflector.go:232: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Unauthorized","commit":"490ef94"}
I have a ClusterRole with the necessary permissions, but we're still seeing the error. Here is my ClusterRole:

❯ kubectl get clusterrole karpenter -o yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
creationTimestamp: "2024-09-23T17:43:52Z"
labels:
app.kubernetes.io/instance: karpenter
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: karpenter
app.kubernetes.io/version: 0.37.0
argocd.argoproj.io/instance: eco-sandbox2.karpenter.karpenter
helm.sh/chart: karpenter-0.37.0
name: karpenter
resourceVersion: "157980964"
uid: d256470a-eac6-4b0f-bbba-d97a5b2d5f11
rules:

  • apiGroups:
    • karpenter.k8s.aws
      resources:
    • ec2nodeclasses
      verbs:
    • get
    • list
    • watch
  • apiGroups:
    • karpenter.k8s.aws
      resources:
    • ec2nodeclasses
    • ec2nodeclasses/status
      verbs:
    • patch
    • update
  • apiGroups:
    • admissionregistration.k8s.io
      resources:
    • mutatingwebhookconfigurations
    • validatingwebhookconfigurations
      verbs:
    • get
    • list
    • watch
      Additionally, here is my conversion configuration in the CRDs:

CRD NAMESPACE CONFIG ;
conversion:
strategy: Webhook
webhook:
conversionReviewVersions:
- v1beta1
- v1
clientConfig:
service:
name: karpenter
namespace: karpenter
port: 8443

@lzyli
Copy link

lzyli commented Sep 24, 2024

I had the same issue. I solved this by using karpenter-crd chart and changed the serviceNamespace to the namespace that we installed karpenter.

@akunduru9
Copy link

I was able to get above issue resolved however seeing different issue now any idea on below error ?

name":"applications"},"namespace":"","name":"applications","reconcileID":"97c36b12-0155-416b-8bd1-32e0eca51423","error":"conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s\": tls: failed to verify certificate: x509: certificate signed by unknown authority; conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s\": tls: failed to verify certificate: x509: certificate signed by unknown authority","errorCauses":[{"error":"conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority"},{"error":"conversion webhook for karpenter.sh/v1beta1, Kind=NodeClaim failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority"}]}

@davidt-gh
Copy link

davidt-gh commented Sep 26, 2024

I had the same issue. I solved this by using karpenter-crd chart and changed the serviceNamespace to the namespace that we installed karpenter.

I'm really hard to understand, why they made it default to kube-system? this change is make a lot of mess in here.
@avielb-navina @jonathan-innis - I saw this is documented by you guys

Copy link
Contributor

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

@nativo-mshivaswamy
Copy link

Ran into same conflict and it broke ArgoCD deployments.

Followed the upgrade instructions from the karpenter documentation
Upgraded from 0.34 to 0.34.7

ComparisonError
Failed to load live state: failed to get cluster info for "https://k8s-cluster": error synchronizing cache state : failed to sync cluster https://k8s-cluster: failed to load initial state of resource EC2NodeClass.karpenter.k8s.aws: conversion webhook for karpenter.k8s.aws/v1beta1, Kind=EC2NodeClass failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s": service "karpenter" not found

@neike894
Copy link

neike894 commented Oct 16, 2024

@nativo-mshivaswamy Could you please confirm which services are listed when running kubectl get svc?
I believe the service running on port 8443 was introduced starting from version 0.36.7.

@neike894
Copy link

Fix for the above is to add the below var to your vars file:

fullnameOverride: karpenter

thanks its help me to resolve my problem

@rkashasl
Copy link

rkashasl commented Oct 31, 2024

Problem still persist if using kustomize and just karpenter chart with symlinked crds

---
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
  name: karpenter
  namespace: flux-system
spec:
  interval: 17m
  type: oci
  url: oci://public.ecr.aws/karpenter/
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: karpenter
  namespace: karpenter
spec:
  install:
    createNamespace: true
    crds: CreateReplace
  upgrade:
    crds: CreateReplace
  interval: 9m
  chart:
    spec:
      chart: karpenter
      version: 1.0.6
      sourceRef:
        kind: HelmRepository
        name: karpenter
        namespace: flux-system
      interval: 14m
  values:
    priorityClassName: system-cluster-critical
    logLevel: info
    controller:
      resources:
        requests:
          cpu: 0.2
          memory: 768Mi
        limits:
          cpu: 0.2
          memory: 768Mi
    settings:
      clusterName: "${clusterName}"
      defaultInstanceProfile: "KarpenterInstanceProfile-${clusterName}"
      interruptionQueue: "KarpenterInterruptions-${clusterName}"
      featureGates:
        driftEnabled: true
    serviceAccount:
      create: false
      name: karpenter-oidc
    webhook:
      enabled: true
    fullnameOverride: karpenter

All crds after upgrade has 'kube-system' namespace for karpenter service which leads to an error in both kustomizations and pod logs

conversion webhook for karpenter.k8s.aws/v1beta1, Kind=EC2NodeClass failed: Post "https://karpenter.kube-system.svc:8443/?timeout=30s": service "karpenter" not found

@rkashasl
Copy link

rkashasl commented Nov 13, 2024

Same after upgrade to 1.08, it just again replaced namespace to kube-system in all crds

image

Any solution?

@K8sKween
Copy link

WORKAROUND (This isn't pretty):

EKS, K8s , 1.30

  1. Scale up an EKS node group or something to keep your pods supported while you rip everything Karpenter out.
  2. Delete everything: CRs, CRDs, and the whole manifest.
  3. Clean out all the webhooks
    Find all webhooks
    kubectl get validatingwebhookconfigurations
    kubectl get mutatingwebhookconfigurations
    
    If it says karpenter, delete it
  4. Install karpenter in the kube-system namespace. Remember: if you are giving this pod an IAM role via a service account annotation, you may need to check the OIDC config in your IAM role trust policy to allow kube-system:karpenter
    After this I was able to get everything applied and working.
  5. If you are like me using ARGOCD just so you know I installed my karpenter-crd chart separately in addition to my karpenter chart, that might also have them. Just worth noting.

I did run into this case where I made my old EC2NodeClass a phantom. Because I would have to raise an AWS ticket to remove it from ETCD, I just made a new one and updated my node pools.

@makis10
Copy link

makis10 commented Nov 26, 2024

kubectl edit crd ec2nodeclasses.karpenter.k8s.aws
kubectl edit crd nodepools.karpenter.sh
kubectl edit crd nodeclaims.karpenter.sh

set service.namespace -> karpenter installation namespace instead of default kube-system

@hamadycisselp360
Copy link

hamadycisselp360 commented Dec 2, 2024

@makis10 solution is exactly it.
We can see by inspecting those CRD that they expect the karpenter service to be in the wrong namespace.
AFter making the changes, do not forget to apply them:

kubectl get crd [the_crd]  -o yaml > latest_version_[the_crd].yaml
kubectl apply -f latest_version_[the_crd].yaml

@juiceblender
Copy link

juiceblender commented Jan 3, 2025

If you are in my situation where it's not a namespace issue - then it may be this. I use terraformand the chart name was given controller-karpenter.

resource "helm_release" "karpenter" {
  namespace = "kube-system"

  name       = local.name (which resolves to "controller-karpenter")
  chart      = "karpenter"
  repository = "oci://public.ecr.aws/karpenter"

  lifecycle {
    ignore_changes = [repository_password]
  }

  version = "0.37.6"
  ...

My CRDs are managed by the helm chart karpenter-crd.

resource "helm_release" "karpenter_crd" {
 namespace = "kube-system"

 name       = "karpenter-crd"
 chart      = "karpenter-crd"
 repository = "oci://public.ecr.aws/karpenter" 

 version = "0.37.6"

The CRDs still keep trying to ping https://karpenter.kube-system.svc:8443/?timeout=30s for validating the configuration.

What I needed to do was add this block, like so

resource "helm_release" "karpenter_crd" {
  namespace = "kube-system"

  name       = "karpenter-crd"
  chart      = "karpenter-crd"
  repository = "oci://public.ecr.aws/karpenter" 

  version = "0.37.6"

  set {
    name = "webhook.serviceName" <------- I had to add this block. 
    value = local.name 
  }
}

This is because there is a way to actually change the service pointing to the Webhook as detailed here. (https://github.com/aws/karpenter-provider-aws/blob/v0.37.6/charts/karpenter-crd/templates/karpenter.k8s.aws_ec2nodeclasses.yaml#L1305). In my case I ran into this when bumping from 0.32 to 0.37.6. I was going to try disabling webhooks as a last resort but managed to find that. You can always verify what is on the K8S cluster with kubectl get crd to get the list and then e.g kubectl get crd/ec2nodeclasses.karpenter.k8s.aws -o yaml | less and look for the Webhook Strategy service it's using.

Just be careful of the bump to v1 as the values may not be respected (#6847)

@juiceblender
Copy link

I'm also wondering if part of the problem is also because the usage for karpenter-crd on the AWS ECR Public Gallery still references the karpenter namespace - but I didn't find it in this repository. Not sure if someone can look into it?

helm upgrade --install karpenter-crd oci://public.ecr.aws/karpenter/karpenter-crd --version {INSERT_IMAGE_TAG} --namespace karpenter --create-namespace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Issues that are support related questions
Projects
None yet
Development

No branches or pull requests