Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes kube-apiserver, kube-controller-manager, and kube-scheduler are not restarted after etcd/kubelet/cri restart #10511

Open
jfroy opened this issue Mar 12, 2025 · 10 comments

Comments

@jfroy
Copy link
Contributor

jfroy commented Mar 12, 2025

Bug Report

Description

On my homelab cluster running Talos 1.9.4, sometimes on one of my nodes, in a sequence, etcd and kubelet exit with code 255 and cri (containerd) with signal hangup. This suggests a reconfiguration rather than a crash. Leaving that aside, when this sequence happens, talos then removes the kube-apiserver, kube-controller-manager, and kube-scheduler static pods. Then etcd, kubelet, and cri are restarted, and then those static pods are rendered/added again.

Only sometimes they are not rendered/added, and that leaves the node in a bad state. There is no additional logging to give a clue as to why that is.

I've attached a support bundle with a node in such a state.

Logs

kantai1: user: warning: [2025-03-11T21:43:11.74012641Z]: [talos] service[kubelet](Preparing): Creating service runner
kantai1: user: warning: [2025-03-11T21:43:11.74077941Z]: [talos] service[etcd](Preparing): Creating service runner
kantai1: user: warning: [2025-03-11T21:43:11.83241041Z]: [talos] service[ext-nvidia-driver](Running): Started task ext-nvidia-driver (PID 7647) for container ext-nvidia-driver
kantai1: user: warning: [2025-03-11T21:43:11.84115341Z]: [talos] service[kubelet](Running): Started task kubelet (PID 7649) for container kubelet
kantai1: user: warning: [2025-03-11T21:43:11.84146941Z]: [talos] service[etcd](Running): Started task etcd (PID 7650) for container etcd
kantai1: user: warning: [2025-03-11T21:43:12.07300641Z]: [talos] service[ext-zfs-service](Waiting): Waiting for file "/dev/zfs" to exist
kantai1: user: warning: [2025-03-11T21:43:13.75346441Z]: [talos] service[kubelet](Running): Health check successful
kantai1: user: warning: [2025-03-11T21:43:16.75497441Z]: [talos] service[etcd](Running): Health check successful
kantai1: user: warning: [2025-03-11T21:43:16.75840041Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-apiserver"}
kantai1: user: warning: [2025-03-11T21:43:16.75891441Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-controller-manager"}
kantai1: user: warning: [2025-03-11T21:43:16.75892641Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-scheduler"}
<snip>
kantai1: user: warning: [2025-03-11T21:43:32.75684141Z]: [talos] service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: task "kubelet" failed: exit code 255
kantai1: user: warning: [2025-03-11T21:43:32.75687941Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: task "etcd" failed: exit code 255
kantai1: user: warning: [2025-03-11T21:43:32.75692341Z]: [talos] service[cri](Waiting): Error running Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]), going to restart forever: signal: hangup
kantai1: user: warning: [2025-03-11T21:43:32.75726741Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-apiserver"}
kantai1: user: warning: [2025-03-11T21:43:32.75728441Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-controller-manager"}
kantai1: user: warning: [2025-03-11T21:43:32.75729541Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-scheduler"}
kantai1: user: warning: [2025-03-11T21:43:35.19133341Z]: [talos] updated Talos API endpoints in Kubernetes {"component": "controller-runtime", "controller": "kubeaccess.EndpointController", "endpoints": ["10.1.1.1", "10.1.1.2", "10.1.1.3"]}
kantai1: user: warning: [2025-03-11T21:43:37.75836841Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: failed to create task: "etcd": connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"
kantai1: user: warning: [2025-03-11T21:43:37.75842541Z]: [talos] service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: failed to create task: "kubelet": connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"
kantai1: user: warning: [2025-03-11T21:43:37.78587441Z]: [talos] service[cri](Running): Process Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]) started with PID 16659
kantai1: user: warning: [2025-03-11T21:43:40.71106241Z]: [talos] task startAllServices (1/1): service "ext-zfs-service" to be "up"
kantai1: user: warning: [2025-03-11T21:43:41.71207641Z]: [talos] service[cri](Running): Health check successful
kantai1: user: warning: [2025-03-11T21:43:42.92725141Z]: [talos] service[etcd](Running): Started task etcd (PID 17246) for container etcd
kantai1: user: warning: [2025-03-11T21:43:42.96428241Z]: [talos] service[kubelet](Running): Started task kubelet (PID 17302) for container kubelet

Environment

  • Talos version: 1.9.4 (custom kernel)
  • Kubernetes version: v1.32.1
  • Platform: bare metal (amd64)

support.zip

@smira
Copy link
Member

smira commented Mar 12, 2025

I don't know what that is, but it's certainly not a reconfiguration.

The static pods are taken down on purpose - as etcd is down, there's no point in running these.

@smira
Copy link
Member

smira commented Mar 12, 2025

There is no message in the cri.log about the kubelet container dying, but there are tons of task exit events, any idea what might be going on?

@jfroy
Copy link
Contributor Author

jfroy commented Mar 12, 2025

There is no message in the cri.log about the kubelet container dying, but there are tons of task exit events, any idea what might be going on?

The node was shut down (with talosctl) for hardware maintenance/repair for a few hours then started back up. I noticed the missing static pods maybe an hour after booting the node. I captured the support bundle and then restarted the node. It's been operating normally since.

@smira
Copy link
Member

smira commented Mar 12, 2025

Not sure what's going on here, Talos never sends SIGHUP to anything. 😕

@jfroy
Copy link
Contributor Author

jfroy commented Mar 12, 2025

Not sure what's going on here, Talos never sends SIGHUP to anything. 😕

Yeah, that's a separate investigation... This bug is specifically about Talos not rendering the static pods. Only judging from the main log, it seems the node eventually settled, with etcd, kubelet, and cri(containerd) running, so I expected the static pods to also come back. It suggests there is a gap in the controller's logic, a race, or missing/dropped event.

@smira
Copy link
Member

smira commented Mar 12, 2025

I don't think it's a bug anywhere, etcd never became healthy, so static pods are not rendered.

@smira
Copy link
Member

smira commented Mar 12, 2025

On another look, there's some mismatch on etcd service status.

@smira
Copy link
Member

smira commented Mar 14, 2025

We observed something similar to your symptoms with Kubernetes conformance tests, and there's a fix in main for it now.

See #10520

@jfroy
Copy link
Contributor Author

jfroy commented Mar 14, 2025

We observed something similar to your symptoms with Kubernetes conformance tests, and there's a fix in main for it now.

See #10520

I read the containerd issue discussion. I assume by "fix in main", you mean the containerd patch from pkgs?

@smira
Copy link
Member

smira commented Mar 14, 2025

I think your issue is not exactly same, as in your case containerd recovered from a crash, but the symptoms look similar - as it looks like a crash which results in SIGHUP which is probably containerd-shim dying and closing the process group, which causes cascading container failures.

So certainly the change in main is not a fix, so I should phrase it different way - there's some bug in containerd and we might be one step closer to find it.

The fix in main allows containerd to recover from such crashes (in your case it recovered, but I saw it failing to start with a corrupted state forever0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants