Sometimes `kube-apiserver`, `kube-controller-manager`, and `kube-scheduler` are not restarted after etcd/kubelet/cri restart #10511

jfroy · 2025-03-12T05:49:29Z

Bug Report

Description

On my homelab cluster running Talos 1.9.4, sometimes on one of my nodes, in a sequence, etcd and kubelet exit with code 255 and cri (containerd) with signal hangup. This suggests a reconfiguration rather than a crash. Leaving that aside, when this sequence happens, talos then removes the kube-apiserver, kube-controller-manager, and kube-scheduler static pods. Then etcd, kubelet, and cri are restarted, and then those static pods are rendered/added again.

Only sometimes they are not rendered/added, and that leaves the node in a bad state. There is no additional logging to give a clue as to why that is.

I've attached a support bundle with a node in such a state.

Logs

kantai1: user: warning: [2025-03-11T21:43:11.74012641Z]: [talos] service[kubelet](Preparing): Creating service runner
kantai1: user: warning: [2025-03-11T21:43:11.74077941Z]: [talos] service[etcd](Preparing): Creating service runner
kantai1: user: warning: [2025-03-11T21:43:11.83241041Z]: [talos] service[ext-nvidia-driver](Running): Started task ext-nvidia-driver (PID 7647) for container ext-nvidia-driver
kantai1: user: warning: [2025-03-11T21:43:11.84115341Z]: [talos] service[kubelet](Running): Started task kubelet (PID 7649) for container kubelet
kantai1: user: warning: [2025-03-11T21:43:11.84146941Z]: [talos] service[etcd](Running): Started task etcd (PID 7650) for container etcd
kantai1: user: warning: [2025-03-11T21:43:12.07300641Z]: [talos] service[ext-zfs-service](Waiting): Waiting for file "/dev/zfs" to exist
kantai1: user: warning: [2025-03-11T21:43:13.75346441Z]: [talos] service[kubelet](Running): Health check successful
kantai1: user: warning: [2025-03-11T21:43:16.75497441Z]: [talos] service[etcd](Running): Health check successful
kantai1: user: warning: [2025-03-11T21:43:16.75840041Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-apiserver"}
kantai1: user: warning: [2025-03-11T21:43:16.75891441Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-controller-manager"}
kantai1: user: warning: [2025-03-11T21:43:16.75892641Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-scheduler"}
<snip>
kantai1: user: warning: [2025-03-11T21:43:32.75684141Z]: [talos] service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: task "kubelet" failed: exit code 255
kantai1: user: warning: [2025-03-11T21:43:32.75687941Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: task "etcd" failed: exit code 255
kantai1: user: warning: [2025-03-11T21:43:32.75692341Z]: [talos] service[cri](Waiting): Error running Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]), going to restart forever: signal: hangup
kantai1: user: warning: [2025-03-11T21:43:32.75726741Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-apiserver"}
kantai1: user: warning: [2025-03-11T21:43:32.75728441Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-controller-manager"}
kantai1: user: warning: [2025-03-11T21:43:32.75729541Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-scheduler"}
kantai1: user: warning: [2025-03-11T21:43:35.19133341Z]: [talos] updated Talos API endpoints in Kubernetes {"component": "controller-runtime", "controller": "kubeaccess.EndpointController", "endpoints": ["10.1.1.1", "10.1.1.2", "10.1.1.3"]}
kantai1: user: warning: [2025-03-11T21:43:37.75836841Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: failed to create task: "etcd": connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"
kantai1: user: warning: [2025-03-11T21:43:37.75842541Z]: [talos] service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: failed to create task: "kubelet": connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"
kantai1: user: warning: [2025-03-11T21:43:37.78587441Z]: [talos] service[cri](Running): Process Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]) started with PID 16659
kantai1: user: warning: [2025-03-11T21:43:40.71106241Z]: [talos] task startAllServices (1/1): service "ext-zfs-service" to be "up"
kantai1: user: warning: [2025-03-11T21:43:41.71207641Z]: [talos] service[cri](Running): Health check successful
kantai1: user: warning: [2025-03-11T21:43:42.92725141Z]: [talos] service[etcd](Running): Started task etcd (PID 17246) for container etcd
kantai1: user: warning: [2025-03-11T21:43:42.96428241Z]: [talos] service[kubelet](Running): Started task kubelet (PID 17302) for container kubelet

Environment

Talos version: 1.9.4 (custom kernel)
Kubernetes version: v1.32.1
Platform: bare metal (amd64)

support.zip

The text was updated successfully, but these errors were encountered:

smira · 2025-03-12T09:31:55Z

I don't know what that is, but it's certainly not a reconfiguration.

The static pods are taken down on purpose - as etcd is down, there's no point in running these.

smira · 2025-03-12T10:19:27Z

There is no message in the cri.log about the kubelet container dying, but there are tons of task exit events, any idea what might be going on?

jfroy · 2025-03-12T14:22:19Z

There is no message in the cri.log about the kubelet container dying, but there are tons of task exit events, any idea what might be going on?

The node was shut down (with talosctl) for hardware maintenance/repair for a few hours then started back up. I noticed the missing static pods maybe an hour after booting the node. I captured the support bundle and then restarted the node. It's been operating normally since.

smira · 2025-03-12T14:26:45Z

Not sure what's going on here, Talos never sends SIGHUP to anything. 😕

jfroy · 2025-03-12T14:33:10Z

Not sure what's going on here, Talos never sends SIGHUP to anything. 😕

Yeah, that's a separate investigation... This bug is specifically about Talos not rendering the static pods. Only judging from the main log, it seems the node eventually settled, with etcd, kubelet, and cri(containerd) running, so I expected the static pods to also come back. It suggests there is a gap in the controller's logic, a race, or missing/dropped event.

smira · 2025-03-12T14:59:30Z

I don't think it's a bug anywhere, etcd never became healthy, so static pods are not rendered.

smira · 2025-03-12T15:05:31Z

On another look, there's some mismatch on etcd service status.

smira · 2025-03-14T10:26:55Z

We observed something similar to your symptoms with Kubernetes conformance tests, and there's a fix in main for it now.

See #10520

jfroy · 2025-03-14T14:35:34Z

We observed something similar to your symptoms with Kubernetes conformance tests, and there's a fix in main for it now.

See #10520

I read the containerd issue discussion. I assume by "fix in main", you mean the containerd patch from pkgs?

smira · 2025-03-14T14:50:42Z

I think your issue is not exactly same, as in your case containerd recovered from a crash, but the symptoms look similar - as it looks like a crash which results in SIGHUP which is probably containerd-shim dying and closing the process group, which causes cascading container failures.

So certainly the change in main is not a fix, so I should phrase it different way - there's some bug in containerd and we might be one step closer to find it.

The fix in main allows containerd to recover from such crashes (in your case it recovered, but I saw it failing to start with a corrupted state forever0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sometimes `kube-apiserver`, `kube-controller-manager`, and `kube-scheduler` are not restarted after etcd/kubelet/cri restart #10511

Sometimes `kube-apiserver`, `kube-controller-manager`, and `kube-scheduler` are not restarted after etcd/kubelet/cri restart #10511

jfroy commented Mar 12, 2025 •

edited

Loading

smira commented Mar 12, 2025

smira commented Mar 12, 2025

jfroy commented Mar 12, 2025

smira commented Mar 12, 2025

jfroy commented Mar 12, 2025

smira commented Mar 12, 2025

smira commented Mar 12, 2025

smira commented Mar 14, 2025

jfroy commented Mar 14, 2025

smira commented Mar 14, 2025

Sometimes kube-apiserver, kube-controller-manager, and kube-scheduler are not restarted after etcd/kubelet/cri restart #10511

Sometimes kube-apiserver, kube-controller-manager, and kube-scheduler are not restarted after etcd/kubelet/cri restart #10511

Comments

jfroy commented Mar 12, 2025 • edited Loading

Bug Report

Description

Logs

Environment

smira commented Mar 12, 2025

smira commented Mar 12, 2025

jfroy commented Mar 12, 2025

smira commented Mar 12, 2025

jfroy commented Mar 12, 2025

smira commented Mar 12, 2025

smira commented Mar 12, 2025

smira commented Mar 14, 2025

jfroy commented Mar 14, 2025

smira commented Mar 14, 2025

Sometimes `kube-apiserver`, `kube-controller-manager`, and `kube-scheduler` are not restarted after etcd/kubelet/cri restart #10511

Sometimes `kube-apiserver`, `kube-controller-manager`, and `kube-scheduler` are not restarted after etcd/kubelet/cri restart #10511

jfroy commented Mar 12, 2025 •

edited

Loading