-
Notifications
You must be signed in to change notification settings - Fork 626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sometimes kube-apiserver
, kube-controller-manager
, and kube-scheduler
are not restarted after etcd/kubelet/cri restart
#10511
Comments
I don't know what that is, but it's certainly not a reconfiguration. The static pods are taken down on purpose - as etcd is down, there's no point in running these. |
There is no message in the |
The node was shut down (with talosctl) for hardware maintenance/repair for a few hours then started back up. I noticed the missing static pods maybe an hour after booting the node. I captured the support bundle and then restarted the node. It's been operating normally since. |
Not sure what's going on here, Talos never sends SIGHUP to anything. 😕 |
Yeah, that's a separate investigation... This bug is specifically about Talos not rendering the static pods. Only judging from the main log, it seems the node eventually settled, with etcd, kubelet, and cri(containerd) running, so I expected the static pods to also come back. It suggests there is a gap in the controller's logic, a race, or missing/dropped event. |
I don't think it's a bug anywhere, |
On another look, there's some mismatch on etcd service status. |
We observed something similar to your symptoms with Kubernetes conformance tests, and there's a fix in main for it now. See #10520 |
I read the containerd issue discussion. I assume by "fix in main", you mean the containerd patch from pkgs? |
I think your issue is not exactly same, as in your case So certainly the change in The fix in main allows containerd to recover from such crashes (in your case it recovered, but I saw it failing to start with a corrupted state forever0. |
Bug Report
Description
On my homelab cluster running Talos 1.9.4, sometimes on one of my nodes, in a sequence, etcd and kubelet exit with code 255 and cri (containerd) with signal hangup. This suggests a reconfiguration rather than a crash. Leaving that aside, when this sequence happens, talos then removes the
kube-apiserver
,kube-controller-manager
, andkube-scheduler
static pods. Then etcd, kubelet, and cri are restarted, and then those static pods are rendered/added again.Only sometimes they are not rendered/added, and that leaves the node in a bad state. There is no additional logging to give a clue as to why that is.
I've attached a support bundle with a node in such a state.
Logs
Environment
support.zip
The text was updated successfully, but these errors were encountered: