-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow administrators to guide upgrade ordering #2059
Comments
Also strongly related to this is #2035 - if anyone wants to e.g. avoid multiple upgrade disruption by scaling up new nodes - today those nodes start at the old config and so will then shortly get upgraded and rebooted. I think other kube distros tend to do upgrades by just creating a new node at the new config and deleting the old node, so they're less subject to this (but that model can't easily apply on bare metal). If we had that PR we could at least have a phase like:
In combination with the above we could allow admins to choose between in-place updates and the burst upgrade strategy at least in cloud. |
I will spawn this as a separate issue but openshift should bias the scheduler to prefer nodes with capacity that are “up to date” over nodes that might have more capacity but are “not up to date”, because a not up to date node will eventually need to be drained. That implies there should be a label or other marker on nodes that by convention indicates a node that is “not up to date and likely to be drained in the near future” (which mco would manage) and the scheduler should avoid that without completely ruining other scheduling criteria. This is a soft criteria because it is better to disrupt a workload twice than let the workload dip below the preferred availability of the cluster. |
There's a spectrum here - some organizations might want to use this capability to guide node upgrades e.g. to nodes hosting less critical services first. I think though as long as the "upgrade ordering controller" ensures that at least one control plane machine or one worker are upgradable, then the MCO can generally work fine. We would also have a requirement that for major version upgrades, the control plane is upgraded first. (It seems simpler to apply the constraint across the board that both a control plane and worker are upgradable always) The precise definition of "upgradable" today comes down to this:
There's ongoing work/design on having the MCO actually manage its |
Another important detail here is not just the ordering but controlling when any updates are applied. There are a few approaches to that today:
Another big hammer is to create lots of machineconfigpools, perhaps even to the point of a pool per node. This is a pretty extreme scenario for the MCO, I think it would probably work but create Better approaches:
|
One possible implementation using the proposed lease might look like
To put it plainly, the lease is an intent to perform or prevent some disruptive action. This exposes an interface which others may build on top of useful components or procedures. This also allows other components to have less knowledge about the MCO, and the MCO to have less knowledge of these other components. |
this sounds just great - I've been looking at the lease type since a while now and this seems kind of related and super useful |
sounds also like similar to airlock https://github.com/poseidon/fleetlock#manual-intervention |
There's clearly overlap between "upgrade weighting" (this initial proposal) and "upgrade blocking/leases". We can clearly stretch "weighting" into including "blocking" per above. It doesn't solve interaction with other components like leases would (and we have that problem in general with machineAPI). In all of this remember though it's not just about the single reboot - in the general case I think we should handle "accumulated disruption" where pods are rescheduled onto nodes that are then immediately chosen for update. Well...OTOH leases would solve at least one "extreme" case where pods are pinned to nodes and we want basically per-node locking and updating at most one controlplane/worker at a time; in that case the admin or controller puts a lease on every other node. Hmm so I guess I am leaning towards leases. |
I am very much against this ordering mechanism. It's not clear to me how a user will successfully make use of this, short of plotting out on a whiteboard all of the pods and machines and a plan for how pods will be moved and machines updated. That mental image couldn't be farther from what we are trying to do with OpenShift and Kubernetes in general. As Clayton alluded to above, all of this scheduling should be automatic and done without user intervention. Unless I'm missing something, this is an anti-pattern and we should move in a different direction. |
This is a story that came up in 4.0 planning. We need a way for users to upgrade X% of their nodes to a given release and let that soak. This is somewhat a different concern than the original issue here. I'm in fan of informing the scheduler of the kubelet version and having it prioritize the newest ones. Seems like it would be easy to do. 'Do scheduling as normal to find suitable, sort by kubelet version, pick highest in that list'. Of course, if the scheduler only finds the first-fit rather than the best-fit, then this would be a much larger change (but based on how preemption works, possibly works in the former case). |
Nothing in this proposal requires that all nodes have a priority. Unlabeled nodes could e.g. be It also doesn't require planning out how pods move - that's a strongly related but orthogonal thing. That said, this request is motivated by one request at an extreme to do exactly what you're saying - the admin wants to choose which nodes to update. In this case there isn't concern about interaction with e.g. PDBs because the pods are pinned to nodes. |
I agree that marking the node as no longer schedulable is a nice way of identifying that a node is readying undergoing maintenance, I had been looking at a situation where I actually want a pod to schedule to a node, I just dont want the node upgraded. So an ability to cordon a node from an upgrade pool, but not from scheduling is the semantic I was trying to think through. |
Another observation: Should we add a rollout strategy concept on a machine config pool? One example rollout strategy would be update by topology, so I could ensure all nodes in zone X are updated before zone Y. This would aid workloads that are zone or location aware. |
Hopefully this comment relates to this particular issue. During a cluster upgrade we (UXD) would like to be able to surface in the console the difference between a node that is 'ready and up to date' vs. a node that is 'ready and waiting to update'. Is there something we could do so that users can quickly distinguish between the two? Perhaps an annotation - waiting to update - or something like that? |
This would increase your exposure to zone-outages, wouldn't it? What if you had a workload that happened to be incompatible with the incoming MachineConfig. You could have PDBs that protect it from excessive degradation during a randomized rollout, but if it gets sqeezed into the old-MachineConfig zone A, zones B and C completely update, and then zone A has an outage, you'd be immediately thrust into an outage for that workload. |
Today the MCO arbitrarily chooses a node to update from the candidates. We want to allow admins to avoid specific nodes entirely (for as long as they want) as well as guide upgrade ordering. This replaces the defunct etcd-specific code with support for a generic annotation `machineconfiguration.openshift.io/update-order` that allows an external controller (and/or human) to do both of these. Setting it to `0` will entirely skip that node for updates. Otherwise, higher values are preferred. Closes: openshift#2059
Today the MCO arbitrarily chooses a node to update from the candidates. We want to allow admins to avoid specific nodes entirely. (Aside: This replaces the defunct etcd-specific ordering code that we didn't end up using) Add an annotation `machineconfiguration.openshift.io/hold` that allows an external controller (and/or human) to avoid specific nodes. Related: openshift#2059
This could help in preventing nodes from upgrading to a kernel that is not supported by an out-of-tree driver. These special resource nodes can only run the workload. Draining those could kill the workload since the "general" CPU nodes cannot handle them. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lifecycle frozen |
Today, any admin that wants to fully control node upgrade ordering can: First, pause the pool controlling those nodes (e.g. So for example, when you upgrade the cluster to pick up a new kernel security update, the Now, if you want to manually target a node for update, it should work to The machine-config-daemon (daemonset) running on that node will then react to that annotation, and apply it (drain + reboot as needed) in the same way it would as if the node controller had changed it. |
Today the MCO makes no attempt to apply any ordering to which nodes it updates from the candidates. One problem we're thinking about is (particularly on bare metal scenarios where there might be a lot of pods on a node, and possibly pods expensive to reschedule like CNV) that it's quite possible that workloads are disrupted multiple times for an OS upgrade.
When we go to drain a node, its pods will be rescheduled across the remaining nodes...and then we will upgrade one of those, quite possibly moving one of the workload pods again etc.
One idea here is to add the minimal hooks such that a separate controller could influence this today.
If for example we supported a label
machineconfig.openshift.io/upgrade-weight=42
and the node controller picked the highest weight node, then the separate controller could also e.g. mark $number nodes which are next in the upgrade ordering as unschedulable, ensuring that the drain from the current node doesn't land on them.Without excess capacity or changing the scheduler to more strongly prefer packing nodes it seems hard to avoid multiple disruption, but the label would allow this baseline integration.
The text was updated successfully, but these errors were encountered: