Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New update strategy #204

Closed
thesharp opened this issue Jan 21, 2020 · 7 comments
Closed

New update strategy #204

thesharp opened this issue Jan 21, 2020 · 7 comments

Comments

@thesharp
Copy link

Feature Request

Desired Feature

New update strategy which would download updates, but leave the actual reboot process to the user.

Example Usage

This update strategy would be useful on a single-node clusters to avoid unscheduled downtimes.

Other Information

User would need a way to know that update is ready and waiting on reboot to be applied. Metrics seem to be the appropriate way for such notifications.

@jlebon
Copy link
Member

jlebon commented Jan 21, 2020

There's probably two separate but related modes:

  1. wait on reboot: anything that reboots the system (including the user) will update the machine
  2. wait on permission AND reboot: sysadmin has to "release" the update -- without this, even rebooting won't update the system

The first can be done by just not using the finalization API (coreos/rpm-ostree#1814). The second can be done by using the finalization API, but stopping short of calling rpm-ostree finalize-deployment (which actually is a hidden command right now).

I can see the usefulness of this, and we could do it. Though anything that's not automatic is kinda counter to the mission :) I think just getting the periodic update strategy in and the fleet lock stuff fleshed out would be more beneficial.

@lucab
Copy link
Contributor

lucab commented Jan 22, 2020

Thanks all for the feedback, I'll cumulative reply here.

The "wait on reboot" method is what Container Linux does, and it's full of corner cases resulting in unplanned/accidental upgrades. IMHO that flow should be only used in manual operating mode, i.e. by stopping/disabling Zincati first. I am not planning to have an update-strategy working that way, as there are too many inherent corner-cases/races.

Regarding the "wait on permission" case, we have a bit more space for design. Current plan is to check for permission with one of these strategies:

@thesharp would one of the last two suit you, perhaps with some homemade helper (e.g. a local HTTP container with custom logic to decide when finalization is allowed)?

If not, I'd still try to come up with a flow which does not completely bypass Zincati finalization (for example, giving permission to reboot only if a specific filepath exists) and which does not require SSHing to each node.
Assuming you end up with a fleet of >100 nodes to upgrade your way, how were you planning to automate it? Something like: getting an alert based the metrics, and then scheduling an SSH task on each affected node to reboot?

@thesharp
Copy link
Author

@lucab I think periodic will suit my needs. I would be able to schedule update window at suitable time/day.

But it also would be nice to have some way of knowing that update/reboot was done w/o monitoring server's uptime. I'm thinking maybe some sort of webhook triggering before the reboot?

Assuming you end up with a fleet of >100 nodes to upgrade your way

That case would be only suitable in something like a home lab. It isn't suitable for large installations.

@lucab
Copy link
Contributor

lucab commented Jan 22, 2020

@thesharp ack, then I'll close this one and try to get #34 done sometime soon.

But it also would be nice to have some way of knowing that update/reboot was done w/o monitoring server's uptime. I'm thinking maybe some sort of webhook triggering before the reboot?

This is exposed as a metric with an info label. It is a bit better than a webhook, as it is always correct in spite of failed upgrades or rollbacks. The result is the graph you see in the README.
For further details, I covered all of this recently at a Prometheus meetup: https://www.youtube.com/watch?v=TJAFktlhQi4 + https://speakerdeck.com/lucab/prometheus-metrics-from-host-local-services.

@thesharp
Copy link
Author

@lucab you probably should link local_exporter somewhere in https://github.com/coreos/zincati/blob/master/docs/usage/metrics.md

@lucab
Copy link
Contributor

lucab commented Mar 10, 2020

I stabilized metrics and tweaked the docs in #243, which will be part of the next release (0.0.9).

The periodic is still work in progress, but I already have #34 to cover that. Closing this ticket.

@lucab
Copy link
Contributor

lucab commented Mar 10, 2020

I also split the file-based strategy idea to #245. Not going to pursue it at this time, periodic is still higher in my list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants