Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process drain-failure nodes at the end #394

Merged
merged 7 commits into from
Oct 25, 2023

Conversation

ssheladiya
Copy link
Contributor

@ssheladiya ssheladiya commented Oct 19, 2023

Problem:
At present, if upgrade-manager comes across a node failure, it becomes stuck and fails to proceed with draining other nodes until the failed node is repaired. This may cause significant delays if the drain-failure node requires a considerable amount of time to fix

Proposal:
Upgrade-manager skip the drain-failed nodes and return to them when all the other nodes in the InstanceGroups are rotated

Changes introduced by this PR:

  • Marks instances that failed to drain with failed-drain value
  • Upgrade-manager skips failed-drain instances while selecting the target and moves on to other instances
  • Once no healthy instances are left to rotate, it tries to rotate the failed-drain instances again

controllers/upgrade.go Outdated Show resolved Hide resolved
controllers/upgrade.go Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Oct 19, 2023

Codecov Report

Merging #394 (aaa0843) into master (7c245ca) will not change coverage.
The diff coverage is 66.66%.

@@           Coverage Diff           @@
##           master     #394   +/-   ##
=======================================
  Coverage   39.09%   39.09%           
=======================================
  Files           7        7           
  Lines         931      931           
=======================================
  Hits          364      364           
  Misses        540      540           
  Partials       27       27           
Flag Coverage Δ
unittests 39.09% <66.66%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
controllers/cloud.go 65.38% <ø> (ø)
controllers/upgrade.go 46.21% <66.66%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ssheladiya ssheladiya marked this pull request as ready for review October 24, 2023 02:52
Copy link
Collaborator

@shreyas-badiger shreyas-badiger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZihanJiang96 ZihanJiang96 merged commit e4a79bf into keikoproj:master Oct 25, 2023
4 checks passed
@shreyas-badiger shreyas-badiger mentioned this pull request Oct 30, 2023
shreyas-badiger added a commit that referenced this pull request Dec 5, 2023
* early-cordon nodes

Signed-off-by: sbadiger <[email protected]>

* early cordon

Signed-off-by: sbadiger <[email protected]>

* include context in cordon and drain functions

* cordon only drifted instances

* add unit tests

* Update aws-sdk-go-cache to v0.0.2 (#399)

Signed-off-by: Todd Ekenstam <[email protected]>
Signed-off-by: sbadiger <[email protected]>

* Process drain-failure nodes at the end (#394)

* Process drain-failures at the end
Signed-off-by: ssheladiya <[email protected]>
Signed-off-by: sbadiger <[email protected]>

* early-cordon nodes

Signed-off-by: sbadiger <[email protected]>

* early cordon

Signed-off-by: sbadiger <[email protected]>

* include context in cordon and drain functions

Signed-off-by: sbadiger <[email protected]>

* Release v1.0.8 (#400)

Signed-off-by: sbadiger <[email protected]>

* cordon only drifted instances

Signed-off-by: sbadiger <[email protected]>

* add unit tests

Signed-off-by: sbadiger <[email protected]>

* resolve merge conflicts

* update go.sum

* resolve test errors

* remove cordon as upgrade strategy

* remove space

* improve test coverage

* improve code coverage

* remove redundant code

* remove unused imports

* Update controllers/providers/kubernetes/nodes.go

Co-authored-by: Venkata Gunapati <[email protected]>

* uncordon the nodes

* error handling for uncordoning

* add tests

* handle uncordon scenario properly

* Update controllers/providers/kubernetes/nodes.go

Co-authored-by: Venkata Gunapati <[email protected]>

* fix typo

* fix lint errors

* default the feature to false

---------

Signed-off-by: sbadiger <[email protected]>
Signed-off-by: Todd Ekenstam <[email protected]>
Co-authored-by: Todd Ekenstam <[email protected]>
Co-authored-by: Siddharth Sheladiya <[email protected]>
Co-authored-by: Venkata Gunapati <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants