Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update the capacity to zero on shutdown/reset #502

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

SchSeba
Copy link
Collaborator

@SchSeba SchSeba commented Oct 2, 2023

When the device plugin is restarted, kubelet marks the resource as unhealthy, but still reports the resource as existing for a grace period (5 mins). If a pod is scheduled before the device plugin comes up, the pod create fails without a retryloop with an error message Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices <DEVICE_NAME>, which is unexpected.

This commit allow the device plugin to send an empty list of devices before the reset or shutdown

@SchSeba SchSeba force-pushed the update_devices_on_shutdown branch from b542bb8 to ee1e0e9 Compare October 2, 2023 08:01
@@ -309,6 +322,14 @@ func (rs *resourceServer) restart() error {
// Send terminate signal to ListAndWatch()
rs.termSignal <- true

// wait for the terminated signal or 5 second
Copy link
Member

@zeeke zeeke Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In resoruceServer.Stop() we do the termSignal-terminatedSignal handshake before stopping the grpcServer. n restart, we do it after rs.grpcServer.Stop(). Is it intentional?

It's not strictly related to this PR, but with these new changes I think it can raise an error on L189
stream.Send(resp)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right! sorry

@SchSeba SchSeba force-pushed the update_devices_on_shutdown branch from ee1e0e9 to 0e7b699 Compare October 2, 2023 08:25
@SchSeba
Copy link
Collaborator Author

SchSeba commented Oct 2, 2023

@zeeke please give it another look :)

@adrianchiris
Copy link
Contributor

hmm im not sure that thats what kubelet expects (being suddenly reported there are no devices when device plugin shuts
down or restarts)

will it remove entries from checkpoint file ?
what if there are pods already using resources ?

@@ -41,6 +41,7 @@ type resourceServer struct {
resourceNamePrefix string
grpcServer *grpc.Server
termSignal chan bool
terminatedSignal chan bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: termSignal and terminatedSignal only relate to ListAndWatch function. They don't relate to the resourceServer struct to terminate or see if it's terminated.

Renaming those two channels to listAndWatchStopSignal and listAndWatchFinishedSignal (or something similar) might improve a little the readability of this file.

@SchSeba
Copy link
Collaborator Author

SchSeba commented Oct 21, 2023

hmm im not sure that thats what kubelet expects (being suddenly reported there are no devices when device plugin shuts
down or restarts)
will it remove entries from checkpoint file ?
what if there are pods already using resources ?

From my tests it doesn't remove the checkpoint file and running pods continue to run (I will try to leave it down for a longer period to be sure there is no reconcile or something that will kill the running pods)

without this change, we have an issue when we take down the pod and the pod is allocated to that node the pod will not be able to start.

When the device plugin is restarted, kubelet marks the resource as unhealthy, but still reports the resource as existing for a grace period (5 mins). If a pod is scheduled before the device plugin comes up, the pod create fails without a retryloop with an error message Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices <DEVICE_NAME>, which is unexpected.

This commit allow the device plugin to send an empty list of devices before the reset or shutdown

Signed-off-by: Sebastian Sch <[email protected]>
@SchSeba
Copy link
Collaborator Author

SchSeba commented Oct 21, 2023

related to openshift/sriov-network-operator#812

@adrianchiris
Copy link
Contributor

adrianchiris commented Nov 23, 2023

@SchSeba been digging a bit into kubelet code

kubelet will report updated status to kubelet every 10 seconds[1] (every NodeStatusUpdateFrequency[2])

once device plugin plugin exits (its endpoint no longer valid), all devices are deemed unhealty[3]

[1] https://github.com/kubernetes/kubernetes/blob/d61cbac69aae97db1839bd2e0e86d68f26b353a7/pkg/kubelet/kubelet.go#L1637
[2] https://github.com/kubernetes/kubernetes/blob/d61cbac69aae97db1839bd2e0e86d68f26b353a7/staging/src/k8s.io/kubelet/config/v1beta1/types.go#L267
[3] https://github.com/kubernetes/kubernetes/blob/d61cbac69aae97db1839bd2e0e86d68f26b353a7/pkg/kubelet/nodestatus/setters.go#L342

so after at most 10 seconds node will report zero allocatable resources if plugin has exited.
if any pod is scheduled to the node before that, it will enter admission error.
setting devices explicitly as un-healthy will not help as unhealty devices is only reported to kube api during that sync loop.

what i suggest, is in sriov-network-operator after we remove device plugin to spin up a goroutine to keep deleting pods in admission error if they consume resources from device plugin until new device plugin is up or alternatively (even better option imo) once device plugin is up again clean up any pods which consume dp resources in admission error once.

LMK what you think.

@SchSeba
Copy link
Collaborator Author

SchSeba commented Dec 5, 2023

Interesting I will try to implement a POC in the operator but before that a question do you see any issue changing the number to 0 when we reboot? can that effect something?

@adrianchiris
Copy link
Contributor

  1. i do not know what the original authors of the API (grpc) intended.
  2. technically we dont achieve more reliability if we do it IMO.
  3. when device plugin exits, it will mark the resources reported as invalid (internally) IIRC. which is the same as if plugin exited.
    in both cases it only affects scheduler when node status is updated in apiserver.

@zeeke
Copy link
Member

zeeke commented Dec 6, 2023

IIUC, if the device plugin exits and leaves the resources unhealthy (for a while I guess), any deployed Pod will get an admission error. If the Pod comes from a Deployment or a Replicaset, the kube controller spawns a lot of subsequent Pods, creating a little junk:

NAME                                                 READY   STATUS                        RESTARTS   AGE    IP               NODE       NOMINATED NODE   READINESS GATES
deploy1-5686945dcc-2x9j5                              0/6     UnexpectedAdmissionError      0          43h    <none>           worker-2   <none>           <none>
deploy1-5686945dcc-4bdpz                              0/6     UnexpectedAdmissionError      0          43h    <none>           worker-2   <none>           <none>
deploy1-5686945dcc-4ht8f                              0/6     Init:ContainerStatusUnknown   0          43h    <none>           worker-2   <none>           <none>
deploy1-5686945dcc-4xv28                              0/6     UnexpectedAdmissionError      0          43h    <none>           worker-2   <none>           <none>

Setting the device count to 0 will prevent the scheduler from selecting the node, making it retry the same Pod in an exponential backoff fashion.

@coveralls
Copy link
Collaborator

coveralls commented May 24, 2024

Pull Request Test Coverage Report for Build 6377485447

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 31 of 39 (79.49%) changed or added relevant lines in 1 file are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.2%) to 75.876%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/resources/server.go 31 39 79.49%
Files with Coverage Reduction New Missed Lines %
pkg/resources/server.go 2 78.52%
Totals Coverage Status
Change from base Build 6371847942: -0.2%
Covered Lines: 2013
Relevant Lines: 2653

💛 - Coveralls

@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 6377734797

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 31 of 40 (77.5%) changed or added relevant lines in 1 file are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.2%) to 75.867%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/resources/server.go 31 40 77.5%
Files with Coverage Reduction New Missed Lines %
pkg/resources/server.go 1 78.45%
Totals Coverage Status
Change from base Build 6371847942: -0.2%
Covered Lines: 2012
Relevant Lines: 2652

💛 - Coveralls

@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 6377419645

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 31 of 41 (75.61%) changed or added relevant lines in 1 file are covered.
  • 19 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.2%) to 75.876%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/resources/server.go 31 41 75.61%
Files with Coverage Reduction New Missed Lines %
pkg/resources/server.go 19 78.52%
Totals Coverage Status
Change from base Build 6371847942: -0.2%
Covered Lines: 2013
Relevant Lines: 2653

💛 - Coveralls

1 similar comment
@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 6377419645

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 31 of 41 (75.61%) changed or added relevant lines in 1 file are covered.
  • 19 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.2%) to 75.876%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/resources/server.go 31 41 75.61%
Files with Coverage Reduction New Missed Lines %
pkg/resources/server.go 19 78.52%
Totals Coverage Status
Change from base Build 6371847942: -0.2%
Covered Lines: 2013
Relevant Lines: 2653

💛 - Coveralls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants