Node crash causes volumes to become unavailable #740

faust64 · 2019-12-11T16:21:52Z

Describe the bug

When one of my OpenShift node goes down, the (RWO/rbd) PVC that were attached to that machine would get unavailable.
Pods trying to restart on another node would be stuck in containerCreating, with events mentioning multi-attach errors.

Those would remain stuck until I eventually reboot my faulty node, which can take me hours or days (lab cluster).

Environment details

Image/version of Ceph CSI driver

Using the samples from the ./deploy folder in this repository.

$ oc get pods -o yaml | grep image: | sort -u
      image: quay.io/cephcsi/cephcsi:canary
      image: quay.io/k8scsi/csi-attacher:v1.2.0
      image: quay.io/k8scsi/csi-node-driver-registrar:v1.1.0
      image: quay.io/k8scsi/csi-provisioner:v1.3.0
$ oc get pods -o yaml | grep imageID: | sort -u
      imageID: quay.io/cephcsi/cephcsi@sha256:a33e8dc5f3726c438194c37ed6fb7645062f06bd3a5293e2fa321f265296ecb5
      imageID: quay.io/cephcsi/cephcsi@sha256:fa27d53692647d4186a55e3dd59ef6a5fd8c747c81dc0de0a8237522e05d04e1
      imageID: quay.io/k8scsi/csi-attacher@sha256:26fccd7a99d973845df1193b46ebdcc6ab8dc5f6e6be319750c471fce1742d13
      imageID: quay.io/k8scsi/csi-node-driver-registrar@sha256:13daf82fb99e951a4bff8ae5fc7c17c3a8fe7130be6400990d8f6076c32d4599
      imageID: quay.io/k8scsi/csi-provisioner@sha256:e615e92233248e72f046dd4f5fac40e75dd49f78805801953a7dfccf4eb09148

Been trying to update to the last tags I could find on quay today, though some of the Pods crashed, suggesting some argument passed was no longer valid / supported. I eventually rolled back, as I noticed your copy still uses the same tags I already had.

helm chart version

NA

Kubernetes cluster version

Client Version: openshift-clients-4.2.0-201910041700
Server Version: 4.2.9
Kubernetes Version: v1.14.6+20e2756

Logs

Steps to reproduce

Steps to reproduce the behavior:

Setup OCP (or k8s) and Ceph clusters,
Deploy assets from ./deploy/(cephfs|rbd)/kubernetes/v1.14+ (to be precise, I did deploy my copy before the cephfs resizer bit got added, 46 days ago)
Create PVC, start deployment using that volume
4 Plug out compute node hosting that Pod (or any violent shutdown, that won't nicely terminate CSI Pods running on node)
Wait

Actual results

Replacement pod stuck in containerCreating. Events mentioning multi-attach errors.

Expected behavior

Back when CSI was not mandatory in OpenShift, when one of my node went down, I might have seen multi-attach errors in the very beginning of my replacement Pod starting, though the outage wouldn't have exceeded 5 to 10 minutes in length.

Additional context

Been noticing that issue each and every time I lose a node. And currently, those RHCOS don't behave well as KVM guests, ... I'ld see crash like those on a daily basis.

$ oc get nodes
NAME   STATUS     ROLES    AGE   VERSION
compute1.demo   NotReady   worker   52d   v1.14.6+31a56cf75
compute2.demo   Ready      worker   52d   v1.14.6+31a56cf75
...
$ oc get pods
NAME                              READY   STATUS              RESTARTS   AGE
nextcloud-demo-17-jps7f           0/2     ContainerCreating   0          13m
nextcloud-demo-17-vvndr           2/2     Terminating         0          116m
...
$ oc describe pod nextcloud-demo-17-jps7f
...
Events:
  Type     Reason              Age               From                                                   Message
  ----     ------              ----              ----                                                   -------
  Normal   Scheduled           13m               default-scheduler                                      Successfully assigned wsweet-demo/nextcloud-demo-17-jps7f to compute5.demo
  Warning  FailedAttachVolume  13m               attachdetach-controller                                Multi-Attach error for volume "pvc-030502d0-1c09-11ea-b52c-52540069748c" Volume is already used by pod(s) nextcloud-demo-17-vvndr
  Warning  FailedMount         5s (x6 over 11m)  kubelet, compute5.demo  Unable to mount volumes for pod "nextcloud-demo-17-jps7f_wsweet-demo(99f80fdd-1c2f-11ea-9492-525400bec0a4)": timeout expired waiting for volumes to attach or mount for pod "wsweet-demo"/"nextcloud-demo-17-jps7f". list of unmounted volumes=[data]. list of unattached volumes=[apachesites data default-token-hssgl]
$ oc get pvc|grep nextc
nextcloud-demo            Bound    pvc-030502d0-1c09-11ea-b52c-52540069748c   50Gi       RWO            ceph-storage   4h52m
$ oc get pods -n csi-ceph -o wide | grep compute1
csi-cephfsplugin-njlt9                          3/3     Running   15         4d5h    10.42.253.20   compute1.demo   <none>           <none>
csi-rbdplugin-ltp58                             3/3     Running   78         52d     10.42.253.20   compute1.demo   <none>           <none>
$ oc logs -n csi-ceph csi-rbdplugin-ltp58 -c csi-rbdplugin
Error from server: Get https://10.42.253.20:10250/containerLogs/rook-ceph/csi-rbdplugin-ltp58/csi-rbdplugin: dial tcp 10.42.253.20:10250: connect: no route to host
[obviously]

...
$ virsh destroy compute1
$ virsh start compute1
...
$ oc get nodes | grep compute1
compute1.demo   Ready    worker   52d   v1.14.6+31a56cf75
$ oc logs -n csi-ceph csi-rbdplugin-ltp58 -c csi-rbdplugin
I1211 16:16:28.730546    4843 cephcsi.go:104] Driver version: canary and Git version: e4b4c70d9267d030ab3663d9a6b0fc20f03d9836
I1211 16:16:28.730741    4843 cephcsi.go:159] Starting driver type: rbd with name: rbd.csi.ceph.com
I1211 16:16:29.186593    4843 mount_linux.go:170] Cannot run systemd-run, assuming non-systemd OS
I1211 16:16:29.186626    4843 mount_linux.go:171] systemd-run failed with: exit status 1
I1211 16:16:29.186649    4843 mount_linux.go:172] systemd-run output: Failed to create bus connection: No such file or directory
I1211 16:16:29.226281    4843 server.go:118] Listening for connections on address: &net.UnixAddr{Name:"//csi/csi.sock", Net:"unix"}
I1211 16:16:29.938617    4843 utils.go:157] ID: 1 GRPC call: /csi.v1.Identity/GetPluginInfo
I1211 16:16:29.938663    4843 utils.go:158] ID: 1 GRPC request: {}
I1211 16:16:29.940166    4843 identityserver-default.go:37] ID: 1 Using default GetPluginInfo
I1211 16:16:29.940185    4843 utils.go:163] ID: 1 GRPC response: {"name":"rbd.csi.ceph.com","vendor_version":"canary"}
I1211 16:16:29.964006    4843 utils.go:157] ID: 2 GRPC call: /csi.v1.Node/NodeGetInfo
I1211 16:16:29.964052    4843 utils.go:158] ID: 2 GRPC request: {}
I1211 16:16:29.964858    4843 nodeserver-default.go:58] ID: 2 Using default NodeGetInfo
I1211 16:16:29.964870    4843 utils.go:163] ID: 2 GRPC response: {"node_id":"compute1.demo"}
I1211 16:16:30.912315    4843 utils.go:157] ID: 3 Req-ID: 0001-0024-f980b615-746a-4e5e-b429-a364fd69838b-0000000000000003-035a230b-1c09-11ea-9aa9-0a580a8105cf GRPC call: /csi.v1.Node/NodeUnpublishVolume
I1211 16:16:30.912344    4843 utils.go:158] ID: 3 Req-ID: 0001-0024-f980b615-746a-4e5e-b429-a364fd69838b-0000000000000003-035a230b-1c09-11ea-9aa9-0a580a8105cf GRPC request: {"target_path":"/var/lib/kubelet/pods/080ab6d5-1c09-11ea-9492-525400bec0a4/volumes/kubernetes.io~csi/pvc-03165680-1c09-11ea-b52c-52540069748c/mount","volume_id":"0001-0024-f980b615-746a-4e5e-b429-a364fd69838b-0000000000000003-035a230b-1c09-11ea-9aa9-0a580a8105cf"}
I1211 16:16:30.912984    4843 utils.go:157] ID: 4 Req-ID: 0001-0024-f980b615-746a-4e5e-b429-a364fd69838b-0000000000000003-18acda65-f8f0-11e9-bcb2-0a580a83001b GRPC call: /csi.v1.Node/NodeUnpublishVolume
I1211 16:16:30.913010    4843 utils.go:158] ID: 4 Req-ID: 0001-0024-f980b615-746a-4e5e-b429-a364fd69838b-0000000000000003-18acda65-f8f0-11e9-bcb2-0a580a83001b GRPC request: {"target_path":"/var/lib/kubelet/pods/25511932-191b-11ea-a362-525400bec0a4/volumes/kubernetes.io~csi/pvc-18638f68-f8f0-11e9-8b94-52540069748c/mount","volume_id":"0001-0024-f980b615-746a-4e5e-b429-a364fd69838b-0000000000000003-18acda65-f8f0-11e9-bcb2-0a580a83001b"}
I1211 16:16:30.920191    4843 utils.go:157] ID: 5 Req-ID: 0001-0024-f980b615-746a-4e5e-b429-a364fd69838b-0000000000000003-035843a1-1c09-11ea-9aa9-0a580a8105cf GRPC call: /csi.v1.Node/NodeUnpublishVolume
I1211 16:16:30.917747    4843 nodeserver.go:434] ID: 4 Req-ID: 0001-0024-f980b615-746a-4e5e-b429-a364fd69838b-0000000000000003-18acda65-f8f0-11e9-bcb2-0a580a83001b targetPath: /var/lib/kubelet/pods/25511932-191b-11ea-a362-525400bec0a4/volumes/kubernetes.io~csi/pvc-18638f68-f8f0-11e9-8b94-52540069748c/mount has already been deleted
...
$ oc get pod | grep nextcloud
nextcloud-demo-17-jps7f           1/2     Running   0          17m

The text was updated successfully, but these errors were encountered:

lorenzomorandini · 2019-12-12T13:58:31Z

We heave the same problem here too. If a node goes down, the new pods are stuck in container creating because the volume(s) are locked on the failed node.

Madhu-1 · 2019-12-12T14:00:26Z

@humblec is it something to do with fencing? how to handle this scenario

CC @dillaman @nixpanic @ShyamsundarR

zhucan · 2019-12-12T14:14:10Z

@faust64 @lorenzomorandini
Once a pod is marked for deleted (in this case, by the node eviction process), the a/d controller will only proceed with detach if the pod has no running containers per the pod object. Since the node is down, it will never update the pod object, which causes a chicken and egg problem. Detach, in this case, requires user to "force delete" the pod, which will cause the a/d controller to issue a detach.

There is another safe-unmount check where the a/d controller waits a max of 6 min for a clean unmount after which it considers the node lost and issues a unilateral detach. We should consider similar logic for this check.
—— from saad-ali

IDK if you follow the work of the storage SIG, but this was discussed 3 weeks ago at out F2F meeting in San Diego. There is doc that can help you: https://docs.google.com/presentation/d/1UmZA37nFnp5HxTDtsDgRh0TRbcwtUMzc1XScf5C9Tqc/edit?usp=sharing

dillaman · 2019-12-12T14:35:31Z

@humblec is it something to do with fencing? how to handle this scenario

Negative -- this is k8s behaving correctly (if not sub-optimally). Just because a node is unresponsive to k8s doesn't mean that workloads aren't still running and actively manipulating storage (e.g. imagine the management network is down but the storage network is still active).

There is really nothing for ceph-csi to do in this instance. It would be great if one day we could storage fence "dead" nodes [1], but the CSI spec doesn't provide us enough information to support that at this time (not that it would impact this issue, though, since it's just k8s).

[1] #578

yanchicago · 2020-06-23T01:35:36Z

Anyone has an "official" workaround? Currently, I'm just blacklist the watcher of the csi volume. And the newly scheduled pod is able to attach to the volume. Someone mentioned "detach" the volume on the previous host. Is it necessary?

stale · 2020-09-27T12:51:14Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

stale · 2020-10-04T16:17:20Z

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

faust64 · 2021-02-01T10:27:39Z

Though the issue is not solved yet, the last CSI releases do offer with a workaround (tested with cephcsi:3.2.1).

You can use kubectl get volumeattachment to retrieve a list of attachments, mapping a PersistentVolume to a Node.
You may kubectl delete volumeattachment csi-xxxyyyzzz to release a volume, at which point the Multi-Attach errors should go away, and the replacement node would be able to restart your Pod.

Having tested by plugging network cables out, without rebooting nodes, I could confirm that once the node gets back Ready, its Terminating containers are properly shut down, volumes un-mapped, ...

ShyamsundarR mentioned this issue Dec 18, 2019

On NodeLost, the new pod can't mount the same volume. rook/rook#1507

Closed

yanchicago mentioned this issue Jun 18, 2020

Add the ability to use exclusive RBD locking to prevent inadvertent multi-node access of an RWO image #578

Open

stale bot added the wontfix This will not be worked on label Sep 27, 2020

stale bot closed this as completed Oct 4, 2020

faust64 mentioned this issue Mar 3, 2021

fix(k8s): PodSecurityPolicy jitsi/docker-jitsi-meet#970

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node crash causes volumes to become unavailable #740

Node crash causes volumes to become unavailable #740

faust64 commented Dec 11, 2019 •

edited

Loading

lorenzomorandini commented Dec 12, 2019

Madhu-1 commented Dec 12, 2019

zhucan commented Dec 12, 2019 •

edited

Loading

dillaman commented Dec 12, 2019

yanchicago commented Jun 23, 2020

stale bot commented Sep 27, 2020

stale bot commented Oct 4, 2020

faust64 commented Feb 1, 2021

Node crash causes volumes to become unavailable #740

Node crash causes volumes to become unavailable #740

Comments

faust64 commented Dec 11, 2019 • edited Loading

Describe the bug

Environment details

Steps to reproduce

Actual results

Expected behavior

Additional context

lorenzomorandini commented Dec 12, 2019

Madhu-1 commented Dec 12, 2019

zhucan commented Dec 12, 2019 • edited Loading

dillaman commented Dec 12, 2019

yanchicago commented Jun 23, 2020

stale bot commented Sep 27, 2020

stale bot commented Oct 4, 2020

faust64 commented Feb 1, 2021

faust64 commented Dec 11, 2019 •

edited

Loading

zhucan commented Dec 12, 2019 •

edited

Loading