Rescheduled pod in a statefulset is not allowed to attach a PVC #4371

yanchicago · 2019-11-25T20:16:32Z

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:
When a host failed, the pod running on that host was rescheduled. But the rescheduled pod failed to attach the PVC.

Expected behavior:
The rescheduled pod should be able to attach the pvc.

How to reproduce it (minimal and precise):

The pod 'testrookblk-1' is part of statefulset and runs on worker-02 node.
The worker-02 node failed. And the pod in terminating state was deleted manually.
The pod 'testrookblk-1' was rescheduled on worker-01 node. But the pod failed to attach its PVC. The agent pod at worker-01 shows below log:

2019-11-25 19:19:16.896921 I | flexdriver: calling agent to attach volume replicapool/pvc-1ee2c64e-0fb7-11ea-9e4c-fa163ed259f6
2019-11-25 19:19:16.902071 I | flexvolume: volume attachment record rook-ceph-system/pvc-1ee2c64e-0fb7-11ea-9e4c-fa163ed259f6 exists for pod: rook-ceph/testrookblk-1
2019-11-25 19:19:16.911820 E | flexdriver: Attach volume replicapool/pvc-1ee2c64e-0fb7-11ea-9e4c-fa163ed259f6 failed: failed to attach volume pvc-1ee2c64e-0fb7-11ea-9e4c-fa163ed259f6 for pod rook-ceph/testrookblk-1. Volume is already attached by pod rook-ceph/testrookblk-1. Status Pending
2019-11-25 19:19:17.485606 I | flexdriver: calling agent to attach volume replicapool/pvc-1ee2c64e-0fb7-11ea-9e4c-fa163ed259f6
2019-11-25 19:19:17.489907 I | flexvolume: volume attachment record rook-ceph-system/pvc-1ee2c64e-0fb7-11ea-9e4c-fa163ed259f6 exists for pod: rook-ceph/testrookblk-1
2019-11-25 19:19:17.493923 E | flexdriver: Attach volume replicapool/pvc-1ee2c64e-0fb7-11ea-9e4c-fa163ed259f6 failed: failed to attach volume pvc-1ee2c64e-0fb7-11ea-9e4c-fa163ed259f6 for pod rook-ceph/testrookblk-1. Volume is already attached by pod rook-ceph/testrookblk-1. Status Pending

In the code( ./pkg/daemon/ceph/agent/flexvolume/controller.go), only two conditions are allowedAttach. In this case, sts pod name will not change. So the rescheduled pod for a sts will not be able to reschedule to a different node.

Should a rescheduled pod ( same named pod in the same namespace running on a different node) be allowed to attach? Removing "attachment.Node == node" part?

    **if err == nil && (attachment.PodNamespace == attachOpts.PodNamespace && attachment.PodName == attachOpts.Pod && attachment.Node == node) {**

Below "original" pod is not validated based on name/namespace.
// Attachment is not orphaned. Original pod still exists. Don't attach.

a). attachment is orphaned
b). Same named pod on the same node

     if err != nil {
                                        if !errors.IsNotFound(err) {
                                                return fmt.Errorf("failed to get pod CRD %s/%s. %+v", attachment.PodNamespace, attachment.PodName, err)
                                        }
                                        allowAttach = true
                                        logger.Infof("volume attachment record %s/%s is orphaned. Updating record with new attachment information for pod %s/%s", volumeattachObj.Namespace, volumeattachObj.Name, attachOpts.PodNamespace, attachOpts.Pod)
                                }
                                if err == nil && (attachment.PodNamespace == attachOpts.PodNamespace && attachment.PodName == attachOpts.Pod && attachment.Node == node) {
                                        allowAttach = true
                                        logger.Infof("volume attachment record %s/%s is starting on the same node. Updating record with new attachment information for pod %s/%s", volumeattachObj.Namespace, volumeattachObj.Name, attachOpts.PodNamespace, attachOpts.Pod)
                                }
...

} else {
                                        // Attachment is not orphaned. Original pod still exists. Don't attach.
                                        return fmt.Errorf("failed to attach volume %s for pod %s/%s. Volume is already attached by pod %s/%s. Status %+v",
                                                crdName, attachOpts.PodNamespace, attachOpts.Pod, attachment.PodNamespace, attachment.PodName, pod.Status.Phase)
                                }

File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary
Operator's logs, if necessary
Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
Read Github documentation if you need help.

Environment:

OS (e.g. from /etc/os-release): Centos 7
Kernel (e.g. uname -a): 3.10.0-957.10.1.el7.x86_64
Cloud provider or hardware configuration: openstack
Rook version (use rook version inside of a Rook Pod): v1.1.7
Storage backend version (e.g. for ceph do ceph -v): v14.2.4
Kubernetes version (use kubectl version): v1.14.3
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Tectonic
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

The text was updated successfully, but these errors were encountered:

travisn · 2019-11-25T21:05:45Z

The pod 'testrookblk-1' is part of statefulset and runs on worker-02 node.

The worker-02 node failed. And the pod in terminating state was deleted manually.

The pod 'testrookblk-1' was rescheduled on worker-01 node. But the pod failed to attach its PVC. The agent pod at worker-01 shows below log:

If the rescheduled pod needs to start, the attachment CR needs to be cleaned up. How was the pod deleted manually? Can you delete the VolumeAttachment CR manually as well? This would free up the volume to be attached.

If the volume doesn't have a chance to detach properly, the flex driver will be at risk of causing corruption if the volume is mounted again before it knows the volume was detached from the previous node. This was the corruption we finally fixed in #4180.

Since the pod has the same name and was rescheduled to the new node, seems like we could relax the check for node. The pod spec should indicate that it is now assigned to the new node. The new pod with the same name can't start on the requested node unless it has already been removed from the previous node.

yanchicago · 2019-11-25T22:37:23Z

@travisn The pod in terminating state was deleted by "kubectl delete". How to find VolumeAttachment CR?
"Since the pod has the same name and was rescheduled to the new node, seems like we could relax the check for node. " I agree that the new pod on different node is less risky to allow attach.
"The new pod with the same name can't start on the requested node unless it has already been removed from the previous node." Do you refer "requested node" as previous node? Note: the new pod always with the same name if it's part of statefulset. I think k8s should guard that two pods with the same name can't co-exist in the same namespace.
Would appreciate if you may provide a quick fix. It's a critical function.

travisn · 2019-11-25T22:53:34Z

The instances of the CRD volumes.rook.io are created by the flex driver in the same namespace as the operator (by default rook-ceph). As a workaround, try deleting the volume that corresponds to the volume with the issue. For example, confirm you found the correct attachment before deleting it:

$ kubectl -n rook-ceph get volumes.rook.io
NAME                                       AGE
pvc-ccdcd7f2-9d37-4276-bcbd-5a0e00d691b0   20s

$ kubectl -n rook-ceph get volumes pvc-ccdcd7f2-9d37-4276-bcbd-5a0e00d691b0 -o yaml
apiVersion: rook.io/v1alpha2
attachments:
- clusterName: rook-ceph
  mountDir: /var/lib/kubelet/pods/cba46244-1e52-4107-9ceb-3ceed9b7e9b8/volumes/ceph.rook.io~rook-ceph/pvc-ccdcd7f2-9d37-4276-bcbd-5a0e00d691b0
  node: minikube
  podName: wordpress-mysql-b9ddd6d4c-pxh5l
  podNamespace: default
  readOnly: false
kind: Volume
...

yanchicago · 2019-11-26T18:49:23Z

@travisn Thanks and the workaround did work. Appreciate if the fix can be available soon.

yanchicago · 2020-01-13T22:08:35Z

In below condition, if the pod on the node specified in the attachment doesn't exist, the attachment should be allowed as well. Is it valid?

if err == nil && (attachment.PodNamespace == attachOpts.PodNamespace && attachment.PodName == attachOpts.Pod && attachment.Node != node) {

travisn · 2020-01-15T20:40:35Z

In below condition, if the pod on the node specified in the attachment doesn't exist, the attachment should be allowed as well. Is it valid?
if err == nil && (attachment.PodNamespace == attachOpts.PodNamespace && attachment.PodName == attachOpts.Pod && attachment.Node != node) {                                   

@yanchicago If the pod doesn't exist, it is checked a few lines earlier and allow the attachment.

allowAttach = true
logger.Infof("volume attachment record %s/%s is orphaned. Updating record with new attachment information for pod %s/%s", volumeattachObj.Namespace, volumeattachObj.Name, attachOpts.PodNamespace, attachOpts.Pod)

Is that the check that you were looking for?

yanchicago · 2020-01-15T22:40:00Z

@travisn Unfortunately, the below search would include the new pod that's current in request of attachment. If the search includes the node/host name of the pod location that's specified in the volumeattachment record, it would work.
For pod in a statefulset, the requesting pod would fit in below checking. So the "orphaned" volume scenario is not correctly identified.

 pod, err := c.context.Clientset.CoreV1().Pods(attachment.PodNamespace).Get(attachment.PodName, metav1.GetOptions{})

Also, could you shed a light on the flow of how a "detach" request route to agent? When a pod died, k8s will send the event directly to the agent or routed in other places?

travisn · 2020-01-16T00:58:45Z

For #4180 The check was added for whether the pod was on the same node because there was corruption without the check. The only way the driver can know that it won't be causing corruption is by blacklisting the other client from connecting. This is work being done in the csi driver in ceph/ceph-csi#578, but not the flex driver currently.

yanchicago · 2020-01-17T21:37:33Z

Yes, agree #4180 fixed the original corruption issue.
But it also blocks the rescheduled pod attaching the volumes while the old pod is gone. The "orphaned" volume checking should include the node name.

stale · 2020-04-23T04:52:18Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

yanchicago · 2020-04-23T14:40:46Z

It's still an issue on flex volume based rook cluster. Until the flex volume cluster is not supported, is there any plan to work on it? This is related to "multi-attach" issue, similar situation exists in csi based rook cluster.

yanchicago · 2020-04-23T14:41:45Z

Closed it by mistake.
Similar CSI issue: ceph/ceph-csi#578

stale · 2020-07-25T04:11:52Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

stale · 2020-08-01T09:22:32Z

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

yanchicago · 2020-08-01T17:12:28Z

https://github.com/rook/rook/issues/4371
The issue is same as the above.

yanchicago added the bug label Nov 25, 2019

Madhu-1 added the csi label Jan 23, 2020

dahankzter mentioned this issue Apr 8, 2020

storage_io_error while deploying cluster scylladb/scylla-operator#75

Closed

stale bot added the wontfix label Apr 23, 2020

stale bot removed the wontfix label Apr 23, 2020

yanchicago closed this as completed Apr 23, 2020

galexrt reopened this Apr 23, 2020

stale bot added the wontfix label Jul 25, 2020

stale bot closed this as completed Aug 1, 2020

yanchicago mentioned this issue Aug 1, 2020

On NodeLost, the new pod can't mount the same volume. #1507

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rescheduled pod in a statefulset is not allowed to attach a PVC #4371

Rescheduled pod in a statefulset is not allowed to attach a PVC #4371

yanchicago commented Nov 25, 2019 •

edited

Loading

travisn commented Nov 25, 2019 •

edited

Loading

yanchicago commented Nov 25, 2019

travisn commented Nov 25, 2019

yanchicago commented Nov 26, 2019

yanchicago commented Jan 13, 2020

travisn commented Jan 15, 2020

yanchicago commented Jan 15, 2020

travisn commented Jan 16, 2020

yanchicago commented Jan 17, 2020

stale bot commented Apr 23, 2020

yanchicago commented Apr 23, 2020

yanchicago commented Apr 23, 2020 •

edited by galexrt

Loading

stale bot commented Jul 25, 2020

stale bot commented Aug 1, 2020

yanchicago commented Aug 1, 2020 •

edited

Loading

Rescheduled pod in a statefulset is not allowed to attach a PVC #4371

Rescheduled pod in a statefulset is not allowed to attach a PVC #4371

Comments

yanchicago commented Nov 25, 2019 • edited Loading

travisn commented Nov 25, 2019 • edited Loading

yanchicago commented Nov 25, 2019

travisn commented Nov 25, 2019

yanchicago commented Nov 26, 2019

yanchicago commented Jan 13, 2020

travisn commented Jan 15, 2020

yanchicago commented Jan 15, 2020

travisn commented Jan 16, 2020

yanchicago commented Jan 17, 2020

stale bot commented Apr 23, 2020

yanchicago commented Apr 23, 2020

yanchicago commented Apr 23, 2020 • edited by galexrt Loading

stale bot commented Jul 25, 2020

stale bot commented Aug 1, 2020

yanchicago commented Aug 1, 2020 • edited Loading

yanchicago commented Nov 25, 2019 •

edited

Loading

travisn commented Nov 25, 2019 •

edited

Loading

yanchicago commented Apr 23, 2020 •

edited by galexrt

Loading

yanchicago commented Aug 1, 2020 •

edited

Loading