Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rescheduled pod in a statefulset is not allowed to attach a PVC #4371

Closed
yanchicago opened this issue Nov 25, 2019 · 15 comments
Closed

Rescheduled pod in a statefulset is not allowed to attach a PVC #4371

yanchicago opened this issue Nov 25, 2019 · 15 comments

Comments

@yanchicago
Copy link

yanchicago commented Nov 25, 2019

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:
When a host failed, the pod running on that host was rescheduled. But the rescheduled pod failed to attach the PVC.

Expected behavior:
The rescheduled pod should be able to attach the pvc.

How to reproduce it (minimal and precise):

  1. The pod 'testrookblk-1' is part of statefulset and runs on worker-02 node.
  2. The worker-02 node failed. And the pod in terminating state was deleted manually.
  3. The pod 'testrookblk-1' was rescheduled on worker-01 node. But the pod failed to attach its PVC. The agent pod at worker-01 shows below log:

2019-11-25 19:19:16.896921 I | flexdriver: calling agent to attach volume replicapool/pvc-1ee2c64e-0fb7-11ea-9e4c-fa163ed259f6
2019-11-25 19:19:16.902071 I | flexvolume: volume attachment record rook-ceph-system/pvc-1ee2c64e-0fb7-11ea-9e4c-fa163ed259f6 exists for pod: rook-ceph/testrookblk-1
2019-11-25 19:19:16.911820 E | flexdriver: Attach volume replicapool/pvc-1ee2c64e-0fb7-11ea-9e4c-fa163ed259f6 failed: failed to attach volume pvc-1ee2c64e-0fb7-11ea-9e4c-fa163ed259f6 for pod rook-ceph/testrookblk-1. Volume is already attached by pod rook-ceph/testrookblk-1. Status Pending
2019-11-25 19:19:17.485606 I | flexdriver: calling agent to attach volume replicapool/pvc-1ee2c64e-0fb7-11ea-9e4c-fa163ed259f6
2019-11-25 19:19:17.489907 I | flexvolume: volume attachment record rook-ceph-system/pvc-1ee2c64e-0fb7-11ea-9e4c-fa163ed259f6 exists for pod: rook-ceph/testrookblk-1
2019-11-25 19:19:17.493923 E | flexdriver: Attach volume replicapool/pvc-1ee2c64e-0fb7-11ea-9e4c-fa163ed259f6 failed: failed to attach volume pvc-1ee2c64e-0fb7-11ea-9e4c-fa163ed259f6 for pod rook-ceph/testrookblk-1. Volume is already attached by pod rook-ceph/testrookblk-1. Status Pending
  1. In the code( ./pkg/daemon/ceph/agent/flexvolume/controller.go), only two conditions are allowedAttach. In this case, sts pod name will not change. So the rescheduled pod for a sts will not be able to reschedule to a different node.

Should a rescheduled pod ( same named pod in the same namespace running on a different node) be allowed to attach? Removing "attachment.Node == node" part?

    **if err == nil && (attachment.PodNamespace == attachOpts.PodNamespace && attachment.PodName == attachOpts.Pod && attachment.Node == node) {**

Below "original" pod is not validated based on name/namespace.
// Attachment is not orphaned. Original pod still exists. Don't attach.

a). attachment is orphaned
b). Same named pod on the same node

     if err != nil {
                                        if !errors.IsNotFound(err) {
                                                return fmt.Errorf("failed to get pod CRD %s/%s. %+v", attachment.PodNamespace, attachment.PodName, err)
                                        }
                                        allowAttach = true
                                        logger.Infof("volume attachment record %s/%s is orphaned. Updating record with new attachment information for pod %s/%s", volumeattachObj.Namespace, volumeattachObj.Name, attachOpts.PodNamespace, attachOpts.Pod)
                                }
                                if err == nil && (attachment.PodNamespace == attachOpts.PodNamespace && attachment.PodName == attachOpts.Pod && attachment.Node == node) {
                                        allowAttach = true
                                        logger.Infof("volume attachment record %s/%s is starting on the same node. Updating record with new attachment information for pod %s/%s", volumeattachObj.Namespace, volumeattachObj.Name, attachOpts.PodNamespace, attachOpts.Pod)
                                }
...

} else {
                                        // Attachment is not orphaned. Original pod still exists. Don't attach.
                                        return fmt.Errorf("failed to attach volume %s for pod %s/%s. Volume is already attached by pod %s/%s. Status %+v",
                                                crdName, attachOpts.PodNamespace, attachOpts.Pod, attachment.PodNamespace, attachment.PodName, pod.Status.Phase)
                                }

File(s) to submit:

  • Cluster CR (custom resource), typically called cluster.yaml, if necessary
  • Operator's logs, if necessary
  • Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
Read Github documentation if you need help.

Environment:

  • OS (e.g. from /etc/os-release): Centos 7
  • Kernel (e.g. uname -a): 3.10.0-957.10.1.el7.x86_64
  • Cloud provider or hardware configuration: openstack
  • Rook version (use rook version inside of a Rook Pod): v1.1.7
  • Storage backend version (e.g. for ceph do ceph -v): v14.2.4
  • Kubernetes version (use kubectl version): v1.14.3
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Tectonic
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):
@yanchicago yanchicago added the bug label Nov 25, 2019
@travisn
Copy link
Member

travisn commented Nov 25, 2019

  1. The pod 'testrookblk-1' is part of statefulset and runs on worker-02 node.
  2. The worker-02 node failed. And the pod in terminating state was deleted manually.
  3. The pod 'testrookblk-1' was rescheduled on worker-01 node. But the pod failed to attach its PVC. The agent pod at worker-01 shows below log:

If the rescheduled pod needs to start, the attachment CR needs to be cleaned up. How was the pod deleted manually? Can you delete the VolumeAttachment CR manually as well? This would free up the volume to be attached.

If the volume doesn't have a chance to detach properly, the flex driver will be at risk of causing corruption if the volume is mounted again before it knows the volume was detached from the previous node. This was the corruption we finally fixed in #4180.

Since the pod has the same name and was rescheduled to the new node, seems like we could relax the check for node. The pod spec should indicate that it is now assigned to the new node. The new pod with the same name can't start on the requested node unless it has already been removed from the previous node.

@yanchicago
Copy link
Author

@travisn The pod in terminating state was deleted by "kubectl delete". How to find VolumeAttachment CR?
"Since the pod has the same name and was rescheduled to the new node, seems like we could relax the check for node. " I agree that the new pod on different node is less risky to allow attach.
"The new pod with the same name can't start on the requested node unless it has already been removed from the previous node." Do you refer "requested node" as previous node? Note: the new pod always with the same name if it's part of statefulset. I think k8s should guard that two pods with the same name can't co-exist in the same namespace.
Would appreciate if you may provide a quick fix. It's a critical function.

@travisn
Copy link
Member

travisn commented Nov 25, 2019

The instances of the CRD volumes.rook.io are created by the flex driver in the same namespace as the operator (by default rook-ceph). As a workaround, try deleting the volume that corresponds to the volume with the issue. For example, confirm you found the correct attachment before deleting it:

$ kubectl -n rook-ceph get volumes.rook.io
NAME                                       AGE
pvc-ccdcd7f2-9d37-4276-bcbd-5a0e00d691b0   20s

$ kubectl -n rook-ceph get volumes pvc-ccdcd7f2-9d37-4276-bcbd-5a0e00d691b0 -o yaml
apiVersion: rook.io/v1alpha2
attachments:
- clusterName: rook-ceph
  mountDir: /var/lib/kubelet/pods/cba46244-1e52-4107-9ceb-3ceed9b7e9b8/volumes/ceph.rook.io~rook-ceph/pvc-ccdcd7f2-9d37-4276-bcbd-5a0e00d691b0
  node: minikube
  podName: wordpress-mysql-b9ddd6d4c-pxh5l
  podNamespace: default
  readOnly: false
kind: Volume
...

@yanchicago
Copy link
Author

@travisn Thanks and the workaround did work. Appreciate if the fix can be available soon.

@yanchicago
Copy link
Author

In below condition, if the pod on the node specified in the attachment doesn't exist, the attachment should be allowed as well. Is it valid?

if err == nil && (attachment.PodNamespace == attachOpts.PodNamespace && attachment.PodName == attachOpts.Pod && attachment.Node != node) {                                   

@travisn
Copy link
Member

travisn commented Jan 15, 2020

In below condition, if the pod on the node specified in the attachment doesn't exist, the attachment should be allowed as well. Is it valid?

if err == nil && (attachment.PodNamespace == attachOpts.PodNamespace && attachment.PodName == attachOpts.Pod && attachment.Node != node) {                                   

@yanchicago If the pod doesn't exist, it is checked a few lines earlier and allow the attachment.

allowAttach = true
logger.Infof("volume attachment record %s/%s is orphaned. Updating record with new attachment information for pod %s/%s", volumeattachObj.Namespace, volumeattachObj.Name, attachOpts.PodNamespace, attachOpts.Pod)

Is that the check that you were looking for?

@yanchicago
Copy link
Author

@travisn Unfortunately, the below search would include the new pod that's current in request of attachment. If the search includes the node/host name of the pod location that's specified in the volumeattachment record, it would work.
For pod in a statefulset, the requesting pod would fit in below checking. So the "orphaned" volume scenario is not correctly identified.

 pod, err := c.context.Clientset.CoreV1().Pods(attachment.PodNamespace).Get(attachment.PodName, metav1.GetOptions{})

Also, could you shed a light on the flow of how a "detach" request route to agent? When a pod died, k8s will send the event directly to the agent or routed in other places?

@travisn
Copy link
Member

travisn commented Jan 16, 2020

For #4180 The check was added for whether the pod was on the same node because there was corruption without the check. The only way the driver can know that it won't be causing corruption is by blacklisting the other client from connecting. This is work being done in the csi driver in ceph/ceph-csi#578, but not the flex driver currently.

@yanchicago
Copy link
Author

Yes, agree #4180 fixed the original corruption issue.
But it also blocks the rescheduled pod attaching the volumes while the old pod is gone. The "orphaned" volume checking should include the node name.

@stale
Copy link

stale bot commented Apr 23, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Apr 23, 2020
@yanchicago
Copy link
Author

It's still an issue on flex volume based rook cluster. Until the flex volume cluster is not supported, is there any plan to work on it? This is related to "multi-attach" issue, similar situation exists in csi based rook cluster.

@stale stale bot removed the wontfix label Apr 23, 2020
@yanchicago
Copy link
Author

yanchicago commented Apr 23, 2020

Closed it by mistake.
Similar CSI issue: ceph/ceph-csi#578

@galexrt galexrt reopened this Apr 23, 2020
@stale
Copy link

stale bot commented Jul 25, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 25, 2020
@stale
Copy link

stale bot commented Aug 1, 2020

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

@stale stale bot closed this as completed Aug 1, 2020
@yanchicago
Copy link
Author

yanchicago commented Aug 1, 2020

https://github.com/rook/rook/issues/4371
The issue is same as the above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants