Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume not detached/migrated after node failure/shutdown #164

Closed
mrkamel opened this issue Nov 22, 2020 · 5 comments
Closed

Volume not detached/migrated after node failure/shutdown #164

mrkamel opened this issue Nov 22, 2020 · 5 comments

Comments

@mrkamel
Copy link

mrkamel commented Nov 22, 2020

I've a simple echoserver pod running on one of three nodes with a 10gb hetzner volume using the csi driver.
When i shut down the node where the pod is running on, the pod can't be migrated to another node as it gets stuck in ContainerCreating

# kubectl get pods
NAME                          READY   STATUS              RESTARTS   AGE
echoserver-6b45d446c5-8lhdg   0/1     ContainerCreating   0          11m
echoserver-6b45d446c5-dk7cd   1/1     Terminating         0          29m

I'm using the latest driver with kubernetes 1.19.
The log is telling me

INFO[2020-11-22 13:12:01] I1122 13:12:01.154989    1010 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-bbea2704-6e0f-4230-88eb-a2803bef1600" (UniqueName: "kubernetes.io/csi/csi.hetzner.cloud^8133269") pod "echoserver-6b45d446c5-8lhdg" (UID: "d984870c-c8c7-42c2-9d90-d9e0d250a124")   component=kubelet
INFO[2020-11-22 13:12:01] E1122 13:12:01.165803    1010 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi.hetzner.cloud^8133269 podName: nodeName:}" failed. No retries permitted until 2020-11-22 13:14:03.165677786 +0100 CET m=+1239.114378880 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"pvc-bbea2704-6e0f-4230-88eb-a2803bef1600\" (UniqueName: \"kubernetes.io/csi/csi.hetzner.cloud^8133269\") pod \"echoserver-6b45d446c5-8lhdg\" (UID: \"d984870c-c8c7-42c2-9d90-d9e0d250a124\") "  component=kubelet

The hetzner cloud panel still shows the volume attached to the shut down node.
When i power on the shut down node again and let it join, the pod and volume can finally be migrated to the new node, but i thought this would not be a neccessity.

My echoserver.yaml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: csi-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: hcloud-volumes
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echoserver
spec:
  replicas: 1
  selector:
    matchLabels:
      app: echoserver
  template:
    metadata:
      labels:
        app: echoserver
    spec:
      containers:
      - image: gcr.io/google_containers/echoserver:1.4
        imagePullPolicy: Always
        name: echoserver
        ports:
        - containerPort: 8080
        volumeMounts:
        - mountPath: "/data"
          name: my-csi-volume
      volumes:
      - name: my-csi-volume
        persistentVolumeClaim:
          claimName: csi-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: echo
spec:
  type: NodePort
  selector:
    app: echoserver
  ports:
    - port: 8080
      nodePort: 30100

Am i missing some config i'm not aware of? Or is this desired?
Thanks in advance

@mrkamel mrkamel changed the title Volume not detached/migrated after node failure/shut-down Volume not detached/migrated after node failure/shutdown Nov 22, 2020
@mrkamel
Copy link
Author

mrkamel commented Nov 22, 2020

ok, seems to be k8s related and not csi driver specific.
kubernetes-retired/external-storage#838 (comment) and kubernetes/kubernetes#57531 (comment)
wasn't aware of that.

@mrkamel mrkamel closed this as completed Nov 22, 2020
@dstapp
Copy link

dstapp commented Jul 3, 2021

@mrkamel Have you found any workaround to that? Not having the VolumeAttachment removed when the pod enters "terminating" state (after node failure) unless I manually delete it imo completely defeats the purpose of self-healing. It's just manual healing then. Which means that running on a single-node setup would be far more likely not to fail than having a multi-node setup where no node may fail or the application breaks until I fix it manually.

I understand the concerns from a design perspective of forcefully detaching a volume from a node, but for certain use-cases it's needed (or I miss something - then please let me know ;)).

I found some issue reports mentioning that this could probably solved using TaintBasedEvictions, which I don't understands since with a taint based eviction set up, the pod will simply be moved to terminated state based on the taint of the nodes, resulting in the same problem.

Btw. either this was change with a more recent k8s version or it's part of the csi-drivers behavior. Running k8s on a private openstack deployment (ovh), this behavior is different - but as said, it's an older k8s version.

@mrkamel
Copy link
Author

mrkamel commented Jul 4, 2021

@dprandzioch unfortunately not. i opened this during evaluation of different solutions for a migration and due to that behaviour of k8s ended up using docker swarm with https://github.com/costela/docker-volume-hetzner for now which works as desired for me. would be interested in solutions as well though.

@s4ke
Copy link
Contributor

s4ke commented Dec 6, 2022

So if I understand this correctly, this is mostly a limitation of the way k8s handles things, right? With upcoming CSI support for Docker Swarm, I am currently checking out what needs to be done once it its released. Was wondering if we should invest time in checking whether the CSI driver works, but if the behaviour is due to the provider, then this might be a reason to stick with costela/docker-voume-hetzner for the time being.

@apricote
Copy link
Member

@s4ke Kubernetes explicitly handles this by forcefully detaching the volume from the server if the kubernetes node is unreachable (configurable timeout). This is per se not allowed by the spec, but there is a discussion to include this behaviour in the CSI spec: container-storage-interface/spec#512

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants