-
Notifications
You must be signed in to change notification settings - Fork 557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node crash causes volumes to become unavailable #740
Comments
We heave the same problem here too. If a node goes down, the new pods are stuck in container creating because the volume(s) are locked on the failed node. |
@humblec is it something to do with fencing? how to handle this scenario |
@faust64 @lorenzomorandini There is another safe-unmount check where the a/d controller waits a max of 6 min for a clean unmount after which it considers the node lost and issues a unilateral detach. We should consider similar logic for this check. IDK if you follow the work of the storage SIG, but this was discussed 3 weeks ago at out F2F meeting in San Diego. There is doc that can help you: https://docs.google.com/presentation/d/1UmZA37nFnp5HxTDtsDgRh0TRbcwtUMzc1XScf5C9Tqc/edit?usp=sharing |
Negative -- this is k8s behaving correctly (if not sub-optimally). Just because a node is unresponsive to k8s doesn't mean that workloads aren't still running and actively manipulating storage (e.g. imagine the management network is down but the storage network is still active). There is really nothing for ceph-csi to do in this instance. It would be great if one day we could storage fence "dead" nodes [1], but the CSI spec doesn't provide us enough information to support that at this time (not that it would impact this issue, though, since it's just k8s). [1] #578 |
Anyone has an "official" workaround? Currently, I'm just blacklist the watcher of the csi volume. And the newly scheduled pod is able to attach to the volume. Someone mentioned "detach" the volume on the previous host. Is it necessary? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation. |
Though the issue is not solved yet, the last CSI releases do offer with a workaround (tested with You can use Having tested by plugging network cables out, without rebooting nodes, I could confirm that once the node gets back Ready, its Terminating containers are properly shut down, volumes un-mapped, ... |
Describe the bug
When one of my OpenShift node goes down, the (RWO/rbd) PVC that were attached to that machine would get unavailable.
Pods trying to restart on another node would be stuck in containerCreating, with events mentioning multi-attach errors.
Those would remain stuck until I eventually reboot my faulty node, which can take me hours or days (lab cluster).
Environment details
Using the samples from the
./deploy
folder in this repository.Been trying to update to the last tags I could find on quay today, though some of the Pods crashed, suggesting some argument passed was no longer valid / supported. I eventually rolled back, as I noticed your copy still uses the same tags I already had.
NA
Client Version: openshift-clients-4.2.0-201910041700
Server Version: 4.2.9
Kubernetes Version: v1.14.6+20e2756
Steps to reproduce
Steps to reproduce the behavior:
./deploy/(cephfs|rbd)/kubernetes/v1.14+
(to be precise, I did deploy my copy before the cephfs resizer bit got added, 46 days ago)4 Plug out compute node hosting that Pod (or any violent shutdown, that won't nicely terminate CSI Pods running on node)
Actual results
Replacement pod stuck in containerCreating. Events mentioning multi-attach errors.
Expected behavior
Back when CSI was not mandatory in OpenShift, when one of my node went down, I might have seen multi-attach errors in the very beginning of my replacement Pod starting, though the outage wouldn't have exceeded 5 to 10 minutes in length.
Additional context
Been noticing that issue each and every time I lose a node. And currently, those RHCOS don't behave well as KVM guests, ... I'ld see crash like those on a daily basis.
The text was updated successfully, but these errors were encountered: