-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the ability to use exclusive RBD locking to prevent inadvertent multi-node access of an RWO image #578
Comments
I want to retract the possibility of using "rbd device map --exclusive". The issue being that when that option is used, krbd will never release the lock. Normally that is exactly what you'd want, except for the case where you are trying to perform some maintenance operation on the image (snapshot create/remove, flatten, ...). |
we need to do below steps
@dillaman @ShyamsundarR correct me if am wrong @dillaman to remove the lock we need to connect the cluster right, means we need mon endpoints, admin ID and credentials ( is it possible to remove the lock without that) as we do |
It's
Negative -- that was my retraction above. If you use this, you will break the ability to have third-party tools (i.e.
Negative -- on a normal unstage, if you were using
You need to talk to the cluster, so you need all the required details to talk to the cluster. Again, not sure it matters much since (1) you wouldn't be doing this during node unstage, and (2) see above ...
It's a limitation of the fact that krbd doesn't perform any maintenance ops. If you used |
This is an important item for |
@humblec There is currently no agreed path forward on this issue. The RBD exclusive-lock feature cannot be used and the CSI RPCs do not provide the necessary credentials to provide a manual workaround. |
@dillaman true, I am aware of the conclusion we made with our last discussions on this topic. However, trying to figure out what can be done here to reach to a (new) solution/workaround, be it by adding support from RBD core side or by trying possibilities from other layers. I was told that, we have to bring this extra safety measure to be assured that, we dont end up in multi access and thus its consequences. @mykaul jfyi ^^ |
Agreed it's a great safety feature -- but it will require a CSI spec and/or external-provisioner sidecar change. There is no workaround possible w/o the credentials. The timeline for getting mirroring to work w/ the external-provisioner is several releases out per the downstream team. |
It seems silly to me to tag a ceph-csi issue against v2 and priority 0 when it will 100% not land in v2 due to upstream dependencies. I think the best you can do is have an issue against the spec and/or CSI external provisioner limitation just to track the the opening of an issue against the other external components. Actually adding support for exclusive locking is something that cannot be scheduled until after the support is merged into the upstream components. |
The solution to use Instead, it is was discussed that we can add our own annotation and ensure this is set, before mapping the image on the node. This will be done in NodeStage (and hence undone in NodeUnstage), to ensure we are not caught in any other kubernetes/CO related bugs (which is what this issue is about). Performing the above in ControllerPublish/Unpublish defeats the purpose, because if the CO decides its state is correct, i.e there are no nodes that have the image mapped, when there is actually an active image map on a node, the CO would simply invoke ControllerUnpublish to remove the older annotation and call ControllerPublish add a new one causing the issue that led to this problem in the first place. The order of operations would look like:
[A] If a node goes down without removing an annotation, then the cluster/storage/CSI admin would need to remove the annotation to clear the state, hence this solution does not provide auto recovery or connection fencing. Connection fencing is not chosen, as all images mapped on a node uses a common connection, for efficiency of time (connection establishment) and resources (memory footprint on the node) reasons, and hence fencing one will fence all on the node, which would be an bigger issue to deal with. [B] There is a race between steps 1 and 5, when 2 clients can pass the no annotation set check (as they read an empty key), and proceed with steps 3 and 4, ending up with both clients writing their node details and reading it back assuming they are the sole owners. This needs more thought. [C] Currently NodeUnstage does not pass in secrets and other required volume attributes. There is some reasoning as to why in these issues,
|
I think we should use the "rbd lock" commands as a base since it will provide the necessary client address such that you could blacklist automatically blacklist a node w/o the need to require storage admin intervention. The lock infrastructure does support adding annotations to the lock if that is required, but we would need to tweak the rbd CLI / APIs to expose that to the outside world if required. |
@athanatos @batrick To follow up from our discussion last week on the need for the csi driver to detach from a pod on a failed node and allowing attaching on another node, any input to add to this thread? This is in the list of critical rook issues that we need to solve. |
I'm not sure I've full wrapped my head the proposed solution by @ShyamsundarR. Can we arrange a meeting to discuss this? |
@batrick @ShyamsundarR et.al, can we update this issue with the discussion happened on the above said meeting? |
Here are the meeting updates:
Going forward actions would be,
@batrick @dillaman Please add in case any key points are missing. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions. |
Not stale |
Still a blocking issue in the fields. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions. |
Unstale |
Fencing an recovering is proposed to the CSI specification: container-storage-interface/spec#477 Ceph-CSI can implement |
Please note that the proposed spec, afaict, intends to fence based on kubernetes node ID, whereas connections in ceph that can be blocklisted are IP addresses. There possibly needs to be a mapping between the two and hence may require additional CSI metadata to be stored etc. as this is implemented. |
Hi, is there any recent design or implementation work on this feature? We have some non-replicated single node DBs instances (for various reasons), and while very unlikely, we are worried node becoming disconnected from kubernetes, but not from Ceph, and still keep running pods using RBD, then master to decide to schedule the pod again on other node, and possibly corrupting the file system or DB. If this could be solved in ceph directly, via locking, that would be great, and easy to deploy. |
Exclusive locking should be usable already. rook/rook#1507 is one of the issues that has more discussion on enhancing fail-over for certain scenarios. By using CSI-Addons |
I have the same issue as everyone else, and I'm a bit confused about the current status. Is NetworkFence available with ceph-csi or only when using Rook? I see in the Rook documentation that NetworkFence only kicks in if I mark a node lost with What I am interested in is an automatic solution that would allow the pod to respawn on a new node without the operator having to manually confirm that the node is shutdown. Shouldn't this lock-out happen automatically without |
One of the problems is that there are two different services that may function independently of each other. There are the Kubernetes services (by kubelet) and Ceph services (mapped RBD images or CephFS mounts). When Pod is started by kubelet, kubelet basically does:
While mounting the volume, a Ceph connection is created, which became independent of kubelet. This means that when kubelet for some reasons malfunctions, the Ceph connection is still alive and working. Whenever this happens, there is also no guarantee that the container(s) are stopped (kubelet may have failed to do so). This makes it possible that the node where kubelet does not work correctly anymore, the containers are still running, and doing I/O on the Ceph volume. If a process is doing I/O on a Ceph volume, ideally no other application should be accessing that volume at the same time (without applications cooperating with each other, e.g. using file/byterange locks). RBD volumes have more requirements that CephFS volumes do, as an RBD-image usually contains a local filesystem (like ext4). Mapping an RBD-image on two nodes, and mounting the local filesystem will almost certainly lead to filesystem corruption. In order to prevent filessytem corruption, or simple application data corruption because a stale kubelet failed to stop a workload, manual intervention is required, indicating that the node has been shutdown. There is not always a manual intervention needed though. Some environments accept a timeout on the Ceph volumes (in case the network was disconnected, when the node was shutdown). This timeout can sometimes even take up to 15 minutes, depending on the type of the failure. Because it is not predictable how failure scenarios happen, and data corruption (of any kind) is not ideal, having the security of the operator to shutdown the malfunctioning node, and adding the taint once done, is the safest solution. |
You are describing the situation without NetworkFence. The whole point of NetworkFence is to make previously-unsafe scenarios safe by removing the possibility for the broken node to corrupt anything. This feature has been added, and it works. My question is why does this "broken node" detection have to be manual? Why can't the fencing kick in when the node is NotReady for a few minutes? It has the potential to make a lot of previously-manual-only recoveries safe for automation, that just wasn't enabled, why not? If the node is offline, the fencing is useless. Why code a feature and then not use it? |
With Rook it can all be automated. In that case, Rook acts like the operator, and a human is not required. Currently Rook is the only automation that handles this, to my knowledge. CSI-drivers are not the right place to check if the container orchestrator (Kubernetes, Swarm, Nomad, ...) is working correctly. Rook or other operators for a specific container orchestrator are more suitable to do (Ceph) NetworkFencing. Ideally the container orchestrator itself provides support for fencing worker nodes, but I am not aware of any plans to do so. @subhamkrai who worked on the Rook part might have more details about other implementations. Mediak8s is an other project that could potentially use csi-addons for NetworkFencing, but I have not followed their future plans. |
The Rook documentation doesn't mention any such process. In fact the page I linked above specifically states that the taint has to be added manually:
|
@remram44 could open a feature request in rook? |
Please refer to rook/rook#1507 (comment) and https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown#proposal. IMO,in general, this is a question for kubernetes folks, not rook. They have to decide on what is a broken node. Rook could have a mechanism based on some conditions to decide this but IMO It is out of scope for Rook, which just takes care of the storage part, to mark a node in cluster, which may have various other apps and services running on it, as out of service. |
I am getting board playing whac-a-mole with rook hanging... |
First of all, the Rook documentation indicates If all that's left is for me to write a little controller that automatically applies Don't get me wrong, these new developments are a dream come true that have so much potential to help with ops. I am just frustrated that we are missing the last (easiest) step of making this kick in on FYI I don't use Rook, just ceph-csi, it just looks like that ceph-csi feature is only documented in the Rook docs. |
Bump. Does anyone understand the current behavior of ceph-csi? Does it add the taint? Does it read the taint? |
@remram44 We have disabled the auto fencing in Rook due to potential issue of wrong IP blocklisting but you could potentially do the below if you are not using Rook
The First 3 steps are to ensure that we blocklisted the right IP to avoid data corruption/inconsistency problem and the last option of adding taint is to move the workload to a new node. We are still in design phase to make the life easy for users but its taking some time to find the right automated solution |
Ok, thanks. I will see if I can dedicate some manpower on my side to make our own automation for this. Unless there is already an effort I could join? |
Although CO systems are meant to ensure an RWO volume is attached to a single node, under any (un)availability constraints. It seems prudent to have an additional layer of protection to ensure this is taken care of.
Quoting solutions presented by @dillaman here:
"The best solution is to ensure that k8s will never request the CSI to map a PV to a node when it could already be mapped on an unresponsive node. Barring that, the CSI should enable the exclusive-lock feature on RBD images for RWO volumes at provision time. At attach time, it should break any existing locks prior to mapping via "rbd lock break" (which will blacklist the previous lock owner from the storage cluster), and the map operation should use "rbd device map --exclusive" to ensure that it immediately acquires the lock and will refuse to release it until unmapped / blacklisted."
We would need to analyze what this entails in a CO environment, and what workflows may be needed to remove nodes from a black list (if at all needed).
The text was updated successfully, but these errors were encountered: