Add the ability to use exclusive RBD locking to prevent inadvertent multi-node access of an RWO image #578

ShyamsundarR · 2019-08-29T13:54:38Z

Although CO systems are meant to ensure an RWO volume is attached to a single node, under any (un)availability constraints. It seems prudent to have an additional layer of protection to ensure this is taken care of.

Quoting solutions presented by @dillaman here:
"The best solution is to ensure that k8s will never request the CSI to map a PV to a node when it could already be mapped on an unresponsive node. Barring that, the CSI should enable the exclusive-lock feature on RBD images for RWO volumes at provision time. At attach time, it should break any existing locks prior to mapping via "rbd lock break" (which will blacklist the previous lock owner from the storage cluster), and the map operation should use "rbd device map --exclusive" to ensure that it immediately acquires the lock and will refuse to release it until unmapped / blacklisted."

We would need to analyze what this entails in a CO environment, and what workflows may be needed to remove nodes from a black list (if at all needed).

dillaman · 2019-08-29T14:04:38Z

I want to retract the possibility of using "rbd device map --exclusive". The issue being that when that option is used, krbd will never release the lock. Normally that is exactly what you'd want, except for the case where you are trying to perform some maintenance operation on the image (snapshot create/remove, flatten, ...).

Madhu-1 · 2019-08-29T14:54:26Z

we need to do below steps
During create

add --exclusive-lock feature while creating an image if its RWO
During node stage
use rbd map --exclusive
During unstage
list the lock on image
remove the lock
unmap the rbd device

@dillaman @ShyamsundarR correct me if am wrong

@dillaman to remove the lock we need to connect the cluster right, means we need mon endpoints, admin ID and credentials ( is it possible to remove the lock without that)

as we do map --exclusive we cannot take a snapshot, right? is this expected

dillaman · 2019-08-29T15:57:46Z

we need to do below steps
During create

* add `--exclusive-lock` feature while creating an image if its RWO
  During node stage

It's --image-feature layering,exclusive-lock,...... (there is no --exclusive-lock optional)

* use `rbd map --exclusive`

Negative -- that was my retraction above. If you use this, you will break the ability to have third-party tools (i.e. rbd snap create) function against in-use images.

  During unstage

* list the lock on image

* remove the lock

* unmap the rbd device

Negative -- on a normal unstage, if you were using exclusive-lock, it will automatically clean up the lock. You would only need to break a lock on node stage if it's still flagged as in-use on another node. But, using exclusive-lock in such a way gets back to the point above ...

@dillaman to remove the lock we need to connect the cluster right, means we need mon endpoints, admin ID and credentials ( is it possible to remove the lock without that)

You need to talk to the cluster, so you need all the required details to talk to the cluster. Again, not sure it matters much since (1) you wouldn't be doing this during node unstage, and (2) see above ...

as we do map --exclusive we cannot take a snapshot, right? is this expected

It's a limitation of the fact that krbd doesn't perform any maintenance ops. If you used rbd-nbd --exclusive, you would still be able to create snapshots because it would handle the request from a remote rbd client.

humblec · 2019-10-04T14:54:22Z

This is an important item for v2.0.0 ( #557 ) and would like to solve it at this release. @ShyamsundarR @Madhu-1 @dillaman any volunteers for this task?

dillaman · 2019-10-04T14:58:17Z

@humblec There is currently no agreed path forward on this issue. The RBD exclusive-lock feature cannot be used and the CSI RPCs do not provide the necessary credentials to provide a manual workaround.

humblec · 2019-10-04T15:05:27Z

@humblec There is currently no agreed path forward on this issue. The RBD exclusive-lock feature cannot be used and the CSI RPCs do not provide the necessary credentials to provide a manual workaround.

@dillaman true, I am aware of the conclusion we made with our last discussions on this topic. However, trying to figure out what can be done here to reach to a (new) solution/workaround, be it by adding support from RBD core side or by trying possibilities from other layers. I was told that, we have to bring this extra safety measure to be assured that, we dont end up in multi access and thus its consequences.

@mykaul jfyi ^^

dillaman · 2019-10-04T15:07:52Z

Agreed it's a great safety feature -- but it will require a CSI spec and/or external-provisioner sidecar change. There is no workaround possible w/o the credentials. The timeline for getting mirroring to work w/ the external-provisioner is several releases out per the downstream team.

Madhu-1 · 2019-10-09T05:11:21Z

@dillaman @humblec if possible can we create an issue in CSI spec for the changes required? what changes did we require in the external provisioner sidecar?

humblec · 2019-10-09T05:27:00Z

@dillaman sure, we can poke upstream for the changes in external provisioner sidecar and try our best to get it in . But we have to list it out the same.

@Madhu-1 this issue track it in CSI repo, we need to come up with steps we are missing to achieve it here.
Also need a volunteer 👍

dillaman · 2019-10-09T11:55:35Z

@dillaman sure, we can poke upstream for the changes in external provisioner sidecar and try our best to get it in . But we have to list it out the same.

It seems silly to me to tag a ceph-csi issue against v2 and priority 0 when it will 100% not land in v2 due to upstream dependencies. I think the best you can do is have an issue against the spec and/or CSI external provisioner limitation just to track the the opening of an issue against the other external components. Actually adding support for exclusive locking is something that cannot be scheduled until after the support is merged into the upstream components.

ShyamsundarR · 2019-10-10T15:05:53Z

The solution to use rbd lock add ... commands is not quite correct, as it provides only an advisory lock, and further is equivalent to setting an annotation on the image that we do not control.

Instead, it is was discussed that we can add our own annotation and ensure this is set, before mapping the image on the node. This will be done in NodeStage (and hence undone in NodeUnstage), to ensure we are not caught in any other kubernetes/CO related bugs (which is what this issue is about).

Performing the above in ControllerPublish/Unpublish defeats the purpose, because if the CO decides its state is correct, i.e there are no nodes that have the image mapped, when there is actually an active image map on a node, the CO would simply invoke ControllerUnpublish to remove the older annotation and call ControllerPublish add a new one causing the issue that led to this problem in the first place.

The order of operations would look like:

NodeStage:
1. Check if there is an existing annotation and it is not the current node
2. If present and not the current node, error out [A]
3. If not present, add an annotation with the required node details
4. Read and check if the annotation is what was added (prevent races from parallel Staging operation on other nodes that may have added an annotation at the same time)
  - If the annotation is different (last writer won), error out [A]
5. If annotation has been successfully set, proceed to map [B]
NodeUnstage:
- (simplified) Check and remove the annotation added [C]

[A] If a node goes down without removing an annotation, then the cluster/storage/CSI admin would need to remove the annotation to clear the state, hence this solution does not provide auto recovery or connection fencing.

Connection fencing is not chosen, as all images mapped on a node uses a common connection, for efficiency of time (connection establishment) and resources (memory footprint on the node) reasons, and hence fencing one will fence all on the node, which would be an bigger issue to deal with.

[B] There is a race between steps 1 and 5, when 2 clients can pass the no annotation set check (as they read an empty key), and proceed with steps 3 and 4, ending up with both clients writing their node details and reading it back assuming they are the sole owners. This needs more thought.

[C] Currently NodeUnstage does not pass in secrets and other required volume attributes. There is some reasoning as to why in these issues,

https://github.com/container-storage-interface/spec/issues/198
https://github.com/container-storage-interface/spec/issues/354
but, once we clear up [A] and [B] and other assumptions in here (such as we will not use node nonce based fencing for sure), we may need to revisit [C] with the CSI spec project.

dillaman · 2019-10-10T15:49:53Z

I think we should use the "rbd lock" commands as a base since it will provide the necessary client address such that you could blacklist automatically blacklist a node w/o the need to require storage admin intervention. The lock infrastructure does support adding annotations to the lock if that is required, but we would need to tweak the rbd CLI / APIs to expose that to the outside world if required.

travisn · 2020-01-16T00:50:23Z

@athanatos @batrick To follow up from our discussion last week on the need for the csi driver to detach from a pod on a failed node and allowing attaching on another node, any input to add to this thread? This is in the list of critical rook issues that we need to solve.

batrick · 2020-01-16T22:30:06Z

I'm not sure I've full wrapped my head the proposed solution by @ShyamsundarR. Can we arrange a meeting to discuss this?

humblec · 2020-01-28T15:05:41Z

@batrick @ShyamsundarR et.al, can we update this issue with the discussion happened on the above said meeting?

ShyamsundarR · 2020-02-07T17:03:09Z

Here are the meeting updates:

To overcome lack of secrets for any required solution in NodeUnstageVolume request, it was proposed that we can add secrets (name/namespace data) to the clusterID configuration, and read/watch said kubernetes secrets.
- This separates the secrets into their own namespaces (a requirement when different users administer the different ceph clusters), and needs required RBACs for CSI namespace to read the same
- Provides the secrets for all calls that CSI may want to make (now and in the future)
- Brings us closer to cloud provider implementations, where ability to access secrets by the CSI driver is assumed in such environments
Better understanding of the problem was needed, to establish when CSI plugins are involved in publishing the volume to a newer node. This is covered in this comment (and also extensively by other kubernetes issues around the topic)
We also discussed how to fence RWX volumes, on nodes that are no longer healthy. This requires kubernetes to provide data on which node is stale or what entity needs fencing/detach.
- RWX fencing cases could be due to active/standby RBD volume, or higher layer clustered use of RBD volumes and also for CephFS volumes
- (Post the meeting) the KEP that seems to be tackling this is https://github.com/kubernetes/enhancements/pull/1116
Finally there is no solution for StatefulSets, based on above reading and investigations, and this also requires the same kubernetes KEP (noted above) to move forward

Going forward actions would be,

Assuming we have required secrets, arrive at exact ceph rbd/cephfs feature/mechanism, to fence stale nodes
Agree/disagree on moving secrets to CSI as proposed above
Follow up on upstream kubernetes and CSI WGs to understand how best the problem is intended to be resolved

@batrick @dillaman Please add in case any key points are missing.

stale · 2020-10-20T07:28:13Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

LalitMaganti · 2020-10-23T02:20:54Z

Not stale

yanchicago · 2020-10-23T02:41:50Z

Still a blocking issue in the fields.

stale · 2021-01-24T02:59:01Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

dimm0 · 2021-01-24T03:44:38Z

Unstale

nixpanic · 2021-05-05T07:52:25Z

Fencing an recovering is proposed to the CSI specification: container-storage-interface/spec#477

Ceph-CSI can implement ControllerPublish and ControllerUnpublish to do the (un)fencing of nodes and volumes once the change in the spec has been accepted, released and implemented in Kubernetes-CSI.

ShyamsundarR · 2021-05-05T10:45:56Z

Fencing an recovering is proposed to the CSI specification: container-storage-interface/spec#477

Ceph-CSI can implement ControllerPublish and ControllerUnpublish to do the (un)fencing of nodes and volumes once the change in the spec has been accepted, released and implemented in Kubernetes-CSI.

Please note that the proposed spec, afaict, intends to fence based on kubernetes node ID, whereas connections in ceph that can be blocklisted are IP addresses. There possibly needs to be a mapping between the two and hence may require additional CSI metadata to be stored etc. as this is implemented.

baryluk · 2022-05-11T11:36:16Z

Hi,

is there any recent design or implementation work on this feature? We have some non-replicated single node DBs instances (for various reasons), and while very unlikely, we are worried node becoming disconnected from kubernetes, but not from Ceph, and still keep running pods using RBD, then master to decide to schedule the pod again on other node, and possibly corrupting the file system or DB.

If this could be solved in ceph directly, via locking, that would be great, and easy to deploy.

nixpanic · 2022-05-12T07:39:20Z

Exclusive locking should be usable already.

rook/rook#1507 is one of the issues that has more discussion on enhancing fail-over for certain scenarios. By using CSI-Addons NetworkFence, it will become possible to (automatically) ceph osd blocklist a failed node, and have the workload Pods start somewhere else. It is required that the application (database in your case) can recover from possible incomplete writes (fencing/blocklisting is similar to power-loss on a system).

remram44 · 2023-08-04T21:58:26Z

I have the same issue as everyone else, and I'm a bit confused about the current status. Is NetworkFence available with ceph-csi or only when using Rook?

I see in the Rook documentation that NetworkFence only kicks in if I mark a node lost with node.kubernetes.io/out-of-service=nodeshutdown. However, if it is already "nodeshutdown", what do I need the network fence for? Additionally, Kubernetes will already force-delete pods if I give this manual confirmation, so there is no problem to be fixed (see Non Graceful Node Shutdown, beta in 1.26).

What I am interested in is an automatic solution that would allow the pod to respawn on a new node without the operator having to manually confirm that the node is shutdown. Shouldn't this lock-out happen automatically without out-of-service=nodeshutdown, if it is to improve the status quo?

nixpanic · 2023-08-07T08:02:16Z

One of the problems is that there are two different services that may function independently of each other. There are the Kubernetes services (by kubelet) and Ceph services (mapped RBD images or CephFS mounts). When Pod is started by kubelet, kubelet basically does:

mount the volume(s)
start the container(s)

While mounting the volume, a Ceph connection is created, which became independent of kubelet. This means that when kubelet for some reasons malfunctions, the Ceph connection is still alive and working. Whenever this happens, there is also no guarantee that the container(s) are stopped (kubelet may have failed to do so). This makes it possible that the node where kubelet does not work correctly anymore, the containers are still running, and doing I/O on the Ceph volume.

If a process is doing I/O on a Ceph volume, ideally no other application should be accessing that volume at the same time (without applications cooperating with each other, e.g. using file/byterange locks). RBD volumes have more requirements that CephFS volumes do, as an RBD-image usually contains a local filesystem (like ext4). Mapping an RBD-image on two nodes, and mounting the local filesystem will almost certainly lead to filesystem corruption. In order to prevent filessytem corruption, or simple application data corruption because a stale kubelet failed to stop a workload, manual intervention is required, indicating that the node has been shutdown.

There is not always a manual intervention needed though. Some environments accept a timeout on the Ceph volumes (in case the network was disconnected, when the node was shutdown). This timeout can sometimes even take up to 15 minutes, depending on the type of the failure. Because it is not predictable how failure scenarios happen, and data corruption (of any kind) is not ideal, having the security of the operator to shutdown the malfunctioning node, and adding the taint once done, is the safest solution.

remram44 · 2023-08-07T13:45:29Z

You are describing the situation without NetworkFence. The whole point of NetworkFence is to make previously-unsafe scenarios safe by removing the possibility for the broken node to corrupt anything. This feature has been added, and it works.

My question is why does this "broken node" detection have to be manual? Why can't the fencing kick in when the node is NotReady for a few minutes? It has the potential to make a lot of previously-manual-only recoveries safe for automation, that just wasn't enabled, why not? If the node is offline, the fencing is useless. Why code a feature and then not use it?

nixpanic · 2023-08-07T16:03:50Z

With Rook it can all be automated. In that case, Rook acts like the operator, and a human is not required. Currently Rook is the only automation that handles this, to my knowledge. CSI-drivers are not the right place to check if the container orchestrator (Kubernetes, Swarm, Nomad, ...) is working correctly. Rook or other operators for a specific container orchestrator are more suitable to do (Ceph) NetworkFencing. Ideally the container orchestrator itself provides support for fencing worker nodes, but I am not aware of any plans to do so.

@subhamkrai who worked on the Rook part might have more details about other implementations.

Mediak8s is an other project that could potentially use csi-addons for NetworkFencing, but I have not followed their future plans.

remram44 · 2023-08-07T16:05:37Z

The Rook documentation doesn't mention any such process. In fact the page I linked above specifically states that the taint has to be added manually:

After the taint is added to the node, Rook will automatically blocklist the node

subhamkrai · 2023-08-09T05:36:20Z

@remram44 could open a feature request in rook?

Rakshith-R · 2023-08-09T05:50:01Z

You are describing the situation without NetworkFence. The whole point of NetworkFence is to make previously-unsafe scenarios safe by removing the possibility for the broken node to corrupt anything. This feature has been added, and it works.

My question is why does this "broken node" detection have to be manual?

Please refer to rook/rook#1507 (comment) and https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown#proposal.

IMO,in general, this is a question for kubernetes folks, not rook. They have to decide on what is a broken node.
There is no agreement on when a node can be considered broken yet. If they did have an agreement, Kubernetes itself could taint the node automatically instead of relying on the admin.

Rook could have a mechanism based on some conditions to decide this but IMO It is out of scope for Rook, which just takes care of the storage part, to mark a node in cluster, which may have various other apps and services running on it, as out of service.
Anything we implement at Rook level will be a hack until kubernetes iteslf comes up with a mechanism.

jjgraham · 2023-08-09T06:07:53Z

I am getting board playing whac-a-mole with rook hanging...

remram44 · 2023-08-09T14:08:05Z

First of all, the Rook documentation indicates node.kubernetes.io/out-of-service=nodeshutdown while the well-known label from Kubernetes is node.kubernetes.io/out-of-service. This implies that I should only use this taint when the the node is shut down (nodeshutdown). This is a minor documentation issue, but if the whole feature is there to deal with the case where a node is out of service but NOT shut down (so it might issue requests that need to be fenced) I think correcting this would go a long way.

If all that's left is for me to write a little controller that automatically applies out-of-service taints when nodes are NotReady for a while, this is great news. All I need is for someone to confirm that I can do that with non-nodeshutdown nodes, and the combination of the three features csi-addons & ceph-csi blocklist & Kubernetes non-graceful node shutdown will make my life so much easier by allowing stateful workloads to recover without manual intervention or risk of corruption.

Don't get me wrong, these new developments are a dream come true that have so much potential to help with ops. I am just frustrated that we are missing the last (easiest) step of making this kick in on NotReady rather than out-of-service:nodeshutdown.

FYI I don't use Rook, just ceph-csi, it just looks like that ceph-csi feature is only documented in the Rook docs.

remram44 · 2024-10-16T14:27:52Z

Bump. Does anyone understand the current behavior of ceph-csi?

Does it add the taint? Does it read the taint?

Madhu-1 · 2024-10-18T09:00:34Z

Bump. Does anyone understand the current behavior of ceph-csi?

Does it add the taint? Does it read the taint?

@remram44 We have disabled the auto fencing in Rook due to potential issue of wrong IP blocklisting but you could potentially do the below if you are not using Rook

Identify that Node is going down or out of service somehow, we cannot use the out-of-service taint here because it doesn't do anything with storage (wrt cephcsi) it get removes the VA objects and start moving the pods to the new nodes)
Get the Right Client IP of the node you want to block the access for and create the NetworkFence CR
Ensure that the network Fence is successful
Add a taint on the node out-of-service to move the workload to another node

The First 3 steps are to ensure that we blocklisted the right IP to avoid data corruption/inconsistency problem and the last option of adding taint is to move the workload to a new node.

We are still in design phase to make the life easy for users but its taking some time to find the right automated solution

remram44 · 2024-10-18T18:10:24Z

Ok, thanks. I will see if I can dedicate some manpower on my side to make our own automation for this. Unless there is already an effort I could join?

humblec added release-2.0.0 v2.0.0 release Priority-0 highest priority issue bug Something isn't working labels Oct 4, 2019

humblec mentioned this issue Oct 9, 2019

Release v2.0.0. of Ceph CSI driver. #557

Closed

12 tasks

dillaman mentioned this issue Dec 12, 2019

Node crash causes volumes to become unavailable #740

Closed

ShyamsundarR mentioned this issue Dec 18, 2019

On NodeLost, the new pod can't mount the same volume. rook/rook#1507

Closed

travisn mentioned this issue Jan 16, 2020

Rescheduled pod in a statefulset is not allowed to attach a PVC rook/rook#4371

Closed

Madhu-1 added Release-2.1.0 and removed release-2.0.0 v2.0.0 release labels Jan 20, 2020

humblec mentioned this issue Jan 22, 2020

Release v2.1.0 of Ceph CSI driver. #806

Closed

14 tasks

Madhu-1 mentioned this issue Feb 5, 2020

Multi-Attach error rook/rook#4699

Closed

travisn mentioned this issue Jul 27, 2020

ceph rdb attached pod failover not happening when node goes down rook/rook#5902

Closed

stale bot added the wontfix This will not be worked on label Oct 20, 2020

stale bot removed the wontfix This will not be worked on label Oct 23, 2020

stale bot added the wontfix This will not be worked on label Jan 24, 2021

stale bot removed the wontfix This will not be worked on label Jan 24, 2021

Madhu-1 added the keepalive This label can be used to disable stale bot activiity in the repo label Mar 10, 2021

nixpanic added dependency/csi This depends on changes to the CSI specification dependency/k8s depends on Kubernetes features labels May 5, 2021

idryomov mentioned this issue May 29, 2021

Continue thick-provisioning of RBD images on CreateVolume restart #2101

Merged

Add the ability to use exclusive RBD locking to prevent inadvertent multi-node access of an RWO image #578

Add the ability to use exclusive RBD locking to prevent inadvertent multi-node access of an RWO image #578

Comments

ShyamsundarR commented Aug 29, 2019

dillaman commented Aug 29, 2019

Madhu-1 commented Aug 29, 2019 • edited Loading

dillaman commented Aug 29, 2019

humblec commented Oct 4, 2019

dillaman commented Oct 4, 2019

humblec commented Oct 4, 2019 • edited Loading

dillaman commented Oct 4, 2019 • edited Loading

Madhu-1 commented Oct 9, 2019

humblec commented Oct 9, 2019 • edited Loading

dillaman commented Oct 9, 2019

ShyamsundarR commented Oct 10, 2019

dillaman commented Oct 10, 2019

travisn commented Jan 16, 2020

batrick commented Jan 16, 2020

humblec commented Jan 28, 2020

ShyamsundarR commented Feb 7, 2020

stale bot commented Oct 20, 2020

LalitMaganti commented Oct 23, 2020

yanchicago commented Oct 23, 2020

stale bot commented Jan 24, 2021

dimm0 commented Jan 24, 2021

nixpanic commented May 5, 2021

ShyamsundarR commented May 5, 2021

baryluk commented May 11, 2022

nixpanic commented May 12, 2022

remram44 commented Aug 4, 2023

nixpanic commented Aug 7, 2023

remram44 commented Aug 7, 2023

nixpanic commented Aug 7, 2023

remram44 commented Aug 7, 2023

subhamkrai commented Aug 9, 2023

Rakshith-R commented Aug 9, 2023

jjgraham commented Aug 9, 2023

remram44 commented Aug 9, 2023

remram44 commented Oct 16, 2024

Madhu-1 commented Oct 18, 2024

remram44 commented Oct 18, 2024

Madhu-1 commented Aug 29, 2019 •

edited

Loading

humblec commented Oct 4, 2019 •

edited

Loading

dillaman commented Oct 4, 2019 •

edited

Loading

humblec commented Oct 9, 2019 •

edited

Loading