-
Notifications
You must be signed in to change notification settings - Fork 39
Add multi-attach error to RWO volumes #161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Hi @alifelan. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Due to an issue with Kubelet, a second volume attachment may be created for an RWO volume if the first volume attachment is undergoing deletion. Once the initial operation times out, the volume is marked as uncertain in the Actual State of the World, and the multi-attach check does not prevent a second node for starting the attachment. This ends up causing a problem with the KubeVirt CSI Driver. This new volume won't be attached until its released from the previous VM, and any new and unrelated volumes that we try to attach to this new VM will fail since the hotplug pod is stuck in ContainerCreating due to the multi-attach error. Here, we introduce an initial check for volumes that are RWO (or, as in the code, non-RWX) where we iterate through the available Virtual Machine Instances, and see if our current volume is still in the VolumeStatus of any of them. Signed-off-by: Alí Felán <[email protected]>
|
/test all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, just a few questions.
pkg/service/controller.go
Outdated
| // Returns: | ||
| // - bool: True if the capability represents an RWX volume. | ||
| // - bool: True if the capability represents an RWO volume. | ||
| func getCapabiltyAccessMode(cap *csi.VolumeCapability_AccessMode) (isRWX bool, isRWO bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably switch this to
func hasRWXCapabiltyAccessMode(cap *csi.VolumeCapability_AccessMode) (bool, error) {
switch cap.GetMode() {
...RWO CAPS
return false, nil
...RWX CAPS
return true, nil
}
return false, fmt.Errorf("unknown volume capability")
}Since we don't seem to care about the isRWO bool, and the error will allow us to detect invalid capabilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, on it. I'd like to mention that we do care of isRWO when doing our checks in ControllerPublishVolume, but we can do a !isRWX. The returned error here will also bubble up through getAccessMode, which will now return (bool, bool, error)
| for _, volumeStatus := range vmi.Status.VolumeStatus { | ||
| // If the name in the status matches our PVC name, it means the volume | ||
| // is actively attached to this other VMI. | ||
| if volumeStatus.Name == dvName { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to be a little more explicit. The volume could be in the process of unplugging, or won't that matter in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to be more specific since we want to cover a volume in the unplugging process. The error we had seen comes from a volumeStatus in Detaching state, so its being unplugged, but it cannot be released because the new hotplug pod has not started successfully.
From the VirtualMachineInstance, we can see a volume shows in the status with the following updates when detaching:
- hotplugVolume:
attachPodName: hp-volume-5f2jw
attachPodUID: f6523b31-2a7f-4dbf-865d-8fc1f4108efb
message: Successfully attach hotplugged volume pvc-6c23f224-bc1a-489f-9104-1762cd9cbbff
to VM
name: pvc-6c23f224-bc1a-489f-9104-1762cd9cbbff
persistentVolumeClaimInfo:
accessModes:
- ReadWriteOnce
capacity:
storage: 1Gi
claimName: pvc-6c23f224-bc1a-489f-9104-1762cd9cbbff
filesystemOverhead: "0"
requests:
storage: "1073741824"
volumeMode: Block
phase: Ready
reason: VolumeReady
target: sdb
---
- hotplugVolume:
attachPodName: hp-volume-5f2jw
attachPodUID: f6523b31-2a7f-4dbf-865d-8fc1f4108efb
message: Successfully attach hotplugged volume pvc-6c23f224-bc1a-489f-9104-1762cd9cbbff
to VM
name: pvc-6c23f224-bc1a-489f-9104-1762cd9cbbff
persistentVolumeClaimInfo:
accessModes:
- ReadWriteOnce
capacity:
storage: 1Gi
claimName: pvc-6c23f224-bc1a-489f-9104-1762cd9cbbff
filesystemOverhead: "0"
requests:
storage: "1073741824"
volumeMode: Block
phase: Detaching
reason: VolumeReady
target: sdb
---
- hotplugVolume:
attachPodName: hp-volume-5f2jw
attachPodUID: f6523b31-2a7f-4dbf-865d-8fc1f4108efb
message: Deleted hotplug attachment pod hp-volume-5f2jw, for volume pvc-6c23f224-bc1a-489f-9104-1762cd9cbbff
name: pvc-6c23f224-bc1a-489f-9104-1762cd9cbbff
persistentVolumeClaimInfo:
accessModes:
- ReadWriteOnce
capacity:
storage: 1Gi
claimName: pvc-6c23f224-bc1a-489f-9104-1762cd9cbbff
filesystemOverhead: "0"
requests:
storage: "1073741824"
volumeMode: Block
phase: Detaching
reason: SuccessfulDelete
target: sdb
In our scenario, we are stuck in the Detaching phase with a VolumeReady reason. This happens because the current hotplug pod has not been deleted since the new one is still in ContainerCreating. While we could check for a Detaching phase with a SuccessfulDelete reason, I think that is shown temporarily in the VirtualMachineInstance status, so a simple name check should be enough (one check would fail, the follow up would succeed; IIUC the period of this state is short).
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay sounds good, I was just making sure I fully understood the issue here. What you are saying makes sense.
This provides a function that, based on the access mode, returns whether a volume is RWX and RWO. Signed-off-by: Alí Felán <[email protected]>
|
/test all |
|
I'm looking into the e2e failure, and I don't think that was caused by this commit. The failure here happened when checking the events for the pod (which, if we made it here, the pod has finished running, aka everything was mounted properly) I haven't seen a scenario where the serial id shows up later, but i guess it could happen if udev is taking some time? The other possibility I can think of is I don't see anything pointing to the issue being caused by this PR. ill retry the test |
|
/test pull-csi-driver-e2e-k8s |
|
@alifelan: PRs from untrusted users cannot be marked as trusted with In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Fair, I cannot rerun the test. I think this is unrelated to the PR, can we rerun the test? Is this test known to be flaky? |
|
/test pull-csi-driver-e2e-k8s |
|
I am not aware of the test being flaky, but lets run it again and find out. |
|
It seems like it may be flaky, probably because of udev timing out. The way I understand is this test was added to ensure we don't succeed a |
What this PR does / why we need it:
When a volume attachment cannot be deleted, KCM will mark the volume as uncertain. An uncertain volume is not considered attach to the node, and multi-attach errors will only be triggered if the volume is confirmed as attached.
This leads to a problem with KubeVirt CSI that expands. If a volume cannot be released from the previous VM (as in, the newest hotplug pod that releases the volume cannot start), KubeVirt CSI will timeout waiting for the operation. Once the timeout happens, KCM marks the volume as uncertain, and creates a new volume attachment for the new node. Worth noting here: The previous volume attachment is still in the cluster, it has an error, and it also has a deletion timestamp.
When KubeVirt CSI starts reconciling on the second volume attachment, it will issue a new
virt addvolumecommand. The hotplug pod gets created, but it is never able to start due to a multi-attach error. This ends up impacting all of the future volumes that get attached (until the problem gets addressed). Not only that, but any "released" volume from the problematic VM will propagate the error to any new VM that receives that volume.This scenario is not ideal, and while there was a PR for it a year ago, it was not merged.
The CSI Specification denotes an error that should be returned by the plugin if the volume is already attached to a different node, with a
PreconditionFailederror. This is what this PR achieves.While possible to add a map, having a loop over the VirtualMachineInstance * VolumeStatus is performant enough for the KubeVirt CSI Driver. A test (albeit, created without a real API server) with 10,000 VirtualMachineInstances with 40 VolumeStatus each took ~9ms.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
This was built on top of #160, however, this change is independent from it
Release note: