Skip to content

Conversation

@alifelan
Copy link

@alifelan alifelan commented Oct 22, 2025

What this PR does / why we need it:
When a volume attachment cannot be deleted, KCM will mark the volume as uncertain. An uncertain volume is not considered attach to the node, and multi-attach errors will only be triggered if the volume is confirmed as attached.

This leads to a problem with KubeVirt CSI that expands. If a volume cannot be released from the previous VM (as in, the newest hotplug pod that releases the volume cannot start), KubeVirt CSI will timeout waiting for the operation. Once the timeout happens, KCM marks the volume as uncertain, and creates a new volume attachment for the new node. Worth noting here: The previous volume attachment is still in the cluster, it has an error, and it also has a deletion timestamp.
When KubeVirt CSI starts reconciling on the second volume attachment, it will issue a new virt addvolume command. The hotplug pod gets created, but it is never able to start due to a multi-attach error. This ends up impacting all of the future volumes that get attached (until the problem gets addressed). Not only that, but any "released" volume from the problematic VM will propagate the error to any new VM that receives that volume.

This scenario is not ideal, and while there was a PR for it a year ago, it was not merged.

The CSI Specification denotes an error that should be returned by the plugin if the volume is already attached to a different node, with a PreconditionFailed error. This is what this PR achieves.

While possible to add a map, having a loop over the VirtualMachineInstance * VolumeStatus is performant enough for the KubeVirt CSI Driver. A test (albeit, created without a real API server) with 10,000 VirtualMachineInstances with 40 VolumeStatus each took ~9ms.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:
This was built on top of #160, however, this change is independent from it

Release note:

The KubeVirt CSI Driver now checks for multi-attach errors on RWO volumes.

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Oct 22, 2025
@kubevirt-bot kubevirt-bot requested review from aglitke and awels October 22, 2025 06:01
@kubevirt-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign davidvossel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot
Copy link

Hi @alifelan. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Due to an issue with Kubelet, a second volume attachment may be created
for an RWO volume if the first volume attachment is undergoing deletion.
Once the initial operation times out, the volume is marked as uncertain
in the Actual State of the World, and the multi-attach check does not
prevent a second node for starting the attachment.

This ends up causing a problem with the KubeVirt CSI Driver. This new
volume won't be attached until its released from the previous VM, and
any new and unrelated volumes that we try to attach to this new VM will
fail since the hotplug pod is stuck in ContainerCreating due to the
multi-attach error.

Here, we introduce an initial check for volumes that are RWO (or, as in
the code, non-RWX) where we iterate through the available Virtual
Machine Instances, and see if our current volume is still in the
VolumeStatus of any of them.

Signed-off-by: Alí Felán <[email protected]>
@awels
Copy link
Member

awels commented Oct 22, 2025

/test all
Thanks for the PR, it is on the list for me to take a look at.

Copy link
Member

@awels awels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, just a few questions.

// Returns:
// - bool: True if the capability represents an RWX volume.
// - bool: True if the capability represents an RWO volume.
func getCapabiltyAccessMode(cap *csi.VolumeCapability_AccessMode) (isRWX bool, isRWO bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably switch this to

func hasRWXCapabiltyAccessMode(cap *csi.VolumeCapability_AccessMode) (bool, error) {
  switch cap.GetMode() {
    ...RWO CAPS
    return false, nil
    ...RWX CAPS
    return true, nil
  }
  return false, fmt.Errorf("unknown volume capability")
}

Since we don't seem to care about the isRWO bool, and the error will allow us to detect invalid capabilities.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, on it. I'd like to mention that we do care of isRWO when doing our checks in ControllerPublishVolume, but we can do a !isRWX. The returned error here will also bubble up through getAccessMode, which will now return (bool, bool, error)

for _, volumeStatus := range vmi.Status.VolumeStatus {
// If the name in the status matches our PVC name, it means the volume
// is actively attached to this other VMI.
if volumeStatus.Name == dvName {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to be a little more explicit. The volume could be in the process of unplugging, or won't that matter in this case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to be more specific since we want to cover a volume in the unplugging process. The error we had seen comes from a volumeStatus in Detaching state, so its being unplugged, but it cannot be released because the new hotplug pod has not started successfully.

From the VirtualMachineInstance, we can see a volume shows in the status with the following updates when detaching:

  - hotplugVolume:
      attachPodName: hp-volume-5f2jw
      attachPodUID: f6523b31-2a7f-4dbf-865d-8fc1f4108efb
    message: Successfully attach hotplugged volume pvc-6c23f224-bc1a-489f-9104-1762cd9cbbff
      to VM
    name: pvc-6c23f224-bc1a-489f-9104-1762cd9cbbff
    persistentVolumeClaimInfo:
      accessModes:
      - ReadWriteOnce
      capacity:
        storage: 1Gi
      claimName: pvc-6c23f224-bc1a-489f-9104-1762cd9cbbff
      filesystemOverhead: "0"
      requests:
        storage: "1073741824"
      volumeMode: Block
    phase: Ready
    reason: VolumeReady
    target: sdb
---
  - hotplugVolume:
      attachPodName: hp-volume-5f2jw
      attachPodUID: f6523b31-2a7f-4dbf-865d-8fc1f4108efb
    message: Successfully attach hotplugged volume pvc-6c23f224-bc1a-489f-9104-1762cd9cbbff
      to VM
    name: pvc-6c23f224-bc1a-489f-9104-1762cd9cbbff
    persistentVolumeClaimInfo:
      accessModes:
      - ReadWriteOnce
      capacity:
        storage: 1Gi
      claimName: pvc-6c23f224-bc1a-489f-9104-1762cd9cbbff
      filesystemOverhead: "0"
      requests:
        storage: "1073741824"
      volumeMode: Block
    phase: Detaching
    reason: VolumeReady
    target: sdb
---
  - hotplugVolume:
      attachPodName: hp-volume-5f2jw
      attachPodUID: f6523b31-2a7f-4dbf-865d-8fc1f4108efb
    message: Deleted hotplug attachment pod hp-volume-5f2jw, for volume pvc-6c23f224-bc1a-489f-9104-1762cd9cbbff
    name: pvc-6c23f224-bc1a-489f-9104-1762cd9cbbff
    persistentVolumeClaimInfo:
      accessModes:
      - ReadWriteOnce
      capacity:
        storage: 1Gi
      claimName: pvc-6c23f224-bc1a-489f-9104-1762cd9cbbff
      filesystemOverhead: "0"
      requests:
        storage: "1073741824"
      volumeMode: Block
    phase: Detaching
    reason: SuccessfulDelete
    target: sdb

In our scenario, we are stuck in the Detaching phase with a VolumeReady reason. This happens because the current hotplug pod has not been deleted since the new one is still in ContainerCreating. While we could check for a Detaching phase with a SuccessfulDelete reason, I think that is shown temporarily in the VirtualMachineInstance status, so a simple name check should be enough (one check would fail, the follow up would succeed; IIUC the period of this state is short).

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay sounds good, I was just making sure I fully understood the issue here. What you are saying makes sense.

This provides a function that, based on the access mode, returns whether
a volume is RWX and RWO.

Signed-off-by: Alí Felán <[email protected]>
@awels
Copy link
Member

awels commented Oct 22, 2025

/test all

@alifelan
Copy link
Author

I'm looking into the e2e failure, and I don't think that was caused by this commit.

The failure here happened when checking the events for the pod (which, if we made it here, the pod has finished running, aka everything was mounted properly)

------------------------------
CreatePVC multi attach - creates 3 pvcs, attach all 3 to pod, detach all 3 from the pod Block volume mode [pvcCreation, Block]
/home/prow/go/src/github.com/kubevirt/csi-driver/e2e/create-pvc_test.go:376
  STEP: creating a pvc @ 10/22/25 20:21:19.589
  STEP: creating a pod that uses 3 PVCs @ 10/22/25 20:21:19.638
  STEP: Wait for pod to reach a completed phase @ 10/22/25 20:21:19.651
  [FAILED] in [It] - /home/prow/go/src/github.com/kubevirt/csi-driver/e2e/create-pvc_test.go:641 @ 10/22/25 20:21:54.749
  STEP: dumping k8s artifacts to /logs/artifacts/create-pvc_test.go:641 @ 10/22/25 20:21:54.749
  [FAILED] in [JustAfterEach] - /home/prow/go/src/github.com/kubevirt/csi-driver/e2e/e2e-suite_test.go:119 @ 10/22/25 20:21:54.75
• [FAILED] [35.353 seconds]
CreatePVC multi attach - creates 3 pvcs, attach all 3 to pod, detach all 3 from the pod [It] Block volume mode [pvcCreation, Block]
/home/prow/go/src/github.com/kubevirt/csi-driver/e2e/create-pvc_test.go:376
  [FAILED] Expected
      <string>: MapVolume.MapPodDevice failed for volume "pvc-eeddcd3e-4456-48c3-91d0-b0886f0f2702" : rpc error: code = Unknown desc = couldn't find device by serial id
  not to contain substring
      <string>: find device by serial id
  In [It] at: /home/prow/go/src/github.com/kubevirt/csi-driver/e2e/create-pvc_test.go:641 @ 10/22/25 20:21:54.749
  There were additional failures detected.  To view them in detail run ginkgo -vv
------------------------------

I haven't seen a scenario where the serial id shows up later, but i guess it could happen if udev is taking some time? The other possibility I can think of is ControllerPublishVolume returning success when the operation wasn't performed, but the newly introduced return statements all return errors.

I don't see anything pointing to the issue being caused by this PR. ill retry the test

@alifelan
Copy link
Author

/test pull-csi-driver-e2e-k8s

@kubevirt-bot
Copy link

@alifelan: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

In response to this:

/test pull-csi-driver-e2e-k8s

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@alifelan
Copy link
Author

Fair, I cannot rerun the test. I think this is unrelated to the PR, can we rerun the test? Is this test known to be flaky?

@awels
Copy link
Member

awels commented Oct 23, 2025

/test pull-csi-driver-e2e-k8s

@awels
Copy link
Member

awels commented Oct 23, 2025

I am not aware of the test being flaky, but lets run it again and find out.

@alifelan
Copy link
Author

It seems like it may be flaky, probably because of udev timing out.

The way I understand is this test was added to ensure we don't succeed a ControllerPublishVolume operation before the volume is available in the VirtualMachineInstance status. Here we are correctly going through that flow, but the serial id was not yet available, maybe because udev was taking longer than it took for KCM to trigger NodeStageVolume

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has DCO signed all their commits. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants