Skip to content

Conversation

@moadqassem
Copy link
Contributor

What this PR does / why we need it:
This pull request introduces support for zone-aware topology in the KubeVirt CSI driver. This enhancement allows the CSI driver to handle storage provisioning and attachment in multi-zone Kubernetes clusters. Integrates with Kubernetes topology-aware volume provisioners to ensure volumes are created in the same zone/region as the requesting node by:

  • Implements support for TopologyKeys to enable zone/region-specific volume scheduling and attachment
  • Modifies the Controller plugin to handle topology constraints during volume provisioning
  • Updates to the Node plugin to include zone information in NodeGetInfo responses

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Release note:

Support zone and region-aware topologies in the KubeVirt CSI Driver

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Jan 14, 2025
@kubevirt-bot
Copy link

Hi @moadqassem. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@moadqassem moadqassem force-pushed the support-pv-node-affinities branch from 1de4f1d to c37d1ae Compare January 14, 2025 14:23
@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Jan 14, 2025
@awels
Copy link
Member

awels commented Jan 14, 2025

/test all

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2025
@moadqassem moadqassem force-pushed the support-pv-node-affinities branch from c37d1ae to 8812d3f Compare March 30, 2025 16:34
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 30, 2025
@kubevirt-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mhenriks for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@moadqassem
Copy link
Contributor Author

moadqassem commented Mar 30, 2025

@awels I have fixed some tests files that were failing and update the PR to resolve few conflicts. However, there were some other issues which I believe are not relevant to my PR:

I0114 23:43:26.714262   37496 create-pvc_test.go:608] Event: AttachVolume.Attach failed for volume "tenant-pv" : rpc error: code = Unknown desc = invalid volume name [FailedAttachVolume]

And

 failed to provision volume with StorageClass "kubevirt": rpc error: code = InvalidArgument desc = non-block volume with RWX access mode is not supported

I also took a quick look into the testing process and In order to add an e2e k8s test to make sure that the zone and region are respected, I must change few things in the kubevirtci cluster provider(or add a new provider, haven't looked deep so not sure). This is where it needs to be adjusted to add labels that points the allowed topology: https://github.com/kubevirt/kubevirtci/blob/a291de27bb596074c79729ea6f88533555f523fd/cluster-up/cluster/ephemeral-provider-common.sh#L90

Let me know what do you think?

@awels
Copy link
Member

awels commented Apr 2, 2025

/test all

@awels
Copy link
Member

awels commented Apr 2, 2025

Sorry been very busy with other stuff, I will try to take a look at this soon.

@awels
Copy link
Member

awels commented Apr 2, 2025

If you look at the testing, we do actually exclude a few tests from the k8s test suite. In particular the RWX filesystem one, since we don't support that.

@moadqassem
Copy link
Contributor Author

If you look at the testing, we do actually exclude a few tests from the k8s test suite. In particular the RWX filesystem one, since we don't support that.

I see. Alright let me check this filtering criteria.

@awels
Copy link
Member

awels commented Apr 2, 2025

https://github.com/kubevirt/csi-driver/blob/main/hack/run-k8s-e2e.sh#L117-L128 is what we use, I believe the last skip is what skips the RWX filesystem test.

@awels
Copy link
Member

awels commented Apr 3, 2025

Okay so I think I know what happened here, we made a change in how we identify which VM the volume is hotplugged into. Before we used the VMI, but that proved to be problematic, now we look at the VM directly. I am not sure if that affects how you look up VMs, it likely does.

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 8, 2025
@moadqassem moadqassem force-pushed the support-pv-node-affinities branch from 1701709 to 989e1ea Compare May 13, 2025 10:29
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 13, 2025
@moadqassem
Copy link
Contributor Author

moadqassem commented May 13, 2025

Okay so I think I know what happened here, we made a change in how we identify which VM the volume is hotplugged into. Before we used the VMI, but that proved to be problematic, now we look at the VM directly. I am not sure if that affects how you look up VMs, it likely does.

hmm I don't think that I am listing or trying to get any VM as the code that I wrote is running on the DS which is in general deals with Node reference rather than the VM itself. The problem is, those tests are failing and I can't really check out what was the issue as you probably know the error traces(such as log messages) might come from other controllers which are not tracked. Also I noticed other PRs that have the exact failures. Do you know how we should proceed here ?

@awels
Copy link
Member

awels commented May 13, 2025

/test all

@moadqassem
Copy link
Contributor Author

moadqassem commented May 13, 2025

@awels After taking a deeper look, seems that these lines of code here broke the VMs lookups to fetch VMs info from the CSINode object https://github.com/kubevirt/csi-driver/pull/130/files#diff-6cb93ab3a359a5bd44c54dbb81ac13c78f2051b868087906408564274b5e94bbR337-R349. These new additions, changed how we fetched those VMs where the volume attachments should be happening in regards how the CSINode node id is being processed.

Signed-off-by: moadqassem <[email protected]>
@moadqassem
Copy link
Contributor Author

@awels can you please rerun the ci tests again. I have tested my recent changes and I was able to fix the problem,

@awels
Copy link
Member

awels commented May 13, 2025

/test all

@awels
Copy link
Member

awels commented May 13, 2025

/test pull-csi-driver-split-k8s-suite-k8s

@awels
Copy link
Member

awels commented May 13, 2025

PR looks good, one thing I forgot to ask about, did you enabled the topology tests in the k8s test suite. You can modify https://github.com/kubevirt/csi-driver/blob/main/hack/test-driver.yaml and set topology: true and that should be all that is needed to run the topology tests.

Signed-off-by: moadqassem <[email protected]>
@awels
Copy link
Member

awels commented May 13, 2025

/test all

@awels
Copy link
Member

awels commented May 14, 2025

/test all

@kubevirt-bot
Copy link

@moadqassem: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-csi-driver-e2e-k8s 4b2711d link true /test pull-csi-driver-e2e-k8s
pull-csi-driver-split-e2e-k8s 4b2711d link true /test pull-csi-driver-split-e2e-k8s
pull-csi-driver-split-k8s-suite-k8s 4b2711d link true /test pull-csi-driver-split-k8s-suite-k8s

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@awels
Copy link
Member

awels commented Jun 5, 2025

Looks like we are having an issue with the tests when we enable topology here. I still have not had a change to look at why.

@kubevirt-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@kubevirt-bot kubevirt-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2025
@awels
Copy link
Member

awels commented Sep 3, 2025

/remove-lifecycle stale

@kubevirt-bot kubevirt-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has DCO signed all their commits. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants