From aeb43d135474e1331068b2f054bcb836effd140a Mon Sep 17 00:00:00 2001 From: sushant-suse Date: Tue, 6 Jan 2026 12:07:49 +0530 Subject: [PATCH 1/6] docs(kb): for migratable RWX volume stuck in detaching loop Signed-off-by: sushant-suse --- ...able-rwx-volume-stuck-in-detaching-loop.md | 114 ++++++++++++++++++ 1 file changed, 114 insertions(+) create mode 100644 content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md diff --git a/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md b/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md new file mode 100644 index 000000000..14422e1e8 --- /dev/null +++ b/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md @@ -0,0 +1,114 @@ +--- +title: "Troubleshooting: Migratable RWX volume stuck in detaching/attaching loop" +authors: +- "Sushant Gaurav" +draft: false +date: 2026-01-06 +versions: +- ">= v1.4.2 and <= v1.7.0" +categories: +- "Migratable RWX volume" +--- + +## Applicable versions + +**Confirmed working with**: + +- Longhorn `v1.4.2` (Harvester `v1.1.x` to `v1.2.x` upgrades) + +**Potentially applicable to**: + +- Longhorn versions prior to `v1.7.0` +- Environments using Migratable RWX volumes with VM live migration + +## Symptoms + +During a VM live migration or a cluster upgrade, a volume becomes stuck in an endless loop of flipping between `attaching` and `detaching` states. Unlike standard migration hangs where the volume stays `attached`, this loop prevents the volume from being used or cleanly detached. + +**Example volume state**: The volume remains stuck in `detaching` even if no workload is running. + +```bash +$ kubectl get volume -n longhorn-system pvc-840804d8-6f11-49fd-afae-54bc5be639de +NAME STATE ROBUSTNESS NODE +pvc-840804d8-6f11-49fd-afae-54bc5be639de detaching unknown ubuntu-lh-2 +``` + +**Longhorn Manager Logs**: The logs on the volume owner node will show failures during the migration finalization phase. The controller is unable to find the engine to complete the switch: + +```text +level=warning msg="Failed to finalize the migration" controller=longhorn-volume error="cannot find the current engine for the switching after iterating and cleaning up all engines... all engines may be detached or in a transient state" +level=warning msg="Waiting to confirm migration until migration engine is ready" controller=longhorn-volume-attachment +``` + +**VolumeAttachment (LHVA) state**: Describing the `volumeattachments.longhorn.io` (LHVA) reveals that the `Spec.Attachment Tickets` and `Status.Attachment Ticket Statuses` are **empty**, yet the resource remains stuck due to a finalizer. + +```yaml +Name: pvc-840804d8-6f11-49fd-afae-54bc5be639de +Namespace: longhorn-system +Kind: VolumeAttachment +Metadata: + Finalizers: + longhorn.io +Spec: + Attachment Tickets: + Volume: pvc-840804d8-6f11-49fd-afae-54bc5be639de +Status: + Attachment Ticket Statuses: +``` + +## Reason + +This issue occurs when a live migration is interrupted—often by powering off the VM, a node failure, or an upgrade interruption—specifically during the "engine switching" phase. + +Longhorn expects to switch the frontend from the source engine to the destination engine. If the workload is stopped during this transition, the engines may vanish, leaving the Volume Controller unable to find a "current" engine to finalize the switch. Because the `VolumeAttachment` CR still exists and holds a finalizer, the controller enters a reconciliation loop it cannot complete, causing the flapping state. + +## Workaround + +If the workload has been shut down and the volume is stuck flapping, follow these steps to manually clear the migration metadata and "ghost" attachment. + +### 1. Clear the Migration Metadata + +Force the volume to drop the migration reference in its status subresource. This stops the controller from attempting to finalize a non-existent migration. + +```bash +kubectl patch -n longhorn-system volume \ + --type=merge \ + --subresource status \ + -p '{"status":{"currentMigrationNodeID":""}}' +``` + +### 2. Remove the VolumeAttachment Finalizer + +The "ghost" LHVA prevents the volume from reaching a steady `detached` state. Manually remove the finalizer to allow the resource to be cleaned up. + +```bash +kubectl patch -n longhorn-system volumeattachments.longhorn.io \ + --type=merge \ + -p '{"metadata":{"finalizers":null}}' +``` + +### 3. Delete the Orphaned LHVA + +If the resource does not disappear automatically after stripping the finalizer, delete it manually: + +```bash +kubectl delete volumeattachments.longhorn.io -n longhorn-system +``` + +### 4. Verify State + +Confirm the volume has transitioned to the `detached` state. + +```bash +$ kubectl get volume -n longhorn-system +NAME STATE ROBUSTNESS NODE +pvc-840804... detached unknown +``` + +You can now safely restart the VM or workload. + +## Related Information + +- [KB: Troubleshooting: Migratable RWX volume migration stuck](https://longhorn.io/kb/troubleshooting-rwx-volume-migration-stuck/) - For cases where migration tickets are present and "Satisfied" but the node is stuck in pre-drain. +- [Longhorn Issue #12238](https://github.com/longhorn/longhorn/issues/12238) +- Fixed in **Longhorn v1.7.0+**, which includes more robust handling for orphaned migration engines. From af21ef93ba5c9849f744a7ca235e55e84438b942 Mon Sep 17 00:00:00 2001 From: sushant-suse Date: Wed, 7 Jan 2026 11:05:06 +0530 Subject: [PATCH 2/6] docs: fixed grammar Signed-off-by: sushant-suse --- ...bleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md b/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md index 14422e1e8..557d6f066 100644 --- a/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md +++ b/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md @@ -68,7 +68,7 @@ If the workload has been shut down and the volume is stuck flapping, follow thes ### 1. Clear the Migration Metadata -Force the volume to drop the migration reference in its status subresource. This stops the controller from attempting to finalize a non-existent migration. +Force the volume to drop the migration reference in its status subresource. This stops the controller from attempting to finalize a nonexistent migration. ```bash kubectl patch -n longhorn-system volume \ From 631dd4ed4c565737fdc45818a48277f7745e9d53 Mon Sep 17 00:00:00 2001 From: sushant-suse Date: Tue, 13 Jan 2026 17:43:13 +0530 Subject: [PATCH 3/6] docs: updated as per Derek's comments Signed-off-by: sushant-suse --- ...able-rwx-volume-stuck-in-detaching-loop.md | 58 +++++-------------- 1 file changed, 16 insertions(+), 42 deletions(-) diff --git a/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md b/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md index 557d6f066..8f8c0f716 100644 --- a/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md +++ b/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md @@ -5,22 +5,11 @@ authors: draft: false date: 2026-01-06 versions: -- ">= v1.4.2 and <= v1.7.0" +- "< v1.10.0" categories: - "Migratable RWX volume" --- -## Applicable versions - -**Confirmed working with**: - -- Longhorn `v1.4.2` (Harvester `v1.1.x` to `v1.2.x` upgrades) - -**Potentially applicable to**: - -- Longhorn versions prior to `v1.7.0` -- Environments using Migratable RWX volumes with VM live migration - ## Symptoms During a VM live migration or a cluster upgrade, a volume becomes stuck in an endless loop of flipping between `attaching` and `detaching` states. Unlike standard migration hangs where the volume stays `attached`, this loop prevents the volume from being used or cleanly detached. @@ -56,17 +45,26 @@ Status: Attachment Ticket Statuses: ``` -## Reason +## Root Cause + +This issue occurs when a Migratable RWX live migration is interrupted commonly due to a VM being powered off, a node failure, or an upgrade event - specifically during the engine switching phase of the migration. -This issue occurs when a live migration is interrupted—often by powering off the VM, a node failure, or an upgrade interruption—specifically during the "engine switching" phase. +During this phase, Longhorn expects to switch the frontend from the source engine to the destination engine. If the workload is stopped while this transition is in progress, both engines may be cleaned up or enter transient states. -Longhorn expects to switch the frontend from the source engine to the destination engine. If the workload is stopped during this transition, the engines may vanish, leaving the Volume Controller unable to find a "current" engine to finalize the switch. Because the `VolumeAttachment` CR still exists and holds a finalizer, the controller enters a reconciliation loop it cannot complete, causing the flapping state. +As a result: + +- The Volume object retains a non-empty `status.currentMigrationNodeID`. +- The Volume Controller continues attempting to finalize a migration that no longer exists. +- The controller cannot identify a valid current engine. +- The volume enters an endless attach/detach reconciliation loop. + +In this scenario, the presence of a `VolumeAttachment` resource is a symptom rather than the root cause. ## Workaround -If the workload has been shut down and the volume is stuck flapping, follow these steps to manually clear the migration metadata and "ghost" attachment. +If the workload or VM has already been shut down and the volume is stuck flapping, manually clear the stale migration metadata from the Volume status. -### 1. Clear the Migration Metadata +### 1. Clear the `volume.status.currentMigrationNodeID` Force the volume to drop the migration reference in its status subresource. This stops the controller from attempting to finalize a nonexistent migration. @@ -77,25 +75,7 @@ kubectl patch -n longhorn-system volume \ -p '{"status":{"currentMigrationNodeID":""}}' ``` -### 2. Remove the VolumeAttachment Finalizer - -The "ghost" LHVA prevents the volume from reaching a steady `detached` state. Manually remove the finalizer to allow the resource to be cleaned up. - -```bash -kubectl patch -n longhorn-system volumeattachments.longhorn.io \ - --type=merge \ - -p '{"metadata":{"finalizers":null}}' -``` - -### 3. Delete the Orphaned LHVA - -If the resource does not disappear automatically after stripping the finalizer, delete it manually: - -```bash -kubectl delete volumeattachments.longhorn.io -n longhorn-system -``` - -### 4. Verify State +### 2. Verify State Confirm the volume has transitioned to the `detached` state. @@ -106,9 +86,3 @@ pvc-840804... detached unknown ``` You can now safely restart the VM or workload. - -## Related Information - -- [KB: Troubleshooting: Migratable RWX volume migration stuck](https://longhorn.io/kb/troubleshooting-rwx-volume-migration-stuck/) - For cases where migration tickets are present and "Satisfied" but the node is stuck in pre-drain. -- [Longhorn Issue #12238](https://github.com/longhorn/longhorn/issues/12238) -- Fixed in **Longhorn v1.7.0+**, which includes more robust handling for orphaned migration engines. From 2dc954f2c2e8834226e4ea83fb74ca5636132d7d Mon Sep 17 00:00:00 2001 From: sushant-suse Date: Wed, 14 Jan 2026 10:29:00 +0530 Subject: [PATCH 4/6] docs: updated intro and added back references Signed-off-by: sushant-suse --- ...oting-migratable-rwx-volume-stuck-in-detaching-loop.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md b/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md index 8f8c0f716..08bb82905 100644 --- a/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md +++ b/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md @@ -12,7 +12,7 @@ categories: ## Symptoms -During a VM live migration or a cluster upgrade, a volume becomes stuck in an endless loop of flipping between `attaching` and `detaching` states. Unlike standard migration hangs where the volume stays `attached`, this loop prevents the volume from being used or cleanly detached. +During a VM live migration or a cluster upgrade, a volume becomes stuck in an endless loop of flipping between `attaching` and `detaching` states. **Example volume state**: The volume remains stuck in `detaching` even if no workload is running. @@ -86,3 +86,9 @@ pvc-840804... detached unknown ``` You can now safely restart the VM or workload. + +## References + +- [KB: Troubleshooting: Migratable RWX volume migration stuck](https://longhorn.io/kb/troubleshooting-rwx-volume-migration-stuck/) - For cases where migration tickets are present and "Satisfied" but the node is stuck in pre-drain. +- [Longhorn Issue #12238](https://github.com/longhorn/longhorn/issues/12238), and [Longhorn Issue #11479](https://github.com/longhorn/longhorn/issues/11479). +- Fixed in **Longhorn v1.7.0+**, which includes more robust handling for orphaned migration engines. From 2bd8ebc39efe0c0bd796804daac053461833d7cc Mon Sep 17 00:00:00 2001 From: sushant-suse Date: Wed, 14 Jan 2026 18:14:14 +0530 Subject: [PATCH 5/6] docs: updated references Signed-off-by: sushant-suse --- ...shooting-migratable-rwx-volume-stuck-in-detaching-loop.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md b/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md index 08bb82905..14371c5b0 100644 --- a/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md +++ b/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md @@ -89,6 +89,5 @@ You can now safely restart the VM or workload. ## References -- [KB: Troubleshooting: Migratable RWX volume migration stuck](https://longhorn.io/kb/troubleshooting-rwx-volume-migration-stuck/) - For cases where migration tickets are present and "Satisfied" but the node is stuck in pre-drain. -- [Longhorn Issue #12238](https://github.com/longhorn/longhorn/issues/12238), and [Longhorn Issue #11479](https://github.com/longhorn/longhorn/issues/11479). -- Fixed in **Longhorn v1.7.0+**, which includes more robust handling for orphaned migration engines. +- [Longhorn Issue #12238](https://github.com/longhorn/longhorn/issues/12238) +- [Longhorn Issue #11479](https://github.com/longhorn/longhorn/issues/11479) From 917b568626564d553ac127184e3ba25d0d4432a4 Mon Sep 17 00:00:00 2001 From: sushant-suse Date: Thu, 15 Jan 2026 23:22:02 +0530 Subject: [PATCH 6/6] docs: updated Symptoms Signed-off-by: sushant-suse --- ...ng-migratable-rwx-volume-stuck-in-detaching-loop.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md b/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md index 14371c5b0..5d1ffafe8 100644 --- a/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md +++ b/content/kb/troubleshooting-migratable-rwx-volume-stuck-in-detaching-loop.md @@ -12,7 +12,15 @@ categories: ## Symptoms -During a VM live migration or a cluster upgrade, a volume becomes stuck in an endless loop of flipping between `attaching` and `detaching` states. +During a VM live migration or a cluster upgrade, a Migratable RWX volume may become stuck in an infinite reconciliation loop. While the volume appears to be unused, it fails to stay in a stable `detached` state, preventing any new workload from attaching to it. + +**Observed Behavior**: + +- **State Flapping**: The volume state continuously flips between `detached` and `detaching`. + - When an attach is attempted, Longhorn updates `status.currentNodeID`. + - Because a migration is internally marked as `"in-progress"` (due to stale metadata), Longhorn immediately tries to transition the volume to `detaching` to clean up, then back to `detached`. +- **Metadata Mismatch**: The Volume `Spec.MigrationNodeID` is empty (`""`), but `Status.CurrentMigrationNodeID` still holds the ID of a previous migration target node. +- **Missing Resources**: Associated Kubernetes `VolumeAttachment` objects have been removed, yet the Longhorn Volume object behaves as if a migration finalization is required. **Example volume state**: The volume remains stuck in `detaching` even if no workload is running.