[question] Migration requirements and limitations #64

d-uzlov · 2025-05-19T01:35:27Z

d-uzlov
May 19, 2025

I found this project while searching for options for pod live migration. I don't think I need to scale to zero.
I'm trying to understand if this project will be useful in my case.

I read the readme, and I have a few things I don't understand about current state of the migration feature and the future possibilities.
Please, don't read it as a list of feature requests, even though bottom of the list contains "possible to implement" type questions. This is really just a list of questions, I want to understand the limitations of the technology.

1. Is it possible to use migration without scaling down pods?

What will happen if a pod is deleted while it is active?

Readme mentions that it's possible to disable scaledown by setting zeropod.ctrox.dev/scaledown-duration=0.
Will the migration feature work at all in this case?

2. How does zeropod decide that it needs to migrate pod?

Readme says that when migration is enabled, new pod will fetch checkpoint from the old pod.
Reddit post mentions that zeropod uses pod-template-hash to check if migration is possible (should this be added to readme here?).
But how does zeropod know that migration is required for a certain new pod?

For example:

2.1. Deployment has 2 pods on node N1. Node is being drained, and these pods are deleted. Will they both migrate?
2.2. One pod is deleted and 2 new ones are created. All 3 have the same revision hash. Which one will be migrated and which one will be created from scratch?
2.3. I want to forcefully recreate a pod. I delete it, and deployment creates a new pod. Will it get a migration? If yes, then how do I prevent it?

3. Could it be possible to force migration?

Let's say pod X is running on node N1. I want to migrate it to node N2.
My understanding is that currently there is no configuration for this. But could this be possible at all with current design?
It seems like relying to pod-template-hash prevents this. So, migration could be used only to rescue pods from a certain drained node?

4. Which parts of the pod are and aren't migrated?

Here is my current understanding.

Migrated (for all containers included in zeropod.ctrox.dev/migrate):

Root filesystem state
Memory state

Recreated:

Root filesystem state (for containers not in zeropod.ctrox.dev/migrate)
Memory state (for containers not in zeropod.ctrox.dev/migrate)
Pod IP
Pod name
Pod UID
Environment variables (downward API, values from configmaps)
Container metrics
Logs
EmptyDir volumes
- It was mentioned in reddit discussion 2 months ago. Is this still the case?
All volumes are reattached
All devices are reattached (and who knows if application will successfully reconnect)

Obviously, any hostPath volumes will break, and should not be used.

Am I correct here? Am I missing some important state in the lists?

5. Could it be possible to use migration for something other than deployments?

In readme I see that currently migration is only for deployments.
But it would be nice to migrate pods of StatefulSet applications. Will it work? Is it possible to implement?

There are also custom pod controllers that could benefit from migration. For example, Postgres Cluster from CNPG, or CloneSet (=deployment with custom fields) from OpenKruise.
Could migration work with custom controllers?

ctrox · 2025-05-21T16:00:41Z

ctrox
May 21, 2025
Maintainer

I have converted this into a discussion, I hope you don't mind.

Readme mentions that it's possible to disable scaledown by setting zeropod.ctrox.dev/scaledown-duration=0.
Will the migration feature work at all in this case?

Yes, that's intended to disable scaledown while still allowing live-migrations.

Reddit post mentions that zeropod uses pod-template-hash to check if migration is possible (should this be added to readme here?).
But how does zeropod know that migration is required for a certain new pod?

Yes, it uses the pod-template-hash. You are right that did not make it into the docs yet. The way it generally works is that a controller will determine if a pod is eligible for migration (e.g. it's actually enabled and satisfies other preconditions) and then it will create a migration object. That object can then be claimed by any new zeropod that is starting up. If it has the same template hash, it will claim the migration.

2.1. Deployment has 2 pods on node N1. Node is being drained, and these pods are deleted. Will they both migrate?

Yes, both will be live-migrated and the claim system will make sure that each pod finds a matching destination.

2.2. One pod is deleted and 2 new ones are created. All 3 have the same revision hash. Which one will be migrated and which one will be created from scratch?

The one who first claims the migration (so the first one starting up) will be migrated, the rest will be created from scratch.

2.3. I want to forcefully recreate a pod. I delete it, and deployment creates a new pod. Will it get a migration? If yes, then how do I prevent it?

I think you could remove the annotation zeropod.ctrox.dev/live-migrate from the pod at runtime and then delete it. Then the controller won't see the pod as migratable and won't create a migration object. But I have not actually tested this.

Could it be possible to force migration?

Good question, I think some minor adjustments would be needed. Currently the system relies on the pod-template-hash but we could come up with a separate label that needs to match the old and new pod to find each other.

Which parts of the pod are and aren't migrated?
Am I correct here? Am I missing some important state in the lists?

I think this is correct, EmtpyDir is still missing.

But it would be nice to migrate pods of StatefulSet applications. Will it work? Is it possible to implement?

One of the main guarantees of a StatefulSet is that the pod names are static and re-used. So this already makes it kind of difficult to implement this in a way that does not completely break Kubernetes assumptions. I have played with the idea that the old container could be frozen in place and then indicate to containerd/Kubernetes that it exited. This would allow Kubernetes to recreate the pod and then the migration could progress. But it seems quite messy to me as essentially the runtime needs to lie about the container state and then after everything is done we still need to make sure everything is cleaned up properly.

There are also custom pod controllers that could benefit from migration. For example, Postgres Cluster from CNPG, or CloneSet (=deployment with custom fields) from OpenKruise.
Could migration work with custom controllers?

If these allow the old pod to be stopping state while already creating the new pod, it might work. But there would need to be a way to match up the old and new pod so we know these belong to the same CloneSet.

2 replies

d-uzlov May 21, 2025
Author

Thank you for answer. I think readme could benefit from some more technical info and examples like this.

The project seems very interesting. It's exciting to get migration in k8s.

I think that my question about choosing a node for migration is not valid for deployments.
Pod affinity should be a part of pod spec.
If I need to change affinity of a certain pod, then deployment does not fit the task.
I think one would need a custom controller to achieve such customization. Maybe something similar to KubeVirt, but with zeropod as runtime class.

Lack of support for emptyDir volumes and statefulSets severely limits usefulness, though.
I understand the issues, but it's still frustrating.

I can guess that transferring emptyDir is difficult because CRI probably sees just mount points.
I can imagine some workarounds by getting pod into from k8s API but I see why this is difficult.

I think I also understand the issues with statefulSet.
I think that the general problem of "need to transfer container state before destroying it" can be solved by transferring the state to file. Container will then be destroyed, and the only thing to clean up will be checkpoint files.
The issue is what to do if migration doesn't happen. You can clean up old checkpoint files on a timer but then application shutdown never happens, so it will look like the container crashed just before shutdown. And I think that this won't be a rare situation. It will happen whenever pod spec changes.
Applications can usually handle recovering from unclean shutdown, but it's not a good design.
Maybe zeropod could watch the parent resource to decide if it needs to attempt the migration path, or if it should just terminate the containers.
But this is itself a difficult problem. And in case of wrong decision you again have application cleanup issue. What if controller login changes, or what if new pod is not scheduled.

Each new problem is increasingly difficult to solve.
At this point maybe lying to k8s about container state would be easier to implement. Assuming kubelet won't destroy all external mount points after it gets container destruction confirmation (which will probably happen).

Even if there are solutions for all issues that doesn't involve patches to kubelet and apiserver, I can't think of good/easy solutions.

While thinking about these problems I got a few more questions.
Maybe I will be able to figure it out when experimenting with zeropod in my cluster, but it would be nice to get confirmation from developer.

How does zeropod handle situations when migration is not needed?

For example, I change container image in a deployment. Deployment starts a rollout update.
The old replicaset is destroyed, new replicaset have a different pod-template-hash, so migration will not happen.

Can zeropod detect such cases? Will application cleanup happen? How long will pod deletion take compared to a pod without zeropod?

How long is a window for migration?

For example, pod on node N1 is deleted. A new pod is being created on node N2.
But node N2 doesn't have container images. It will take several minutes to download them.
Does this mean that old pod will hang until the state is transferred to the new node?

ctrox Sep 25, 2025
Maintainer

Really sorry, I completely forgot to reply again.

How does zeropod handle situations when migration is not needed?

During shutdown of a pod, it will wait until a migration target has been found. If the new pod has a different template hash, it won't match so it will keep trying to look for a match for 10 seconds. The new pod will do something similar, it will ask the node server for a restore and if it does not find a match until the backoff is reached it just starts the container from scratch.

How long is a window for migration?

So the window is only 10 seconds right now. It might be possible to claim the migration earlier while still pulling the image as the pause container already gets started while the other container image(s) are being pulled. I thought this was already the case but it does not seem like it :)

I'll create an issue to look into this.

This is something that will eventually be configurable but the drawback of increasing that window is that if you have a template change, you'll also need to wait longer for pod start/stop.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[question] Migration requirements and limitations #64

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[question] Migration requirements and limitations #64

Uh oh!

Uh oh!

d-uzlov May 19, 2025

1. Is it possible to use migration without scaling down pods?

2. How does zeropod decide that it needs to migrate pod?

3. Could it be possible to force migration?

4. Which parts of the pod are and aren't migrated?

5. Could it be possible to use migration for something other than deployments?

Replies: 1 comment · 2 replies

Uh oh!

ctrox May 21, 2025 Maintainer

Uh oh!

d-uzlov May 21, 2025 Author

Uh oh!

ctrox Sep 25, 2025 Maintainer

d-uzlov
May 19, 2025

Replies: 1 comment 2 replies

ctrox
May 21, 2025
Maintainer

d-uzlov May 21, 2025
Author

ctrox Sep 25, 2025
Maintainer