Description
Describe the bug
I am looking for a way to run a long running job in a Kubernetes container that can withstand temporary network disconnection (the disconnection normally lasts for less than 1 second). I attempt to do this using exec. However, the callback function for exec does not get called if there is a network disconnect.
To Reproduce
- Call
exec
with a long running method like sleep. - Disconnect the connection to the node (I simulate this by scaling down or deleting
konnectivity-agent
pods on the server). - The callback is never returned.
Expected behavior
An error or some indication of the disconnect would be helpful. Even better is if there is a way to establish a reconnection.
If not, is there a suggested way to run these long running jobs and not getting them disrupted when a network disconnect event happens? For context, I am using GKE and these disruptions happen when there is a maintenance event.
** Example Code**
The code that I am using (this is from https://github.com/actions/runner-container-hooks/blob/main/packages/k8s/src/k8s/index.ts#L223) where neither the resolve
or reject
function is called in the callback:
await new Promise(function (resolve, reject) {
exec
.exec(
namespace,
podName,
containerName,
command,
process.stdout,
process.stderr,
stdin ?? null,
false /* tty */,
resp => {
// kube.exec returns an error if exit code is not 0, but we can't actually get the exit code
if (resp.status === 'Success') {
resolve(resp.code)
} else {
reject(resp?.message)
}
}
)
.catch(e => reject(e)).finally(() => console.log("done"))
})