You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Arguably this is desired, if the job handler takes too long to respond it is 'dead'. It would be good to add heartbeat checks at appropriate points to allow large, yet continuing, tasks without having to arbitrarily increase the heartbeat timeout.
The text was updated successfully, but these errors were encountered:
We ran into a problem with the heartbeat being blocked and k8s thinking the pod was dead (which would then actually cause a failure by trying to recover, when things were really fine) recently. Our problem with a long running pysam blocking the heartbeat -- more discussion here: #11558
It might be that the heartbeat just isn't a great metric for liveness, at least for some of our processes.
Arguably this is desired, if the job handler takes too long to respond it is 'dead'. It would be good to add heartbeat checks at appropriate points to allow large, yet continuing, tasks without having to arbitrarily increase the heartbeat timeout.
The text was updated successfully, but these errors were encountered: