Job handler delays heartbeat when scheduling large tasks #10894

innovate-invent · 2020-12-10T18:29:22Z

Arguably this is desired, if the job handler takes too long to respond it is 'dead'. It would be good to add heartbeat checks at appropriate points to allow large, yet continuing, tasks without having to arbitrarily increase the heartbeat timeout.

innovate-invent · 2021-03-18T18:16:39Z

I had to disable health checks for handlers as the heartbeat can be delayed for over 5min when mapping over very large collections.

dannon · 2021-03-18T23:39:52Z

We ran into a problem with the heartbeat being blocked and k8s thinking the pod was dead (which would then actually cause a failure by trying to recover, when things were really fine) recently. Our problem with a long running pysam blocking the heartbeat -- more discussion here: #11558

It might be that the heartbeat just isn't a great metric for liveness, at least for some of our processes.

innovate-invent · 2021-03-18T23:57:27Z

Ah! I wonder if the local job runner blocks the heartbeat too.

innovate-invent mentioned this issue Jan 27, 2021

Cloud support #11233

Open

12 tasks

innovate-invent pushed a commit to brinkmanlab/galaxy-container that referenced this issue Mar 23, 2021

Disable health check on worker until galaxyproject/galaxy#10894

7dd6115

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job handler delays heartbeat when scheduling large tasks #10894

Job handler delays heartbeat when scheduling large tasks #10894

innovate-invent commented Dec 10, 2020

innovate-invent commented Mar 18, 2021

dannon commented Mar 18, 2021

innovate-invent commented Mar 18, 2021

Job handler delays heartbeat when scheduling large tasks #10894

Job handler delays heartbeat when scheduling large tasks #10894

Comments

innovate-invent commented Dec 10, 2020

innovate-invent commented Mar 18, 2021

dannon commented Mar 18, 2021

innovate-invent commented Mar 18, 2021