Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job handler delays heartbeat when scheduling large tasks #10894

Open
innovate-invent opened this issue Dec 10, 2020 · 3 comments
Open

Job handler delays heartbeat when scheduling large tasks #10894

innovate-invent opened this issue Dec 10, 2020 · 3 comments

Comments

@innovate-invent
Copy link
Contributor

Arguably this is desired, if the job handler takes too long to respond it is 'dead'. It would be good to add heartbeat checks at appropriate points to allow large, yet continuing, tasks without having to arbitrarily increase the heartbeat timeout.

@innovate-invent innovate-invent mentioned this issue Jan 27, 2021
12 tasks
@innovate-invent
Copy link
Contributor Author

I had to disable health checks for handlers as the heartbeat can be delayed for over 5min when mapping over very large collections.

@dannon
Copy link
Member

dannon commented Mar 18, 2021

We ran into a problem with the heartbeat being blocked and k8s thinking the pod was dead (which would then actually cause a failure by trying to recover, when things were really fine) recently. Our problem with a long running pysam blocking the heartbeat -- more discussion here: #11558

It might be that the heartbeat just isn't a great metric for liveness, at least for some of our processes.

@innovate-invent
Copy link
Contributor Author

Ah! I wonder if the local job runner blocks the heartbeat too.

innovate-invent pushed a commit to brinkmanlab/galaxy-container that referenced this issue Mar 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants