-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Galaxy job handlers may crash during startup #17079
Comments
wow I feel very proud, my jobs are so bad they alone crash a handler :D
The jobs I'm currently trying to run are supposed to re-use existing jobs if possible. That's the main setting I've changed and am manually enabling on starting my workflow. (Please feel free to cancel all of my jobs if it helps.) |
@bgruening @sanjaysrikakulam @mira-miracoli If you need to restart a handler, then you will have to evict @hexylena's jobs from it. This is a huge problem: since @hexylena's jobs can go to any handler, Jenkins cannot run, and we also have a cron job that restarts the handlers from time to time that needs to stay disabled (use |
If you don't mind, I am disabling job re-use for you to see what happens. |
@hexylena Congratulations 🎉, you found the bug! Disabling job reuse fixes the problem. @bgruening @sanjaysrikakulam @mira-miracoli which means, we have to disable this feature for everyone somehow. |
Using the job cache can cause job handlers to crash during startup, see galaxyproject/galaxy#17079.
Just to clarify, with job-reuse you mean the job-cache feature? |
Exactly. |
I don't think that's the job cache's fault, the problem is with the transaction getting closed in another thread. This would happen with any other job that crashes because of a bad TPV rule. |
This might be the actual problem:
from https://sentry.galaxyproject.org/share/issue/da111795b80240cb90069cbd314b22b2/ |
and the transaction error should've been fixed in https://github.com/galaxyproject/galaxy/pull/16687/files. Can you confirm you got that commit on the handler that failed ? |
This avoids committing the transaction and then have it fail on the next iteration. Should fix the app startup failing in galaxyproject#17079
I think the premise that we want to do a rollback on exceptions in this method is wrong (it **may** be correct apprach in other places in the codebase e.g. in `Tool.handle_single_execution()`). Here it prevents us from comitting anything inside the with statement (as the job_wrapper.fail method does). Here's the simplified issue: ```shell ❯ python -i scripts/db_shell.py -c config/galaxy.yml >>> with sa_session() as session, session.begin(): ... sa_session.execute(update(Job).where(Job.id == 1).values(state="error")) ... sa_session.commit() ... sa_session.execute(update(Job).where(Job.id == 1).values(state="ok")) ... sa_session.commit() ... <sqlalchemy.engine.cursor.LegacyCursorResult object at 0x11f1be350> Traceback (most recent call last): File "<stdin>", line 4, in <module> File "<string>", line 2, in execute File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/orm/session.py", line 1711, in execute conn = self._connection_for_bind(bind, close_with_result=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/orm/session.py", line 1552, in _connection_for_bind TransactionalContext._trans_ctx_check(self) File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/engine/util.py", line 199, in _trans_ctx_check raise exc.InvalidRequestError( sqlalchemy.exc.InvalidRequestError: Can't operate on closed transaction inside context manager. Please complete the context manager before emitting further commands. ``` It is probably still worthwhile to have the job recovery be minimal and do things such as calling the job wrapper fail method that does actual work to the job handler as in galaxyproject#17083, but that's refactoring that can be done on the dev branch and it still seems risky in the sense that we then need to be very careful in ensuring we don't commit anywhere else inside the scope of the begin() statement. Finally I don't think it makes sense that the startup check should ever cause the boot process to fail. This isn't a misconfiguration or even anything catastrophic for the remaining jobs and places unnecessary stress on admins and can basically break at any time and shouldn't cause a complete service failure. Fixes galaxyproject#17079
I think the premise that we want to do a rollback on exceptions in this method is wrong (it **may** be correct apprach in other places in the codebase e.g. in `Tool.handle_single_execution()`). Here it prevents us from comitting anything inside the with statement (as the job_wrapper.fail method does). Here's the simplified issue: ```shell ❯ python -i scripts/db_shell.py -c config/galaxy.yml >>> with sa_session() as session, session.begin(): ... sa_session.execute(update(Job).where(Job.id == 1).values(state="error")) ... sa_session.commit() ... sa_session.execute(update(Job).where(Job.id == 1).values(state="ok")) ... sa_session.commit() ... <sqlalchemy.engine.cursor.LegacyCursorResult object at 0x11f1be350> Traceback (most recent call last): File "<stdin>", line 4, in <module> File "<string>", line 2, in execute File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/orm/session.py", line 1711, in execute conn = self._connection_for_bind(bind, close_with_result=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/orm/session.py", line 1552, in _connection_for_bind TransactionalContext._trans_ctx_check(self) File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/engine/util.py", line 199, in _trans_ctx_check raise exc.InvalidRequestError( sqlalchemy.exc.InvalidRequestError: Can't operate on closed transaction inside context manager. Please complete the context manager before emitting further commands. ``` It is probably still worthwhile to have the job recovery be minimal and do things such as calling the job wrapper fail method that does actual work to the job handler as in galaxyproject#17083, but that's refactoring that can be done on the dev branch and it still seems risky in the sense that we then need to be very careful in ensuring we don't commit anywhere else inside the scope of the begin() statement. Finally I don't think it makes sense that the startup check should ever cause the boot process to fail. This isn't a misconfiguration or even anything catastrophic for the remaining jobs and places unnecessary stress on admins and can basically break at any time and shouldn't cause a complete service failure. Fixes galaxyproject#17079
We have that commit. $ git branch --contains aedbb393e9fc6a3282cd020fe9e0786c10ca8937 | grep release_23.1_europe
* release_23.1_europe |
yes |
Thanks @mvdbeek btw, this is working :) |
Describe the bug
Jobs from @hexylena crash Galaxy job handlers during startup on usegalaxy.eu.
Whenever a job submitted by @hexylena is assigned to a handler (e.g.
handler_sn06_0
), the handler fails to start with the following error message:Reassigning the jobs to a handler that is already running (like this
for job in "${jobs[@]}"; do gxadmin mutate restart-jobs --commit $job && gxadmin mutate reassign-job-to-handler $job handler_sn06_1 --commit; done
) does not cause any problems, the jobs just run normally.Galaxy Version and/or server at which you observed the bug
Galaxy Version: 23.1 (minor 2.dev0)
Commit: e22e87a4c0aa35255e7605ef5a33ca82346b23a (from the usegalaxy-eu/galaxy fork)
To Reproduce
Steps to reproduce the behavior:
systemctl restart galaxy-handler@x
Expected behavior
Job handlers start normally regardless of which user's jobs they grab.
Screenshots
N/A
Additional context
@hexylena if you believe something related to your account can have an influence edit this section and add the information.
@hexylena wrote:
The text was updated successfully, but these errors were encountered: