-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[23.1] Remove rollback from __check_jobs_at_startup
#17085
[23.1] Remove rollback from __check_jobs_at_startup
#17085
Conversation
57e8fe6
to
62c153a
Compare
I think the premise that we want to do a rollback on exceptions in this method is wrong (it **may** be correct apprach in other places in the codebase e.g. in `Tool.handle_single_execution()`). Here it prevents us from comitting anything inside the with statement (as the job_wrapper.fail method does). Here's the simplified issue: ```shell ❯ python -i scripts/db_shell.py -c config/galaxy.yml >>> with sa_session() as session, session.begin(): ... sa_session.execute(update(Job).where(Job.id == 1).values(state="error")) ... sa_session.commit() ... sa_session.execute(update(Job).where(Job.id == 1).values(state="ok")) ... sa_session.commit() ... <sqlalchemy.engine.cursor.LegacyCursorResult object at 0x11f1be350> Traceback (most recent call last): File "<stdin>", line 4, in <module> File "<string>", line 2, in execute File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/orm/session.py", line 1711, in execute conn = self._connection_for_bind(bind, close_with_result=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/orm/session.py", line 1552, in _connection_for_bind TransactionalContext._trans_ctx_check(self) File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/engine/util.py", line 199, in _trans_ctx_check raise exc.InvalidRequestError( sqlalchemy.exc.InvalidRequestError: Can't operate on closed transaction inside context manager. Please complete the context manager before emitting further commands. ``` It is probably still worthwhile to have the job recovery be minimal and do things such as calling the job wrapper fail method that does actual work to the job handler as in galaxyproject#17083, but that's refactoring that can be done on the dev branch and it still seems risky in the sense that we then need to be very careful in ensuring we don't commit anywhere else inside the scope of the begin() statement. Finally I don't think it makes sense that the startup check should ever cause the boot process to fail. This isn't a misconfiguration or even anything catastrophic for the remaining jobs and places unnecessary stress on admins and can basically break at any time and shouldn't cause a complete service failure. Fixes galaxyproject#17079
62c153a
to
da387d4
Compare
Test errors are unrelated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Marius!
@mvdbeek I don't think this is correct. The previous version rolled back one transaction with one job in case of an exception and did not proceed. The new version ignores the error and commits everything - including whatever was related to the exception. In other words, previous version: This version: What we need, I suppose, is this: Here's a gist that demonstrates the problem and (I think) the solution: |
This is correct. Have you read the PR description and the example given ?
Yes, this is correct (https://gist.github.com/jdavcs/1402a63c8afc9150646e809f3d0ae2d9?permalink_comment_id=4774237#gistcomment-4774237). |
Yes, I read the PR description. This logic commits all the changes unconditionally, even if some may have caused an exception - which doesn't seem right? The PR description says that the rollback prevents us from committing anything inside the |
Well, I did say
and
which I don't see acknowledged. Why would we keep doing the rollback on every startup ? |
Yes, you said it's wrong, but the justification didn't seem right to me - which is why I questioned it. I replied to your comment on the gist here. We might do a rollback on every startup because if we make a change to the database within a transaction, and something causes an exception, I suppose we don't want that change to be written to the database? We either fail fast (which is what we don't want to do), or we rollback the changes that may have led to the exception.
We are discussing it, right? |
I suppose we do, so that we don't keep on trying to commit and rollback the same exception on startup.
With fail fast you mean no try/except, or to keep running into the same exception on every startup ?
It's not changes we've made in the recover method that are leading to an exception, it's the state of the job that leads to the exception. If there's an exception recovering the job -- that's it, we're done. We're not ending up with some weird database state that we have to prevent from reaching. The example in #17079 is a missing job directory, and I have a hard time coming up with a scenario where would not want to commit changes that were made ... and ideally we can just batch that in one commit. |
OK, I think I understand now why we want to commit for each job. I guess I didn't think of the cause of the exception as something that has already happened to the job prior to the check on startup (e.g. like the missing job dir). Thank you for the detailed explanation - this was helpful! |
I think the premise that we want to do a rollback on exceptions in this method is wrong (it may be correct apprach in other places in the codebase e.g. in
Tool.handle_single_execution()
). Here it prevents us from comitting anything inside the with statement (as the job_wrapper.fail method does).Here's the simplified issue:
It is probably still worthwhile to have the job recovery be minimal and do things such as calling the job wrapper fail method that does actual work to the job handler as in
#17083, but that's refactoring that can be done on the dev branch and it still seems risky in the sense that we then need to be very careful in ensuring we don't commit anywhere else inside the scope of the begin() statement.
Finally I don't think it makes sense that the startup check should ever cause the boot process to fail. This isn't a misconfiguration or even anything catastrophic for the remaining jobs and places unnecessary stress on admins and can basically break at any time and shouldn't cause a complete service failure.
Fixes #17079
How to test the changes?
(Select all options that apply)
License