This repository has been archived by the owner on Apr 26, 2024. It is now read-only.
Emphasize on remote result storage necessity in README #82
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current README recommends using remote result storage if special prefect features like caching are used.
However, in edge cases, not using remote result storage can lead to very hard to debug errors even when not using any prefect features that would require remote result persistence. Consequently, the README should emphasize that remote result storage is necessary.
Explanation:
Tasks manage their own state in the Prefect cloud. So right before a task ends and returns its result, it calls the prefect API to signal that it is Completed. Consequently, for a brief moment the task is technically still running but marked as Completed. If ray then kills the worker that runs this task (e.g. due to oom, could be also another remote function on the same node that is causing this) after it got marked as Completed but before it returned the result it is causing issues. Ray will rerun the task, which results in the forbidden state transition Completed -> Running (again). When encountering these forbidden transitions (Abort exception caught in https://github.com/PrefectHQ/prefect/blob/main/src/prefect/engine.py#L1386), prefect does one final check with the cloud api to find out if the task is maybe already finished. If this is the case, prefect continues by retrieving the result from storage (which is empty if no remote result persistence is enabled and the task was running on another node).
This leads to MissingResult error (if result persistence was off) or ValueError storage path not found (if result persistence is on).
While this is extremely flaky and hard to reproduce in the wild, because the oom needs to hit in exactly the right moment, this can be reproduced by the following example code that triggers an oom error in the on_completion hook (which runs after the task is marked completed but before it returns).
Checklist
pre-commit
checks.pre-commit install && pre-commit run --all
locally for formatting and linting.mkdocs serve
view documentation locally.