Support durable execution with DBOS #130
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR integrates DBOS with workflow and context to provide out-of-the-box durable execution and checkpointing.
Changes
DBOSWorkflow:
Workflowto start a DBOS durable workflow for each step worker, and use durable notification for sending theStartEvent._donestep finishes, it broadcasts a message to all running DBOS workflows to signal the end. This cleanly terminates running tasks.DBOSContext:
Contextto provide durable execution, making each step worker's main loop a DBOS workflow (including the cancel worker). It also usesDBOS.recvto durably receive incoming events for each step.DBOSContextneeds to have a unique name and be created in a static function. This is important for DBOS to correctly find the definition of workflows for failure recovery.DBOS.step.ctx.send_eventandctx.send_event_asyncuse DBOS durable send instead of in-memory queues for communicating events between steps.DBOS.sleepand_durable_timeto make sure determinism within a step workflow.Example
Here is a simple example to use DBOS.
Discussion
_donestep behavior -- it currently throws aWorkflowDoneexception. Though it works, its workflow generates error stack trace that can be annoying. An alternative way is to use durable events to signal the end of the workflow.Workflow.runshould also become a DBOS workflow, and each step is a sub-workflow. However, currently,_run_workflowusesasyncio.waitwhich is not deterministic (a fundamental requirement for durable workflows). This may require changes to how it waits for the workflow to finish.DBOS.stepfunctions.