- 
                Notifications
    You must be signed in to change notification settings 
- Fork 31
Runtime Plugins #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime Plugins #127
Conversation
| Pull Request Test Coverage Report for Build 18885222297Details
 
 
 
 💛 - Coveralls | 
d8507d0    to
    74e10d2      
    Compare
  
    | msg = "Error creating a Context instance: the provided payload has a wrong or old format." | ||
| raise ContextSerdeError(msg) from e | ||
|  | ||
| async def mark_in_progress(self, name: str, ev: Event, worker_id: str = "") -> None: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these weren't prefixed, but they seem very private, so went ahead with deleting them
|  | ||
|  | ||
| @functools.lru_cache(maxsize=1) | ||
| def _warn_get_result() -> None: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seemed a little odd to access this on the context, rather than the handler, which more represents the run (and is literally a future that resolves to this value). I only saw one small test reference.
74e10d2    to
    eb1d2b5      
    Compare
  
    7d4141c    to
    181edb0      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a decent cleanup. Would probably benefit from an architecture diagram to clarify the current shape and if it matches the desired end-goal
Also before we merge this (or before we release this), will need to update the API reference docs
        
          
                src/workflows/context/context.py
              
                Outdated
          
        
      | broker_state = WorkflowBrokerState.from_serialized( | ||
| self._init_snapshot, self._serializer | ||
| ) | ||
| self._broker_run = WorkflowBroker( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to picture some of the hierarchy here
- Context
- state_store
- broker_run
- broker_state
 
 
| ) | ||
| self._broker_run = WorkflowBroker( | ||
| workflow=workflow, | ||
| context=self, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it weird to pass in the entire context here? Circular dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are certainly coupled, but have the same lifetime. The context is more or less just the public interface for the broker
| self._broker_run = self._init_broker(workflow) | ||
|  | ||
| async def before_start() -> None: | ||
| if prev_broker is not None: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't we just set it to None and re-initialize it above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the one that was maybe just noned out if pre-existing. Here, the shutdown is within the before_start so that it can be awaited before initializing the new broker.
FWIW, This would only happen if you re-use a context across multiple workflow runs
| accepted_events: list[tuple[str, str]] = Field(default_factory=list) | ||
|  | ||
| # Broker log of all dispatched events in order, as serializer-encoded strings. | ||
| broker_log: list[str] = Field(default_factory=list) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh this isn't used and just eats memory, we could probably delete if we are changing this much
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm planning more changes, so this seemed like a good checkpoint for review at least to see how we feel about the direction. I think perhaps we should keep these changes on a branch to introduce the full change set in one go?
e227cc9    to
    d7341f8      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adrianlyjak I went through the PR and, while pretty much everything seems legit, I am still not super sure I understand the end goals of this: I think @logan-markewich mentioned to me we are doing it so that we can make our workflows effectively durable and long-running, but I am still struggling to visualize the bigger picture and to envision how these changes are going to affect the way we use workflows/context on a end-user perspective.
It will certainly benefit from some documentation/examples for the new patterns, but I think it might be good if we also take more time to discuss architecture and design choices :) Thanks for doing all of this work tho! 🙌
aaccdd6    to
    cd0e32e      
    Compare
  
    | Check out this pull request on   See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB | 
| async def before_start() -> None: | ||
| if prev_broker is not None: | ||
| try: | ||
| await prev_broker.shutdown() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the prev_broker run shutdown before getting to this point? The lifecycle of the broker feels a little odd here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it is, but you still want the reference to the broker for getting the state, so just double checking here. This also just protects from someone doing weird things with capturing a context that's still running and passing it in again to the workflow.
| # We do this regardless of is_running state so workflows can resume from where they left off | ||
| for step_name, worker_data in serialized.workers.items(): | ||
| if step_name not in base_state.workers: | ||
| continue | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it a critical error here? To me this indicates that the context and workflow do not match/are out of sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is from serialized state, if you renamed / removed a step, this could happen. Debatable, but seems like it could easily happen in dev. Nice to still be able to best effort deserialize
        
          
                src/workflows/runtime/broker.py
              
                Outdated
          
        
      | return new_state | ||
|  | ||
| @property | ||
| def _replay_ticks(self) -> list[WorkflowTick]: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if its every commented/explained why we need/want replay? Why did I need to replay ticks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replay is a bad name. This is more or less an event source architecture, so this is basically all you need to record to recreate the state. Right now it's an implementation detail from which we can re-derive the current state rather than syncing the full state from inside the control loop
| """Wait for the next tick from the internal queue.""" | ||
| return await self.queue.get() | ||
|  | ||
| def queue_event(self, tick: WorkflowTick, delay: float | None = None) -> None: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: queue_tick ?
| add: AddWaiter[EventType] | ||
|  | ||
|  | ||
| StepWorkerStateContextVar = ContextVar[StepWorkerContext]("step_worker") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context vars might not work in a distributed setup? Not a blocker but seems notable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idea would be to set this up so that it works in a distributed context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of new code/concepts being introduced here 😬 Will require heavy validation (and maybe even a few beta releases?). But from what I can tell by eye, seems overall a good structure
We might need a more in-depth diagram too, tbh I'm losing track of concepts as I go through the code lol (While I would learn this in time, even for newcomers and contributors it would be nice).
| from workflows.events import Event, StopEvent | ||
|  | ||
|  | ||
| @dataclass(frozen=True) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooc why dataclass? I know weird things can happen when you mix dataclasses and pydantic, typically I'd recommend picking one or the other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they're not meant to be mixed right now, and these are more or less transient. Just wanted something light weight. If we want them to be serializable, should switch to pydantic.
27bb2fb    to
    66d2c49      
    Compare
  
    * add some refactoring notes * Clarify / document plugin interface better * debug mem leak * ugly things to workaround leaks * clean up from leak fix
* ugly leak fixes * remove test files
* cp * wip * wip * wip * working again * test gha * fix precommits * Add llama-index-utils-workflow release * woops * clean up pyproject * Update docs * ugly toml sort fix * oops
070fba2    to
    862274b      
    Compare
  
    

This is a feature branch for refactoring the internals of workflows to better support a pluggable runtime, and to be easier to extend with additional features. It mainly split apart a lot of responsibilities that were held by the context, and create a deeper internal structure with varied responsibilities. It maintains parity with the existing public interface with minimal changes
Related PRs:
Original PR Notes:
This is a big one 😳 sorry! It's 100% hand coded though
First step of a refactor to facilitate better plugins for managing future runtime pluggability - long term goals are better support for distributed and/or persistent workflows by extending to external coordinators
This refactor focuses on giving more discrete responsibilities to a few new components, and narrow the responsibility of existing ones (namely
Context):WorkflowBrokerclass that by and large lifts methods fromContextthat are related to task, queue, and lock management. The long term goal is to define and break this up even further. There was a small amount of runtime/starting logic in the Workflow that was also moved hereSerializedContexttyped intermediary pydantic model to validate and document the current serialized state, (rather than passing around a plain dict). The dict interface remains unchanged, to maintain compatibilityWorkflowBrokerState, which contains the mutable/asyncio python state that parallels most of aSerializedContextContext now contains a reference to the broker. Note, it also still contains the reference to the store (this was not moved to the Broker). My perspective is that the state and the runtime durability will have separate needs, and shouldn't be closely coupled. For this reason, I also removed the state snapshot on NOT_IN_PROGRESS internal events--I think it may be better to rethink that while it is unused so we can focus on figuring out making the state store and runtime more configurable/extendable.
The intialization of the workflow run was a little distributed before, so this part of the code is "new" (as opposed to copy pasted).
Sidequest 1: Adds better types to the
@stepdecorator return value, such that the_step_configattribute is typed into the returned step function (this removes variousget_attrcalls in the code). Renames it to _step_config so python doesn't do name mangling