-
Notifications
You must be signed in to change notification settings - Fork 1
Reinforcement Learning RFC #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
l45k
wants to merge
1
commit into
alpha
Choose a base branch
from
leo/rl_rfc
base: alpha
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,108 @@ | ||
| # Reinforcement Learning on Hypha | ||
|
|
||
| ## Overview | ||
|
|
||
| As part of the SRIND challenge, we also want to add the capabilities to Hypha to run Reinforcement | ||
| Learning (RL) workloads. Compared to the supervised or unsupervised training regime, RL works | ||
| fundamentaly different. Thus, this RFC highlights where the concepts diverge and why this is a | ||
| challenge in the current version of Hypha. Then, it will propose a way on how to enable Hypha for RL | ||
| without breaking its concepts. | ||
|
|
||
| ## Background | ||
|
|
||
| While (un-)supervised ML follows the schema of data being ingested by a model for training, RL comes without | ||
| any prepared training data. Instead, it needs a simulation (usually called an environment) of the process | ||
| that should be solved. For example, this can be the environment for the Atari games, where the RL agent can send | ||
| actions that will be performed by the simulator. For the Atari games, this is something like pressing a button | ||
| to performing an action in the game. | ||
|
|
||
| Thus, the training process creates its own training data, and the training data is updated over time. This | ||
| doesen't fit into Hypha's current architecture, where data is served from a static, preprocessed dataset. | ||
| This architecture is depicted in belows diagram. | ||
|
|
||
| ```mermaid | ||
| --- | ||
| config: | ||
| theme: redux | ||
| --- | ||
| sequenceDiagram | ||
| participant S as Scheduler | ||
| participant W1 as Worker_1 | ||
| participant W2 as Worker_2 | ||
| participant PS as Parameter Server | ||
| participant D as Data Node | ||
| loop Do Work | ||
| loop local rounds | ||
| W1 -->> S: Request Data | ||
| S -->> W1: Data Batch | ||
| W1 -->> D: Reuest Batch | ||
| D -->> W1: Send Batch | ||
| W1-->W1: Work on Batch | ||
| W2 -->> S: Request Data | ||
| S -->> W2: Data Batch | ||
| W2 -->> D: Reuest Batch | ||
| D -->> W1: Send Batch | ||
| W2-->W2: Work on Batch | ||
| end | ||
| W1-->> PS: Send Gradient | ||
| W2 -->> PS: Send Gradient | ||
| PS -->> W1: Send Parameters | ||
| PS -->> W2: Send Parameters | ||
| end | ||
| ``` | ||
|
|
||
| ## Proposal | ||
|
|
||
| To enable RL, we need two things: Data Nodes that are capable of running an environment and multiple | ||
| Data Nodes because creating samples for the worker to train on is essential. What is beneficial in this | ||
| scenario is that RL models are usually quite small and don't need GPUs for inference. However, to | ||
| produce enough training samples, simulating the environment and running inference need to be fast. Thus, | ||
| RL Data Nodes are quite heavy on CPU load. In terms of heterogeneous hardware requirements, RL is a perfect | ||
| match for Hypha, since it will need both GPU- and CPU-powered workers. | ||
|
|
||
| To fulfill the first requirement, a special RL Data Node needs to be implemented. The Data Node needs to | ||
| run the environment and the current model. Therefore, it will also need to be able to receive model updates. | ||
| The RL Data Node will run a continuous data generation process that should possibly run in parallel. The | ||
| resulting training data will be stored in a FIFO circular buffer. The buffer needs to provide random | ||
| access to the samples for training. To enhance efficiency, the buffer should also be held in memory for faster | ||
| access times. | ||
|
|
||
| The second requirement will be satisfied by improving the Scheduler to redirect Worker requests for data to | ||
| different RL Data Nodes. Thus, the Scheduler needs to balance latency between RL Data Nodes and Workers as | ||
| well as sampling and processing speed. | ||
|
|
||
| The above solution will look like the following. | ||
|
|
||
| ```mermaid | ||
| --- | ||
| config: | ||
| theme: redux | ||
| --- | ||
| sequenceDiagram | ||
| participant S as Scheduler | ||
| participant W1 as Worker_1 | ||
| participant W2 as Worker_2 | ||
| participant PS as Parameter Server | ||
| participant D1 as RL Data Node 1 | ||
| participant D2 as RL Data Node 2 | ||
| loop Do Work | ||
| loop local rounds | ||
| W1 -->> S: Request Data | ||
| S -->> W1: Data Batch from Data Node 2 | ||
| W1 -->> D2: Reuest Batch | ||
| D2 -->> W1: Send Batch | ||
| W1-->W1: Work on Batch | ||
| W2 -->> S: Request Data | ||
| S -->> W2: Data Batch from Data Node 1 | ||
| W2 -->> D1: Reuest Batch | ||
| D1 -->> W1: Send Batch | ||
| W2-->W2: Work on Batch | ||
| end | ||
| W1-->> PS: Send Gradient | ||
| W2 -->> PS: Send Gradient | ||
| PS -->> W1: Send Parameters | ||
| PS -->> W2: Send Parameters | ||
| PS -->> D1: Send Parameters | ||
| PS -->> D2: Send Parameters | ||
| end | ||
| ``` | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we should model this not in the scheduler but via the different connector/bridge much like the stochastic wiring described in the SWARM learning paper. We already have a many reference and different selection strategies allowing us to point from one work to many data nodes and only need to extend this with a strategy that considers the connection and delivery speed (latency, bandwidth, generation) to optimally connect workers with data nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be possible. However, I would like to avoid mixing concepts. We decided to go with DiLoCo and a centralized scheduler. If we now start to loosen this by introducing a form of decentralized scheduling, it will complicate things more than it will help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it be super interesting to benchmark one approach against the other no matter what we'll use as the standard moving forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully agree. But would rather have a working baseling and start improving from there on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's start with the scheduler approach then