Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions rfc/2025-10-01_reinforcement_learning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Reinforcement Learning on Hypha

## Overview

As part of the SRIND challenge, we also want to add the capabilities to Hypha to run Reinforcement
Learning (RL) workloads. Compared to the supervised or unsupervised training regime, RL works
fundamentaly different. Thus, this RFC highlights where the concepts diverge and why this is a
challenge in the current version of Hypha. Then, it will propose a way on how to enable Hypha for RL
without breaking its concepts.

## Background

While (un-)supervised ML follows the schema of data being ingested by a model for training, RL comes without
any prepared training data. Instead, it needs a simulation (usually called an environment) of the process
that should be solved. For example, this can be the environment for the Atari games, where the RL agent can send
actions that will be performed by the simulator. For the Atari games, this is something like pressing a button
to performing an action in the game.

Thus, the training process creates its own training data, and the training data is updated over time. This
doesen't fit into Hypha's current architecture, where data is served from a static, preprocessed dataset.
This architecture is depicted in belows diagram.

```mermaid
---
config:
theme: redux
---
sequenceDiagram
participant S as Scheduler
participant W1 as Worker_1
participant W2 as Worker_2
participant PS as Parameter Server
participant D as Data Node
loop Do Work
loop local rounds
W1 -->> S: Request Data
S -->> W1: Data Batch
W1 -->> D: Reuest Batch
D -->> W1: Send Batch
W1-->W1: Work on Batch
W2 -->> S: Request Data
S -->> W2: Data Batch
W2 -->> D: Reuest Batch
D -->> W1: Send Batch
W2-->W2: Work on Batch
end
W1-->> PS: Send Gradient
W2 -->> PS: Send Gradient
PS -->> W1: Send Parameters
PS -->> W2: Send Parameters
end
```

## Proposal

To enable RL, we need two things: Data Nodes that are capable of running an environment and multiple
Data Nodes because creating samples for the worker to train on is essential. What is beneficial in this
scenario is that RL models are usually quite small and don't need GPUs for inference. However, to
produce enough training samples, simulating the environment and running inference need to be fast. Thus,
RL Data Nodes are quite heavy on CPU load. In terms of heterogeneous hardware requirements, RL is a perfect
match for Hypha, since it will need both GPU- and CPU-powered workers.

To fulfill the first requirement, a special RL Data Node needs to be implemented. The Data Node needs to
run the environment and the current model. Therefore, it will also need to be able to receive model updates.
The RL Data Node will run a continuous data generation process that should possibly run in parallel. The
resulting training data will be stored in a FIFO circular buffer. The buffer needs to provide random
access to the samples for training. To enhance efficiency, the buffer should also be held in memory for faster
access times.

The second requirement will be satisfied by improving the Scheduler to redirect Worker requests for data to
different RL Data Nodes. Thus, the Scheduler needs to balance latency between RL Data Nodes and Workers as
well as sampling and processing speed.
Comment on lines +70 to +72
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we should model this not in the scheduler but via the different connector/bridge much like the stochastic wiring described in the SWARM learning paper. We already have a many reference and different selection strategies allowing us to point from one work to many data nodes and only need to extend this with a strategy that considers the connection and delivery speed (latency, bandwidth, generation) to optimally connect workers with data nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be possible. However, I would like to avoid mixing concepts. We decided to go with DiLoCo and a centralized scheduler. If we now start to loosen this by introducing a form of decentralized scheduling, it will complicate things more than it will help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it be super interesting to benchmark one approach against the other no matter what we'll use as the standard moving forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree. But would rather have a working baseling and start improving from there on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's start with the scheduler approach then


The above solution will look like the following.

```mermaid
---
config:
theme: redux
---
sequenceDiagram
participant S as Scheduler
participant W1 as Worker_1
participant W2 as Worker_2
participant PS as Parameter Server
participant D1 as RL Data Node 1
participant D2 as RL Data Node 2
loop Do Work
loop local rounds
W1 -->> S: Request Data
S -->> W1: Data Batch from Data Node 2
W1 -->> D2: Reuest Batch
D2 -->> W1: Send Batch
W1-->W1: Work on Batch
W2 -->> S: Request Data
S -->> W2: Data Batch from Data Node 1
W2 -->> D1: Reuest Batch
D1 -->> W1: Send Batch
W2-->W2: Work on Batch
end
W1-->> PS: Send Gradient
W2 -->> PS: Send Gradient
PS -->> W1: Send Parameters
PS -->> W2: Send Parameters
PS -->> D1: Send Parameters
PS -->> D2: Send Parameters
end
```
Loading