feat: add router replay(R3) for megatron engine by TaoZex · Pull Request #1207 · areal-project/AReaL

TaoZex · 2026-04-18T12:40:16Z

Description

This PR implements Rollout Routing Replay (R3) for MoE models, addressing training instability caused by inference-training routing discrepancy in asynchronous RL training. R3 records expert routing indices from the inference engine and replays them during training, ensuring consistent expert selection regardless of weight staleness.

Key Changes

Core MoE Patch (router_replay_patch.py):

RouterReplay class (one per MoE layer) with RECORD/REPLAY_FORWARD/REPLAY_BACKWARD actions
patched_routing: replaces TopKRouter.routing — uses scores.gather(1, target_topk_idx) in replay mode instead of torch.topk, preserving gradient flow
Four monkey-patches: TransformerConfig.__init__, TopKRouter.__init__, TopKRouter.routing, MoEAlltoAllTokenDispatcher.preprocess

Data Distribution (router_replay_utils.py):

set_router_replay_data: 4-step pipeline — right-pad→left-align → CP split → TP/SP scatter → PP layer slice → Dense/MoE mapping
RouterReplayHelper: locates RouterReplay instances by (pp_rank, vp_stage)
Layer allocation helpers: get_num_layers_to_build, get_moe_num_layers_to_build (PP/VP aware)

MegatronEngine Integration (megatron_engine_r3_patch.py):

Wraps forward_backward_batch: retrieves routed_experts via side-channel, splits per micro-batch, injects replay setup via per-instance class swap, toggles forward/backward replay mode, cleans up in finally

Actor & Workflow Integration (actor_r3_patch.py, rlvr_r3_patch.py):

Actor: splits routed_experts per mini-batch, delivers via engine side-channel (bypasses pack_tensor_dict 4D incompatibility)
Workflow: resolve_r3_moe_config auto-resolves num_moe_layers/topk from HF config; extract_routed_experts converts SGLang numpy output to left-padded torch tensor

SGLang Integration (sglang_r3_patch.py, sglang_remote.py):

Server patch: pre-encodes routed_experts as base64 in TokenizerManager._handle_batch_output (fixes jsonable_encoder silently flattening torch.Tensor to {} when skip_tokenizer_init=True)
Client: decodes base64, validates num_sgl_token divisibility

Orchestrator & Config (rl_trainer.py, cli_args.py):

return_routed_experts=True → auto-sets enable_router_replay, resolves MoE config, forces skip_tokenizer_init=True, validates SGLang-only support

Supported Parallelism

Dimension	Supported	Mechanism
TP	✅	`scatter_to_sequence_parallel_region` + `seq_align_to` by `tp_size`
PP	✅	`get_current_rank_layer_info` slices per PP rank's MoE layers
VP	✅	Cumulative offset by `vp_stage` in `RouterReplayHelper`
CP	✅	`split_packed_seqs_for_context_parallel` before TP scatter; `seq_align_to = tp_size * cp_size * 2` when `cp_size > 1`; `cp_local` disabled to avoid shape mismatch
DP	✅	Data flows with mini-batches; no conflict

New Metrics

All metrics are computed under compute_logp/r3 scope at _compute_logp time (training weights = rollout weights, no optimizer drift), using stats_tracker.stat with n_valid_tokens denominator (token-weighted global mean/min/max across ranks). The denominator key n_valid_tokens is filtered from wandb/tensorboard export to avoid clutter.

Metric	Definition	Meaning
`rollout_train_logp_abs_diff`	`mean(\|logp_train - logp_rollout\|)` over valid tokens	Mean absolute logprob divergence; R3 should reduce this to BF16 kernel noise
`rollout_train_logp_sq_diff`	`mean((logp_train - logp_rollout)²)` over valid tokens	Mean squared divergence; more sensitive to outliers
`rollout_train_k3_kl`	`mean(exp(Δ) - 1 - Δ)` where `Δ = logp_train - logp_rollout`	Schulman k3 KL estimator: unbiased, non-negative estimator of `KL(π_rollout ‖ π_train)`
`rollout_train_extreme_frac_tau2`	`F(τ=2)`: fraction of tokens with `\|Δ\| > ln(2)`	Extreme token fraction from Router Replay paper (Eq. 3); tokens whose importance ratio leaves `[1/2, 2]`
`rollout_train_extreme_frac_tau5`	`F(τ=5)`: fraction of tokens with `\|Δ\| > ln(5)`	Same with threshold τ=5; more severe outliers
`r3_enabled`	`1.0` if R3 side-channel active, `0.0` otherwise	Binary flag for A/B comparison in wandb

All computed under torch.no_grad() on detached tensors; negligible overhead.

Related Paper

https://arxiv.org/abs/2510.11370 (Ma et al., arXiv:2510.11370, 2025) — proposes R3 to reduce training-inference policy KL divergence and prevent MoE RL training collapse.

Related Issue

Fixes #(issue)

Type of Change

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Breaking Change Details (if applicable):

N/A

Additional Context

Backward Compatible: return_routed_experts=False (default) → all R3 code inactive, zero overhead
SGLang Only: vLLM backend does not support return_routed_experts; config validation raises explicit error
Side-Channel Delivery: routed_experts delivered via engine._r3_pending_routed_experts to bypass pack_tensor_dict 4D incompatibility
Server Patch Required: sglang_r3_patch must be installed on inference server to fix torch.Tensor serialization when skip_tokenizer_init=True

gemini-code-assist

Code Review

This pull request implements Router Replay (R3) to align Mixture-of-Experts (MoE) routing decisions between rollout inference and training, preventing performance degradation caused by weight staleness in RL. The changes include monkey-patches for Megatron-Core components, engine-level wrappers for micro-batch scheduling, and workflow integrations to propagate routing indices from SGLang. Feedback focuses on critical architectural issues regarding global state and thread safety, specifically the risks of patching class-level iterators and using global lists for router instances. Additionally, there are recommendations to fix potential data loss in uneven batch splitting and to optimize performance by removing GPU-CPU synchronization points in the data processing pipeline.

TaoZex · 2026-05-07T08:47:57Z

Unit Test Results (including test_r3_mask_alignment.py and test_router_replay.py)

End-to-End Test Results (including test_router_replay_e2e.py)

TaoZex · 2026-05-07T09:03:52Z

R3 Metric

Compare the metric results with router replay (r3) enabled versus disabled, R3 metrics are tested on [Moonlight-16B-A3B] (https://huggingface.co/moonshotai/Moonlight-16B-A3B-Instruct), see the example config at ‎examples/math/gsm8k_grpo_megatron_r3.yaml:

The first three metrics in the table are the primary evaluation metrics from the paper; additionally, I introduced two supplementary metrics to further verify the effectiveness of R3.

Let Δ = log π_train − log π_infer, T = set of valid response tokens.

`rollout_train_k3_kl`

$$\frac{1}{|T|}\sum_{t \in T}\bigl[e^{\Delta_t} - 1 - \Delta_t\bigr]$$

Unbiased, non-negative estimator of $\text{KL}(\pi_{\text{train}} | \pi_{\text{infer}})$ (Schulman k3).

`rollout_train_extreme_frac_tau2`

$$\frac{1}{|T|}\sum_{t \in T} \mathbf{1}\bigl[\lvert\Delta_t\rvert > \ln 2\bigr]$$

Fraction of tokens whose probability ratio exceeds the range $[1/2,,2]$.

`rollout_train_extreme_frac_tau5`

$$\frac{1}{|T|}\sum_{t \in T} \mathbf{1}\bigl[\lvert\Delta_t\rvert > \ln 5\bigr]$$

Fraction of tokens whose probability ratio exceeds the range $[1/5,,5]$.

`rollout_train_logp_abs_diff`

$$\frac{1}{|T|}\sum_{t \in T} \lvert\Delta_t\rvert$$

Mean absolute log-probability difference between training and inference engines.

`rollout_train_logp_sq_diff`

$$\frac{1}{|T|}\sum_{t \in T} \Delta_t^2$$

Mean squared log-probability difference, more sensitive to extreme outliers.

TaoZex · 2026-05-07T09:09:09Z

@garrett4wade @rchardx @nuzant Hi guys, I’d really appreciate it if you could help review pr code when you have some spare time, and I’m looking forward to your suggestions!

bingyechen added 6 commits April 18, 2026 17:34

feat: add router replay for megatron engine

fd37381

feat: fix

e7bf2c6

feat: fix

00ed924

feat: add config for test

bb650a2

feat: fix

87dca2a

faet: fix

6198aea

gemini-code-assist Bot reviewed Apr 18, 2026

View reviewed changes

bingyechen added 23 commits April 18, 2026 22:58

fix(router): refactor

45fbf9f

fix(engine): fix routed_experts format

4655264

fix(sglang): ban skip_tokenizer_init

833ec68

feat(math): add base config

a887b52

fix(math): fix config

e5dae7d

fix: fix skip_tokenizer_init

c06c5d0

feat(engine): fix optimizer

c53033a

feat(router): fix code

9b752a4

fix(engine): remove

b149621

fix: logger fix

7161578

feat: add r3 log

e05df49

feat(validate): improve r3

2027b6a

refactor(router_replay_patch): print

4ca85b2

fix(router): fix calculate router

c78ee7f

fix(ppo): add warning

c1ba06b

feat(config): fix

1a5201c

fix(engine): fix forward_only

e9c8ddd

fix(config ): set skip_tokenizer_init false

e770b98

fix(ppo): add dense count

66a8269

fix(math): fix eps_clip

79e7d9d

feat(math): add config

23b657c

fix(config): remove Instruct

a9b0b46

feat(R3): add logprob diff

e78147c

bingyechen and others added 16 commits May 6, 2026 20:04

fix(archon_utils): revert test

f5be0f3

fix(examples/math): test config

5fdc08b

feat(diag): cp log

9cf7127

fix: add cp_local

001e036

fix: config

b5c8d28

fix(engine): disable cp_local

97b5dd4

fix(examples/math): fix oom

f816f6c

refactor: remove log

0642173

refactor: remove

850041b

refactor: remove diag

49680c5

refactor(megatron_engine): fix

1af8cc3

refactor(ppo): remove useless

8fc8028

feat(math): fix config

13384d5

Merge branch 'main' into final_moe

39b39af

refactor(router): fix precommit

35014fe

feat: fix pre commit

d2b6d4c

TaoZex marked this pull request as ready for review May 7, 2026 09:04

TaoZex requested review from CormickKneey, HwVanICI, PrometheusComing, fishcrap, geshi001, guozhihao-224 and sitabulaixizawaluduo as code owners May 7, 2026 09:04

Merge branch 'main' into final_moe

a9f0b04

TaoZex changed the title ~~[WIP]feat: add router replay for megatron engine~~ feat: add router replay(R3) for megatron engine May 7, 2026

garrett4wade added the high priority label May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add router replay(R3) for megatron engine#1207

feat: add router replay(R3) for megatron engine#1207
TaoZex wants to merge 119 commits into
areal-project:mainfrom
TaoZex:final_moe

TaoZex commented Apr 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TaoZex commented May 7, 2026

Uh oh!

TaoZex commented May 7, 2026

Uh oh!

TaoZex commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TaoZex commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Changes

Supported Parallelism

New Metrics

Related Paper

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TaoZex commented May 7, 2026

Uh oh!

TaoZex commented May 7, 2026

R3 Metric

rollout_train_k3_kl

rollout_train_extreme_frac_tau2

rollout_train_extreme_frac_tau5

rollout_train_logp_abs_diff

rollout_train_logp_sq_diff

Uh oh!

TaoZex commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TaoZex commented Apr 18, 2026 •

edited

Loading

`rollout_train_k3_kl`

`rollout_train_extreme_frac_tau2`

`rollout_train_extreme_frac_tau5`

`rollout_train_logp_abs_diff`

`rollout_train_logp_sq_diff`