feat：Support LoRA incremental weight synchronization on disk for FSDP and SGLang by TaoZex · Pull Request #1233 · areal-project/AReaL

TaoZex · 2026-04-23T03:40:01Z

Description

Implement disk-based LoRA adapter synchronization for the FSDP training engine and SGLang inference backend.

This PR fixes the FSDP save path for LoRA training: when use_lora=True, FSDP now writes adapter-only PEFT artifacts instead of saving the full HuggingFace model directory. The generated adapter directory contains adapter_model.safetensors and adapter_config.json, matching the format consumed by SGLang's existing /load_lora_adapter endpoint.

Key changes:

Adapter-only FSDP save path: FSDPEngine._save_model_to_hf() now branches on self.config.use_lora; LoRA mode saves only LoRA adapter tensors, while non-LoRA mode keeps the existing full-model HuggingFace save path.
PEFT-compatible adapter artifacts: Added _save_lora_adapter_to_hf() to filter LoRA tensors, strip the PEFT active adapter segment such as .default., write adapter_model.safetensors, and generate adapter_config.json with required LoRA metadata.
SGLang disk-sync compatibility: The saved adapter layout matches SGLang's existing /load_lora_adapter disk update flow, so LoRA disk synchronization works without NCCL-based distributed weight update.
Network robustness: Updated gethostip() to prefer UDP socket probing before hostname resolution and simplify fallback error handling.
Test coverage: Added unit and torchrun E2E coverage for LoRA adapter filtering, PEFT artifact layout, SGLang disk request dispatch, versioned LoRA naming, and adapter-only FSDP saves across DP/TP configurations.

Related Issue

Fixes #(issue)

Type of Change

✨ New feature
⚡ Performance improvement
✅ Test coverage improvement
♻️ Refactoring

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (not applicable; no docs changed)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Breaking Change Details

N/A

Additional Context

Files changed:

File	Change
`areal/engine/fsdp_engine.py`	Add LoRA-aware HF save path; save adapter-only PEFT artifacts when `use_lora=True`; keep full HF save for non-LoRA models
`areal/utils/network.py`	Prefer UDP socket probing for host IP discovery before hostname-based fallback
`tests/test_lora_adapter_save.py`	Add direct unit tests for `_save_lora_adapter_to_hf()` artifact layout, key filtering, config generation, and fail-fast behavior
`tests/test_lora_disk_sync.py`	Add unit tests for LoRA disk sync metadata, SGLang `/load_lora_adapter` dispatch, versioned LoRA names, and distributed update rejection for LoRA
`tests/test_lora_disk_sync_e2e.py`	Add pytest wrapper for torchrun E2E LoRA disk sync tests with single-GPU, DP2, DP4, and DP2+TP2 coverage
`tests/torchrun/run_lora_disk_sync.py`	Add distributed E2E script that initializes FSDP LoRA, saves adapter-only artifacts, validates PEFT files, and checks forward pass health

gemini-code-assist

Code Review

This pull request introduces "LoRA Delta Sync," an incremental weight update mechanism for FSDP and SGLang that reduces communication overhead by transmitting only adapter weights after an initial full sync. It also adds entropy regularization to the PPO actor. Feedback focuses on potential memory issues when collecting full model parameters on a single rank, the need for shared storage in multi-node setups for adapter files, and minor code cleanups regarding imports and logic simplification.

TaoZex · 2026-04-23T09:32:23Z

Test Content

Test the code and generate test results

tests/test_lora_delta_sync.py

tests/test_lora_delta_sync_e2e.py (Invoke tests/torchrun/run_lora_delta_sync.py for testing)

TaoZex · 2026-04-23T09:57:57Z

Before optimization, parameters of both the base model and adapter model needed to be updated. After optimization, Step > 1 only the adapter model parameters are transmitted.

1. Task Reward

LoRA adopts incremental disk weight update, delivering stable training with continuous reward growth.

2. Weight Synchronization Data Volume

Base model synchronization: 2944.40M

Adapter model synchronization: 35.21M

The proportion of weight synchronization parameters is reduced to: 35.21 / （35.21 + 2944.40） = 1.18%

The overall parameter transmission volume is reduced by 98.82%.

3. Weight Synchronization Latency

Step 1 (Base + Adapter): 6.74s
Step > 1 (Only Adapter): 0.35s (average indicator in the chart)

The latency is reduced from 6.74s to 0.35s, with the update weight time decreased by 94.8%.

TaoZex · 2026-04-23T10:50:14Z

@rchardx This idea is inspired by the incremental update of weights. If you’re interested, would you mind reviewing it or sharing your suggestions in your spare time? Looking forward to your reply. Thanks～

garrett4wade

Hi @TaoZex , lora update is expected to be finished via the "disk" update mode, which has already implemented the path of "/load_lora_adapter". The critical issue that we should fix is that the FSDP engine always save full parameters rather than the LoRAs.

TaoZex · 2026-05-10T17:20:36Z

Unit test(tests/test_fsdp_engine.py)

E2E test(tests/test_lora_disk_sync_e2e.py)

TaoZex · 2026-05-11T01:16:11Z

@garrett4wade Thank you for your code review. I have made the relevant fixes, and currently only the adapter file will be retained. I hope you can review it again when you have time~

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread areal/engine/fsdp_engine.py Outdated

Comment thread areal/engine/fsdp_engine.py Outdated

Comment thread areal/engine/fsdp_engine.py Outdated

Comment thread areal/engine/fsdp_engine.py Outdated

Comment thread areal/trainer/ppo/actor.py Outdated

TaoZex marked this pull request as ready for review April 23, 2026 10:21

TaoZex requested review from garrett4wade, nuzant and rchardx as code owners April 23, 2026 10:21

garrett4wade reviewed Apr 23, 2026

View reviewed changes

github-actions Bot added the stale label May 8, 2026

areal-project deleted a comment from github-actions Bot May 10, 2026

TaoZex requested review from CormickKneey, HwVanICI, PrometheusComing, fishcrap, geshi001, guozhihao-224 and sitabulaixizawaluduo as code owners May 10, 2026 12:47

bingyechen added 11 commits May 10, 2026 21:40

feat(engine): lora

d142c21

feat(metrics): log size

c2ec2d4

fix(network): ipv6

23b28b0

feat(weight): fix metric

e41b1a7

refactor(engine): lora

b09a9be

feat(test): fix

d4e0ce0

test(lora_disk_sync): fix

fa1ad50

test(lora_disk_sync): add path

8bcd7b5

refactor(tests): fix

eb8fe4c

refactor(tests): fix

d9db7f1

refactor(test): lora fix

7f1a9a0

TaoZex force-pushed the lora_incre branch from f153a02 to 7f1a9a0 Compare May 10, 2026 17:18

github-actions Bot removed the stale label May 11, 2026

Merge branch 'main' into lora_incre

d8b0220

TaoZex requested a review from garrett4wade May 13, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat：Support LoRA incremental weight synchronization on disk for FSDP and SGLang#1233

feat：Support LoRA incremental weight synchronization on disk for FSDP and SGLang#1233
TaoZex wants to merge 12 commits into
areal-project:mainfrom
TaoZex:lora_incre

TaoZex commented Apr 23, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TaoZex commented Apr 23, 2026

Uh oh!

TaoZex commented Apr 23, 2026 •

edited

Loading

Uh oh!

TaoZex commented Apr 23, 2026

Uh oh!

garrett4wade left a comment

Uh oh!

TaoZex commented May 10, 2026

Uh oh!

TaoZex commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TaoZex commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Checklist

Breaking Change Details

Additional Context

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TaoZex commented Apr 23, 2026

Test Content

Uh oh!

TaoZex commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Task Reward

2. Weight Synchronization Data Volume

3. Weight Synchronization Latency

Uh oh!

TaoZex commented Apr 23, 2026

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

Uh oh!

TaoZex commented May 10, 2026

Unit test(tests/test_fsdp_engine.py)

E2E test(tests/test_lora_disk_sync_e2e.py)

Uh oh!

TaoZex commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TaoZex commented Apr 23, 2026 •

edited

Loading

TaoZex commented Apr 23, 2026 •

edited

Loading