Skip to content

feat:Support LoRA incremental weight synchronization on disk for FSDP and SGLang#1233

Open
TaoZex wants to merge 12 commits into
areal-project:mainfrom
TaoZex:lora_incre
Open

feat:Support LoRA incremental weight synchronization on disk for FSDP and SGLang#1233
TaoZex wants to merge 12 commits into
areal-project:mainfrom
TaoZex:lora_incre

Conversation

@TaoZex
Copy link
Copy Markdown
Collaborator

@TaoZex TaoZex commented Apr 23, 2026

Description

Implement disk-based LoRA adapter synchronization for the FSDP training engine and SGLang inference backend.

This PR fixes the FSDP save path for LoRA training: when use_lora=True, FSDP now writes adapter-only PEFT artifacts instead of saving the full HuggingFace model directory. The generated adapter directory contains adapter_model.safetensors and adapter_config.json, matching the format consumed by SGLang's existing /load_lora_adapter endpoint.

Key changes:

  • Adapter-only FSDP save path: FSDPEngine._save_model_to_hf() now branches on self.config.use_lora; LoRA mode saves only LoRA adapter tensors, while non-LoRA mode keeps the existing full-model HuggingFace save path.
  • PEFT-compatible adapter artifacts: Added _save_lora_adapter_to_hf() to filter LoRA tensors, strip the PEFT active adapter segment such as .default., write adapter_model.safetensors, and generate adapter_config.json with required LoRA metadata.
  • SGLang disk-sync compatibility: The saved adapter layout matches SGLang's existing /load_lora_adapter disk update flow, so LoRA disk synchronization works without NCCL-based distributed weight update.
  • Network robustness: Updated gethostip() to prefer UDP socket probing before hostname resolution and simplify fallback error handling.
  • Test coverage: Added unit and torchrun E2E coverage for LoRA adapter filtering, PEFT artifact layout, SGLang disk request dispatch, versioned LoRA naming, and adapter-only FSDP saves across DP/TP configurations.

Related Issue

Fixes #(issue)

Type of Change

  • ✨ New feature
  • ⚡ Performance improvement
  • ✅ Test coverage improvement
  • ♻️ Refactoring

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (not applicable; no docs changed)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Breaking Change Details

N/A

Additional Context

Files changed:

File Change
areal/engine/fsdp_engine.py Add LoRA-aware HF save path; save adapter-only PEFT artifacts when use_lora=True; keep full HF save for non-LoRA models
areal/utils/network.py Prefer UDP socket probing for host IP discovery before hostname-based fallback
tests/test_lora_adapter_save.py Add direct unit tests for _save_lora_adapter_to_hf() artifact layout, key filtering, config generation, and fail-fast behavior
tests/test_lora_disk_sync.py Add unit tests for LoRA disk sync metadata, SGLang /load_lora_adapter dispatch, versioned LoRA names, and distributed update rejection for LoRA
tests/test_lora_disk_sync_e2e.py Add pytest wrapper for torchrun E2E LoRA disk sync tests with single-GPU, DP2, DP4, and DP2+TP2 coverage
tests/torchrun/run_lora_disk_sync.py Add distributed E2E script that initializes FSDP LoRA, saves adapter-only artifacts, validates PEFT files, and checks forward pass health

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces "LoRA Delta Sync," an incremental weight update mechanism for FSDP and SGLang that reduces communication overhead by transmitting only adapter weights after an initial full sync. It also adds entropy regularization to the PPO actor. Feedback focuses on potential memory issues when collecting full model parameters on a single rank, the need for shared storage in multi-node setups for adapter files, and minor code cleanups regarding imports and logic simplification.

Comment thread areal/engine/fsdp_engine.py Outdated
Comment thread areal/engine/fsdp_engine.py Outdated
Comment thread areal/engine/fsdp_engine.py Outdated
Comment thread areal/engine/fsdp_engine.py Outdated
Comment thread areal/trainer/ppo/actor.py Outdated
@TaoZex
Copy link
Copy Markdown
Collaborator Author

TaoZex commented Apr 23, 2026

Test Content

Test the code and generate test results

  1. tests/test_lora_delta_sync.py
image
  1. tests/test_lora_delta_sync_e2e.py (Invoke tests/torchrun/run_lora_delta_sync.py for testing)
image

@TaoZex
Copy link
Copy Markdown
Collaborator Author

TaoZex commented Apr 23, 2026

Before optimization, parameters of both the base model and adapter model needed to be updated. After optimization, Step > 1 only the adapter model parameters are transmitted.

1. Task Reward

image LoRA adopts incremental disk weight update, delivering stable training with continuous reward growth.

2. Weight Synchronization Data Volume

  • Base model synchronization: 2944.40M
image
  • Adapter model synchronization: 35.21M
image

The proportion of weight synchronization parameters is reduced to: 35.21 / (35.21 + 2944.40) = 1.18%

The overall parameter transmission volume is reduced by 98.82%.

3. Weight Synchronization Latency

image
  • Step 1 (Base + Adapter): 6.74s
  • Step > 1 (Only Adapter): 0.35s (average indicator in the chart)

The latency is reduced from 6.74s to 0.35s, with the update weight time decreased by 94.8%.

@TaoZex TaoZex marked this pull request as ready for review April 23, 2026 10:21
@TaoZex
Copy link
Copy Markdown
Collaborator Author

TaoZex commented Apr 23, 2026

@rchardx This idea is inspired by the incremental update of weights. If you’re interested, would you mind reviewing it or sharing your suggestions in your spare time? Looking forward to your reply. Thanks~

Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @TaoZex , lora update is expected to be finished via the "disk" update mode, which has already implemented the path of "/load_lora_adapter". The critical issue that we should fix is that the FSDP engine always save full parameters rather than the LoRAs.

@TaoZex
Copy link
Copy Markdown
Collaborator Author

TaoZex commented May 10, 2026

Unit test(tests/test_fsdp_engine.py)

image

E2E test(tests/test_lora_disk_sync_e2e.py)

image

@TaoZex
Copy link
Copy Markdown
Collaborator Author

TaoZex commented May 11, 2026

@garrett4wade Thank you for your code review. I have made the relevant fixes, and currently only the adapter file will be retained. I hope you can review it again when you have time~

@github-actions github-actions Bot removed the stale label May 11, 2026
@TaoZex TaoZex requested a review from garrett4wade May 13, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants