feat:Support LoRA incremental weight synchronization on disk for FSDP and SGLang#1233
feat:Support LoRA incremental weight synchronization on disk for FSDP and SGLang#1233TaoZex wants to merge 12 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces "LoRA Delta Sync," an incremental weight update mechanism for FSDP and SGLang that reduces communication overhead by transmitting only adapter weights after an initial full sync. It also adds entropy regularization to the PPO actor. Feedback focuses on potential memory issues when collecting full model parameters on a single rank, the need for shared storage in multi-node setups for adapter files, and minor code cleanups regarding imports and logic simplification.
|
@rchardx This idea is inspired by the incremental update of weights. If you’re interested, would you mind reviewing it or sharing your suggestions in your spare time? Looking forward to your reply. Thanks~ |
|
@garrett4wade Thank you for your code review. I have made the relevant fixes, and currently only the adapter file will be retained. I hope you can review it again when you have time~ |








Description
Implement disk-based LoRA adapter synchronization for the FSDP training engine and SGLang inference backend.
This PR fixes the FSDP save path for LoRA training: when
use_lora=True, FSDP now writes adapter-only PEFT artifacts instead of saving the full HuggingFace model directory. The generated adapter directory containsadapter_model.safetensorsandadapter_config.json, matching the format consumed by SGLang's existing/load_lora_adapterendpoint.Key changes:
FSDPEngine._save_model_to_hf()now branches onself.config.use_lora; LoRA mode saves only LoRA adapter tensors, while non-LoRA mode keeps the existing full-model HuggingFace save path._save_lora_adapter_to_hf()to filter LoRA tensors, strip the PEFT active adapter segment such as.default., writeadapter_model.safetensors, and generateadapter_config.jsonwith required LoRA metadata./load_lora_adapterdisk update flow, so LoRA disk synchronization works without NCCL-based distributed weight update.gethostip()to prefer UDP socket probing before hostname resolution and simplify fallback error handling.Related Issue
Fixes #(issue)
Type of Change
Checklist
pre-commit run --all-files)/review-prcommand/create-prBreaking Change Details
N/A
Additional Context
Files changed:
areal/engine/fsdp_engine.pyuse_lora=True; keep full HF save for non-LoRA modelsareal/utils/network.pytests/test_lora_adapter_save.py_save_lora_adapter_to_hf()artifact layout, key filtering, config generation, and fail-fast behaviortests/test_lora_disk_sync.py/load_lora_adapterdispatch, versioned LoRA names, and distributed update rejection for LoRAtests/test_lora_disk_sync_e2e.pytests/torchrun/run_lora_disk_sync.py