[AMD] Add AMD MI350X/MI355X (gfx950) blockwise FP8 support for run_qwen3_30b_a3b#1465
Draft
JessicaJiang-123 wants to merge 7 commits into
Draft
[AMD] Add AMD MI350X/MI355X (gfx950) blockwise FP8 support for run_qwen3_30b_a3b#1465JessicaJiang-123 wants to merge 7 commits into
JessicaJiang-123 wants to merge 7 commits into
Conversation
…evert example .sh ROCm edit
…_gpus/NOSET special-cases
…per-engine 2 (eval fix) Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Co-authored-with: @XinyuJiangCMU
Summary
Add a ROCm gfx950 (MI350X / MI355X) blockwise FP8 training path for the Qwen3-30B-A3B RL recipe, and make the RL weight reload run the fp8 post-processing the inference engine needs.
Changes
scripts/run_qwen3_30b_a3b.py: addMI350X/MI355Xashardwareoptions. On these, enable the TransformerEngine blockwise FP8 recipe (--fp8-recipe blockwise,--fp8-format e4m3,NVTE_ROCM_ENABLE_FP8_BLOCK_SCALING=1) and disable gradient-accumulation-fusion (ROCm has no wgrad fusion yet). Add a single-node parallel config (TP1 + sequence-parallel, PP2, CP2, EP4,--max-tokens-per-gpu 16384) with the CPU-offload optimizer, plus the matching rollout settings. Keep Ray from blanking HIP/CUDA visibility for the job entrypoint.update_weight_from_tensor.py: runpost_process_weightsafter the RL weight update forfp8as well ascompressed-tensors, so the inference engine re-applies its fp8 weight post-processing (the ROCm aiter pre-shuffle) on the freshly loaded weights each step.Validated end-to-end on Qwen3-30B-A3B (8x MI350X): fp8 blockwise train + rollout matches the bf16 reference to ~0.04 relerr, and stays on-policy (per-step train-vs-rollout logprob abs-diff ~0.04), including under TP2 + sequence parallel (fwd/dgrad/wgrad).
Related
Part of the AMD Qwen3-30B-A3B blockwise-FP8 bring-up:
update_weight_from_tensor.pychange in this PR.docker/Dockerfile.rocmfromROCm/TransformerEngine @ v2.8_rocmto the merged commit.