[AMD] Add AMD MI350X/MI355X (gfx950) blockwise FP8 support for run_qwen3_30b_a3b by JessicaJiang-123 · Pull Request #1465 · radixark/miles

JessicaJiang-123 · 2026-06-22T21:16:49Z

Summary

Add a ROCm gfx950 (MI350X / MI355X) blockwise FP8 training path for the Qwen3-30B-A3B RL recipe, and make the RL weight reload run the fp8 post-processing the inference engine needs.

Changes

scripts/run_qwen3_30b_a3b.py: add MI350X / MI355X as hardware options. On these, enable the TransformerEngine blockwise FP8 recipe (--fp8-recipe blockwise, --fp8-format e4m3, NVTE_ROCM_ENABLE_FP8_BLOCK_SCALING=1) and disable gradient-accumulation-fusion (ROCm has no wgrad fusion yet). Add a single-node parallel config (TP1 + sequence-parallel, PP2, CP2, EP4, --max-tokens-per-gpu 16384) with the CPU-offload optimizer, plus the matching rollout settings. Keep Ray from blanking HIP/CUDA visibility for the job entrypoint.
update_weight_from_tensor.py: run post_process_weights after the RL weight update for fp8 as well as compressed-tensors, so the inference engine re-applies its fp8 weight post-processing (the ROCm aiter pre-shuffle) on the freshly loaded weights each step.

Validated end-to-end on Qwen3-30B-A3B (8x MI350X): fp8 blockwise train + rollout matches the bf16 reference to ~0.04 relerr, and stays on-policy (per-step train-vs-rollout logprob abs-diff ~0.04), including under TP2 + sequence parallel (fwd/dgrad/wgrad).

TransformerEngine: Add Triton blockwise FP8 training path for ROCm gfx950 ROCm/TransformerEngine#647 — blockwise FP8 training kernels (gfx950, Triton).
sglang: [AMD-miles] Make ROCm aiter fp8 weight pre-shuffle idempotent for RL weight reload sgl-project/sglang#28963 — aiter fp8 weight-reload idempotency; paired with the update_weight_from_tensor.py change in this PR.
Docker image — bump the Transformer Engine pin in docker/Dockerfile.rocm from ROCm/TransformerEngine @ v2.8_rocm to the merged commit.

…-A3B (first-cut)

…evert example .sh ROCm edit

…_gpus/NOSET special-cases

…per-engine 2 (eval fix) Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu>

gemini-code-assist · 2026-06-22T21:16:53Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

JessicaJiang-123 and others added 7 commits June 14, 2026 04:08

Disable gradient-accumulation-fusion for ROCm blockwise FP8 Qwen3-30B…

202af78

…-A3B (first-cut)

Merge remote-tracking branch 'upstream/main' into amd-qwen3-30b-a3b-fp8

7be4c3a

Enable restore_weights_before_loading for fp8 quant in RL weight update

8bb4d13

Add AMD MI350X (gfx950) fp8 blockwise path to run_qwen3_30b_a3b.py; r…

bbedc7f

…evert example .sh ROCm edit

Clean up AMD fp8 launcher: MI350X/MI355X (gfx950), drop redundant num…

6716938

…_gpus/NOSET special-cases

Merge remote-tracking branch 'upstream/main' into amd-qwen3-30b-a3b-fp8

a65815a

AMD MI350X fp8: max-tokens-per-gpu 16384 (+11%) and rollout-num-gpus-…

b60ade2

…per-engine 2 (eval fix) Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Add AMD MI350X/MI355X (gfx950) blockwise FP8 support for run_qwen3_30b_a3b#1465

[AMD] Add AMD MI350X/MI355X (gfx950) blockwise FP8 support for run_qwen3_30b_a3b#1465
JessicaJiang-123 wants to merge 7 commits into
radixark:mainfrom
JessicaJiang-123:amd-qwen3-30b-a3b-fp8

JessicaJiang-123 commented Jun 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JessicaJiang-123 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Related

Uh oh!

gemini-code-assist Bot commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JessicaJiang-123 commented Jun 22, 2026 •

edited

Loading