Skip to content

add pagged attention nhd for aiter_backend#269

Draft
apinge wants to merge 3 commits into
zejunchen-zejun:Qwen3.5_v0.5.9from
apinge:pa_nhd
Draft

add pagged attention nhd for aiter_backend#269
apinge wants to merge 3 commits into
zejunchen-zejun:Qwen3.5_v0.5.9from
apinge:pa_nhd

Conversation

@apinge

@apinge apinge commented Apr 28, 2026

Copy link
Copy Markdown

Motivation

Related to ROCm/aiter#2919

Modifications

Accuracy Tests

Qwen3.5-27B TP1

export SGLANG_DISABLE_CUDNN_CHECK=1
export SGLANG_USE_CUDA_IPC_TRANSPORT=1
export SGLANG_VLM_CACHE_SIZE_MB=32768
export SGLANG_USE_AITER=1
export SGLANG_ROCM_USE_AITER_LINEAR_SHUFFLE=1
export SGLANG_ROCM_USE_AITER_LINEAR_FP8HIPB=1
export SGLANG_USE_AITER_NEW_CA=false

python3 -m sglang.launch_server \
  --port 1080 \
  --model-path /models/Qwen3.5-27B \
  --tp-size 1 \
  --attention-backend aiter \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-multimodal \
  --trust-remote-code \
  --chunked-prefill-size 32768 \
  --mem-fraction-static 0.85 \
  --max-prefill-tokens 32768 \
  --max-running-requests 32 \
  --cuda-graph-bs 1 2 4 8 12 16 $(seq 24 32) \
  --disable-radix-cache \
  --context-length 262144 \
  --disable-custom-all-reduce \
  --mm-attention-backend aiter_attn \
  2>&1 | tee Qwen3.5-27B-TP1.log
python3 sglang/benchmark/gsm8k/bench_sglang.py --port 1080 --enable-thinking --tokenizer-path /models/Qwen3.5-27B --max-new-tokens 4096
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [08:34<00:00,  2.57s/it]
Accuracy: 0.805
Invalid: 0.005
Latency: 514.539 s
Output throughput: 556.207 token/s

Qwen3.5-27B TP2

export SGLANG_DISABLE_CUDNN_CHECK=1
export SGLANG_USE_CUDA_IPC_TRANSPORT=1
export SGLANG_VLM_CACHE_SIZE_MB=8192
export SGLANG_USE_AITER=1
export SGLANG_ROCM_USE_AITER_LINEAR_SHUFFLE=1
export SGLANG_ROCM_USE_AITER_LINEAR_FP8HIPB=1
export SGLANG_USE_AITER_NEW_CA=false


python3 -m sglang.launch_server \
  --port 1080 \
  --model-path /models/Qwen3.5-27B \
  --tp-size 2 \
  --attention-backend aiter \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-multimodal \
  --trust-remote-code \
  --chunked-prefill-size 32768 \
  --mem-fraction-static 0.9 \
  --max-prefill-tokens 32768 \
  --max-running-requests 32 \
  --cuda-graph-bs 1 2 4 8 12 16 $(seq 24 32) \
  --disable-radix-cache \
  --disable-custom-all-reduce \
  --mm-attention-backend aiter_attn 
python3 sglang/benchmark/gsm8k/bench_sglang.py --port 1080 --enable-thinking --tokenizer-path /models/Qwen3.5-27B --max-new-tokens 4096
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [04:13<00:00,  1.27s/it]
Accuracy: 0.790
Invalid: 0.005
Latency: 253.955 s
Output throughput: 1138.693 token/s

Qwen3.5-27B-PTPC-compressor TP1

export SGLANG_DISABLE_CUDNN_CHECK=1
export SGLANG_USE_CUDA_IPC_TRANSPORT=1
export SGLANG_VLM_CACHE_SIZE_MB=8192
export SGLANG_USE_AITER=1
export SGLANG_ROCM_USE_AITER_LINEAR_SHUFFLE=1
export SGLANG_ROCM_USE_AITER_LINEAR_FP8HIPB=1
export SGLANG_USE_AITER_NEW_CA=false


python3 -m sglang.launch_server \
       --port 1080 \
       --model-path /models/Qwen3.5-27B-PTPC-compressor \
       --tp-size 1 \
       --attention-backend aiter \
       --reasoning-parser qwen3 \
       --tool-call-parser qwen3_coder \
       --enable-multimodal \
       --trust-remote-code \
       --chunked-prefill-size 32768 \
       --mem-fraction-static 0.9 \
       --max-prefill-tokens 32768 \
       --max-running-requests 24 \
       --disable-radix-cache \
       --context-length 262144 \
       --disable-custom-all-reduce \
       --mm-attention-backend aiter_attn \
       2>&1 | tee Qwen3.5-27B-FP8-TP2.log
python3 sglang/benchmark/gsm8k/bench_sglang.py --port 1080 --enable-thinking --tokenizer-path /models/Qwen3.5-27B --max-new-tokens 4096
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [05:20<00:00,  1.60s/it]
Accuracy: 0.745
Invalid: 0.010
Latency: 320.079 s
Output throughput: 882.261 token/s

Qwen3.5-27B-PTPC-compressor TP2

export SGLANG_DISABLE_CUDNN_CHECK=1
export SGLANG_USE_CUDA_IPC_TRANSPORT=1
export SGLANG_VLM_CACHE_SIZE_MB=8192
export SGLANG_USE_AITER=1
export SGLANG_ROCM_USE_AITER_LINEAR_SHUFFLE=1
export SGLANG_ROCM_USE_AITER_LINEAR_FP8HIPB=1
export SGLANG_USE_AITER_NEW_CA=false
export HIP_VISIBLE_DEVICES=6,7
#export USE_PA=1

python3 -m sglang.launch_server \
       --port 1080 \
       --model-path /models/Qwen3.5-27B-PTPC-compressor \
       --tp-size 2 \
       --attention-backend aiter \
       --reasoning-parser qwen3 \
       --tool-call-parser qwen3_coder \
       --enable-multimodal \
       --trust-remote-code \
       --chunked-prefill-size 32768 \
       --mem-fraction-static 0.9 \
       --max-prefill-tokens 32768 \
       --max-running-requests 32 \
       --cuda-graph-bs 1 2 4 8 12 16 $(seq 24 32) \
       --disable-radix-cache \
       --context-length 262144 \
       --disable-custom-all-reduce \
       --mm-attention-backend aiter_attn
 python3 sglang/benchmark/gsm8k/bench_sglang.py --port 1080 --enable-thinking --tokenizer-path /models/Qwen3.5-27B --max-new-tokens 4096
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [03:29<00:00,  1.05s/it]
Accuracy: 0.795
Invalid: 0.005
Latency: 209.109 s
Output throughput: 1305.016 token/s

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant