add pagged attention nhd for aiter_backend by apinge · Pull Request #269 · zejunchen-zejun/sglang

apinge · 2026-04-28T11:27:35Z

Motivation

Modifications

Accuracy Tests

Qwen3.5-27B TP1

export SGLANG_DISABLE_CUDNN_CHECK=1
export SGLANG_USE_CUDA_IPC_TRANSPORT=1
export SGLANG_VLM_CACHE_SIZE_MB=32768
export SGLANG_USE_AITER=1
export SGLANG_ROCM_USE_AITER_LINEAR_SHUFFLE=1
export SGLANG_ROCM_USE_AITER_LINEAR_FP8HIPB=1
export SGLANG_USE_AITER_NEW_CA=false

python3 -m sglang.launch_server \
  --port 1080 \
  --model-path /models/Qwen3.5-27B \
  --tp-size 1 \
  --attention-backend aiter \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-multimodal \
  --trust-remote-code \
  --chunked-prefill-size 32768 \
  --mem-fraction-static 0.85 \
  --max-prefill-tokens 32768 \
  --max-running-requests 32 \
  --cuda-graph-bs 1 2 4 8 12 16 $(seq 24 32) \
  --disable-radix-cache \
  --context-length 262144 \
  --disable-custom-all-reduce \
  --mm-attention-backend aiter_attn \
  2>&1 | tee Qwen3.5-27B-TP1.log

python3 sglang/benchmark/gsm8k/bench_sglang.py --port 1080 --enable-thinking --tokenizer-path /models/Qwen3.5-27B --max-new-tokens 4096
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [08:34<00:00,  2.57s/it]
Accuracy: 0.805
Invalid: 0.005
Latency: 514.539 s
Output throughput: 556.207 token/s

Qwen3.5-27B TP2

export SGLANG_DISABLE_CUDNN_CHECK=1
export SGLANG_USE_CUDA_IPC_TRANSPORT=1
export SGLANG_VLM_CACHE_SIZE_MB=8192
export SGLANG_USE_AITER=1
export SGLANG_ROCM_USE_AITER_LINEAR_SHUFFLE=1
export SGLANG_ROCM_USE_AITER_LINEAR_FP8HIPB=1
export SGLANG_USE_AITER_NEW_CA=false


python3 -m sglang.launch_server \
  --port 1080 \
  --model-path /models/Qwen3.5-27B \
  --tp-size 2 \
  --attention-backend aiter \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-multimodal \
  --trust-remote-code \
  --chunked-prefill-size 32768 \
  --mem-fraction-static 0.9 \
  --max-prefill-tokens 32768 \
  --max-running-requests 32 \
  --cuda-graph-bs 1 2 4 8 12 16 $(seq 24 32) \
  --disable-radix-cache \
  --disable-custom-all-reduce \
  --mm-attention-backend aiter_attn

python3 sglang/benchmark/gsm8k/bench_sglang.py --port 1080 --enable-thinking --tokenizer-path /models/Qwen3.5-27B --max-new-tokens 4096
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [04:13<00:00,  1.27s/it]
Accuracy: 0.790
Invalid: 0.005
Latency: 253.955 s
Output throughput: 1138.693 token/s

Qwen3.5-27B-PTPC-compressor TP1

export SGLANG_DISABLE_CUDNN_CHECK=1
export SGLANG_USE_CUDA_IPC_TRANSPORT=1
export SGLANG_VLM_CACHE_SIZE_MB=8192
export SGLANG_USE_AITER=1
export SGLANG_ROCM_USE_AITER_LINEAR_SHUFFLE=1
export SGLANG_ROCM_USE_AITER_LINEAR_FP8HIPB=1
export SGLANG_USE_AITER_NEW_CA=false


python3 -m sglang.launch_server \
       --port 1080 \
       --model-path /models/Qwen3.5-27B-PTPC-compressor \
       --tp-size 1 \
       --attention-backend aiter \
       --reasoning-parser qwen3 \
       --tool-call-parser qwen3_coder \
       --enable-multimodal \
       --trust-remote-code \
       --chunked-prefill-size 32768 \
       --mem-fraction-static 0.9 \
       --max-prefill-tokens 32768 \
       --max-running-requests 24 \
       --disable-radix-cache \
       --context-length 262144 \
       --disable-custom-all-reduce \
       --mm-attention-backend aiter_attn \
       2>&1 | tee Qwen3.5-27B-FP8-TP2.log

python3 sglang/benchmark/gsm8k/bench_sglang.py --port 1080 --enable-thinking --tokenizer-path /models/Qwen3.5-27B --max-new-tokens 4096
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [05:20<00:00,  1.60s/it]
Accuracy: 0.745
Invalid: 0.010
Latency: 320.079 s
Output throughput: 882.261 token/s

Qwen3.5-27B-PTPC-compressor TP2

export SGLANG_DISABLE_CUDNN_CHECK=1
export SGLANG_USE_CUDA_IPC_TRANSPORT=1
export SGLANG_VLM_CACHE_SIZE_MB=8192
export SGLANG_USE_AITER=1
export SGLANG_ROCM_USE_AITER_LINEAR_SHUFFLE=1
export SGLANG_ROCM_USE_AITER_LINEAR_FP8HIPB=1
export SGLANG_USE_AITER_NEW_CA=false
export HIP_VISIBLE_DEVICES=6,7
#export USE_PA=1

python3 -m sglang.launch_server \
       --port 1080 \
       --model-path /models/Qwen3.5-27B-PTPC-compressor \
       --tp-size 2 \
       --attention-backend aiter \
       --reasoning-parser qwen3 \
       --tool-call-parser qwen3_coder \
       --enable-multimodal \
       --trust-remote-code \
       --chunked-prefill-size 32768 \
       --mem-fraction-static 0.9 \
       --max-prefill-tokens 32768 \
       --max-running-requests 32 \
       --cuda-graph-bs 1 2 4 8 12 16 $(seq 24 32) \
       --disable-radix-cache \
       --context-length 262144 \
       --disable-custom-all-reduce \
       --mm-attention-backend aiter_attn

 python3 sglang/benchmark/gsm8k/bench_sglang.py --port 1080 --enable-thinking --tokenizer-path /models/Qwen3.5-27B --max-new-tokens 4096
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [03:29<00:00,  1.05s/it]
Accuracy: 0.795
Invalid: 0.005
Latency: 209.109 s
Output throughput: 1305.016 token/s

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

apinge added 3 commits May 9, 2026 01:13

add pagged attention nhd for aiter_backend

5e0fcf1

revert unused modification

3a38ce9

remove comment

f742fe9

apinge force-pushed the pa_nhd branch from 743e3d8 to f742fe9 Compare May 9, 2026 01:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add pagged attention nhd for aiter_backend#269

add pagged attention nhd for aiter_backend#269
apinge wants to merge 3 commits into
zejunchen-zejun:Qwen3.5_v0.5.9from
apinge:pa_nhd

apinge commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

apinge commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Qwen3.5-27B TP1

Qwen3.5-27B TP2

Qwen3.5-27B-PTPC-compressor TP1

Qwen3.5-27B-PTPC-compressor TP2

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

apinge commented Apr 28, 2026 •

edited

Loading