[GDN] Fused all preprocessing into one kernel by apinge · Pull Request #244 · zejunchen-zejun/sglang

apinge · 2026-04-09T06:28:14Z

Motivation

Modifications

Accuracy Tests

Qwen3.5-27B-PTPC-compressor

Tested on MI308X

lm_eval --model local-completions \
    --model_args '{"base_url": "http://localhost:9080/v1/completions", "model": "/models/Qwen3.5-27B-PTPC-compressor", "num_concurrent": 256, "max_retries": 10, "max_gen_toks": 2048}' \
    --tasks gsm8k \
    --batch_size auto \
    --num_fewshot 5 \
    --trust_remote_code 2>&1 | tee gsm8k_eval.log

local-completions ({'base_url': 'http://localhost:9080/v1/completions', 'model': '/models/Qwen3.5-27B-PTPC-compressor', 'num_concurrent': 256, 'max_retries': 10, 'max_gen_toks': 2048}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7597|±  |0.0118|
|     |       |strict-match    |     5|exact_match|↑  |0.8537|±  |0.0097|

qwen3p5_397B_ptpc

RESULT_FILE: accuracy_test_results/offical_qwen3p5_397B_ptpc_gsm8k_results.json
GSM8K flexible extract score: 0.95

Benchmarking and Profiling

Tested on MI308X

./run_pure_text_bench.sh   
bench model: /models/Qwen3.5-27B-PTPC-compressor/
input tokens: 8000
output tokens: 500
max concurrency: 1
num prompts: 32
dataset-name: random


#Input tokens: 256000
#Output tokens: 16000

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     32        
Benchmark duration (s):                  265.83    
Total input tokens:                      256000    
Total input text tokens:                 256000    
Total generated tokens:                  16000     
Total generated tokens (retokenized):    15994     
Request throughput (req/s):              0.12      
Input token throughput (tok/s):          963.03    
Output token throughput (tok/s):         60.19     
Peak output token throughput (tok/s):    71.00     
Peak concurrent requests:                2         
Total token throughput (tok/s):          1023.22   
Concurrency:                             1.00      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   8305.94   
Median E2E Latency (ms):                 8274.94   
P90 E2E Latency (ms):                    8374.91   
P99 E2E Latency (ms):                    8659.60   
---------------Time to First Token----------------
Mean TTFT (ms):                          1119.79   
Median TTFT (ms):                        1113.78   
P99 TTFT (ms):                           1192.66   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.40     
Median TPOT (ms):                        14.36     
P99 TPOT (ms):                           15.06     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           14.40     
Median ITL (ms):                         14.32     
P95 ITL (ms):                            14.49     
P99 ITL (ms):                            15.14     
Max ITL (ms):                            484.70    
==================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

Signed-off-by: apinge <Tong.Qiu2@amd.com>

apinge marked this pull request as ready for review April 9, 2026 09:42

apinge added 5 commits April 13, 2026 11:04

add impl

24d4e92

Signed-off-by: apinge <Tong.Qiu2@amd.com>

correct comments

2c39a7b

Signed-off-by: apinge <Tong.Qiu2@amd.com>

correct if statement

51829b2

Signed-off-by: apinge <Tong.Qiu2@amd.com>

correct if statement

1cc4445

Signed-off-by: apinge <Tong.Qiu2@amd.com>

rectify if statement

3f7b5f1

Signed-off-by: apinge <Tong.Qiu2@amd.com>

apinge force-pushed the fused_gdn_zejun branch from 5c09d68 to 3f7b5f1 Compare April 13, 2026 11:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GDN] Fused all preprocessing into one kernel#244

[GDN] Fused all preprocessing into one kernel#244
apinge wants to merge 5 commits into
zejunchen-zejun:Qwen3.5_v0.5.9from
apinge:fused_gdn_zejun

apinge commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

apinge commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Qwen3.5-27B-PTPC-compressor

qwen3p5_397B_ptpc

Benchmarking and Profiling

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

apinge commented Apr 9, 2026 •

edited

Loading