Fix grid size in Triton decoding kernel #2134

ispobock · 2024-11-23T06:58:49Z

Motivation

Fix issue mentioned in #1935.

python -m sglang.bench_one_batch --batch-size 128 --input 128 --output 128 --model meta-llama/Llama-3.1-8B-Instruct --attention-backend triton

Prefill. latency: 0.37300 s, throughput:  43925.00 token/s
Decode.  latency: 0.01022 s, throughput:  12529.66 token/s
Decode.  latency: 0.01028 s, throughput:  12455.53 token/s
Decode.  latency: 0.01024 s, throughput:  12494.67 token/s
Decode.  latency: 0.01015 s, throughput:  12611.19 token/s
Decode.  latency: 0.01016 s, throughput:  12596.69 token/s
Decode.  median latency: 0.01000 s, median throughput:  12796.35 token/s

python -m sglang.bench_offline_throughput --model meta-llama/Llama-3.1-8B-Instruct --disable-radix --num-prompt 3000 --attention-backend triton

====== Offline Throughput Benchmark Result =======
Backend:                                 engine    
Successful requests:                     3000      
Benchmark duration (s):                  60.98     
Total input tokens:                      673672    
Total generated tokens:                  581627    
Request throughput (req/s):              49.20     
Input token throughput (tok/s):          11047.56  
Output token throughput (tok/s):         9538.11   
Total token throughput (tok/s):          20585.67  
==================================================

zhyncs · 2024-11-23T07:01:29Z

python/sglang/srt/layers/attention/triton_ops/decode_attention.py

@@ -189,11 +186,12 @@ def _decode_att_m_fwd(
    logit_cap,
 ):
    BLOCK = 32
+    SPLIT_K = 8


Is this parameter applicable to various cases?

I tested it on ShareGPT, 8 is an optimal selection.

That may need some tuning for different situations.

ref https://fireworks.ai/blog/why-gpus-on-demand

Prompt Lengths(Tokens) Fireworks Latency vLLM Latency

Long prompt (4000 input, 200 output) 2117 ms (at 7.5 QPS) 2877 ms (at 0.348 QPS)

Medium prompt (2000 input, 100 output) 740 ms (at 1.33 QPS) 1509 ms (at 0.663 QPS)

Short prompt (128 input, 4 output) 43.3 ms (at 22.51 QPS) 247 ms (at 4.056 QPS)

May we also tune the Medium prompt and Long prompt cases

BTW, I think 4k long prompt have nothing to do with "long," even though the blog defines them as such. In reality, some cases are around 30k-50k.

Tested the throughput (req/s) for these cases, split=8 is also good.

python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disable-radix-cache --trust-remote-code --tp 1 --attention-backend triton python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 2000 --random-output 100 --random-range-ratio 1 --num-prompts 1000 python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 4000 --random-output 200 --random-range-ratio 1 --num-prompts 1000

Prompt Lengths(Tokens) Split = 4 Split = 8 Split = 16 Split = 32

Long prompt (4000 input, 200 output) 5.32 5.34 5.35 5.32

Medium prompt (2000 input, 100 output) 14.14 14.15 14.06 13.94

The throughput looks good. How about the latency

Median decode latency (ms):

python3 -m sglang.bench_one_batch --batch-size 1 --input 2000 --output 100 --model meta-llama/Llama-3.1-8B-Instruct --attention-backend triton python3 -m sglang.bench_one_batch --batch-size 1 --input 4000 --output 200 --model meta-llama/Llama-3.1-8B-Instruct --attention-backend triton

Prompt Lengths(Tokens) Split = 4 Split = 8 Split = 16 Split = 32

Long prompt (4000 input, 200 output) 9.78 9.37 9.02 8.91

Medium prompt (2000 input, 100 output) 8.33 8.06 7.94 7.94

zhyncs · 2024-11-23T07:33:36Z

I think the failure in Unit Test 3 is introduced by introduced by #2081 (comment)

zhyncs

LGTM!

refactor split k

d4a8c5c

ispobock requested review from merrymercy, Ying1123 and zhyncs as code owners November 23, 2024 06:58

Merge branch 'main' into fix-decode-attn

1fbec6c

zhyncs reviewed Nov 23, 2024

View reviewed changes

zhyncs approved these changes Nov 23, 2024

View reviewed changes

zhyncs merged commit c5f8650 into sgl-project:main Nov 23, 2024
12 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix grid size in Triton decoding kernel #2134

Fix grid size in Triton decoding kernel #2134

ispobock commented Nov 23, 2024

zhyncs Nov 23, 2024

ispobock Nov 23, 2024

ispobock Nov 23, 2024

zhyncs Nov 23, 2024

zhyncs Nov 23, 2024

ispobock Nov 23, 2024

zhyncs Nov 23, 2024

ispobock Nov 23, 2024

zhyncs commented Nov 23, 2024

zhyncs left a comment

Prompt Lengths(Tokens)	Fireworks Latency	vLLM Latency
Long prompt (4000 input, 200 output)	2117 ms (at 7.5 QPS)	2877 ms (at 0.348 QPS)
Medium prompt (2000 input, 100 output)	740 ms (at 1.33 QPS)	1509 ms (at 0.663 QPS)
Short prompt (128 input, 4 output)	43.3 ms (at 22.51 QPS)	247 ms (at 4.056 QPS)

Prompt Lengths(Tokens)	Split = 4	Split = 8	Split = 16	Split = 32
Long prompt (4000 input, 200 output)	5.32	5.34	5.35	5.32
Medium prompt (2000 input, 100 output)	14.14	14.15	14.06	13.94

Prompt Lengths(Tokens)	Split = 4	Split = 8	Split = 16	Split = 32
Long prompt (4000 input, 200 output)	9.78	9.37	9.02	8.91
Medium prompt (2000 input, 100 output)	8.33	8.06	7.94	7.94

Fix grid size in Triton decoding kernel #2134

Fix grid size in Triton decoding kernel #2134

Conversation

ispobock commented Nov 23, 2024

Motivation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhyncs commented Nov 23, 2024

zhyncs left a comment

Choose a reason for hiding this comment