[Don't merge] Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices #4

TianyuZhang1214 · 2025-10-20T12:06:58Z

Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices

Introduction

We published an article on LMSYS titled "Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G", sharing our best practices for deploying the DeepSeek-R1 model on H20-96G hardware.
To facilitate reproduction of our experimental results and provide access to our code, we have released this pull request in the DeepSeek-R1 repository.

Reproduction Steps

Pulling the Docker Image

To obtain the Docker image, use the following command:

docker pull ghcr.io/antgroup/sglang:h20-blog-release

The image is hosted at: https://github.com/orgs/antgroup/packages/container/package/sglang

Checking Environment Variables

All environment variables are stored in the /root/env.sh file, configured for our H20 environment. Before launching SGLang, verify that these variables are suitable for your environment.

Launching SGLang

We recommend running four containers: two for Prefill nodes and two for Decode nodes.

1. Launching Prefill Nodes (Identical Configuration for Both Nodes)

Note:

Both Prefill nodes use the same launch parameters.
Adjust the port number if there is a conflict.

PYTHONUNBUFFERED=1 \
SGL_CHUNKED_PREFIX_CACHE_THRESHOLD=0 \
nohup python3 -m sglang.launch_server \
--trust-remote-code \
--model-path /path/to/DeepSeek-R1 \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \
--host 0.0.0.0 \
--port 61001 \
--tp-size 8 \
--page-size 64 \
--attention-backend fa3 \
--mem-fraction-static 0.9 \
--chunked-prefill-size 16384 \
--max-running-requests 768 \
--context-length 65535 \
--enable-cache-report \
--log-level info \
--load-balance-method round_robin \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
> /home/admin/logs/stdout.log 2>&1 &

2. Launching Decode Nodes

Note:

Set {node_rank} to 0 or 1 for the respective node.
Replace {decode_master_ip} with the IP address of Node 0.
Adjust the port number if there is a conflict.

Node-0

PYTHONUNBUFFERED=1 \
SGL_ENABLE_JIT_DEEPGEMM=1 \
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=96 \
ENABLE_SWAPAB=1 \
nohup python3 -m sglang.launch_server \
--model-path /path/to/DeepSeek-R1 \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \
--disaggregation-bootstrap-port 9000 \
--attention-backend flashmla \
--host 0.0.0.0 \
--port 61001 \
--trust-remote-code \
--dist-init-addr {decode_master_ip}:62001 \
--nnodes 2 \
--node-rank {node_rank} \
--tp-size 16 \
--dp-size 16 \
--enable-dp-attention \
--mem-fraction-static 0.88 \
--max-running-requests 768 \
--context-length 65535 \
--log-level info \
--decode-log-interval 50 \
--page-size 64 \
--schedule-conservativeness 0.3 \
--enable-cache-report \
--moe-dense-tp-size 1 \
--enable-deepep-moe \
--enable-dp-lm-head \
--cuda-graph-max-bs 48 \
--speculative-algorithm NEXTN \
--speculative-num-steps 1 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 2 \
--init-expert-location /root/expert_workload.json \
--prefill-round-robin-balance \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
--moe-a2a-backend deepep \
--deepep-mode low_latency_overlap \
--enable-single-batch-overlap \
> /home/admin/logs/stdout.log 2>&1 &

3. Launching SGLang Router

Note:

Replace {decode_master_ip}, {prefill_node_0_ip}, and {prefill_node_1_ip} with the respective IP addresses.
Adjust the port number if there is a conflict.

nohup python3 -m sglang_router.launch_router \
--pd-disaggregation \
--mini-lb \
--host 0.0.0.0 \
--decode http://{decode_master_ip}:61001 \
--port 8000 \
--prefill http://{prefill_node_0_ip}:61001 \
--prefill http://{prefill_node_1_ip}:61001 \
> /home/admin/logs/router.log 2>&1 &

Testing

1. Running the Benchmark

Note:

This script is designed to observe peak performance in logs. Since --request-rate is set to inf, all requests are sent at once, making TTFT and TPOT data less meaningful.
Replace {path-to-shareGPT} with the path to the ShareGPT dataset.

nohup python3 -m sglang.bench_serving \
--host 0.0.0.0 \
--port 8000 \
--dataset-path {path-to-shareGPT} \
--num-prompt 4096 \
--random-input 4096 \
--random-output 1536 \
--request-rate "inf" \
--max-concurrency 2048 \
--warmup-requests 0 \
--backend sglang \
--dataset-name random \
--random-range-ratio 1 \
> /home/local/workspace/bench.log 2>&1 &

2. Observing Logs

To monitor peak performance, filter logs for entries with running-req: 48:

grep -E 'Decode batch.*running-req: 48' /home/admin/logs/sglang.log

Example Output (for batch size = 48):

2025-10-29 02:27:35 INFO 8900 [ DP15 TP15 EP15 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 48, #token: 571264, token usage: 0.76, accept len: 1.91, pre-allocated usage: 0.36, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 740.64, #queue-req: 13, 
2025-10-29 02:27:35 INFO 8894 [ DP9 TP9 EP9 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 48, #token: 573312, token usage: 0.76, accept len: 1.92, pre-allocated usage: 0.36, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 744.66, #queue-req: 13, 
2025-10-29 02:27:35 INFO 8899 [ DP14 TP14 EP14 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 48, #token: 571840, token usage: 0.76, accept len: 1.91, pre-allocated usage: 0.36, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 741.26, #queue-req: 14, 
2025-10-29 02:27:40 INFO 8898 [ DP13 TP13 EP13 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 48, #token: 578688, token usage: 0.77, accept len: 1.90, pre-allocated usage: 0.35, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 719.54, #queue-req: 17, 
2025-10-29 02:27:41 INFO 8896 [ DP11 TP11 EP11 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 48, #token: 577344, token usage: 0.77, accept len: 1.91, pre-allocated usage: 0.36, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 710.05, #queue-req: 15, 
2025-10-29 02:27:41 INFO 8897 [ DP12 TP12 EP12 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 48, #token: 577792, token usage: 0.77, accept len: 1.90, pre-allocated usage: 0.36, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 707.89, #queue-req: 15, 
2025-10-29 02:27:41 INFO 8893 [ DP8 TP8 EP8 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 47, #token: 572288, token usage: 0.76, accept len: 1.91, pre-allocated usage: 0.35, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 710.46, #queue-req: 16

Related PRs

Prefill
EPLB
- feat: Add Expert Affinity Aware EPLB algorithm
- feat: introduce async rebalance mode for expert load balancer
FP8 MLA
- update fp8 support
SwapAB GEMM
- support swapAB for m_grouped_fp8_gemm_nt_masked
SBO(single-batch-overlap)
DeepXTrace
- DeepXTrace

Profiling

Open the following links and then view the profiling files: Perfetto

Decode

Input=4K, chunked-prefill-size=16384: h20_blog_prefill_tp8_input_4k.json.gz

Decode

running-req: 48: h20_blog_decode_ep16_bs48.json.gz

running-req: 32: h20_blog_decode_ep16_bs32.json.gz

…when seq_lens is small

sourcery-ai · 2025-10-20T12:07:07Z

Reviewer's Guide

This PR implements a new one-shot multi-head attention mode for DeepSeek-V2, enriches fused MoE Triton kernels with descriptor/TMA/filtering support, introduces Triton-based KV buffer operations in the memory pool and utils, updates config generation for down-MoE scenarios, and adds a comprehensive benchmark/tuning script for the fused MoE kernels.

Sequence diagram for one-shot MHA attention path in DeepSeek-V2

sequenceDiagram
    participant FB as ForwardBatch
    participant Attn as DeepseekV2AttentionMLA
    participant KVPool as MLATokenToKVPool
    FB->>Attn: forward_prepare(...)
    Attn->>FB: _support_mha_one_shot(...)
    alt MHA_ONE_SHOT supported
        Attn->>Attn: forward_normal_one_shot_prepare(...)
        Attn->>FB: fetch_mha_one_shot_kv_indices()
        Attn->>KVPool: get_mla_kv_buffer(...)
        KVPool-->>Attn: (kv_a, k_pe)
        Attn->>Attn: forward_normal_one_shot_core(...)
    else fallback
        Attn->>Attn: forward_normal_chunked_kv_prepare(...)
    end

Sequence diagram for fused MoE Triton kernel invocation with TMA/descriptor support

sequenceDiagram
    participant Worker as BenchmarkWorker
    participant FusedMoE as FusedMoE
    participant Kernel as TritonKernel
    Worker->>FusedMoE: benchmark(...)
    FusedMoE->>Kernel: invoke_fused_moe_kernel(..., a_desc, b_desc, filter_expert)
    Kernel-->>FusedMoE: (results)
    FusedMoE-->>Worker: (latency results)

Class diagram for new and updated DeepSeek-V2 attention and MoE classes

classDiagram
    class AttnForwardMethod {
        +MHA_CHUNKED_KV
        +MHA_ONE_SHOT
        +MLA_FUSED_ROPE
    }
    class DeepseekV2AttentionMLA {
        +kv_cache_dtype
        +forward_normal_one_shot_prepare()
        +forward_normal_one_shot_core()
        +_set_mla_kv_buffer()
        +_get_mla_kv_buffer()
        +_concat_and_cast_mha_k()
    }
    class ForwardBatch {
        +mha_one_shot_kv_indices
        +mha_one_shot
        +fetch_mha_one_shot_kv_indices()
    }
    class MLATokenToKVPool {
        +get_mla_kv_buffer()
    }
    AttnForwardMethod <|-- DeepseekV2AttentionMLA
    DeepseekV2AttentionMLA <.. ForwardBatch
    ForwardBatch <.. MLATokenToKVPool

Class diagram for Fused MoE Triton kernel and config changes

classDiagram
    class BenchmarkWorker {
        +benchmark()
        +tune()
    }
    class BestConfigTrace {
        +update()
        +total_time
        +config_dict()
    }
    class MoeRunnerConfig {
        +inplace
        +num_experts
        +num_local_experts
    }
    class FusedMoE {
        +fused_experts_impl(..., filter_expert)
    }
    class FusedMoEConfig {
        +get_config_file_name(..., down_moe)
        +get_moe_configs(..., down_moe)
        +try_get_optimal_moe_config(..., return_down_config)
    }
    BenchmarkWorker <.. BestConfigTrace
    FusedMoEConfig <.. FusedMoE

File-Level Changes

Change	Details	Files
Add one-shot MHA method to DeepSeekV2 attention pipeline	Introduce AttnForwardMethod.MHA_ONE_SHOT and support predicate Update backend dispatch to select one-shot mode when capacity allows Implement forward_normal_one_shot_prepare/core and KV buffer set/get helpers Extend ForwardBatch with mha_one_shot fields and fetch_mha_one_shot_kv_indices Adjust flashinfer and flashattention backends for one-shot execution	`python/sglang/srt/models/deepseek_v2.py` `python/sglang/srt/model_executor/forward_batch_info.py` `python/sglang/srt/layers/attention/utils.py` `python/sglang/srt/layers/attention/flashinfer_mla_backend.py` `python/sglang/srt/layers/attention/flashattention_backend.py`
Enhance fused MoE Triton kernels with tensor descriptor and TMA support	Add TensorDescriptor import path check and filter_expert flag Extend invoke_fused_moe_kernel signature with a_desc, b_desc, c_sorted, filter_expert, a_use_tma, b_use_tma Update fused_moe API to propagate new parameters and filter_expert logic Augment config naming and try_get_optimal_moe_config to handle down_moe scenarios	`python/sglang/srt/layers/moe/fused_moe_triton/fused_moe_triton_kernels.py` `python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py` `python/sglang/srt/layers/moe/fused_moe_triton/fused_moe_triton_config.py`
Implement low-level Triton kernels for KV buffer operations	Add get_mla_kv_buffer_triton kernel and wrapper in memory_pool Add concat_and_cast_mha_k_kernel and triton wrapper in attention utils Integrate MLATokenToKVPool.get_mla_kv_buffer method	`python/sglang/srt/mem_cache/memory_pool.py` `python/sglang/srt/layers/attention/utils.py`
Add benchmark and tuning script for fused MoE Triton kernels	Introduce tuning_fused_moe_triton_sep.py under benchmark/kernels Implement ray-based distributed tuning, search-space generation and performance logging	`benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton_sep.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

justinSmileDate · 2025-10-22T07:07:37Z

Sorry to bother you, is this optimization only for the DeepSeek-R1 model? Is the DeepSeek-V3 model also available?

TianyuZhang1214 · 2025-10-23T02:50:19Z

Sorry to bother you, is this optimization only for the DeepSeek-R1 model? Is the DeepSeek-V3 model also available?

Yes, V3 is also available.

justinSmileDate · 2025-10-24T07:33:08Z

Sorry to bother you, is this optimization only for the DeepSeek-R1 model? Is the DeepSeek-V3 model also available?

Yes, V3 is also available.

Thanks for your reply. When I try to reproduce your work, it shows that there is a problem with the DeepGEMM library. Can you tell me the link of the Deepgemm library you use? In this repository, I tried the sbo.v2.public branch to load DeepSeekV3 for inference, but it prompts that the deep_gemm.get_compile_mode() function cannot be found. In fact, the sbo.v2.public branch does not register the get_compile_mode() function. Although the sbo.v2.sgl branch registers the get_compile_mode() function, it fails to run the basic tests/test_fp8.py test. Could you tell me the correct steps to use it?

JoyFuture · 2025-10-24T09:33:14Z

Hello, I have successfully run it according to your configuration, and the performance is very good. However, there is an issue: during the operation, the sglang logs are not printed. Even when I set --log-level debug, there are no sglang-related logs, only some logs from nccl, deepgemm, and the transfer engine. How can I configure it to properly print the sglang logs?

yuan-luo · 2025-10-25T02:35:59Z

Hello, I have successfully run it according to your configuration, and the performance is very good. However, there is an issue: during the operation, the sglang logs are not printed. Even when I set --log-level debug, there are no sglang-related logs, only some logs from nccl, deepgemm, and the transfer engine. How can I configure it to properly print the sglang logs?

The log is redirected to /home/admin/logs/stdout.log.

justinSmileDate · 2025-10-26T03:24:15Z

Hello, I have successfully run it according to your configuration, and the performance is very good. However, there is an issue: during the operation, the sglang logs are not printed. Even when I set --log-level debug, there are no sglang-related logs, only some logs from nccl, deepgemm, and the transfer engine. How can I configure it to properly print the sglang logs?

Could you please tell me if you used a Docker image or compiled from source code to get it to work successfully? I tried compiling from source code but failed.

TianyuZhang1214 · 2025-10-27T02:06:42Z

Sorry to bother you, is this optimization only for the DeepSeek-R1 model? Is the DeepSeek-V3 model also available?

Yes, V3 is also available.

Thanks for your reply. When I try to reproduce your work, it shows that there is a problem with the DeepGEMM library. Can you tell me the link of the Deepgemm library you use? In this repository, I tried the sbo.v2.public branch to load DeepSeekV3 for inference, but it prompts that the deep_gemm.get_compile_mode() function cannot be found. In fact, the sbo.v2.public branch does not register the get_compile_mode() function. Although the sbo.v2.sgl branch registers the get_compile_mode() function, it fails to run the basic tests/test_fp8.py test. Could you tell me the correct steps to use it?

Please use the docker image.

These features from FlashMLA-FP8, DeepEP, and DeepGEMM are still under review and require compilation from source. For DeepGEMM, you must first merge code from PR#183 and PR#192, which can be complex before they are integrated into the main branch. The Docker image simplifies this process.

JoyFuture · 2025-10-27T06:52:37Z

I’d like to reproduce the final performance reported in the blog (each node achieves 16.5k input tokens per second and 5.7k output tokens per second on 4096-token input sequences). How should I do that? I only found scripts for peak performance testing—what command did you use to benchmark the metrics shown in the blog? Could you share it? Thanks.

JoyFuture · 2025-10-27T07:02:51Z

Hello, I have successfully run it according to your configuration, and the performance is very good. However, there is an issue: during the operation, the sglang logs are not printed. Even when I set --log-level debug, there are no sglang-related logs, only some logs from nccl, deepgemm, and the transfer engine. How can I configure it to properly print the sglang logs?

The log is redirected to /home/admin/logs/stdout.log.

Thanks so much. I found the redirected log file at /home/admin/log/sglang.log, but my startup command didn’t redirect the logs. If I want to restore printing to the terminal after startup or modify the log-saving configuration, how should I do it?

JoyFuture · 2025-10-27T07:04:54Z

Hello, I have successfully run it according to your configuration, and the performance is very good. However, there is an issue: during the operation, the sglang logs are not printed. Even when I set --log-level debug, there are no sglang-related logs, only some logs from nccl, deepgemm, and the transfer engine. How can I configure it to properly print the sglang logs?

Could you please tell me if you used a Docker image or compiled from source code to get it to work successfully? I tried compiling from source code but failed.

I’m deploying via Docker images and it’s running fine. If you manage to compile from source and run it successfully, I’d appreciate it if you could share your steps.

TianyuZhang1214 · 2025-10-27T07:14:15Z

I’d like to reproduce the final performance reported in the blog (each node achieves 16.5k input tokens per second and 5.7k output tokens per second on 4096-token input sequences). How should I do that? I only found scripts for peak performance testing—what command did you use to benchmark the metrics shown in the blog? Could you share it? Thanks.

To reproduce the performance metrics from the blog (16.5k input tokens/s and 5.7k output tokens/s for 4096-token sequences), follow the setup in the Launching SGLang guide, using 2 Prefill nodes and 2 Decode nodes. Start by testing with the bs=32/GPU configuration as shown in the Testing section's example comments. To fully replicate the blog's results, increase the batch size to bs=48/GPU.

zheng1 · 2025-10-28T10:50:48Z

Do you mean using sglang.bench_one_batch_server to test the batch size (bs)?
I didn’t find any batch size–related configuration in sglang.bench_serving.

Also, when you say bs=32/GPU or bs=48/GPU, does the GPU value refer to 8 GPUs or 8×4=32 GPUs in total?

I only recently started trying to reproduce DeepSeek’s performance under PD separation, so if I’ve misunderstood any of the fundamentals, please feel free to correct me.

zheng1 · 2025-10-28T11:14:31Z

I reread your reply and would like to restate my updated understanding.
Does the line running-req: 32 in the decode log mean that the gen throughput shown there corresponds to the case when this node was processing a batch size of 32?
If we want to fully reproduce the results shown in the blog, what command should we use for the benchmark, and what command should we use to check the performance?
If possible, could you please share the exact commands you used for reproduction? Thank you very much!

TianyuZhang1214 · 2025-10-29T02:06:15Z

I reread your reply and would like to restate my updated understanding. Does the line running-req: 32 in the decode log mean that the gen throughput shown there corresponds to the case when this node was processing a batch size of 32? If we want to fully reproduce the results shown in the blog, what command should we use for the benchmark, and what command should we use to check the performance? If possible, could you please share the exact commands you used for reproduction? Thank you very much!

This test requires 4 nodes with 8 H20 GPUs each (4×8 H20).
Simply follow the Reproduction Steps in the PR description—the exact commands are provided there.

zheng1 · 2025-10-29T03:28:26Z

I reproduced the following results on three H20 96G nodes (P1D2) using the command you provided — does this outcome meet expectations?

2025-10-29 01:58:38 INFO 54435 [ DP12 TP12 EP12 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 32, #token: 121088, token usage: 0.16, accept len: 1.97, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 727.89, #queue-req: 3,

Does this result mean that, across 2 nodes, the output throughput is
2 × 8 × 727.89 = 11,646.24 tokens per second (t/s)?

python -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 4096 --random-input 4096 --random-output 4096 --host 10.0.18.1 --port 30000 --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model /data01/models/DeepSeek-V3.1/ --request-rate "inf" --max-concurrency 512 --warmup-requests 0 --flush-cache

Updated:

python -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 4096 --random-input 4096 --random-output 1536 --host 10.0.18.1 --port 30000 --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model /data01/models/DeepSeek-V3.1/ --request-rate "inf" --max-concurrency 2048 --warmup-requests 0 --flush-cache
Input token throughput (tok/s):          15627.10
Output token throughput (tok/s):         5897.04
Total token throughput (tok/s):          21524.14
Concurrency:                             1526.01

zheng1 · 2025-10-29T04:26:45Z

I can’t find running-req: 48 in the sglang.log.
Is that because I only have one prefill node, so the maximum running-req value on my decode nodes is 32? @TianyuZhang1214

TianyuZhang1214 · 2025-10-29T08:51:27Z

@zheng1
Thank you for your experiment — you've nearly replicated ours.

2025-10-29 01:58:38 INFO 54435 [ DP12 TP12 EP12 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 32, #token: 121088, token usage: 0.16, accept len: 1.97, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 727.89, #queue-req: 3,

Yes, the decode output is expected, congratulations!

Does this result mean that, across 2 nodes, the output throughput is 2 × 8 × 727.89 = 11,646.24 tokens per second (t/s)?

The decode output may vary across DP ranks; consider 11,646.24 tokens as an estimated value.

Input token throughput (tok/s): 15627.10 Output token throughput (tok/s): 5897.04 Total token throughput (tok/s): 21524.14 Concurrency: 1526.01

The bench_serving output is lower than expected because it includes batches with sizes below 48. Please check the results for a specific batch size, e.g., running-req: 48.

I can’t find running-req: 48 in the sglang.log.

Yes, you're using only 1 Prefill node, so the pressure on Decode is insufficient. Consider using 2 Prefill nodes.

justinSmileDate · 2025-10-29T14:55:53Z

FP8 MLA

update fp8 support

SwapAB GEMM

support swapAB for m_grouped_fp8_gemm_nt_masked

SBO(single-batch-overlap)

Single Batch Overlap for MoE Models

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send

Thanks for your reply. I merged based on the suggested pr, but I'm still encountering many problems, especially with SBO and SwapAB GEMM features.

My installation process is as follows:

FP8 MLA：

git clone https://github.com/Wangzheee/FlashMLA.git -b fp8
cd  FlashMLA && pip install -v .

DEEPEP(NVSHMEM has been installed.):

git clone https://github.com/Zqy11/DeepEP.git -b feat/overlap
cd DeepEP && python setup.py install

DEEPGEMM：

merge PR183 and PR 192
cd DeepGEMM && ./install.sh

In fact, even after the above steps, there's still a long way to go before successfully starting the service. Are there any other pull requests that need to be merged?

TianyuZhang1214 · 2025-10-30T01:25:06Z

@justinSmileDate

In fact, even after the above steps, there's still a long way to go before successfully starting the service. Are there any other pull requests that need to be merged?

Would you mind testing with the Docker image directly?
Building from source requires a lot of tedious setup, and we simply don’t have enough time to maintain detailed step-by-step instructions — that’s precisely why we open-sourced the image to begin with.

justinSmileDate · 2025-10-30T06:16:46Z

@justinSmileDate

In fact, even after the above steps, there's still a long way to go before successfully starting the service. Are there any other pull requests that need to be merged?

Would you mind testing with the Docker image directly? Building from source requires a lot of tedious setup, and we simply don’t have enough time to maintain detailed step-by-step instructions — that’s precisely why we open-sourced the image to begin with.

Understood, I will find a way to try your docker image.

yangzhipeng1108 · 2025-11-03T08:01:44Z

stdout.log
sudo docker run --gpus all --shm-size 512g --network host --privileged
-v /dev/infiniband:/dev/infiniband -e NCCL_IB_HCA=mlx5
--env "GLOO_SOCKET_IFNAME=bond0"
--env "NCCL_SOCKET_IFNAME=bond0"
--env "NCCL_DEBUG=INFO"
-v /mnt/share/deepseek-ai:/deepseek
--name sglang_multinode1
-it --rm --ipc=host
ghcr.io/antgroup/sglang:h20-blog-release-3

yangzhipeng1108 · 2025-11-03T08:26:12Z

Hello, I have successfully run it according to your configuration, and the performance is very good. However, there is an issue: during the operation, the sglang logs are not printed. Even when I set --log-level debug, there are no sglang-related logs, only some logs from nccl, deepgemm, and the transfer engine. How can I configure it to properly print the sglang logs?

The log is redirected to /home/admin/logs/stdout.log.

Could you please send me the operating steps?

sjtushenhai · 2025-11-03T12:27:39Z

Hi @TianyuZhang1214
Do you only support H20 GPUs? I'm trying to deploy on other types of cards and ran into the following issue:

All deep_gemm operations loaded successfully!
W1103 12:25:08.122000 32988 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W1103 12:25:08.122000 32988 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
Killed

The process is killed immediately, and I'm not sure how to troubleshoot it. I checked dmesg, but there's nothing useful there either. Have you ever encountered a problem like this before?

TianyuZhang1214 · 2025-11-03T13:15:10Z

stdout.log sudo docker run --gpus all --shm-size 512g --network host --privileged -v /dev/infiniband:/dev/infiniband -e NCCL_IB_HCA=mlx5 --env "GLOO_SOCKET_IFNAME=bond0" --env "NCCL_SOCKET_IFNAME=bond0" --env "NCCL_DEBUG=INFO" -v /mnt/share/deepseek-ai:/deepseek --name sglang_multinode1 -it --rm --ipc=host ghcr.io/antgroup/sglang:h20-blog-release-3

@yangzhipeng1108
Please check the following error message:

/root/qiongyu.zqy/DeepEP/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:282: init failed for transport: IBGDA
/root/qiongyu.zqy/DeepEP/nvshmem_src/src/host/topo/topo.cpp:469: [GPU 7] Peer GPU 8 is not accessible, exiting ...
/root/qiongyu.zqy/DeepEP/nvshmem_src/src/host/init/init.cu:1035: non-zero status: 3 building transport map failed

By the way, all environment variables are defined in /root/env.sh. Please modify them inside the file instead of defining them like --env "GLOO_SOCKET_IFNAME=bond0"

TianyuZhang1214 · 2025-11-03T13:21:52Z

@sjtushenhai

Do you only support H20 GPUs?

All Hopper GPUs are supported. H20 is recommended for our optimizations.

Have you ever encountered a problem like this before?

No, we haven't. We’ve only tested on Hopper GPUs—sorry about that.

sjtushenhai · 2025-11-03T14:00:42Z

@sjtushenhai

Do you only support H20 GPUs?

All Hopper GPUs are supported. H20 is recommended for our optimizations.

Have you ever encountered a problem like this before?

No, we haven't. We’ve only tested on Hopper GPUs—sorry about that.

My environment is:

NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.9
cat /etc/modprobe.d/nvidia.conf
options nvidia NVreg_EnableStreamMemOPs=1 NVreg_RegistryDwords="PeerMappingOverride=1;"

Does this patch require this driver version, or is it necessary to upgrade the driver version?

justinSmileDate · 2025-11-04T08:33:33Z

Would you mind disclosing the P and D node deployment parameters and test scripts for the base (BF16+MTP) from the article Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G? This is very important to me.

yangzhipeng1108 · 2025-11-04T08:55:13Z

sudo docker run
Exception ignored in atexit callback: <function move_cutlass_compiled_cache at 0x7f10e5de5b20>
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/codegen/cuda/cutlass_utils.py", line 39, in move_cutlass_compiled_cache
if not os.path.exists(cutlass.CACHE_FILE):
^^^^^^^^^^^^^^^^^^
AttributeError: module 'cutlass' has no attribute 'CACHE_FILE'
Killed

TianyuZhang1214 · 2025-11-04T12:27:37Z

@sjtushenhai

Does this patch require this driver version, or is it necessary to upgrade the driver version?

12.9 is ok.

TianyuZhang1214 · 2025-11-04T12:30:34Z

@justinSmileDate

Would you mind disclosing the P and D node deployment parameters and test scripts for the base (BF16+MTP) from the article Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G? This is very important to me.

All relevant details are included in this PR. Please refer to the Launching SGLang and Testing sections in the PR description.

TianyuZhang1214 · 2025-11-04T12:35:32Z

@yangzhipeng1108

sudo docker run Exception ignored in atexit callback: <function move_cutlass_compiled_cache at 0x7f10e5de5b20> Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/codegen/cuda/cutlass_utils.py", line 39, in move_cutlass_compiled_cache if not os.path.exists(cutlass.CACHE_FILE): ^^^^^^^^^^^^^^^^^^ AttributeError: module 'cutlass' has no attribute 'CACHE_FILE' Killed

We're sorry, but we haven't encountered this error before. You may need to troubleshoot and resolve it on your own.

xu-yfei and others added 10 commits October 14, 2025 15:12

tuning fused moe triton, add tma for down proj kernel

cef7056

fix try_get_optimal_moe_config error

da97088

move tma into orignal triton kernel

de0de5c

Merge branch 'main' into xyf/tune_moe

051220b

MHA chunked prefix: merge prefix and extend kv cache to run mha once …

35e69e4

…when seq_lens is small

rename to mha one hot & opt code

3a938d4

typo one shot

633ada0

Merge branch 'main' into xyf/tune_moe

8804815

Merge branch 'main' into xyf/tune_moe

ec996c7

Merge branch 'xyf/mha_merge' into h20-blog-20250926

820d692

TianyuZhang1214 mentioned this pull request Oct 29, 2025

[Don't merge] Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices sgl-project/sglang#11854

Closed

[Don't merge] Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices #4

Are you sure you want to change the base?

[Don't merge] Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices #4

Uh oh!

Conversation

TianyuZhang1214 commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices

Introduction

Reproduction Steps

Pulling the Docker Image

Checking Environment Variables

Launching SGLang

1. Launching Prefill Nodes (Identical Configuration for Both Nodes)

2. Launching Decode Nodes

Node-0

3. Launching SGLang Router

Testing

1. Running the Benchmark

2. Observing Logs

Related PRs

Profiling

Decode

Decode

Uh oh!

sourcery-ai bot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for one-shot MHA attention path in DeepSeek-V2

Sequence diagram for fused MoE Triton kernel invocation with TMA/descriptor support

Class diagram for new and updated DeepSeek-V2 attention and MoE classes

Class diagram for Fused MoE Triton kernel and config changes

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

justinSmileDate commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TianyuZhang1214 commented Oct 23, 2025

Uh oh!

justinSmileDate commented Oct 24, 2025

Uh oh!

JoyFuture commented Oct 24, 2025

Uh oh!

yuan-luo commented Oct 25, 2025 • edited by TianyuZhang1214 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

justinSmileDate commented Oct 26, 2025

Uh oh!

TianyuZhang1214 commented Oct 27, 2025

Uh oh!

JoyFuture commented Oct 27, 2025

Uh oh!

JoyFuture commented Oct 27, 2025

Uh oh!

JoyFuture commented Oct 27, 2025

Uh oh!

TianyuZhang1214 commented Oct 27, 2025

Uh oh!

zheng1 commented Oct 28, 2025

Uh oh!

zheng1 commented Oct 28, 2025

Uh oh!

TianyuZhang1214 commented Oct 29, 2025

Uh oh!

zheng1 commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zheng1 commented Oct 29, 2025

Uh oh!

TianyuZhang1214 commented Oct 29, 2025

Uh oh!

justinSmileDate commented Oct 29, 2025

Uh oh!

TianyuZhang1214 commented Oct 30, 2025

Uh oh!

justinSmileDate commented Oct 30, 2025

Uh oh!

TianyuZhang1214 commented Oct 20, 2025 •

edited

Loading

sourcery-ai bot commented Oct 20, 2025 •

edited

Loading

justinSmileDate commented Oct 22, 2025 •

edited

Loading

yuan-luo commented Oct 25, 2025 •

edited by TianyuZhang1214

Loading

zheng1 commented Oct 29, 2025 •

edited

Loading

yangzhipeng1108 commented Nov 3, 2025 •

edited by TianyuZhang1214

Loading

sjtushenhai commented Nov 3, 2025 •

edited

Loading

TianyuZhang1214 commented Nov 3, 2025 •

edited

Loading

TianyuZhang1214 commented Nov 4, 2025 •

edited

Loading