Skip to content

Conversation

@TianyuZhang1214
Copy link
Collaborator

@TianyuZhang1214 TianyuZhang1214 commented Oct 20, 2025

Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices

Introduction

We published an article on LMSYS titled "Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G", sharing our best practices for deploying the DeepSeek-R1 model on H20-96G hardware.
To facilitate reproduction of our experimental results and provide access to our code, we have released this pull request in the DeepSeek-R1 repository.

Reproduction Steps

Pulling the Docker Image

To obtain the Docker image, use the following command:

docker pull ghcr.io/antgroup/sglang:h20-blog-release

The image is hosted at: https://github.com/orgs/antgroup/packages/container/package/sglang

Checking Environment Variables

All environment variables are stored in the /root/env.sh file, configured for our H20 environment. Before launching SGLang, verify that these variables are suitable for your environment.

Launching SGLang

We recommend running four containers: two for Prefill nodes and two for Decode nodes.

1. Launching Prefill Nodes (Identical Configuration for Both Nodes)

Note:

  • Both Prefill nodes use the same launch parameters.
  • Adjust the port number if there is a conflict.
PYTHONUNBUFFERED=1 \
SGL_CHUNKED_PREFIX_CACHE_THRESHOLD=0 \
nohup python3 -m sglang.launch_server \
--trust-remote-code \
--model-path /path/to/DeepSeek-R1 \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \
--host 0.0.0.0 \
--port 61001 \
--tp-size 8 \
--page-size 64 \
--attention-backend fa3 \
--mem-fraction-static 0.9 \
--chunked-prefill-size 16384 \
--max-running-requests 768 \
--context-length 65535 \
--enable-cache-report \
--log-level info \
--load-balance-method round_robin \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
> /home/admin/logs/stdout.log 2>&1 &

2. Launching Decode Nodes

Note:

  • Set {node_rank} to 0 or 1 for the respective node.
  • Replace {decode_master_ip} with the IP address of Node 0.
  • Adjust the port number if there is a conflict.
Node-0
PYTHONUNBUFFERED=1 \
SGL_ENABLE_JIT_DEEPGEMM=1 \
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=96 \
ENABLE_SWAPAB=1 \
nohup python3 -m sglang.launch_server \
--model-path /path/to/DeepSeek-R1 \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \
--disaggregation-bootstrap-port 9000 \
--attention-backend flashmla \
--host 0.0.0.0 \
--port 61001 \
--trust-remote-code \
--dist-init-addr {decode_master_ip}:62001 \
--nnodes 2 \
--node-rank {node_rank} \
--tp-size 16 \
--dp-size 16 \
--enable-dp-attention \
--mem-fraction-static 0.88 \
--max-running-requests 768 \
--context-length 65535 \
--log-level info \
--decode-log-interval 50 \
--page-size 64 \
--schedule-conservativeness 0.3 \
--enable-cache-report \
--moe-dense-tp-size 1 \
--enable-deepep-moe \
--enable-dp-lm-head \
--cuda-graph-max-bs 48 \
--speculative-algorithm NEXTN \
--speculative-num-steps 1 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 2 \
--init-expert-location /root/expert_workload.json \
--prefill-round-robin-balance \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
--moe-a2a-backend deepep \
--deepep-mode low_latency_overlap \
--enable-single-batch-overlap \
> /home/admin/logs/stdout.log 2>&1 &

3. Launching SGLang Router

Note:

  • Replace {decode_master_ip}, {prefill_node_0_ip}, and {prefill_node_1_ip} with the respective IP addresses.
  • Adjust the port number if there is a conflict.
nohup python3 -m sglang_router.launch_router \
--pd-disaggregation \
--mini-lb \
--host 0.0.0.0 \
--decode http://{decode_master_ip}:61001 \
--port 8000 \
--prefill http://{prefill_node_0_ip}:61001 \
--prefill http://{prefill_node_1_ip}:61001 \
> /home/admin/logs/router.log 2>&1 &

Testing

1. Running the Benchmark

Note:

  • This script is designed to observe peak performance in logs. Since --request-rate is set to inf, all requests are sent at once, making TTFT and TPOT data less meaningful.
  • Replace {path-to-shareGPT} with the path to the ShareGPT dataset.
nohup python3 -m sglang.bench_serving \
--host 0.0.0.0 \
--port 8000 \
--dataset-path {path-to-shareGPT} \
--num-prompt 4096 \
--random-input 4096 \
--random-output 1536 \
--request-rate "inf" \
--max-concurrency 2048 \
--warmup-requests 0 \
--backend sglang \
--dataset-name random \
--random-range-ratio 1 \
> /home/local/workspace/bench.log 2>&1 &

2. Observing Logs

To monitor peak performance, filter logs for entries with running-req: 48:

grep -E 'Decode batch.*running-req: 48' /home/admin/logs/sglang.log

Example Output (for batch size = 48):

2025-10-29 02:27:35 INFO 8900 [ DP15 TP15 EP15 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 48, #token: 571264, token usage: 0.76, accept len: 1.91, pre-allocated usage: 0.36, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 740.64, #queue-req: 13, 
2025-10-29 02:27:35 INFO 8894 [ DP9 TP9 EP9 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 48, #token: 573312, token usage: 0.76, accept len: 1.92, pre-allocated usage: 0.36, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 744.66, #queue-req: 13, 
2025-10-29 02:27:35 INFO 8899 [ DP14 TP14 EP14 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 48, #token: 571840, token usage: 0.76, accept len: 1.91, pre-allocated usage: 0.36, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 741.26, #queue-req: 14, 
2025-10-29 02:27:40 INFO 8898 [ DP13 TP13 EP13 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 48, #token: 578688, token usage: 0.77, accept len: 1.90, pre-allocated usage: 0.35, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 719.54, #queue-req: 17, 
2025-10-29 02:27:41 INFO 8896 [ DP11 TP11 EP11 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 48, #token: 577344, token usage: 0.77, accept len: 1.91, pre-allocated usage: 0.36, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 710.05, #queue-req: 15, 
2025-10-29 02:27:41 INFO 8897 [ DP12 TP12 EP12 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 48, #token: 577792, token usage: 0.77, accept len: 1.90, pre-allocated usage: 0.36, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 707.89, #queue-req: 15, 
2025-10-29 02:27:41 INFO 8893 [ DP8 TP8 EP8 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 47, #token: 572288, token usage: 0.76, accept len: 1.91, pre-allocated usage: 0.35, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 710.46, #queue-req: 16

Related PRs

Profiling

Open the following links and then view the profiling files: Perfetto

Decode

Input=4K, chunked-prefill-size=16384: h20_blog_prefill_tp8_input_4k.json.gz

Decode

running-req: 48: h20_blog_decode_ep16_bs48.json.gz

running-req: 32: h20_blog_decode_ep16_bs32.json.gz

@sourcery-ai
Copy link

sourcery-ai bot commented Oct 20, 2025

Reviewer's Guide

This PR implements a new one-shot multi-head attention mode for DeepSeek-V2, enriches fused MoE Triton kernels with descriptor/TMA/filtering support, introduces Triton-based KV buffer operations in the memory pool and utils, updates config generation for down-MoE scenarios, and adds a comprehensive benchmark/tuning script for the fused MoE kernels.

Sequence diagram for one-shot MHA attention path in DeepSeek-V2

sequenceDiagram
    participant FB as ForwardBatch
    participant Attn as DeepseekV2AttentionMLA
    participant KVPool as MLATokenToKVPool
    FB->>Attn: forward_prepare(...)
    Attn->>FB: _support_mha_one_shot(...)
    alt MHA_ONE_SHOT supported
        Attn->>Attn: forward_normal_one_shot_prepare(...)
        Attn->>FB: fetch_mha_one_shot_kv_indices()
        Attn->>KVPool: get_mla_kv_buffer(...)
        KVPool-->>Attn: (kv_a, k_pe)
        Attn->>Attn: forward_normal_one_shot_core(...)
    else fallback
        Attn->>Attn: forward_normal_chunked_kv_prepare(...)
    end
Loading

Sequence diagram for fused MoE Triton kernel invocation with TMA/descriptor support

sequenceDiagram
    participant Worker as BenchmarkWorker
    participant FusedMoE as FusedMoE
    participant Kernel as TritonKernel
    Worker->>FusedMoE: benchmark(...)
    FusedMoE->>Kernel: invoke_fused_moe_kernel(..., a_desc, b_desc, filter_expert)
    Kernel-->>FusedMoE: (results)
    FusedMoE-->>Worker: (latency results)
Loading

Class diagram for new and updated DeepSeek-V2 attention and MoE classes

classDiagram
    class AttnForwardMethod {
        +MHA_CHUNKED_KV
        +MHA_ONE_SHOT
        +MLA_FUSED_ROPE
    }
    class DeepseekV2AttentionMLA {
        +kv_cache_dtype
        +forward_normal_one_shot_prepare()
        +forward_normal_one_shot_core()
        +_set_mla_kv_buffer()
        +_get_mla_kv_buffer()
        +_concat_and_cast_mha_k()
    }
    class ForwardBatch {
        +mha_one_shot_kv_indices
        +mha_one_shot
        +fetch_mha_one_shot_kv_indices()
    }
    class MLATokenToKVPool {
        +get_mla_kv_buffer()
    }
    AttnForwardMethod <|-- DeepseekV2AttentionMLA
    DeepseekV2AttentionMLA <.. ForwardBatch
    ForwardBatch <.. MLATokenToKVPool
Loading

Class diagram for Fused MoE Triton kernel and config changes

classDiagram
    class BenchmarkWorker {
        +benchmark()
        +tune()
    }
    class BestConfigTrace {
        +update()
        +total_time
        +config_dict()
    }
    class MoeRunnerConfig {
        +inplace
        +num_experts
        +num_local_experts
    }
    class FusedMoE {
        +fused_experts_impl(..., filter_expert)
    }
    class FusedMoEConfig {
        +get_config_file_name(..., down_moe)
        +get_moe_configs(..., down_moe)
        +try_get_optimal_moe_config(..., return_down_config)
    }
    BenchmarkWorker <.. BestConfigTrace
    FusedMoEConfig <.. FusedMoE
Loading

File-Level Changes

Change Details Files
Add one-shot MHA method to DeepSeekV2 attention pipeline
  • Introduce AttnForwardMethod.MHA_ONE_SHOT and support predicate
  • Update backend dispatch to select one-shot mode when capacity allows
  • Implement forward_normal_one_shot_prepare/core and KV buffer set/get helpers
  • Extend ForwardBatch with mha_one_shot fields and fetch_mha_one_shot_kv_indices
  • Adjust flashinfer and flashattention backends for one-shot execution
python/sglang/srt/models/deepseek_v2.py
python/sglang/srt/model_executor/forward_batch_info.py
python/sglang/srt/layers/attention/utils.py
python/sglang/srt/layers/attention/flashinfer_mla_backend.py
python/sglang/srt/layers/attention/flashattention_backend.py
Enhance fused MoE Triton kernels with tensor descriptor and TMA support
  • Add TensorDescriptor import path check and filter_expert flag
  • Extend invoke_fused_moe_kernel signature with a_desc, b_desc, c_sorted, filter_expert, a_use_tma, b_use_tma
  • Update fused_moe API to propagate new parameters and filter_expert logic
  • Augment config naming and try_get_optimal_moe_config to handle down_moe scenarios
python/sglang/srt/layers/moe/fused_moe_triton/fused_moe_triton_kernels.py
python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py
python/sglang/srt/layers/moe/fused_moe_triton/fused_moe_triton_config.py
Implement low-level Triton kernels for KV buffer operations
  • Add get_mla_kv_buffer_triton kernel and wrapper in memory_pool
  • Add concat_and_cast_mha_k_kernel and triton wrapper in attention utils
  • Integrate MLATokenToKVPool.get_mla_kv_buffer method
python/sglang/srt/mem_cache/memory_pool.py
python/sglang/srt/layers/attention/utils.py
Add benchmark and tuning script for fused MoE Triton kernels
  • Introduce tuning_fused_moe_triton_sep.py under benchmark/kernels
  • Implement ray-based distributed tuning, search-space generation and performance logging
benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton_sep.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@justinSmileDate
Copy link

justinSmileDate commented Oct 22, 2025

Sorry to bother you, is this optimization only for the DeepSeek-R1 model? Is the DeepSeek-V3 model also available?

@TianyuZhang1214
Copy link
Collaborator Author

Sorry to bother you, is this optimization only for the DeepSeek-R1 model? Is the DeepSeek-V3 model also available?

Yes, V3 is also available.

@justinSmileDate
Copy link

Sorry to bother you, is this optimization only for the DeepSeek-R1 model? Is the DeepSeek-V3 model also available?

Yes, V3 is also available.

Thanks for your reply. When I try to reproduce your work, it shows that there is a problem with the DeepGEMM library. Can you tell me the link of the Deepgemm library you use? In this repository, I tried the sbo.v2.public branch to load DeepSeekV3 for inference, but it prompts that the deep_gemm.get_compile_mode() function cannot be found. In fact, the sbo.v2.public branch does not register the get_compile_mode() function. Although the sbo.v2.sgl branch registers the get_compile_mode() function, it fails to run the basic tests/test_fp8.py test. Could you tell me the correct steps to use it?

@JoyFuture
Copy link

Hello, I have successfully run it according to your configuration, and the performance is very good. However, there is an issue: during the operation, the sglang logs are not printed. Even when I set --log-level debug, there are no sglang-related logs, only some logs from nccl, deepgemm, and the transfer engine. How can I configure it to properly print the sglang logs?

@yuan-luo
Copy link
Collaborator

yuan-luo commented Oct 25, 2025

Hello, I have successfully run it according to your configuration, and the performance is very good. However, there is an issue: during the operation, the sglang logs are not printed. Even when I set --log-level debug, there are no sglang-related logs, only some logs from nccl, deepgemm, and the transfer engine. How can I configure it to properly print the sglang logs?

The log is redirected to /home/admin/logs/stdout.log.

@justinSmileDate
Copy link

Hello, I have successfully run it according to your configuration, and the performance is very good. However, there is an issue: during the operation, the sglang logs are not printed. Even when I set --log-level debug, there are no sglang-related logs, only some logs from nccl, deepgemm, and the transfer engine. How can I configure it to properly print the sglang logs?

Could you please tell me if you used a Docker image or compiled from source code to get it to work successfully? I tried compiling from source code but failed.

@TianyuZhang1214
Copy link
Collaborator Author

Sorry to bother you, is this optimization only for the DeepSeek-R1 model? Is the DeepSeek-V3 model also available?

Yes, V3 is also available.

Thanks for your reply. When I try to reproduce your work, it shows that there is a problem with the DeepGEMM library. Can you tell me the link of the Deepgemm library you use? In this repository, I tried the sbo.v2.public branch to load DeepSeekV3 for inference, but it prompts that the deep_gemm.get_compile_mode() function cannot be found. In fact, the sbo.v2.public branch does not register the get_compile_mode() function. Although the sbo.v2.sgl branch registers the get_compile_mode() function, it fails to run the basic tests/test_fp8.py test. Could you tell me the correct steps to use it?

Please use the docker image.

These features from FlashMLA-FP8, DeepEP, and DeepGEMM are still under review and require compilation from source. For DeepGEMM, you must first merge code from PR#183 and PR#192, which can be complex before they are integrated into the main branch. The Docker image simplifies this process.

@JoyFuture
Copy link

I’d like to reproduce the final performance reported in the blog (each node achieves 16.5k input tokens per second and 5.7k output tokens per second on 4096-token input sequences). How should I do that? I only found scripts for peak performance testing—what command did you use to benchmark the metrics shown in the blog? Could you share it? Thanks.

@JoyFuture
Copy link

Hello, I have successfully run it according to your configuration, and the performance is very good. However, there is an issue: during the operation, the sglang logs are not printed. Even when I set --log-level debug, there are no sglang-related logs, only some logs from nccl, deepgemm, and the transfer engine. How can I configure it to properly print the sglang logs?

The log is redirected to /home/admin/logs/stdout.log.

Thanks so much. I found the redirected log file at /home/admin/log/sglang.log, but my startup command didn’t redirect the logs. If I want to restore printing to the terminal after startup or modify the log-saving configuration, how should I do it?

@JoyFuture
Copy link

Hello, I have successfully run it according to your configuration, and the performance is very good. However, there is an issue: during the operation, the sglang logs are not printed. Even when I set --log-level debug, there are no sglang-related logs, only some logs from nccl, deepgemm, and the transfer engine. How can I configure it to properly print the sglang logs?

Could you please tell me if you used a Docker image or compiled from source code to get it to work successfully? I tried compiling from source code but failed.

I’m deploying via Docker images and it’s running fine. If you manage to compile from source and run it successfully, I’d appreciate it if you could share your steps.

@TianyuZhang1214
Copy link
Collaborator Author

I’d like to reproduce the final performance reported in the blog (each node achieves 16.5k input tokens per second and 5.7k output tokens per second on 4096-token input sequences). How should I do that? I only found scripts for peak performance testing—what command did you use to benchmark the metrics shown in the blog? Could you share it? Thanks.

To reproduce the performance metrics from the blog (16.5k input tokens/s and 5.7k output tokens/s for 4096-token sequences), follow the setup in the Launching SGLang guide, using 2 Prefill nodes and 2 Decode nodes. Start by testing with the bs=32/GPU configuration as shown in the Testing section's example comments. To fully replicate the blog's results, increase the batch size to bs=48/GPU.

@zheng1
Copy link

zheng1 commented Oct 28, 2025

Do you mean using sglang.bench_one_batch_server to test the batch size (bs)?
I didn’t find any batch size–related configuration in sglang.bench_serving.

Also, when you say bs=32/GPU or bs=48/GPU, does the GPU value refer to 8 GPUs or 8×4=32 GPUs in total?

I only recently started trying to reproduce DeepSeek’s performance under PD separation, so if I’ve misunderstood any of the fundamentals, please feel free to correct me.

@zheng1
Copy link

zheng1 commented Oct 28, 2025

I reread your reply and would like to restate my updated understanding.
Does the line running-req: 32 in the decode log mean that the gen throughput shown there corresponds to the case when this node was processing a batch size of 32?
If we want to fully reproduce the results shown in the blog, what command should we use for the benchmark, and what command should we use to check the performance?
If possible, could you please share the exact commands you used for reproduction? Thank you very much!

@TianyuZhang1214
Copy link
Collaborator Author

I reread your reply and would like to restate my updated understanding. Does the line running-req: 32 in the decode log mean that the gen throughput shown there corresponds to the case when this node was processing a batch size of 32? If we want to fully reproduce the results shown in the blog, what command should we use for the benchmark, and what command should we use to check the performance? If possible, could you please share the exact commands you used for reproduction? Thank you very much!

This test requires 4 nodes with 8 H20 GPUs each (4×8 H20).
Simply follow the Reproduction Steps in the PR description—the exact commands are provided there.

@zheng1
Copy link

zheng1 commented Oct 29, 2025

I reproduced the following results on three H20 96G nodes (P1D2) using the command you provided — does this outcome meet expectations?

2025-10-29 01:58:38 INFO 54435 [ DP12 TP12 EP12 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 32, #token: 121088, token usage: 0.16, accept len: 1.97, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 727.89, #queue-req: 3,

Does this result mean that, across 2 nodes, the output throughput is
2 × 8 × 727.89 = 11,646.24 tokens per second (t/s)?

python -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 4096 --random-input 4096 --random-output 4096 --host 10.0.18.1 --port 30000 --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model /data01/models/DeepSeek-V3.1/ --request-rate "inf" --max-concurrency 512 --warmup-requests 0 --flush-cache

Updated:

python -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 4096 --random-input 4096 --random-output 1536 --host 10.0.18.1 --port 30000 --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model /data01/models/DeepSeek-V3.1/ --request-rate "inf" --max-concurrency 2048 --warmup-requests 0 --flush-cache
Input token throughput (tok/s):          15627.10
Output token throughput (tok/s):         5897.04
Total token throughput (tok/s):          21524.14
Concurrency:                             1526.01

@zheng1
Copy link

zheng1 commented Oct 29, 2025

I can’t find running-req: 48 in the sglang.log.
Is that because I only have one prefill node, so the maximum running-req value on my decode nodes is 32? @TianyuZhang1214

@TianyuZhang1214
Copy link
Collaborator Author

@zheng1
Thank you for your experiment — you've nearly replicated ours.

2025-10-29 01:58:38 INFO 54435 [ DP12 TP12 EP12 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 32, #token: 121088, token usage: 0.16, accept len: 1.97, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 727.89, #queue-req: 3,

Yes, the decode output is expected, congratulations!

Does this result mean that, across 2 nodes, the output throughput is 2 × 8 × 727.89 = 11,646.24 tokens per second (t/s)?

The decode output may vary across DP ranks; consider 11,646.24 tokens as an estimated value.

Input token throughput (tok/s): 15627.10 Output token throughput (tok/s): 5897.04 Total token throughput (tok/s): 21524.14 Concurrency: 1526.01

The bench_serving output is lower than expected because it includes batches with sizes below 48. Please check the results for a specific batch size, e.g., running-req: 48.

I can’t find running-req: 48 in the sglang.log.

Yes, you're using only 1 Prefill node, so the pressure on Decode is insufficient. Consider using 2 Prefill nodes.

@justinSmileDate
Copy link

Thanks for your reply. I merged based on the suggested pr, but I'm still encountering many problems, especially with SBO and SwapAB GEMM features.

My installation process is as follows:

FP8 MLA:

git clone https://github.com/Wangzheee/FlashMLA.git -b fp8
cd  FlashMLA && pip install -v .

DEEPEP(NVSHMEM has been installed.):

git clone https://github.com/Zqy11/DeepEP.git -b feat/overlap
cd DeepEP && python setup.py install 

DEEPGEMM:

merge PR183 and PR 192
cd DeepGEMM && ./install.sh

In fact, even after the above steps, there's still a long way to go before successfully starting the service. Are there any other pull requests that need to be merged?

@TianyuZhang1214
Copy link
Collaborator Author

@justinSmileDate

In fact, even after the above steps, there's still a long way to go before successfully starting the service. Are there any other pull requests that need to be merged?

Would you mind testing with the Docker image directly?
Building from source requires a lot of tedious setup, and we simply don’t have enough time to maintain detailed step-by-step instructions — that’s precisely why we open-sourced the image to begin with.

@justinSmileDate
Copy link

@justinSmileDate

In fact, even after the above steps, there's still a long way to go before successfully starting the service. Are there any other pull requests that need to be merged?

Would you mind testing with the Docker image directly? Building from source requires a lot of tedious setup, and we simply don’t have enough time to maintain detailed step-by-step instructions — that’s precisely why we open-sourced the image to begin with.

Understood, I will find a way to try your docker image.

@yangzhipeng1108
Copy link

yangzhipeng1108 commented Nov 3, 2025

stdout.log
sudo docker run --gpus all --shm-size 512g --network host --privileged
-v /dev/infiniband:/dev/infiniband -e NCCL_IB_HCA=mlx5
--env "GLOO_SOCKET_IFNAME=bond0"
--env "NCCL_SOCKET_IFNAME=bond0"
--env "NCCL_DEBUG=INFO"
-v /mnt/share/deepseek-ai:/deepseek
--name sglang_multinode1
-it --rm --ipc=host
ghcr.io/antgroup/sglang:h20-blog-release-3

@yangzhipeng1108
Copy link

Hello, I have successfully run it according to your configuration, and the performance is very good. However, there is an issue: during the operation, the sglang logs are not printed. Even when I set --log-level debug, there are no sglang-related logs, only some logs from nccl, deepgemm, and the transfer engine. How can I configure it to properly print the sglang logs?

The log is redirected to /home/admin/logs/stdout.log.

Could you please send me the operating steps?

@sjtushenhai
Copy link

sjtushenhai commented Nov 3, 2025

Hi @TianyuZhang1214
Do you only support H20 GPUs? I'm trying to deploy on other types of cards and ran into the following issue:

All deep_gemm operations loaded successfully!
W1103 12:25:08.122000 32988 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W1103 12:25:08.122000 32988 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
Killed

The process is killed immediately, and I'm not sure how to troubleshoot it. I checked dmesg, but there's nothing useful there either. Have you ever encountered a problem like this before?

@TianyuZhang1214
Copy link
Collaborator Author

TianyuZhang1214 commented Nov 3, 2025

stdout.log sudo docker run --gpus all --shm-size 512g --network host --privileged -v /dev/infiniband:/dev/infiniband -e NCCL_IB_HCA=mlx5 --env "GLOO_SOCKET_IFNAME=bond0" --env "NCCL_SOCKET_IFNAME=bond0" --env "NCCL_DEBUG=INFO" -v /mnt/share/deepseek-ai:/deepseek --name sglang_multinode1 -it --rm --ipc=host ghcr.io/antgroup/sglang:h20-blog-release-3

@yangzhipeng1108
Please check the following error message:

/root/qiongyu.zqy/DeepEP/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:282: init failed for transport: IBGDA
/root/qiongyu.zqy/DeepEP/nvshmem_src/src/host/topo/topo.cpp:469: [GPU 7] Peer GPU 8 is not accessible, exiting ...
/root/qiongyu.zqy/DeepEP/nvshmem_src/src/host/init/init.cu:1035: non-zero status: 3 building transport map failed

By the way, all environment variables are defined in /root/env.sh. Please modify them inside the file instead of defining them like --env "GLOO_SOCKET_IFNAME=bond0"

@TianyuZhang1214
Copy link
Collaborator Author

@sjtushenhai

Do you only support H20 GPUs?

All Hopper GPUs are supported. H20 is recommended for our optimizations.

Have you ever encountered a problem like this before?

No, we haven't. We’ve only tested on Hopper GPUs—sorry about that.

@sjtushenhai
Copy link

@sjtushenhai

Do you only support H20 GPUs?

All Hopper GPUs are supported. H20 is recommended for our optimizations.

Have you ever encountered a problem like this before?

No, we haven't. We’ve only tested on Hopper GPUs—sorry about that.

My environment is:

NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.9
cat /etc/modprobe.d/nvidia.conf
options nvidia NVreg_EnableStreamMemOPs=1 NVreg_RegistryDwords="PeerMappingOverride=1;"

Does this patch require this driver version, or is it necessary to upgrade the driver version?

@justinSmileDate
Copy link

Would you mind disclosing the P and D node deployment parameters and test scripts for the base (BF16+MTP) from the article Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G? This is very important to me.

@yangzhipeng1108
Copy link

sudo docker run
Exception ignored in atexit callback: <function move_cutlass_compiled_cache at 0x7f10e5de5b20>
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/codegen/cuda/cutlass_utils.py", line 39, in move_cutlass_compiled_cache
if not os.path.exists(cutlass.CACHE_FILE):
^^^^^^^^^^^^^^^^^^
AttributeError: module 'cutlass' has no attribute 'CACHE_FILE'
Killed

@TianyuZhang1214
Copy link
Collaborator Author

@sjtushenhai

Does this patch require this driver version, or is it necessary to upgrade the driver version?

12.9 is ok.

@TianyuZhang1214
Copy link
Collaborator Author

TianyuZhang1214 commented Nov 4, 2025

@justinSmileDate

Would you mind disclosing the P and D node deployment parameters and test scripts for the base (BF16+MTP) from the article Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G? This is very important to me.

All relevant details are included in this PR. Please refer to the Launching SGLang and Testing sections in the PR description.

@TianyuZhang1214
Copy link
Collaborator Author

@yangzhipeng1108

sudo docker run Exception ignored in atexit callback: <function move_cutlass_compiled_cache at 0x7f10e5de5b20> Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/codegen/cuda/cutlass_utils.py", line 39, in move_cutlass_compiled_cache if not os.path.exists(cutlass.CACHE_FILE): ^^^^^^^^^^^^^^^^^^ AttributeError: module 'cutlass' has no attribute 'CACHE_FILE' Killed

We're sorry, but we haven't encountered this error before. You may need to troubleshoot and resolve it on your own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.