Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature, Hardware] Enable DeepseekV3 on AMD GPUs #2601

Merged
merged 27 commits into from
Jan 3, 2025

Conversation

BruceXcluding
Copy link
Contributor

@BruceXcluding BruceXcluding commented Dec 26, 2024

Motivation

  • Support DeepseekV3 on AMD Instinct MI300X GPU

Modifications

  • Add proper fix for AMD FP8 e4m3fnuz to support DeepseekV3 FP8 model
  • Bypass FlashInfer backend bmm_fp8 to cast FP8 to BF16 in MLA
  • Add AMD triton stages config

TODO

  • amd base image testing ROCm base image update #2692
  • sgl-kernel add amd backend
  • DeepseekV3 MOE config optimization
  • batch mm for FP8 optimization on rocm
  • customized block FP8 quant
  • dp attention optimization

How to run

build env

cd sglang/docker

docker build –t sglang-rocm:latest –f Dockerfile.rocm .
 
docker run -it --ipc=host \ 
               --cap-add=SYS_PTRACE \
               --network=host \ 
               --device=/dev/kfd --device=/dev/dri \
               --security-opt seccomp=unconfined \ 
               --group-add video \
               --privileged \
               -w /workspace sglang-rocm:latest 

offline:

python -m sglang.bench_one_batch --batch-size 32 --input 128 --output 32 --model /data/DeepSeek-V3-Base/ --tp 8 --trust-remote-code

Prefill. latency: 3.95045 s, throughput:   1036.84 token/s
Decode.  latency: 0.10960 s, throughput:    291.96 token/s
Decode.  latency: 0.10487 s, throughput:    305.14 token/s
Decode.  latency: 0.10468 s, throughput:    305.71 token/s
Decode.  latency: 0.10455 s, throughput:    306.07 token/s
Decode.  latency: 0.10458 s, throughput:    305.98 token/s
Decode.  median latency: 0.10458 s, median throughput:    305.98 token/s
Total. latency:  4.688 s, throughput:    928.38 token/s
Benchmark ...
Prefill. latency: 0.38250 s, throughput:  10708.55 token/s
Decode.  latency: 0.10400 s, throughput:    307.70 token/s
Decode.  latency: 0.10448 s, throughput:    306.28 token/s
Decode.  latency: 0.10446 s, throughput:    306.34 token/s
Decode.  latency: 0.10434 s, throughput:    306.70 token/s
Decode.  latency: 0.10454 s, throughput:    306.12 token/s
Decode.  median latency: 0.10429 s, median throughput:    306.83 token/s
Total. latency:  3.617 s, throughput:   1415.51 token/s

server:

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3-Base --tp 8 --trust-remote-code

python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8

Accuracy: 0.916
Invalid: 0.000
Latency: 246.118 s
Output throughput: 511.690 token/s

Issues

  • If you get the error like raise OutOfResources(self.metadata.shared, max_shared, "shared memory"), same with [Bug] Deepseek-v2-lite AMD MI300 run failed #2384
    Solved with python/sglang/srt/layers/attention/triton_ops/decode_attention.py +410
  • If you get an error like ImportError: cannot import name 'build_regex_from_schema' from 'outlines.fsm.json_schema', same with [Bug] SGLang v0.4.0 with AMD MI300X #2530
    Solved with downgrade vllm
  • If you get an error like `RuntimeError: [enforce fail at /app/pytorch/third_party/gloo/gloo/transport/tcp/device.cc:83] ifa != nullptr. Unable to find address for: eth0'
    Solved with ifconfig check your eth number and export GLOO_SOCKET_IFNAME=your eth

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@zhyncs zhyncs added bug Something isn't working amd labels Dec 26, 2024
@carlushuang
Copy link

@HaiShaw

@HaiShaw
Copy link
Collaborator

HaiShaw commented Dec 26, 2024

@BruceXcluding Can we just add the fix to unlock v3 from the triton kernel config error first?

@zhyncs
Copy link
Member

zhyncs commented Dec 26, 2024

@BruceXcluding Can we just add the fix to unlock v3 from the triton kernel config error first?

That would be nice. I plan to release v0.4.1.post1 soon to enable users to use AMD MI300X initially.

@zhyncs
Copy link
Member

zhyncs commented Dec 26, 2024

Copy link
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BruceXcluding
Some to address, thanks!

@@ -37,7 +37,7 @@ ENV SGLANG_SET_CPU_AFFINITY=1
ENV SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
ENV NCCL_MIN_NCHANNELS=112

ENV MOE_PADDING=1
ENV MOE_PADDING=0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to keep MOE_PADDING on for performance, to error it incurs we need to fix it.

docker/Dockerfile.rocm Outdated Show resolved Hide resolved
@@ -402,7 +402,7 @@ def _decode_grouped_att_m_fwd(
sm_scale,
logit_cap,
):
BLOCK = 32
BLOCK = 16 if is_hip() else 32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not cut by half for HIP globally here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't work well in latest vllm with BLOCK 32

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part we can not take as it is - it will cost performance of all other models in large margin.

@@ -217,7 +217,7 @@ def create_weights(

# WEIGHT
weight_dtype = (
torch.float8_e4m3fn
torch.float8_e4m3fnuz if is_hip() else torch.float8_e4m3fn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not have this, serialized weight is always OCP (torch.float8_e4m3fn)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would encounter the error "python/sglang/srt/layers/quantization/fp8_kernel.py:176:33: error: Unsupported conversion from 'f8E4M3FN' to 'f16'
accumulator += tl.dot(a, b) * a_s[:, None] * b_s[None, :]" with torch.float8_e4m3fn at w8a8_block_fp8_matmul

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check how normalize_e4m3fn_to_e4m3fnuz is used.
Basically - we do not expected non-OCP/e4m3fn dtype in the quantized model.

@@ -430,7 +432,7 @@ def get_default_config(
dtype: Optional[str],
is_marlin: bool,
) -> Dict[str, int]:
if dtype == "fp8_w8a8":
if dtype == "fp8_w8a8" and not is_hip():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following block isn't a breaker to HIP

Copy link
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BruceXcluding
Also see this error below with your version of pyproject.toml:

  File "/dockerx/1226/HS/sglang/python/sglang/srt/constrained/outlines_backend.py", line 23, in <module>
    from outlines.fsm.json_schema import build_regex_from_schema
ImportError: cannot import name 'build_regex_from_schema' from 'outlines.fsm.json_schema' (/usr/local/lib/python3.12/dist-packages/outlines/fsm/json_schema.py)

@ZJLi2013
Copy link

the CI failure: PR Test / unit-test-backend-2-gpu, used a lite model 'deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct', which doesn't has fp8 block-level quant feature

@BruceXcluding BruceXcluding marked this pull request as ready for review December 27, 2024 05:56
@@ -217,7 +217,7 @@ def create_weights(

# WEIGHT
weight_dtype = (
torch.float8_e4m3fn
torch.float8_e4m3fnuz if is_hip() else torch.float8_e4m3fn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check how normalize_e4m3fn_to_e4m3fnuz is used.
Basically - we do not expected non-OCP/e4m3fn dtype in the quantized model.

@@ -432,7 +432,7 @@ def create_weights(
from sglang.srt.layers.moe.fused_moe_triton import FusedMoeWeightScaleSupported

if self.quant_config.is_checkpoint_fp8_serialized:
params_dtype = torch.float8_e4m3fn
params_dtype = torch.float8_e4m3fnuz if is_hip() else torch.float8_e4m3fn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same problem here - check out the previous usage from normalize_e4m3fn_to_e4m3fnuz


def is_hip() -> bool:
"""Return whether it is HIP on the AMD ROCm platform."""
return torch.version.hip is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return torch.version.hip is not None
return torch.cuda.is_available() and torch.version.hip

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -27,7 +27,7 @@ srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cu

# HIP (Heterogeneous-computing Interface for Portability) for AMD
# => base docker rocm/vllm-dev:20241022, not from public vllm whl
srt_hip = ["sglang[runtime_common]", "torch", "vllm==0.6.3.dev13"]
srt_hip = ["sglang[runtime_common]", "torch", "vllm==0.6.3.post2.dev1+g1ef171e0.rocm624"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What issues could occur if the image isn't updated? Minimize updating the base image whenever possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhyncs we (AMD) will have to decide on this, so ignore it for now.

"sgl_kernel.ops.moe_align_block_size",
[
"src/sgl-kernel/csrc/moe_align_kernel.cu",
"src/sgl-kernel/csrc/sgl_kernel_ops.cu",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need to use AMD for compilation, I recommend not compiling sgl_kernel_ops.cu directly. Instead, use a separate file to avoid mixing NVIDIA and AMD's cu files, it's better to keep them separate. cc @HaiShaw @ispobock @merrymercy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any suggestions? @yzh119

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems need to compile reduce kernel here, otherwise some archs will not be imported due to No module named 'sgl_kernel.ops._kernels'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May we use is_hip there

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhyncs In case CUDA/HIP compatible kernel files, we don't use separate files (that is point of HIP), I believe that is one of the cases. We do for sure separate files for AMD specific kernels or kernel implementations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zyeric the else: case seemingly have no impact to NV side, can you be more specific?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's better to separate amd/nv kernels as 2 different backends? at this moment, moe_align_kernel is only required for amd backend, while in near future, there are ck kernels added to amd backend.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HaiShaw I think the root cause is that the import path is still sgl_kernel.ops._kernels at https://github.com/BruceXcluding/sglang/blob/main/sgl-kernel/src/sgl-kernel/ops/__init__.py#L1

Copy link

@zyeric zyeric Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current version works for me, many thanks :D

Accuracy: 0.951
Invalid: 0.000
Latency: 160.916 s
Output throughput: 869.145 token/s

@BruceXcluding BruceXcluding marked this pull request as draft December 31, 2024 01:36
@BruceXcluding
Copy link
Contributor Author

@AdjectiveAllison we are targeted to fix accuracy issue with fp8, do you see garbled output with bf16 as well? We will tune performance with config.json provided soon. Are you using MI308?

No, output on full bf16 works perfectly. I'm on an 8x mi300x machine. 192GB of vram each.

@AdjectiveAllison Can you try with the latest instruction

@@ -578,8 +578,9 @@ def _set_envs_and_config(server_args: ServerArgs):
os.environ["NCCL_NVLS_ENABLE"] = "0"
os.environ["TORCH_NCCL_AVOID_RECORD_STREAMS"] = "1"
os.environ["CUDA_DEVICE_MAX_CONNECTIONS"] = "4"
if "GLOO_SOCKET_IFNAME" not in os.environ:
os.environ["GLOO_SOCKET_IFNAME"] = "eth0"
# TODO(fix socket error with gpu backend)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented out?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used for cpu backend or specific workstation? get RuntimeError: [enforce fail at pytorch/third_party/gloo/gloo/transport/tcp/device.cc:83] ifa != nullptr. Unable to find address for: eth0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used for multi-node tensor parallelism. Instead of using comments, we suggest adding an is_hip flag.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the value set for the GLOO_SOCKET_IFNAME environment variable should depend on the name of the network interface card in each user's system and should not be hard-coded as eth0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wufann If the user's value is not eth0, they should specify it explicitly, this applies only when no setting is provided, with eth0 as the default.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhyncs Different network interface ( "ens" ) may be used. Also they may test in a single node envrionment where IB is not configured. In that case IB should be disabled

@@ -0,0 +1,51 @@
cmake_minimum_required(VERSION 3.18)
Copy link
Member

@zhyncs zhyncs Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this, we only use CMakeLists.txt for clangd indexing, so it's not necessary.

build-backend = "setuptools.build_meta"

[project]
name = "sgl-kernel"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refer to the setup of flash-attention or vllm compatible with NVIDIA and AMD?
https://github.com/Dao-AILab/flash-attention/blob/main/setup.py
https://github.com/vllm-project/vllm/blob/main/setup.py

@zhyncs
Copy link
Member

zhyncs commented Jan 2, 2025

Hi @BruceXcluding @HaiShaw
#2712
You can now try using moe_align_block_size_triton on AMD.

@BruceXcluding
Copy link
Contributor Author

Hi @BruceXcluding @HaiShaw #2712 You can now try using moe_align_block_size_triton on AMD.

Tested and works well. We could build sgl-kernel-amd after we add ck kernels

@HaiShaw
Copy link
Collaborator

HaiShaw commented Jan 2, 2025

Hi @BruceXcluding @HaiShaw #2712 You can now try using moe_align_block_size_triton on AMD.

Tested and works well. We could build sgl-kernel-amd after we add ck kernels

@BruceXcluding, How was the performance comparing to sgl-kernel-amd?

@zhyncs
Copy link
Member

zhyncs commented Jan 2, 2025

Hi @BruceXcluding @HaiShaw
Before releasing v0.4.1.post4 #2713, I hope the main branch has a version compatible with AMD MI300X. What minimal changes are needed to achieve this? The requirement is just to get it running, performance optimization can be done later.

@zhyncs zhyncs marked this pull request as ready for review January 2, 2025 18:42
@HaiShaw
Copy link
Collaborator

HaiShaw commented Jan 2, 2025

@zhyncs I am expecting @BruceXcluding to do the final update.
@BruceXcluding can you confirm the decode_attention.py change?

Copy link
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BruceXcluding thanks!

@HaiShaw HaiShaw dismissed merrymercy’s stale review January 3, 2025 00:23

had been address above

@HaiShaw HaiShaw merged commit c7ae474 into sgl-project:main Jan 3, 2025
15 checks passed
@BruceXcluding
Copy link
Contributor Author

Hi @BruceXcluding @HaiShaw Before releasing v0.4.1.post4 #2713, I hope the main branch has a version compatible with AMD MI300X. What minimal changes are needed to achieve this? The requirement is just to get it running, performance optimization can be done later.

Thanks @zhyncs @HaiShaw. we will keep the TODO list on track for performance improvement.

XiaotongJiang pushed a commit to XiaotongJiang/sglang that referenced this pull request Jan 3, 2025
Co-authored-by: root <[email protected]>
Co-authored-by: HAI <[email protected]>
Co-authored-by: Bruce Xue <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>
@yiakwy-xpu-ml-framework-team
Copy link

yiakwy-xpu-ml-framework-team commented Jan 3, 2025

Hi @BruceXcluding @HaiShaw Before releasing v0.4.1.post4 #2713, I hope the main branch has a version compatible with AMD MI300X. What minimal changes are needed to achieve this? The requirement is just to get it running, performance optimization can be done later.

Thanks @zhyncs @HaiShaw. we will keep the TODO list on track for performance improvement.

Yes theoretical throughput is

4800 (memory transaction speed) / 37 * 1.8 (MTP multiplier) ~ 233 tok/gpu/sec, arrond 1868 toks/sec for 8 cards

There are spaces to improve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
amd bug Something isn't working high priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.