Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added _static/images/sglang.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
21 changes: 20 additions & 1 deletion index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
sources/lm_deploy/index.rst
sources/torchchat/index.rst
sources/torchtitan/index.rst
sources/sglang/index.rst


选择您的偏好,并按照 :doc:`快速安装昇腾环境<sources/ascend/quick_install>` 的安装指导进行操作。
Expand Down Expand Up @@ -392,6 +393,24 @@
<span class="split">|</span>
<a href="sources/torchtitan/quick_start.html">快速上手</a>
</div>
</div>
</div>
<!-- Card 20 -->
<div class="box rounded-lg p-4 flex flex-col items-center">
<div class="flex items-center mb-4">
<div class="img w-16 h-16 rounded-md mr-4" style="background-image: url('_static/images/sglang.png')"></div>
<div>
<h2 class="text-lg font-semibold">SGLang</h2>
<p class="text-gray-600 desc">用于LLM和VLM的高速服务框架</p>
</div>
</div>
<div class="flex-grow"></div>
<div class="flex space-x-4 text-blue-600">
<a href="https://github.com/sgl-project/sglang">官方链接</a>
<span class="split">|</span>
<a href="sources/sglang/install.html">安装指南</a>
<span class="split">|</span>
<a href="sources/sglang/quick_start.html">快速上手</a>
</div>
</div>
</div>
</div>
8 changes: 8 additions & 0 deletions sources/sglang/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
SGLang
============

.. toctree::
:maxdepth: 2

install.rst
quick_start.rst
193 changes: 193 additions & 0 deletions sources/sglang/install.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
安装指南
==============

本教程面向使用 SGLang & 昇腾的开发者,帮助完成昇腾环境下 SGLang 的安装。截至 2025 年 9 月,该项目涉及的如下组件正在活跃开发中,建议使用最新版本,并注意版本以及设备兼容性。

昇腾环境安装
------------

请根据已有昇腾产品型号及 CPU 架构等按照 :doc:`快速安装昇腾环境指引 <../ascend/quick_install>` 进行昇腾环境安装。

.. warning::
CANN 推荐版本为 8.2.RC1 以上,安装 CANN 时,请同时安装 Kernel 算子包以及 nnal ARM 平台加速库软件包。


SGLang 安装
----------------------

方法1:使用源码安装 SGLang
~~~~~~~~~~~~~~~~~~~~~~


Python 环境创建
^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: shell
:linenos:

# Create a new conda environment, and only python 3.11 is supported
conda create --name sglang_npu python=3.11
# Activate the virtual environment
conda activate sglang_npu

安装 python 依赖
^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: shell
:linenos:

pip install attrs==24.2.0 numpy==1.26.4 scipy==1.13.1 decorator==5.1.1 psutil==6.0.0 pytest==8.3.2 pytest-xdist==3.6.1 pyyaml


MemFabric Adaptor 安装
^^^^^^^^^^^^^^^^^^^^^^

MemFabric Adaptor 是 Mooncake Transfer Engine 在昇腾 NPU 集群上实现 KV cache 传输的替代方案。


目前,MemFabric Adaptor 仅支持 aarch64 架构的设备。请根据实际架构选择安装:

.. code-block:: shell
:linenos:

MF_WHL_NAME="mf_adapter-1.0.0-cp311-cp311-linux_aarch64.whl"
MEMFABRIC_URL="https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/sglang/${MF_WHL_NAME}"
wget -O "${MF_WHL_NAME}" "${MEMFABRIC_URL}" && pip install "./${MF_WHL_NAME}"


torch-npu 安装
^^^^^^^^^^^^^^^^^^^^^^

按照 :doc:`torch-npu 安装指引 <../pytorch/install>` 本项目由于 NPUGraph 和 Triton-Ascend 的限制,目前仅支持安装 2.6.0 版本 torch 和 torch-npu,后续会推出更通用的版本方案。

.. code-block:: shell
:linenos:

# Install torch 2.6.0 and torchvision 0.21.0 on CPU only
PYTORCH_VERSION=2.6.0
TORCHVISION_VERSION=0.21.0
pip install torch==$PYTORCH_VERSION torchvision==$TORCHVISION_VERSION --index-url https://download.pytorch.org/whl/cpu

# Install torch_npu 2.6.0 or you can just pip install torch_npu==2.6.0
PTA_VERSION="v7.1.0.2-pytorch2.6.0"
PTA_NAME="torch_npu-2.6.0.post2-cp311-cp311-manylinux_2_28_aarch64.whl"
PTA_URL="https://gitcode.com/ascend/pytorch/releases/download/${PTA_VERSION}/${PTA_WHL_NAME}"
wget -O "${PTA_NAME}" "${PTA_URL}" && pip install "./${PTA_NAME}"

安装完成后,可以通过以下代码验证 torch_npu 是否安装成功:

.. code-block:: shell
:linenos:

import torch
# import torch_npu # In torch 2.6.0,no need to import torch_npu explicitly

x = torch.randn(2, 2).npu()
y = torch.randn(2, 2).npu()
z = x.mm(y)

print(z)

程序能够成功打印矩阵 Z 的值即为安装成功。

vLLM 安装
^^^^^^^^^^^^^^^^^^^^^^

vLLM 目前仍是昇腾 NPU 上的一个主要前提条件。基于 torch==2.6.0 版本,vLLM 需要从源码编译安装 v0.8.5 版本。

.. code-block:: shell
:linenos:

VLLM_TAG=v0.8.5
git clone --depth 1 https://github.com/vllm-project/vllm.git --branch $VLLM_TAG
cd vllm
VLLM_TARGET_DEVICE="empty" pip install -v -e .
cd ..

Triton-Ascend 安装
^^^^^^^^^^^^^^^^^^^^^^

Triton Ascend还在频繁更新。为能使用最新功能特性,建议拉取代码进行源码安装。详细安装步骤请参考 `安装指南 <https://gitcode.com/Ascend/triton-ascend/blob/master/docs/sources/getting-started/installation.md>`_。

或者选择安装 Triton Ascend nightly 包:

.. code-block:: shell
:linenos:

pip install -i https://test.pypi.org/simple/ "triton-ascend<3.2.0rc" --pre --no-cache-dir


安装 Deep-ep 与 sgl-kernel-npu:
^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: shell
:linenos:

pip install wheel==0.45.1
git clone https://github.com/sgl-project/sgl-kernel-npu.git

# Add environment variables
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/runtime/lib64/stub:$LD_LIBRARY_PATH
source /usr/local/Ascend/ascend-toolkit/set_env.sh
cd sgl-kernel-npu

# Compile and install deep-ep, sgl-kernel-npu
bash build.sh
pip install output/deep_ep*.whl output/sgl_kernel_npu*.whl --no-cache-dir
cd ..
rm -rf sgl-kernel-npu

# Link to the deep_ep_cpp.*.so file
cd "$(pip show deep-ep | grep -E '^Location:' | awk '{print $2}')" && ln -s deep_ep/deep_ep_cpp*.so


源码安装 SGLang:
^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: shell
:linenos:

# Use the last release branch
git clone -b v0.5.3rc0 https://github.com/sgl-project/sglang.git
cd sglang

pip install --upgrade pip
# Install SGLang with NPU support
pip install -e python[srt_npu]
cd ..



方法2:使用 docker 镜像安装 SGLang
~~~~~~~~~~~~~~~~~~~~~~

注意:--privileged 和 --network=host 是 RDMA 所必需的,而 RDMA 通常也是 Ascend NPU 集群的必备组件。

以下 Docker 命令基于 Atlas 800I A3 机型。若使用 Atlas 800I A2 机型,请确保仅将 davinci [0-7] 映射到容器中。

.. code-block:: shell
:linenos:

# Clone the SGLang repository
git clone https://github.com/sgl-project/sglang.git
cd sglang/docker

# Build the docker image
docker build -t <image_name> -f Dockerfile.npu .

alias drun='docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \
--device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
--device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
--device=/dev/davinci8 --device=/dev/davinci9 --device=/dev/davinci10 --device=/dev/davinci11 \
--device=/dev/davinci12 --device=/dev/davinci13 --device=/dev/davinci14 --device=/dev/davinci15 \
--device=/dev/davinci_manager --device=/dev/hisi_hdc \
--volume /usr/local/sbin:/usr/local/sbin --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \
--volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
--volume /etc/ascend_install.info:/etc/ascend_install.info \
--volume /var/queue_schedule:/var/queue_schedule --volume ~/.cache/:/root/.cache/'

# Run the docker container and start the SGLang server
drun --env "HF_TOKEN=<secret>" \
<image_name> \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --attention-backend ascend --host 0.0.0.0 --port 30000

104 changes: 104 additions & 0 deletions sources/sglang/quick_start.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
快速开始
==================

.. note::

阅读本篇前,请确保已按照 :doc:`安装教程 <./install>` 准备好昇腾环境及 SGLang !

本篇教程将介绍如何使用 SGLang 进行快速开发,帮助您快速上手 SGLang。

本文档帮助昇腾开发者快速使用 SGLang × 昇腾 进行 LLM 推理服务。可以访问 `这篇官方文档 <https://docs.sglang.ai/>`_ 获取更多信息。

概览
------------------------

SGLang 是一款适用于 LLM 和 VLM 的高速服务框架。通过协同设计后端运行时环境与前端语言,让用户与模型的交互更快速、更可控。

使用 SGLang 启动服务
------------------------

以下示例展示了如何使用 SGLang 启动一个简单的会话生成服务:

启动一个 server:

.. code-block:: shell
:linenos:

# Launch the SGLang server on NPU
python -m sglang.launch_server --model Qwen/Qwen2.5-0.5B-Instruct \
--device npu --port 8000 --attention-backend ascend \
--host 0.0.0.0 --trust-remote-code

启动成功后,将看到类似如下的日志输出:

.. code-block:: shell
:linenos:

INFO: Started server process [89394]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: 127.0.0.1:40106 - "GET /get_model_info HTTP/1.1" 200 OK
Prefill batch. #new-seq: 1, #new-token: 128, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
INFO: 127.0.0.1:40108 - "POST /generate HTTP/1.1" 200 OK
The server is fired up and ready to roll!

使用 curl 进行测试:

.. code-block:: shell
:linenos:

curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen/qwen2.5-0.5b-instruct",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'

将看到类似如下返回结果:

.. code-block:: shell
:linenos:

{"id":"3f2f1aa779b544c19f01c08b803bf4ef","object":"chat.completion","created":1759136880,"model":"qwen/qwen2.5-0.5b-instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The capital of France is Paris.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":36,"total_tokens":44,"completion_tokens":8,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

使用 SGLang 进行推理验证
------------------------

以下代码展示了如何使用 SGLang 进行推理验证:

.. code-block:: shell
:linenos:

# example.py
import torch

import sglang as sgl

def main():

prompts = [
"Hello, my name is",
"The Independence Day of the United States is",
"The capital of Germany is",
"The full form of AI is",
] * 1

llm = sgl.Engine(model_path="/Qwen2.5/Qwen2.5-0.5B-Instruct", device="npu", attention_backend="ascend")

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 100}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print("===============================")
print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

if __name__ == '__main__':
main()

运行 example.py 进行测试,查看是否得到输出即可验证 SGLang 是否安装成功。