Add graph runner support with torch compile on CPU by CaoE · Pull Request #7843 · sgl-project/sglang

CaoE · 2025-07-08T05:48:26Z

Motivation

Inspired by mingfeima#73. We add CPU graph runner with torch compile to reduce python overhead to speed up decoding on CPU.

Profiling with disabling torch compile:

Profiling with enabling torch compile:

From the profiling results above, we can see that torch compile can reduce python overhead by reducing the module call stack.

Modifications

Fix the mutation errors in kernel registration on CPU.
Register fake functions for custom kernels to support torch.compile on CPU.
Convert custom parameters to native parameters of torch to avoid graph break on CPU.
Add graph runner support on CPU. Note: cpu graph needs to be enabled using --enable-torch-compile

Performance:

Enable compile vs. Disable compile:

Accuracy:

		enable torch compile	disable torch compile
model id	dtype	mmlu	mmlu
meituan--DeepSeek-R1-Channel-INT8	int8	0.87	0.871
Qwen--Qwen3-14B-FP8	fp8	0.788	0.788
meta-llama--Llama-3.2-3B-Instruct	bf16	0.607	0.607
microsoft--Phi-4-multimodal-instruct	bf16	0.689	0.688

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.

gemini-code-assist

Summary of Changes

Hello @CaoE, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the SGLang runtime by integrating torch.compile support for CPU inference. The primary goal is to minimize Python overhead during model execution on CPU, mirroring the performance benefits previously achieved with CUDA graphs. This involves extensive modifications to enable graph capture, ensure compatibility with custom kernels and parameter types, and streamline distributed operations within a compiled environment, ultimately leading to more efficient CPU model execution.

Highlights

CPU Graph Runner Integration: Introduced a generalized GraphRunner (formerly CudaGraphRunner) to enable graph capture and replay for CPU inference, leveraging torch.compile to reduce Python overhead. This involves adapting the existing CUDA graph infrastructure to support CPU-specific optimizations.
torch.compile Compatibility for Custom Kernels: Added a new module (cpu_register_fake.py) that registers 'fake' implementations for various custom sgl_kernel CPU operations. This crucial step allows torch.compile to correctly trace the computational graph involving these operations without needing their actual execution during compilation, enabling end-to-end optimization.
Enhanced Parameter Handling for torch.compile: Extended ChannelQuantScaleParameter with necessary __tensor_flatten__, __torch_dispatch__, and copy methods. These additions ensure that custom torch.nn.Parameter subclasses are properly handled and optimized by torch.compile's graph tracing mechanism.
Distributed Operations and torch.compile Alignment: Adjusted the shm_allreduce operation and registered a fake shm_allgather to ensure seamless compatibility with torch.compile when distributed operations are performed on CPU, preventing tracing errors and enabling optimization of communication primitives.
General torch.compile Infrastructure Improvements: Included various minor fixes and enhancements across the codebase, such as adjusting torch._inductor configurations for CPU, improving memory capacity reporting for CPU, and handling context managers during compilation to prevent tracing issues and improve robustness.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces graph runner support for CPU to reduce Python overhead. I've identified a critical issue in the replay logic that needs to be addressed, along with a few medium-severity issues related to code clarity and potential bugs.

Alcanderian · 2025-08-26T03:41:48Z

Hi @Alcanderian @zhyncs Can this PR be landed since the NPU graph runner has been re-landed ? Recently, some AMD-GPU and NPU-related tests have failed, most of which seem unrelated to this PR. Thanks.

please fix xeon CI

CaoE · 2025-08-26T05:24:32Z

please fix xeon CI

Fixed https://github.com/sgl-project/sglang/actions/runs/17227399712/job/48876607152?pr=7843.

CaoE · 2025-08-27T01:09:39Z

Hi @FlamingoPg @Alcanderian Can this PR be merged?

CaoE · 2025-08-27T04:50:03Z

@FlamingoPg @Alcanderian https://github.com/sgl-project/sglang/actions/runs/17231906249/job/48964007907?pr=7843 doesn't seem to be related to this PR.

CaoE · 2025-08-28T00:41:52Z

Hi @zhyncs Could you please help merge this PR ? Thank you.

FlamingoPg · 2025-08-28T14:23:30Z

Hi @zhyncs Could you please help merge this PR ? Thank you.

@zhyncs LGTM, any other comment?

zhyncs · 2025-08-28T20:46:10Z

    extend_input_len_per_req: List[int]
    extend_logprob_start_len_per_req: List[int]
    bid: int
-    can_run_cuda_graph: bool


can we keep the old name

Changing can_run_cuda_graph to can_run_graph is simply to indicate that graph is also supported on the CPU, thus reducing confusion. Depending on the device, it can be determined whether it is a CUDA graph or a CPU graph. Do you think we should keep all can_run_cuda_graph? Are there any unconsidered implications of changing can_run_cuda_graph to can_run_graph? This is fine for us, and I can change all can_run_graph back to can_run_cuda_graph.

I hope to minimize the impact on downstream forks as much as possible, usually new hardware changes are best made independently, with minimal changes to existing NVIDIA GPUs.

Thanks for your comments. I'll change them back.

CaoE · 2025-09-02T05:35:40Z

Hi @zhyncs https://github.com/sgl-project/sglang/actions/runs/17372540433/job/49363628357?pr=7843 doesn't seem to be related to this PR. Can this PR be merged?

CaoE · 2025-09-04T13:54:29Z

Hi @zhyncs Could you please take another look? Thank you.

gemini-code-assist Bot reviewed Jul 8, 2025

View reviewed changes

CaoE force-pushed the cpu_compile branch from e815e7d to bafc35e Compare July 8, 2025 08:01

mingfeima mentioned this pull request Jul 23, 2025

[Roadmap] CPU Backend Optimization (2025 H2) #8281

Closed

2 tasks

CaoE force-pushed the cpu_compile branch 6 times, most recently from 8644e3d to e3f52c2 Compare July 29, 2025 09:02

CaoE marked this pull request as ready for review July 30, 2025 01:37

CaoE requested review from BBuf, FlamingoPg, HaiShaw, HandH1998, Ying1123, ch-wan, hnyls2002, ispobock, kushanam, merrymercy, yizhang2077 and zhyncs as code owners July 30, 2025 01:37

CaoE requested review from kssteven418, rkooo567 and xiezhq-hermann as code owners August 1, 2025 02:00

CaoE force-pushed the cpu_compile branch 3 times, most recently from cd53d18 to d97f691 Compare August 4, 2025 06:04

mingfeima added the intel label Aug 4, 2025

CaoE added 5 commits August 25, 2025 14:32

modify timeout-minutes

159a103

Merge branch 'main' into cpu_compile

3a52dcc

fix merge main

d6109c5

Merge branch 'main' into cpu_compile

1c67857

Merge branch 'main' into cpu_compile

d04bdf2

FlamingoPg approved these changes Aug 26, 2025

View reviewed changes

CaoE added 3 commits August 26, 2025 11:54

fix test_cpu_graph

a4066c4

reduce bs to shorten test time

9dedc87

Merge branch 'main' into cpu_compile

fc4118a

zhyncs reviewed Aug 28, 2025

View reviewed changes

CaoE requested a review from zhyncs August 29, 2025 01:24

zhyncs added the high priority label Aug 29, 2025

zhyncs self-assigned this Aug 29, 2025

CaoE and others added 5 commits August 29, 2025 09:59

change can_run_graph back to can_run_cuda_graph

b99506f

Merge branch 'main' into cpu_compile

3f2f32a

Merge branch 'main' into cpu_compile

fea6858

Merge branch 'main' into cpu_compile

8f02f28

Merge branch 'main' into cpu_compile

fdfa7f9

zhyncs merged commit 7577f0e into sgl-project:main Sep 8, 2025
117 of 127 checks passed

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

Add graph runner support with torch compile on CPU (sgl-project#7843)

8144838

lifuhuang pushed a commit that referenced this pull request Sep 10, 2025

Add graph runner support with torch compile on CPU (#7843)

7bb3dee

0826joyce pushed a commit to 0826joyce/sglang-perf-opt that referenced this pull request May 19, 2026

Add graph runner support with torch compile on CPU (sgl-project#7843)

f28adf3

Conversation

CaoE commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Performance:

Accuracy:

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Alcanderian commented Aug 26, 2025

Uh oh!

CaoE commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CaoE commented Aug 27, 2025

Uh oh!

CaoE commented Aug 27, 2025

Uh oh!

CaoE commented Aug 28, 2025

Uh oh!

FlamingoPg commented Aug 28, 2025

Uh oh!

zhyncs Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

CaoE Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

zhyncs Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

CaoE Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

CaoE Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

CaoE commented Sep 2, 2025

Uh oh!

CaoE commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CaoE commented Jul 8, 2025 •

edited

Loading

CaoE commented Aug 26, 2025 •

edited

Loading