Skip to content

Add graph runner support with torch compile on CPU#7843

Merged
zhyncs merged 33 commits into
sgl-project:mainfrom
CaoE:cpu_compile
Sep 8, 2025
Merged

Add graph runner support with torch compile on CPU#7843
zhyncs merged 33 commits into
sgl-project:mainfrom
CaoE:cpu_compile

Conversation

@CaoE
Copy link
Copy Markdown
Contributor

@CaoE CaoE commented Jul 8, 2025

Motivation

Inspired by mingfeima#73. We add CPU graph runner with torch compile to reduce python overhead to speed up decoding on CPU.

Profiling with disabling torch compile:
image

Profiling with enabling torch compile:
image

From the profiling results above, we can see that torch compile can reduce python overhead by reducing the module call stack.

Modifications

  • Fix the mutation errors in kernel registration on CPU.
  • Register fake functions for custom kernels to support torch.compile on CPU.
  • Convert custom parameters to native parameters of torch to avoid graph break on CPU.
  • Add graph runner support on CPU. Note: cpu graph needs to be enabled using --enable-torch-compile

Performance:

Enable compile vs. Disable compile:
image

Accuracy:

    enable torch compile disable torch compile
model id dtype mmlu mmlu
meituan--DeepSeek-R1-Channel-INT8 int8 0.87 0.871
Qwen--Qwen3-14B-FP8 fp8 0.788 0.788
meta-llama--Llama-3.2-3B-Instruct bf16 0.607 0.607
microsoft--Phi-4-multimodal-instruct bf16 0.689 0.688

Checklist

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @CaoE, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the SGLang runtime by integrating torch.compile support for CPU inference. The primary goal is to minimize Python overhead during model execution on CPU, mirroring the performance benefits previously achieved with CUDA graphs. This involves extensive modifications to enable graph capture, ensure compatibility with custom kernels and parameter types, and streamline distributed operations within a compiled environment, ultimately leading to more efficient CPU model execution.

Highlights

  • CPU Graph Runner Integration: Introduced a generalized GraphRunner (formerly CudaGraphRunner) to enable graph capture and replay for CPU inference, leveraging torch.compile to reduce Python overhead. This involves adapting the existing CUDA graph infrastructure to support CPU-specific optimizations.
  • torch.compile Compatibility for Custom Kernels: Added a new module (cpu_register_fake.py) that registers 'fake' implementations for various custom sgl_kernel CPU operations. This crucial step allows torch.compile to correctly trace the computational graph involving these operations without needing their actual execution during compilation, enabling end-to-end optimization.
  • Enhanced Parameter Handling for torch.compile: Extended ChannelQuantScaleParameter with necessary __tensor_flatten__, __torch_dispatch__, and copy methods. These additions ensure that custom torch.nn.Parameter subclasses are properly handled and optimized by torch.compile's graph tracing mechanism.
  • Distributed Operations and torch.compile Alignment: Adjusted the shm_allreduce operation and registered a fake shm_allgather to ensure seamless compatibility with torch.compile when distributed operations are performed on CPU, preventing tracing errors and enabling optimization of communication primitives.
  • General torch.compile Infrastructure Improvements: Included various minor fixes and enhancements across the codebase, such as adjusting torch._inductor configurations for CPU, improving memory capacity reporting for CPU, and handling context managers during compilation to prevent tracing issues and improve robustness.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces graph runner support for CPU to reduce Python overhead. I've identified a critical issue in the replay logic that needs to be addressed, along with a few medium-severity issues related to code clarity and potential bugs.

Comment thread python/sglang/srt/model_executor/cuda_graph_runner.py Outdated
Comment thread python/sglang/srt/layers/parameter.py Outdated
Comment thread python/sglang/srt/layers/attention/intel_amx_backend.py Outdated
Comment thread python/sglang/srt/models/deepseek_v2.py Outdated
Comment thread python/sglang/srt/utils.py Outdated
Comment thread sgl-kernel/python/sgl_kernel/cpu_register_fake.py Outdated
@CaoE CaoE force-pushed the cpu_compile branch 6 times, most recently from 8644e3d to e3f52c2 Compare July 29, 2025 09:02
@CaoE CaoE marked this pull request as ready for review July 30, 2025 01:37
@CaoE CaoE force-pushed the cpu_compile branch 3 times, most recently from cd53d18 to d97f691 Compare August 4, 2025 06:04
@mingfeima mingfeima added the intel label Aug 4, 2025
@Alcanderian
Copy link
Copy Markdown
Collaborator

Hi @Alcanderian @zhyncs Can this PR be landed since the NPU graph runner has been re-landed ? Recently, some AMD-GPU and NPU-related tests have failed, most of which seem unrelated to this PR. Thanks.

please fix xeon CI

@CaoE
Copy link
Copy Markdown
Contributor Author

CaoE commented Aug 26, 2025

@CaoE
Copy link
Copy Markdown
Contributor Author

CaoE commented Aug 27, 2025

Hi @FlamingoPg @Alcanderian Can this PR be merged?

@CaoE
Copy link
Copy Markdown
Contributor Author

CaoE commented Aug 27, 2025

@CaoE
Copy link
Copy Markdown
Contributor Author

CaoE commented Aug 28, 2025

Hi @zhyncs Could you please help merge this PR ? Thank you.

@FlamingoPg
Copy link
Copy Markdown
Collaborator

Hi @zhyncs Could you please help merge this PR ? Thank you.

@zhyncs LGTM, any other comment?

extend_input_len_per_req: List[int]
extend_logprob_start_len_per_req: List[int]
bid: int
can_run_cuda_graph: bool
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep the old name

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing can_run_cuda_graph to can_run_graph is simply to indicate that graph is also supported on the CPU, thus reducing confusion. Depending on the device, it can be determined whether it is a CUDA graph or a CPU graph. Do you think we should keep all can_run_cuda_graph? Are there any unconsidered implications of changing can_run_cuda_graph to can_run_graph? This is fine for us, and I can change all can_run_graph back to can_run_cuda_graph.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope to minimize the impact on downstream forks as much as possible, usually new hardware changes are best made independently, with minimal changes to existing NVIDIA GPUs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments. I'll change them back.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified.

@CaoE CaoE requested a review from zhyncs August 29, 2025 01:24
@zhyncs zhyncs self-assigned this Aug 29, 2025
@CaoE
Copy link
Copy Markdown
Contributor Author

CaoE commented Sep 2, 2025

Hi @zhyncs https://github.com/sgl-project/sglang/actions/runs/17372540433/job/49363628357?pr=7843 doesn't seem to be related to this PR. Can this PR be merged?

@CaoE
Copy link
Copy Markdown
Contributor Author

CaoE commented Sep 4, 2025

Hi @zhyncs Could you please take another look? Thank you.

@zhyncs zhyncs merged commit 7577f0e into sgl-project:main Sep 8, 2025
117 of 127 checks passed
MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025
0826joyce pushed a commit to 0826joyce/sglang-perf-opt that referenced this pull request May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpu cpu backend performance optimization high priority intel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants