RFC-0045-decoupling-cuda-code.md [commenting] #82

bjtuwjx · 2025-09-02T06:19:36Z

This RFC proposal aims to decouple the CUDA-related code from the PyTorch main codebase and refactor it into an independent and modularized directory hierarchy with the help of a build optimization toolkit. Specifically, the proposal covers the following work:

Decouple CUDA-related code from the main codebase at both inter-file and intra-file levels, reducing the PyTorch core framework's direct dependency on CUDA.
Propose a modularized and standardized directory hierarchy and consolidate all CUDA-related code within it as a reference for other third-party backend integration.
Redesign the build system to support standalone compilation of the CUDA backend and develop a wrapped cmake toolkit to support and streamline the build process.

Click here for a preview of this RFC.

highlights manuscript

file decoupling and directory restructuring

Hl

update

Prior Art

Add next steps

Update RFC-0039-decouple-cuda-codes.md

Add next steps

Check word usage

meta-cla · 2025-09-02T06:19:42Z

Hi @bjtuwjx!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

meta-cla · 2025-09-02T07:15:47Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

albanD · 2025-09-02T13:13:44Z

RFC-0045-decoupling-cuda-code.md

+## **Motivation**
+
+For a long time, NVIDIA GPUs and the CUDA architecture have dominated the PyTorch ecosystem. However, as an increasing number of vendors introduce their high-performance AI chips, current ecosystem is revealing the following key issues:
+- *Code coupling*. The CUDA code is too tightly coupled with the PyTorch codebase, resulting in poor modularity and high maintenance costs.


Can you give some details on this? What is the more concrete impact on modularity? And how this would reduce maintenance cost?

@albanD
Thank you for your question and sorry for the late reply. The current interleaving of CPU and CUDA code within the same source files or same directories creates a significant maintenance burden and limits modularity. The impact of the proposed decoupling is concrete:

Enhanced Modularity:

Physical Separation of Concerns: Today, a developer working on a CPU kernel must navigate around #ifdef USE_CUDA guards and CUDA-specific code. Conversely, a CUDA developer risks breaking CPU code. Refactoring into a separate torch_cuda component creates a strict physical and architectural boundary. This allows both teams to work independently with well-defined interfaces, a fundamental principle of clean software architecture.

Simplified Dependency Graph: The core PyTorch library (libtorch.so) would no longer have a direct compile-time dependency on CUDA headers or libraries. This simplifies the build process for CPU-only versions and reduces unexpected breakage due to CUDA-related changes.

Reduced Maintenance Cost:

Dedicated Testing and CI: The torch_cuda module could have its own dedicated CI pipeline. A failure in a CUDA kernel would be immediately isolated to this component, thus speeding up triage and resolution. CPU-side changes could be tested independently before integration with CUDA.

Lower Cognitive Load: New contributors and maintainers would face a much simpler onboarding process. They wouldn't need to understand the entire monolithic codebase to contribute to a specific part (either core CPU ops or GPU ops).

Easier Version Management: Upgrading or testing against a new CUDA toolkit version becomes an isolated task for the torch_cuda component, without needing to rebuild or retest the entire PyTorch codebase.

I'm a little confused by these statements.
The kernel code are already separated between cpu and cuda, with the appropriate shared logic available via our c++ apis.
libtorch.so already does not depend on cuda when doing a cpu build. For cuda builds it does mostly for historical reasons. We could remove such a dependency but I'm not sure what would be the benefit, and it would add many new indirection layers to handle necessary cuda APIs for non-cuda builds.
We already have dedicated testing and ci for cuda.
I'm not sure how that would help testing, it's not like our test suite is cleanly separated to test each component separately. And because most of our features are about using many of these components together, we can't really do that.

albanD · 2025-09-02T13:15:53Z

RFC-0045-decoupling-cuda-code.md

+
+For a long time, NVIDIA GPUs and the CUDA architecture have dominated the PyTorch ecosystem. However, as an increasing number of vendors introduce their high-performance AI chips, current ecosystem is revealing the following key issues:
+- *Code coupling*. The CUDA code is too tightly coupled with the PyTorch codebase, resulting in poor modularity and high maintenance costs.
+- *Integration effort*. Currently, different hardware backends may adopt varying integration methods into PyTorch. The integration approaches and code lack standardization and consistency, leading to a significant amount of repetitive code and substantial integration effort.


We have done a lot of work with the PrivateUse1 backend extension point. In particular, added all the asked-for extension points, building OpenReg as a in-tree testing backend for this extension, added autoload, actively working on more documentation and updated many tools to use the accelerator API to enable smooth transition for the end user.
What would the proposed RFC provide on top of this? And why would we prefer investing in this refactor rather than continue improving the PrivateUse1-related extension points?

@albanD
We sincerely appreciate the monumental effort put into the PrivateUse1 backend and related infrastructure like OpenReg, which provides invaluable in-tree testing capabilities. Our proposal primarily addresses a critical shortcoming of the current ecosystem: the unsustainable practice of vendors Reusing the CUDA Key.
We observe that many vendors (e.g., Kunlunxin XPU , MetaX MACA) are forced to choose between two suboptimal paths:

Reuse the CUDA Key: They reuse the CUDA key and its code logic for ecosystem compatibility and to avoid the immense cost of rewriting thousands of kernels from scratch. However, this requires highly invasive modifications to the deeply entangled CUDA code within the PyTorch core to optimize for their hardware. This creates a maintenance nightmare for both the vendor and the PyTorch project, as it breaks modularity and makes the codebase increasingly unstable and difficult to work with.

Use PrivateUse1: This is the architecturally clean method, but it forfeits any ability to reuse existing CUDA kernel implementations, presenting a massive development barrier.

Our RFC does not seek to replace PrivateUse1. Instead, it aims to eliminate the primary reason that forces vendors to choose the first, invasive option. We make PrivateUse1 the unequivocally best choice for everyone by solving the code reuse problem.
Here is what the RFC provides:

Solves the Kernel Reuse Problem for CUDA-like Hardware: By refactoring CUDA into a standalone torch_cuda component, we create a clean, reusable library of CUDA kernels and infrastructure. This allows vendors with CUDA-like hardware to fork, adapt, and build upon this component outside of the core PyTorch repository.

Protects the Integrity of the Core Project: This approach forcefully ends the practice of vendors making invasive modifications to the core PyTorch codebase. It draws a clear architectural boundary: the core PyTorch repository manages interfaces and CPU kernels, while hardware-specific implementations are maintained externally. This drastically reduces the maintenance burden and complexity for PyTorch core maintainers.

Strengthens the PrivateUse1 Ecosystem: By providing a clear migration path and a reusable component, our proposal transforms PrivateUse1 from a "clean-slate-only" option into the recommended, versatile integration point for all third-party backends. It empowers vendors to choose their starting point (from scratch or from the decoupled CUDA code) while using the same, official PrivateUse1 interface.

Why invest in this refactor? Continuing to improve the PrivateUse1 hooks is essential, but it does not address the architectural debt that encourages vendors to misuse the CUDA key. This refactor attacks that problem directly. It is a strategic investment to stop the influx of invasive, third-party patches into the core codebase and to provide a scalable, maintainable model for the entire ecosystem. It ensures that the future growth of PyTorch is built on a solid architectural foundation, not on a increasingly tangled monolith.

albanD · 2025-09-02T13:17:31Z

RFC-0045-decoupling-cuda-code.md

+For a long time, NVIDIA GPUs and the CUDA architecture have dominated the PyTorch ecosystem. However, as an increasing number of vendors introduce their high-performance AI chips, current ecosystem is revealing the following key issues:
+- *Code coupling*. The CUDA code is too tightly coupled with the PyTorch codebase, resulting in poor modularity and high maintenance costs.
+- *Integration effort*. Currently, different hardware backends may adopt varying integration methods into PyTorch. The integration approaches and code lack standardization and consistency, leading to a significant amount of repetitive code and substantial integration effort.
+- *Code migration*. Due to the lack of integration code specification, different hardware backends provide APIs with varying names and styles, resulting in high code migration costs for PyTorch users.


We are building the accelerator API to address this particular points (relatively independent on how the backend itself is implemented (in/out of tree)).
How would the proposal here help compared to continuing to extend the accelerator API?

@albanD
The Accelerator API is a critical and welcome user-facing abstraction for writing device-agnostic code. Our proposal is a implementer-facing architectural change that would powerfully synergize with it.

Different Layers of the Stack: The Accelerator API solves the problem of "how do users write code for any accelerator?". Our RFC solves the problem of "how do implementers structure the code for each accelerator in a maintainable way?".

How the Proposal Helps the Accelerator API:

Clean Foundation for Implementations: The Accelerator API needs to be implemented by each backend. A cleanly decoupled backend (like the proposed torch_cuda) is far easier to adapt to a new abstraction layer like the Accelerator API. The clear boundaries simplify the implementation of dispatcher logic and kernel registration.

Architectural Alignment: It ensures the internal architecture reflects the elegance of the user-facing abstraction. A messy, coupled internals makes evolving a clean API more difficult. A clean internals makes API evolution simpler and more robust.

Synergy: Continuing to extend the Accelerator API on top of the current entangled codebase adds complexity to a complex foundation. Refactoring the foundation simplifies the task of extending the API. This refactor is a force multiplier for the Accelerator API effort, ensuring its implementation is built on a solid, maintainable base.

[email protected] and others added 30 commits May 21, 2025 10:16

add afteam RFC template

80baf38

task assignment

ddf4d3c

assign plan

3b3fc46

highlights manuscript

9f278d8

Merge pull request #1 from bjtuwjx/ymw

1c45e89

highlights manuscript

file decoupling and directory restructuring

6a1b9fb

Merge pull request #2 from bjtuwjx/zhangjing

bd1227a

file decoupling and directory restructuring

compiling implementation

90204a4

Increase indentation

39124ce

Merge pull request #3 from bjtuwjx/hl

1802b8d

Hl

add Summary section

521d5ff

add newline

7b8ad85

优缺点

2849b6f

优缺点

7feca61

format update

9890226

update hyperlink

c456d8f

update capital

51623be

add decoupling.pptx

62254d5

Merge pull request #4 from bjtuwjx/zhangjing

7e0c0df

update

add Alternatives

1d37b31

add motivation

df80461

Prior Art

0078eba

Merge pull request #5 from deadsec1994/cw

0df18fd

Prior Art

Update RFC-0039-decouple-cuda-codes.md

0f79d3f

Add next steps

Merge pull request #1 from liyagit21/liyagit21-patch-1

9c1758e

Update RFC-0039-decouple-cuda-codes.md

Merge pull request #6 from liyagit21/master

ccb3a8a

Add next steps

revise some content

84dc107

revise fig insert

a08256c

redraw dir-restructured fig and add Unresolved questions

d0c1dec

translate fig 3

e34a120

bjtuwjx and others added 20 commits August 13, 2025 16:17

translate summary, highlights and implemantation sections

390057b

translate Unresolved Questions section

05c8efa

revise fig layout

d07b186

translate Next Steps and Alternatives sections

ca2c02a

translate Prior Art

f1efafb

translate motivation section

10bcc9c

fix typo

f19dad1

minor fix

bc97d6b

revise Summary and Highlights

9268f7b

merge figs

c728add

revise Proposed Implementation

6ed2a94

revise Drawbacks

7a0428f

revise Next Steps

32797e1

rename RFC file

efdab8e

revise fig and nex steps

8797b6d

revise Proposed Implementation

d0823c8

rephrase

cbaf173

Merge pull request #7 from bjtuwjx/yww

80e2f66

Check word usage

change rfc number

3ffecce

fix figure dir

3ea4cf0

bjtuwjx mentioned this pull request Sep 2, 2025

[RFC] CUDA code decoupling and directory refactoring pytorch/pytorch#161954

Open

meta-cla bot added the cla signed label Sep 2, 2025

albanD reviewed Sep 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC-0045-decoupling-cuda-code.md [commenting] #82

RFC-0045-decoupling-cuda-code.md [commenting] #82

Uh oh!

bjtuwjx commented Sep 2, 2025

Uh oh!

meta-cla bot commented Sep 2, 2025

Uh oh!

meta-cla bot commented Sep 2, 2025

Uh oh!

albanD Sep 2, 2025

Uh oh!

bjtuwjx Sep 22, 2025

Uh oh!

albanD Sep 22, 2025

Uh oh!

albanD Sep 2, 2025

Uh oh!

bjtuwjx Sep 22, 2025

Uh oh!

albanD Sep 2, 2025

Uh oh!

bjtuwjx Sep 22, 2025

Uh oh!

Uh oh!

RFC-0045-decoupling-cuda-code.md [commenting] #82

Are you sure you want to change the base?

RFC-0045-decoupling-cuda-code.md [commenting] #82

Uh oh!

Conversation

bjtuwjx commented Sep 2, 2025

Uh oh!

meta-cla bot commented Sep 2, 2025

Action Required

Process

Uh oh!

meta-cla bot commented Sep 2, 2025

Uh oh!

albanD Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

bjtuwjx Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

albanD Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

albanD Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

bjtuwjx Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

albanD Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

bjtuwjx Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!