Skip to content

Conversation

bjtuwjx
Copy link

@bjtuwjx bjtuwjx commented Sep 2, 2025

This RFC proposal aims to decouple the CUDA-related code from the PyTorch main codebase and refactor it into an independent and modularized directory hierarchy with the help of a build optimization toolkit. Specifically, the proposal covers the following work:

  • Decouple CUDA-related code from the main codebase at both inter-file and intra-file levels, reducing the PyTorch core framework's direct dependency on CUDA.
  • Propose a modularized and standardized directory hierarchy and consolidate all CUDA-related code within it as a reference for other third-party backend integration.
  • Redesign the build system to support standalone compilation of the CUDA backend and develop a wrapped cmake toolkit to support and streamline the build process.

Click here for a preview of this RFC.

Copy link

meta-cla bot commented Sep 2, 2025

Hi @bjtuwjx!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Copy link

meta-cla bot commented Sep 2, 2025

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

## **Motivation**

For a long time, NVIDIA GPUs and the CUDA architecture have dominated the PyTorch ecosystem. However, as an increasing number of vendors introduce their high-performance AI chips, current ecosystem is revealing the following key issues:
- *Code coupling*. The CUDA code is too tightly coupled with the PyTorch codebase, resulting in poor modularity and high maintenance costs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give some details on this? What is the more concrete impact on modularity? And how this would reduce maintenance cost?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD
Thank you for your question and sorry for the late reply. The current interleaving of CPU and CUDA code within the same source files or same directories creates a significant maintenance burden and limits modularity. The impact of the proposed decoupling is concrete:

  • Enhanced Modularity:
    • Physical Separation of Concerns: Today, a developer working on a CPU kernel must navigate around #ifdef USE_CUDA guards and CUDA-specific code. Conversely, a CUDA developer risks breaking CPU code. Refactoring into a separate torch_cuda component creates a strict physical and architectural boundary. This allows both teams to work independently with well-defined interfaces, a fundamental principle of clean software architecture.
    • Simplified Dependency Graph: The core PyTorch library (libtorch.so) would no longer have a direct compile-time dependency on CUDA headers or libraries. This simplifies the build process for CPU-only versions and reduces unexpected breakage due to CUDA-related changes.
  • Reduced Maintenance Cost:
    • Dedicated Testing and CI: The torch_cuda module could have its own dedicated CI pipeline. A failure in a CUDA kernel would be immediately isolated to this component, thus speeding up triage and resolution. CPU-side changes could be tested independently before integration with CUDA.
    • Lower Cognitive Load: New contributors and maintainers would face a much simpler onboarding process. They wouldn't need to understand the entire monolithic codebase to contribute to a specific part (either core CPU ops or GPU ops).
    • Easier Version Management: Upgrading or testing against a new CUDA toolkit version becomes an isolated task for the torch_cuda component, without needing to rebuild or retest the entire PyTorch codebase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by these statements.
The kernel code are already separated between cpu and cuda, with the appropriate shared logic available via our c++ apis.
libtorch.so already does not depend on cuda when doing a cpu build. For cuda builds it does mostly for historical reasons. We could remove such a dependency but I'm not sure what would be the benefit, and it would add many new indirection layers to handle necessary cuda APIs for non-cuda builds.
We already have dedicated testing and ci for cuda.
I'm not sure how that would help testing, it's not like our test suite is cleanly separated to test each component separately. And because most of our features are about using many of these components together, we can't really do that.


For a long time, NVIDIA GPUs and the CUDA architecture have dominated the PyTorch ecosystem. However, as an increasing number of vendors introduce their high-performance AI chips, current ecosystem is revealing the following key issues:
- *Code coupling*. The CUDA code is too tightly coupled with the PyTorch codebase, resulting in poor modularity and high maintenance costs.
- *Integration effort*. Currently, different hardware backends may adopt varying integration methods into PyTorch. The integration approaches and code lack standardization and consistency, leading to a significant amount of repetitive code and substantial integration effort.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have done a lot of work with the PrivateUse1 backend extension point. In particular, added all the asked-for extension points, building OpenReg as a in-tree testing backend for this extension, added autoload, actively working on more documentation and updated many tools to use the accelerator API to enable smooth transition for the end user.
What would the proposed RFC provide on top of this? And why would we prefer investing in this refactor rather than continue improving the PrivateUse1-related extension points?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD
We sincerely appreciate the monumental effort put into the PrivateUse1 backend and related infrastructure like OpenReg, which provides invaluable in-tree testing capabilities. Our proposal primarily addresses a critical shortcoming of the current ecosystem: the unsustainable practice of vendors Reusing the CUDA Key.
We observe that many vendors (e.g., Kunlunxin XPU , MetaX MACA) are forced to choose between two suboptimal paths:

  1. Reuse the CUDA Key: They reuse the CUDA key and its code logic for ecosystem compatibility and to avoid the immense cost of rewriting thousands of kernels from scratch. However, this requires highly invasive modifications to the deeply entangled CUDA code within the PyTorch core to optimize for their hardware. This creates a maintenance nightmare for both the vendor and the PyTorch project, as it breaks modularity and makes the codebase increasingly unstable and difficult to work with.
  2. Use PrivateUse1: This is the architecturally clean method, but it forfeits any ability to reuse existing CUDA kernel implementations, presenting a massive development barrier.

Our RFC does not seek to replace PrivateUse1. Instead, it aims to eliminate the primary reason that forces vendors to choose the first, invasive option. We make PrivateUse1 the unequivocally best choice for everyone by solving the code reuse problem.
Here is what the RFC provides:

  • Solves the Kernel Reuse Problem for CUDA-like Hardware: By refactoring CUDA into a standalone torch_cuda component, we create a clean, reusable library of CUDA kernels and infrastructure. This allows vendors with CUDA-like hardware to fork, adapt, and build upon this component outside of the core PyTorch repository.
  • Protects the Integrity of the Core Project: This approach forcefully ends the practice of vendors making invasive modifications to the core PyTorch codebase. It draws a clear architectural boundary: the core PyTorch repository manages interfaces and CPU kernels, while hardware-specific implementations are maintained externally. This drastically reduces the maintenance burden and complexity for PyTorch core maintainers.
  • Strengthens the PrivateUse1 Ecosystem: By providing a clear migration path and a reusable component, our proposal transforms PrivateUse1 from a "clean-slate-only" option into the recommended, versatile integration point for all third-party backends. It empowers vendors to choose their starting point (from scratch or from the decoupled CUDA code) while using the same, official PrivateUse1 interface.

Why invest in this refactor? Continuing to improve the PrivateUse1 hooks is essential, but it does not address the architectural debt that encourages vendors to misuse the CUDA key. This refactor attacks that problem directly. It is a strategic investment to stop the influx of invasive, third-party patches into the core codebase and to provide a scalable, maintainable model for the entire ecosystem. It ensures that the future growth of PyTorch is built on a solid architectural foundation, not on a increasingly tangled monolith.

For a long time, NVIDIA GPUs and the CUDA architecture have dominated the PyTorch ecosystem. However, as an increasing number of vendors introduce their high-performance AI chips, current ecosystem is revealing the following key issues:
- *Code coupling*. The CUDA code is too tightly coupled with the PyTorch codebase, resulting in poor modularity and high maintenance costs.
- *Integration effort*. Currently, different hardware backends may adopt varying integration methods into PyTorch. The integration approaches and code lack standardization and consistency, leading to a significant amount of repetitive code and substantial integration effort.
- *Code migration*. Due to the lack of integration code specification, different hardware backends provide APIs with varying names and styles, resulting in high code migration costs for PyTorch users.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are building the accelerator API to address this particular points (relatively independent on how the backend itself is implemented (in/out of tree)).
How would the proposal here help compared to continuing to extend the accelerator API?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD
The Accelerator API is a critical and welcome user-facing abstraction for writing device-agnostic code. Our proposal is a implementer-facing architectural change that would powerfully synergize with it.

  • Different Layers of the Stack: The Accelerator API solves the problem of "how do users write code for any accelerator?". Our RFC solves the problem of "how do implementers structure the code for each accelerator in a maintainable way?".
  • How the Proposal Helps the Accelerator API:
    • Clean Foundation for Implementations: The Accelerator API needs to be implemented by each backend. A cleanly decoupled backend (like the proposed torch_cuda) is far easier to adapt to a new abstraction layer like the Accelerator API. The clear boundaries simplify the implementation of dispatcher logic and kernel registration.
    • Architectural Alignment: It ensures the internal architecture reflects the elegance of the user-facing abstraction. A messy, coupled internals makes evolving a clean API more difficult. A clean internals makes API evolution simpler and more robust.
    • Synergy: Continuing to extend the Accelerator API on top of the current entangled codebase adds complexity to a complex foundation. Refactoring the foundation simplifies the task of extending the API. This refactor is a force multiplier for the Accelerator API effort, ensuring its implementation is built on a solid, maintainable base.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants