Skip to content

Conversation

ko3n1g
Copy link
Collaborator

@ko3n1g ko3n1g commented Sep 1, 2025

Local copy of #179

ko3n1g added 16 commits August 28, 2025 15:18
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
@ko3n1g ko3n1g changed the title Ko3n1g/ci/build wheels chore: Build and store bdist wheels Sep 1, 2025
@LyricZhao
Copy link
Collaborator

Thanks! I will merge it tomorrow. Closing #179 now.

@LyricZhao
Copy link
Collaborator

Hi @ko3n1g, may I know why there is a DEEP_GEMM_SKIP_CUDA_BUILD? Actually the CUDA extension is not building CUDA kernels, but building the core JIT part.

@LyricZhao
Copy link
Collaborator

BTW, is it OK to release Linux only and CUDA 12 only? We don't support other platform and CUDA 11.

@LyricZhao
Copy link
Collaborator

And ext_modules is not used (SKIP_CUDA_BUILD not working)

@LyricZhao
Copy link
Collaborator

I did some updates and lints on your branch (not tested), you can continue to work on this. Many thanks!

@ko3n1g
Copy link
Collaborator Author

ko3n1g commented Sep 16, 2025

Hey @LyricZhao, sorry for the absence, I was on vacation and just returned yesterday. Picking this now up again.

Hi @ko3n1g, may I know why there is a DEEP_GEMM_SKIP_CUDA_BUILD? Actually the CUDA extension is not building CUDA kernels, but building the core JIT part.

Does this mean we don't need to build torch/cuda specific wheels? In that case, I can simplify the whole logic significantly. I wasn't too familiar with the src code and assumed it needed platform specific wheels like FA or TE.

@LyricZhao
Copy link
Collaborator

Does this mean we don't need to build torch/cuda specific wheels?

I guess we should have torch-specific wheel and cuda-major version (12/13) wheels.

@LyricZhao
Copy link
Collaborator

Sorry for my late reply, too busy recently.

@ko3n1g ko3n1g force-pushed the ko3n1g/ci/build-wheels branch from bffc64b to bd86ac7 Compare September 26, 2025 12:49
@ko3n1g
Copy link
Collaborator Author

ko3n1g commented Sep 26, 2025

Thanks for the fixes and linting @LyricZhao !

The DG_SKIP_CUDA_BUILD flag only exists since we don't need to build anything for the sdist PyPi wheels.

I noticed the last commit accidentally (?) removed some args from the cuda-extension and re-added it. Please let me know if this is good!

I think this PR should otherwise be good to go

@ko3n1g ko3n1g requested a review from LyricZhao September 26, 2025 12:56
@LyricZhao
Copy link
Collaborator

Sorry, I don't get it. The current version failed to compile the CUDA extension. Actually, this is not CUDA extension, but just the JIT CPP part. And in my opinion, this should be always built. So do you agree if I remove the DG_SKIP_CUDA_BUILD?

@LyricZhao
Copy link
Collaborator

LyricZhao commented Sep 28, 2025

Another thing is about DG_NO_LOCAL_VERSION:

I want to remove this as well.

If git is available, the downloaded wheel (for the cached codepath) must be aligned with the local version (commit ID included and not local uncommited changes).
If git is not available, we must do a local build.

Do you agree with this?

Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
@ko3n1g
Copy link
Collaborator Author

ko3n1g commented Sep 28, 2025

Yes of course, please feel free to refactor as you see it best for DG!

To lay out my thoughts and motivation:

  • The flags DG_USE_LOCAL_VERSION, DG_SKIP_CUDA_BUILD and DG_FORCE_BUILD are mostly for CI and their defaults are set so that local development doesn't change
  • We typically don't push each commit to PyPI only released versions (!)
  • So when you run python setup.py bdist_wheel locally (without setting any env-vars), it will check if there's a pre-built wheel at a GitHub release for lets say 2.0.0+$git_sha, will not find anything since we only create GH releases for releases (let's sat 2.0.0) and thus build the CUDA extension from scratch
  • If you check out tag 2.0.0 and run python setup.py bdist_wheel, it would find the pre-built wheel on the GitHub release of DG2.0.0 and install that directly. However, if you don't want to fetch the pre-built wheel, you can set DG_FORCE_BUILD=1 and force rebuilding the cuda extension.
  • If we run pip install "deep-gemm==2.0.0", we download the wheel from PyPI that only contains the python source code (not the cpp_extension) at first. During the installation process, pip will notice that deep_gemm_cpp is missing and will start its build process. Here, the setup.py intercepts by checking if it can find this extension compiled for the right torch/cuda version at the GitHub release 2.0.0 of DG. If so, it will download and install this one. Otherwise, it will proceed with the build locally.
  • DG_USE_LOCAL_VERSION is required for CI since when we push to PyPI, we don't want the version number to be 2.0.0+$git_sha but just 2.0.0. That is the typical convention: PyPI releases follow semantic-versions without commit-sha.
  • DG_SKIP_CUDA_BUILD is also required for CI since on PyPI we only store the sdist wheel that doesn't have the deep_gemm_cpp extension. The reason why we don't want to push the extension to PyPI is because we will have many different wheels for CUDA/torch. For each released version, PyPI allows us to push wheels only once. So if we later want to add torch2.9 support to DeepGEMM 2.0.0, PyPI wouldn't allow us to add another torch2.9-specific wheel to the already published 2.0.0 PyPI release. Therefore, we use PyPI as a trojan horse that only hosts the python code and the setup.py logic that downloads the compiled extension from GitHub. To a GitHub release, we can always add new wheels (or even remove) as we like. This is where we store the bdists (the compiled deep_gemm_cpp extensions).

I hope this makes the workflow clear:)

This is the standard procedure we're doing at FlashAttention, mamba, groupedgemm and others. But again, I'm not familiar with the specifics of DG, so if you believe a different approach fits DG better, please feel free to refactor and push to this branch and merge it:)

@LyricZhao
Copy link
Collaborator

Thanks! I get it, I will refactor & test it tomorrow and merge! Now env values seems default for CI but not for users (so some errors occur).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants