Skip to content

fix: detect GPU arch automatically for kernel building#169

Open
Dogacel wants to merge 2 commits into
lightseekorg:mainfrom
Dogacel:default-kernel-arch
Open

fix: detect GPU arch automatically for kernel building#169
Dogacel wants to merge 2 commits into
lightseekorg:mainfrom
Dogacel:default-kernel-arch

Conversation

@Dogacel
Copy link
Copy Markdown

@Dogacel Dogacel commented May 17, 2026

Summary

Detect GPU architecture automatically while building tokenspeed-kernel.

Test Plan

  1. (Old flow) Built kernel on GH200, however sm100a is not comptible by default, so I get [ts] RuntimeError: Check failed: (status == cudaSuccess) is false: BatchQKApplyRotaryPosIdsCosSinCacheEnhanced failed with error code no kernel image is available for execution on the device while running gpt-oss-20b.
  2. (New flow) Clear packages, cache and re-run pip install -e tokenspeed-kernel/python/ --no-build-isolation. The GPU architecture is automatically detected without setting FLASHINFER_CUDA_ARCH_LIST or TOKENSPEED_CUDA_ARCH (which also fixes the issue).

@Dogacel Dogacel requested a review from a team as a code owner May 17, 2026 05:42
Signed-off-by: Doğaç Eldenk <dogacel@gmail.com>
@Dogacel Dogacel force-pushed the default-kernel-arch branch from cc66eee to b738115 Compare May 17, 2026 06:15
Comment thread tokenspeed-kernel/python/setup.py Outdated
for cap in caps:
cap = cap.strip()
if cap:
archs.add(self._normalize_cuda_arch(cap + "a"))
Copy link
Copy Markdown
Contributor

@borontion borontion May 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why there is a suffix "a" here - cap + "a". shouldn't it be handled by _normalize_cuda_arch?

Copy link
Copy Markdown
Author

@Dogacel Dogacel May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused, it seems like we can remove.

The goal was to get sm100a, sm90a, sm120a etc. The normalize does,

        suffix = "a" if has_suffix or major >= 9 else ""
        return f"{major}{minor}{suffix}"

So it will always has "a", BUT when I tried to compile tokenspeed on sm80, I never succeeded so I couldn't test.

So just removing it should result in identical code for supported architectures, I'll do that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi we don’t plan to support sm80

@Dogacel Dogacel requested a review from borontion May 18, 2026 02:00
Signed-off-by: Doğaç Eldenk <dogacel@gmail.com>
@Dogacel Dogacel force-pushed the default-kernel-arch branch from aa04d70 to 9044eb1 Compare May 18, 2026 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants