Skip to content

fix(build): probe CUDA toolkit layouts in a shared openinfer-build crate#343

Open
FeathBow wants to merge 1 commit into
openinfer-project:mainfrom
FeathBow:fix/build-cuda-discovery
Open

fix(build): probe CUDA toolkit layouts in a shared openinfer-build crate#343
FeathBow wants to merge 1 commit into
openinfer-project:mainfrom
FeathBow:fix/build-cuda-discovery

Conversation

@FeathBow

Copy link
Copy Markdown
Contributor

Description

Fixes #342

CUDA toolkit discovery was duplicated across build scripts, each assuming the classic /usr/local/cuda layout (lib64/ for libs, include/ for headers). Two real layouts break it: conda/micromamba (libs in lib/, headers in targets/<arch>-linux/include/) and the NVIDIA HPC SDK (cuBLAS in a math_libs/<ver> sibling tree). This PR concentrates discovery in a shared openinfer-build crate: find_package probes several check files per root, and cuda_libs probes lib64/lib/targets/<arch>/lib plus the math_libs sibling, emitting only dirs that exist. The evidenced sites (openinfer-kernels, cuda-sys, cudart-sys) migrate to it; gdrapi-sys/libibverbs-sys keep their behavior through the same helper.

Before

  • Dual-GH200 (aarch64, sm_90), NVIDIA HPC SDK toolkit: linking openinfer-kernels fails; the only workaround was a manual LIBRARY_PATH export.
  • Single GPU (x86_64, sm_89), conda toolkit: the cuda-sys/cudart-sys build scripts panic on the header probe, taking cargo test --workspace down with them.
Error logs
# Dual-GH200 (aarch64, sm_90), HPC SDK toolkit — openinfer-kernels link stage
ld: cannot find -lcublas: No such file or directory
ld: cannot find -lcublasLt: No such file or directory

# Single GPU (x86_64, sm_89), conda toolkit — openinfer-comm-cuda-sys build script
cuda-sys build error: required header `include/cuda.h` not found.
Looked at `$CUDA_HOME` (set to ".../envs/<conda-env>") and default paths ["/usr/local/cuda"]

After

  • Dual-GH200 (aarch64, sm_90), NVIDIA HPC SDK toolkit: openinfer-kernels relinks with LIBRARY_PATH unset; the workaround is deleted.
  • Single GPU (x86_64, sm_89), conda toolkit: cuda-sys/cudart-sys build, openinfer-kernels relinks, and the Qwen3-4B golden gate runs green on the fixed tree.
  • Layout unit tests pass on both machines and on a CUDA-less host (the crate has no CUDA dependency).
Verification logs
== buildfix verify, Dual-GH200 (aarch64, sm_90), HPC SDK toolkit ==
LIBRARY_PATH=unset
=A= openinfer-build unit tests
test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.01s
=B= kernels relink (gate --no-run)
    Finished `release` profile [optimized] target(s) in 45.60s

== buildfix verify, Single GPU (x86_64, sm_89), conda toolkit ==
=A= openinfer-build unit tests
test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
=B= cuda-sys
    Finished `release` profile [optimized] target(s) in 1.89s
=C= cudart-sys
    Finished `release` profile [optimized] target(s) in 1.23s
=D= kernels relink (gate --no-run)
    Finished `release` profile [optimized] target(s) in 38.85s
=E= golden gate full run
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 38.05s

== buildfix verify, CUDA-less dev host (macOS arm64) ==
test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • My code follows the style guidelines of this project (see docs/conventions/coding-style.md).
  • I have performed a self-review of my own code.
  • I have formatted my commits according to Commitizen conventions.
  • I have run the local test suite and all tests pass (see CLAUDE.md).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

build: CUDA toolkit discovery assumes one layout per build script

1 participant