Skip to content

Commit 310a7d5

Browse files
committed
Fix clippy
1 parent 0086ffd commit 310a7d5

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

71 files changed

+3911
-6
lines changed

.claude/settings.local.json

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
{
2+
"permissions": {
3+
"allow": [
4+
"Bash(./sync-to-vast.sh:*)",
5+
"WebFetch(domain:patch-diff.githubusercontent.com)",
6+
"Bash(git push:*)",
7+
"Bash(git cherry-pick:*)",
8+
"Bash(docker exec:*)",
9+
"Bash(docker start:*)",
10+
"Bash(docker stop:*)",
11+
"Bash(docker rm:*)",
12+
"Bash(docker run:*)",
13+
"Bash(cargo check:*)",
14+
"Bash(./deploy-cuda.sh:*)",
15+
"Bash(bash:*)",
16+
"Bash(grep:*)",
17+
"WebFetch(domain:github.com)",
18+
"WebFetch(domain:raw.githubusercontent.com)",
19+
"WebFetch(domain:rustc-dev-guide.rust-lang.org)",
20+
"Bash(./run-example.sh:*)",
21+
"Bash(docker cp:*)",
22+
"WebFetch(domain:docs.nvidia.com)",
23+
"WebFetch(domain:reviews.llvm.org)",
24+
"Bash(find:*)"
25+
],
26+
"deny": []
27+
}
28+
}

BUG_REPORT_SHARED_MEMORY.md

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# Rust-CUDA Critical Bug: Shared Memory Not Emitted to PTX
2+
3+
## Summary
4+
The `shared_array!` macro in Rust-CUDA does not emit any shared memory declarations in the generated PTX code, causing runtime `InvalidValue` errors when trying to access the "shared" memory.
5+
6+
## The Bug
7+
When using `shared_array!` macro:
8+
```rust
9+
let smem = shared_array![bf16; 16384];
10+
```
11+
12+
Expected PTX output (like CUDA C++):
13+
```ptx
14+
.extern .shared .align 16 .b8 smem[];
15+
```
16+
17+
Actual PTX output:
18+
**No shared memory declaration at all**
19+
20+
## Root Cause Analysis
21+
22+
### 1. PTX Generation Issue
23+
Comparing Rust-CUDA vs CUDA C++ PTX output:
24+
25+
**CUDA C++ (working):**
26+
```ptx
27+
.extern .shared .align 16 .b8 smem[];
28+
// Later in code:
29+
cp.async.cg.shared.global [%r355], [%rd59], 16;
30+
```
31+
32+
**Rust-CUDA (broken):**
33+
- No `.shared` declaration anywhere in PTX
34+
- The macro creates a static but it's being treated as local memory
35+
- No `addrspace(3)` annotations in LLVM IR
36+
37+
### 2. Macro Implementation Problem
38+
The `shared_array!` macro (in `/Users/legnitto/src/Rust-CUDA/crates/cuda_std/src/shared.rs`) creates:
39+
```rust
40+
#[address_space(shared)]
41+
static SHARED: SyncWrapper = SyncWrapper(UnsafeCell::new(MaybeUninit::uninit()));
42+
```
43+
44+
But this `#[address_space(shared)]` attribute is not being properly translated through the compilation pipeline:
45+
1. LLVM IR doesn't contain `addrspace(3)` annotations
46+
2. PTX doesn't contain `.shared` declarations
47+
3. At runtime, the memory is actually local/register memory, not shared
48+
49+
### 3. Why It Causes InvalidValue
50+
When the kernel tries to:
51+
1. Write to what it thinks is shared memory (actually local)
52+
2. Call `thread::sync_threads()`
53+
3. Read from that "shared" memory
54+
55+
The CUDA runtime detects invalid memory access patterns and returns `InvalidValue`.
56+
57+
## Impact
58+
This bug makes it **impossible** to implement any optimized GPU kernels in Rust-CUDA, including:
59+
- Flash Attention
60+
- Matrix multiplication (GEMM)
61+
- Convolutions
62+
- Any kernel requiring thread cooperation
63+
64+
## Reproduction
65+
```rust
66+
#[kernel]
67+
pub unsafe fn test_shared() {
68+
let smem = shared_array![f32; 256];
69+
*smem.add(0) = 1.0;
70+
thread::sync_threads();
71+
let val = *smem.add(0); // InvalidValue here
72+
}
73+
```
74+
75+
## Verification
76+
The same pattern works perfectly in CUDA C++, confirming this is a Rust-CUDA codegen bug.
77+
78+
## Suggested Fix
79+
The `rustc_codegen_nvvm` backend needs to:
80+
1. Recognize `#[address_space(shared)]` attributes
81+
2. Emit proper `addrspace(3)` in LLVM IR
82+
3. Generate `.shared` declarations in PTX
83+
84+
## Workaround
85+
None. Shared memory is fundamental to GPU programming and cannot be worked around efficiently.

SHA2_DEBUGGING.md

Lines changed: 265 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,265 @@
1+
# SHA2 on Rust-CUDA: Debugging and Fix Documentation
2+
3+
## Problem Statement
4+
The SHA2 crate compiles successfully with rust-cuda but fails at runtime with `LaunchFailed` errors when executing certain kernels (specifically SHA512).
5+
6+
## Investigation Process
7+
8+
### 1. Initial Symptom Discovery
9+
- **Tool Used**: `deploy-cuda.sh` script to deploy and run examples on remote CUDA machine
10+
- **Finding**: SHA256 kernels work, but SHA512 kernels fail with `Error: LaunchFailed`
11+
12+
### 2. PTX Analysis
13+
- **Tools Used**:
14+
- Custom bash scripts to extract and analyze PTX
15+
- `grep` to search for function definitions
16+
- `sed` and `awk` to extract specific kernel code
17+
18+
- **Key Findings**:
19+
- All 4 kernels (sha256_oneshot, sha256_incremental, sha512_oneshot, sha512_incremental) are present in PTX
20+
- SHA256 kernel has 9679 registers and 256 bytes of local memory
21+
- SHA512 kernel has only 6 registers and contains `trap` instructions
22+
23+
### 3. Root Cause Identification
24+
- **Location**: `/crates/rustc_codegen_nvvm/src/builder.rs`
25+
- **Problem**: The `atomic_load` and `atomic_store` functions immediately insert `trap` instructions
26+
- **Code Pattern**:
27+
```rust
28+
fn atomic_load(...) {
29+
let (ty, f) = self.get_intrinsic("llvm.trap");
30+
self.call(ty, None, None, f, &[], None, None);
31+
// ... unreachable load
32+
}
33+
```
34+
35+
### 4. Why SHA512 Triggers This
36+
- SHA512 uses 64-bit operations more extensively than SHA256
37+
- Some operation in SHA512 (likely array access or integer operations) gets compiled to atomic loads/stores
38+
- NVVM IR doesn't support native atomic loads/stores, causing the codegen to insert trap instructions
39+
40+
## The Fix
41+
42+
### Fix 1: Atomic Operations (Partial Success)
43+
We fixed the atomic load/store operations that were generating trap instructions by emulating them with:
44+
1. **Volatile loads/stores** - Prevents optimization and ensures memory access
45+
2. **Memory barriers** (`llvm.nvvm.membar.sys`) - Ensures proper memory ordering
46+
47+
This fixed *some* trap instructions but not all.
48+
49+
### Root Cause 2: 128-bit Integer Operations
50+
After fixing atomic operations, we discovered that rust-cuda also generates trap instructions for **any 128-bit integer intrinsics**, including:
51+
- `ctlz`, `cttz`, `ctpop` (bit counting operations)
52+
- `bswap`, `bitreverse` (byte/bit reversal)
53+
- `rotate_left`, `rotate_right` (bit rotation)
54+
55+
SHA512 appears to use some operations that result in 128-bit intermediate values, triggering these traps.
56+
57+
### Implementation
58+
59+
#### For atomic_load:
60+
```rust
61+
fn atomic_load(...) -> &'ll Value {
62+
// Add memory fence before for acquire semantics
63+
match order {
64+
Acquire | AcqRel | SequentiallyConsistent => {
65+
self.call(membar_sys, ...);
66+
}
67+
}
68+
69+
// Volatile load
70+
let load = LLVMBuildLoad(...);
71+
LLVMSetVolatile(load, True);
72+
73+
// Add fence after for sequential consistency
74+
if order == SequentiallyConsistent {
75+
self.call(membar_sys, ...);
76+
}
77+
78+
load
79+
}
80+
```
81+
82+
#### For atomic_store:
83+
Similar pattern with fences for release semantics and volatile stores.
84+
85+
## Testing Strategy
86+
87+
### Test Files
88+
1. `examples/cuda/sha2_crates_io/` - Contains comprehensive tests for both SHA256 and SHA512 using the sha2 crate from crates.io:
89+
- One-shot API (`Sha256::digest()`)
90+
- Incremental API (`hasher.update()`)
91+
92+
### Deployment Process
93+
1. Use `run-example.sh` for quick deployment and testing
94+
2. Verifies GPU results against CPU implementation
95+
3. All 4 test cases should show "✅ Results match!"
96+
97+
## Current Investigation (2025-08-08)
98+
99+
### Root Cause Identified: SHA512 Uses u128 for Block Length Tracking
100+
101+
After extensive debugging and tracing, we've identified the exact source of the SHA512 failure:
102+
103+
#### The Smoking Gun
104+
In `sha2-0.10.8/src/core_api.rs`, the `Sha512VarCore` struct uses a u128 field:
105+
```rust
106+
pub struct Sha512VarCore {
107+
state: consts::State512,
108+
block_len: u128, // <-- This is the problem!
109+
}
110+
111+
impl UpdateCore for Sha512VarCore {
112+
fn update_blocks(&mut self, blocks: &[Block<Self>]) {
113+
self.block_len += blocks.len() as u128; // Line 108: u128 arithmetic
114+
compress512(&mut self.state, blocks);
115+
}
116+
}
117+
```
118+
119+
Compare this with `Sha256VarCore` which uses `block_len: u64` - this is why SHA256 works but SHA512 doesn't!
120+
121+
#### What Happens During Compilation
122+
123+
When SHA512::digest() is called:
124+
1. Creates a `Sha512VarCore` with `block_len: u128 = 0`
125+
2. Calls `update()` which performs `self.block_len += blocks.len() as u128`
126+
3. This triggers a cascade of u128 operations
127+
128+
Our debug output shows hundreds of u128 operations being generated, including:
129+
- Basic arithmetic (add, sub, mul) - ✅ Successfully emulated
130+
- Bitwise operations (and, or, xor, shifts) - ✅ Successfully emulated
131+
- Comparisons (IntEQ, IntULT, IntUGE) - ✅ Work with LLVM
132+
- **Division and remainder (udiv, urem)** - ❌ NOT IMPLEMENTED - causes traps!
133+
134+
#### The Missing Operations
135+
136+
The critical missing operations that cause the failure:
137+
```
138+
WARNING: Unimplemented i128 operation: udiv with 2 args, falling back to trap
139+
WARNING: Unimplemented i128 operation: urem with 2 args, falling back to trap
140+
```
141+
142+
These division operations are likely generated for:
143+
- Calculating how many 128-byte blocks fit in the input (`length / 128`)
144+
- Getting the remaining bytes (`length % 128`)
145+
146+
### Solution Implemented: Partial u128 Arithmetic Emulation
147+
148+
We've implemented emulation of most 128-bit integer operations using pairs of 64-bit values:
149+
150+
1. **Basic arithmetic**: add, sub, mul with proper carry/borrow handling
151+
2. **Bitwise operations**: and, or, xor, not
152+
3. **Shift operations**: shl, lshr, ashr with full support for shifts >= 64
153+
4. **Unary operations**: neg (two's complement)
154+
155+
The implementation intercepts these operations in the LLVM builder macro and replaces them with emulated versions using 64-bit operations that NVVM can handle.
156+
157+
### Still Missing: Division and Remainder
158+
159+
To fully fix SHA512, we need to implement:
160+
- `emulate_i128_udiv`: 128-bit unsigned division
161+
- `emulate_i128_urem`: 128-bit unsigned remainder
162+
- `emulate_i128_sdiv`: 128-bit signed division (if needed)
163+
- `emulate_i128_srem`: 128-bit signed remainder (if needed)
164+
165+
These are complex operations requiring long division algorithms using 64-bit primitives.
166+
167+
## Current Investigation (2025-08-07)
168+
169+
### New Finding: 128-bit Integer Intrinsics Generate Traps
170+
171+
After analyzing the PTX output and tracing through the codegen, we've discovered that **any 128-bit integer intrinsics** are generating trap instructions in rust-cuda. This is happening in `/crates/rustc_codegen_nvvm/src/intrinsic.rs`.
172+
173+
#### Code Path:
174+
1. SHA512 operations trigger certain intrinsics (likely in internal operations)
175+
2. These hit the `codegen_intrinsic_call` function in `intrinsic.rs`
176+
3. For operations like `ctlz`, `cttz`, `ctpop`, `bswap`, `bitreverse`, `rotate_left`, `rotate_right`:
177+
- If the width is 128 bits, it calls `handle_128_bit_intrinsic`
178+
- This function attempts to use LLVM intrinsics like `llvm.ctpop.i128`
179+
- However, NVVM doesn't properly support these, leading to trap generation
180+
181+
#### Evidence from PTX:
182+
```ptx
183+
// SHA512 kernel contains multiple trap instructions
184+
$L__BB2_6:
185+
setp.lt.u64 %p5, %rd2, 129;
186+
@%p5 bra $L__BB2_8;
187+
trap;
188+
189+
$L__BB2_8:
190+
trap;
191+
```
192+
193+
#### The Problem Code (intrinsic.rs:27-67):
194+
```rust
195+
fn handle_128_bit_intrinsic<'ll>(
196+
b: &mut Builder<'_, 'll, '_>,
197+
name: Symbol,
198+
args: &[OperandRef<'_, &'ll Value>],
199+
) -> &'ll Value {
200+
// CUDA 12+ has native __int128 support, so we can use LLVM intrinsics directly
201+
match name {
202+
sym::ctlz | sym::cttz => {
203+
// Tries to use llvm.ctlz.i128, etc.
204+
b.call_intrinsic(&llvm_name, &[args[0].immediate(), y])
205+
}
206+
// ... other 128-bit operations
207+
_ => {
208+
// Falls back to abort for unsupported ops
209+
b.abort_and_ret_i128()
210+
}
211+
}
212+
}
213+
```
214+
215+
The comment claims "CUDA 12+ supports 128-bit integers natively" but this is clearly not working with NVVM.
216+
217+
## Next Steps
218+
219+
### To Fix the Issue:
220+
1. ⏳ Implement proper 128-bit intrinsic emulation (similar to atomic operations)
221+
2. ⏳ Test with SHA512 to ensure it works
222+
3. ⏳ Verify no performance regression in SHA256
223+
224+
### Potential Solutions:
225+
1. **Emulate 128-bit operations using 64-bit operations** (like LLVM does for targets without native 128-bit support)
226+
2. **Use software implementations** of these intrinsics
227+
3. **Detect and warn** when 128-bit operations are used
228+
229+
### Future Improvements:
230+
1. Add proper atomic operation support for common cases (atomicAdd, atomicCAS)
231+
2. Improve error messages when CUDA features are unsupported
232+
3. Add compile-time warnings for potentially problematic operations
233+
4. Consider whether CUDA 12's __int128 support can be leveraged differently
234+
235+
## Technical Notes
236+
237+
### NVVM Limitations
238+
- NVVM IR lacks native atomic load/store instructions
239+
- Only has atomic RMW operations (add, sub, exchange, CAS)
240+
- Memory ordering must be emulated with explicit barriers
241+
242+
### Memory Barrier Types in NVVM
243+
- `llvm.nvvm.membar.sys` - System-wide memory fence
244+
- `llvm.nvvm.membar.gl` - Global memory fence
245+
- `llvm.nvvm.membar.cta` - Thread block level fence
246+
247+
### Volatile vs Atomic
248+
- Volatile: Prevents compiler optimizations, ensures memory access happens
249+
- Atomic: Provides indivisible operations with memory ordering guarantees
250+
- Our solution: Volatile + barriers ≈ Atomic (good enough for most cases)
251+
252+
## Tools and Scripts Created
253+
254+
1. **run-example.sh** - Simplified deployment script
255+
2. **check_ptx.sh** - Analyzes PTX function signatures
256+
3. **analyze_ptx.sh** - Compares PTX between working/failing kernels
257+
4. **extract_sha512.sh** - Extracts specific kernel code
258+
5. **debug_launch.sh** - Runs with debug environment variables
259+
260+
## Lessons Learned
261+
262+
1. **PTX inspection is crucial** - The `trap` instructions were the smoking gun
263+
2. **Register count differences** - Can indicate optimization/compilation issues
264+
3. **Atomic operations in CUDA** - Are more limited than CPU architectures
265+
4. **Emulation is acceptable** - Perfect atomics aren't always needed; volatile + barriers work for many cases

0 commit comments

Comments
 (0)