|
| 1 | +# SHA2 on Rust-CUDA: Debugging and Fix Documentation |
| 2 | + |
| 3 | +## Problem Statement |
| 4 | +The SHA2 crate compiles successfully with rust-cuda but fails at runtime with `LaunchFailed` errors when executing certain kernels (specifically SHA512). |
| 5 | + |
| 6 | +## Investigation Process |
| 7 | + |
| 8 | +### 1. Initial Symptom Discovery |
| 9 | +- **Tool Used**: `deploy-cuda.sh` script to deploy and run examples on remote CUDA machine |
| 10 | +- **Finding**: SHA256 kernels work, but SHA512 kernels fail with `Error: LaunchFailed` |
| 11 | + |
| 12 | +### 2. PTX Analysis |
| 13 | +- **Tools Used**: |
| 14 | + - Custom bash scripts to extract and analyze PTX |
| 15 | + - `grep` to search for function definitions |
| 16 | + - `sed` and `awk` to extract specific kernel code |
| 17 | + |
| 18 | +- **Key Findings**: |
| 19 | + - All 4 kernels (sha256_oneshot, sha256_incremental, sha512_oneshot, sha512_incremental) are present in PTX |
| 20 | + - SHA256 kernel has 9679 registers and 256 bytes of local memory |
| 21 | + - SHA512 kernel has only 6 registers and contains `trap` instructions |
| 22 | + |
| 23 | +### 3. Root Cause Identification |
| 24 | +- **Location**: `/crates/rustc_codegen_nvvm/src/builder.rs` |
| 25 | +- **Problem**: The `atomic_load` and `atomic_store` functions immediately insert `trap` instructions |
| 26 | +- **Code Pattern**: |
| 27 | + ```rust |
| 28 | + fn atomic_load(...) { |
| 29 | + let (ty, f) = self.get_intrinsic("llvm.trap"); |
| 30 | + self.call(ty, None, None, f, &[], None, None); |
| 31 | + // ... unreachable load |
| 32 | + } |
| 33 | + ``` |
| 34 | + |
| 35 | +### 4. Why SHA512 Triggers This |
| 36 | +- SHA512 uses 64-bit operations more extensively than SHA256 |
| 37 | +- Some operation in SHA512 (likely array access or integer operations) gets compiled to atomic loads/stores |
| 38 | +- NVVM IR doesn't support native atomic loads/stores, causing the codegen to insert trap instructions |
| 39 | + |
| 40 | +## The Fix |
| 41 | + |
| 42 | +### Fix 1: Atomic Operations (Partial Success) |
| 43 | +We fixed the atomic load/store operations that were generating trap instructions by emulating them with: |
| 44 | +1. **Volatile loads/stores** - Prevents optimization and ensures memory access |
| 45 | +2. **Memory barriers** (`llvm.nvvm.membar.sys`) - Ensures proper memory ordering |
| 46 | + |
| 47 | +This fixed *some* trap instructions but not all. |
| 48 | + |
| 49 | +### Root Cause 2: 128-bit Integer Operations |
| 50 | +After fixing atomic operations, we discovered that rust-cuda also generates trap instructions for **any 128-bit integer intrinsics**, including: |
| 51 | +- `ctlz`, `cttz`, `ctpop` (bit counting operations) |
| 52 | +- `bswap`, `bitreverse` (byte/bit reversal) |
| 53 | +- `rotate_left`, `rotate_right` (bit rotation) |
| 54 | + |
| 55 | +SHA512 appears to use some operations that result in 128-bit intermediate values, triggering these traps. |
| 56 | + |
| 57 | +### Implementation |
| 58 | + |
| 59 | +#### For atomic_load: |
| 60 | +```rust |
| 61 | +fn atomic_load(...) -> &'ll Value { |
| 62 | + // Add memory fence before for acquire semantics |
| 63 | + match order { |
| 64 | + Acquire | AcqRel | SequentiallyConsistent => { |
| 65 | + self.call(membar_sys, ...); |
| 66 | + } |
| 67 | + } |
| 68 | + |
| 69 | + // Volatile load |
| 70 | + let load = LLVMBuildLoad(...); |
| 71 | + LLVMSetVolatile(load, True); |
| 72 | + |
| 73 | + // Add fence after for sequential consistency |
| 74 | + if order == SequentiallyConsistent { |
| 75 | + self.call(membar_sys, ...); |
| 76 | + } |
| 77 | + |
| 78 | + load |
| 79 | +} |
| 80 | +``` |
| 81 | + |
| 82 | +#### For atomic_store: |
| 83 | +Similar pattern with fences for release semantics and volatile stores. |
| 84 | + |
| 85 | +## Testing Strategy |
| 86 | + |
| 87 | +### Test Files |
| 88 | +1. `examples/cuda/sha2_crates_io/` - Contains comprehensive tests for both SHA256 and SHA512 using the sha2 crate from crates.io: |
| 89 | + - One-shot API (`Sha256::digest()`) |
| 90 | + - Incremental API (`hasher.update()`) |
| 91 | + |
| 92 | +### Deployment Process |
| 93 | +1. Use `run-example.sh` for quick deployment and testing |
| 94 | +2. Verifies GPU results against CPU implementation |
| 95 | +3. All 4 test cases should show "✅ Results match!" |
| 96 | + |
| 97 | +## Current Investigation (2025-08-08) |
| 98 | + |
| 99 | +### Root Cause Identified: SHA512 Uses u128 for Block Length Tracking |
| 100 | + |
| 101 | +After extensive debugging and tracing, we've identified the exact source of the SHA512 failure: |
| 102 | + |
| 103 | +#### The Smoking Gun |
| 104 | +In `sha2-0.10.8/src/core_api.rs`, the `Sha512VarCore` struct uses a u128 field: |
| 105 | +```rust |
| 106 | +pub struct Sha512VarCore { |
| 107 | + state: consts::State512, |
| 108 | + block_len: u128, // <-- This is the problem! |
| 109 | +} |
| 110 | + |
| 111 | +impl UpdateCore for Sha512VarCore { |
| 112 | + fn update_blocks(&mut self, blocks: &[Block<Self>]) { |
| 113 | + self.block_len += blocks.len() as u128; // Line 108: u128 arithmetic |
| 114 | + compress512(&mut self.state, blocks); |
| 115 | + } |
| 116 | +} |
| 117 | +``` |
| 118 | + |
| 119 | +Compare this with `Sha256VarCore` which uses `block_len: u64` - this is why SHA256 works but SHA512 doesn't! |
| 120 | + |
| 121 | +#### What Happens During Compilation |
| 122 | + |
| 123 | +When SHA512::digest() is called: |
| 124 | +1. Creates a `Sha512VarCore` with `block_len: u128 = 0` |
| 125 | +2. Calls `update()` which performs `self.block_len += blocks.len() as u128` |
| 126 | +3. This triggers a cascade of u128 operations |
| 127 | + |
| 128 | +Our debug output shows hundreds of u128 operations being generated, including: |
| 129 | +- Basic arithmetic (add, sub, mul) - ✅ Successfully emulated |
| 130 | +- Bitwise operations (and, or, xor, shifts) - ✅ Successfully emulated |
| 131 | +- Comparisons (IntEQ, IntULT, IntUGE) - ✅ Work with LLVM |
| 132 | +- **Division and remainder (udiv, urem)** - ❌ NOT IMPLEMENTED - causes traps! |
| 133 | + |
| 134 | +#### The Missing Operations |
| 135 | + |
| 136 | +The critical missing operations that cause the failure: |
| 137 | +``` |
| 138 | +WARNING: Unimplemented i128 operation: udiv with 2 args, falling back to trap |
| 139 | +WARNING: Unimplemented i128 operation: urem with 2 args, falling back to trap |
| 140 | +``` |
| 141 | + |
| 142 | +These division operations are likely generated for: |
| 143 | +- Calculating how many 128-byte blocks fit in the input (`length / 128`) |
| 144 | +- Getting the remaining bytes (`length % 128`) |
| 145 | + |
| 146 | +### Solution Implemented: Partial u128 Arithmetic Emulation |
| 147 | + |
| 148 | +We've implemented emulation of most 128-bit integer operations using pairs of 64-bit values: |
| 149 | + |
| 150 | +1. **Basic arithmetic**: add, sub, mul with proper carry/borrow handling |
| 151 | +2. **Bitwise operations**: and, or, xor, not |
| 152 | +3. **Shift operations**: shl, lshr, ashr with full support for shifts >= 64 |
| 153 | +4. **Unary operations**: neg (two's complement) |
| 154 | + |
| 155 | +The implementation intercepts these operations in the LLVM builder macro and replaces them with emulated versions using 64-bit operations that NVVM can handle. |
| 156 | + |
| 157 | +### Still Missing: Division and Remainder |
| 158 | + |
| 159 | +To fully fix SHA512, we need to implement: |
| 160 | +- `emulate_i128_udiv`: 128-bit unsigned division |
| 161 | +- `emulate_i128_urem`: 128-bit unsigned remainder |
| 162 | +- `emulate_i128_sdiv`: 128-bit signed division (if needed) |
| 163 | +- `emulate_i128_srem`: 128-bit signed remainder (if needed) |
| 164 | + |
| 165 | +These are complex operations requiring long division algorithms using 64-bit primitives. |
| 166 | + |
| 167 | +## Current Investigation (2025-08-07) |
| 168 | + |
| 169 | +### New Finding: 128-bit Integer Intrinsics Generate Traps |
| 170 | + |
| 171 | +After analyzing the PTX output and tracing through the codegen, we've discovered that **any 128-bit integer intrinsics** are generating trap instructions in rust-cuda. This is happening in `/crates/rustc_codegen_nvvm/src/intrinsic.rs`. |
| 172 | + |
| 173 | +#### Code Path: |
| 174 | +1. SHA512 operations trigger certain intrinsics (likely in internal operations) |
| 175 | +2. These hit the `codegen_intrinsic_call` function in `intrinsic.rs` |
| 176 | +3. For operations like `ctlz`, `cttz`, `ctpop`, `bswap`, `bitreverse`, `rotate_left`, `rotate_right`: |
| 177 | + - If the width is 128 bits, it calls `handle_128_bit_intrinsic` |
| 178 | + - This function attempts to use LLVM intrinsics like `llvm.ctpop.i128` |
| 179 | + - However, NVVM doesn't properly support these, leading to trap generation |
| 180 | + |
| 181 | +#### Evidence from PTX: |
| 182 | +```ptx |
| 183 | +// SHA512 kernel contains multiple trap instructions |
| 184 | +$L__BB2_6: |
| 185 | + setp.lt.u64 %p5, %rd2, 129; |
| 186 | + @%p5 bra $L__BB2_8; |
| 187 | + trap; |
| 188 | +
|
| 189 | +$L__BB2_8: |
| 190 | + trap; |
| 191 | +``` |
| 192 | + |
| 193 | +#### The Problem Code (intrinsic.rs:27-67): |
| 194 | +```rust |
| 195 | +fn handle_128_bit_intrinsic<'ll>( |
| 196 | + b: &mut Builder<'_, 'll, '_>, |
| 197 | + name: Symbol, |
| 198 | + args: &[OperandRef<'_, &'ll Value>], |
| 199 | +) -> &'ll Value { |
| 200 | + // CUDA 12+ has native __int128 support, so we can use LLVM intrinsics directly |
| 201 | + match name { |
| 202 | + sym::ctlz | sym::cttz => { |
| 203 | + // Tries to use llvm.ctlz.i128, etc. |
| 204 | + b.call_intrinsic(&llvm_name, &[args[0].immediate(), y]) |
| 205 | + } |
| 206 | + // ... other 128-bit operations |
| 207 | + _ => { |
| 208 | + // Falls back to abort for unsupported ops |
| 209 | + b.abort_and_ret_i128() |
| 210 | + } |
| 211 | + } |
| 212 | +} |
| 213 | +``` |
| 214 | + |
| 215 | +The comment claims "CUDA 12+ supports 128-bit integers natively" but this is clearly not working with NVVM. |
| 216 | + |
| 217 | +## Next Steps |
| 218 | + |
| 219 | +### To Fix the Issue: |
| 220 | +1. ⏳ Implement proper 128-bit intrinsic emulation (similar to atomic operations) |
| 221 | +2. ⏳ Test with SHA512 to ensure it works |
| 222 | +3. ⏳ Verify no performance regression in SHA256 |
| 223 | + |
| 224 | +### Potential Solutions: |
| 225 | +1. **Emulate 128-bit operations using 64-bit operations** (like LLVM does for targets without native 128-bit support) |
| 226 | +2. **Use software implementations** of these intrinsics |
| 227 | +3. **Detect and warn** when 128-bit operations are used |
| 228 | + |
| 229 | +### Future Improvements: |
| 230 | +1. Add proper atomic operation support for common cases (atomicAdd, atomicCAS) |
| 231 | +2. Improve error messages when CUDA features are unsupported |
| 232 | +3. Add compile-time warnings for potentially problematic operations |
| 233 | +4. Consider whether CUDA 12's __int128 support can be leveraged differently |
| 234 | + |
| 235 | +## Technical Notes |
| 236 | + |
| 237 | +### NVVM Limitations |
| 238 | +- NVVM IR lacks native atomic load/store instructions |
| 239 | +- Only has atomic RMW operations (add, sub, exchange, CAS) |
| 240 | +- Memory ordering must be emulated with explicit barriers |
| 241 | + |
| 242 | +### Memory Barrier Types in NVVM |
| 243 | +- `llvm.nvvm.membar.sys` - System-wide memory fence |
| 244 | +- `llvm.nvvm.membar.gl` - Global memory fence |
| 245 | +- `llvm.nvvm.membar.cta` - Thread block level fence |
| 246 | + |
| 247 | +### Volatile vs Atomic |
| 248 | +- Volatile: Prevents compiler optimizations, ensures memory access happens |
| 249 | +- Atomic: Provides indivisible operations with memory ordering guarantees |
| 250 | +- Our solution: Volatile + barriers ≈ Atomic (good enough for most cases) |
| 251 | + |
| 252 | +## Tools and Scripts Created |
| 253 | + |
| 254 | +1. **run-example.sh** - Simplified deployment script |
| 255 | +2. **check_ptx.sh** - Analyzes PTX function signatures |
| 256 | +3. **analyze_ptx.sh** - Compares PTX between working/failing kernels |
| 257 | +4. **extract_sha512.sh** - Extracts specific kernel code |
| 258 | +5. **debug_launch.sh** - Runs with debug environment variables |
| 259 | + |
| 260 | +## Lessons Learned |
| 261 | + |
| 262 | +1. **PTX inspection is crucial** - The `trap` instructions were the smoking gun |
| 263 | +2. **Register count differences** - Can indicate optimization/compilation issues |
| 264 | +3. **Atomic operations in CUDA** - Are more limited than CPU architectures |
| 265 | +4. **Emulation is acceptable** - Perfect atomics aren't always needed; volatile + barriers work for many cases |
0 commit comments