Example: count_equal/naive.cu, count_equal_2d/naive.cu
Reduces atomic contention by using a hierarchy of atomic operations:
- Level 1: Threads within a block use
atomicAdd()to a__shared__variable - Level 2: One thread per block uses
atomicAdd()to global memory
__shared__ int block_count;
if (threadIdx.x == 0) block_count = 0;
__syncthreads();
// Each thread atomically updates shared memory
if (condition) atomicAdd(&block_count, 1);
__syncthreads();
// One thread updates global memory
if (threadIdx.x == 0) atomicAdd(output, block_count);Benefit: Reduces global atomic operations from O(N) to O(grid_size), dramatically lowering memory contention.
Example: count_equal/tree_reduction.cu, tensor_reduction_1d/tree_reduction.cu
Eliminates atomic operations within blocks by using parallel pairwise summation:
__shared__ float sdata[256];
sdata[tid] = local_value;
__syncthreads();
// Parallel reduction: stride halves each iteration
for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
if (tid < stride) {
sdata[tid] += sdata[tid + stride];
}
__syncthreads();
}
if (tid == 0) atomicAdd(output, sdata[0]);Benefit: O(log n) time complexity per block with no intra-block atomic contention. Typically 2-5x faster than atomic-based reduction for large blocks.
Example: tensor_reduction_1d/tree_reduction.cu
Allows each thread to process multiple elements, handling arrays larger than grid size:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
float local = 0.0f;
for (int i = tid; i < N; i += stride) {
local += input[i];
}Benefit: Handles arbitrarily large arrays without multiple kernel launches. Accumulates in fast registers before writing to shared memory.
Example: color_inversion/naive.cu, count_equal_2d/naive.cu, matrix_multiplication/naive.cu, matrx_transpose/naive.cu
Maps 2D thread blocks naturally to 2D data structures:
dim3 threadsPerBlock(16, 16); // 256 threads total
dim3 blocksPerGrid((width + 15) / 16, (height + 15) / 16);
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;Benefit: Clearer code for 2D operations, better memory coalescing for row-adjacent threads.
Example: count_equal/naive.cu, count_equal/tree_reduction.cu, tensor_reduction_1d/tree_reduction.cu
Uses fast on-chip shared memory for intermediate results:
__shared__ int blockSum; // ~100x faster than global memoryBenefit: Shared memory has much lower latency (~20-30 cycles vs ~400-800 cycles for global memory) and higher bandwidth.
Example: geglu/naive.cu
Precomputes constants at compile time to eliminate runtime calculations:
constexpr float inv_sqrt2 = 0.7071067811865475f;
// Instead of: float inv_sqrt2 = 1.0f / sqrtf(2.0f);Benefit: Saves expensive floating-point operations (division, sqrt) in every thread execution.
Example: geglu/naive.cu
Uses CUDA's hardware-accelerated special function units (SFUs):
float gelu = 0.5f * x * (1.0f + erff(x * inv_sqrt2)); // erff() uses SFUBenefit: Functions like erff(), sinf(), cosf(), expf() are hardware-accelerated on GPUs, much faster than software implementations.
Example: relu/naive.cu
Checks float sign using integer bit operations instead of floating-point comparison:
if ((__float_as_uint(input[x]) >> 31) != 0) {
output[x] = 0.0f; // Negative
return;
}Benefit: Integer operations can be cheaper than FP comparisons on some GPU architectures. The sign bit is bit 31 in IEEE 754 format.
Example: count_equal/tree_reduction.cu, tensor_reduction_1d/tree_reduction.cu
Ensures consecutive threads access consecutive memory addresses:
sdata[tid] = (i < N) ? input[i] : 0; // Threads 0-31 access input[0-31]Benefit: GPU memory systems coalesce 32-thread warps into single memory transactions, maximizing bandwidth utilization.
Example: tensor_reduction_1d/tree_reduction.cu
Accumulates intermediate results in registers (fastest memory tier):
float local = 0.0f; // Register variable
for (int i = tid; i < N; i += stride) {
local += input[i]; // Accumulate in register
}
// Only write to shared memory once
sdata[threadIdx.x] = local;Benefit: Registers have single-cycle access latency. Reduces shared memory traffic by factor of (elements per thread).
Example: All multi-threaded block operations
Ensures all threads in a block reach synchronization points before proceeding:
if (threadIdx.x == 0) shared_var = 0;
__syncthreads(); // Wait for initialization
// All threads can now safely use shared_var
atomicAdd(&shared_var, value);
__syncthreads(); // Wait for all updates
if (threadIdx.x == 0) finalizeResult();Benefit: Critical for correctness in shared memory operations. Prevents race conditions and data hazards.
Example: All reduction patterns
Minimizes global memory atomic operations by designating one thread per block:
if (threadIdx.x == 0) { // Or (threadIdx.x == 0 && threadIdx.y == 0) for 2D
atomicAdd(global_output, block_result);
}Benefit: Reduces global atomics from (threads_per_block × num_blocks) to just (num_blocks).
| Technique | Key Benefit | Performance Impact |
|---|---|---|
| Two-Level Atomic Reduction | Reduces global atomic contention | ~10-100x fewer global atomics |
| Tree Reduction | Eliminates intra-block atomics | 2-5x faster than atomic approach |
| Grid-Stride Loop | Handles large arrays efficiently | Enables processing of arbitrarily large data |
| 2D Thread Organization | Natural mapping to 2D data | Better code clarity and memory access |
| Shared Memory | Fast intermediate storage | ~100x faster than global memory |
| Compile-Time Constants | Eliminates runtime computation | Saves expensive operations per thread |
| Built-in Math Functions | Hardware acceleration | 2-10x faster than software implementations |
| Sign Bit Extraction | Faster comparisons | Minor speedup, architecture-dependent |
| Memory Coalescing | Maximizes bandwidth | Up to 32x memory throughput |
| Register Accumulation | Fastest memory tier | Reduces memory traffic significantly |
| Synchronization | Ensures correctness | Required for shared memory operations |
| Single-Thread Updates | Minimizes global operations | O(blocks) instead of O(threads) |