Efficient sparse tensor compression and inference library with bit-packed storage and sliding window computation.
SparseCompress implements a novel approach to storing and computing with sparse neural networks:
- Compact storage: Non-zero weights stored as a list + bit-packed position mask
- Memory efficiency: Up to 24x compression for 99% sparse tensors
- Sliding window inference: Minimize memory usage during forward pass
- Zero reconstruction error: Perfect numerical accuracy
pip install -r requirements.txtfrom sparse_compress import SparseCompress
import numpy as np
# Initialize compressor
compressor = SparseCompress(sparsity_threshold=1e-6)
# Create a sparse weight matrix
weight = np.random.randn(1024, 512).astype(np.float32)
weight[np.random.random((1024, 512)) < 0.9] = 0 # 90% sparsity
# Compress the weight matrix
compressed = compressor.compress(weight)
print(f"Compression ratio: {compressor.compression_ratio(weight, compressed):.2f}x")
# Perform efficient matrix multiplication with sliding window
input_batch = np.random.randn(32, 512).astype(np.float32)
output = compressor.sliding_window_matmul(compressed, input_batch, window_size=128)- Non-zero values stored in compact array
- Positions encoded as bit-packed mask (1 bit per element)
- Significant memory savings for sparse tensors (>80% sparsity)
- Process large matrices in configurable chunks
- Reduces peak memory usage during computation
- Maintains exact numerical precision
- Compression: 4-24x depending on sparsity level
- Memory: Up to 16x reduction with sliding window
- Accuracy: Zero reconstruction error
Run the examples to see SparseCompress in action:
# Run comprehensive examples and benchmarks
python example_usage.py
# Run tests
python test_sparse_compress.py- Training and inference with sparse neural networks
- Memory-constrained edge deployment
- Large-scale sparse matrix operations
- Efficient storage of pruned models
Main class for compression and computation operations.
compress(tensor): Convert dense tensor to sparse formatdecompress(sparse_tensor): Reconstruct full dense tensorsliding_window_matmul(sparse_weight, input_tensor, window_size): Memory-efficient matrix multiplicationget_sparsity_ratio(tensor): Calculate fraction of zero elementscompression_ratio(original, compressed): Calculate space savings
Data structure for compressed representation.
values: Array of non-zero valuesmask_bytes: Bit-packed position maskshape: Original tensor dimensionsmemory_size: Total memory usage in bytes
For GPU-accelerated sparse operations, use the CUDA implementation:
from sparse_compress_cuda import SparseCompressCUDA
# Initialize CUDA compressor
compressor = SparseCompressCUDA(sparsity_threshold=1e-6)
# Compress on GPU
weight = np.random.randn(1024, 512).astype(np.float32)
weight[np.random.random((1024, 512)) < 0.95] = 0
compressed = compressor.compress(weight)
# Prepare for matrix multiplication (computes row offsets)
compressed = compressor.prepare_for_matmul(compressed)
# Fast sparse matmul on GPU
input_batch = np.random.randn(128, 512).astype(np.float32)
output = compressor.sparse_matmul(compressed, input_batch)
# Or use sliding window for memory efficiency
output = compressor.sliding_window_matmul(compressed, input_batch, window_size=128)- CUDA Toolkit 11.0+
- Numba with CUDA support:
pip install numba
The raw CUDA kernels in cuda/sparse_kernels.cu can be compiled for maximum performance:
cd cuda
chmod +x build.sh
CUDA_ARCH=sm_80 ./build.sh # Adjust for your GPU architectureKey optimizations in the CUDA implementation:
- Parallel prefix sums using warp-level shuffle intrinsics
- Coalesced memory access patterns
- Shared memory tiling for input reuse in matmul
- Atomic operations for concurrent mask bit setting
MIT