Flash Dynamic Mask Attention

Flash-DMA is a high-performance attention implementation that integrates Flash Attention's memory efficiency with Dynamic Mask Attention's computational efficiency for processing extremely long sequences in transformer models.

Key Features

Sparse Attention Computation: Dynamically selects the most important keys for each query, reducing computation from $O(N^2)$ to $O(N \cdot k)$ where $k \ll N$.
Memory Efficiency: Maintains Flash Attention's $O(N)$ memory complexity without materializing the full attention matrix.
CUDA-Accelerated: Deep integration at the CUDA kernel level for maximum performance.
Long Sequence Support: Efficiently handles sequences of 128K+ tokens that would be impractical with standard attention.
Backward Compatible: API compatible with existing Flash Attention implementations.

Installation

Prerequisites

Python: 3.7 or later
PyTorch: 1.10.0 or later
CUDA: 11.0 or later (for GPU acceleration)
NVIDIA GPU: Compute Capability 6.0 or higher
C++ Compiler: GCC 7+ or compatible

CUDA Environment Setup

Ensure your CUDA environment is properly configured:

# Check CUDA installation
nvcc --version

# Set CUDA_HOME if needed
export CUDA_HOME=/usr/local/cuda

Install from Source

git clone https://github.com/SmallDoges/flash-dmattn.git
cd flash-dmattn
git submodule update --init --recursive
pip install .

How It Works

Flash-DMA combines two complementary techniques:

Dynamic Mask Attention: Computes relevance scores for keys and selects only the most important ones for attention computation
Flash Attention: Processes attention in blocks to reduce memory usage and HBM access

The Integration Approach

The integration happens at the CUDA kernel level with several key components:

ZOH States: Pre-computed importance scores for key selection
Active Masks: Binary masks indicating which keys should be considered for each query
Sparse Matrix Multiplication: Custom CUDA kernels for efficient sparse attention computation
Block-Based Processing: Maintains Flash Attention's block-based approach for memory efficiency

This creates a hybrid attention mechanism that achieves both memory and computational efficiency.

Documentation

For detailed technical documentation, see:

Integration Guide - Comprehensive technical details
API Reference - Function signatures and parameters

API Reference

Important

TODO

Building from Source

Development Setup

# Clone with submodules
git clone --recursive https://github.com/SmallDoges/flash-dmattn.git
cd flash-dmattn

# Build in development mode
pip install -e .

Build Requirements

CUDA Toolkit 11.0+
CUTLASS library (included as submodule)
CUB library (included as submodule)

Supported Architectures

SM 6.0+ (Pascal, Volta, Turing, Ampere, Ada Lovelace)
Optimized for SM 8.0+ (Ampere and newer)

Testing

Run Tests

# Gradient equivalent benchmarks
python benchmarks/benchmark_grad.py

Compatibility

Component	Supported Versions
PyTorch	1.10.0+
CUDA	11.0+
Python	3.7+
GPU Arch	SM 6.0+

Troubleshooting

Common Issues

Compilation Errors

# Ensure CUDA_HOME is set
export CUDA_HOME=/usr/local/cuda
# Update NVCC if needed
which nvcc

Performance Issues

Ensure GPU has sufficient compute capability (6.0+)
Use appropriate data types (float16 recommended)
Verify CUDA kernels are being used (not CPU fallback)

License

This project is licensed under the BSD 3-Clause License. See LICENSE for details.

Citation

If you use Flash-DMA in your research, please cite:

@misc{flash_dma_2025,
  title={Trainable Dynamic Mask Sparse Attention},
  author={Jingze Shi and Yifan Wu and Bingheng Wu and Yiran Peng and Yuyu Luo},
  year={2025},
  url={https://github.com/SmallDoges/flash-dmattn}
}

Acknowledgments

This project builds upon the excellent work of:

Flash-Attention by Tri Dao et al.
NVIDIA CUTLASS library for efficient matrix operations

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
assets		assets
benchmarks		benchmarks
csrc		csrc
docs		docs
.gitignore		.gitignore
.gitmodules		.gitmodules
AUTHORS		AUTHORS
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Flash Dynamic Mask Attention

Key Features

Installation

Prerequisites

CUDA Environment Setup

Install from Source

How It Works

The Integration Approach

Documentation

API Reference

Building from Source

Development Setup

Build Requirements

Supported Architectures

Testing

Run Tests

Compatibility

Troubleshooting

Common Issues

License

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

SmallDoges/flash-dmattn

Folders and files

Latest commit

History

Repository files navigation

Flash Dynamic Mask Attention

Key Features

Installation

Prerequisites

CUDA Environment Setup

Install from Source

How It Works

The Integration Approach

Documentation

API Reference

Building from Source

Development Setup

Build Requirements

Supported Architectures

Testing

Run Tests

Compatibility

Troubleshooting

Common Issues

License

Citation

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages