Fast-DLLM is a diffusion-based Large Language Model (LLM) inference acceleration framework that supports efficient inference for models like Dream and LLaDA.
.
├── dream/ # Dream model related code
├── llada/ # LLaDA model related code
└── .gitignore # Git ignore configuration
- Fast inference support for Dream and LLaDA models
- Multiple inference optimization strategies
- Code generation and evaluation capabilities
- Interactive chat interface
- Key-Value Cache for Block-Wise Decoding We propose an efficient block-wise decoding KV Cache mechanism for Masked Diffusion Models (MDMs). By reusing attention Key-Value activations across multiple steps within each block, our approach avoids redundant computation and significantly accelerates inference. Furthermore, our DualCache extension also caches masked suffix tokens, enabling even greater speedup with negligible accuracy loss.
- Confidence-Aware Parallel Decoding Instead of decoding tokens sequentially, we introduce a confidence-aware parallel decoding scheme. At each step, only tokens with confidence over a threshold are unmasked in parallel, while uncertain ones remain masked for future steps. This selective approach effectively balances decoding efficiency and output quality.
- Overall Performance Overall, introducing the KV Cache mechanism yields significant speed improvements for all tasks and sequence lengths, typically achieving a 2x to 3.6x speedup compared to the vanilla backbone. When the parallel decoding strategy is applied individually, we see additional acceleration, often pushing speedups to 4x-6x for the evaluated settings, particularly as the generation length increases.
- Clone the repository:
git clone https://github.com/your-username/fast-dllm.git
cd fast-dllm
- Install dependencies:
pip install -r requirements.txt
python llada/chat.py --gen_length 128 --steps 128 --block_size 32
Parameter descriptions:
--gen_length
: Maximum length of generated text--steps
: Number of sampling steps--block_size
: Cache block size--use_cache
: Whether to use cache--if_cache_position
: Whether to use dual cache--threshold
: Confidence threshold
We also provide a web demo using Gradio. First, install Gradio:
pip install gradio
Then run the demo:
cd llada
python app.py
Benchmark | Gen Length | LLaDA | +Cache | +Parallel | +Cache+Parallel (Fast-dLLM) |
---|---|---|---|---|---|
GSM8K (5-shot) | 256 | 79.3 6.73 (1×) |
79.5 21.23 (3.2×) |
79.2 16.53 (2.5×) |
78.5 54.4 (8.1×) |
512 | 77.5 3.23 (1×) |
77.0 10.43 (3.3×) |
77.6 18.63 (5.8×) |
77.2 35.3 (11.0×) |
|
HumanEval (0-shot) | 256 | 41.5 30.5 (1×) |
42.7 40.73 (1.3×) |
43.9 101.53 (3.3×) |
43.3 114.1 (3.7×) |
512 | 43.9 18.4 (1×) |
45.7 29.33 (1.6×) |
43.3 57.13 (3.1×) |
44.5 73.7 (4.0×) |
Each cell presents the accuracy (top row, in percentage) and the decoding throughput (middle row, in tokens per second) with relative speedup (bottom row) to the LLaDA baseline.
For detailed evaluation instructions on GSM8K and HumanEval benchmarks, please refer to LLaDA Evaluation Guide.
For detailed evaluation instructions on GSM8K and HumanEval benchmarks, please refer to Dream Evaluation Guide.
Issues and Pull Requests are welcome!
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
If you find this work useful, please cite our paper:
@misc{wu2025fastdllmtrainingfreeaccelerationdiffusion,
title={Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding},
author={Chengyue Wu and Hao Zhang and Shuchen Xue and Zhijian Liu and Shizhe Diao and Ligeng Zhu and Ping Luo and Song Han and Enze Xie},
year={2025},
eprint={2505.22618},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.22618},
}
We would like to thank the authors of LLaDA and Dream for their excellent work and open-source contributions.