Skip to content

Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"

License

Notifications You must be signed in to change notification settings

NVlabs/Fast-dLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fast-DLLM

Project arXiv

Fast-DLLM is a diffusion-based Large Language Model (LLM) inference acceleration framework that supports efficient inference for models like Dream and LLaDA.

End-to-end speedup over vanilla LLaDA baseline

End-to-end speedup over vanilla LLaDA baseline

Project Structure

.
├── dream/          # Dream model related code
├── llada/          # LLaDA model related code
└── .gitignore      # Git ignore configuration

Features

  • Fast inference support for Dream and LLaDA models
  • Multiple inference optimization strategies
  • Code generation and evaluation capabilities
  • Interactive chat interface

Key Features

  1. Key-Value Cache for Block-Wise Decoding We propose an efficient block-wise decoding KV Cache mechanism for Masked Diffusion Models (MDMs). By reusing attention Key-Value activations across multiple steps within each block, our approach avoids redundant computation and significantly accelerates inference. Furthermore, our DualCache extension also caches masked suffix tokens, enabling even greater speedup with negligible accuracy loss.
KV Cache for block-wise decoding

KV Cache for block-wise decoding

  1. Confidence-Aware Parallel Decoding Instead of decoding tokens sequentially, we introduce a confidence-aware parallel decoding scheme. At each step, only tokens with confidence over a threshold are unmasked in parallel, while uncertain ones remain masked for future steps. This selective approach effectively balances decoding efficiency and output quality.
Decoding comparison

Left: Standard decoding (LLaDA). Right: Confidence-aware parallel decoding.

Pseudo code for our method

Pseudo code for our method

  1. Overall Performance Overall, introducing the KV Cache mechanism yields significant speed improvements for all tasks and sequence lengths, typically achieving a 2x to 3.6x speedup compared to the vanilla backbone. When the parallel decoding strategy is applied individually, we see additional acceleration, often pushing speedups to 4x-6x for the evaluated settings, particularly as the generation length increases.
Overall performance

Overall performance comparison

Installation

  1. Clone the repository:
git clone https://github.com/your-username/fast-dllm.git
cd fast-dllm
  1. Install dependencies:
pip install -r requirements.txt

Usage

1. Using LLaDA Model

Interactive Chat

python llada/chat.py --gen_length 128 --steps 128 --block_size 32

Parameter descriptions:

  • --gen_length: Maximum length of generated text
  • --steps: Number of sampling steps
  • --block_size: Cache block size
  • --use_cache: Whether to use cache
  • --if_cache_position: Whether to use dual cache
  • --threshold: Confidence threshold

Web Demo

We also provide a web demo using Gradio. First, install Gradio:

pip install gradio

Then run the demo:

cd llada
python app.py

Model Evaluation

Benchmark Gen Length LLaDA +Cache +Parallel +Cache+Parallel (Fast-dLLM)
GSM8K (5-shot) 256 79.3
6.73
(1×)
79.5
21.23
(3.2×)
79.2
16.53
(2.5×)
78.5
54.4
(8.1×)
512 77.5
3.23
(1×)
77.0
10.43
(3.3×)
77.6
18.63
(5.8×)
77.2
35.3
(11.0×)
HumanEval (0-shot) 256 41.5
30.5 (1×)
42.7
40.73
(1.3×)
43.9
101.53
(3.3×)
43.3
114.1
(3.7×)
512 43.9
18.4 (1×)
45.7
29.33
(1.6×)
43.3
57.13
(3.1×)
44.5
73.7
(4.0×)

Each cell presents the accuracy (top row, in percentage) and the decoding throughput (middle row, in tokens per second) with relative speedup (bottom row) to the LLaDA baseline.

For detailed evaluation instructions on GSM8K and HumanEval benchmarks, please refer to LLaDA Evaluation Guide.

2. Using Dream Model

For detailed evaluation instructions on GSM8K and HumanEval benchmarks, please refer to Dream Evaluation Guide.

Contributing

Issues and Pull Requests are welcome!

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Citation

If you find this work useful, please cite our paper:

@misc{wu2025fastdllmtrainingfreeaccelerationdiffusion,
      title={Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding}, 
      author={Chengyue Wu and Hao Zhang and Shuchen Xue and Zhijian Liu and Shizhe Diao and Ligeng Zhu and Ping Luo and Song Han and Enze Xie},
      year={2025},
      eprint={2505.22618},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.22618}, 
}

Acknowledgements

We would like to thank the authors of LLaDA and Dream for their excellent work and open-source contributions.

About

Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •