Fast-DLLM

Fast-DLLM is a diffusion-based Large Language Model (LLM) inference acceleration framework that supports efficient inference for models like Dream and LLaDA.

End-to-end speedup over vanilla LLaDA baseline

Project Structure

.
├── dream/          # Dream model related code
├── llada/          # LLaDA model related code
└── .gitignore      # Git ignore configuration

Features

Fast inference support for Dream and LLaDA models
Multiple inference optimization strategies
Code generation and evaluation capabilities
Interactive chat interface

Key Features

Key-Value Cache for Block-Wise Decoding We propose an efficient block-wise decoding KV Cache mechanism for Masked Diffusion Models (MDMs). By reusing attention Key-Value activations across multiple steps within each block, our approach avoids redundant computation and significantly accelerates inference. Furthermore, our DualCache extension also caches masked suffix tokens, enabling even greater speedup with negligible accuracy loss.

KV Cache for block-wise decoding

Confidence-Aware Parallel Decoding Instead of decoding tokens sequentially, we introduce a confidence-aware parallel decoding scheme. At each step, only tokens with confidence over a threshold are unmasked in parallel, while uncertain ones remain masked for future steps. This selective approach effectively balances decoding efficiency and output quality.

Left: Standard decoding (LLaDA). Right: Confidence-aware parallel decoding.

Pseudo code for our method

Overall Performance Overall, introducing the KV Cache mechanism yields significant speed improvements for all tasks and sequence lengths, typically achieving a 2x to 3.6x speedup compared to the vanilla backbone. When the parallel decoding strategy is applied individually, we see additional acceleration, often pushing speedups to 4x-6x for the evaluated settings, particularly as the generation length increases.

Overall performance comparison

Installation

Clone the repository:

git clone https://github.com/your-username/fast-dllm.git
cd fast-dllm

Install dependencies:

pip install -r requirements.txt

Usage

1. Using LLaDA Model

Interactive Chat

python llada/chat.py --gen_length 128 --steps 128 --block_size 32

Parameter descriptions:

--gen_length: Maximum length of generated text
--steps: Number of sampling steps
--block_size: Cache block size
--use_cache: Whether to use cache
--if_cache_position: Whether to use dual cache
--threshold: Confidence threshold

Web Demo

We also provide a web demo using Gradio. First, install Gradio:

pip install gradio

Then run the demo:

cd llada
python app.py

Model Evaluation

Benchmark	Gen Length	LLaDA	+Cache	+Parallel	+Cache+Parallel (Fast-dLLM)
GSM8K (5-shot)	256	79.3 6.73 (1×)	79.5 21.23 (3.2×)	79.2 16.53 (2.5×)	78.5 54.4 (8.1×)
	512	77.5 3.23 (1×)	77.0 10.43 (3.3×)	77.6 18.63 (5.8×)	77.2 35.3 (11.0×)
HumanEval (0-shot)	256	41.5 30.5 (1×)	42.7 40.73 (1.3×)	43.9 101.53 (3.3×)	43.3 114.1 (3.7×)
	512	43.9 18.4 (1×)	45.7 29.33 (1.6×)	43.3 57.13 (3.1×)	44.5 73.7 (4.0×)

Each cell presents the accuracy (top row, in percentage) and the decoding throughput (middle row, in tokens per second) with relative speedup (bottom row) to the LLaDA baseline.

For detailed evaluation instructions on GSM8K and HumanEval benchmarks, please refer to LLaDA Evaluation Guide.

2. Using Dream Model

For detailed evaluation instructions on GSM8K and HumanEval benchmarks, please refer to Dream Evaluation Guide.

Contributing

Issues and Pull Requests are welcome!

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Citation

If you find this work useful, please cite our paper:

@misc{wu2025fastdllmtrainingfreeaccelerationdiffusion,
      title={Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding}, 
      author={Chengyue Wu and Hao Zhang and Shuchen Xue and Zhijian Liu and Shizhe Diao and Ligeng Zhu and Ping Luo and Song Han and Enze Xie},
      year={2025},
      eprint={2505.22618},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.22618}, 
}

Acknowledgements

We would like to thank the authors of LLaDA and Dream for their excellent work and open-source contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
asset		asset
dream		dream
llada		llada
paper		paper
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
index.html		index.html
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fast-DLLM

Project Structure

Features

Key Features

Installation

Usage

1. Using LLaDA Model

Interactive Chat

Web Demo

Model Evaluation

2. Using Dream Model

Contributing

License

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

NVlabs/Fast-dLLM

Folders and files

Latest commit

History

Repository files navigation

Fast-DLLM

Project Structure

Features

Key Features

Installation

Usage

1. Using LLaDA Model

Interactive Chat

Web Demo

Model Evaluation

2. Using Dream Model

Contributing

License

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages