This directory provides two key sets of resources:
- Warmup: Tutorials for continual pretraining and SFTing any autoregressive model on small datasets to generate text with MDLM (masked diffusion) or BD3LM (block diffusion).
Tiny-A2D: The exact training, inference, and evaluation scripts for developing:Qwen3-0.6B-diffusion-mdlm-v0.1andQwen3-0.6B-diffusion-bd3lm-v0.1. For detailed experimental results and reproduction instructions, please see ourTiny-A2D Report.
# example entry points for training / inference / evaluation
examples/a2d
├── bd3lm # Block Discrete Denoising Diffusion Language Modeling (https://arxiv.org/abs/2503.09573)
│ ├── chat.py
│ ├── eval.sh
│ ├── pt.py
│ ├── sample.py
│ └── sft.py
├── mdlm # Masked Diffusion Language Modeling (https://arxiv.org/abs/2406.07524)
│ ├── chat.py
│ ├── eval.sh
│ ├── pt.py
│ ├── sample.py
│ └── sft.py
└── README.md
(Optional) If your source AR model is not already supported in
dllm/pipelines/a2d/models:
Modify the original autoregressive modeling file to support non-causal attention. See
modeling_qwen3.pyfor an example.And ensure the attention behavior is correct:
# Run A2D attention tests pytest scripts/tests/test_attention.py -k "test_a2d" # (Optional; required for the bd3lm trainer) pytest scripts/tests/test_attention.py -k "test_bd3lm"
Before training, modify and save the source autoregressive models with non-causal attention.
For example, to save Qwen/Qwen3-0.6B with its original weights but with modified non-causal attention defined in
modeling_qwen3.py:
python dllm/pipelines/a2d/convert.py --model_name_or_path "Qwen/Qwen3-0.6B" --output_dir ".models/a2d/Qwen3-0.6B"Warmup: MDLM
In this section, we show toy examples of continual pretraining and SFTing Qwen/Qwen3-0.6B on small datasets to generate text with MDLM.
To train Qwen/Qwen3-0.6B on the tiny-shakespeare dataset with MDLM, run (on 1 GPU):
accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 1 \
examples/a2d/mdlm/pt.py \
--model_name_or_path ".models/a2d/Qwen3-0.6B" \
--dataset_args "Trelis/tiny-shakespeare" \
--text_field "Text" \
--insert_eos False \
--max_length 128 \
--num_train_epochs 10 \
--learning_rate 1e-4 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--output_dir ".models/a2d/Qwen3-0.6B/mdlm/tiny-shakespeare"To sample from the model interactively:
# Enter a prompt (e.g., "First citizen: Before we proceed any further, hear me speak."),
# or press Enter to let the model generate text from scratch.
python -u examples/a2d/mdlm/chat.py \
--model_name_or_path ".models/a2d/Qwen3-0.6B/mdlm/tiny-shakespeare/checkpoint-final" \
--chat_template False --remasking "random" --temperature 0.7Example of pretraining on a larger dataset (OpenWebText) in streaming mode
To train Qwen/Qwen3-0.6B on the openwebtext dataset in streaming mode with MDLM, run (on 8 GPUs):
accelerate launch --config_file scripts/accelerate_configs/zero2.yaml --num_processes 8 \
examples/a2d/mdlm/pt.py \
--model_name_or_path ".models/a2d/Qwen3-0.6B" \
--dataset_args "dylanebert/openwebtext" \
--text_field "text" \
--streaming True \
--insert_eos True \
--max_length 512 \
--max_steps 20000 \
--learning_rate 1e-4 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--eval_strategy "no" \
--output_dir ".models/a2d/Qwen3-0.6B/mdlm/openwebtext"To sample from the model interactively:
# Enter a prompt (e.g., "Lebron James is"),
# or press Enter to let the model generate text from scratch.
python -u examples/a2d/mdlm/chat.py \
--model_name_or_path ".models/a2d/Qwen3-0.6B/mdlm/openwebtext/checkpoint-final" \
--chat_template False --remasking "random" --temperature 0.7To train Qwen/Qwen3-0.6B on the alpaca dataset with MDLM, run (on 8 GPUs):
accelerate launch --config_file scripts/accelerate_configs/zero2.yaml --num_processes 8 \
examples/a2d/mdlm/sft.py \
--model_name_or_path ".models/a2d/Qwen3-0.6B" \
--dataset_args "tatsu-lab/alpaca" \
--max_length 512 \
--num_train_epochs 10 \
--learning_rate 1e-4 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--output_dir ".models/a2d/Qwen3-0.6B/mdlm/alpaca"To chat with the model:
python -u examples/a2d/mdlm/chat.py \
--model_name_or_path ".models/a2d/Qwen3-0.6B/mdlm/alpaca/checkpoint-final" --block_size 32Warmup: BD3LM
In this section, we show toy examples of continual pretraining and SFTing Qwen/Qwen3-0.6B on small datasets to generate text with BD3LM.
To train Qwen/Qwen3-0.6B on the tiny-shakespeare dataset with BD3LM, run (on 1 GPU):
accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 1 \
examples/a2d/bd3lm/pt.py \
--model_name_or_path ".models/a2d/Qwen3-0.6B" \
--dataset_args "Trelis/tiny-shakespeare" \
--text_field "Text" \
--insert_eos False \
--max_length 128 \
--num_train_epochs 10 \
--learning_rate 1e-4 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--block_size 32 \
--output_dir ".models/a2d/Qwen3-0.6B/bd3lm/tiny-shakespeare"To sample from the model interactively:
# Enter a prompt (e.g., "First citizen: Before we proceed any further, hear me speak."),
# or press Enter to let the model generate text from scratch.
python -u examples/a2d/bd3lm/chat.py \
--model_name_or_path ".models/a2d/Qwen3-0.6B/bd3lm/tiny-shakespeare/checkpoint-final" \
--chat_template False --block_size 32 --remasking "random" --temperature 0.7Example of pretraining on a larger dataset (OpenWebText) in streaming mode
To train Qwen/Qwen3-0.6B on the openwebtext dataset in streaming mode with BD3LM, run (on 8 GPUs):
accelerate launch --config_file scripts/accelerate_configs/zero2.yaml --num_processes 8 \
examples/a2d/bd3lm/pt.py \
--model_name_or_path ".models/a2d/Qwen3-0.6B" \
--dataset_args "dylanebert/openwebtext" \
--text_field "text" \
--streaming True \
--insert_eos True \
--max_length 512 \
--max_steps 20000 \
--learning_rate 1e-4 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--eval_strategy "no" \
--block_size 32 \
--output_dir ".models/a2d/Qwen3-0.6B/bd3lm/openwebtext"To sample from the model interactively:
# Enter a prompt (e.g., "Lebron James is"),
# or press Enter to let the model generate text from scratch.
python -u examples/a2d/bd3lm/chat.py \
--model_name_or_path ".models/a2d/Qwen3-0.6B/bd3lm/openwebtext/checkpoint-final" \
--chat_template False --block_size 32 --remasking "random" --temperature 0.7To train Qwen/Qwen3-0.6B on the alpaca dataset with BD3LM, run (on 8 GPUs):
accelerate launch --config_file scripts/accelerate_configs/zero2.yaml --num_processes 8 \
examples/a2d/bd3lm/sft.py \
--model_name_or_path ".models/a2d/Qwen3-0.6B" \
--dataset_args "tatsu-lab/alpaca" \
--max_length 512 \
--num_train_epochs 10 \
--learning_rate 1e-4 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--block_size 32 \
--output_dir ".models/a2d/Qwen3-0.6B/bd3lm/alpaca"To chat with the model:
python -u examples/a2d/bd3lm/chat.py \
--model_name_or_path ".models/a2d/Qwen3-0.6B/bd3lm/alpaca/checkpoint-final" --block_size 32Here we show the exact commands we use to train / interact with / evaluate the Tiny-A2D models:
Qwen3-0.6B-diffusion-mdlm-v0.1 and Qwen3-0.6B-diffusion-bd3lm-v0.1.
For training curves and other details, please see Tiny-A2D Report.
Read Useful tips for training and (optional) Slurm setup before training.
The Tiny-A2D models are trained purely with SFT.
To reproduce Qwen3-0.6B-diffusion-mdlm-v0.1 (with MDLM & SFT), run the command below (about 10 hours on 64 A100s):
WANDB_MODE=online sbatch --nodes=8 --gres=gpu:8 scripts/train.slurm.sh \
--accelerate_config "zero2" \
--script_path "examples/a2d/mdlm/sft.py" \
--model_name_or_path ".models/a2d/Qwen3-0.6B" \
--dataset_args "allenai/tulu-3-sft-mixture+HuggingFaceTB/smoltalk+OpenCoder-LLM/opc-sft-stage1[lang:python]+OpenCoder-LLM/opc-sft-stage2[lang:python]" \
--max_length 1024 \
--num_train_epochs 10 \
--learning_rate 1e-4 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 2 \
--output_dir ".models/a2d/Qwen3-0.6B/tulu-3-sft-mixture+smoltalk+opc-sft-stage1&2/epochs-10-bs-2048-len-1024"To reproduce Qwen3-0.6B-diffusion-bd3lm-v0.1 (with BD3LM & SFT), run the command below (about 10 hours on 64 A100s):
WANDB_MODE=online sbatch --nodes=8 --gres=gpu:8 scripts/train.slurm.sh \
--accelerate_config "zero2" \
--script_path "examples/a2d/bd3lm/sft.py" \
--model_name_or_path ".models/a2d/Qwen3-0.6B" \
--dataset_args "allenai/tulu-3-sft-mixture+HuggingFaceTB/smoltalk+OpenCoder-LLM/opc-sft-stage1[lang:python]+OpenCoder-LLM/opc-sft-stage2[lang:python]" \
--max_length 512 \
--num_train_epochs 10 \
--learning_rate 1e-4 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 2 \
--block_size 32 \
--output_dir ".models/a2d/Qwen3-0.6B/tulu-3-sft-mixture+smoltalk+opc-sft-stage1&2/epochs-10-bs-2048-len-512-bls-32"To chat with the model:
python -u examples/a2d/mdlm/chat.py --model_name_or_path "dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1"
python -u examples/a2d/bd3lm/chat.py --model_name_or_path "dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1"Read (optional) Evaluation setup before running evaluation.
To evaluate Qwen3-0.6B-diffusion-mdlm-v0.1 and Qwen3-0.6B-diffusion-bd3lm-v0.1 on gsm8k using 4 GPUs, run:
# Use model_args to adjust the sampler arguments for evaluation.
accelerate launch --num_processes 4 \
dllm/pipelines/a2d/eval.py \
--tasks "gsm8k_cot" \
--model "a2d_mdlm" \
--apply_chat_template \
--num_fewshot 0 \
--model_args "pretrained=dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1,max_new_tokens=256,steps=256,block_size=256,cfg_scale=0.0,temperature=0.0"
accelerate launch --num_processes 4 \
dllm/pipelines/a2d/eval.py \
--tasks "gsm8k_cot" \
--model "a2d_bd3lm" \
--apply_chat_template \
--num_fewshot 0 \
--model_args "pretrained=dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1,max_new_tokens=256,steps=256,block_size=32,cfg_scale=0.0,temperature=0.0"To automatically evaluate Qwen3-0.6B-diffusion-mdlm-v0.1 and Qwen3-0.6B-diffusion-bd3lm-v0.1 on all benchmarks, run:
bash examples/a2d/mdlm/eval.sh --model_name_or_path "dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1"
bash examples/a2d/bd3lm/eval.sh --model_name_or_path "dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1" | Model | GSM8K | BBH | MATH | MMLU | MMLU‑Pro | Hellaswag | HumanEval | MBPP |
|---|---|---|---|---|---|---|---|---|
Qwen3-0.6B-diffusion-mdlm-v0.1 (Reproduced)
|
29.3 | 26.7 | 8.7 | 40.0 | 17.3 | 42.1 | 30.5 | 29.2 |
Qwen3-0.6B-diffusion-bd3lm-v0.1 (Reproduced)
|
46.3 | 26.6 | 12.9 | 39.1 | 13.8 | 39.3 | 46.3 | 38.2 |
Qwen2.5-0.5B (Official) |
41.6 | 20.3 | 19.5 | 47.5 | 15.7 | 52.1 | 30.5 | 39.3 |
Qwen3-0.6B-Base (Official)
|
59.6 | 41.5 | 32.4 | 52.8 | 24.7 | 47.4 | 32.3 | 36.6 |
| Model | HumanEval | MBPP |
|---|---|---|
Qwen2.5-Coder-0.5B-Instruct-diffusion-mdlm-v0.1 (Reproduced)
|
28.1 | 23.0 |
Qwen2.5-Coder-0.5B-Instruct-diffusion-bd3lm-v0.1 (Reproduced)
|
39.0 | 33.2 |
open-dcoder-0.5B (Official)
|
20.8 | 35.2 |
Qwen2.5-Coder-0.5B-Instruct (Official)
|
28.0 | 52.9 |
Table 1. Results (Reproduced) are obtained using our framework, while results (Official) come from the Qwen3 Technical Report, Qwen2.5-Coder Technical Report, Qwen2.5 Blog, and Open-dLLM. Italic rows denote autoregressive models, whereas non-italic rows denote diffusion language models.