Name	Name	Last commit message	Last commit date
parent directory ..
bd3lm	bd3lm
mdlm	mdlm
README.md	README.md

A2D (AR-to-Diffusion)

This directory provides two key sets of resources:

Warmup: Tutorials for continual pretraining and SFTing any autoregressive model on small datasets to generate text with MDLM (masked diffusion) or BD3LM (block diffusion).
Tiny-A2D: The exact training, inference, and evaluation scripts for developing: Qwen3-0.6B-diffusion-mdlm-v0.1 and Qwen3-0.6B-diffusion-bd3lm-v0.1. For detailed experimental results and reproduction instructions, please see our Tiny-A2D Report.

Files

# example entry points for training / inference / evaluation
examples/a2d
├── bd3lm               # Block Discrete Denoising Diffusion Language Modeling (https://arxiv.org/abs/2503.09573)
│   ├── chat.py
│   ├── eval.sh
│   ├── pt.py
│   ├── sample.py
│   └── sft.py
├── mdlm                # Masked Diffusion Language Modeling (https://arxiv.org/abs/2406.07524)
│   ├── chat.py
│   ├── eval.sh
│   ├── pt.py
│   ├── sample.py
│   └── sft.py
└── README.md

Setup

(Optional) If your source AR model is not already supported in dllm/pipelines/a2d/models:
Modify the original autoregressive modeling file to support non-causal attention. See modeling_qwen3.py for an example.
And ensure the attention behavior is correct:
# Run A2D attention tests
pytest scripts/tests/test_attention.py -k "test_a2d"
# (Optional; required for the bd3lm trainer)
pytest scripts/tests/test_attention.py -k "test_bd3lm"

Before training, modify and save the source autoregressive models with non-causal attention. For example, to save Qwen/Qwen3-0.6B with its original weights but with modified non-causal attention defined in modeling_qwen3.py:

python dllm/pipelines/a2d/convert.py --model_name_or_path "Qwen/Qwen3-0.6B" --output_dir ".models/a2d/Qwen3-0.6B"

Warmup: MDLM

In this section, we show toy examples of continual pretraining and SFTing Qwen/Qwen3-0.6B on small datasets to generate text with MDLM.

Continual Pretraining

To train Qwen/Qwen3-0.6B on the tiny-shakespeare dataset with MDLM, run (on 1 GPU):

accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 1 \
    examples/a2d/mdlm/pt.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B" \
    --dataset_args "Trelis/tiny-shakespeare" \
    --text_field "Text" \
    --insert_eos False \
    --max_length 128 \
    --num_train_epochs 10 \
    --learning_rate 1e-4 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --output_dir ".models/a2d/Qwen3-0.6B/mdlm/tiny-shakespeare"

To sample from the model interactively:

# Enter a prompt (e.g., "First citizen: Before we proceed any further, hear me speak."),
# or press Enter to let the model generate text from scratch.
python -u examples/a2d/mdlm/chat.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B/mdlm/tiny-shakespeare/checkpoint-final" \
    --chat_template False --remasking "random" --temperature 0.7

Example of pretraining on a larger dataset (OpenWebText) in streaming mode

To train Qwen/Qwen3-0.6B on the openwebtext dataset in streaming mode with MDLM, run (on 8 GPUs):

accelerate launch --config_file scripts/accelerate_configs/zero2.yaml --num_processes 8 \
    examples/a2d/mdlm/pt.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B" \
    --dataset_args "dylanebert/openwebtext" \
    --text_field "text" \
    --streaming True \
    --insert_eos True \
    --max_length 512 \
    --max_steps 20000 \
    --learning_rate 1e-4 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --eval_strategy "no" \
    --output_dir ".models/a2d/Qwen3-0.6B/mdlm/openwebtext"

To sample from the model interactively:

# Enter a prompt (e.g., "Lebron James is"),
# or press Enter to let the model generate text from scratch.
python -u examples/a2d/mdlm/chat.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B/mdlm/openwebtext/checkpoint-final" \
    --chat_template False --remasking "random" --temperature 0.7

SFT

To train Qwen/Qwen3-0.6B on the alpaca dataset with MDLM, run (on 8 GPUs):

accelerate launch --config_file scripts/accelerate_configs/zero2.yaml --num_processes 8 \
    examples/a2d/mdlm/sft.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B" \
    --dataset_args "tatsu-lab/alpaca" \
    --max_length 512 \
    --num_train_epochs 10 \
    --learning_rate 1e-4 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --output_dir ".models/a2d/Qwen3-0.6B/mdlm/alpaca"

To chat with the model:

python -u examples/a2d/mdlm/chat.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B/mdlm/alpaca/checkpoint-final" --block_size 32

Warmup: BD3LM

In this section, we show toy examples of continual pretraining and SFTing Qwen/Qwen3-0.6B on small datasets to generate text with BD3LM.

Continual Pretraining

To train Qwen/Qwen3-0.6B on the tiny-shakespeare dataset with BD3LM, run (on 1 GPU):

accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 1 \
    examples/a2d/bd3lm/pt.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B" \
    --dataset_args "Trelis/tiny-shakespeare" \
    --text_field "Text" \
    --insert_eos False \
    --max_length 128 \
    --num_train_epochs 10 \
    --learning_rate 1e-4 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --block_size 32 \
    --output_dir ".models/a2d/Qwen3-0.6B/bd3lm/tiny-shakespeare"

To sample from the model interactively:

# Enter a prompt (e.g., "First citizen: Before we proceed any further, hear me speak."),
# or press Enter to let the model generate text from scratch.
python -u examples/a2d/bd3lm/chat.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B/bd3lm/tiny-shakespeare/checkpoint-final" \
    --chat_template False --block_size 32 --remasking "random" --temperature 0.7

Example of pretraining on a larger dataset (OpenWebText) in streaming mode

To train Qwen/Qwen3-0.6B on the openwebtext dataset in streaming mode with BD3LM, run (on 8 GPUs):

accelerate launch --config_file scripts/accelerate_configs/zero2.yaml --num_processes 8 \
    examples/a2d/bd3lm/pt.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B" \
    --dataset_args "dylanebert/openwebtext" \
    --text_field "text" \
    --streaming True \
    --insert_eos True \
    --max_length 512 \
    --max_steps 20000 \
    --learning_rate 1e-4 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --eval_strategy "no" \
    --block_size 32 \
    --output_dir ".models/a2d/Qwen3-0.6B/bd3lm/openwebtext"

To sample from the model interactively:

# Enter a prompt (e.g., "Lebron James is"),
# or press Enter to let the model generate text from scratch.
python -u examples/a2d/bd3lm/chat.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B/bd3lm/openwebtext/checkpoint-final" \
    --chat_template False --block_size 32 --remasking "random" --temperature 0.7

SFT

To train Qwen/Qwen3-0.6B on the alpaca dataset with BD3LM, run (on 8 GPUs):

accelerate launch --config_file scripts/accelerate_configs/zero2.yaml --num_processes 8 \
    examples/a2d/bd3lm/sft.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B" \
    --dataset_args "tatsu-lab/alpaca" \
    --max_length 512 \
    --num_train_epochs 10 \
    --learning_rate 1e-4 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --block_size 32 \
    --output_dir ".models/a2d/Qwen3-0.6B/bd3lm/alpaca"

To chat with the model:

python -u examples/a2d/bd3lm/chat.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B/bd3lm/alpaca/checkpoint-final" --block_size 32

`Tiny-A2D`

Here we show the exact commands we use to train / interact with / evaluate the Tiny-A2D models: Qwen3-0.6B-diffusion-mdlm-v0.1 and Qwen3-0.6B-diffusion-bd3lm-v0.1. For training curves and other details, please see blog Tiny-A2D Report.

Training

Read Useful tips for training and (optional) Slurm setup before training.

The Tiny-A2D models are trained purely with SFT.

To reproduce Qwen3-0.6B-diffusion-mdlm-v0.1 (with MDLM & SFT), run the command below (about 10 hours on 64 A100s):

WANDB_MODE=online sbatch --nodes=8 --gres=gpu:8 scripts/train.slurm.sh \
    --accelerate_config "zero2" \
    --script_path "examples/a2d/mdlm/sft.py" \
    --model_name_or_path ".models/a2d/Qwen3-0.6B" \
    --dataset_args "allenai/tulu-3-sft-mixture+HuggingFaceTB/smoltalk+OpenCoder-LLM/opc-sft-stage1[lang:python]+OpenCoder-LLM/opc-sft-stage2[lang:python]" \
    --max_length 1024 \
    --num_train_epochs 10 \
    --learning_rate 1e-4 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --gradient_accumulation_steps 2 \
    --output_dir ".models/a2d/Qwen3-0.6B/tulu-3-sft-mixture+smoltalk+opc-sft-stage1&2/epochs-10-bs-2048-len-1024"

To reproduce Qwen3-0.6B-diffusion-bd3lm-v0.1 (with BD3LM & SFT), run the command below (about 10 hours on 64 A100s):

WANDB_MODE=online sbatch --nodes=8 --gres=gpu:8 scripts/train.slurm.sh \
    --accelerate_config "zero2" \
    --script_path "examples/a2d/bd3lm/sft.py" \
    --model_name_or_path ".models/a2d/Qwen3-0.6B" \
    --dataset_args "allenai/tulu-3-sft-mixture+HuggingFaceTB/smoltalk+OpenCoder-LLM/opc-sft-stage1[lang:python]+OpenCoder-LLM/opc-sft-stage2[lang:python]" \
    --max_length 512 \
    --num_train_epochs 10 \
    --learning_rate 1e-4 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --gradient_accumulation_steps 2 \
    --block_size 32 \
    --output_dir ".models/a2d/Qwen3-0.6B/tulu-3-sft-mixture+smoltalk+opc-sft-stage1&2/epochs-10-bs-2048-len-512-bls-32"

Inference

To chat with the model:

python -u examples/a2d/mdlm/chat.py --model_name_or_path "dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1"
python -u examples/a2d/bd3lm/chat.py --model_name_or_path "dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1"

Evaluation

Read (optional) Evaluation setup before running evaluation.

To evaluate Qwen3-0.6B-diffusion-mdlm-v0.1 and Qwen3-0.6B-diffusion-bd3lm-v0.1 on gsm8k using 4 GPUs, run:

# Use model_args to adjust the sampler arguments for evaluation.
accelerate launch --num_processes 4 \
    dllm/pipelines/a2d/eval.py \
    --tasks "gsm8k_cot" \
    --model "a2d_mdlm" \
    --apply_chat_template \
    --num_fewshot 0 \
    --model_args "pretrained=dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1,max_new_tokens=256,steps=256,block_size=256,cfg_scale=0.0,temperature=0.0"

accelerate launch --num_processes 4 \
    dllm/pipelines/a2d/eval.py \
    --tasks "gsm8k_cot" \
    --model "a2d_bd3lm" \
    --apply_chat_template \
    --num_fewshot 0 \
    --model_args "pretrained=dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1,max_new_tokens=256,steps=256,block_size=32,cfg_scale=0.0,temperature=0.0"

To automatically evaluate Qwen3-0.6B-diffusion-mdlm-v0.1 and Qwen3-0.6B-diffusion-bd3lm-v0.1 on all benchmarks, run:

bash examples/a2d/mdlm/eval.sh --model_name_or_path "dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1" 
bash examples/a2d/bd3lm/eval.sh --model_name_or_path "dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1"

Evaluation results

Model	GSM8K	BBH	MATH	MMLU	MMLU‑Pro	Hellaswag	HumanEval	MBPP
`Qwen3-0.6B-diffusion-mdlm-v0.1` (Reproduced)	29.3	26.7	8.7	40.0	17.3	42.1	30.5	29.2
`Qwen3-0.6B-diffusion-bd3lm-v0.1` (Reproduced)	46.3	26.6	12.9	39.1	13.8	39.3	46.3	38.2

`Qwen2.5-0.5B` (Official)	41.6	20.3	19.5	47.5	15.7	52.1	30.5	39.3
`Qwen3-0.6B-Base` (Official)	59.6	41.5	32.4	52.8	24.7	47.4	32.3	36.6

Model	HumanEval	MBPP
`Qwen2.5-Coder-0.5B-Instruct-diffusion-mdlm-v0.1` (Reproduced)	28.1	23.0
`Qwen2.5-Coder-0.5B-Instruct-diffusion-bd3lm-v0.1` (Reproduced)	39.0	33.2
`open-dcoder-0.5B` (Official)	20.8	35.2

`Qwen2.5-Coder-0.5B-Instruct` (Official)	28.0	52.9

Table 1. Results (Reproduced) are obtained using our framework, while results (Official) come from the Qwen3 Technical Report, Qwen2.5-Coder Technical Report, Qwen2.5 Blog, and Open-dLLM. Italic rows denote autoregressive models, whereas non-italic rows denote diffusion language models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

A2D (AR-to-Diffusion)

Files

Setup

Warmup: MDLM

Continual Pretraining

SFT

Warmup: BD3LM

Continual Pretraining

SFT

`Tiny-A2D`

Training

Inference

Evaluation

Evaluation results

FilesExpand file tree

a2d

Directory actions

More options

Directory actions

More options

Latest commit

History

a2d

Folders and files

parent directory

README.md

A2D (AR-to-Diffusion)

Files

Setup

Warmup: MDLM

Continual Pretraining

SFT

Warmup: BD3LM

Continual Pretraining

SFT

Tiny-A2D

Training

Inference

Evaluation

Evaluation results

`Tiny-A2D`