Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

README.md

A2D (AR-to-Diffusion)

Hugging Face Checkpoints W&B Report

This directory provides two key sets of resources:

Files

# example entry points for training / inference / evaluation
examples/a2d
├── bd3lm               # Block Discrete Denoising Diffusion Language Modeling (https://arxiv.org/abs/2503.09573)
│   ├── chat.py
│   ├── eval.sh
│   ├── pt.py
│   ├── sample.py
│   └── sft.py
├── mdlm                # Masked Diffusion Language Modeling (https://arxiv.org/abs/2406.07524)
│   ├── chat.py
│   ├── eval.sh
│   ├── pt.py
│   ├── sample.py
│   └── sft.py
└── README.md

Setup

(Optional) If your source AR model is not already supported in dllm/pipelines/a2d/models:

  1. Modify the original autoregressive modeling file to support non-causal attention. See modeling_qwen3.py for an example.

  2. And ensure the attention behavior is correct:

    # Run A2D attention tests
    pytest scripts/tests/test_attention.py -k "test_a2d"
    # (Optional; required for the bd3lm trainer)
    pytest scripts/tests/test_attention.py -k "test_bd3lm"

Before training, modify and save the source autoregressive models with non-causal attention. For example, to save Qwen/Qwen3-0.6B with its original weights but with modified non-causal attention defined in modeling_qwen3.py:

python dllm/pipelines/a2d/convert.py --model_name_or_path "Qwen/Qwen3-0.6B" --output_dir ".models/a2d/Qwen3-0.6B"

Warmup: MDLM

In this section, we show toy examples of continual pretraining and SFTing Qwen/Qwen3-0.6B on small datasets to generate text with MDLM.

Continual Pretraining

To train Qwen/Qwen3-0.6B on the tiny-shakespeare dataset with MDLM, run (on 1 GPU):

accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 1 \
    examples/a2d/mdlm/pt.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B" \
    --dataset_args "Trelis/tiny-shakespeare" \
    --text_field "Text" \
    --insert_eos False \
    --max_length 128 \
    --num_train_epochs 10 \
    --learning_rate 1e-4 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --output_dir ".models/a2d/Qwen3-0.6B/mdlm/tiny-shakespeare"

To sample from the model interactively:

# Enter a prompt (e.g., "First citizen: Before we proceed any further, hear me speak."),
# or press Enter to let the model generate text from scratch.
python -u examples/a2d/mdlm/chat.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B/mdlm/tiny-shakespeare/checkpoint-final" \
    --chat_template False --remasking "random" --temperature 0.7
Example of pretraining on a larger dataset (OpenWebText) in streaming mode

To train Qwen/Qwen3-0.6B on the openwebtext dataset in streaming mode with MDLM, run (on 8 GPUs):

accelerate launch --config_file scripts/accelerate_configs/zero2.yaml --num_processes 8 \
    examples/a2d/mdlm/pt.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B" \
    --dataset_args "dylanebert/openwebtext" \
    --text_field "text" \
    --streaming True \
    --insert_eos True \
    --max_length 512 \
    --max_steps 20000 \
    --learning_rate 1e-4 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --eval_strategy "no" \
    --output_dir ".models/a2d/Qwen3-0.6B/mdlm/openwebtext"

To sample from the model interactively:

# Enter a prompt (e.g., "Lebron James is"),
# or press Enter to let the model generate text from scratch.
python -u examples/a2d/mdlm/chat.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B/mdlm/openwebtext/checkpoint-final" \
    --chat_template False --remasking "random" --temperature 0.7

SFT

To train Qwen/Qwen3-0.6B on the alpaca dataset with MDLM, run (on 8 GPUs):

accelerate launch --config_file scripts/accelerate_configs/zero2.yaml --num_processes 8 \
    examples/a2d/mdlm/sft.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B" \
    --dataset_args "tatsu-lab/alpaca" \
    --max_length 512 \
    --num_train_epochs 10 \
    --learning_rate 1e-4 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --output_dir ".models/a2d/Qwen3-0.6B/mdlm/alpaca"

To chat with the model:

python -u examples/a2d/mdlm/chat.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B/mdlm/alpaca/checkpoint-final" --block_size 32

Warmup: BD3LM

In this section, we show toy examples of continual pretraining and SFTing Qwen/Qwen3-0.6B on small datasets to generate text with BD3LM.

Continual Pretraining

To train Qwen/Qwen3-0.6B on the tiny-shakespeare dataset with BD3LM, run (on 1 GPU):

accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 1 \
    examples/a2d/bd3lm/pt.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B" \
    --dataset_args "Trelis/tiny-shakespeare" \
    --text_field "Text" \
    --insert_eos False \
    --max_length 128 \
    --num_train_epochs 10 \
    --learning_rate 1e-4 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --block_size 32 \
    --output_dir ".models/a2d/Qwen3-0.6B/bd3lm/tiny-shakespeare"

To sample from the model interactively:

# Enter a prompt (e.g., "First citizen: Before we proceed any further, hear me speak."),
# or press Enter to let the model generate text from scratch.
python -u examples/a2d/bd3lm/chat.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B/bd3lm/tiny-shakespeare/checkpoint-final" \
    --chat_template False --block_size 32 --remasking "random" --temperature 0.7
Example of pretraining on a larger dataset (OpenWebText) in streaming mode

To train Qwen/Qwen3-0.6B on the openwebtext dataset in streaming mode with BD3LM, run (on 8 GPUs):

accelerate launch --config_file scripts/accelerate_configs/zero2.yaml --num_processes 8 \
    examples/a2d/bd3lm/pt.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B" \
    --dataset_args "dylanebert/openwebtext" \
    --text_field "text" \
    --streaming True \
    --insert_eos True \
    --max_length 512 \
    --max_steps 20000 \
    --learning_rate 1e-4 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --eval_strategy "no" \
    --block_size 32 \
    --output_dir ".models/a2d/Qwen3-0.6B/bd3lm/openwebtext"

To sample from the model interactively:

# Enter a prompt (e.g., "Lebron James is"),
# or press Enter to let the model generate text from scratch.
python -u examples/a2d/bd3lm/chat.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B/bd3lm/openwebtext/checkpoint-final" \
    --chat_template False --block_size 32 --remasking "random" --temperature 0.7

SFT

To train Qwen/Qwen3-0.6B on the alpaca dataset with BD3LM, run (on 8 GPUs):

accelerate launch --config_file scripts/accelerate_configs/zero2.yaml --num_processes 8 \
    examples/a2d/bd3lm/sft.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B" \
    --dataset_args "tatsu-lab/alpaca" \
    --max_length 512 \
    --num_train_epochs 10 \
    --learning_rate 1e-4 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --block_size 32 \
    --output_dir ".models/a2d/Qwen3-0.6B/bd3lm/alpaca"

To chat with the model:

python -u examples/a2d/bd3lm/chat.py \
    --model_name_or_path ".models/a2d/Qwen3-0.6B/bd3lm/alpaca/checkpoint-final" --block_size 32

Tiny-A2D

Here we show the exact commands we use to train / interact with / evaluate the Tiny-A2D models: Qwen3-0.6B-diffusion-mdlm-v0.1 and Qwen3-0.6B-diffusion-bd3lm-v0.1. For training curves and other details, please see blog Tiny-A2D Report.

Training

Read Useful tips for training and (optional) Slurm setup before training.

The Tiny-A2D models are trained purely with SFT.

To reproduce Qwen3-0.6B-diffusion-mdlm-v0.1 (with MDLM & SFT), run the command below (about 10 hours on 64 A100s):

WANDB_MODE=online sbatch --nodes=8 --gres=gpu:8 scripts/train.slurm.sh \
    --accelerate_config "zero2" \
    --script_path "examples/a2d/mdlm/sft.py" \
    --model_name_or_path ".models/a2d/Qwen3-0.6B" \
    --dataset_args "allenai/tulu-3-sft-mixture+HuggingFaceTB/smoltalk+OpenCoder-LLM/opc-sft-stage1[lang:python]+OpenCoder-LLM/opc-sft-stage2[lang:python]" \
    --max_length 1024 \
    --num_train_epochs 10 \
    --learning_rate 1e-4 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --gradient_accumulation_steps 2 \
    --output_dir ".models/a2d/Qwen3-0.6B/tulu-3-sft-mixture+smoltalk+opc-sft-stage1&2/epochs-10-bs-2048-len-1024"

To reproduce Qwen3-0.6B-diffusion-bd3lm-v0.1 (with BD3LM & SFT), run the command below (about 10 hours on 64 A100s):

WANDB_MODE=online sbatch --nodes=8 --gres=gpu:8 scripts/train.slurm.sh \
    --accelerate_config "zero2" \
    --script_path "examples/a2d/bd3lm/sft.py" \
    --model_name_or_path ".models/a2d/Qwen3-0.6B" \
    --dataset_args "allenai/tulu-3-sft-mixture+HuggingFaceTB/smoltalk+OpenCoder-LLM/opc-sft-stage1[lang:python]+OpenCoder-LLM/opc-sft-stage2[lang:python]" \
    --max_length 512 \
    --num_train_epochs 10 \
    --learning_rate 1e-4 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --gradient_accumulation_steps 2 \
    --block_size 32 \
    --output_dir ".models/a2d/Qwen3-0.6B/tulu-3-sft-mixture+smoltalk+opc-sft-stage1&2/epochs-10-bs-2048-len-512-bls-32"

Inference

To chat with the model:

python -u examples/a2d/mdlm/chat.py --model_name_or_path "dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1"
python -u examples/a2d/bd3lm/chat.py --model_name_or_path "dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1"

Evaluation

Read (optional) Evaluation setup before running evaluation.

To evaluate Qwen3-0.6B-diffusion-mdlm-v0.1 and Qwen3-0.6B-diffusion-bd3lm-v0.1 on gsm8k using 4 GPUs, run:

# Use model_args to adjust the sampler arguments for evaluation.
accelerate launch --num_processes 4 \
    dllm/pipelines/a2d/eval.py \
    --tasks "gsm8k_cot" \
    --model "a2d_mdlm" \
    --apply_chat_template \
    --num_fewshot 0 \
    --model_args "pretrained=dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1,max_new_tokens=256,steps=256,block_size=256,cfg_scale=0.0,temperature=0.0"

accelerate launch --num_processes 4 \
    dllm/pipelines/a2d/eval.py \
    --tasks "gsm8k_cot" \
    --model "a2d_bd3lm" \
    --apply_chat_template \
    --num_fewshot 0 \
    --model_args "pretrained=dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1,max_new_tokens=256,steps=256,block_size=32,cfg_scale=0.0,temperature=0.0"

To automatically evaluate Qwen3-0.6B-diffusion-mdlm-v0.1 and Qwen3-0.6B-diffusion-bd3lm-v0.1 on all benchmarks, run:

bash examples/a2d/mdlm/eval.sh --model_name_or_path "dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1" 
bash examples/a2d/bd3lm/eval.sh --model_name_or_path "dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1" 

Evaluation results

Model                         GSM8K BBH MATH MMLU MMLU‑Pro Hellaswag HumanEval MBPP
Qwen3-0.6B-diffusion-mdlm-v0.1 (Reproduced) 29.326.78.7 40.017.342.1 30.529.2
Qwen3-0.6B-diffusion-bd3lm-v0.1 (Reproduced) 46.326.612.9 39.113.839.3 46.338.2
Qwen2.5-0.5B (Official) 41.620.319.547.515.752.130.539.3
Qwen3-0.6B-Base (Official) 59.641.532.4 52.824.747.4 32.336.6
Model HumanEval MBPP
Qwen2.5-Coder-0.5B-Instruct-diffusion-mdlm-v0.1 (Reproduced) 28.1 23.0
Qwen2.5-Coder-0.5B-Instruct-diffusion-bd3lm-v0.1 (Reproduced) 39.0 33.2
open-dcoder-0.5B (Official) 20.8 35.2
Qwen2.5-Coder-0.5B-Instruct (Official) 28.0 52.9

Table 1. Results (Reproduced) are obtained using our framework, while results (Official) come from the Qwen3 Technical Report, Qwen2.5-Coder Technical Report, Qwen2.5 Blog, and Open-dLLM. Italic rows denote autoregressive models, whereas non-italic rows denote diffusion language models.