d2: Improved Techniques for Training Reasoning Diffusion Language Models

This repo is the official implementation of the paper d2: Improved Techniques for Training Reasoning Diffusion Language Models.

Code Organization

In this repo, we provide the code of finetuning any-order causal LLaDA, d2-AnyOrder and diffu-GRPO RL post-training for causal LLaDA, and d2-StepMerge and diffu-GRPO RL post-training for standard LLaDA. The code is organized in this structure:

dataset: data files
diffu-grpo: code for standard LLaDA RL, including d2-StepMerge and diffu-GRPO
diffu-grpo-ao: code for any-order causal LLaDA RL, including d2-AnyOrder and diffu-GRPO
eval: evaluation code
SFT_AO: code for finetuning any-order causal LLaDA

Environment Setup

To setup the environment, run:

conda env create -f env.yml
conda activate d2
pip install datasets==4.2.0
pip install jsonlines # used in any-order finetuning

Checkpoints

We provide our finetuned any-order causal LLaDA checkpoints on Hugging Face, including any-order causal LLaDA checkpoint after the first finetuning stage: GuanghanWang/d2_anyorder_causal_llada_intellectsft, and the any-order causal LLaDA checkpoint after two rounds of finetuning with the second round on GSM8K: GuanghanWang/d2_anyorder_causal_llada_intellectsft_gsm8k.

Any-Order Finetuning

We provide the code for any-order causal fine-tuning to ensure maximal transparency. For the d2-AnyOrder RL experiments, you can also use our released any-order causal LLaDA checkpoint on GSM8K and jump to the d2-AnyOrder section.

Here, we provide the data and code for the second finetuning phase. That is, take the checkpoint finetuned on 150k LLaDA-2.0-mini generated sequenced from 150k IntellectSFT prompts, and further finetune it on the LLaDA-2.0-mini generated sequences from GSM8K/MATH500 prompts. You can also do any-order finetuning on the model using your own data.

Example bash scripts for any-order causal finetuning

GSM8K

cd SFT_AO
bash ./bash_scripts/finetune_gsm8k.sh > ./logs/finetune_gsm8k.log 2>&1

MATH500

cd SFT_AO
bash ./bash_scripts/finetune_math500.sh > ./logs/finetune_math500.log 2>&1

d2-AnyOrder

The code for d2-AnyOrder and the correponding diffu-GRPO experiments is inside the diffu-grpo-ao directory. We provide RL code applied to our finetuned any-order causal GSM8K checkpoint.

diffu-grpo-ao/bash_scripts contains the bash scripts we used to run the RL experiments.

Example bash scripts for running the RL experiments:

diffu-GRPO

cd diffu-grpo-ao
bash ./bash_scripts/anyorder_gsm8k_diffugrpo.sh > ./logs/anyorder_gsm8k_diffugrpo.log 2>&1

d2-AnyOrder

cd diffu-grpo-ao
bash ./bash_scripts/anyorder_gsm8k_d2anyorder.sh > ./logs/anyorder_gsm8k_d2anyorder.log 2>&1

d2-StepMerge

The code for d2-StepMerge and the correponding diffu-GRPO experiments is inside the diffu-grpo directory.

diffu-grpo/bash_scripts contains the bash scripts we used to run the RL experiments.

Example bash scripts for running the RL experiments (change DATASET into the corresponding dataset name):

diffu-GRPO

cd diffu-grpo
bash ./bash_scripts/DATASET_diffugrpo.sh > ./logs/DATASET_diffugrpo.log 2>&1

d2-StepMerge

cd diffu-grpo
bash ./bash_scripts/DATASET_d2stepmerge.sh > ./logs/DATASET_d2stepmerge.log 2>&1

Evaluation

The code for evaluation is inside the eval directory.

eval/bash_scripts contains the bash scripts we use to run the evaluation.

Example bash scripts for running the evaluation for the post-trained any-order causal LLaDA. Here, we conduct d2-AnyOrder RL on our provided any-order causal LLaDA checkpoint trained on GSM8K.

any-order causal LLaDA

cd eval
bash ./bash_scripts/eval_anyorder_gsm8k_llada.sh > ./logs/eval_anyorder_gsm8k_llada.log 2>&1

diffu-GRPO

cd eval
bash ./bash_scripts/eval_anyorder_gsm8k_diffugrpo.sh > ./logs/eval_anyorder_gsm8k_diffugrpo.log 2>&1

d2-AnyOrder

cd eval
bash ./bash_scripts/eval_anyorder_gsm8k_d2anyorder.sh > ./logs/eval_anyorder_gsm8k_d2anyorder.log 2>&1

Example bash scripts for running the evaluation for the LLaDA-8B-Instruct experiments (change DATASET into the corresponding dataset name):

LLaDA-8B-Instruct

cd eval
bash ./bash_scripts/eval_DATASET_llada.sh > ./logs/eval_DATASET_llada.log 2>&1

diffu-GRPO

cd diffu-grpo
bash ./bash_scripts/eval_DATASET_diffugrpo.sh > ./logs/eval_DATASET_diffugrpo.log 2>&1

d2-StepMerge

cd diffu-grpo
bash ./bash_scripts/eval_DATASET_d2stepmerge.sh > ./logs/eval_DATASET_d2stepmerge.log 2>&1

After generating samples using the bash scripts, run parse_and_get_acc.py to parse the json files and get the accuracy number. Remember to change the value of directory in the parse_and_get_acc.py script.

Acknowledgements

This repository is built off of d1.

Citation

If you find this work useful, please consider citing:

@article{wang2025d2,
  title={d2: Improved techniques for training reasoning diffusion language models},
  author={Wang, Guanghan and Turok, Gilad and Schiff, Yair and Arriola, Marianne and Kuleshov, Volodymyr},
  journal={arXiv preprint arXiv:2509.21474},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
SFT_AO		SFT_AO
assets		assets
dataset		dataset
diffu-grpo-ao		diffu-grpo-ao
diffu-grpo		diffu-grpo
eval		eval
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

d2: Improved Techniques for Training Reasoning Diffusion Language Models

Code Organization

Environment Setup

Checkpoints

Any-Order Finetuning

d2-AnyOrder

d2-StepMerge

Evaluation

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

d2: Improved Techniques for Training Reasoning Diffusion Language Models

Code Organization

Environment Setup

Checkpoints

Any-Order Finetuning

d2-AnyOrder

d2-StepMerge

Evaluation

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages