Question Regarding D2F's Performance on Planning Tasks

Hello,

First, thank you for your excellent work on D2F and for your commitment to open-source development!

I'm writing to discuss a surprising result I found when evaluating D2F's planning capabilities, and I'm hoping you might have some insight.

### Motivation
Inspired by the recent [PC-Sampler paper](https://arxiv.org/abs/2508.13021), which uses `countdown` and `sudoku` to test the planning abilities of dLLMs, I wanted to evaluate how a Block-Diffusion model like D2F performs on these benchmarks. The paper highlights the strong planning skills of LLaDA, the base model for D2F.

### Experiment and Results
I forked your repository and integrated the evaluation logic from PC-Sampler. While D2F performed reasonably well on `countdown`, its performance completely collapsed on `sudoku`:

> `Countdown` (50 samples): 34% accuracy
> `Sudoku` (50 samples): 0% accuracy

### The Core Question
This zero-accuracy result on `sudoku` is puzzling. As shown in the PC-Sampler paper (see attached table), the base LLaDA models achieve respectable scores on this task (e.g., 23.8% accuracy with confidence-based sampling). Since D2F is fine-tuned from LLaDA, I expected it to retain at least some of this planning capability.

I have already tried tuning hyperparameters like block_size and using different sampling strategies (top-k), but the `sudoku` score remained at 0.

This leads me to wonder:

- Is it possible that the block-diffusion architecture itself has inherent limitations for highly structured, long-horizon planning tasks like `Sudoku` compared to the base LLaDA model?

- Or, is it more likely that I have introduced a subtle bug in my implementation?

My modified fork is available here for review: [Discrete-Diffusion-Forcing](https://github.com/Crys-Chen/Discrete-Diffusion-Forcing). My experiments were conducted on A100-80GB GPUs and the same python environment with yours.

Any thoughts or insights you could provide would be greatly appreciated. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question Regarding D2F's Performance on Planning Tasks #9

Motivation

Experiment and Results

The Core Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question Regarding D2F's Performance on Planning Tasks #9

Description

Motivation

Experiment and Results

The Core Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions