Skip to content

Question Regarding D2F's Performance on Planning Tasks #9

@Crys-Chen

Description

@Crys-Chen

Hello,

First, thank you for your excellent work on D2F and for your commitment to open-source development!

I'm writing to discuss a surprising result I found when evaluating D2F's planning capabilities, and I'm hoping you might have some insight.

Motivation

Inspired by the recent PC-Sampler paper, which uses countdown and sudoku to test the planning abilities of dLLMs, I wanted to evaluate how a Block-Diffusion model like D2F performs on these benchmarks. The paper highlights the strong planning skills of LLaDA, the base model for D2F.

Experiment and Results

I forked your repository and integrated the evaluation logic from PC-Sampler. While D2F performed reasonably well on countdown, its performance completely collapsed on sudoku:

Countdown (50 samples): 34% accuracy
Sudoku (50 samples): 0% accuracy

The Core Question

This zero-accuracy result on sudoku is puzzling. As shown in the PC-Sampler paper (see attached table), the base LLaDA models achieve respectable scores on this task (e.g., 23.8% accuracy with confidence-based sampling). Since D2F is fine-tuned from LLaDA, I expected it to retain at least some of this planning capability.

I have already tried tuning hyperparameters like block_size and using different sampling strategies (top-k), but the sudoku score remained at 0.

This leads me to wonder:

  • Is it possible that the block-diffusion architecture itself has inherent limitations for highly structured, long-horizon planning tasks like Sudoku compared to the base LLaDA model?

  • Or, is it more likely that I have introduced a subtle bug in my implementation?

My modified fork is available here for review: Discrete-Diffusion-Forcing. My experiments were conducted on A100-80GB GPUs and the same python environment with yours.

Any thoughts or insights you could provide would be greatly appreciated. Thank you!

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions