-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Hello,
First, thank you for your excellent work on D2F and for your commitment to open-source development!
I'm writing to discuss a surprising result I found when evaluating D2F's planning capabilities, and I'm hoping you might have some insight.
Motivation
Inspired by the recent PC-Sampler paper, which uses countdown and sudoku to test the planning abilities of dLLMs, I wanted to evaluate how a Block-Diffusion model like D2F performs on these benchmarks. The paper highlights the strong planning skills of LLaDA, the base model for D2F.
Experiment and Results
I forked your repository and integrated the evaluation logic from PC-Sampler. While D2F performed reasonably well on countdown, its performance completely collapsed on sudoku:
Countdown(50 samples): 34% accuracy
Sudoku(50 samples): 0% accuracy
The Core Question
This zero-accuracy result on sudoku is puzzling. As shown in the PC-Sampler paper (see attached table), the base LLaDA models achieve respectable scores on this task (e.g., 23.8% accuracy with confidence-based sampling). Since D2F is fine-tuned from LLaDA, I expected it to retain at least some of this planning capability.
I have already tried tuning hyperparameters like block_size and using different sampling strategies (top-k), but the sudoku score remained at 0.
This leads me to wonder:
-
Is it possible that the block-diffusion architecture itself has inherent limitations for highly structured, long-horizon planning tasks like
Sudokucompared to the base LLaDA model? -
Or, is it more likely that I have introduced a subtle bug in my implementation?
My modified fork is available here for review: Discrete-Diffusion-Forcing. My experiments were conducted on A100-80GB GPUs and the same python environment with yours.
Any thoughts or insights you could provide would be greatly appreciated. Thank you!