feat(eval): add loglikelihood support for BD3LM eval harness#95
feat(eval): add loglikelihood support for BD3LM eval harness#95zamal-db wants to merge 2 commits intoZHZisZZ:mainfrom
Conversation
Implement Monte Carlo ELBO loglikelihood estimation for BD3LM models, enabling evaluation on likelihood-based benchmarks (ARC, HellaSwag, etc.). The key architectural difference from MDLM is _get_logits, which constructs the [x_t x_0] input with block-diffusion attention (M_BD | M_OBC | M_BC) and duplicated position IDs, matching the BD3LM training procedure exactly. - Add _get_logits with [x_t x_0] construction and block-diffusion mask - Add _create_attention_mask supporting both SDPA and flex_attention - Add _forward_process, _get_loglikelihood, _suffix_greedy_prediction - Add loglikelihood public API (lm-eval interface) - Add mc_num, batch_size, is_check_greedy to BD3LMEvalConfig - Reuse _create_bd3lm_attention_mask from dllm.core.trainers.bd3lm Closes ZHZisZZ#93
Summary of ChangesHello @zamal-db, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the evaluation capabilities for BD3LM models by introducing a Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request implements loglikelihood() for BD3LMEvalHarness, enabling evaluation of BD3LM models on likelihood-based benchmarks using a Monte Carlo ELBO estimate. A critical security concern was identified: an empty continuation string can lead to a Denial of Service vulnerability by causing a crash due to division by zero or an invalid range in the forward diffusion process; a graceful fix is recommended. Furthermore, to enhance maintainability, consider refactoring a helper function to remove statefulness by explicitly passing data as arguments, which will improve robustness and readability.
| continuation_enc, device=self.device, dtype=torch.long | ||
| ) | ||
|
|
||
| logprob = self._get_loglikelihood(context, continuation) |
There was a problem hiding this comment.
The _forward_process method crashes when the continuation (target) is empty because target_len becomes 0. This leads to an invalid range for torch.randint(1, target_len + 1) on line 164, and division/modulo by zero on lines 174 and 197. This can be triggered by malformed requests or edge cases in benchmark datasets, leading to a Denial of Service (DoS) of the evaluation process.
| logprob = self._get_loglikelihood(context, continuation) | |
| logprob = self._get_loglikelihood(context, continuation) if continuation.shape[0] > 0 else 0.0 |
dllm/core/eval/bd3lm.py
Outdated
| def _get_logits( | ||
| self, batch: torch.Tensor, prompt_index: torch.Tensor | ||
| ) -> torch.Tensor: | ||
| """BD3LM forward: [x_t ⊕ x_0] with block-diffusion attention, return x_t logits.""" | ||
| b, l = batch.shape | ||
|
|
||
| # [x_t ⊕ x_0]: noised first half, clean second half | ||
| concat_input_ids = torch.cat([batch, self._x0], dim=1) # [b, 2l] |
There was a problem hiding this comment.
The use of self._x0 to pass the clean sequence to this method makes it stateful and reliant on its callers to set this property correctly. This can be fragile and harder to maintain.
The suggested change refactors this method to accept x0 as an explicit argument, making its dependencies clear.
Please update the calls to _get_logits in _get_loglikelihood and _suffix_greedy_prediction accordingly:
-
In
_get_loglikelihood:# ... x0 = seq.clone() # ... logits = self._get_logits(perturbed_seq, x0, prompt_index) # ...
-
In
_suffix_greedy_prediction:# ... x0 = torch.cat([prefix, target]).unsqueeze(0) # ... logits = self._get_logits(seq, x0, prompt_index)[mask_index] # ...
| def _get_logits( | |
| self, batch: torch.Tensor, prompt_index: torch.Tensor | |
| ) -> torch.Tensor: | |
| """BD3LM forward: [x_t ⊕ x_0] with block-diffusion attention, return x_t logits.""" | |
| b, l = batch.shape | |
| # [x_t ⊕ x_0]: noised first half, clean second half | |
| concat_input_ids = torch.cat([batch, self._x0], dim=1) # [b, 2l] | |
| def _get_logits( | |
| self, xt: torch.Tensor, x0: torch.Tensor, prompt_index: torch.Tensor | |
| ) -> torch.Tensor: | |
| """BD3LM forward: [x_t ⊕ x_0] with block-diffusion attention, return x_t logits.""" | |
| b, l = xt.shape | |
| # [x_t ⊕ x_0]: noised first half, clean second half | |
| concat_input_ids = torch.cat([xt, x0], dim=1) # [b, 2l] |
- Add guard for empty continuation (returns 0.0, False) to prevent DoS via division-by-zero in _forward_process (target_len=0) - Refactor _get_logits to accept x0 as explicit parameter instead of reading from self._x0, improving maintainability - Rename x0 -> x0_clean in _suffix_greedy_prediction to avoid shadowing by the greedy argmax variable Co-authored-by: gemini-code-assist[bot] <176aborting-id@users.noreply.github.com>
|
Hi @zamal-db ,Thank you very much for your great work! I have verified your code about evaluation on arc-easy, arc-challenge, hellaswag, piqa and it can work well on these tasks. But I found there is atill an error when I try to evaluate on And the error is: |
|
Hey @sglucas, thanks for testing this, and great to hear arc-easy, arc-challenge, hellaswag, piqa all work! I dug deep into the MMLU regex error and I'm pretty confident this isn't coming from our code or from
The And the 4 tasks you confirmed working (arc-easy, arc-challenge, hellaswag, piqa) all go through the exact same pipeline: So where is this regex coming from? My best guess is there's a separate python -c "import lm_eval; print(lm_eval.__file__)"It should point to Also worth checking: For MMLU specifically, the official eval scripts actually use accelerate launch \
--num_processes 8 \
--main_process_port 39501 \
dllm/pipelines/a2d/eval.py \
--tasks mmlu_generative_dream \
--model a2d_bd3lm \
--num_fewshot 0 \
--apply_chat_template \
--model_args "pretrained=${model_name_or_path},max_new_tokens=3,steps=3,block_size=32,cfg_scale=0.0"For likelihood-based mmlu (using our accelerate launch \
--num_processes 8 \
--main_process_port 39501 \
dllm/pipelines/a2d/eval.py \
--tasks mmlu \
--model a2d_bd3lm \
--num_fewshot 5 \
--model_args "pretrained=${model_name_or_path},mc_num=128,batch_size=32,block_size=32,max_length=4096"One small note: Let me know what |
Summary
Implements Monte Carlo ELBO
loglikelihood()forBD3LMEvalHarness, enabling BD3LM models to be evaluated on likelihood-based benchmarks (ARC-Challenge, ARC-Easy, HellaSwag, WinoGrande, PIQA, MMLU, etc.).Addresses #93 — previously, all BD3LM evaluations were generation-based only, and
loglikelihood()raisedNotImplementedError.Motivation
As noted by @lingjiechen2 in #93:
This PR fills that gap by implementing the proper BD3LM loglikelihood with the correct
[x_t ⊕ x_0]block-diffusion attention, matching the training procedure exactly.Mathematical Framework
BD3LM factorizes the likelihood over B blocks (Arriola et al., 2025):
Each per-block ELBO uses the absorbing-state diffusion parameterization. The MC estimator is:
where
p_mask = k / L'is the importance weight andkis the number of masked tokens.The forward process (token-level masking) is identical to MDLM — the block structure is enforced only through the attention mask, not the masking pattern.
Implementation
1 file changed:
dllm/core/eval/bd3lm.py(235 insertions, 3 deletions)The only architecturally novel method is
_get_logits(), which differs from MDLM by:x_t(length L)[x_t ⊕ x_0](length 2L)[0..L-1, 0..L-1](duplicated)logits[:, :l])All other methods (
_forward_process,_get_loglikelihood,_suffix_greedy_prediction,loglikelihood) follow the same MC ELBO framework asMDLMEvalHarness.Key design decisions:
_create_bd3lm_attention_maskfromdllm.core.trainers.bd3lm— same function used in trainingself.accelerator = Nonefor single-GPU eval (BaseEvalHarnesssets this toNonewhennum_processes == 1)BaseEvalHarness(not MDLM) — independent implementation, consistent with Dream's eval pattern__init__.py—BD3LMEvalConfigandBD3LMEvalHarnessare already exportedA2DBD3LMEvalHarnessautomatically inheritsloglikelihood()via class hierarchyUsage
accelerate launch \ --num_processes 4 \ dllm/pipelines/a2d/eval.py \ --tasks arc_challenge \ --model a2d_bd3lm \ --num_fewshot 0 \ --model_args "pretrained=dllm-collection/Qwen3-0.6B-diffusion-bd3lm-v0.1,mc_num=128,batch_size=32,block_size=32"Verification
BD3LMEvalHarness,A2DBD3LMEvalHarness)x_0tokens verified to NEVER attend tox_ttokensmc_num,batch_size,block_size,is_check_greedy) follow MDLM conventions