Note
This project includes the codebase, datasets for dLLM post-training: Agentic SFT and Agentic VRPO.
Also includes the implementation of P-ReAct, adapted from SDAR's jetengine.
DLLM-Searcher decodes the tool call region before think regionοΌso DLLM-Searcher always keeps thinking while waiting for tool response.
We design a two-stage post-training pipeline encompassing Agentic SFT and Agentic VRPO to enhance the reasoning and tool-calling capabilities. Furthermore, we propose a novel agent paradigm termed P-ReAct. P-ReAct guides the model to prioritize decoding tool_call instructions, thereby allowing the model to keep thinking while waiting for the tool's return.
In this project, we organized the code as follows:
.
βββ Dataroller/ # π Data collection and preparation
β βββ scripts/
β β βββ test.sh # Run for data collecting
β βββ prompt.py
β βββ react_agent.py # ReAct agent implementation
β βββ run.sh
β βββ run_multi_react.py # Multi-turn ReAct execution
β βββ tool_search.py # Search tool definitions
β
βββ dLLM_trainer/ # π§ Training pipelines
β βββ SFT/
β β βββ dLLM-RL/
β β βββ train/ # SFT training code
β β βββ data/ # SFT datasets
β β βββ sdar_sft.sh # SFT training script
β β
β βββ VRPO/
β βββ my_train/ # VRPO training code
β β βββ jetengine/
β βββ recipes/ # VRPO training configurations
β
βββ my_eval/ # π Evaluation scripts
β
βββ visualization/ # π¨ Visual assets
βββ main.png
βββ diffusion_generation.gif
We recommend using separate environments for data collection, Agentic SFT training, Agentic VRPO training, and inference to avoid dependency conflicts.
conda create -n dllmeval python=3.10
conda activate dllmeval
pip install torch==2.8
wget https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.3.18/flash_attn-2.7.4+cu128torch2.8-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.4+cu128torch2.8-cp310-cp310-linux_x86_64.whl
pip install sglang[all]
pip install qwen-agent[gui,rag,code_interpreter,mcp]conda create --name dllm-rl python=3.10
conda activate dllm-rl
pip install torch==2.6.0
pip install --no-cache-dir \
https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/\
flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
cd dLLM_trainer/SFT/dLLM-RL
pip install -r requirements.txtconda create -n espo python=3.11 -y
conda activate espo
pip install torch==2.6.0
pip install --no-cache-dir \
https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/\
flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
cd dLLM_trainer/VRPO
pip install -e ".[code]"
pip install wandb==0.15.12 protobuf==3.20.3# The Same as VRPOWe use the Dataroller module to collect and process training data using the following command:
cd Dataroller
bash scripts/test.shWe release our SFT dataset in dLLM_trainer/SFT/data/data.json and our VRPO dataset in dLLM_trainer/VRPO/data/train.jsonl
Dataset Structure:
- Includes reasoning traces, tool calls, and search results
Before starting SFT training, you should configure the paths in sdar.yaml and sft_sdar.py:
sdar.yaml:
model:
pretrained_model: "your_path" # Absolute path to your pretrained model
optimized_name: "optimized" # Output name for optimized model, saved under sft_sdar/ckptsft_sdar.py:
with open("../data/" + config.dataset.optimization_data + ".json", 'r') as f:
dataset_load = json.load(f)After configuration, launch the SFT training as follows:
cd dLLM_trainer/SFT/dLLM-RL
bash sdar_sft.shUpdate the path configurations in dpo.yaml according to your setup:
# Example configuration structure
model:
path: "your_sft_model_path" # Path to SFT trained model
# Additional VRPO-specific configurationsLaunch the VRPO training as follows:
cd dLLM_trainer/VRPO
bash run_dpo.shOnly 11 lines of code to implement complete token-prefilling and confidence biasing .
# dLLM_trainer/VRPO/my_train/jetengine/engine/scheduler.py
elif 'toolcall_pre_rl' in seq.remasking_strategy:
if seq.current_denoising_step == 0:
seq_x0[your_tool_end] = 151658 # </tool_call>
transfer_index[your_tool_end] = True
seq_x0[your_tool_start] = 151657 # <tool_call>
transfer_index[your_tool_start] = True
else:
confidence = torch.where(mask_index, seq_x0_p, -np.inf)
confidence[your_tool_start:your_tool_end + 1] = confidence[your_tool_start:your_tool_end + 1] + 0.5
_, top_indices = torch.topk(confidence, num_to_transfer)
transfer_index[top_indices] = True- β‘ Parallel reasoning and action execution
- π Asynchronous tool calling
- π Reduced end-to-end latency
- π― Improved search agent efficiency
Use the trained model to roll out the data script as follows :
cd inference/my_eval
bash run_test.shThe LLM as judge prompt we use is
prompt = '''Given a Question and its Golden Answer, verify whether the Predicted Answer is correct.
The prediction is correct if it fully aligns with the meaning and key information of the Golden Answer.
Respond with ONLY True if the prediction is correct and ONLY False otherwise.
Question: {question}
Golden Answer: {reference}
Predicted Answer: {prediction}
'''Evaluation scripts are available in the inference/my_eval/ directory.
cd inference/my_eval
python cal_acc.pyWe sincerely thank the authors of the following open-source repositories for their efforts:
- Training Frameworks: TraceRL, ESPO
- Evaluation & Serving: WebSailor, R1Searcher
- Base Models: LLaDA, Dream7B, SDAR

