-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Description
Hi,
thanks for your great work.
While attempting to reproduce the LLaDA-V ablation study results, I successfully executed scripts/llada_v_pretrain.sh. However, I then encountered several issues in the subsequent steps.
-
I attempted to increase the
per_device_batch_sizefrom 1 to 8 while running bashscripts/train_ablation/llada_v_sft.sh. Although GPU memory usage increased, the total training time remained unchanged at approximately 72 hours with 4 GPUs. -
I observed that the
grad_normwas reported as 0.0 during training. Is this expected, or could it indicate a problem? (The only modifications I made were to the training scripts; the rest of the original code remains unchanged.)

Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels