-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help reproducing training run #8
Comments
@jxhe, looking at the training script: does this expect 16 GPUs? |
The minimum hardware requirement is 6 H/A100-80G GPUs. (We haven't tested it yet) For more detailed, please refer to here. Thank you ! |
Thank you very much, I will close. |
@HYZ17 I will reopen this briefly. For the single node (8x A100 80GB), we get this result:
|
Hi, we tested the 1-node script again and there were no issues. Maybe you can check Ray's status using 'ray status' to verify if it started correctly |
I also encountered this error and ray starts sucessfully. I think it is the fault of hardward. |
8x H100 80GB on Lambda Labs (
gpu_8x_h100_sxm5
).examples/script
, I get an error thattrain_ppo_qwen_base_math_lv35_new.sh
does not exist.--working-dir=.
fromexamples/script
, it can't resolve theopenrlhf
module from inside/tmp
.If I adjust the script to point to the Qwen 2.5 Math 7B snapshot from
huggingface download
:And send the job from
train/
after starting the cluster:It doesn't error, and I can see the cluster in
ray status
:But nothing ever goes into GPU memory or seems to start, wandb never gets any information despite being logged in and key set etc.
Anyone have any idea what might be going on?
train_ppo_qwen_base_math_lv35_new.sh
should definitely callopenrlhf
and start pretraining but never seems to.The text was updated successfully, but these errors were encountered: