Skip to content

[BUG] GRPO on GSM8K is stable for SGLang but unstable/collapses for vLLM #1290

@adityasoni9998

Description

@adityasoni9998

Checklist

  • The error occurs when using our provided Docker image.
  • I can consistently reproduce the bug across multiple trials or random seeds.
  • If the error causes experiment abortion, I've verified that this error is the root
    cause, not a secondary error caused by peer workers.

Detailed Information

Hi team - thank you for your work on RL training infrastructure. I have been consistently facing RL instability when running this example using the default config - the only change I make when shifting from SGLang to vLLM is on this line where I replace sglang with vllm.

Interestingly, on both H100 and A6000 GPUs from NVIDIA, I find that SGLang works fine with the reward on validation split close to that given here but I find that the RL run collapses when running on H100 with vLLM. I used the FSDP trainer backend in my experiments. I tried setting the enforce_eager flag in vLLM to True but that didn't help either. Is there any specific set of flags/settings needed to make vLLM work with AReaL? My experiments were run using a relatively recent commit ae8c792fdb5e21f77b3b9bca9c435cb6d1ddf62b.

I have attached the reward curves and grad norm curves for reference (the grad norm seems to explode/be higher for vLLM but not for SGLang)

Image Image Image

Describe the bug

See above

Expected behavior

Default config with this simple RL example should work regardless of choice of inference backend.

Full logs

N/A

To Reproduce

Setup AReaL with vLLM/SGLang and then run python3 examples/math/gsm8k_rl.py --config examples/math/gsm8k_grpo.yaml scheduler.type=local

Commit ID

ae8c792fdb5e21f77b3b9bca9c435cb6d1ddf62b

Environment

NVIDIA H100 and A6000 GPUs. Installation simply required uv sync --extra cuda with Python 3.12 for SGLang. For vLLM, we installed flash-attn from the pre-built wheel: https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.16/flash_attn-2.8.3+cu128torch2.10-cp312-cp312-linux_x86_64.whl for torch 2.10.

Script

https://github.com/inclusionAI/AReaL/blob/main/examples/math/gsm8k_grpo.yaml

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions