[BUG] GRPO on GSM8K is stable for SGLang but unstable/collapses for vLLM

## Checklist

- [ ] The error occurs when using our provided Docker image.
- [X] I can consistently reproduce the bug across multiple trials or random seeds.
- [ ] If the error causes experiment abortion, I've verified that this error is the root
  cause, not a secondary error caused by peer workers.

## Detailed Information
Hi team - thank you for your work on RL training infrastructure. I have been consistently facing RL instability when running this [example](https://github.com/inclusionAI/AReaL/tree/main/examples/math) using the default [config](https://github.com/inclusionAI/AReaL/blob/main/examples/math/gsm8k_grpo.yaml) - the only change I make when shifting from SGLang to vLLM is on [this](https://github.com/inclusionAI/AReaL/blob/main/examples/math/gsm8k_grpo.yaml#L22) line where I replace `sglang` with `vllm`. 

Interestingly, on both H100 and A6000 GPUs from NVIDIA, I find that SGLang works fine with the reward on validation split close to that given [here](https://github.com/inclusionAI/AReaL/tree/main/examples/math#hyper-parameters-for-gsm8k-finetuning-on-qwen25-15b-instruct) but I find that the RL run collapses when running on H100 with vLLM. I used the FSDP trainer backend in my experiments. I tried setting the `enforce_eager` flag in vLLM to True but that didn't help either. Is there any specific set of flags/settings needed to make vLLM work with AReaL? My experiments were run using a relatively recent commit `ae8c792fdb5e21f77b3b9bca9c435cb6d1ddf62b`. 

I have attached the reward curves and grad norm curves for reference (the grad norm seems to explode/be higher for vLLM but not for SGLang)

<img width="1471" height="829" alt="Image" src="https://github.com/user-attachments/assets/013cbcc9-add9-4078-9f65-6905f8ca5a6f" />

<img width="2187" height="1050" alt="Image" src="https://github.com/user-attachments/assets/627e5466-1b82-4c71-89bd-ca7f4083c5a4" />

<img width="2186" height="1039" alt="Image" src="https://github.com/user-attachments/assets/4f066b80-9662-4ff1-b1be-4de7212a4585" />

### Describe the bug
See above

### Expected behavior
Default config with this simple RL example should work regardless of choice of inference backend.

### Full logs
N/A

## To Reproduce
Setup AReaL with vLLM/SGLang and then run `python3 examples/math/gsm8k_rl.py --config examples/math/gsm8k_grpo.yaml scheduler.type=local`

### Commit ID
`ae8c792fdb5e21f77b3b9bca9c435cb6d1ddf62b`

### Environment
NVIDIA H100 and A6000 GPUs. Installation simply required `uv sync --extra cuda` with Python 3.12 for SGLang. For vLLM, we installed flash-attn from the pre-built wheel: https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.16/flash_attn-2.8.3+cu128torch2.10-cp312-cp312-linux_x86_64.whl for torch 2.10. 

### Script
https://github.com/inclusionAI/AReaL/blob/main/examples/math/gsm8k_grpo.yaml


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] GRPO on GSM8K is stable for SGLang but unstable/collapses for vLLM #1290

Checklist

Detailed Information

Describe the bug

Expected behavior

Full logs

To Reproduce

Commit ID

Environment

Script

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] GRPO on GSM8K is stable for SGLang but unstable/collapses for vLLM #1290

Description

Checklist

Detailed Information

Describe the bug

Expected behavior

Full logs

To Reproduce

Commit ID

Environment

Script

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions