Skip to content

[Bug Fix] Use global rank instead of local rank in AsyncSGLangRollout broadcast when TP > 1 #1449

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

JasonZhu1313
Copy link

@JasonZhu1313 JasonZhu1313 commented May 8, 2025

Checklist Before Starting

  • Search for similar PR(s).

What does this PR do?

Add one-line overview of what this PR aims to achieve or accomplish.

[Bug Fix] Fix the broadcast issue in AsyncSGLangRollout by using the global rank instead of the local rank as the rank index when TP > 1

High-Level Design

Currently, the Slang async rollout engine failed due to the following issue when TP > 1:


ray.exceptions.RayTaskError(ValueError): ray::TaskRunner.run() (pid=177476, ip=100.96.17.52, actor_id=033eaa4520599283cbe9e6d801000000, repr=<main_ppo.TaskRunner object at 0x7c54a9a61120>)
  File "/home/jobuser/verl/verl/trainer/main_ppo.py", line 183, in run
    trainer.fit()
  File "/home/jobuser/verl/verl/trainer/ppo/ray_trainer.py", line 871, in fit
    val_metrics = self._validate()
  File "/home/jobuser/verl/verl/trainer/ppo/ray_trainer.py", line 604, in _validate
    test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded)
  File "/home/jobuser/verl/verl/single_controller/ray/base.py", line 49, in func
    output = ray.get(output)
ray.exceptions.RayTaskError(ValueError): ray::WorkerDict.actor_rollout_generate_sequences() (pid=196191, ip=100.96.17.52, actor_id=d9428a225808fb224e8e4a0b01000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7a6a9a0f06d0>)
  File "/home/jobuser/verl/verl/single_controller/ray/base.py", line 459, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
  File "/home/jobuser/verl/verl/single_controller/base/decorator.py", line 501, in inner
    return func(*args, **kwargs)
  File "/home/jobuser/verl/verl/workers/fsdp_workers.py", line 610, in generate_sequences
    output = self.rollout.generate_sequences_with_tools(prompts=prompts)
  File "/home/jobuser/sglang/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/jobuser/verl/verl/workers/rollout/sglang_rollout/async_sglang_rollout.py", line 584, in generate_sequences_with_tools
    [sorted_output_req_list] = broadcast_pyobj(
ValueError: not enough values to unpack (expected 1, got 0)

When TP = 2 and DP = 4, the device mesh for 8 GPUs is structured as follows:

[[0, 1], [2, 3], [4, 5], [6, 7]]

In this setup, the Slang rollout is collected at the first local rank within each tensor parallel (TP) group, meaning ranks [0, 2, 4, 6] in the global (world) rank space.

Currently, Slang checks whether the current rank index matches the source (src) index in broadcast_pyobj. However, when using the local rank within a TP group, the only possible values are [0, 1] (since TP = 2). This mismatch causes rank == src to fail on some global ranks that hold actual data. As a result, an empty list is returned to VerL, which leads to an unpacking failure on the VerL side.

Demonstrate the high-level design if this PR is complex.

Specific Changes

Use global rank as rank index to ensure the rank with actual data is being broadcast to processor group.

List the specific changes.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this 

Test

Test a multi turn training and the rollout can be generated successfully with the fix

TaskRunner pid=1504685) validation generation end
(WorkerDict pid=1523161) WARN: rank 6 grad_norm is not finite: nan [repeated 6x across cluster]
(TaskRunner pid=1504685) [prompt] system
(TaskRunner pid=1504685) 
(TaskRunner pid=1504685)                             You are a math expert. You are given a question and you need to solve it step by step.  
(TaskRunner pid=1504685)                             `calc_gsm8k_reward` is a tool for calculating the reward of gsm8k. You should use this 
(TaskRunner pid=1504685)                             tool to calculate the reward of your answer(1.0 if your answer is correct, 0.0 if your 
(TaskRunner pid=1504685)                             answer is incorrect) before submitting it and refine your answer if necessary. Put your 
(TaskRunner pid=1504685)                             final answer in the format of `#### <answer>`.
(TaskRunner pid=1504685) 
(TaskRunner pid=1504685) # Tools
(TaskRunner pid=1504685) 
(TaskRunner pid=1504685) You may call one or more functions to assist with the user query.
(TaskRunner pid=1504685) 
(TaskRunner pid=1504685) You are provided with function signatures within <tools></tools> XML tags:
(TaskRunner pid=1504685) <tools>
(TaskRunner pid=1504685) {"type": "function", "function": {"name": "calc_gsm8k_reward", "description": "A tool for calculating the reward of gsm8k. (1.0 if your answer is correct, 0.0 if your answer is incorrect)", "parameters": {"type": "object", "properties": {"answer": {"type": "string", "description": "The model's answer to the GSM8K math problem, must be a digits", "enum": null}}, "required": ["answer"]}, "strict": false}}
(TaskRunner pid=1504685) </tools>
(TaskRunner pid=1504685) 
(TaskRunner pid=1504685) For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
(TaskRunner pid=1504685) <tool_call>
(TaskRunner pid=1504685) {"name": <function-name>, "arguments": <args-json-object>}
(TaskRunner pid=1504685) </tool_call>
(TaskRunner pid=1504685) user
(TaskRunner pid=1504685) Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? 
(TaskRunner pid=1504685)         You must use the `calc_gsm8k_reward` tool to calculate the reward 
(TaskRunner pid=1504685)         of your answer(1.0 if your answer is correct, 0.0 if your answer is incorrect) 
(TaskRunner pid=1504685)         before submitting it at least once and refine your answer if necessary. 
(TaskRunner pid=1504685)         Put your final answer in the format of `#### <answer>`.
(TaskRunner pid=1504685)     
(TaskRunner pid=1504685) assistant
(TaskRunner pid=1504685) 
(TaskRunner pid=1504685) [response] Let's calculate the number of eggs Janet sells at the farmers' market each day.
(TaskRunner pid=1504685) 
(TaskRunner pid=1504685) 1. Janet's ducks lay 16 eggs per day.
(TaskRunner pid=1504685) 2. She eats 3 eggs for breakfast.
(TaskRunner pid=1504685) 3. She uses 4 eggs to make muffins for her friends.
(TaskRunner pid=1504685) 
(TaskRunner pid=1504685) The number of eggs sold at the farmers' market is:
(TaskRunner pid=1504685) \[ \text{Eggs sold} = \text{Total eggs} - \text{Eggs eaten for breakfast} - \text{Eggs used for muffins} \]
(TaskRunner pid=1504685) \[ \text{Eggs sold} = 16 - 3 - 4 = 9 \]
(TaskRunner pid=1504685) 
(TaskRunner pid=1504685) Since she sells each fresh duck egg for $2 at the farmers' market, the daily revenue is:
(TaskRunner pid=1504685) \[ \text{Revenue} = \text{Eggs sold} \times \text{Price per egg} \]
(TaskRunner pid=1504685) \[ \text{Revenue} = 9 \times 2 = 18 \]
(TaskRunner pid=1504685) 
(TaskRunner pid=1504685) Now, let's calculate the reward for this answer using the `calc_gsm8k_reward` tool.
(TaskRunner pid=1504685) <tool_call>
(TaskRunner pid=1504685) {"name": "calc_gsm8k_reward", "arguments": "{\"answer\": \"18\"}"}
(TaskRunner pid=1504685) </tool_call>
(TaskRunner pid=1504685) tool
(TaskRunner pid=1504685) Current parsed answer='18' reward=1.0
(TaskRunner pid=1504685) assistant
(TaskRunner pid=1504685) The reward for the answer is 1.0, which means the answer is correct.
(TaskRunner pid=1504685) 
(TaskRunner pid=1504685) #### 18
(TaskRunner pid=1504685) [ground_truth] 18
(TaskRunner pid=1504685) [score] 1.0


For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Additional Info.

  • Issue Number: Fixes issue # or discussion # if any.
  • Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
  • Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title if it breaks any API.
  • Update the documentation about your changes in the docs.
  • Add CI test(s) if neccessary.

@JasonZhu1313
Copy link
Author

@zhaochenyang20 Could you help take a look? Thanks!

@zhaochenyang20
Copy link
Collaborator

@JasonZhu1313 Let me pass the CI

vermouth1992 pushed a commit that referenced this pull request May 20, 2025
…en sgl and sgl async (#1577)

### Checklist Before Starting

- [x] Search for similar PR(s).
- Thanks to:
  - close #1558 due to mix of prs
  - close #1449 due to partial fix sgl new version issue
  - close #1300 which is part of current pr
- This pr is co-authored with @ocss884 

### What does this PR do?

> Add one-line overview of what this PR aims to achieve or accomplish. 

- bump sglang to 0.4.6.post4
- unified sglang and sglang_async `generate_sequences` api behavior,
e.g. image support
- fix warning for cuda barrier at start of fsdp_workers

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.

---------

Co-authored-by: ocss884 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants