[ci, tests] refactor: split GPU smoke CI into label-driven test groups by wtomin · Pull Request #233 · verl-project/verl-omni

wtomin · 2026-07-03T03:24:11Z

Summary

Refactors the GPU smoke CI pipeline so PRs can trigger smaller, right-sized GPU jobs via labels instead of always provisioning a single 8-GPU runner for the full suite.

Split the monolithic run_gpu_smoke_tests.sh into four group scripts backed by a shared lib_gpu_smoke.sh helper library.
Reorganized tests by GPU topology and resource needs:
- core (2 GPU): vllm-omni rollout, diffusion agent loop, diffusers FSDP/VeOmni engines, diffusion rollout seed multi-worker, visual reward manager
- e2e (2 GPU): DiffusionNFT trainer e2e, Qwen3-Omni Thinker GSPO LoRA e2e
- reward-e2e (4 GPU): FlowGRPO trainer e2e, Qwen-Image online DPO trainer e2e
Updated .github/workflows/gpu_smoke.yml with label-driven job dispatch and per-group dynamic runners (L20x1 / L20x2 / L20x4).

PR label → CI behavior

Label	Jobs	Runner
`ready-for-ci`	All groups running in parallel in a 8-card server	1× L20x8
`ci-core`	core only	1× L20x2
`ci-e2e-omni`	e2e only	1× L20x2
`ci-e2e-diffusion`	reward-e2e only	1× L20x4

Push events and ready-for-ci still run the full suite; granular labels let contributors run only the subset relevant to their change.

Motivation

The previous pipeline always used an 8-GPU runner for every ready-for-ci run, even when only 2-GPU subset was needed. Splitting into groups reduces GPU waste and shortens feedback loops for targeted PRs.

Test plan

Open a PR and add ci-core —verify only the 2-GPU core job runs
Add ci-e2e-omni — verify only the 2-GPU e2e job runs
Add ci-e2e-diffusion — verify only the 4-GPU reward e2e job runs
Add ready-for-ci — verify full suite runs sequentially on one 8-GPU runner
Confirm cleanup destroys all provisioned dynamic runners after each run

gemini-code-assist

Code Review

This pull request refactors the GPU smoke test suite by splitting the monolithic run_gpu_smoke_tests.sh script into modular, resource-specific test group scripts and introducing a shared helper library, lib_gpu_smoke.sh, for common setup and reporting. Feedback is provided regarding a potential unbound variable error in lib_gpu_smoke.sh when referencing BASH_SOURCE[1] under set -u, with a suggestion to use a fallback parameter expansion.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

zhtmike · 2026-07-03T03:43:23Z

better to keep diffusion agent loop, diffusers FSDP/VeOmni engines with multiple gpu to test with fsdp/dp/sp/tp.

zhtmike · 2026-07-03T03:44:34Z

Updated .github/workflows/gpu_smoke.yml with label-driven job dispatch and per-group dynamic runners (L20x1 / L20x2 / L20x4).

I have tested with this, seems does not work.

zhtmike · 2026-07-03T03:46:30Z

A offline test script is better to be provided with guide of how to test it offline.

wtomin · 2026-07-03T04:01:25Z

better to keep diffusion agent loop, diffusers FSDP/VeOmni engines with multiple gpu to test with fsdp/dp/sp/tp.

Good suggestion! Now the core test will run on 2 GPUs.

zhtmike · 2026-07-03T04:15:32Z

yes, it is still running in sequential.. better to allocated a 8-gpu server, and run them parallel.

wtomin · 2026-07-03T05:44:40Z

yes, it is still running in sequential.. better to allocated a 8-gpu server, and run them parallel.

It is plausible. I implemented in the recent commit 717b764. It did run multiple jobs in parallel in a 8-cards server, however, the log is not straightforward.

You can only view the log after all jobs are done, and the logs are uploaded to artifact. It is a bit inconvient, but more efficient. @zhtmike What do you think? Which one is better?

zhtmike · 2026-07-03T06:12:11Z

better to allow the user to input with arbitrary number of gpus for local test. Like I do have 4 cards only

I think CUDA_VISIBLE_DEVICES=0,1,2,3 NUM_GPUS=4 bash tests/gpu_smoke/run_gpu_smoke_tests.sh could work locally.

zhtmike · 2026-07-03T06:15:05Z

The skip checks is not that straightforward. Can we just not trigger them from action yaml?

zhtmike · 2026-07-03T06:47:52Z

LGTM, please update the documents of how to add new tests in this new version; and a guidance of how to run smoke test locally.

SamitHuang · 2026-07-03T06:49:19Z

I agree that it's better to run gpu-smoke-core, gpu-smoke-e2e, and gpu-smoke-reward-e2e in parallel. For the logging problem, can we refer to verl ci? where multiple gpu tests are also run in parallel.
I think the label names can be more intuitive, reward-e2e, e2e are quite confusing.

zhtmike · 2026-07-03T06:51:42Z

I agree that it's better to run gpu-smoke-core, gpu-smoke-e2e, and gpu-smoke-reward-e2e in parallel. For the logging problem, can we refer to verl ci? where multiple gpu tests are also run in parallel.

I think it is a limitation from the CI server provider currently.

SamitHuang · 2026-07-03T07:02:03Z

can we auto determine which tests should run for the PR code diff?

wtomin · 2026-07-03T07:02:13Z

I agree that it's better to run gpu-smoke-core, gpu-smoke-e2e, and gpu-smoke-reward-e2e in parallel. For the logging problem, can we refer to verl ci? where multiple gpu tests are also run in parallel.

verl-project/verl uses volcengine/vemlp-github-runner to provision one elastic GPU runner (typically L20x8) per workflow. Although the YAML defines multiple downstream jobs as structurally parallel, all jobs share the single runner-label from the setup step, so they queue up and execute sequentially on the same physical machine.

One solution is to create multiple runner instances. Invoke volcengine/vemlp-github-runner@v1 multiple times during the setup phase (with different images or configurations) to generate distinct runner-labels, allowing different jobs to run on separate runners. However, this requires the veMLP backend to support a single workflow requesting multiple tasks. Besides, we don't have different images or configurations for now. I think currently we can only run multiple jobs sequentially.

I think the label names can be more intuitive, reward-e2e, e2e are quite confusing.

reward-e2e refer to these e2e tests with a reward model, and 4 cards are needed. e2e refer to these e2e tests without a reward model, and 2 cards are needed. Any better suggestions?

wtomin · 2026-07-03T07:12:13Z

LGTM, please update the documents of how to add new tests in this new version; and a guidance of how to run smoke test locally.

Sure. I will update the docs after the PR design is finalized.

SamitHuang · 2026-07-03T07:29:25Z

reward-e2e refer to these e2e tests with a reward model, and 4 cards are needed. e2e refer to these e2e tests without a reward model, and 2 cards are needed. Any better suggestions?

I suggest to update the test group logic according to the code change perspective, instead of GPU consume. Can be:
ci-core: all key modules - training, rollout, reward engines, weight sync manager, etc
ci-e2e-omni: e2e 1 or 2 step training for qwen3omni tiny gspo. and future fully async omni, qwen-tts
ci-e2e-diffusion: e2e 1 or 2 step training for flowgrpo, online dpo, diffusionnft

Typically, diffusion model contributors will not touch omni-related codes.

new groups and labels

6244719

gemini-code-assist Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread tests/gpu_smoke/lib_gpu_smoke.sh Outdated

wtomin and others added 2 commits July 3, 2026 11:31

reduce repetitive codes

d4a626b

Update tests/gpu_smoke/lib_gpu_smoke.sh

be507a9

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

wtomin marked this pull request as ready for review July 3, 2026 03:36

wtomin requested review from SamitHuang and zhtmike as code owners July 3, 2026 03:36

zhtmike requested a review from AndyZhou952 July 3, 2026 03:48

core use 2 gpus

163b9ce

wtomin added the ready-for-ci read for running CI label Jul 3, 2026

fix actions and labels

c2ccef6

github-actions Bot removed the ready-for-ci read for running CI label Jul 3, 2026

wtomin added the ready-for-ci read for running CI label Jul 3, 2026

setup core & e2e* reward e2e jobs in parallel

c09a7cd

wtomin added ready-for-ci read for running CI and removed ready-for-ci read for running CI labels Jul 3, 2026

safer trigger

4bb71d1

wtomin added ready-for-ci read for running CI and removed ready-for-ci read for running CI labels Jul 3, 2026

all tests running on 8xL20 in parallel

717b764

wtomin added the ready-for-ci read for running CI label Jul 3, 2026

zhtmike reviewed Jul 3, 2026

View reviewed changes

wtomin added 2 commits July 3, 2026 14:27

not to trigger irrelevant jobs

3422abb

arbitrary positive gpu nums

b104499

github-actions Bot removed the ready-for-ci read for running CI label Jul 3, 2026

docs updates

44e9cee

wtomin added the ready-for-ci read for running CI label Jul 3, 2026

change trigger label name

007c20f

github-actions Bot removed the ready-for-ci read for running CI label Jul 3, 2026

wtomin added the ci-core Run all core modules - training, rollout, reward engines, weight sync manager, etc label Jul 3, 2026

run checks if ci is enabled

41d7238

wtomin removed the ci-core Run all core modules - training, rollout, reward engines, weight sync manager, etc label Jul 3, 2026

wtomin added 2 commits July 3, 2026 17:31

revert synchronize

d07b89b

drop ci labels

6c33ced

Uh oh!

Conversation

wtomin commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

PR label → CI behavior

Motivation

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

zhtmike commented Jul 3, 2026

Uh oh!

zhtmike commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhtmike commented Jul 3, 2026

Uh oh!

wtomin commented Jul 3, 2026

Uh oh!

zhtmike commented Jul 3, 2026

Uh oh!

wtomin commented Jul 3, 2026

Uh oh!

zhtmike Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

wtomin Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

zhtmike commented Jul 3, 2026

Uh oh!

zhtmike commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SamitHuang commented Jul 3, 2026

Uh oh!

zhtmike commented Jul 3, 2026

Uh oh!

SamitHuang commented Jul 3, 2026

Uh oh!

wtomin commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wtomin commented Jul 3, 2026

Uh oh!

SamitHuang commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wtomin commented Jul 3, 2026 •

edited

Loading

zhtmike commented Jul 3, 2026 •

edited

Loading

zhtmike commented Jul 3, 2026 •

edited

Loading

wtomin commented Jul 3, 2026 •

edited

Loading

SamitHuang commented Jul 3, 2026 •

edited

Loading