Skip to content

[ci, tests] refactor: split GPU smoke CI into label-driven test groups#233

Open
wtomin wants to merge 15 commits into
verl-project:mainfrom
wtomin:new-ci-pipeline
Open

[ci, tests] refactor: split GPU smoke CI into label-driven test groups#233
wtomin wants to merge 15 commits into
verl-project:mainfrom
wtomin:new-ci-pipeline

Conversation

@wtomin

@wtomin wtomin commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Refactors the GPU smoke CI pipeline so PRs can trigger smaller, right-sized GPU jobs via labels instead of always provisioning a single 8-GPU runner for the full suite.

  • Split the monolithic run_gpu_smoke_tests.sh into four group scripts backed by a shared lib_gpu_smoke.sh helper library.
  • Reorganized tests by GPU topology and resource needs:
    • core (2 GPU): vllm-omni rollout, diffusion agent loop, diffusers FSDP/VeOmni engines, diffusion rollout seed multi-worker, visual reward manager
    • e2e (2 GPU): DiffusionNFT trainer e2e, Qwen3-Omni Thinker GSPO LoRA e2e
    • reward-e2e (4 GPU): FlowGRPO trainer e2e, Qwen-Image online DPO trainer e2e
  • Updated .github/workflows/gpu_smoke.yml with label-driven job dispatch and per-group dynamic runners (L20x1 / L20x2 / L20x4).

PR label → CI behavior

Label Jobs Runner
ready-for-ci All groups running in parallel in a 8-card server 1× L20x8
ci-core core only 1× L20x2
ci-e2e-omni e2e only 1× L20x2
ci-e2e-diffusion reward-e2e only 1× L20x4

Push events and ready-for-ci still run the full suite; granular labels let contributors run only the subset relevant to their change.

Motivation

The previous pipeline always used an 8-GPU runner for every ready-for-ci run, even when only 2-GPU subset was needed. Splitting into groups reduces GPU waste and shortens feedback loops for targeted PRs.

Test plan

  • Open a PR and add ci-core —verify only the 2-GPU core job runs
  • Add ci-e2e-omni — verify only the 2-GPU e2e job runs
  • Add ci-e2e-diffusion — verify only the 4-GPU reward e2e job runs
  • Add ready-for-ci — verify full suite runs sequentially on one 8-GPU runner
  • Confirm cleanup destroys all provisioned dynamic runners after each run

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the GPU smoke test suite by splitting the monolithic run_gpu_smoke_tests.sh script into modular, resource-specific test group scripts and introducing a shared helper library, lib_gpu_smoke.sh, for common setup and reporting. Feedback is provided regarding a potential unbound variable error in lib_gpu_smoke.sh when referencing BASH_SOURCE[1] under set -u, with a suggestion to use a fallback parameter expansion.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread tests/gpu_smoke/lib_gpu_smoke.sh Outdated
wtomin and others added 2 commits July 3, 2026 11:31
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
@wtomin wtomin marked this pull request as ready for review July 3, 2026 03:36
@wtomin wtomin requested review from SamitHuang and zhtmike as code owners July 3, 2026 03:36
@zhtmike

zhtmike commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

better to keep diffusion agent loop, diffusers FSDP/VeOmni engines with multiple gpu to test with fsdp/dp/sp/tp.

@zhtmike

zhtmike commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Updated .github/workflows/gpu_smoke.yml with label-driven job dispatch and per-group dynamic runners (L20x1 / L20x2 / L20x4).

I have tested with this, seems does not work.

@zhtmike

zhtmike commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

A offline test script is better to be provided with guide of how to test it offline.

@zhtmike zhtmike requested a review from AndyZhou952 July 3, 2026 03:48
@wtomin

wtomin commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

better to keep diffusion agent loop, diffusers FSDP/VeOmni engines with multiple gpu to test with fsdp/dp/sp/tp.

Good suggestion! Now the core test will run on 2 GPUs.

@wtomin wtomin added the ready-for-ci read for running CI label Jul 3, 2026
@github-actions github-actions Bot removed the ready-for-ci read for running CI label Jul 3, 2026
@wtomin wtomin added the ready-for-ci read for running CI label Jul 3, 2026
@zhtmike

zhtmike commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

yes, it is still running in sequential.. better to allocated a 8-gpu server, and run them parallel.

@wtomin wtomin added ready-for-ci read for running CI and removed ready-for-ci read for running CI labels Jul 3, 2026
@wtomin wtomin added ready-for-ci read for running CI and removed ready-for-ci read for running CI labels Jul 3, 2026
@wtomin wtomin added the ready-for-ci read for running CI label Jul 3, 2026
@wtomin

wtomin commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

yes, it is still running in sequential.. better to allocated a 8-gpu server, and run them parallel.

It is plausible. I implemented in the recent commit 717b764. It did run multiple jobs in parallel in a 8-cards server, however, the log is not straightforward.
image

You can only view the log after all jobs are done, and the logs are uploaded to artifact. It is a bit inconvient, but more efficient. @zhtmike What do you think? Which one is better?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to allow the user to input with arbitrary number of gpus for local test. Like I do have 4 cards only

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CUDA_VISIBLE_DEVICES=0,1,2,3 NUM_GPUS=4 bash tests/gpu_smoke/run_gpu_smoke_tests.sh could work locally.

@zhtmike

zhtmike commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

The skip checks is not that straightforward. Can we just not trigger them from action yaml?

@github-actions github-actions Bot removed the ready-for-ci read for running CI label Jul 3, 2026
@wtomin wtomin added the ready-for-ci read for running CI label Jul 3, 2026
@zhtmike

zhtmike commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

LGTM, please update the documents of how to add new tests in this new version; and a guidance of how to run smoke test locally.

@SamitHuang

Copy link
Copy Markdown
Collaborator
  1. I agree that it's better to run gpu-smoke-core, gpu-smoke-e2e, and gpu-smoke-reward-e2e in parallel. For the logging problem, can we refer to verl ci? where multiple gpu tests are also run in parallel.
  2. I think the label names can be more intuitive, reward-e2e, e2e are quite confusing.

@zhtmike

zhtmike commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator
  1. I agree that it's better to run gpu-smoke-core, gpu-smoke-e2e, and gpu-smoke-reward-e2e in parallel. For the logging problem, can we refer to verl ci? where multiple gpu tests are also run in parallel.

I think it is a limitation from the CI server provider currently.

@SamitHuang

Copy link
Copy Markdown
Collaborator

can we auto determine which tests should run for the PR code diff?

@wtomin

wtomin commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author
  1. I agree that it's better to run gpu-smoke-core, gpu-smoke-e2e, and gpu-smoke-reward-e2e in parallel. For the logging problem, can we refer to verl ci? where multiple gpu tests are also run in parallel.

verl-project/verl uses volcengine/vemlp-github-runner to provision one elastic GPU runner (typically L20x8) per workflow. Although the YAML defines multiple downstream jobs as structurally parallel, all jobs share the single runner-label from the setup step, so they queue up and execute sequentially on the same physical machine.

One solution is to create multiple runner instances. Invoke volcengine/vemlp-github-runner@v1 multiple times during the setup phase (with different images or configurations) to generate distinct runner-labels, allowing different jobs to run on separate runners. However, this requires the veMLP backend to support a single workflow requesting multiple tasks. Besides, we don't have different images or configurations for now. I think currently we can only run multiple jobs sequentially.

  1. I think the label names can be more intuitive, reward-e2e, e2e are quite confusing.

reward-e2e refer to these e2e tests with a reward model, and 4 cards are needed. e2e refer to these e2e tests without a reward model, and 2 cards are needed. Any better suggestions?

@wtomin

wtomin commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

LGTM, please update the documents of how to add new tests in this new version; and a guidance of how to run smoke test locally.

Sure. I will update the docs after the PR design is finalized.

@SamitHuang

SamitHuang commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

reward-e2e refer to these e2e tests with a reward model, and 4 cards are needed. e2e refer to these e2e tests without a reward model, and 2 cards are needed. Any better suggestions?

I suggest to update the test group logic according to the code change perspective, instead of GPU consume. Can be:
ci-core: all key modules - training, rollout, reward engines, weight sync manager, etc
ci-e2e-omni: e2e 1 or 2 step training for qwen3omni tiny gspo. and future fully async omni, qwen-tts
ci-e2e-diffusion: e2e 1 or 2 step training for flowgrpo, online dpo, diffusionnft

Typically, diffusion model contributors will not touch omni-related codes.

@github-actions github-actions Bot removed the ready-for-ci read for running CI label Jul 3, 2026
@wtomin wtomin added the ci-core Run all core modules - training, rollout, reward engines, weight sync manager, etc label Jul 3, 2026
@wtomin wtomin removed the ci-core Run all core modules - training, rollout, reward engines, weight sync manager, etc label Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants