[ci, tests] refactor: split GPU smoke CI into label-driven test groups#233
[ci, tests] refactor: split GPU smoke CI into label-driven test groups#233wtomin wants to merge 15 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the GPU smoke test suite by splitting the monolithic run_gpu_smoke_tests.sh script into modular, resource-specific test group scripts and introducing a shared helper library, lib_gpu_smoke.sh, for common setup and reporting. Feedback is provided regarding a potential unbound variable error in lib_gpu_smoke.sh when referencing BASH_SOURCE[1] under set -u, with a suggestion to use a fallback parameter expansion.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
|
better to keep diffusion agent loop, diffusers FSDP/VeOmni engines with multiple gpu to test with fsdp/dp/sp/tp. |
I have tested with this, seems does not work. |
|
A offline test script is better to be provided with guide of how to test it offline. |
Good suggestion! Now the |
|
yes, it is still running in sequential.. better to allocated a 8-gpu server, and run them parallel. |
It is plausible. I implemented in the recent commit 717b764. It did run multiple jobs in parallel in a 8-cards server, however, the log is not straightforward. You can only view the log after all jobs are done, and the logs are uploaded to artifact. It is a bit inconvient, but more efficient. @zhtmike What do you think? Which one is better? |
There was a problem hiding this comment.
better to allow the user to input with arbitrary number of gpus for local test. Like I do have 4 cards only
There was a problem hiding this comment.
I think CUDA_VISIBLE_DEVICES=0,1,2,3 NUM_GPUS=4 bash tests/gpu_smoke/run_gpu_smoke_tests.sh could work locally.
|
The skip checks is not that straightforward. Can we just not trigger them from action yaml? |
|
LGTM, please update the documents of how to add new tests in this new version; and a guidance of how to run smoke test locally. |
|
I think it is a limitation from the CI server provider currently. |
|
can we auto determine which tests should run for the PR code diff? |
One solution is to create multiple runner instances. Invoke
|
Sure. I will update the docs after the PR design is finalized. |
I suggest to update the test group logic according to the code change perspective, instead of GPU consume. Can be: Typically, diffusion model contributors will not touch omni-related codes. |

Summary
Refactors the GPU smoke CI pipeline so PRs can trigger smaller, right-sized GPU jobs via labels instead of always provisioning a single 8-GPU runner for the full suite.
run_gpu_smoke_tests.shinto four group scripts backed by a sharedlib_gpu_smoke.shhelper library..github/workflows/gpu_smoke.ymlwith label-driven job dispatch and per-group dynamic runners (L20x1 / L20x2 / L20x4).PR label → CI behavior
ready-for-cici-coreci-e2e-omnici-e2e-diffusionPush events and
ready-for-cistill run the full suite; granular labels let contributors run only the subset relevant to their change.Motivation
The previous pipeline always used an 8-GPU runner for every
ready-for-cirun, even when only 2-GPU subset was needed. Splitting into groups reduces GPU waste and shortens feedback loops for targeted PRs.Test plan
ci-core—verify only the 2-GPU core job runsci-e2e-omni— verify only the 2-GPU e2e job runsci-e2e-diffusion— verify only the 4-GPU reward e2e job runsready-for-ci— verify full suite runs sequentially on one 8-GPU runnercleanupdestroys all provisioned dynamic runners after each run