Summary
TileLang ST full SIM CI already enables testcase-level parallelism with --jobs 64, but end-to-end wall-clock time is still dominated by a small set of long-tail testcases.
Current CI command:
bash test/tilelang_st/script/run_ci.sh -r sim -v a5 --jobs 64
The key bottleneck is no longer build parallelism or lack of testcase-level concurrency. Instead, it is the combination of:
- simulator execution being serial inside each testcase
- highly imbalanced testcase runtime distribution
- several very heavy testcases dominating the tail latency of the whole batch
Command line
bash test/tilelang_st/script/run_ci.sh -r sim -v a5 --jobs 64
Reproduction input
Relevant workflow and scripts:
.github/workflows/ci.yml
test/tilelang_st/script/run_ci.sh
test/tilelang_st/script/run_all_st.py
test/tilelang_st/script/run_st.py
Relevant logs used for analysis:
build/1_vpto-sim-validation.txt
build/run_all.log
Expected performance
The PR SIM validation path should avoid being dominated by a few long-tail testcases.
A reasonable target is:
- keep a fast PR feedback path with a significantly lower wall-clock
- preserve full-coverage SIM validation in nightly or dedicated jobs
- make heavy testcase distribution visible and manageable
Actual performance
Findings from the 64-way parallel log:
- Testcase-level parallelism is already enabled:
- log shows
running testcases in parallel with jobs=64
- all 71 testcases are queued immediately
- Single-testcase simulator execution is still serial:
- log repeatedly shows
[TmSim]: Run in serial mode.
- The runtime distribution is extremely imbalanced.
Top long-tail testcases from the log:
| testcase |
model runtime |
approx minutes |
trowargmax |
1493.3 s |
24.89 |
trowargmin |
1486.7 s |
24.78 |
trowprod |
1201.5 s |
20.02 |
trowmin |
1183.7 s |
19.73 |
trowmax |
1144.1 s |
19.07 |
tpartmin |
1102.6 s |
18.38 |
tpartmax |
1091.2 s |
18.19 |
tsels |
876.9 s |
14.62 |
tfillpad |
859.6 s |
14.33 |
trowsum |
851.8 s |
14.20 |
tcolmin |
825.2 s |
13.75 |
tcolmax |
813.8 s |
13.56 |
Additional observations:
- all 71 testcase
Model RUN TIME values sum to about 595.86 minutes of simulator time
- testcase execution wall-clock is still dominated by the slowest tail
- increasing
--jobs further is unlikely to help much because the bottleneck is now the slowest serial testcases rather than queue width
Profiling data
Current bottleneck assessment:
Primary bottleneck
The dominant bottleneck is long-tail testcase runtime under a serial per-testcase simulator.
Secondary bottlenecks
There are script-side inefficiencies worth cleaning up, though they are not the primary reason for the current tail:
run_all_st.py parallel mode launches one nested Python subprocess per testcase
- each subprocess runs the single-testcase path with
jobs=1
- testcase output is buffered and replayed only after completion
- selected testcase build target is computed but the shared build still passes
"all"
Suggested optimization items
Priority 0: split CI coverage by testcase weight
- Define a lightweight PR SIM suite and move the heaviest long-tail cases to nightly or a separate job.
- Maintain an explicit allowlist or skiplist for heavy testcase groups in CI.
- Keep full-coverage SIM validation available, but decouple it from the fast feedback path.
Priority 1: split heavy testcase internals by case granularity
- Extend the batch framework so it can schedule testcase internal cases, not only testcase directories.
- Reuse the existing
run_st.py -c/--case path for finer-grained parallel scheduling.
- Split heavy testcase families such as row/col reduce and argmax/argmin into smaller work units.
Priority 2: introduce a smoke-vs-full matrix policy
- Keep one or a few representative cases per operator family for PR.
- Keep the full shape matrix for nightly or scheduled validation.
- Make the policy explicit in CI configuration and documentation.
Priority 3: clean up script-level inefficiencies
- Avoid launching one nested Python batch runner subprocess per testcase in parallel mode.
- Refactor
run_st.py helpers to remove global os.chdir() dependence so in-process workers become safe.
- Write per-testcase logs to files and print concise live progress to stdout.
- Honor the selected build target instead of always building
all in batch mode.
- Reduce repeated script copying into testcase build directories where possible.
Git commit
be690d3959ddccffe747fe5b14ceea9c9dc0a4f9
Summary
TileLang ST full SIM CI already enables testcase-level parallelism with
--jobs 64, but end-to-end wall-clock time is still dominated by a small set of long-tail testcases.Current CI command:
The key bottleneck is no longer build parallelism or lack of testcase-level concurrency. Instead, it is the combination of:
Command line
Reproduction input
Relevant workflow and scripts:
.github/workflows/ci.ymltest/tilelang_st/script/run_ci.shtest/tilelang_st/script/run_all_st.pytest/tilelang_st/script/run_st.pyRelevant logs used for analysis:
build/1_vpto-sim-validation.txtbuild/run_all.logExpected performance
The PR SIM validation path should avoid being dominated by a few long-tail testcases.
A reasonable target is:
Actual performance
Findings from the 64-way parallel log:
running testcases in parallel with jobs=64[TmSim]: Run in serial mode.Top long-tail testcases from the log:
trowargmaxtrowargmintrowprodtrowmintrowmaxtpartmintpartmaxtselstfillpadtrowsumtcolmintcolmaxAdditional observations:
Model RUN TIMEvalues sum to about595.86minutes of simulator time--jobsfurther is unlikely to help much because the bottleneck is now the slowest serial testcases rather than queue widthProfiling data
Current bottleneck assessment:
Primary bottleneck
The dominant bottleneck is long-tail testcase runtime under a serial per-testcase simulator.
Secondary bottlenecks
There are script-side inefficiencies worth cleaning up, though they are not the primary reason for the current tail:
run_all_st.pyparallel mode launches one nested Python subprocess per testcasejobs=1"all"Suggested optimization items
Priority 0: split CI coverage by testcase weight
Priority 1: split heavy testcase internals by case granularity
run_st.py -c/--casepath for finer-grained parallel scheduling.Priority 2: introduce a smoke-vs-full matrix policy
Priority 3: clean up script-level inefficiencies
run_st.pyhelpers to remove globalos.chdir()dependence so in-process workers become safe.allin batch mode.Git commit
be690d3959ddccffe747fe5b14ceea9c9dc0a4f9