Skip to content

[Performance] Optimize TileLang ST SIM CI long-tail bottleneck under 64-way parallelism #301

@Zhendong404

Description

@Zhendong404

Summary

TileLang ST full SIM CI already enables testcase-level parallelism with --jobs 64, but end-to-end wall-clock time is still dominated by a small set of long-tail testcases.

Current CI command:

bash test/tilelang_st/script/run_ci.sh -r sim -v a5 --jobs 64

The key bottleneck is no longer build parallelism or lack of testcase-level concurrency. Instead, it is the combination of:

  1. simulator execution being serial inside each testcase
  2. highly imbalanced testcase runtime distribution
  3. several very heavy testcases dominating the tail latency of the whole batch

Command line

bash test/tilelang_st/script/run_ci.sh -r sim -v a5 --jobs 64

Reproduction input

Relevant workflow and scripts:

  • .github/workflows/ci.yml
  • test/tilelang_st/script/run_ci.sh
  • test/tilelang_st/script/run_all_st.py
  • test/tilelang_st/script/run_st.py

Relevant logs used for analysis:

  • build/1_vpto-sim-validation.txt
  • build/run_all.log

Expected performance

The PR SIM validation path should avoid being dominated by a few long-tail testcases.

A reasonable target is:

  1. keep a fast PR feedback path with a significantly lower wall-clock
  2. preserve full-coverage SIM validation in nightly or dedicated jobs
  3. make heavy testcase distribution visible and manageable

Actual performance

Findings from the 64-way parallel log:

  1. Testcase-level parallelism is already enabled:
    • log shows running testcases in parallel with jobs=64
    • all 71 testcases are queued immediately
  2. Single-testcase simulator execution is still serial:
    • log repeatedly shows [TmSim]: Run in serial mode.
  3. The runtime distribution is extremely imbalanced.

Top long-tail testcases from the log:

testcase model runtime approx minutes
trowargmax 1493.3 s 24.89
trowargmin 1486.7 s 24.78
trowprod 1201.5 s 20.02
trowmin 1183.7 s 19.73
trowmax 1144.1 s 19.07
tpartmin 1102.6 s 18.38
tpartmax 1091.2 s 18.19
tsels 876.9 s 14.62
tfillpad 859.6 s 14.33
trowsum 851.8 s 14.20
tcolmin 825.2 s 13.75
tcolmax 813.8 s 13.56

Additional observations:

  • all 71 testcase Model RUN TIME values sum to about 595.86 minutes of simulator time
  • testcase execution wall-clock is still dominated by the slowest tail
  • increasing --jobs further is unlikely to help much because the bottleneck is now the slowest serial testcases rather than queue width

Profiling data

Current bottleneck assessment:

Primary bottleneck

The dominant bottleneck is long-tail testcase runtime under a serial per-testcase simulator.

Secondary bottlenecks

There are script-side inefficiencies worth cleaning up, though they are not the primary reason for the current tail:

  1. run_all_st.py parallel mode launches one nested Python subprocess per testcase
  2. each subprocess runs the single-testcase path with jobs=1
  3. testcase output is buffered and replayed only after completion
  4. selected testcase build target is computed but the shared build still passes "all"

Suggested optimization items

Priority 0: split CI coverage by testcase weight

  1. Define a lightweight PR SIM suite and move the heaviest long-tail cases to nightly or a separate job.
  2. Maintain an explicit allowlist or skiplist for heavy testcase groups in CI.
  3. Keep full-coverage SIM validation available, but decouple it from the fast feedback path.

Priority 1: split heavy testcase internals by case granularity

  1. Extend the batch framework so it can schedule testcase internal cases, not only testcase directories.
  2. Reuse the existing run_st.py -c/--case path for finer-grained parallel scheduling.
  3. Split heavy testcase families such as row/col reduce and argmax/argmin into smaller work units.

Priority 2: introduce a smoke-vs-full matrix policy

  1. Keep one or a few representative cases per operator family for PR.
  2. Keep the full shape matrix for nightly or scheduled validation.
  3. Make the policy explicit in CI configuration and documentation.

Priority 3: clean up script-level inefficiencies

  1. Avoid launching one nested Python batch runner subprocess per testcase in parallel mode.
  2. Refactor run_st.py helpers to remove global os.chdir() dependence so in-process workers become safe.
  3. Write per-testcase logs to files and print concise live progress to stdout.
  4. Honor the selected build target instead of always building all in batch mode.
  5. Reduce repeated script copying into testcase build directories where possible.

Git commit

be690d3959ddccffe747fe5b14ceea9c9dc0a4f9

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions