[Performance] Optimize TileLang ST SIM CI long-tail bottleneck under 64-way parallelism

## Summary

TileLang ST full SIM CI already enables testcase-level parallelism with `--jobs 64`, but end-to-end wall-clock time is still dominated by a small set of long-tail testcases.

Current CI command:

```bash
bash test/tilelang_st/script/run_ci.sh -r sim -v a5 --jobs 64
```

The key bottleneck is no longer build parallelism or lack of testcase-level concurrency. Instead, it is the combination of:

1. simulator execution being serial inside each testcase
2. highly imbalanced testcase runtime distribution
3. several very heavy testcases dominating the tail latency of the whole batch

## Command line

```bash
bash test/tilelang_st/script/run_ci.sh -r sim -v a5 --jobs 64
```

## Reproduction input

Relevant workflow and scripts:

- `.github/workflows/ci.yml`
- `test/tilelang_st/script/run_ci.sh`
- `test/tilelang_st/script/run_all_st.py`
- `test/tilelang_st/script/run_st.py`

Relevant logs used for analysis:

- `build/1_vpto-sim-validation.txt`
- `build/run_all.log`

## Expected performance

The PR SIM validation path should avoid being dominated by a few long-tail testcases.

A reasonable target is:

1. keep a fast PR feedback path with a significantly lower wall-clock
2. preserve full-coverage SIM validation in nightly or dedicated jobs
3. make heavy testcase distribution visible and manageable

## Actual performance

Findings from the 64-way parallel log:

1. Testcase-level parallelism is already enabled:
   - log shows `running testcases in parallel with jobs=64`
   - all 71 testcases are queued immediately
2. Single-testcase simulator execution is still serial:
   - log repeatedly shows `[TmSim]: Run in serial mode.`
3. The runtime distribution is extremely imbalanced.

Top long-tail testcases from the log:

| testcase | model runtime | approx minutes |
| --- | ---: | ---: |
| `trowargmax` | 1493.3 s | 24.89 |
| `trowargmin` | 1486.7 s | 24.78 |
| `trowprod` | 1201.5 s | 20.02 |
| `trowmin` | 1183.7 s | 19.73 |
| `trowmax` | 1144.1 s | 19.07 |
| `tpartmin` | 1102.6 s | 18.38 |
| `tpartmax` | 1091.2 s | 18.19 |
| `tsels` | 876.9 s | 14.62 |
| `tfillpad` | 859.6 s | 14.33 |
| `trowsum` | 851.8 s | 14.20 |
| `tcolmin` | 825.2 s | 13.75 |
| `tcolmax` | 813.8 s | 13.56 |

Additional observations:

- all 71 testcase `Model RUN TIME` values sum to about `595.86` minutes of simulator time
- testcase execution wall-clock is still dominated by the slowest tail
- increasing `--jobs` further is unlikely to help much because the bottleneck is now the slowest serial testcases rather than queue width

## Profiling data

Current bottleneck assessment:

### Primary bottleneck

The dominant bottleneck is **long-tail testcase runtime under a serial per-testcase simulator**.

### Secondary bottlenecks

There are script-side inefficiencies worth cleaning up, though they are not the primary reason for the current tail:

1. `run_all_st.py` parallel mode launches one nested Python subprocess per testcase
2. each subprocess runs the single-testcase path with `jobs=1`
3. testcase output is buffered and replayed only after completion
4. selected testcase build target is computed but the shared build still passes `"all"`

## Suggested optimization items

### Priority 0: split CI coverage by testcase weight

1. Define a lightweight PR SIM suite and move the heaviest long-tail cases to nightly or a separate job.
2. Maintain an explicit allowlist or skiplist for heavy testcase groups in CI.
3. Keep full-coverage SIM validation available, but decouple it from the fast feedback path.

### Priority 1: split heavy testcase internals by case granularity

1. Extend the batch framework so it can schedule testcase internal cases, not only testcase directories.
2. Reuse the existing `run_st.py -c/--case` path for finer-grained parallel scheduling.
3. Split heavy testcase families such as row/col reduce and argmax/argmin into smaller work units.

### Priority 2: introduce a smoke-vs-full matrix policy

1. Keep one or a few representative cases per operator family for PR.
2. Keep the full shape matrix for nightly or scheduled validation.
3. Make the policy explicit in CI configuration and documentation.

### Priority 3: clean up script-level inefficiencies

1. Avoid launching one nested Python batch runner subprocess per testcase in parallel mode.
2. Refactor `run_st.py` helpers to remove global `os.chdir()` dependence so in-process workers become safe.
3. Write per-testcase logs to files and print concise live progress to stdout.
4. Honor the selected build target instead of always building `all` in batch mode.
5. Reduce repeated script copying into testcase build directories where possible.

## Git commit

`be690d3959ddccffe747fe5b14ceea9c9dc0a4f9`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Optimize TileLang ST SIM CI long-tail bottleneck under 64-way parallelism #301

Summary

Command line

Reproduction input

Expected performance

Actual performance

Profiling data

Primary bottleneck

Secondary bottlenecks

Suggested optimization items

Priority 0: split CI coverage by testcase weight

Priority 1: split heavy testcase internals by case granularity

Priority 2: introduce a smoke-vs-full matrix policy

Priority 3: clean up script-level inefficiencies

Git commit

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

testcase	model runtime	approx minutes
`trowargmax`	1493.3 s	24.89
`trowargmin`	1486.7 s	24.78
`trowprod`	1201.5 s	20.02
`trowmin`	1183.7 s	19.73
`trowmax`	1144.1 s	19.07
`tpartmin`	1102.6 s	18.38
`tpartmax`	1091.2 s	18.19
`tsels`	876.9 s	14.62
`tfillpad`	859.6 s	14.33
`trowsum`	851.8 s	14.20
`tcolmin`	825.2 s	13.75
`tcolmax`	813.8 s	13.56

[Performance] Optimize TileLang ST SIM CI long-tail bottleneck under 64-way parallelism #301

Description

Summary

Command line

Reproduction input

Expected performance

Actual performance

Profiling data

Primary bottleneck

Secondary bottlenecks

Suggested optimization items

Priority 0: split CI coverage by testcase weight

Priority 1: split heavy testcase internals by case granularity

Priority 2: introduce a smoke-vs-full matrix policy

Priority 3: clean up script-level inefficiencies

Git commit

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions