[Perf] Reduce compilation time on kernels with multiple offloaded tasks by hughperkins · Pull Request #729 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-06-08T02:24:53Z

Once a kernel is offloaded, cfg_optimization now builds a separate control-flow
graph per offloaded task (per sub-block, with the correct parallel-execution
flag) instead of one whole-kernel CFG spanning all tasks. store-to-load
forwarding and dead-store elimination then run scoped to each task.

This is semantics-preserving: each offloaded task is a separate device launch,
so cross-task register forwarding is impossible anyway, and global memory
(fields, external tensors, and the global-temporary buffer that carries values
between tasks) is treated conservatively as live-in/live-out of every task by
the existing CFG boundary seeding (reaching-def start-node seed + live-var
final-node seed). The win is compile time: the reaching-definition / live-
variable dataflow becomes ~linear in total IR rather than super-linear in the
combined whole-kernel IR, which blows up for kernels that pack many stages into
one @qd.kernel.

Gated by CompileConfig::cfg_optimization_per_task (default true; env
QD_CFG_OPTIMIZATION_PER_TASK / qd.init kwarg). Pre-offload IR, function bodies,
and the real-matrix path fall back to the whole-kernel CFG unchanged.Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Once a kernel is offloaded, cfg_optimization now builds a separate control-flow graph per offloaded task (per sub-block, with the correct parallel-execution flag) instead of one whole-kernel CFG spanning all tasks. store-to-load forwarding and dead-store elimination then run scoped to each task. This is semantics-preserving: each offloaded task is a separate device launch, so cross-task register forwarding is impossible anyway, and global memory (fields, external tensors, and the global-temporary buffer that carries values between tasks) is treated conservatively as live-in/live-out of every task by the existing CFG boundary seeding (reaching-def start-node seed + live-var final-node seed). The win is compile time: the reaching-definition / live- variable dataflow becomes ~linear in total IR rather than super-linear in the combined whole-kernel IR, which blows up for kernels that pack many stages into one @qd.kernel. Gated by CompileConfig::cfg_optimization_per_task (default true; env QD_CFG_OPTIMIZATION_PER_TASK / qd.init kwarg). Pre-offload IR, function bodies, and the real-matrix path fall back to the whole-kernel CFG unchanged.

The first cut built a separate CFG per offloaded sub-block (prologue/body/ epilogue). That dropped the offloaded for-body's implicit-loop `continue` edges -- which are wired by visit(OffloadedStmt), not visit(Block) -- and so wrongly dead-store-eliminated a global store preceding a `continue` (test_cfg_continue regressed). Instead, build one CFG per offloaded task by temporarily moving the single OffloadedStmt into a throwaway wrapper block and running it through the normal Block -> OffloadedStmt construction, then moving it back. The per-task CFG is then byte-for-byte the slice the whole-kernel CFG would build for that task (continue wiring, prologue/body/epilogue chaining, parallel-execution flag), so correctness is preserved while the dataflow analyses stay per-task. Also revert the build_cfg(root_in_parallel_for) signature change (no longer needed -- visit(OffloadedStmt) sets the body's parallel flag itself) and fall back to the whole-kernel CFG when QD_DUMP_CFG is requested so dumping still shows the full graph.

Profiling showed the super-linear reaching-definition / store-to-load analyses run mostly in the pre-offload phase, on the monolithic kernel IR before any offloaded tasks exist -- so per-task scoping alone barely helped (the post-offload cfg was already tiny). Under cfg_optimization_per_task, skip the whole-kernel cfg_optimization entirely when the IR is not yet offloaded (no OffloadedStmt tasks), keeping only the cheap dead-alloca cleanup, and let the post-offload per-task cfg perform store-to-load forwarding + dead-store elimination once tasks exist. cfg_optimization is an optimization, not a correctness pass, so dropping it pre-offload is safe; the only thing lost is cross-task forwarding/DSE on the monolithic IR, which is invalid across separate device launches anyway. CFG dumping (QD_DUMP_CFG) still forces the whole-kernel path.

github-actions · 2026-06-08T02:56:22Z

Total: 4 file(s) changed, +63 -2 code lines.

…orization tests) The previous commit ditched the whole-kernel cfg for ALL non-offloaded IR under cfg_optimization_per_task, which over-reached: full_simplify is also called on standalone, never-offloaded blocks (unit tests, function bodies), where its store-to-load forwarding + dead-store elimination must still run. die() cannot remove a dead store (side-effecting), so dropping cfg there left a stray store and regressed Half2Vectorization.{Ndarray,GlobalTemporary,Field} (each +1 statement). Scope the ditch to the compile_to_offloads pre-offload phases (simplify_I, simplify_II, pre/post_autodiff) -- the monolithic-kernel calls whose super-linear cfg dominates compile time and that are redundant because the post-offload per-task cfg (simplify_III onward) redoes intra-task forwarding/DSE once tasks exist. Any other non-offloaded caller falls through to the whole-kernel cfg, restoring the prior behavior. No-op for the qipc graph kernel (only hits simplify_I/II pre-offload), so the compile-time win is unchanged.

github-actions · 2026-06-08T04:55:52Z

Total: 4 file(s) changed, +67 -2 code lines.

github-actions · 2026-06-08T05:50:15Z

Diff coverage: 0% · 0 lines, 0 missing

hughperkins added 3 commits June 7, 2026 17:33

hughperkins temporarily deployed to publish_pypi June 8, 2026 03:11 — with GitHub Actions Inactive

hughperkins temporarily deployed to publish_pypi June 8, 2026 11:07 — with GitHub Actions Inactive

hughperkins temporarily deployed to publish_pypi June 8, 2026 11:43 — with GitHub Actions Inactive

hughperkins temporarily deployed to publish_pypi June 8, 2026 11:47 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Reduce compilation time on kernels with multiple offloaded tasks#729

[Perf] Reduce compilation time on kernels with multiple offloaded tasks#729
hughperkins wants to merge 4 commits into
mainfrom
desk8/cfg-per-offloaded-task

hughperkins commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hughperkins commented Jun 8, 2026

Brief Summary

Walkthrough

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant