Skip to content

[Perf] Reduce compilation time on kernels with multiple offloaded tasks#729

Draft
hughperkins wants to merge 4 commits into
mainfrom
desk8/cfg-per-offloaded-task
Draft

[Perf] Reduce compilation time on kernels with multiple offloaded tasks#729
hughperkins wants to merge 4 commits into
mainfrom
desk8/cfg-per-offloaded-task

Conversation

@hughperkins

Copy link
Copy Markdown
Collaborator

Once a kernel is offloaded, cfg_optimization now builds a separate control-flow
graph per offloaded task (per sub-block, with the correct parallel-execution
flag) instead of one whole-kernel CFG spanning all tasks. store-to-load
forwarding and dead-store elimination then run scoped to each task.

This is semantics-preserving: each offloaded task is a separate device launch,
so cross-task register forwarding is impossible anyway, and global memory
(fields, external tensors, and the global-temporary buffer that carries values
between tasks) is treated conservatively as live-in/live-out of every task by
the existing CFG boundary seeding (reaching-def start-node seed + live-var
final-node seed). The win is compile time: the reaching-definition / live-
variable dataflow becomes ~linear in total IR rather than super-linear in the
combined whole-kernel IR, which blows up for kernels that pack many stages into
one @qd.kernel.

Gated by CompileConfig::cfg_optimization_per_task (default true; env
QD_CFG_OPTIMIZATION_PER_TASK / qd.init kwarg). Pre-offload IR, function bodies,
and the real-matrix path fall back to the whole-kernel CFG unchanged.Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Once a kernel is offloaded, cfg_optimization now builds a separate control-flow
graph per offloaded task (per sub-block, with the correct parallel-execution
flag) instead of one whole-kernel CFG spanning all tasks. store-to-load
forwarding and dead-store elimination then run scoped to each task.

This is semantics-preserving: each offloaded task is a separate device launch,
so cross-task register forwarding is impossible anyway, and global memory
(fields, external tensors, and the global-temporary buffer that carries values
between tasks) is treated conservatively as live-in/live-out of every task by
the existing CFG boundary seeding (reaching-def start-node seed + live-var
final-node seed). The win is compile time: the reaching-definition / live-
variable dataflow becomes ~linear in total IR rather than super-linear in the
combined whole-kernel IR, which blows up for kernels that pack many stages into
one @qd.kernel.

Gated by CompileConfig::cfg_optimization_per_task (default true; env
QD_CFG_OPTIMIZATION_PER_TASK / qd.init kwarg). Pre-offload IR, function bodies,
and the real-matrix path fall back to the whole-kernel CFG unchanged.
The first cut built a separate CFG per offloaded sub-block (prologue/body/
epilogue). That dropped the offloaded for-body's implicit-loop `continue`
edges -- which are wired by visit(OffloadedStmt), not visit(Block) -- and so
wrongly dead-store-eliminated a global store preceding a `continue`
(test_cfg_continue regressed).

Instead, build one CFG per offloaded task by temporarily moving the single
OffloadedStmt into a throwaway wrapper block and running it through the normal
Block -> OffloadedStmt construction, then moving it back. The per-task CFG is
then byte-for-byte the slice the whole-kernel CFG would build for that task
(continue wiring, prologue/body/epilogue chaining, parallel-execution flag),
so correctness is preserved while the dataflow analyses stay per-task.

Also revert the build_cfg(root_in_parallel_for) signature change (no longer
needed -- visit(OffloadedStmt) sets the body's parallel flag itself) and fall
back to the whole-kernel CFG when QD_DUMP_CFG is requested so dumping still
shows the full graph.
Profiling showed the super-linear reaching-definition / store-to-load analyses
run mostly in the pre-offload phase, on the monolithic kernel IR before any
offloaded tasks exist -- so per-task scoping alone barely helped (the
post-offload cfg was already tiny). Under cfg_optimization_per_task, skip the
whole-kernel cfg_optimization entirely when the IR is not yet offloaded (no
OffloadedStmt tasks), keeping only the cheap dead-alloca cleanup, and let the
post-offload per-task cfg perform store-to-load forwarding + dead-store
elimination once tasks exist.

cfg_optimization is an optimization, not a correctness pass, so dropping it
pre-offload is safe; the only thing lost is cross-task forwarding/DSE on the
monolithic IR, which is invalid across separate device launches anyway. CFG
dumping (QD_DUMP_CFG) still forces the whole-kernel path.
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

…orization tests)

The previous commit ditched the whole-kernel cfg for ALL non-offloaded IR under
cfg_optimization_per_task, which over-reached: full_simplify is also called on
standalone, never-offloaded blocks (unit tests, function bodies), where its
store-to-load forwarding + dead-store elimination must still run. die() cannot
remove a dead store (side-effecting), so dropping cfg there left a stray store and
regressed Half2Vectorization.{Ndarray,GlobalTemporary,Field} (each +1 statement).

Scope the ditch to the compile_to_offloads pre-offload phases (simplify_I,
simplify_II, pre/post_autodiff) -- the monolithic-kernel calls whose super-linear
cfg dominates compile time and that are redundant because the post-offload per-task
cfg (simplify_III onward) redoes intra-task forwarding/DSE once tasks exist. Any
other non-offloaded caller falls through to the whole-kernel cfg, restoring the
prior behavior. No-op for the qipc graph kernel (only hits simplify_I/II pre-offload),
so the compile-time win is unchanged.
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant