[Perf] Reduce compilation time on kernels with multiple offloaded tasks#729
Draft
hughperkins wants to merge 4 commits into
Draft
[Perf] Reduce compilation time on kernels with multiple offloaded tasks#729hughperkins wants to merge 4 commits into
hughperkins wants to merge 4 commits into
Conversation
Once a kernel is offloaded, cfg_optimization now builds a separate control-flow graph per offloaded task (per sub-block, with the correct parallel-execution flag) instead of one whole-kernel CFG spanning all tasks. store-to-load forwarding and dead-store elimination then run scoped to each task. This is semantics-preserving: each offloaded task is a separate device launch, so cross-task register forwarding is impossible anyway, and global memory (fields, external tensors, and the global-temporary buffer that carries values between tasks) is treated conservatively as live-in/live-out of every task by the existing CFG boundary seeding (reaching-def start-node seed + live-var final-node seed). The win is compile time: the reaching-definition / live- variable dataflow becomes ~linear in total IR rather than super-linear in the combined whole-kernel IR, which blows up for kernels that pack many stages into one @qd.kernel. Gated by CompileConfig::cfg_optimization_per_task (default true; env QD_CFG_OPTIMIZATION_PER_TASK / qd.init kwarg). Pre-offload IR, function bodies, and the real-matrix path fall back to the whole-kernel CFG unchanged.
The first cut built a separate CFG per offloaded sub-block (prologue/body/ epilogue). That dropped the offloaded for-body's implicit-loop `continue` edges -- which are wired by visit(OffloadedStmt), not visit(Block) -- and so wrongly dead-store-eliminated a global store preceding a `continue` (test_cfg_continue regressed). Instead, build one CFG per offloaded task by temporarily moving the single OffloadedStmt into a throwaway wrapper block and running it through the normal Block -> OffloadedStmt construction, then moving it back. The per-task CFG is then byte-for-byte the slice the whole-kernel CFG would build for that task (continue wiring, prologue/body/epilogue chaining, parallel-execution flag), so correctness is preserved while the dataflow analyses stay per-task. Also revert the build_cfg(root_in_parallel_for) signature change (no longer needed -- visit(OffloadedStmt) sets the body's parallel flag itself) and fall back to the whole-kernel CFG when QD_DUMP_CFG is requested so dumping still shows the full graph.
Profiling showed the super-linear reaching-definition / store-to-load analyses run mostly in the pre-offload phase, on the monolithic kernel IR before any offloaded tasks exist -- so per-task scoping alone barely helped (the post-offload cfg was already tiny). Under cfg_optimization_per_task, skip the whole-kernel cfg_optimization entirely when the IR is not yet offloaded (no OffloadedStmt tasks), keeping only the cheap dead-alloca cleanup, and let the post-offload per-task cfg perform store-to-load forwarding + dead-store elimination once tasks exist. cfg_optimization is an optimization, not a correctness pass, so dropping it pre-offload is safe; the only thing lost is cross-task forwarding/DSE on the monolithic IR, which is invalid across separate device launches anyway. CFG dumping (QD_DUMP_CFG) still forces the whole-kernel path.
…orization tests)
The previous commit ditched the whole-kernel cfg for ALL non-offloaded IR under
cfg_optimization_per_task, which over-reached: full_simplify is also called on
standalone, never-offloaded blocks (unit tests, function bodies), where its
store-to-load forwarding + dead-store elimination must still run. die() cannot
remove a dead store (side-effecting), so dropping cfg there left a stray store and
regressed Half2Vectorization.{Ndarray,GlobalTemporary,Field} (each +1 statement).
Scope the ditch to the compile_to_offloads pre-offload phases (simplify_I,
simplify_II, pre/post_autodiff) -- the monolithic-kernel calls whose super-linear
cfg dominates compile time and that are redundant because the post-offload per-task
cfg (simplify_III onward) redoes intra-task forwarding/DSE once tasks exist. Any
other non-offloaded caller falls through to the whole-kernel cfg, restoring the
prior behavior. No-op for the qipc graph kernel (only hits simplify_I/II pre-offload),
so the compile-time win is unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Once a kernel is offloaded, cfg_optimization now builds a separate control-flow
graph per offloaded task (per sub-block, with the correct parallel-execution
flag) instead of one whole-kernel CFG spanning all tasks. store-to-load
forwarding and dead-store elimination then run scoped to each task.
This is semantics-preserving: each offloaded task is a separate device launch,
so cross-task register forwarding is impossible anyway, and global memory
(fields, external tensors, and the global-temporary buffer that carries values
between tasks) is treated conservatively as live-in/live-out of every task by
the existing CFG boundary seeding (reaching-def start-node seed + live-var
final-node seed). The win is compile time: the reaching-definition / live-
variable dataflow becomes ~linear in total IR rather than super-linear in the
combined whole-kernel IR, which blows up for kernels that pack many stages into
one @qd.kernel.
Gated by CompileConfig::cfg_optimization_per_task (default true; env
QD_CFG_OPTIMIZATION_PER_TASK / qd.init kwarg). Pre-offload IR, function bodies,
and the real-matrix path fall back to the whole-kernel CFG unchanged.Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough