[Graph] Add qd.checkpoint#725
Conversation
First slice of the qd.checkpoint primitive (design doc in perso_hugh/doc/qipc/reentrant.md, sections 5.1-5.2). This adds the Python API, AST recognition, validation, and per-kernel metadata list -- no IR, runtime, or cross-backend changes yet. Every checkpoint body still runs unconditionally; the docs page calls this out as experimental. Following slices wire checkpoint_id end-to-end (ForLoopConfig -> OffloadedTask, mirroring stream_parallel_group_id), build per-checkpoint CUDA IF conditional nodes on SM 9.0+ / CUDA 12.4+, auto-insert yield-check kernels, and add the GraphStatus host API with step.resume(from_checkpoint=).
Mirrors the stream_parallel_group_id propagation chain for a new
checkpoint_id field tagging the enclosing qd.checkpoint() scope on
each parallel for-loop:
ForLoopConfig.checkpoint_id (-1 = no checkpoint)
-> FrontendForStmt.checkpoint_id
-> RangeForStmt / StructForStmt .checkpoint_id (lower_ast.cpp)
-> OffloadedStmt.checkpoint_id (offload.cpp)
-> OffloadedTask.checkpoint_id (codegen_cuda/amdgpu)
Driven from Python by new ASTBuilder.begin_checkpoint() /
end_checkpoint() exposed via export_lang.cpp; the AST transformer
calls them around the with-block body and asserts the C++ counter
agrees with the Python list index.
Also folds the new field into the offline-cache key emit list, the
clone() methods, and the QD_IO_DEF / QD_STMT_DEF_FIELDS sets so the
fastcache stays content-addressable.
Pure plumbing -- no behaviour change. Existing tests cover regression.
Slice 1c will be the first consumer (GraphManager builds per-cp_id
IF conditional nodes on CUDA 12.4+).
… SM 9.0+)
Adds the runtime side of qd.checkpoint() on CUDA 12.4+. Each contiguous
run of OffloadedTasks sharing the same checkpoint_id is wrapped in a
CUDA graph IF conditional node, preceded by a small gate kernel that
sets the IF handle from `cp_id >= *resume_point`. Tasks with cp_id == -1
remain top-level siblings as before.
- New gate kernel `_qd_checkpoint_if_gate` ships as a pre-built
fatbin (sm_90/100/120) generated by
scripts/build_checkpoint_gate_fatbin.py. Matches the
graph_do_while_cond fatbin pattern so the user doesn't need
libcudadevrt at runtime.
- CachedGraph now owns a `resume_point_dev_ptr` int32 device
scalar (zero-initialised) when the kernel uses checkpoints. The
launch-path memcpy resets it to 0 each call; slice 2's host API
will set it to `from_checkpoint=` when resuming.
- GraphManager exposes num_checkpoints_on_last_call() for tests to
confirm the IF nodes were actually emitted. Wired through
KernelLauncher / Program to Python via the same chain as
num_nodes_on_last_call.
- Tests assert the IF count for both flat-checkpoint and
inside-graph_do_while kernels; behavioural assertions hold on
every backend (no behaviour change yet, slice 1d adds yield).
Falls back to "no-op CUDA graph build" on pre-SM-9.0 (caller does
the host-side flat launch) mirroring the existing graph_do_while
fallback. Slice 4/5 will add the indirect-dispatch alternative for
AMDGPU/Metal/Vulkan/older CUDA.
…es + arg-id plumbing
…into GraphManager
…h for yield_signal readback
…_checkpoint=) host API
…t resume_point at end of WHILE iter
…now takes resume_point arg)
…o checkpoints materialise as distinct offloaded tasks
…heckpoints; document offloader limitation
… iteration The CPU host-branch gating treated `ctx.resume_from_checkpoint` as if it applied for the entire `kernel.resume(...)` launch, but `from_checkpoint` is meant to skip cp_ids only on the FIRST iteration of a resumed `graph_do_while` body. The CUDA-native cond-with-yield kernel resets `*resume_point = 0` between iterations for the same reason (`graph_do_while_cond.cu`); the CPU emulation now mirrors that by zeroing `ctx.resume_from_checkpoint` after each completed iteration. Found while probing slice 6 on the local 5090 box: cp 0 sat before a yielding cp 1 inside `graph_do_while`, and a `resume(from_checkpoint=1)` launch incorrectly skipped cp 0 on every iteration instead of just the first, so counter A was wrong (1 vs expected 3).
…en CPU tests Two fixes after running the slice 6 CPU fallback end-to-end on a local x64 host: 1. `KernelLauncher` was resetting `last_yield_cp_id_on_last_call_` at the top of every LLVM launch, including the aux kernels that `ndarray.to_numpy()` triggers between a yielding launch and the user's `GraphStatus.yielded` check. Restrict the reset to launches that have at least one resolved `checkpoint_yield_on=` arg, so non-graph aux kernels can't clobber the value. Mirrors the CUDA path, where only `GraphManager::launch_cached_graph` touches the field. 2. Tests in `test_checkpoint.py` and `test_checkpoint_resume_offset.py` were skipping the yield/resume behavioural cases on every non-CUDA-native backend. Slice 6 now implements the same contract on CPU, so add a `_supports_checkpoint_yield_resume()` predicate that admits CUDA-native (slice 1d) and x64 (slice 6), and route the behavioural skips through it. Pure introspection assertions that only exist on CUDA (e.g. `get_graph_num_checkpoints_on_last_call`) keep using `_is_checkpoint_if_path_native`. Locally on the 5090 box: 54 passed, 8847 deselected (no skips on the yield/resume cases for either x64 or cuda).
Reflect that slice 6 landed CPU/x64 support for the checkpoint yield/resume contract: - Backend support table: CPU now reads "host-branch gating" / "implemented (host-branch gating)" instead of "runs unconditionally" / "not yet (slices 4-6)". Remaining "not yet" entries collapse to slices 4-5. - Experimental status block: notes that CPU emulates the same contract via host-branch gating in `KernelLauncher` (same Python API, no device IF nodes). - Yield-mechanism section: now titled "CUDA SM 9.0+ and CPU/x64", describes the device-kernel vs host-branch lowerings side by side and calls out the WHILE early-exit equivalence. - Host-side yield/resume loop: explicit that backends without the gate still return GraphStatus but always with `yielded=False`. - Authoring tip about `for _ in range(1):` now references "checkpoint gate" rather than the CUDA-specific "IF gate". - New "Backend coverage notes" subsection summarising what each backend does today, including the CPU/x64 prototyping angle.
Builds one HIP graph per contiguous run of same-cp_id offloaded tasks (plus one per cp_id<0 unconditional batch), then has the launcher iterate batches with host-branch gating that mirrors slice 6's CPU contract: - `resume_point` from `ctx.resume_from_checkpoint` skips cp_ids strictly below it on this launch. - `yield_signal` (-1 / cp_id) tracks first-yielder-wins; once set, later cp_id>=0 batches are skipped. - After each yielding batch, a stream sync + D2H of the user's `yield_on` flag observes the yield, updates the launcher's `last_yield_cp_id_on_last_call_`, and memsets the flag back to 0 on device. This is the CPU launcher's logic ported to HIP-graph launches. ROCm 7.2's HIP has neither conditional graph nodes nor indirect dispatch, so the design's "gate kernel + indirect dispatch" recipe from reentrant.md §6.2 is replaced with the equivalent host-orchestrated sub-graph launches. Pinned-host yield flag (§10) stays a future enhancement. Non-checkpoint kernels continue down the original single-graph path unchanged (detected up-front in `try_launch`); contact area is just the new sub_graph_execs / batch_cp_ids fields on `CachedGraph` and the `launch_cached_checkpoint_graph` helper. `graph_do_while + checkpoint` still falls through to the streaming launcher today (next commit).
…ost gating Plumbs the slice-6 CPU host-branch gating into the AMDGPU streaming launcher so `graph_do_while + qd.checkpoint(yield_on=...)` works on AMDGPU even though HIP 7.2 has neither conditional graph nodes nor indirect dispatch. - `launch_llvm_kernel` now mirrors the CUDA / sub-graph resolver and populates `ctx.checkpoint_yield_on_dev_ptrs` from the per-cp arg-id table the frontend pushes through `LaunchContextBuilder`. - `launch_offloaded_tasks` reads `ctx.resume_from_checkpoint` into a local `resume_point`, tracks `yield_signal` per launch, and skips cp_id>=0 tasks when `cp_id < resume_point` or `yield_signal != -1`. After the last task of a yielding checkpoint, a stream-sync + D2H of the user's flag is the host equivalent of the device-side yield-check kernel; non-zero clears the flag back on device and records the first-yielder via `graph_manager_.set_last_yield_cp_id_on_last_call`. - `launch_offloaded_tasks_with_do_while` breaks on yield (avoids the spin-forever the CPU launcher comment in `runtime/cpu/kernel_launcher .cpp` calls out) and clears `ctx.resume_from_checkpoint` between iters so `from_checkpoint=cp` applies only to the first WHILE iter. - `GraphManager` gains a `set_last_yield_cp_id_on_last_call` setter so the streaming and sub-graph paths can both feed the same field the Python `GraphStatus` surface reads. Test predicates: `_supports_checkpoint_yield_resume_in_while_loop()` now returns True on AMDGPU (was False after the first slice-4 commit), so all four `test_resume_offset_*` cases and `test_checkpoint_yield_exits_graph_do_while_early` run on AMDGPU.
…oint
Slice 4's design used "indirect dispatch" for non-CUDA-12.4+ backends to
gate per-checkpoint kernel launches. Vulkan + Metal have
vkCmdDispatchIndirect / dispatchThreadgroupsIndirect respectively, but
the GFX runtime today records and submits a single command list per
launch_kernel call rather than running inside a pre-recorded graph, so
"skip a launch" is naturally implemented at the host task-loop level
instead of via indirect dispatch.
Changes (mirror the CPU slice 6 / AMDGPU slice 4 contract):
- `TaskAttributes` gains `checkpoint_id` (propagated from
`OffloadedStmt::checkpoint_id` in `spirv_codegen.cpp`'s
serial/range_for/struct_for emit paths). Persisted via QD_IO_DEF so
offline-cache hits keep the gating wired.
- `GfxRuntime::launch_kernel`'s task loop reads
`host_ctx.resume_from_checkpoint` and tracks `yield_signal`
in-launch. Tasks whose `cp_id` is < `resume_point` (first iter only)
or that follow an observed yield are skipped before the
pipeline-bind + dispatch record.
- After the last task in a same-cp_id run, the runtime flushes +
wait_idles + readbacks the user's `yield_on=` flag through
`Device::readback_data`, sets `last_yield_cp_id_on_last_call_` /
`yield_signal` on non-zero, and clears the flag with `upload_data`
before re-opening the cmdlist for subsequent tasks. Stalling here
is acceptable for slice 4; pinned-host yield flags stay future
enhancement per reentrant.md §10.
- `GfxRuntime::last_yield_cp_id_on_last_call()` exposes the field;
`gfx::KernelLauncher::get_graph_last_yield_cp_id_on_last_call()`
routes through, so Python `GraphStatus.yielded` becomes accurate on
vulkan / metal.
- `launch_offloaded_tasks_with_do_while` breaks on yield + clears
`ctx.resume_from_checkpoint` between iters, matching the CPU /
AMDGPU implementations.
Test predicates `_supports_checkpoint_yield_resume{,_in_while_loop}`
now return True for vulkan + metal so 27 tests cover those backends.
Build + run pending on cluster (Vulkan) and macOS host (Metal).
PR 725's AI bot 'feature factorization' check flagged kernel.py for accreting ~80 lines of checkpoint-feature-specific blocks across __call__ / launch_kernel. The PR already follows the extract-to-module pattern for graph_status.py / checkpoint.py / checkpoint_transformer.py; this commit applies the same treatment to the kernel.py side. New file python/quadrants/lang/kernel_checkpoint.py exposes free functions that Kernel delegates to via one-liner calls: - validate_resume_cookie - translate_user_label_to_internal_cp_id - init_yield_on_arg_id_table - maybe_record_yield_on_arg - forward_yield_on_table_to_ctx - maybe_build_graph_status kernel.py shrinks by ~48 lines net (68 deletions, 20 insertions). All existing tests still pass (35/35 in test_checkpoint + test_resume_offset on x64); pre-commit clean. No behaviour change. Also rewrap the @qd.kernel(checkpoints=...) docstring param in kernel_impl.py from 80c to 120c -- AI bot 'line wrapping' check flagged the 6-line description as wrapping at 80 instead of the project-wide 120-char convention.
Two failures on PR 725 / commit e939301 ('extract qd.checkpoint plumbing into kernel_checkpoint.py'): 1. Linux test_api.py: 'kernel_checkpoint' leaked into dir(quadrants) because 'from quadrants.lang import kernel_checkpoint as _checkpoint_helpers' in kernel.py attaches the submodule as an attribute of the parent package, which then bubbles up through 'from quadrants.lang import *' in quadrants/__init__.py. Other internal modules (kernel_impl, misc, ops, etc.) are masked via an explicit exclusion list in quadrants/lang/__init__.py's __all__ filter; add kernel_checkpoint to that list. Verified locally: 'kernel_checkpoint' in dir(quadrants) -> False after the fix. 2. Line wrapping (AI bot): three pre-existing comment runs in runtime/cpu/kernel_launcher.cpp and runtime/cuda/graph_manager.cpp wrapped at 76-84 chars instead of the project-wide 120-char limit. Reflowed those three runs. 53/53 test_api + test_checkpoint + test_resume_offset tests pass on x64; pre-commit clean.
- quadrants/ir/frontend_ir.h:27 (checkpoint_id field docstring): 8 lines wrapping at ~90c -> 7 lines packing to ~119c - quadrants/ir/frontend_ir.h:1092 (begin_checkpoint docstring): 7 lines wrapping at ~88c -> 6 lines packing to ~118c - quadrants/program/kernel_launcher.h:26 (get_graph_num_checkpoints_on_last_call docstring): 4 lines wrapping at ~89c -> 3 lines packing to ~120c Used rewrap-comments-120c skill (find_underwrapped.py --diff). All other reported runs are #include blocks (structural, can't reflow).
|
|
||
| > **Experimental.** `qd.checkpoint`, `qd.GraphStatus`, and `kernel.resume(from_checkpoint=...)` are experimental APIs. The shape of the public surface (the context-manager signature, the `@qd.kernel(checkpoints=True)` flag, the `GraphStatus` fields, the host-side resume loop, the error messages, and the cross-backend lowering details) may change in any future release without a deprecation cycle. | ||
|
|
||
| `qd.checkpoint` lets a graph kernel pause partway through, surface a reason to the host, let the host fix things up, and resume from where it paused on the next launch. An example use-case is an algorithm implemented as a graph that may need to allocate additional memory partway through, where the graph operations are in-place, and therefore not idempotent, and therefore for which simply retrying the whole graph from the start is not an option. |
There was a problem hiding this comment.
As I already mentioned, "graph operations" is not a standard term, you cannot use it without defining what it means, or just reformulate this sentence to avoid using this terminology.
I would suggest to define in parentheses what "idempotent" means in programming. It is not really common knowledge and understanding what it means in this context is critical.
There was a problem hiding this comment.
I would argue that idempotent is a standard programming term. It is not compiler specific. It is not physics siulation specific. I've seem the term used by many engineers in my previous company, during standard PRs.
There was a problem hiding this comment.
- replaced 'graph operaitons' with 'operations in the graph'
- replcae 'not idempotent' with 'cannot be rerun'
There was a problem hiding this comment.
‘cannot be rerun´ without altering / corrupting the output
| The framework never writes into your `yield_on` buffer — you own it end-to-end. That means: | ||
|
|
||
| - Before the **first** launch, initialise it to `0` (a freshly allocated `qd.ndarray` is not guaranteed to be zeroed). | ||
| - Before each **resume** launch, reset it to `0` (otherwise the body of the same checkpoint sees the stale non-zero value and yields again on the same condition, looping forever). |
There was a problem hiding this comment.
Could be worth using some
There was a problem hiding this comment.
wouldnt that be non-ascii? 🤔
| from_checkpoint=status.checkpoint) | ||
| ``` | ||
|
|
||
| ### Restrictions |
There was a problem hiding this comment.
Could you explain / be more explicit about what happens during resume?
Stating clearly that the entire checkpoint block is re-executed, and that it is user-responsibility to ensure idempotent behaviour when checkpoint is needed? Because if the state is altered during the checkpoint block, resuming is not going to save you I guess? Whatever the answer, it should be very clear in the doc.
Beyond that, how does checkpointing works under the hood? Does it snapshot all the input data by copy before yielding, or it just return like this? If no copy is made, this means that resuming must be done "right away", without further altering the data in between, otherwise it is some kind of undefined behaviour.
Another important point, what is I don't want to resume in such a case and I just want to move on to another kernel and continue like this? Is it supported or resume must happen?
I think it is essentially to clarify all these points in the documentation.
There was a problem hiding this comment.
checkpoint does NOT require idempotent behavior. This is the entire purpose of checkpoint: to be able to interrupt and resume graphs that are NOT idempotent.
There was a problem hiding this comment.
ah, hte checkpoint block itself. right.
There was a problem hiding this comment.
the checkpoint block itself actually does not so much require idempotence, as requiring that it is atomic: it either succeeds completely, or fails without changing anything.
There was a problem hiding this comment.
as an example, in the case of allocation issues, the checkpoint block looks like:
- do we have enough memory availbel?
- no: exit now
- yes: ok, lets proceed with running the sort etc
There was a problem hiding this comment.
added 'resume where' section
There was a problem hiding this comment.
the checkpoint block itself actually does not so much require idempotence, as requiring that it is atomic: it either succeeds completely, or fails without changing anything.
Yeah, this is exactly what I meant by « ensure idempotent behaviour when checkpoint is needed »
There was a problem hiding this comment.
"fails without changing anything" I feel is not idempotent? Idempotent means that calling the function multiple times is identical in effect to calling it once. But if it fails the first time, it would only be idempotent if it always failed thereafter I feel?
Integrate origin/main's nested graph_do_while feature (#728) with the checkpointing work on hp/graph-checkpoint. Key reconciliations: - GraphRegionTag (ir.h): add checkpoint_id alongside graph_do_while_level_id and stream_parallel_group_id; update constructors and operator==. - frontend_ir.{h,cpp}: ForLoopConfig/FrontendForStmt carry checkpoint_id; ASTBuilder stamps GraphRegionTag with the 3-arg constructor. - offload.cpp: propagate checkpoint_id through assemble_serial_statements / push_serial_statement so serial side-effecting tasks inherit the checkpoint id while pure-only tasks keep -1. - gen_offline_cache_key.cpp: emit both graph_do_while_level_id and checkpoint_id for OffloadedStmt and other statements. - python/quadrants/lang/kernel.py: re-integrate GraphDoWhileLevel dataclass / graph_do_while_levels / _graph_do_while_level_stack from main with hp/graph-checkpoint's checkpoint metadata + fast-cache load/store paths and launch_kernel. - function_def_transformer.py: allow qd.checkpoint With blocks inside qd.graph_do_while bodies (new _is_checkpoint_with helper). - cuda/amdgpu/cpu graph_manager + kernel_launcher: combine nested-GDW and checkpoint plumbing, dropping qipc-integration-only bits (graph_parallel, add_empty_node, pre-Hopper flat graph, cp_id_storage). - cpu/kernel_launcher.cpp: drop accidental *flag = 0; re-introduction so the user-owned yield_on flag is not cleared by the runtime. Tests: - All checkpoint and graph_do_while tests pass on x64 and cuda. - Full x64 test suite: 4174 passed, 2053 skipped (1 pre-existing perf-flake test_concurrent_streams_with_events, unrelated).
prepare_checkpoint_launch_state and finalize_checkpoint_readback already exist in checkpoint_launch.cpp and are declared in runtime.h, but runtime.cpp still inlined the identical logic instead of calling them. Replace both inline blocks with calls to the existing helpers (-94 / +7) to silence the CI feature-factorization check and keep the launch_kernel hot path focused.
No description provided.