Skip to content

[Graph] Add qd.checkpoint#725

Open
hughperkins wants to merge 95 commits into
mainfrom
hp/graph-checkpoint
Open

[Graph] Add qd.checkpoint#725
hughperkins wants to merge 95 commits into
mainfrom
hp/graph-checkpoint

Conversation

@hughperkins

@hughperkins hughperkins commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

hughperkins and others added 30 commits June 4, 2026 13:10
First slice of the qd.checkpoint primitive (design doc in
perso_hugh/doc/qipc/reentrant.md, sections 5.1-5.2). This adds the
Python API, AST recognition, validation, and per-kernel metadata
list -- no IR, runtime, or cross-backend changes yet. Every
checkpoint body still runs unconditionally; the docs page calls
this out as experimental.

Following slices wire checkpoint_id end-to-end (ForLoopConfig ->
OffloadedTask, mirroring stream_parallel_group_id), build
per-checkpoint CUDA IF conditional nodes on SM 9.0+ / CUDA 12.4+,
auto-insert yield-check kernels, and add the GraphStatus host API
with step.resume(from_checkpoint=).
Mirrors the stream_parallel_group_id propagation chain for a new
checkpoint_id field tagging the enclosing qd.checkpoint() scope on
each parallel for-loop:

  ForLoopConfig.checkpoint_id (-1 = no checkpoint)
    -> FrontendForStmt.checkpoint_id
      -> RangeForStmt / StructForStmt .checkpoint_id  (lower_ast.cpp)
        -> OffloadedStmt.checkpoint_id                (offload.cpp)
          -> OffloadedTask.checkpoint_id              (codegen_cuda/amdgpu)

Driven from Python by new ASTBuilder.begin_checkpoint() /
end_checkpoint() exposed via export_lang.cpp; the AST transformer
calls them around the with-block body and asserts the C++ counter
agrees with the Python list index.

Also folds the new field into the offline-cache key emit list, the
clone() methods, and the QD_IO_DEF / QD_STMT_DEF_FIELDS sets so the
fastcache stays content-addressable.

Pure plumbing -- no behaviour change. Existing tests cover regression.
Slice 1c will be the first consumer (GraphManager builds per-cp_id
IF conditional nodes on CUDA 12.4+).
… SM 9.0+)

Adds the runtime side of qd.checkpoint() on CUDA 12.4+. Each contiguous
run of OffloadedTasks sharing the same checkpoint_id is wrapped in a
CUDA graph IF conditional node, preceded by a small gate kernel that
sets the IF handle from `cp_id >= *resume_point`. Tasks with cp_id == -1
remain top-level siblings as before.

  - New gate kernel `_qd_checkpoint_if_gate` ships as a pre-built
    fatbin (sm_90/100/120) generated by
    scripts/build_checkpoint_gate_fatbin.py. Matches the
    graph_do_while_cond fatbin pattern so the user doesn't need
    libcudadevrt at runtime.
  - CachedGraph now owns a `resume_point_dev_ptr` int32 device
    scalar (zero-initialised) when the kernel uses checkpoints. The
    launch-path memcpy resets it to 0 each call; slice 2's host API
    will set it to `from_checkpoint=` when resuming.
  - GraphManager exposes num_checkpoints_on_last_call() for tests to
    confirm the IF nodes were actually emitted. Wired through
    KernelLauncher / Program to Python via the same chain as
    num_nodes_on_last_call.
  - Tests assert the IF count for both flat-checkpoint and
    inside-graph_do_while kernels; behavioural assertions hold on
    every backend (no behaviour change yet, slice 1d adds yield).

Falls back to "no-op CUDA graph build" on pre-SM-9.0 (caller does
the host-side flat launch) mirroring the existing graph_do_while
fallback. Slice 4/5 will add the indirect-dispatch alternative for
AMDGPU/Metal/Vulkan/older CUDA.
…o checkpoints materialise as distinct offloaded tasks
… iteration

The CPU host-branch gating treated `ctx.resume_from_checkpoint` as if it
applied for the entire `kernel.resume(...)` launch, but `from_checkpoint`
is meant to skip cp_ids only on the FIRST iteration of a resumed
`graph_do_while` body. The CUDA-native cond-with-yield kernel resets
`*resume_point = 0` between iterations for the same reason
(`graph_do_while_cond.cu`); the CPU emulation now mirrors that by
zeroing `ctx.resume_from_checkpoint` after each completed iteration.

Found while probing slice 6 on the local 5090 box: cp 0 sat before a
yielding cp 1 inside `graph_do_while`, and a `resume(from_checkpoint=1)`
launch incorrectly skipped cp 0 on every iteration instead of just the
first, so counter A was wrong (1 vs expected 3).
…en CPU tests

Two fixes after running the slice 6 CPU fallback end-to-end on a local
x64 host:

1. `KernelLauncher` was resetting `last_yield_cp_id_on_last_call_` at
   the top of every LLVM launch, including the aux kernels that
   `ndarray.to_numpy()` triggers between a yielding launch and the
   user's `GraphStatus.yielded` check. Restrict the reset to launches
   that have at least one resolved `checkpoint_yield_on=` arg, so
   non-graph aux kernels can't clobber the value. Mirrors the CUDA
   path, where only `GraphManager::launch_cached_graph` touches the
   field.
2. Tests in `test_checkpoint.py` and `test_checkpoint_resume_offset.py`
   were skipping the yield/resume behavioural cases on every
   non-CUDA-native backend. Slice 6 now implements the same contract
   on CPU, so add a `_supports_checkpoint_yield_resume()` predicate
   that admits CUDA-native (slice 1d) and x64 (slice 6), and route the
   behavioural skips through it. Pure introspection assertions that
   only exist on CUDA (e.g. `get_graph_num_checkpoints_on_last_call`)
   keep using `_is_checkpoint_if_path_native`.

Locally on the 5090 box: 54 passed, 8847 deselected (no skips on the
yield/resume cases for either x64 or cuda).
Reflect that slice 6 landed CPU/x64 support for the checkpoint
yield/resume contract:

- Backend support table: CPU now reads "host-branch gating" / "implemented
  (host-branch gating)" instead of "runs unconditionally" / "not yet
  (slices 4-6)". Remaining "not yet" entries collapse to slices 4-5.
- Experimental status block: notes that CPU emulates the same contract
  via host-branch gating in `KernelLauncher` (same Python API, no
  device IF nodes).
- Yield-mechanism section: now titled "CUDA SM 9.0+ and CPU/x64",
  describes the device-kernel vs host-branch lowerings side by side and
  calls out the WHILE early-exit equivalence.
- Host-side yield/resume loop: explicit that backends without the gate
  still return GraphStatus but always with `yielded=False`.
- Authoring tip about `for _ in range(1):` now references "checkpoint
  gate" rather than the CUDA-specific "IF gate".
- New "Backend coverage notes" subsection summarising what each backend
  does today, including the CPU/x64 prototyping angle.
Builds one HIP graph per contiguous run of same-cp_id offloaded tasks
(plus one per cp_id<0 unconditional batch), then has the launcher iterate
batches with host-branch gating that mirrors slice 6's CPU contract:

- `resume_point` from `ctx.resume_from_checkpoint` skips cp_ids strictly
  below it on this launch.
- `yield_signal` (-1 / cp_id) tracks first-yielder-wins; once set, later
  cp_id>=0 batches are skipped.
- After each yielding batch, a stream sync + D2H of the user's `yield_on`
  flag observes the yield, updates the launcher's
  `last_yield_cp_id_on_last_call_`, and memsets the flag back to 0 on
  device. This is the CPU launcher's logic ported to HIP-graph launches.

ROCm 7.2's HIP has neither conditional graph nodes nor indirect
dispatch, so the design's "gate kernel + indirect dispatch" recipe from
reentrant.md §6.2 is replaced with the equivalent host-orchestrated
sub-graph launches. Pinned-host yield flag (§10) stays a future
enhancement.

Non-checkpoint kernels continue down the original single-graph path
unchanged (detected up-front in `try_launch`); contact area is just the
new sub_graph_execs / batch_cp_ids fields on `CachedGraph` and the
`launch_cached_checkpoint_graph` helper.

`graph_do_while + checkpoint` still falls through to the streaming
launcher today (next commit).
…ost gating

Plumbs the slice-6 CPU host-branch gating into the AMDGPU streaming
launcher so `graph_do_while + qd.checkpoint(yield_on=...)` works on
AMDGPU even though HIP 7.2 has neither conditional graph nodes nor
indirect dispatch.

- `launch_llvm_kernel` now mirrors the CUDA / sub-graph resolver and
  populates `ctx.checkpoint_yield_on_dev_ptrs` from the per-cp arg-id
  table the frontend pushes through `LaunchContextBuilder`.
- `launch_offloaded_tasks` reads `ctx.resume_from_checkpoint` into a
  local `resume_point`, tracks `yield_signal` per launch, and skips
  cp_id>=0 tasks when `cp_id < resume_point` or `yield_signal != -1`.
  After the last task of a yielding checkpoint, a stream-sync + D2H of
  the user's flag is the host equivalent of the device-side yield-check
  kernel; non-zero clears the flag back on device and records the
  first-yielder via `graph_manager_.set_last_yield_cp_id_on_last_call`.
- `launch_offloaded_tasks_with_do_while` breaks on yield (avoids the
  spin-forever the CPU launcher comment in `runtime/cpu/kernel_launcher
  .cpp` calls out) and clears `ctx.resume_from_checkpoint` between
  iters so `from_checkpoint=cp` applies only to the first WHILE iter.
- `GraphManager` gains a `set_last_yield_cp_id_on_last_call` setter so
  the streaming and sub-graph paths can both feed the same field the
  Python `GraphStatus` surface reads.

Test predicates: `_supports_checkpoint_yield_resume_in_while_loop()`
now returns True on AMDGPU (was False after the first slice-4 commit),
so all four `test_resume_offset_*` cases and
`test_checkpoint_yield_exits_graph_do_while_early` run on AMDGPU.
…oint

Slice 4's design used "indirect dispatch" for non-CUDA-12.4+ backends to
gate per-checkpoint kernel launches. Vulkan + Metal have
vkCmdDispatchIndirect / dispatchThreadgroupsIndirect respectively, but
the GFX runtime today records and submits a single command list per
launch_kernel call rather than running inside a pre-recorded graph, so
"skip a launch" is naturally implemented at the host task-loop level
instead of via indirect dispatch.

Changes (mirror the CPU slice 6 / AMDGPU slice 4 contract):
- `TaskAttributes` gains `checkpoint_id` (propagated from
  `OffloadedStmt::checkpoint_id` in `spirv_codegen.cpp`'s
  serial/range_for/struct_for emit paths). Persisted via QD_IO_DEF so
  offline-cache hits keep the gating wired.
- `GfxRuntime::launch_kernel`'s task loop reads
  `host_ctx.resume_from_checkpoint` and tracks `yield_signal`
  in-launch. Tasks whose `cp_id` is < `resume_point` (first iter only)
  or that follow an observed yield are skipped before the
  pipeline-bind + dispatch record.
- After the last task in a same-cp_id run, the runtime flushes +
  wait_idles + readbacks the user's `yield_on=` flag through
  `Device::readback_data`, sets `last_yield_cp_id_on_last_call_` /
  `yield_signal` on non-zero, and clears the flag with `upload_data`
  before re-opening the cmdlist for subsequent tasks. Stalling here
  is acceptable for slice 4; pinned-host yield flags stay future
  enhancement per reentrant.md §10.
- `GfxRuntime::last_yield_cp_id_on_last_call()` exposes the field;
  `gfx::KernelLauncher::get_graph_last_yield_cp_id_on_last_call()`
  routes through, so Python `GraphStatus.yielded` becomes accurate on
  vulkan / metal.
- `launch_offloaded_tasks_with_do_while` breaks on yield + clears
  `ctx.resume_from_checkpoint` between iters, matching the CPU /
  AMDGPU implementations.

Test predicates `_supports_checkpoint_yield_resume{,_in_while_loop}`
now return True for vulkan + metal so 27 tests cover those backends.
Build + run pending on cluster (Vulkan) and macOS host (Metal).
@hughperkins hughperkins added the awaiting review pass New PR or review comments addressed label Jun 17, 2026
@github-actions

Copy link
Copy Markdown

PR 725's AI bot 'feature factorization' check flagged kernel.py for
accreting ~80 lines of checkpoint-feature-specific blocks across
__call__ / launch_kernel. The PR already follows the extract-to-module
pattern for graph_status.py / checkpoint.py / checkpoint_transformer.py;
this commit applies the same treatment to the kernel.py side.

New file python/quadrants/lang/kernel_checkpoint.py exposes free
functions that Kernel delegates to via one-liner calls:

  - validate_resume_cookie
  - translate_user_label_to_internal_cp_id
  - init_yield_on_arg_id_table
  - maybe_record_yield_on_arg
  - forward_yield_on_table_to_ctx
  - maybe_build_graph_status

kernel.py shrinks by ~48 lines net (68 deletions, 20 insertions). All
existing tests still pass (35/35 in test_checkpoint + test_resume_offset
on x64); pre-commit clean. No behaviour change.

Also rewrap the @qd.kernel(checkpoints=...) docstring param in
kernel_impl.py from 80c to 120c -- AI bot 'line wrapping' check
flagged the 6-line description as wrapping at 80 instead of the
project-wide 120-char convention.
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

Two failures on PR 725 / commit e939301 ('extract qd.checkpoint
plumbing into kernel_checkpoint.py'):

1. Linux test_api.py: 'kernel_checkpoint' leaked into dir(quadrants)
   because 'from quadrants.lang import kernel_checkpoint as
   _checkpoint_helpers' in kernel.py attaches the submodule as an
   attribute of the parent package, which then bubbles up through
   'from quadrants.lang import *' in quadrants/__init__.py. Other
   internal modules (kernel_impl, misc, ops, etc.) are masked via an
   explicit exclusion list in quadrants/lang/__init__.py's __all__
   filter; add kernel_checkpoint to that list. Verified locally:
   'kernel_checkpoint' in dir(quadrants) -> False after the fix.

2. Line wrapping (AI bot): three pre-existing comment runs in
   runtime/cpu/kernel_launcher.cpp and runtime/cuda/graph_manager.cpp
   wrapped at 76-84 chars instead of the project-wide 120-char limit.
   Reflowed those three runs.

53/53 test_api + test_checkpoint + test_resume_offset tests pass on
x64; pre-commit clean.
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

- quadrants/ir/frontend_ir.h:27 (checkpoint_id field docstring): 8 lines
  wrapping at ~90c -> 7 lines packing to ~119c
- quadrants/ir/frontend_ir.h:1092 (begin_checkpoint docstring): 7 lines
  wrapping at ~88c -> 6 lines packing to ~118c
- quadrants/program/kernel_launcher.h:26 (get_graph_num_checkpoints_on_last_call
  docstring): 4 lines wrapping at ~89c -> 3 lines packing to ~120c

Used rewrap-comments-120c skill (find_underwrapped.py --diff). All
other reported runs are #include blocks (structural, can't reflow).
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

Comment thread docs/source/user_guide/graph.md Outdated

> **Experimental.** `qd.checkpoint`, `qd.GraphStatus`, and `kernel.resume(from_checkpoint=...)` are experimental APIs. The shape of the public surface (the context-manager signature, the `@qd.kernel(checkpoints=True)` flag, the `GraphStatus` fields, the host-side resume loop, the error messages, and the cross-backend lowering details) may change in any future release without a deprecation cycle.

`qd.checkpoint` lets a graph kernel pause partway through, surface a reason to the host, let the host fix things up, and resume from where it paused on the next launch. An example use-case is an algorithm implemented as a graph that may need to allocate additional memory partway through, where the graph operations are in-place, and therefore not idempotent, and therefore for which simply retrying the whole graph from the start is not an option.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I already mentioned, "graph operations" is not a standard term, you cannot use it without defining what it means, or just reformulate this sentence to avoid using this terminology.

I would suggest to define in parentheses what "idempotent" means in programming. It is not really common knowledge and understanding what it means in this context is critical.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that idempotent is a standard programming term. It is not compiler specific. It is not physics siulation specific. I've seem the term used by many engineers in my previous company, during standard PRs.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • replaced 'graph operaitons' with 'operations in the graph'
  • replcae 'not idempotent' with 'cannot be rerun'

@duburcqa duburcqa Jun 18, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‘cannot be rerun´ without altering / corrupting the output

Comment thread docs/source/user_guide/graph.md
Comment thread docs/source/user_guide/graph.md
The framework never writes into your `yield_on` buffer — you own it end-to-end. That means:

- Before the **first** launch, initialise it to `0` (a freshly allocated `qd.ndarray` is not guaranteed to be zeroed).
- Before each **resume** launch, reset it to `0` (otherwise the body of the same checkpoint sees the stale non-zero value and yields again on the same condition, looping forever).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be worth using some ⚠️ tag.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldnt that be non-ascii? 🤔

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it uses :warning:

from_checkpoint=status.checkpoint)
```

### Restrictions

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain / be more explicit about what happens during resume?

Stating clearly that the entire checkpoint block is re-executed, and that it is user-responsibility to ensure idempotent behaviour when checkpoint is needed? Because if the state is altered during the checkpoint block, resuming is not going to save you I guess? Whatever the answer, it should be very clear in the doc.

Beyond that, how does checkpointing works under the hood? Does it snapshot all the input data by copy before yielding, or it just return like this? If no copy is made, this means that resuming must be done "right away", without further altering the data in between, otherwise it is some kind of undefined behaviour.

Another important point, what is I don't want to resume in such a case and I just want to move on to another kernel and continue like this? Is it supported or resume must happen?

I think it is essentially to clarify all these points in the documentation.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkpoint does NOT require idempotent behavior. This is the entire purpose of checkpoint: to be able to interrupt and resume graphs that are NOT idempotent.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, hte checkpoint block itself. right.

@hughperkins hughperkins Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the checkpoint block itself actually does not so much require idempotence, as requiring that it is atomic: it either succeeds completely, or fails without changing anything.

@hughperkins hughperkins Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as an example, in the case of allocation issues, the checkpoint block looks like:

  • do we have enough memory availbel?
    • no: exit now
    • yes: ok, lets proceed with running the sort etc

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added 'resume where' section

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the checkpoint block itself actually does not so much require idempotence, as requiring that it is atomic: it either succeeds completely, or fails without changing anything.

Yeah, this is exactly what I meant by « ensure idempotent behaviour when checkpoint is needed »

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"fails without changing anything" I feel is not idempotent? Idempotent means that calling the function multiple times is identical in effect to calling it once. But if it fails the first time, it would only be idempotent if it always failed thereafter I feel?

@duburcqa duburcqa removed the awaiting review pass New PR or review comments addressed label Jun 18, 2026
Integrate origin/main's nested graph_do_while feature (#728) with
the checkpointing work on hp/graph-checkpoint.

Key reconciliations:
- GraphRegionTag (ir.h): add checkpoint_id alongside graph_do_while_level_id
  and stream_parallel_group_id; update constructors and operator==.
- frontend_ir.{h,cpp}: ForLoopConfig/FrontendForStmt carry checkpoint_id;
  ASTBuilder stamps GraphRegionTag with the 3-arg constructor.
- offload.cpp: propagate checkpoint_id through assemble_serial_statements
  / push_serial_statement so serial side-effecting tasks inherit the
  checkpoint id while pure-only tasks keep -1.
- gen_offline_cache_key.cpp: emit both graph_do_while_level_id and
  checkpoint_id for OffloadedStmt and other statements.
- python/quadrants/lang/kernel.py: re-integrate GraphDoWhileLevel
  dataclass / graph_do_while_levels / _graph_do_while_level_stack from
  main with hp/graph-checkpoint's checkpoint metadata + fast-cache
  load/store paths and launch_kernel.
- function_def_transformer.py: allow qd.checkpoint With blocks inside
  qd.graph_do_while bodies (new _is_checkpoint_with helper).
- cuda/amdgpu/cpu graph_manager + kernel_launcher: combine nested-GDW
  and checkpoint plumbing, dropping qipc-integration-only bits
  (graph_parallel, add_empty_node, pre-Hopper flat graph, cp_id_storage).
- cpu/kernel_launcher.cpp: drop accidental *flag = 0; re-introduction so
  the user-owned yield_on flag is not cleared by the runtime.

Tests:
- All checkpoint and graph_do_while tests pass on x64 and cuda.
- Full x64 test suite: 4174 passed, 2053 skipped (1 pre-existing
  perf-flake test_concurrent_streams_with_events, unrelated).
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

@hughperkins hughperkins added the awaiting review pass New PR or review comments addressed label Jun 18, 2026
prepare_checkpoint_launch_state and finalize_checkpoint_readback already
exist in checkpoint_launch.cpp and are declared in runtime.h, but
runtime.cpp still inlined the identical logic instead of calling them.
Replace both inline blocks with calls to the existing helpers
(-94 / +7) to silence the CI feature-factorization check and keep the
launch_kernel hot path focused.
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review pass New PR or review comments addressed qipc needed by qipc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants