-
Notifications
You must be signed in to change notification settings - Fork 25
[Graph] Add qd.checkpoint #725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 90 commits
9c3920d
ec5bdd2
1af49f0
1e18653
a5c614f
b4612a0
e138cfd
4492b35
79ebf57
28a2f29
5dedc4e
977f3f0
bf906d7
dd37b28
788f607
37ed4e4
4d5bc25
d7fe3f6
7123030
c685f24
0c94986
e35e9cb
83a6224
b265cdf
e42c496
9a369c4
22f8d26
22f2069
a25c241
840ac3d
68a4d54
cd2881d
fcc001f
559fbbf
fded02a
73628e2
01795ab
0983393
5058e34
c8bfb4b
7c9b76a
61ee679
a0f2b54
cf4633e
b6ca290
d2afddf
b6ee9be
2a483e7
9240ca3
fb799c6
6ea701b
8b45f47
dd9548b
d741a46
197a892
3402601
9e3eae5
10c36a6
5b3e656
0e26368
9aa7c69
0bb6027
935cc37
7ddee8d
5d1c181
7aa1f60
b3e726e
9698f73
60290d7
aa408c6
9e02885
a7f198f
9d91543
24139e0
be27aa3
2db5db8
9a35cb9
b0ae987
c5fa6c9
9e2cc90
a3520a0
14edbf7
5927d31
d3559df
f21617d
67da464
a398139
e939301
6544ff6
0464423
cb4f308
28f630b
1cf0343
35e49d4
3b84936
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,14 +4,11 @@ Graphs reduce kernel launch overhead by capturing a sequence of GPU operations i | |
|
|
||
| ## Backend support | ||
|
|
||
| Both features run on every backend. They are *hardware accelerated* on CUDA (via CUDA graphs) and AMDGPU (via HIP graphs); `graph_do_while` additionally requires CUDA SM 9.0+ / Hopper for its hardware-accelerated path. On other backends, `graph=True` is silently ignored and the kernel runs via the normal launch path, and `graph_do_while` falls back to a host-side do-while loop that copies the condition value GPU → host each iteration (causing a pipeline stall — see [Caveats](#caveats)). | ||
|
|
||
| | Feature | `qd.cuda` SM 9.0+ | `qd.cuda` < SM 9.0 | `qd.amdgpu` | `qd.metal` | `qd.vulkan` | `qd.cpu` | | ||
|
hughperkins marked this conversation as resolved.
|
||
| | --- | --- | --- | --- | --- | --- | --- | | ||
| | `graph=True` | hardware accelerated | hardware accelerated | hardware accelerated | runs (no acceleration) | runs (no acceleration) | runs (no acceleration) | | ||
| | `graph_do_while` | hardware accelerated | host fallback | host fallback | host fallback | host fallback | host fallback | | ||
|
|
||
| AMDGPU `graph_do_while` falls back to the host-side loop because HIP does not currently expose conditional / while graph nodes (as of ROCm 7.2). | ||
| | `qd.graph_do_while` | hardware accelerated | host fallback | host fallback | host fallback | host fallback | host fallback | | ||
| | `qd.checkpoint` | GPU-side | GPU-side | GPU-side | GPU-side | GPU-side | host-side | | ||
|
|
||
| ## Basic usage | ||
|
|
||
|
|
@@ -154,3 +151,90 @@ Note: the basic `graph=True` path (without `graph_do_while`) does **not** stall | |
| Therefore on unsupported platforms, you might consider creating a second implementation, which works differently. e.g.: | ||
| - fixed number of loop iterations, so no dependency on gpu data for kernel launch; combined perhaps with: | ||
| - make each kernel 'short-circuit', exit quickly, if the task has already been completed; to avoid running the GPU more than necessary | ||
|
|
||
| ## Checkpoints with `qd.checkpoint` *(experimental)* | ||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note: after the backwards-incompatible disaster taht was the algorithms.md changes for qipc, that kept changing 😅 I think I'd like to mark things as 'experimental' for a few weeks/months, until we are confident the api is stable.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note to AI: this is not a request to you, it is an obseration for other human reviewers.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If checkpoint is a graph-specific API, it should be prefixed by 'graph_', as for all the other functions.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Know that I am thinking, we should probably just have some new 'qd.graph.' submodule. That would like everything both simpler and more less confusing.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems reasonable. But perhaps not in this PR?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In that case, make sure it is tracking somewhere. Still, it is weird to address this in another PR, since it is literally the PR that introduces this function. But I can understand you want to move faster and qipc is already relying on this one. |
||
|
|
||
| > **Experimental.** `qd.checkpoint`, `qd.GraphStatus`, and `kernel.resume(from_checkpoint=...)` are experimental APIs. The shape of the public surface (the context-manager signature, the `@qd.kernel(checkpoints=True)` flag, the `GraphStatus` fields, the host-side resume loop, the error messages, and the cross-backend lowering details) may change in any future release without a deprecation cycle. | ||
|
|
||
| `qd.checkpoint` lets a graph kernel pause partway through, surface a reason to the host, let the host fix things up, and resume from where it paused on the next launch. An example use-case is an algorithm implemented as a graph that may need to allocate additional memory partway through, where the graph operations are in-place, and therefore not idempotent, and therefore for which simply retrying the whole graph from the start is not an option. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I already mentioned, "graph operations" is not a standard term, you cannot use it without defining what it means, or just reformulate this sentence to avoid using this terminology. I would suggest to define in parentheses what "idempotent" means in programming. It is not really common knowledge and understanding what it means in this context is critical.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would argue that idempotent is a standard programming term. It is not compiler specific. It is not physics siulation specific. I've seem the term used by many engineers in my previous company, during standard PRs.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ‘cannot be rerun´ without altering / corrupting the output |
||
|
|
||
| To use checkpoints: | ||
|
|
||
| 1. Decorate the kernel with `@qd.kernel(graph=True, checkpoints=True)`. | ||
| 2. Place `with qd.checkpoint(cp_id, yield_on=flag):` around any section of the body where you want to be able to pause and resume. | ||
|
|
||
| ```python | ||
| from enum import IntEnum | ||
|
|
||
| class Stage(IntEnum): | ||
| SIM = 0 | ||
|
|
||
| @qd.kernel(graph=True, checkpoints=True) | ||
| def step( | ||
| arr: qd.types.ndarray(qd.f32, ndim=1), | ||
| overflow_flag: qd.types.ndarray(qd.i32, ndim=0), | ||
| newton_cond: qd.types.ndarray(qd.i32, ndim=0), | ||
| ): | ||
| while qd.graph_do_while(newton_cond): | ||
| for i in range(arr.shape[0]): | ||
| # ... | ||
| pass | ||
| with qd.checkpoint(Stage.SIM, yield_on=overflow_flag): | ||
| for i in range(arr.shape[0]): | ||
| # ... | ||
| pass | ||
| for i in range(arr.shape[0]): | ||
| # ... | ||
| pass | ||
| ``` | ||
|
hughperkins marked this conversation as resolved.
|
||
|
|
||
| The `cp_id` argument is the label you'll use to identify the checkpoint from the host (in `GraphStatus.checkpoint` and `kernel.resume(from_checkpoint=...)`). It must be an int literal or an `IntEnum` value; the framework preserves the value as-is, so `qd.checkpoint(Stage.SIM, ...)` round-trips as `Stage.SIM` rather than the raw int. Labels must be unique within a kernel. | ||
|
|
||
| ### Yield mechanism | ||
|
|
||
| When the body of a checkpoint writes a non-zero value into `yield_on[()]`: | ||
|
|
||
| 1. Everything after the yielding checkpoint in the same launch is skipped. | ||
| 2. `qd.checkpoint` will exit any surrounding `qd.graph_do_while`. | ||
|
|
||
| The framework never writes into your `yield_on` buffer — you own it end-to-end. That means: | ||
|
duburcqa marked this conversation as resolved.
|
||
|
|
||
| - Before the **first** launch, initialise it to `0` (a freshly allocated `qd.ndarray` is not guaranteed to be zeroed). | ||
| - Before each **resume** launch, reset it to `0` (otherwise the body of the same checkpoint sees the stale non-zero value and yields again on the same condition, looping forever). | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could be worth using some
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wouldnt that be non-ascii? 🤔
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, it uses |
||
|
|
||
| ### Host-side yield / resume loop | ||
|
|
||
| Kernels annotated with `checkpoints=True` return a `qd.GraphStatus` from every launch (including from `kernel.resume(...)`). The status carries two fields: | ||
|
|
||
| - `status.yielded` — `True` iff a checkpoint's `yield_on=` flag was non-zero during this launch. | ||
| - `status.checkpoint` — the `cp_id` label of the yielding checkpoint (or `None` when `yielded` is `False`). | ||
|
|
||
| Resume by calling `kernel.resume(..., from_checkpoint=label)`. Everything before `label` in source order is skipped on the resume launch; everything from `label` onward runs normally. The canonical host loop: | ||
|
|
||
| ```python | ||
| overflow_flag[()] = 0 # initialise before the first launch | ||
| status = step(arr, overflow_flag, newton_cond) | ||
|
duburcqa marked this conversation as resolved.
|
||
| while status.yielded: | ||
| handle_overflow_for(status.checkpoint, ...) | ||
| overflow_flag[()] = 0 # clear before resume, otherwise the same checkpoint yields again | ||
|
duburcqa marked this conversation as resolved.
|
||
| status = step.resume(arr, overflow_flag, newton_cond, | ||
| from_checkpoint=status.checkpoint) | ||
| ``` | ||
|
|
||
| ### Restrictions | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you explain / be more explicit about what happens during resume? Stating clearly that the entire checkpoint block is re-executed, and that it is user-responsibility to ensure idempotent behaviour when checkpoint is needed? Because if the state is altered during the checkpoint block, resuming is not going to save you I guess? Whatever the answer, it should be very clear in the doc. Beyond that, how does checkpointing works under the hood? Does it snapshot all the input data by copy before yielding, or it just return like this? If no copy is made, this means that resuming must be done "right away", without further altering the data in between, otherwise it is some kind of undefined behaviour. Another important point, what is I don't want to resume in such a case and I just want to move on to another kernel and continue like this? Is it supported or resume must happen? I think it is essentially to clarify all these points in the documentation.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. checkpoint does NOT require idempotent behavior. This is the entire purpose of checkpoint: to be able to interrupt and resume graphs that are NOT idempotent.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah, hte checkpoint block itself. right.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the checkpoint block itself actually does not so much require idempotence, as requiring that it is atomic: it either succeeds completely, or fails without changing anything.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as an example, in the case of allocation issues, the checkpoint block looks like:
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added 'resume where' section
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah, this is exactly what I meant by « ensure idempotent behaviour when checkpoint is needed »
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "fails without changing anything" I feel is not idempotent? Idempotent means that calling the function multiple times is identical in effect to calling it once. But if it fails the first time, it would only be idempotent if it always failed thereafter I feel? |
||
|
|
||
| - Must be used inside `@qd.kernel(graph=True, checkpoints=True)`. Without the flag, `qd.checkpoint(...)` raises `QuadrantsSyntaxError` at compile time with a fix-it pointing at `checkpoints=True`. | ||
| - `cp_id` must be an int literal or an `IntEnum` value, and must be unique across the kernel. | ||
| - `yield_on=` must be a kernel parameter that is a 0-d `qd.types.ndarray(qd.i32, ndim=0)`; expressions are not supported. | ||
| - Checkpoints cannot be nested inside other checkpoints. Checkpoints inside a `qd.graph_do_while` body are fine. | ||
| - The body of a `with qd.checkpoint(...)` block cannot contain bare top-level statements (assignments, augmented assignments, or bare call/expression statements). Every top-level statement must be inside a `for`-loop (or other control-flow construct). A docstring as the first statement is allowed. Bare statements raise `QuadrantsSyntaxError` at compile time with a fix-it pointing at the explicit one-iteration `for`-wrap: | ||
|
|
||
| ```python | ||
| with qd.checkpoint(0, yield_on=flag): | ||
| for _ in range(1): | ||
| c[()] = c[()] + 1 | ||
| for i in range(arr.shape[0]): | ||
| arr[i] = arr[i] + 1 | ||
| ``` | ||
|
|
||
| The restriction is by design: each top-level statement inside a checkpoint becomes its own GPU task / graph node, so silently wrapping bare statements would hide a sequence of N field writes ballooning into N kernel launches. Forcing the user to write the `for`-wrap themselves keeps the lowering visible and gives a single obvious place to fuse multiple writes into one task by sharing a single wrapper. | ||
Uh oh!
There was an error while loading. Please reload this page.