Genesis-Embodied-AI · hughperkins · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026
diff --git a/docs/source/user_guide/graph.md b/docs/source/user_guide/graph.md
@@ -4,14 +4,11 @@ Graphs reduce kernel launch overhead by capturing a sequence of GPU operations i
 
 ## Backend support
 
-Both features run on every backend. They are *hardware accelerated* on CUDA (via CUDA graphs) and AMDGPU (via HIP graphs); `graph_do_while` additionally requires CUDA SM 9.0+ / Hopper for its hardware-accelerated path. On other backends, `graph=True` is silently ignored and the kernel runs via the normal launch path, and `graph_do_while` falls back to a host-side do-while loop that copies the condition value GPU → host each iteration (causing a pipeline stall — see [Caveats](#caveats)).
-
 | Feature | `qd.cuda` SM 9.0+ | `qd.cuda` < SM 9.0 | `qd.amdgpu` | `qd.metal` | `qd.vulkan` | `qd.cpu` |
 | --- | --- | --- | --- | --- | --- | --- |
 | `graph=True` | hardware accelerated | hardware accelerated | hardware accelerated | runs (no acceleration) | runs (no acceleration) | runs (no acceleration) |
-| `graph_do_while` | hardware accelerated | host fallback | host fallback | host fallback | host fallback | host fallback |
-
-AMDGPU `graph_do_while` falls back to the host-side loop because HIP does not currently expose conditional / while graph nodes (as of ROCm 7.2).
+| `qd.graph_do_while` | hardware accelerated | host fallback | host fallback | host fallback | host fallback | host fallback |
+| `qd.checkpoint` | GPU-side | GPU-side | GPU-side | GPU-side | GPU-side | host-side |
 
 ## Basic usage
 
@@ -154,3 +151,90 @@ Note: the basic `graph=True` path (without `graph_do_while`) does **not** stall
 Therefore on unsupported platforms, you might consider creating a second implementation, which works differently. e.g.:
 - fixed number of loop iterations, so no dependency on gpu data for kernel launch; combined perhaps with:
 - make each kernel 'short-circuit', exit quickly, if the task has already been completed; to avoid running the GPU more than necessary
+
+## Checkpoints with `qd.checkpoint` *(experimental)*
+
+> **Experimental.** `qd.checkpoint`, `qd.GraphStatus`, and `kernel.resume(from_checkpoint=...)` are experimental APIs. The shape of the public surface (the context-manager signature, the `@qd.kernel(checkpoints=True)` flag, the `GraphStatus` fields, the host-side resume loop, the error messages, and the cross-backend lowering details) may change in any future release without a deprecation cycle.
+
+`qd.checkpoint` lets a graph kernel pause partway through, surface a reason to the host, let the host fix things up, and resume from where it paused on the next launch. An example use-case is an algorithm implemented as a graph that may need to allocate additional memory partway through, where the graph operations are in-place, and therefore not idempotent, and therefore for which simply retrying the whole graph from the start is not an option.
+
+To use checkpoints:
+
+1. Decorate the kernel with `@qd.kernel(graph=True, checkpoints=True)`.
+2. Place `with qd.checkpoint(cp_id, yield_on=flag):` around any section of the body where you want to be able to pause and resume.
+
+```python
+from enum import IntEnum
+
+class Stage(IntEnum):
+    SIM = 0
+
+@qd.kernel(graph=True, checkpoints=True)
+def step(
+    arr: qd.types.ndarray(qd.f32, ndim=1),
+    overflow_flag: qd.types.ndarray(qd.i32, ndim=0),
+    newton_cond: qd.types.ndarray(qd.i32, ndim=0),
+):
+    while qd.graph_do_while(newton_cond):
+        for i in range(arr.shape[0]):
+            # ...
+            pass
+        with qd.checkpoint(Stage.SIM, yield_on=overflow_flag):
+            for i in range(arr.shape[0]):
+                # ...
+                pass
+        for i in range(arr.shape[0]):
+            # ...
+            pass
+```
+
+The `cp_id` argument is the label you'll use to identify the checkpoint from the host (in `GraphStatus.checkpoint` and `kernel.resume(from_checkpoint=...)`). It must be an int literal or an `IntEnum` value; the framework preserves the value as-is, so `qd.checkpoint(Stage.SIM, ...)` round-trips as `Stage.SIM` rather than the raw int. Labels must be unique within a kernel.
+
+### Yield mechanism
+
+When the body of a checkpoint writes a non-zero value into `yield_on[()]`:
+
+1. Everything after the yielding checkpoint in the same launch is skipped.
+2. `qd.checkpoint` will exit any surrounding `qd.graph_do_while`.
+
+The framework never writes into your `yield_on` buffer — you own it end-to-end. That means:
+
+- Before the **first** launch, initialise it to `0` (a freshly allocated `qd.ndarray` is not guaranteed to be zeroed).
+- Before each **resume** launch, reset it to `0` (otherwise the body of the same checkpoint sees the stale non-zero value and yields again on the same condition, looping forever).
+
+### Host-side yield / resume loop
+
+Kernels annotated with `checkpoints=True` return a `qd.GraphStatus` from every launch (including from `kernel.resume(...)`). The status carries two fields:
+
+- `status.yielded` — `True` iff a checkpoint's `yield_on=` flag was non-zero during this launch.
+- `status.checkpoint` — the `cp_id` label of the yielding checkpoint (or `None` when `yielded` is `False`).
+
+Resume by calling `kernel.resume(..., from_checkpoint=label)`. Everything before `label` in source order is skipped on the resume launch; everything from `label` onward runs normally. The canonical host loop:
+
+```python
+overflow_flag[()] = 0  # initialise before the first launch
+status = step(arr, overflow_flag, newton_cond)
+while status.yielded:
+    handle_overflow_for(status.checkpoint, ...)
+    overflow_flag[()] = 0  # clear before resume, otherwise the same checkpoint yields again
+    status = step.resume(arr, overflow_flag, newton_cond,
+                         from_checkpoint=status.checkpoint)
+```
+
+### Restrictions
+
+- Must be used inside `@qd.kernel(graph=True, checkpoints=True)`. Without the flag, `qd.checkpoint(...)` raises `QuadrantsSyntaxError` at compile time with a fix-it pointing at `checkpoints=True`.
+- `cp_id` must be an int literal or an `IntEnum` value, and must be unique across the kernel.
+- `yield_on=` must be a kernel parameter that is a 0-d `qd.types.ndarray(qd.i32, ndim=0)`; expressions are not supported.
+- Checkpoints cannot be nested inside other checkpoints. Checkpoints inside a `qd.graph_do_while` body are fine.
+- The body of a `with qd.checkpoint(...)` block cannot contain bare top-level statements (assignments, augmented assignments, or bare call/expression statements). Every top-level statement must be inside a `for`-loop (or other control-flow construct). A docstring as the first statement is allowed. Bare statements raise `QuadrantsSyntaxError` at compile time with a fix-it pointing at the explicit one-iteration `for`-wrap:
+
+  ```python
+  with qd.checkpoint(0, yield_on=flag):
+      for _ in range(1):
+          c[()] = c[()] + 1
+      for i in range(arr.shape[0]):
+          arr[i] = arr[i] + 1
+  ```
+
+The restriction is by design: each top-level statement inside a checkpoint becomes its own GPU task / graph node, so silently wrapping bare statements would hide a sequence of N field writes ballooning into N kernel launches. Forcing the user to write the `for`-wrap themselves keeps the lowering visible and gives a single obvious place to fuse multiple writes into one task by sharing a single wrapper.
diff --git a/python/quadrants/lang/__init__.py b/python/quadrants/lang/__init__.py
@@ -35,9 +35,11 @@
         "enums",
         "exception",
         "expr",
+        "graph_status",
         "impl",
         "inspect",
         "kernel_arguments",
+        "kernel_checkpoint",
         "kernel_impl",
         "matrix",
         "mesh",

diff --git a/python/quadrants/lang/_quadrants_callable.py b/python/quadrants/lang/_quadrants_callable.py
@@ -96,6 +96,48 @@ def __init__(self, fn: Callable, wrapper: Callable) -> None:
     def __call__(self, *args, **kwargs):
         return self.wrapper.__call__(*args, **kwargs)
 
+    def resume(self, *args, from_checkpoint, **kwargs):
+        """Continues a paused graph kernel from the checkpoint labelled ``from_checkpoint``.
+
+        .. warning::
+
+            **Experimental.** ``kernel.resume`` is part of the experimental ``qd.checkpoint`` surface; the signature
+            (in particular the ``from_checkpoint=`` kwarg) and behaviour may change in any future release without a
+            deprecation cycle.
+
+        Use only on ``@qd.kernel(graph=True, checkpoints=True)`` kernels with at least one
+        ``qd.checkpoint(cp_id, yield_on=flag)`` block. ``from_checkpoint`` is a ``cp_id`` label (typically an
+        ``IntEnum`` value, often ``status.checkpoint`` from the previous launch): everything before that label in
+        source order is skipped on this launch, and execution continues from there. The host loop pattern is::
+
+            from enum import IntEnum
+
+            class Stage(IntEnum):
+                SIM = 0
+
+            overflow_flag[()] = 0  # initialise before the first launch
+            status = step(arr, overflow_flag, newton_cond)
+            while status.yielded:
+                handle(status.checkpoint, ...)
+                overflow_flag[()] = 0  # the framework never clears your yield_on flag
+                status = step.resume(arr, overflow_flag, newton_cond,
+                                     from_checkpoint=status.checkpoint)
+
+        Returns the same ``GraphStatus`` shape as the plain call.
+
+        Raises ``RuntimeError`` if invoked on a kernel without any ``yield_on=`` checkpoint, or if ``from_checkpoint``
+        does not match any declared ``cp_id`` in the kernel.
+        """
+        if not isinstance(from_checkpoint, int):
+            raise RuntimeError(
+                f"from_checkpoint= must be an int or IntEnum value matching a `qd.checkpoint(cp_id=...)` label in "
+                f"the kernel (typically `status.checkpoint` from the previous launch's GraphStatus); "
+                f"got {from_checkpoint!r}."
+            )
+        # Smuggle the resume cookie past the AST-mapped kwargs path; `Kernel.__call__` pops it before anything else
+        # looks at kwargs.
+        return self.wrapper.__call__(*args, _qd_from_checkpoint=from_checkpoint, **kwargs)
+
     def __get__(self, instance, owner):
         if instance is None:
             return self
@@ -125,3 +167,7 @@ def __setattr__(self, k: str, v: Any) -> None:
     def grad(self, *args, **kwargs) -> "Kernel":
         assert self.quadrants_callable._adjoint is not None
         return self.quadrants_callable._adjoint(self.instance, *args, **kwargs)
+
+    def resume(self, *args, from_checkpoint, **kwargs):
+        """Bound-method form of `QuadrantsCallable.resume` (see that docstring)."""
+        return self.quadrants_callable.resume(self.instance, *args, from_checkpoint=from_checkpoint, **kwargs)
diff --git a/python/quadrants/lang/ast/ast_transformer.py b/python/quadrants/lang/ast/ast_transformer.py
@@ -26,6 +26,9 @@
     get_decorator,
 )
 from quadrants.lang.ast.ast_transformers.call_transformer import CallTransformer
+from quadrants.lang.ast.ast_transformers.checkpoint_transformer import (
+    CheckpointTransformer,
+)
 from quadrants.lang.ast.ast_transformers.function_def_transformer import (
     FunctionDefTransformer,
 )
@@ -1362,6 +1365,13 @@ def _is_graph_do_while_call(node: ast.expr) -> str | None:
                 return node.args[0].id
         return None
 
+    @staticmethod
+    def _is_checkpoint_call(node: ast.expr, global_vars: dict):
+        """Thin forwarding wrapper around ``CheckpointTransformer.is_checkpoint_call``; the actual logic lives in module
+        ``ast_transformers/checkpoint_transformer.py`` to keep this file from growing per-feature. Returns a
+        ``CheckpointCallInfo`` or ``None``."""
+        return CheckpointTransformer.is_checkpoint_call(node, global_vars)
+
     @staticmethod
     def build_While(ctx: ASTTransformerFuncContext, node: ast.While) -> None:
         if node.orelse:
@@ -1575,15 +1585,32 @@ def build_With(ctx: ASTTransformerFuncContext, node: ast.With) -> None:
             raise QuadrantsSyntaxError("'with ... as ...' is not supported in Quadrants kernels")
         if not isinstance(item.context_expr, ast.Call):
             raise QuadrantsSyntaxError("'with' in Quadrants kernels requires a call expression")
+
+        checkpoint_info = ASTTransformer._is_checkpoint_call(item.context_expr, ctx.global_vars)
+        if checkpoint_info is not None:
+            return ASTTransformer._build_checkpoint_with(ctx, node, checkpoint_info)
+
         if not FunctionDefTransformer._is_stream_parallel_with(node, ctx.global_vars):
-            raise QuadrantsSyntaxError("'with' in Quadrants kernels only supports qd.stream_parallel()")
+            raise QuadrantsSyntaxError(
+                "'with' in Quadrants kernels only supports qd.stream_parallel() or qd.checkpoint()"
+            )
         if not ctx.is_kernel:
             raise QuadrantsSyntaxError("qd.stream_parallel() can only be used inside @qd.kernel, not @qd.func")
         ctx.ast_builder.begin_stream_parallel()
         build_stmts(ctx, node.body)
         ctx.ast_builder.end_stream_parallel()
         return None
 
+    @staticmethod
+    def _build_checkpoint_with(
+        ctx: ASTTransformerFuncContext,
+        node: ast.With,
+        info,
+    ) -> None:
+        """Thin forwarding wrapper around ``CheckpointTransformer.build_checkpoint_with``; the actual logic lives in
+        ``ast_transformers/checkpoint_transformer.py``."""
+        return CheckpointTransformer.build_checkpoint_with(ctx, node, info, build_stmts)
+
     @staticmethod
     def build_Pass(ctx: ASTTransformerFuncContext, node: ast.Pass) -> None:
         return None