Genesis-Embodied-AI · hughperkins · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026
diff --git a/docs/source/user_guide/graph.md b/docs/source/user_guide/graph.md
@@ -4,14 +4,11 @@ Graphs reduce kernel launch overhead by capturing a sequence of GPU operations i
 
 ## Backend support
 
-Both features run on every backend. They are *hardware accelerated* on CUDA (via CUDA graphs) and AMDGPU (via HIP graphs); `graph_do_while` additionally requires CUDA SM 9.0+ / Hopper for its hardware-accelerated path. On other backends, `graph=True` is silently ignored and the kernel runs via the normal launch path, and `graph_do_while` falls back to a host-side do-while loop that copies the condition value GPU → host each iteration (causing a pipeline stall — see [Caveats](#caveats)).
-
 | Feature | `qd.cuda` SM 9.0+ | `qd.cuda` < SM 9.0 | `qd.amdgpu` | `qd.metal` | `qd.vulkan` | `qd.cpu` |
 | --- | --- | --- | --- | --- | --- | --- |
 | `graph=True` | hardware accelerated | hardware accelerated | hardware accelerated | runs (no acceleration) | runs (no acceleration) | runs (no acceleration) |
 | `graph_do_while` | hardware accelerated | host fallback | host fallback | host fallback | host fallback | host fallback |
-
-AMDGPU `graph_do_while` falls back to the host-side loop because HIP does not currently expose conditional / while graph nodes (as of ROCm 7.2).
+| `qd.checkpoint` (skip + `yield_on=`) | GPU-side | GPU-side | GPU-side | GPU-side | GPU-side | host-side |
 
 ## Basic usage
 
@@ -154,3 +151,65 @@ Note: the basic `graph=True` path (without `graph_do_while`) does **not** stall
 Therefore on unsupported platforms, you might consider creating a second implementation, which works differently. e.g.:
 - fixed number of loop iterations, so no dependency on gpu data for kernel launch; combined perhaps with:
 - make each kernel 'short-circuit', exit quickly, if the task has already been completed; to avoid running the GPU more than necessary
+
+## Checkpoints with `qd.checkpoint` *(experimental)*
+
+`qd.checkpoint()` marks a section of a graph kernel as a *skippable, optionally yieldable stage*. An example use-case is an algorithm implemented as a graph where you might need to allocate additional memory part-way through, the graph operations are in-place, and simply retrying the whole graph from the start is not an option. `qd.checkpoint` lets the kernel break at some point in the graph, surface the reason to the host, let the host fix things up, and resume from that point on the next launch.
+
+```python
+@qd.kernel(graph=True)
+def step(
+    arr: qd.types.ndarray(qd.f32, ndim=1),
+    overflow_flag: qd.types.ndarray(qd.i32, ndim=0),
+    newton_cond: qd.types.ndarray(qd.i32, ndim=0),
+):
+    while qd.graph_do_while(newton_cond):
+        with qd.checkpoint():                       # cp_id 0
+            for i in range(arr.shape[0]):
+                # ...
+                pass
+        with qd.checkpoint(yield_on=overflow_flag): # cp_id 1 (can yield)
+            for i in range(arr.shape[0]):
+                # ...
+                pass
+        with qd.checkpoint():                       # cp_id 2
+            for i in range(arr.shape[0]):
+                # ...
+                pass
+```
+
+Each `with qd.checkpoint(...)` block gets a `cp_id` assigned. You can use the `cp_id` to identify which checkpoint yielded and which checkpoint to resume from — see [Host-side yield / resume loop](#host-side-yield--resume-loop) below.
+
+### Yield mechanism
+
+If `yield_on=foo` is supplied, the body may write a non-zero value into `foo[()]` (for example, when a pre-allocated buffer is too small) to signal "the host needs to handle something before this checkpoint can complete". When that happens:
+
+1. The framework records the checkpoint that yielded (first yielder in declaration order wins).
+2. Every later checkpoint in the same launch is skipped.
+3. `qd.checkpoint` will exit any surrounding `qd.graph_do_while`.
+4. `foo[()]` is reset to `0`.
+
+### Host-side yield / resume loop
+
+Kernels with at least one `yield_on=` checkpoint return a `qd.GraphStatus` from every launch (and from `kernel.resume(...)`). The status carries two fields:
+
+- `status.yielded` — `True` iff some `yield_on=` flag was non-zero during this launch.
+- `status.checkpoint` — `cp_id` of the first (in declaration order) checkpoint that fired its flag, or `None` when `yielded` is `False`.
+
+Resume by calling `kernel.resume(..., from_checkpoint=status.checkpoint)`. Every `qd.checkpoint` with `cp_id < from_checkpoint` is skipped on the resume launch; the rest run normally. The canonical host loop:
+
+```python
+status = step(arr, overflow_flag, newton_cond)
+while status.yielded:
+    handle_overflow_for(status.checkpoint, ...)
+    status = step.resume(arr, overflow_flag, newton_cond,
+                         from_checkpoint=status.checkpoint)
+```
+
+Kernels with `qd.checkpoint()` but no `yield_on=` keep their previous return contract (typically `None`) — the `GraphStatus` surface is opt-in via `yield_on=`.
+
+### Restrictions
+
+- Must be used inside `@qd.kernel(graph=True)`.
+- `yield_on=` (when supplied) must be a kernel parameter that is a 0-d `qd.types.ndarray(qd.i32, ndim=0)`.
+- Checkpoints cannot be nested inside other checkpoints.
diff --git a/python/quadrants/lang/_quadrants_callable.py b/python/quadrants/lang/_quadrants_callable.py
@@ -95,6 +95,32 @@ def __init__(self, fn: Callable, wrapper: Callable) -> None:
     def __call__(self, *args, **kwargs):
         return self.wrapper.__call__(*args, **kwargs)
 
+    def resume(self, *args, from_checkpoint: int, **kwargs):
+        """Re-launches the kernel, skipping every ``qd.checkpoint`` with ``cp_id < from_checkpoint``.
+
+        Use only on kernels decorated with ``@qd.kernel(graph=True)`` that contain at least
+        one ``qd.checkpoint(yield_on=...)`` block. The host loop pattern is::
+
+            status = step(arr, overflow_flag, newton_cond)
+            while status.yielded:
+                handle_overflow_for(status.checkpoint, ...)
+                status = step.resume(arr, overflow_flag, newton_cond,
+                                     from_checkpoint=status.checkpoint)
+
+        Returns the same ``GraphStatus`` shape as the plain call.
+
+        Raises ``RuntimeError`` if invoked on a kernel without any ``yield_on=`` checkpoint
+        (there is no resume_point slot to write to, so the call would be a no-op).
+        """
+        if not isinstance(from_checkpoint, int) or from_checkpoint < 0:
+            raise RuntimeError(
+                f"from_checkpoint= must be a non-negative integer (typically `status.checkpoint` "
+                f"from the previous launch's GraphStatus); got {from_checkpoint!r}."
+            )
+        # Smuggle the resume cookie past the AST-mapped kwargs path; `Kernel.__call__` pops it
+        # before anything else looks at kwargs.
+        return self.wrapper.__call__(*args, _qd_from_checkpoint=from_checkpoint, **kwargs)
+
     def __get__(self, instance, owner):
         if instance is None:
             return self
@@ -124,3 +150,7 @@ def __setattr__(self, k: str, v: Any) -> None:
     def grad(self, *args, **kwargs) -> "Kernel":
         assert self.quadrants_callable._adjoint is not None
         return self.quadrants_callable._adjoint(self.instance, *args, **kwargs)
+
+    def resume(self, *args, from_checkpoint: int, **kwargs):
+        """Bound-method form of `QuadrantsCallable.resume` (see that docstring)."""
+        return self.quadrants_callable.resume(self.instance, *args, from_checkpoint=from_checkpoint, **kwargs)
diff --git a/python/quadrants/lang/ast/ast_transformer.py b/python/quadrants/lang/ast/ast_transformer.py
@@ -1362,6 +1362,42 @@ def _is_graph_do_while_call(node: ast.expr) -> str | None:
                 return node.args[0].id
         return None
 
+    @staticmethod
+    def _is_checkpoint_call(node: ast.expr) -> tuple[bool, str | None]:
+        """If *node* is a ``qd.checkpoint(...)`` call return ``(True, yield_on_arg_name)``; otherwise
+        ``(False, None)``. ``yield_on_arg_name`` is ``None`` when the user wrote
+        ``qd.checkpoint()`` with no ``yield_on`` kwarg.
+
+        Validates the call shape (no positional args, only ``yield_on=`` as a bare ``ast.Name``)
+        and raises ``QuadrantsSyntaxError`` for misuse so the user gets a clear message at the
+        ``with`` site rather than a vague "not stream_parallel" error later.
+        """
+        if not isinstance(node, ast.Call):
+            return False, None
+        func = node.func
+        is_checkpoint = (isinstance(func, ast.Attribute) and func.attr == "checkpoint") or (
+            isinstance(func, ast.Name) and func.id == "checkpoint"
+        )
+        if not is_checkpoint:
+            return False, None
+        if node.args:
+            raise QuadrantsSyntaxError(
+                "qd.checkpoint() takes no positional arguments; use qd.checkpoint(yield_on=flag) instead"
+            )
+        yield_on_name: str | None = None
+        for kw in node.keywords:
+            if kw.arg != "yield_on":
+                raise QuadrantsSyntaxError(
+                    f"qd.checkpoint() got unexpected keyword argument {kw.arg!r}; only 'yield_on' is supported"
+                )
+            if not isinstance(kw.value, ast.Name):
+                raise QuadrantsSyntaxError(
+                    "qd.checkpoint(yield_on=...) must be the bare name of a kernel parameter "
+                    "(e.g. `yield_on=overflow_flag`); expressions are not supported"
+                )
+            yield_on_name = kw.value.id
+        return True, yield_on_name
+
     @staticmethod
     def build_While(ctx: ASTTransformerFuncContext, node: ast.While) -> None:
         if node.orelse:
@@ -1575,15 +1611,112 @@ def build_With(ctx: ASTTransformerFuncContext, node: ast.With) -> None:
             raise QuadrantsSyntaxError("'with ... as ...' is not supported in Quadrants kernels")
         if not isinstance(item.context_expr, ast.Call):
             raise QuadrantsSyntaxError("'with' in Quadrants kernels requires a call expression")
+
+        is_checkpoint, yield_on_name = ASTTransformer._is_checkpoint_call(item.context_expr)
+        if is_checkpoint:
+            return ASTTransformer._build_checkpoint_with(ctx, node, yield_on_name)
+
         if not FunctionDefTransformer._is_stream_parallel_with(node, ctx.global_vars):
-            raise QuadrantsSyntaxError("'with' in Quadrants kernels only supports qd.stream_parallel()")
+            raise QuadrantsSyntaxError(
+                "'with' in Quadrants kernels only supports qd.stream_parallel() or qd.checkpoint()"
+            )
         if not ctx.is_kernel:
             raise QuadrantsSyntaxError("qd.stream_parallel() can only be used inside @qd.kernel, not @qd.func")
         ctx.ast_builder.begin_stream_parallel()
         build_stmts(ctx, node.body)
         ctx.ast_builder.end_stream_parallel()
         return None
 
+    @staticmethod
+    def _build_checkpoint_with(
+        ctx: ASTTransformerFuncContext,
+        node: ast.With,
+        yield_on_name: str | None,
+    ) -> None:
+        """Handles ``with qd.checkpoint(yield_on=arg):`` blocks.
+
+        Slice 1a: validates the use-site (kernel must be graph=True, no nesting, yield_on must be a kernel
+        parameter) and records the checkpoint's ``yield_on`` arg on the kernel object. Walks the body
+        transparently -- for-loops inside the ``with`` become normal top-level for-loops in the kernel's
+        frontend IR. The ``cp_id`` is assigned by declaration order (list index in
+        ``kernel.checkpoint_yield_on_args``).
+
+        Later slices wire ``cp_id`` through ForLoopConfig → OffloadedTask so the GraphManager can wrap
+        each checkpoint's body kernels in an IF conditional node and insert the yield-check kernel.
+        """
+        if not ctx.is_kernel:
+            raise QuadrantsSyntaxError("qd.checkpoint() can only be used inside @qd.kernel, not @qd.func")
+        kernel = ctx.global_context.current_kernel
+        if not kernel.use_graph:
+            raise QuadrantsSyntaxError("qd.checkpoint() requires @qd.kernel(graph=True)")
+        if getattr(ctx, "_in_checkpoint", False):
+            raise QuadrantsSyntaxError(
+                "qd.checkpoint() cannot be nested inside another qd.checkpoint(); checkpoints in the "
+                "same kernel must be flat siblings (a checkpoint inside qd.graph_do_while is fine)"
+            )
+        if yield_on_name is not None:
+            arg_names = [m.name for m in kernel.arg_metas]
+            if yield_on_name not in arg_names:
+                raise QuadrantsSyntaxError(
+                    f"qd.checkpoint(yield_on={yield_on_name!r}) does not match any parameter of kernel "
+                    f"{kernel.func.__name__!r}. Available parameters: {arg_names}"
+                )
+
+        # Auto-wrap bare top-level statements in the checkpoint body in a one-iteration
+        # `for` loop. The offloader's pending-serial bucket loses the surrounding
+        # `checkpoint_id` and emits such statements as `serial` tasks with `cp_id == -1`,
+        # meaning they would run unconditionally even when the checkpoint is skipped -- a
+        # silent correctness bug. The fix is to lower them as `range_for` tasks instead by
+        # wrapping each bare statement in `for _ in range(1): <stmt>`. We target the specific
+        # statement kinds known to hit the footgun (Assign / AugAssign / AnnAssign /
+        # non-docstring Expr) and leave everything else (For, While, If, With, Pass,
+        # docstring) untouched so they keep working transparently; nested
+        # `with qd.checkpoint(...)` in particular still falls through to the existing
+        # nested-checkpoint check at the start of this method.
+        new_body: list[ast.stmt] = []
+        for i, stmt in enumerate(node.body):
+            needs_wrap = isinstance(stmt, (ast.Assign, ast.AugAssign, ast.AnnAssign))
+            if not needs_wrap and isinstance(stmt, ast.Expr):
+                is_docstring = i == 0 and isinstance(stmt.value, ast.Constant)
+                needs_wrap = not is_docstring
+            if needs_wrap:
+                wrapped = ast.For(
+                    target=ast.Name(id="_", ctx=ast.Store()),
+                    iter=ast.Call(
+                        func=ast.Name(id="range", ctx=ast.Load()),
+                        args=[ast.Constant(value=1)],
+                        keywords=[],
+                    ),
+                    body=[stmt],
+                    orelse=[],
+                )
+                ast.copy_location(wrapped, stmt)
+                ast.fix_missing_locations(wrapped)
+                new_body.append(wrapped)
+            else:
+                new_body.append(stmt)
+        node.body = new_body
+
+        kernel.checkpoint_yield_on_args.append(yield_on_name)
+        # Hand control to the C++ ASTBuilder so that every for-loop emitted by `build_stmts`
+        # below is tagged with this checkpoint's `cp_id` on its `ForLoopConfig.checkpoint_id`.
+        # The C++ counter is the source of truth for cp_id; we cross-check it against the
+        # Python list index so a future refactor that misaligns the two surfaces immediately.
+        cpp_cp_id = ctx.ast_builder.begin_checkpoint()
+        py_cp_id = len(kernel.checkpoint_yield_on_args) - 1
+        assert cpp_cp_id == py_cp_id, (
+            f"C++ ASTBuilder.begin_checkpoint() returned cp_id={cpp_cp_id} but Python "
+            f"kernel.checkpoint_yield_on_args index expected {py_cp_id}; these counters "
+            f"must stay in lockstep so the GraphManager (slice 1c) can index yield_on by cp_id"
+        )
+        ctx._in_checkpoint = True
+        try:
+            build_stmts(ctx, node.body)
+        finally:
+            ctx._in_checkpoint = False
+            ctx.ast_builder.end_checkpoint()
+        return None
+
     @staticmethod
     def build_Pass(ctx: ASTTransformerFuncContext, node: ast.Pass) -> None:
         return None

diff --git a/python/quadrants/lang/ast/ast_transformers/function_def_transformer.py b/python/quadrants/lang/ast/ast_transformers/function_def_transformer.py
@@ -504,6 +504,12 @@ def build_FunctionDef(
 
         if ctx.is_kernel:
             FunctionDefTransformer._validate_stream_parallel_exclusivity(node.body, ctx.global_vars)
+            kernel = ctx.global_context.current_kernel
+            if kernel is not None:
+                # Reset before walking the body so re-materialisations (e.g. when a templated kernel is compiled
+                # with a different argument shape) start from an empty list. Mirrors how `graph_do_while_arg`
+                # gets overwritten unconditionally during AST traversal.
+                kernel.checkpoint_yield_on_args = []
 
         with ctx.variable_scope_guard():
             build_stmts(ctx, node.body)

diff --git a/python/quadrants/lang/graph_status.py b/python/quadrants/lang/graph_status.py
@@ -0,0 +1,50 @@
+"""Plain-Python container returned from graph kernels that contain ``qd.checkpoint(yield_on=...)``.
+
+Lives in its own module (with no Quadrants-internal imports) so it can be imported safely from
+both ``kernel.py`` and ``misc.py`` without re-introducing the circular import chain that
+``misc.py -> impl.py -> kernel.py`` would create.
+
+Re-exported via ``qd.lang.misc`` (and therefore as ``qd.GraphStatus``) for the user-facing
+canonical import path.
+"""
+
+from __future__ import annotations
+
+
+class GraphStatus:
+    """Result returned by a graph kernel that contains ``qd.checkpoint(yield_on=...)`` blocks.
+
+    Returned from ``kernel(...)`` and ``kernel.resume(..., from_checkpoint=cp)`` whenever the
+    kernel was decorated with ``@qd.kernel(graph=True)`` and contains at least one checkpoint
+    that declared a ``yield_on=`` parameter. Read ``status.yielded`` to decide whether to keep
+    running the host loop, and ``status.checkpoint`` to find out which checkpoint asked the
+    host to handle something.
+
+    Canonical usage (mirrors the qipc re-entrant pattern; see ``graph.md``)::
+
+        status = step(arr, overflow_flag, newton_cond)
+        while status.yielded:
+            handle_overflow_for(status.checkpoint, ...)
+            status = step.resume(arr, overflow_flag, newton_cond,
+                                 from_checkpoint=status.checkpoint)
+
+    Attributes:
+        yielded: ``True`` iff one of the kernel's ``yield_on=`` checkpoints fired its flag on
+            the most recent launch. ``False`` means the kernel completed normally and the host
+            loop should exit.
+        checkpoint: ``cp_id`` of the checkpoint whose ``yield_on=`` flag was non-zero (or
+            ``None`` when ``yielded`` is ``False``). Pass it to ``kernel.resume(...,
+            from_checkpoint=cp)`` to skip every checkpoint with a lower ``cp_id`` on the next
+            launch.
+    """
+
+    __slots__ = ("yielded", "checkpoint")
+
+    def __init__(self, yielded: bool, checkpoint: int | None):
+        self.yielded = yielded
+        self.checkpoint = checkpoint
+
+    def __repr__(self) -> str:
+        if self.yielded:
+            return f"GraphStatus(yielded=True, checkpoint={self.checkpoint})"
+        return "GraphStatus(yielded=False)"