[Graph] Add qd.checkpoint by hughperkins · Pull Request #725 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-06-05T18:22:41Z

No description provided.

First slice of the qd.checkpoint primitive (design doc in perso_hugh/doc/qipc/reentrant.md, sections 5.1-5.2). This adds the Python API, AST recognition, validation, and per-kernel metadata list -- no IR, runtime, or cross-backend changes yet. Every checkpoint body still runs unconditionally; the docs page calls this out as experimental. Following slices wire checkpoint_id end-to-end (ForLoopConfig -> OffloadedTask, mirroring stream_parallel_group_id), build per-checkpoint CUDA IF conditional nodes on SM 9.0+ / CUDA 12.4+, auto-insert yield-check kernels, and add the GraphStatus host API with step.resume(from_checkpoint=).

Mirrors the stream_parallel_group_id propagation chain for a new checkpoint_id field tagging the enclosing qd.checkpoint() scope on each parallel for-loop: ForLoopConfig.checkpoint_id (-1 = no checkpoint) -> FrontendForStmt.checkpoint_id -> RangeForStmt / StructForStmt .checkpoint_id (lower_ast.cpp) -> OffloadedStmt.checkpoint_id (offload.cpp) -> OffloadedTask.checkpoint_id (codegen_cuda/amdgpu) Driven from Python by new ASTBuilder.begin_checkpoint() / end_checkpoint() exposed via export_lang.cpp; the AST transformer calls them around the with-block body and asserts the C++ counter agrees with the Python list index. Also folds the new field into the offline-cache key emit list, the clone() methods, and the QD_IO_DEF / QD_STMT_DEF_FIELDS sets so the fastcache stays content-addressable. Pure plumbing -- no behaviour change. Existing tests cover regression. Slice 1c will be the first consumer (GraphManager builds per-cp_id IF conditional nodes on CUDA 12.4+).

…lkit limit)

… SM 9.0+) Adds the runtime side of qd.checkpoint() on CUDA 12.4+. Each contiguous run of OffloadedTasks sharing the same checkpoint_id is wrapped in a CUDA graph IF conditional node, preceded by a small gate kernel that sets the IF handle from `cp_id >= *resume_point`. Tasks with cp_id == -1 remain top-level siblings as before. - New gate kernel `_qd_checkpoint_if_gate` ships as a pre-built fatbin (sm_90/100/120) generated by scripts/build_checkpoint_gate_fatbin.py. Matches the graph_do_while_cond fatbin pattern so the user doesn't need libcudadevrt at runtime. - CachedGraph now owns a `resume_point_dev_ptr` int32 device scalar (zero-initialised) when the kernel uses checkpoints. The launch-path memcpy resets it to 0 each call; slice 2's host API will set it to `from_checkpoint=` when resuming. - GraphManager exposes num_checkpoints_on_last_call() for tests to confirm the IF nodes were actually emitted. Wired through KernelLauncher / Program to Python via the same chain as num_nodes_on_last_call. - Tests assert the IF count for both flat-checkpoint and inside-graph_do_while kernels; behavioural assertions hold on every backend (no behaviour change yet, slice 1d adds yield). Falls back to "no-op CUDA graph build" on pre-SM-9.0 (caller does the host-side flat launch) mirroring the existing graph_do_while fallback. Slice 4/5 will add the indirect-dispatch alternative for AMDGPU/Metal/Vulkan/older CUDA.

…es + arg-id plumbing

…with-yield

…into GraphManager

… early-exit

…h for yield_signal readback

…e test header

…_checkpoint=) host API

…rcular import

…) end-to-end

…t resume_point at end of WHILE iter

…now takes resume_point arg)

…o checkpoints materialise as distinct offloaded tasks

…heckpoints; document offloader limitation

…ckpoint coverage

…resume/yield

… iteration The CPU host-branch gating treated `ctx.resume_from_checkpoint` as if it applied for the entire `kernel.resume(...)` launch, but `from_checkpoint` is meant to skip cp_ids only on the FIRST iteration of a resumed `graph_do_while` body. The CUDA-native cond-with-yield kernel resets `*resume_point = 0` between iterations for the same reason (`graph_do_while_cond.cu`); the CPU emulation now mirrors that by zeroing `ctx.resume_from_checkpoint` after each completed iteration. Found while probing slice 6 on the local 5090 box: cp 0 sat before a yielding cp 1 inside `graph_do_while`, and a `resume(from_checkpoint=1)` launch incorrectly skipped cp 0 on every iteration instead of just the first, so counter A was wrong (1 vs expected 3).

…en CPU tests Two fixes after running the slice 6 CPU fallback end-to-end on a local x64 host: 1. `KernelLauncher` was resetting `last_yield_cp_id_on_last_call_` at the top of every LLVM launch, including the aux kernels that `ndarray.to_numpy()` triggers between a yielding launch and the user's `GraphStatus.yielded` check. Restrict the reset to launches that have at least one resolved `checkpoint_yield_on=` arg, so non-graph aux kernels can't clobber the value. Mirrors the CUDA path, where only `GraphManager::launch_cached_graph` touches the field. 2. Tests in `test_checkpoint.py` and `test_checkpoint_resume_offset.py` were skipping the yield/resume behavioural cases on every non-CUDA-native backend. Slice 6 now implements the same contract on CPU, so add a `_supports_checkpoint_yield_resume()` predicate that admits CUDA-native (slice 1d) and x64 (slice 6), and route the behavioural skips through it. Pure introspection assertions that only exist on CUDA (e.g. `get_graph_num_checkpoints_on_last_call`) keep using `_is_checkpoint_if_path_native`. Locally on the 5090 box: 54 passed, 8847 deselected (no skips on the yield/resume cases for either x64 or cuda).

Reflect that slice 6 landed CPU/x64 support for the checkpoint yield/resume contract: - Backend support table: CPU now reads "host-branch gating" / "implemented (host-branch gating)" instead of "runs unconditionally" / "not yet (slices 4-6)". Remaining "not yet" entries collapse to slices 4-5. - Experimental status block: notes that CPU emulates the same contract via host-branch gating in `KernelLauncher` (same Python API, no device IF nodes). - Yield-mechanism section: now titled "CUDA SM 9.0+ and CPU/x64", describes the device-kernel vs host-branch lowerings side by side and calls out the WHILE early-exit equivalence. - Host-side yield/resume loop: explicit that backends without the gate still return GraphStatus but always with `yielded=False`. - Authoring tip about `for _ in range(1):` now references "checkpoint gate" rather than the CUDA-specific "IF gate". - New "Backend coverage notes" subsection summarising what each backend does today, including the CPU/x64 prototyping angle.

Builds one HIP graph per contiguous run of same-cp_id offloaded tasks (plus one per cp_id<0 unconditional batch), then has the launcher iterate batches with host-branch gating that mirrors slice 6's CPU contract: - `resume_point` from `ctx.resume_from_checkpoint` skips cp_ids strictly below it on this launch. - `yield_signal` (-1 / cp_id) tracks first-yielder-wins; once set, later cp_id>=0 batches are skipped. - After each yielding batch, a stream sync + D2H of the user's `yield_on` flag observes the yield, updates the launcher's `last_yield_cp_id_on_last_call_`, and memsets the flag back to 0 on device. This is the CPU launcher's logic ported to HIP-graph launches. ROCm 7.2's HIP has neither conditional graph nodes nor indirect dispatch, so the design's "gate kernel + indirect dispatch" recipe from reentrant.md §6.2 is replaced with the equivalent host-orchestrated sub-graph launches. Pinned-host yield flag (§10) stays a future enhancement. Non-checkpoint kernels continue down the original single-graph path unchanged (detected up-front in `try_launch`); contact area is just the new sub_graph_execs / batch_cp_ids fields on `CachedGraph` and the `launch_cached_checkpoint_graph` helper. `graph_do_while + checkpoint` still falls through to the streaming launcher today (next commit).

…ntial paths)

…ost gating Plumbs the slice-6 CPU host-branch gating into the AMDGPU streaming launcher so `graph_do_while + qd.checkpoint(yield_on=...)` works on AMDGPU even though HIP 7.2 has neither conditional graph nodes nor indirect dispatch. - `launch_llvm_kernel` now mirrors the CUDA / sub-graph resolver and populates `ctx.checkpoint_yield_on_dev_ptrs` from the per-cp arg-id table the frontend pushes through `LaunchContextBuilder`. - `launch_offloaded_tasks` reads `ctx.resume_from_checkpoint` into a local `resume_point`, tracks `yield_signal` per launch, and skips cp_id>=0 tasks when `cp_id < resume_point` or `yield_signal != -1`. After the last task of a yielding checkpoint, a stream-sync + D2H of the user's flag is the host equivalent of the device-side yield-check kernel; non-zero clears the flag back on device and records the first-yielder via `graph_manager_.set_last_yield_cp_id_on_last_call`. - `launch_offloaded_tasks_with_do_while` breaks on yield (avoids the spin-forever the CPU launcher comment in `runtime/cpu/kernel_launcher .cpp` calls out) and clears `ctx.resume_from_checkpoint` between iters so `from_checkpoint=cp` applies only to the first WHILE iter. - `GraphManager` gains a `set_last_yield_cp_id_on_last_call` setter so the streaming and sub-graph paths can both feed the same field the Python `GraphStatus` surface reads. Test predicates: `_supports_checkpoint_yield_resume_in_while_loop()` now returns True on AMDGPU (was False after the first slice-4 commit), so all four `test_resume_offset_*` cases and `test_checkpoint_yield_exits_graph_do_while_early` run on AMDGPU.

…oint Slice 4's design used "indirect dispatch" for non-CUDA-12.4+ backends to gate per-checkpoint kernel launches. Vulkan + Metal have vkCmdDispatchIndirect / dispatchThreadgroupsIndirect respectively, but the GFX runtime today records and submits a single command list per launch_kernel call rather than running inside a pre-recorded graph, so "skip a launch" is naturally implemented at the host task-loop level instead of via indirect dispatch. Changes (mirror the CPU slice 6 / AMDGPU slice 4 contract): - `TaskAttributes` gains `checkpoint_id` (propagated from `OffloadedStmt::checkpoint_id` in `spirv_codegen.cpp`'s serial/range_for/struct_for emit paths). Persisted via QD_IO_DEF so offline-cache hits keep the gating wired. - `GfxRuntime::launch_kernel`'s task loop reads `host_ctx.resume_from_checkpoint` and tracks `yield_signal` in-launch. Tasks whose `cp_id` is < `resume_point` (first iter only) or that follow an observed yield are skipped before the pipeline-bind + dispatch record. - After the last task in a same-cp_id run, the runtime flushes + wait_idles + readbacks the user's `yield_on=` flag through `Device::readback_data`, sets `last_yield_cp_id_on_last_call_` / `yield_signal` on non-zero, and clears the flag with `upload_data` before re-opening the cmdlist for subsequent tasks. Stalling here is acceptable for slice 4; pinned-host yield flags stay future enhancement per reentrant.md §10. - `GfxRuntime::last_yield_cp_id_on_last_call()` exposes the field; `gfx::KernelLauncher::get_graph_last_yield_cp_id_on_last_call()` routes through, so Python `GraphStatus.yielded` becomes accurate on vulkan / metal. - `launch_offloaded_tasks_with_do_while` breaks on yield + clears `ctx.resume_from_checkpoint` between iters, matching the CPU / AMDGPU implementations. Test predicates `_supports_checkpoint_yield_resume{,_in_while_loop}` now return True for vulkan + metal so 27 tests cover those backends. Build + run pending on cluster (Vulkan) and macOS host (Metal).

github-actions · 2026-06-17T17:20:46Z

Total: 70 file(s) changed, +11169 -1149 code lines.

PR 725's AI bot 'feature factorization' check flagged kernel.py for accreting ~80 lines of checkpoint-feature-specific blocks across __call__ / launch_kernel. The PR already follows the extract-to-module pattern for graph_status.py / checkpoint.py / checkpoint_transformer.py; this commit applies the same treatment to the kernel.py side. New file python/quadrants/lang/kernel_checkpoint.py exposes free functions that Kernel delegates to via one-liner calls: - validate_resume_cookie - translate_user_label_to_internal_cp_id - init_yield_on_arg_id_table - maybe_record_yield_on_arg - forward_yield_on_table_to_ctx - maybe_build_graph_status kernel.py shrinks by ~48 lines net (68 deletions, 20 insertions). All existing tests still pass (35/35 in test_checkpoint + test_resume_offset on x64); pre-commit clean. No behaviour change. Also rewrap the @qd.kernel(checkpoints=...) docstring param in kernel_impl.py from 80c to 120c -- AI bot 'line wrapping' check flagged the 6-line description as wrapping at 80 instead of the project-wide 120-char convention.

github-actions · 2026-06-17T17:53:57Z

Diff coverage: 90% · 913 lines, 93 missing

github-actions · 2026-06-17T18:35:29Z

Diff coverage: 90% · 927 lines, 93 missing

Two failures on PR 725 / commit e939301 ('extract qd.checkpoint plumbing into kernel_checkpoint.py'): 1. Linux test_api.py: 'kernel_checkpoint' leaked into dir(quadrants) because 'from quadrants.lang import kernel_checkpoint as _checkpoint_helpers' in kernel.py attaches the submodule as an attribute of the parent package, which then bubbles up through 'from quadrants.lang import *' in quadrants/__init__.py. Other internal modules (kernel_impl, misc, ops, etc.) are masked via an explicit exclusion list in quadrants/lang/__init__.py's __all__ filter; add kernel_checkpoint to that list. Verified locally: 'kernel_checkpoint' in dir(quadrants) -> False after the fix. 2. Line wrapping (AI bot): three pre-existing comment runs in runtime/cpu/kernel_launcher.cpp and runtime/cuda/graph_manager.cpp wrapped at 76-84 chars instead of the project-wide 120-char limit. Reflowed those three runs. 53/53 test_api + test_checkpoint + test_resume_offset tests pass on x64; pre-commit clean.

github-actions · 2026-06-17T19:57:18Z

Total: 71 file(s) changed, +11186 -1150 code lines.

github-actions · 2026-06-17T20:27:17Z

Diff coverage: 92% · 927 lines, 76 missing

- quadrants/ir/frontend_ir.h:27 (checkpoint_id field docstring): 8 lines wrapping at ~90c -> 7 lines packing to ~119c - quadrants/ir/frontend_ir.h:1092 (begin_checkpoint docstring): 7 lines wrapping at ~88c -> 6 lines packing to ~118c - quadrants/program/kernel_launcher.h:26 (get_graph_num_checkpoints_on_last_call docstring): 4 lines wrapping at ~89c -> 3 lines packing to ~120c Used rewrap-comments-120c skill (find_underwrapped.py --diff). All other reported runs are #include blocks (structural, can't reflow).

github-actions · 2026-06-17T21:47:20Z

Total: 71 file(s) changed, +11186 -1150 code lines.

github-actions · 2026-06-17T22:47:40Z

Diff coverage: 92% · 927 lines, 76 missing

duburcqa · 2026-06-18T08:12:05Z

+
+> **Experimental.** `qd.checkpoint`, `qd.GraphStatus`, and `kernel.resume(from_checkpoint=...)` are experimental APIs. The shape of the public surface (the context-manager signature, the `@qd.kernel(checkpoints=True)` flag, the `GraphStatus` fields, the host-side resume loop, the error messages, and the cross-backend lowering details) may change in any future release without a deprecation cycle.
+
+`qd.checkpoint` lets a graph kernel pause partway through, surface a reason to the host, let the host fix things up, and resume from where it paused on the next launch. An example use-case is an algorithm implemented as a graph that may need to allocate additional memory partway through, where the graph operations are in-place, and therefore not idempotent, and therefore for which simply retrying the whole graph from the start is not an option.


As I already mentioned, "graph operations" is not a standard term, you cannot use it without defining what it means, or just reformulate this sentence to avoid using this terminology.

I would suggest to define in parentheses what "idempotent" means in programming. It is not really common knowledge and understanding what it means in this context is critical.

I would argue that idempotent is a standard programming term. It is not compiler specific. It is not physics siulation specific. I've seem the term used by many engineers in my previous company, during standard PRs.

replaced 'graph operaitons' with 'operations in the graph'

replcae 'not idempotent' with 'cannot be rerun'

‘cannot be rerun´ without altering / corrupting the output

duburcqa · 2026-06-18T08:17:45Z

+The framework never writes into your `yield_on` buffer — you own it end-to-end. That means:
+
+- Before the **first** launch, initialise it to `0` (a freshly allocated `qd.ndarray` is not guaranteed to be zeroed).
+- Before each **resume** launch, reset it to `0` (otherwise the body of the same checkpoint sees the stale non-zero value and yields again on the same condition, looping forever).


Could be worth using some ⚠️ tag.

wouldnt that be non-ascii? 🤔

No, it uses :warning:

duburcqa · 2026-06-18T08:26:18Z

+                         from_checkpoint=status.checkpoint)
+```
+
+### Restrictions


Could you explain / be more explicit about what happens during resume?

Stating clearly that the entire checkpoint block is re-executed, and that it is user-responsibility to ensure idempotent behaviour when checkpoint is needed? Because if the state is altered during the checkpoint block, resuming is not going to save you I guess? Whatever the answer, it should be very clear in the doc.

Beyond that, how does checkpointing works under the hood? Does it snapshot all the input data by copy before yielding, or it just return like this? If no copy is made, this means that resuming must be done "right away", without further altering the data in between, otherwise it is some kind of undefined behaviour.

Another important point, what is I don't want to resume in such a case and I just want to move on to another kernel and continue like this? Is it supported or resume must happen?

I think it is essentially to clarify all these points in the documentation.

checkpoint does NOT require idempotent behavior. This is the entire purpose of checkpoint: to be able to interrupt and resume graphs that are NOT idempotent.

ah, hte checkpoint block itself. right.

the checkpoint block itself actually does not so much require idempotence, as requiring that it is atomic: it either succeeds completely, or fails without changing anything.

as an example, in the case of allocation issues, the checkpoint block looks like:

do we have enough memory availbel?

no: exit now

yes: ok, lets proceed with running the sort etc

added 'resume where' section

the checkpoint block itself actually does not so much require idempotence, as requiring that it is atomic: it either succeeds completely, or fails without changing anything.

Yeah, this is exactly what I meant by « ensure idempotent behaviour when checkpoint is needed »

"fails without changing anything" I feel is not idempotent? Idempotent means that calling the function multiple times is identical in effect to calling it once. But if it fails the first time, it would only be idempotent if it always failed thereafter I feel?

Integrate origin/main's nested graph_do_while feature (#728) with the checkpointing work on hp/graph-checkpoint. Key reconciliations: - GraphRegionTag (ir.h): add checkpoint_id alongside graph_do_while_level_id and stream_parallel_group_id; update constructors and operator==. - frontend_ir.{h,cpp}: ForLoopConfig/FrontendForStmt carry checkpoint_id; ASTBuilder stamps GraphRegionTag with the 3-arg constructor. - offload.cpp: propagate checkpoint_id through assemble_serial_statements / push_serial_statement so serial side-effecting tasks inherit the checkpoint id while pure-only tasks keep -1. - gen_offline_cache_key.cpp: emit both graph_do_while_level_id and checkpoint_id for OffloadedStmt and other statements. - python/quadrants/lang/kernel.py: re-integrate GraphDoWhileLevel dataclass / graph_do_while_levels / _graph_do_while_level_stack from main with hp/graph-checkpoint's checkpoint metadata + fast-cache load/store paths and launch_kernel. - function_def_transformer.py: allow qd.checkpoint With blocks inside qd.graph_do_while bodies (new _is_checkpoint_with helper). - cuda/amdgpu/cpu graph_manager + kernel_launcher: combine nested-GDW and checkpoint plumbing, dropping qipc-integration-only bits (graph_parallel, add_empty_node, pre-Hopper flat graph, cp_id_storage). - cpu/kernel_launcher.cpp: drop accidental *flag = 0; re-introduction so the user-owned yield_on flag is not cleared by the runtime. Tests: - All checkpoint and graph_do_while tests pass on x64 and cuda. - Full x64 test suite: 4174 passed, 2053 skipped (1 pre-existing perf-flake test_concurrent_streams_with_events, unrelated).

github-actions · 2026-06-18T15:36:33Z

Total: 72 file(s) changed, +11368 -1167 code lines.

github-actions · 2026-06-18T16:32:39Z

Diff coverage: 91% · 943 lines, 82 missing

prepare_checkpoint_launch_state and finalize_checkpoint_readback already exist in checkpoint_launch.cpp and are declared in runtime.h, but runtime.cpp still inlined the identical logic instead of calling them. Replace both inline blocks with calls to the existing helpers (-94 / +7) to silence the CI feature-factorization check and keep the launch_kernel hot path focused.

github-actions · 2026-06-18T20:46:10Z

Total: 72 file(s) changed, +11301 -1167 code lines.

github-actions · 2026-06-18T21:50:01Z

Diff coverage: 91% · 943 lines, 82 missing

hughperkins and others added 30 commits June 4, 2026 13:10

[Graph] Add checkpoint IF-gate kernel source (slice 1c, part 1/N)

1af49f0

[Graph] Drop sm_110 from checkpoint gate fatbin script (CUDA 12.9 too…

1e18653

…lkit limit)

[Graph] Slice 1d (part 1): yield-check + cond-with-yield kernel sourc…

b4612a0

…es + arg-id plumbing

[Graph] Slice 1d (part 2): regenerate fatbins for yield-check + cond-…

e138cfd

…with-yield

[Graph] Slice 1d (part 3): wire yield-check kernel + cond-with-yield …

4492b35

…into GraphManager

[Graph] Slice 1d (part 4): tests for yield-check / yield-race / WHILE…

79ebf57

… early-exit

[Graph] Slice 1d (fix): route first launch through launch_cached_grap…

28a2f29

…h for yield_signal readback

[Graph] Slice 1d (tests): correct yield-first-wins semantics

5dedc4e

[Graph] Slice 1d (docs): document yield mechanism in graph.md + updat…

977f3f0

…e test header

[Graph] Slice 2 (impl): GraphStatus return value + kernel.resume(from…

bf906d7

…_checkpoint=) host API

[Graph] Slice 2 (fix): move GraphStatus to its own module to avoid ci…

dd37b28

…rcular import

[Graph] Slice 2 (tests): GraphStatus return + resume(from_checkpoint=…

788f607

…) end-to-end

[Graph] Slice 2 (docs): GraphStatus + kernel.resume() user guide

37ed4e4

[Graph] Slice 3: port qipc test_resume_offset.cu scenarios A+B + rese…

4d5bc25

…t resume_point at end of WHILE iter

[Graph] Slice 3: regenerate condition kernel fatbin (cond-with-yield …

d7fe3f6

…now takes resume_point arg)

[Graph] Slice 3 (fix): rewrite resume-offset tests to use for-loops s…

7123030

…o checkpoints materialise as distinct offloaded tasks

[Graph] Slice 3 (fix): use for _ in range(1) for scalar work inside c…

c685f24

…heckpoints; document offloader limitation

[Graph] Slice 7 (docs polish): backend-support table now lists qd.che…

0c94986

…ckpoint coverage

[Graph] Slice 6: CPU fallback - host-branch gating for qd.checkpoint …

e35e9cb

…resume/yield

[Graph] Slice 4 (AMDGPU): open up yield/resume tests on amdgpu (seque…

22f8d26

…ntial paths)

[Docs] Update graph.md for slice 4/5 (AMDGPU + Vulkan + Metal coverage)

840ac3d

hughperkins added 3 commits June 17, 2026 12:31

Merge branch 'main' into hp/graph-checkpoint

f21617d

address review commetns

67da464

precoommit

a398139

hughperkins added the awaiting review pass New PR or review comments addressed label Jun 17, 2026

duburcqa reviewed Jun 18, 2026

View reviewed changes

Comment thread docs/source/user_guide/graph.md

duburcqa reviewed Jun 18, 2026

View reviewed changes

Comment thread docs/source/user_guide/graph.md

duburcqa reviewed Jun 18, 2026

View reviewed changes

duburcqa removed the awaiting review pass New PR or review comments addressed label Jun 18, 2026

hughperkins added 2 commits June 18, 2026 14:31

address comments

28f630b

add Resume where setion

1cf0343

hughperkins added the awaiting review pass New PR or review comments addressed label Jun 18, 2026

hughperkins added 2 commits June 18, 2026 15:14

precommit

35e49d4


		> Experimental. `qd.checkpoint`, `qd.GraphStatus`, and `kernel.resume(from_checkpoint=...)` are experimental APIs. The shape of the public surface (the context-manager signature, the `@qd.kernel(checkpoints=True)` flag, the `GraphStatus` fields, the host-side resume loop, the error messages, and the cross-backend lowering details) may change in any future release without a deprecation cycle.

		`qd.checkpoint` lets a graph kernel pause partway through, surface a reason to the host, let the host fix things up, and resume from where it paused on the next launch. An example use-case is an algorithm implemented as a graph that may need to allocate additional memory partway through, where the graph operations are in-place, and therefore not idempotent, and therefore for which simply retrying the whole graph from the start is not an option.

Conversation

hughperkins commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duburcqa Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hughperkins Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hughperkins Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hughperkins commented Jun 5, 2026 •

edited

Loading

duburcqa Jun 18, 2026 •

edited

Loading

hughperkins Jun 18, 2026 •

edited

Loading

hughperkins Jun 18, 2026 •

edited

Loading