restore: persist descriptor refs in job_info, not RestoreDetails by kev-cao · Pull Request #171336 · cockroachdb/cockroach

kev-cao · 2026-06-02T03:12:04Z

At ~1M descriptors, persisting full marshaled descriptor protos into
RestoreDetails.{Table,Type,Schema,Database,Function}Descs (and from
there into the system.job_info payload row) pushes the row past the
64 MiB raft command size limit, so execution can't even start
ingesting. #170974 already removed planning-side population of these
fields; this PR removes execution-side population too.

Each restored descriptor's (ID, Version) tuple is persisted to a
dedicated system.job_info row keyed by descriptor type
(restore_table_desc_refs, restore_type_desc_refs, ...). Per-row
size is bounded by kv.raft.command.max_size independently of the
payload, so at 1M descs each row is on the order of ~16 MiB — well
under the limit. Descriptor bodies are read from KV at Resume time
via the batched descsCol.MutableByID().Descs(ctx, ids) pattern that
prefetchDescriptors already uses.

A new cluster version V26_3_DescriptorIDsInRestoreDetails gates the
behavior change. Below the gate the new info-key rows are written
and the legacy slices are populated, so a restore job created on a
new binary can still be driven by a node on the old binary during a
mixed-version window. Once the gate is active only the info-key rows
are written. Readers always prefer the info-key rows and fall back to
the legacy slices, which covers jobs created before any new code ran.

The early commits scaffold protos, helpers, and the gate; the middle
commits hoist a single descriptor fetch in doResume and migrate
each reader (the rekey list, stats notification, dropDescriptors,
setDescriptorsOffline, etc.); the last three commits flip the gate
on and fix two paths the gate broke (the revlog pipeline, which read
descriptors from the now-nil slice, and the download job's
OnFailOrCancel, which looked up info-key rows under the wrong job
ID).

Informs: #170974

Epic: none

Release note: None

trunk-io · 2026-06-02T03:12:08Z

Merging to master in this repository is managed by Trunk.

To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

blathers-crl · 2026-06-02T03:12:11Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2026-06-02T03:12:34Z

This change is

kev-cao · 2026-06-02T04:21:04Z

Claude (Opus 4.7):

Splitting this in two for easier review. This PR stays open as the
umbrella / end-to-end view; the two child PRs are what actually land:

restore: scaffold + dual-write descriptor refs to job_info #171340 — write side: scaffold (proto, info-key constants, CV, helpers)
and dual-write (ID, Version) tuples to dedicated system.job_info
rows. Purely additive — no reader consumes them yet, no user-visible
behavior change.
restore: migrate readers + gate legacy descriptor writes #171341 — read side + gate flip: hoist a single prefetchDescriptors
call in doResume, migrate every reader off
RestoreDetails.{Type}Descs, gate the legacy writes on
V26_3_DescriptorIDsInRestoreDetails, copy descriptor refs into the
download job, and add TestRestoreDescriptorRefs. Stacked on restore: scaffold + dual-write descriptor refs to job_info #171340
— its diff currently includes the bottom two commits from restore: scaffold + dual-write descriptor refs to job_info #171340 and
will shrink once that lands.

This umbrella PR will not be merged. After both children merge I'll
close it.

Add the building blocks to persist restored descriptor `(ID, Version)` tuples in dedicated `system.job_info` rows instead of inlining the full descriptor payloads on `RestoreDetails`. The latter exceeds the 64 MiB raft command limit on the `system.job_info` row write at ~1M descriptors, blocking RESTORE during job resumption. This commit is foundation-only — no behavior change. It adds: - `RestoreDescRef` and `RestoreDescRefs` protos in `backuppb`. - Five per-type info-key constants (`restoreTableDescRefsKey`, etc.). - Read/write helpers (`getDescRefs`, `writeDescRefs`, `tableDescRefs`, `typeDescRefs`, `schemaDescRefs`, `databaseDescRefs`, `functionDescRefs`, `allDescRefs`) that prefer the info-key row and fall back to the legacy `details.{Type}Descs` slice for jobs created before the info-key writes existed. - The `V26_3_DescriptorIDsInRestoreDetails` cluster version that later commits will use to gate the legacy descriptor-slice writes off once the cluster has finalized the upgrade. Subsequent commits dual-write the info-key rows, hoist a single descriptor fetch in `doResume`, migrate readers off the legacy slices, and gate the legacy write on the new CV. Informs: #170669 Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>

createImportingDescriptors and publishDescriptors both persist the restore's descriptor lists onto RestoreDetails and call SetDetails. At ~1M descriptors, that payload exceeds the 64 MiB raft command limit and the restore stalls. Alongside the existing legacy slice population, write a slim (ID, Version) tuple list per descriptor type to a dedicated system.job_info row keyed by restore{Table,Type,Schema,Database,Function}DescRefsKey, via the helpers introduced in the previous commit. Each row is bounded by kv.raft.command.max_size in isolation, so the per-type rows stay well under the limit even at the scale that breaks the combined payload. No reader consumes the info-key rows yet; this commit is a behavior- preserving dual-write. Subsequent commits will hoist a single descriptor fetch in doResume, migrate readers off the legacy slices, and finally gate the legacy writes off above V26_3_DescriptorIDsInRestoreDetails. Informs: #170669 Release note: None

Today three readers — createRestoreFlows (rekey list, pkIDs), remapAndFilterRelevantStatistics (missing-stats warnings), and notifyStatsRefresherOfNewTables — each pull table descriptors back out of details.TableDescs. That field is the very thing we want to stop populating to escape the 64 MiB raft command limit at high descriptor counts, so the readers need a different source. Fetch the freshly-written OFFLINE descriptors from KV once in doResume, right after createImportingDescriptors, and thread the resulting nstree.Catalog into each reader via a new restoredTableDescs helper. publishDescriptors keeps its own fresh fetch because it needs txn-bound mutable copies to write back. The online-restore download path runs in a different goroutine and fetches the published catalog on demand at its call site. No semantic change: previously the readers consumed the OFFLINE descriptors via the snapshot baked into RestoreDetails during createImportingDescriptors; now they consume the same descriptors via a fresh KV read at the same point in the job lifecycle. Informs: #170669 Release note: None

Migrate restore-job readers off the legacy RestoreDetails.{Table,Type,Schema,Database,Function}Descs slices to the new (ID, Version) tuples persisted under dedicated system.job_info rows. Descriptor-body readers (publishDescriptors, protectRestoreTargets, the empty-restore fast path, ReportResults) now source their data from either the catalog hoisted in doResume or per-type ref helpers; ID-only readers (dropDescriptors, removeExistingTypeBackReferences, getUndroppedTablesFromRestore, checkRestoredTableDescriptorVersions, setDescriptorsOffline) drive directly off the refs. The revlog entry points (BuildRekeys, SplitAndScatterRevlogSpans, RunRevlogFinalMerge, RestoreFromRevisionLog) also read OFFLINE table descriptors from RestoreDetails.TableDescs. Take a []catalog.TableDescriptor parameter on each entry point so the caller in restore_job.go can thread the hoisted catalog (offlineDescs) through via restoredTableDescs. The planning-side validation call in restore_planning.go passes nil since planning has never had access to the materialized table descriptors; it remains a sanity check on DescriptorRewrites and the codec. The legacy slices are still dual-written by the writers in createImportingDescriptors and publishDescriptors; gating those writes on the new cluster version comes in a follow-up commit. With this commit the info-key rows are the source of truth for everything the restore job reads. Informs: #170669 Release note: None

createImportingDescriptors and publishDescriptors previously dual-wrote descriptor bodies to RestoreDetails.{Table,Type,Schema,Database,Function}Descs alongside the new (ID, Version) info-key rows. Skip the legacy slice writes once the cluster has crossed V26_3_DescriptorIDsInRestoreDetails; on the new code path the info-key rows are the source of truth. The legacy fields and their dual-write are still needed in mixed-version clusters so that older binaries resuming the job can drive it from the RestoreDetails payload. The fields themselves will be removed in 27.1. maybeWriteDownloadJob creates a separate download job that inherits the link job's RestoreDetails by value. The descriptor-ref helpers (tableDescRefs etc.) look up info-key rows in system.job_info keyed by the calling job's ID, so the download job's OnFailOrCancel path found no rows under its own ID and fell back to the now-nil legacy slices, leaving cleanup with an empty descriptor set. Copy each restore_*_desc_refs row from the link job to the download job at creation time so the download job has the same source of truth. Add TestRestoreDescriptorRefs to lock down behavior on both sides of the gate: below the gate, the legacy slices must still be populated and the helpers must succeed; at the gate, the legacy slices must be nil and the info-key rows must be the source of truth. A second subtest seeds a job_info table with no descriptor-ref rows but populated legacy slices, simulating a job created on the prior binary and resumed on the new one, to exercise the helpers' legacy-fallback branch. Migrate TestProtectRestoreTargets to read through databaseDescRefs / tableDescRefs so it works on both sides of the gate. Informs: #170669 Release note: None

kev-cao requested review from a team as code owners June 2, 2026 03:12

kev-cao requested review from andrew-r-thomas and removed request for a team June 2, 2026 03:12

kev-cao force-pushed the restore/remove-descs-from-planning branch from 7ebec85 to 61973b7 Compare June 2, 2026 03:49

kev-cao mentioned this pull request Jun 2, 2026

restore: WriteDescriptors batch exceeds Raft command size limit at 100k descriptors #171338

Open

kev-cao force-pushed the restore/remove-descs-from-planning branch from 61973b7 to 9d0d904 Compare June 2, 2026 03:57

kev-cao changed the title ~~backup: persist restore descriptor refs in job_info, not RestoreDetails~~ restore: persist descriptor refs in job_info, not RestoreDetails Jun 2, 2026

kev-cao added the do-not-merge bors won't merge a PR with this label. label Jun 2, 2026

kev-cao force-pushed the restore/remove-descs-from-planning branch from 9d0d904 to 75a02f7 Compare June 2, 2026 04:19

This was referenced Jun 2, 2026

restore: scaffold + dual-write descriptor refs to job_info #171340

Closed

restore: migrate readers + gate legacy descriptor writes #171341

Closed

kev-cao and others added 5 commits June 2, 2026 00:24

kev-cao force-pushed the restore/remove-descs-from-planning branch from 75a02f7 to c1c74b6 Compare June 2, 2026 04:26

kev-cao closed this by deleting the head repository Jun 3, 2026

spilchen mentioned this pull request Jun 3, 2026

sql/catalog: support batched descriptor writes with deferred validation #171400

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restore: persist descriptor refs in job_info, not RestoreDetails#171336

restore: persist descriptor refs in job_info, not RestoreDetails#171336
kev-cao wants to merge 5 commits into
cockroachdb:masterfrom
kev-cao:restore/remove-descs-from-planning

kev-cao commented Jun 2, 2026

Uh oh!

trunk-io Bot commented Jun 2, 2026

Uh oh!

blathers-crl Bot commented Jun 2, 2026

Uh oh!

cockroach-teamcity commented Jun 2, 2026

Uh oh!

kev-cao commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kev-cao commented Jun 2, 2026

Uh oh!

trunk-io Bot commented Jun 2, 2026

Uh oh!

blathers-crl Bot commented Jun 2, 2026

Uh oh!

cockroach-teamcity commented Jun 2, 2026

Uh oh!

kev-cao commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants