Skip to content

restore: persist descriptor refs in job_info, not RestoreDetails#171336

Closed
kev-cao wants to merge 5 commits into
cockroachdb:masterfrom
kev-cao:restore/remove-descs-from-planning
Closed

restore: persist descriptor refs in job_info, not RestoreDetails#171336
kev-cao wants to merge 5 commits into
cockroachdb:masterfrom
kev-cao:restore/remove-descs-from-planning

Conversation

@kev-cao

@kev-cao kev-cao commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

At ~1M descriptors, persisting full marshaled descriptor protos into
RestoreDetails.{Table,Type,Schema,Database,Function}Descs (and from
there into the system.job_info payload row) pushes the row past the
64 MiB raft command size limit, so execution can't even start
ingesting. #170974 already removed planning-side population of these
fields; this PR removes execution-side population too.

Each restored descriptor's (ID, Version) tuple is persisted to a
dedicated system.job_info row keyed by descriptor type
(restore_table_desc_refs, restore_type_desc_refs, ...). Per-row
size is bounded by kv.raft.command.max_size independently of the
payload, so at 1M descs each row is on the order of ~16 MiB — well
under the limit. Descriptor bodies are read from KV at Resume time
via the batched descsCol.MutableByID().Descs(ctx, ids) pattern that
prefetchDescriptors already uses.

A new cluster version V26_3_DescriptorIDsInRestoreDetails gates the
behavior change. Below the gate the new info-key rows are written
and the legacy slices are populated, so a restore job created on a
new binary can still be driven by a node on the old binary during a
mixed-version window. Once the gate is active only the info-key rows
are written. Readers always prefer the info-key rows and fall back to
the legacy slices, which covers jobs created before any new code ran.

The early commits scaffold protos, helpers, and the gate; the middle
commits hoist a single descriptor fetch in doResume and migrate
each reader (the rekey list, stats notification, dropDescriptors,
setDescriptorsOffline, etc.); the last three commits flip the gate
on and fix two paths the gate broke (the revlog pipeline, which read
descriptors from the now-nil slice, and the download job's
OnFailOrCancel, which looked up info-key rows under the wrong job
ID).

Informs: #170974

Epic: none

Release note: None

@kev-cao kev-cao requested review from a team as code owners June 2, 2026 03:12
@kev-cao kev-cao requested review from andrew-r-thomas and removed request for a team June 2, 2026 03:12
@trunk-io

trunk-io Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@blathers-crl

blathers-crl Bot commented Jun 2, 2026

Copy link
Copy Markdown

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity

Copy link
Copy Markdown
Member

This change is Reviewable

@kev-cao kev-cao force-pushed the restore/remove-descs-from-planning branch from 7ebec85 to 61973b7 Compare June 2, 2026 03:49
@kev-cao kev-cao force-pushed the restore/remove-descs-from-planning branch from 61973b7 to 9d0d904 Compare June 2, 2026 03:57
@kev-cao kev-cao changed the title backup: persist restore descriptor refs in job_info, not RestoreDetails restore: persist descriptor refs in job_info, not RestoreDetails Jun 2, 2026
@kev-cao kev-cao added the do-not-merge bors won't merge a PR with this label. label Jun 2, 2026
@kev-cao kev-cao force-pushed the restore/remove-descs-from-planning branch from 9d0d904 to 75a02f7 Compare June 2, 2026 04:19
@kev-cao

kev-cao commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Claude (Opus 4.7):

Splitting this in two for easier review. This PR stays open as the
umbrella / end-to-end view; the two child PRs are what actually land:

This umbrella PR will not be merged. After both children merge I'll
close it.

kev-cao and others added 5 commits June 2, 2026 00:24
Add the building blocks to persist restored descriptor `(ID, Version)`
tuples in dedicated `system.job_info` rows instead of inlining the full
descriptor payloads on `RestoreDetails`. The latter exceeds the 64 MiB
raft command limit on the `system.job_info` row write at ~1M
descriptors, blocking RESTORE during job resumption.

This commit is foundation-only — no behavior change. It adds:

- `RestoreDescRef` and `RestoreDescRefs` protos in `backuppb`.
- Five per-type info-key constants (`restoreTableDescRefsKey`, etc.).
- Read/write helpers (`getDescRefs`, `writeDescRefs`,
  `tableDescRefs`, `typeDescRefs`, `schemaDescRefs`,
  `databaseDescRefs`, `functionDescRefs`, `allDescRefs`) that prefer
  the info-key row and fall back to the legacy
  `details.{Type}Descs` slice for jobs created before the info-key
  writes existed.
- The `V26_3_DescriptorIDsInRestoreDetails` cluster version that
  later commits will use to gate the legacy descriptor-slice writes
  off once the cluster has finalized the upgrade.

Subsequent commits dual-write the info-key rows, hoist a single
descriptor fetch in `doResume`, migrate readers off the legacy
slices, and gate the legacy write on the new CV.

Informs: #170669

Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
createImportingDescriptors and publishDescriptors both persist the
restore's descriptor lists onto RestoreDetails and call SetDetails. At
~1M descriptors, that payload exceeds the 64 MiB raft command limit and
the restore stalls.

Alongside the existing legacy slice population, write a slim
(ID, Version) tuple list per descriptor type to a dedicated
system.job_info row keyed by restore{Table,Type,Schema,Database,Function}DescRefsKey,
via the helpers introduced in the previous commit. Each row is bounded
by kv.raft.command.max_size in isolation, so the per-type rows stay
well under the limit even at the scale that breaks the combined
payload.

No reader consumes the info-key rows yet; this commit is a behavior-
preserving dual-write. Subsequent commits will hoist a single
descriptor fetch in doResume, migrate readers off the legacy slices,
and finally gate the legacy writes off above V26_3_DescriptorIDsInRestoreDetails.

Informs: #170669

Release note: None
Today three readers — createRestoreFlows (rekey list, pkIDs),
remapAndFilterRelevantStatistics (missing-stats warnings), and
notifyStatsRefresherOfNewTables — each pull table descriptors back out
of details.TableDescs. That field is the very thing we want to stop
populating to escape the 64 MiB raft command limit at high descriptor
counts, so the readers need a different source.

Fetch the freshly-written OFFLINE descriptors from KV once in doResume,
right after createImportingDescriptors, and thread the resulting
nstree.Catalog into each reader via a new restoredTableDescs helper.
publishDescriptors keeps its own fresh fetch because it needs txn-bound
mutable copies to write back. The online-restore download path runs in
a different goroutine and fetches the published catalog on demand at
its call site.

No semantic change: previously the readers consumed the OFFLINE
descriptors via the snapshot baked into RestoreDetails during
createImportingDescriptors; now they consume the same descriptors via
a fresh KV read at the same point in the job lifecycle.

Informs: #170669

Release note: None
Migrate restore-job readers off the legacy
RestoreDetails.{Table,Type,Schema,Database,Function}Descs slices to the
new (ID, Version) tuples persisted under dedicated system.job_info
rows. Descriptor-body readers (publishDescriptors, protectRestoreTargets,
the empty-restore fast path, ReportResults) now source their data from
either the catalog hoisted in doResume or per-type ref helpers; ID-only
readers (dropDescriptors, removeExistingTypeBackReferences,
getUndroppedTablesFromRestore, checkRestoredTableDescriptorVersions,
setDescriptorsOffline) drive directly off the refs.

The revlog entry points (BuildRekeys, SplitAndScatterRevlogSpans,
RunRevlogFinalMerge, RestoreFromRevisionLog) also read OFFLINE table
descriptors from RestoreDetails.TableDescs. Take a
[]catalog.TableDescriptor parameter on each entry point so the caller
in restore_job.go can thread the hoisted catalog (offlineDescs) through
via restoredTableDescs. The planning-side validation call in
restore_planning.go passes nil since planning has never had access to
the materialized table descriptors; it remains a sanity check on
DescriptorRewrites and the codec.

The legacy slices are still dual-written by the writers in
createImportingDescriptors and publishDescriptors; gating those writes
on the new cluster version comes in a follow-up commit. With this
commit the info-key rows are the source of truth for everything the
restore job reads.

Informs: #170669

Release note: None
createImportingDescriptors and publishDescriptors previously dual-wrote
descriptor bodies to RestoreDetails.{Table,Type,Schema,Database,Function}Descs
alongside the new (ID, Version) info-key rows. Skip the legacy slice
writes once the cluster has crossed V26_3_DescriptorIDsInRestoreDetails;
on the new code path the info-key rows are the source of truth. The
legacy fields and their dual-write are still needed in mixed-version
clusters so that older binaries resuming the job can drive it from the
RestoreDetails payload. The fields themselves will be removed in 27.1.

maybeWriteDownloadJob creates a separate download job that inherits the
link job's RestoreDetails by value. The descriptor-ref helpers
(tableDescRefs etc.) look up info-key rows in system.job_info keyed by
the calling job's ID, so the download job's OnFailOrCancel path found
no rows under its own ID and fell back to the now-nil legacy slices,
leaving cleanup with an empty descriptor set. Copy each
restore_*_desc_refs row from the link job to the download job at
creation time so the download job has the same source of truth.

Add TestRestoreDescriptorRefs to lock down behavior on both sides of
the gate: below the gate, the legacy slices must still be populated
and the helpers must succeed; at the gate, the legacy slices must be
nil and the info-key rows must be the source of truth. A second
subtest seeds a job_info table with no descriptor-ref rows but
populated legacy slices, simulating a job created on the prior binary
and resumed on the new one, to exercise the helpers' legacy-fallback
branch. Migrate TestProtectRestoreTargets to read through
databaseDescRefs / tableDescRefs so it works on both sides of the gate.

Informs: #170669

Release note: None
@kev-cao kev-cao force-pushed the restore/remove-descs-from-planning branch from 75a02f7 to c1c74b6 Compare June 2, 2026 04:26
@kev-cao kev-cao closed this by deleting the head repository Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge bors won't merge a PR with this label.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants