restore: scaffold + dual-write descriptor refs to job_info#171340
Closed
kev-cao wants to merge 2 commits into
Closed
restore: scaffold + dual-write descriptor refs to job_info#171340kev-cao wants to merge 2 commits into
kev-cao wants to merge 2 commits into
Conversation
Contributor
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Member
This was referenced Jun 2, 2026
Add the building blocks to persist restored descriptor `(ID, Version)`
tuples in dedicated `system.job_info` rows instead of inlining the full
descriptor payloads on `RestoreDetails`. The latter exceeds the 64 MiB
raft command limit on the `system.job_info` row write at ~1M
descriptors, blocking RESTORE during job resumption.
This commit is foundation-only — no behavior change. It adds:
- `RestoreDescRef` and `RestoreDescRefs` protos in `backuppb`.
- Five per-type info-key constants (`restoreTableDescRefsKey`, etc.).
- Read/write helpers (`getDescRefs`, `writeDescRefs`,
`tableDescRefs`, `typeDescRefs`, `schemaDescRefs`,
`databaseDescRefs`, `functionDescRefs`, `allDescRefs`) that prefer
the info-key row and fall back to the legacy
`details.{Type}Descs` slice for jobs created before the info-key
writes existed.
- The `V26_3_DescriptorIDsInRestoreDetails` cluster version that
later commits will use to gate the legacy descriptor-slice writes
off once the cluster has finalized the upgrade.
Subsequent commits dual-write the info-key rows, hoist a single
descriptor fetch in `doResume`, migrate readers off the legacy
slices, and gate the legacy write on the new CV.
Informs: #170669
Release note: None
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
createImportingDescriptors and publishDescriptors both persist the
restore's descriptor lists onto RestoreDetails and call SetDetails. At
~1M descriptors, that payload exceeds the 64 MiB raft command limit and
the restore stalls.
Alongside the existing legacy slice population, write a slim
(ID, Version) tuple list per descriptor type to a dedicated
system.job_info row keyed by restore{Table,Type,Schema,Database,Function}DescRefsKey,
via the helpers introduced in the previous commit. Each row is bounded
by kv.raft.command.max_size in isolation, so the per-type rows stay
well under the limit even at the scale that breaks the combined
payload.
No reader consumes the info-key rows yet; this commit is a behavior-
preserving dual-write. Subsequent commits will hoist a single
descriptor fetch in doResume, migrate readers off the legacy slices,
and finally gate the legacy writes off above V26_3_DescriptorIDsInRestoreDetails.
Informs: #170669
Release note: None
dac8bb2 to
ab15b16
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
At ~1M descriptors, persisting full marshaled descriptor protos into
RestoreDetails.{Table,Type,Schema,Database,Function}Descs(and fromthere into the
system.job_infopayload row) pushes the row past the64 MiB raft command size limit, so execution can't even start
ingesting. #170974 already removed planning-side population of these
fields; this work removes execution-side population.
This PR is the first half of that effort — the write side. It
adds the building blocks and starts populating the new info-key rows
alongside the legacy slices, without changing what any reader
consumes. No user-visible behavior change.
What lands here:
RestoreDescRef+RestoreDescRefsprotos inbackuppb.restore_table_desc_refs, ...).V26_3_DescriptorIDsInRestoreDetailscluster version. Unused onthis PR; the read-side migration PR will gate the legacy-write skip
on it.
tableDescRefs, ...,allDescRefs) that preferthe info-key row and fall back to the legacy slice for jobs created
before any new code ran. Dead code until the read-side PR consumes
them.
createImportingDescriptorsandpublishDescriptorsthat persist a slim(ID, Version)list perdescriptor type to its dedicated
system.job_inforow in theexisting DescsTxn. Each row is bounded by
kv.raft.command.max_sizeindependently (~16 MiB at 1M descs).
The read-side migration + gate flip that actually delivers the
storage win lands separately as a stacked follow-up to this PR.
Informs: #170669
Epic: none
Release note: None