Skip to content

restore: scaffold + dual-write descriptor refs to job_info#171340

Closed
kev-cao wants to merge 2 commits into
cockroachdb:masterfrom
kev-cao:restore/scaffold-desc-refs
Closed

restore: scaffold + dual-write descriptor refs to job_info#171340
kev-cao wants to merge 2 commits into
cockroachdb:masterfrom
kev-cao:restore/scaffold-desc-refs

Conversation

@kev-cao

@kev-cao kev-cao commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

At ~1M descriptors, persisting full marshaled descriptor protos into
RestoreDetails.{Table,Type,Schema,Database,Function}Descs (and from
there into the system.job_info payload row) pushes the row past the
64 MiB raft command size limit, so execution can't even start
ingesting. #170974 already removed planning-side population of these
fields; this work removes execution-side population.

This PR is the first half of that effort — the write side. It
adds the building blocks and starts populating the new info-key rows
alongside the legacy slices, without changing what any reader
consumes. No user-visible behavior change.

What lands here:

  • RestoreDescRef + RestoreDescRefs protos in backuppb.
  • Five per-type info-key constants (restore_table_desc_refs, ...).
  • V26_3_DescriptorIDsInRestoreDetails cluster version. Unused on
    this PR; the read-side migration PR will gate the legacy-write skip
    on it.
  • Read/write helpers (tableDescRefs, ..., allDescRefs) that prefer
    the info-key row and fall back to the legacy slice for jobs created
    before any new code ran. Dead code until the read-side PR consumes
    them.
  • Dual-write call sites in createImportingDescriptors and
    publishDescriptors that persist a slim (ID, Version) list per
    descriptor type to its dedicated system.job_info row in the
    existing DescsTxn. Each row is bounded by kv.raft.command.max_size
    independently (~16 MiB at 1M descs).

The read-side migration + gate flip that actually delivers the
storage win lands separately as a stacked follow-up to this PR.

Informs: #170669

Epic: none

Release note: None

@kev-cao kev-cao requested review from a team as code owners June 2, 2026 04:20
@kev-cao kev-cao requested review from msbutler and removed request for a team June 2, 2026 04:20
@trunk-io

trunk-io Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@blathers-crl

blathers-crl Bot commented Jun 2, 2026

Copy link
Copy Markdown

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity

Copy link
Copy Markdown
Member

This change is Reviewable

kev-cao and others added 2 commits June 2, 2026 00:24
Add the building blocks to persist restored descriptor `(ID, Version)`
tuples in dedicated `system.job_info` rows instead of inlining the full
descriptor payloads on `RestoreDetails`. The latter exceeds the 64 MiB
raft command limit on the `system.job_info` row write at ~1M
descriptors, blocking RESTORE during job resumption.

This commit is foundation-only — no behavior change. It adds:

- `RestoreDescRef` and `RestoreDescRefs` protos in `backuppb`.
- Five per-type info-key constants (`restoreTableDescRefsKey`, etc.).
- Read/write helpers (`getDescRefs`, `writeDescRefs`,
  `tableDescRefs`, `typeDescRefs`, `schemaDescRefs`,
  `databaseDescRefs`, `functionDescRefs`, `allDescRefs`) that prefer
  the info-key row and fall back to the legacy
  `details.{Type}Descs` slice for jobs created before the info-key
  writes existed.
- The `V26_3_DescriptorIDsInRestoreDetails` cluster version that
  later commits will use to gate the legacy descriptor-slice writes
  off once the cluster has finalized the upgrade.

Subsequent commits dual-write the info-key rows, hoist a single
descriptor fetch in `doResume`, migrate readers off the legacy
slices, and gate the legacy write on the new CV.

Informs: #170669

Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
createImportingDescriptors and publishDescriptors both persist the
restore's descriptor lists onto RestoreDetails and call SetDetails. At
~1M descriptors, that payload exceeds the 64 MiB raft command limit and
the restore stalls.

Alongside the existing legacy slice population, write a slim
(ID, Version) tuple list per descriptor type to a dedicated
system.job_info row keyed by restore{Table,Type,Schema,Database,Function}DescRefsKey,
via the helpers introduced in the previous commit. Each row is bounded
by kv.raft.command.max_size in isolation, so the per-type rows stay
well under the limit even at the scale that breaks the combined
payload.

No reader consumes the info-key rows yet; this commit is a behavior-
preserving dual-write. Subsequent commits will hoist a single
descriptor fetch in doResume, migrate readers off the legacy slices,
and finally gate the legacy writes off above V26_3_DescriptorIDsInRestoreDetails.

Informs: #170669

Release note: None
@kev-cao kev-cao force-pushed the restore/scaffold-desc-refs branch from dac8bb2 to ab15b16 Compare June 2, 2026 04:26
@kev-cao kev-cao closed this by deleting the head repository Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants