Skip to content

Want ability to take a crucible dataset out of provisioning pool #3480

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
askfongjojo opened this issue Jul 3, 2023 · 4 comments · Fixed by #7912
Closed

Want ability to take a crucible dataset out of provisioning pool #3480

askfongjojo opened this issue Jul 3, 2023 · 4 comments · Fixed by #7912
Assignees
Labels
customer For any bug reports or feature requests tied to customer requests

Comments

@askfongjojo
Copy link

Similar to #2483 (the ability to take a sled out of provisioning pool), there are multiple circumstatnces (e.g. physical disk failure, crucible-agent failing to come up #3416) that causes a crucible dataset to be non-functional. Regardless of whether the logic for picking crucible is fully randomized, we need a support tool to mark a crucible as "bad".

The tool (probably an API) will be used by Oxide support initially. Eventually, we'd want a heart-beat mechanism in place to make the decision of when to take a crucible out of or back into the provisioning pool.

@askfongjojo askfongjojo added this to the MVP milestone Jul 3, 2023
@askfongjojo
Copy link
Author

This API can also be triggered optionally by the "take sled out of provisioning pool" feature since talking a sled out will mean not provisioning to its disks in most cases.

@askfongjojo
Copy link
Author

Volume/region background clean-up also needs to skip crucible zones that are already removed - #4331.

@morlandi7 morlandi7 modified the milestones: MVP, 5 Nov 7, 2023
@morlandi7 morlandi7 modified the milestones: 5, 6 Dec 5, 2023
@morlandi7 morlandi7 added the known issue To include in customer documentation and training label Dec 18, 2023
@morlandi7 morlandi7 modified the milestones: 6, 7 Jan 25, 2024
@askfongjojo askfongjojo removed the known issue To include in customer documentation and training label Mar 9, 2024
@morlandi7 morlandi7 modified the milestones: 7, 8 Mar 12, 2024
@davepacheco
Copy link
Collaborator

@sunshowers I think this issue was resolved as of #5032. Is that right?

@davepacheco
Copy link
Collaborator

Answering my own question: we discussed this briefly in today's update call. This ticket is different from #5032 in two ways: it covers not just an underlying database facility, but an API and tool that can be used by support (at least) to control this behavior; and it covers dataset-level granularity, whereas #5032 only covers sled-level.

@morlandi7 morlandi7 modified the milestones: 8, 9 May 13, 2024
@morlandi7 morlandi7 modified the milestones: 9, 10 Jul 17, 2024
@morlandi7 morlandi7 removed this from the 10 milestone Aug 13, 2024
@twinfees twinfees added the customer For any bug reports or feature requests tied to customer requests label Mar 27, 2025
jmpesp added a commit to jmpesp/omicron that referenced this issue Apr 3, 2025
Recent customer issues have highlighted problems with storage
accounting, namely that while there are quotas and reservations for
individual Crucible regions, there's nothing set for the whole Crucible
dataset. Crucible _could_ end up using the whole disk, or some large
fraction of it, such that other users of the same U2 could be starved
out.

This commit adds a buffer to each zpool that the Crucible region
allocation query will not allocate into. This overhead will be set to
250G initially (see oxidecomputer#7875 for reasoning) but could
also be modified with omdb.

Part of this commit's changes include using a CTE with
`regions_hard_delete`, which is much more efficient than the previous
for loop but has the effect of overwriting `size_used` for all datasets,
which will undo any time this column value was manually set to prevent
allocation for particular datasets / pools. Because of this, this commit
also adds a `no_provision` flag for a Crucible dataset: if it is set,
then the region allocation query will not allocate into that dataset.
This flag can be toggled with omdb.

Part of the upgrade to R14 will include a support procedure to address
if the addition of the control plane storage buffer of 250G causes a
Crucible dataset to be "overprovisioned", necessitating manually
requested region replacement requests to reduce the size allocated for a
particular Crucible dataset. This commit adds an omdb command to show
all overprovisioned crucible datasets, and changes the region listing
command so it can list regions for a particular dataset.

Fixes oxidecomputer#3480
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer For any bug reports or feature requests tied to customer requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants