-
Notifications
You must be signed in to change notification settings - Fork 44
Want ability to take a crucible dataset out of provisioning pool #3480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This API can also be triggered optionally by the "take sled out of provisioning pool" feature since talking a sled out will mean not provisioning to its disks in most cases. |
Volume/region background clean-up also needs to skip crucible zones that are already removed - #4331. |
@sunshowers I think this issue was resolved as of #5032. Is that right? |
Answering my own question: we discussed this briefly in today's update call. This ticket is different from #5032 in two ways: it covers not just an underlying database facility, but an API and tool that can be used by support (at least) to control this behavior; and it covers dataset-level granularity, whereas #5032 only covers sled-level. |
Recent customer issues have highlighted problems with storage accounting, namely that while there are quotas and reservations for individual Crucible regions, there's nothing set for the whole Crucible dataset. Crucible _could_ end up using the whole disk, or some large fraction of it, such that other users of the same U2 could be starved out. This commit adds a buffer to each zpool that the Crucible region allocation query will not allocate into. This overhead will be set to 250G initially (see oxidecomputer#7875 for reasoning) but could also be modified with omdb. Part of this commit's changes include using a CTE with `regions_hard_delete`, which is much more efficient than the previous for loop but has the effect of overwriting `size_used` for all datasets, which will undo any time this column value was manually set to prevent allocation for particular datasets / pools. Because of this, this commit also adds a `no_provision` flag for a Crucible dataset: if it is set, then the region allocation query will not allocate into that dataset. This flag can be toggled with omdb. Part of the upgrade to R14 will include a support procedure to address if the addition of the control plane storage buffer of 250G causes a Crucible dataset to be "overprovisioned", necessitating manually requested region replacement requests to reduce the size allocated for a particular Crucible dataset. This commit adds an omdb command to show all overprovisioned crucible datasets, and changes the region listing command so it can list regions for a particular dataset. Fixes oxidecomputer#3480
Similar to #2483 (the ability to take a sled out of provisioning pool), there are multiple circumstatnces (e.g. physical disk failure, crucible-agent failing to come up #3416) that causes a crucible dataset to be non-functional. Regardless of whether the logic for picking crucible is fully randomized, we need a support tool to mark a crucible as "bad".
The tool (probably an API) will be used by Oxide support initially. Eventually, we'd want a heart-beat mechanism in place to make the decision of when to take a crucible out of or back into the provisioning pool.
The text was updated successfully, but these errors were encountered: