-
Notifications
You must be signed in to change notification settings - Fork 66
Description
instance_start sagas can fail for a variety of reasons. Some of these are internal errors that we need not expose to the user, such as "we made you a VMM but then we tried to ensure your instance there and it turned out the VMM had just disappeared in the intervening seconds". However, a substantial number of instance start failures are directly relevant to the user.
In particular, the instance_start saga is what's responsible for performing sled resource allocation (also referred to as instance placement1) in order to actually find a sled on which the instance can live. Instance placement is influenced by a number of user-provided parameters: the resources (vCPUs and memory) requested by the instance, its affinity and anti-affinity constraints, and the presence of local disks attached to that instance (soon, see #9499). These factors can make the control plane unable to find a sled capable of hosting that instance: there may not be a sled with enough free memory and vCPUs for the instance, or affinity and/or anti-affinity constraints may render all sleds with sufficient capacity ineligible for that instance. An instance may be unable to start for reasons that are transient (e.g. sleds may be down for maintenance, temporarily reducing the available resources), or permanently (you have constructed an unsatisfied system of affinity and anti-affinity constraints).
When an instance is started via the /v1/instances/{instance}/start API endpoint, or by clicking on the button in the web console that does that, any error that prevented the instance from starting is currently bubbled up to the user.2 This is good. However, there are also some cases in which these errors do not make it into the light holes in a real life human person's face:
-
The instance failed for
$SOME_REASONand the control plane attempted to automatically restart it, but failed to do so.In this case, the instance will appear to come to rest in the
Stoppedstate (should unwinding instance-start sagas put the instance inStoppedorFailed? #7727), and the control plane will not attempt to restart it again, which either is or is not correct (SagaUnwoundVMMs don't reincarnate as expected (#6669 is busted) #9174). This makes you sad, because you never asked for your instance to be stopped and you don't know why it isStopped. See many VMs on dogfood in state "stopped" with intended state "running" #9177 for an example of a real life human person being sad about this. -
The instance was asked to be started by an
instance_createrequest with"start": true, but the client did not wait around for the start saga to complete, or the Nexus responsible for handling that request died.This is a bit of a weird one. Normally, the
project_create_instancefunction does await the completion of theinstance_startsaga it runs after the instance is created, so the error should bubble up there:
omicron/nexus/src/app/instance.rs
Lines 689 to 714 in 71d9385
let instance_id = saga_outputs .lookup_node_output::<Uuid>("instance_id") .map_err(|e| Error::internal_error(&format!("{:#}", &e))) .internal_context("looking up output from instance create saga")?; // If the caller asked to start the instance, kick off that saga. // There's a window in which the instance is stopped and can be deleted, // so this is not guaranteed to succeed, and its result should not // affect the result of the attempt to create the instance. if params.start { let lookup = LookupPath::new(opctx, &self.db_datastore) .instance_id(instance_id); let start_result = self .instance_start( opctx, &lookup, instance_start::Reason::AutoStart, ) .await; if let Err(e) = start_result { info!(self.log, "failed to start newly-created instance"; "instance_id" => %instance_id, "error" => ?e); } }
However, if the client times out the request and hangs up, the start saga will still proceed in the background. Similarly, if Nexus manages to start the start saga but vanishes into the ether before the start saga it started has completed, another Nexus will continue to execute that saga in the background. In either of those cases, should the start saga that's running in the background fail, any information about why the instance did not start goes directly to/dev/null.3 -
Also, @jmpesp is working on making the common case also worse. We have been discussing making all instance start operations asynchronous (in the sense that the HTTP API call returns as soon as the
instance_startsaga, not in the sense of being a Rustasync fn, which it already is). This may be necessary because starting an instance with local storage takes A While and we would like the client to not get too sad about that. However, changing the API behavior to always do the actual work of starting in the background means that, well, starting the instance will now happen in the background in this case, too. So the error also ends up disappearing into the void should we make this change.
ALL OF THIS MAKES ME FEEL VERY SAD AND ANGRY AND I WOULD LIKE IT TO NOT BE THAT WAY
Footnotes
-
At least by me. I'm not sure whether anyone else has used this terminology or not. Whatever. ↩
-
Or the program that called the API, which I consider to be a form of user. ↩
-
Technically, it does not go to
/dev/null, it goes to/pool/ext/b93f880e-c55b-4d6c-9a16-939d84b628fc/crypt/debug/oxz_nexus_470fbf4d-0178-45ee-a422-136fa5f4a158/oxide-nexus:default.log.1766096101, which from the user's perspective may as well be/dev/null. And while Oxide support knows that this is a real place that actually exists, it's pretty hard to get it back out from there. ↩