Skip to content

instance start failures (and other instance events?) should be reported to the user #9185

@hawkw

Description

@hawkw

instance_start sagas can fail for a variety of reasons. Some of these are internal errors that we need not expose to the user, such as "we made you a VMM but then we tried to ensure your instance there and it turned out the VMM had just disappeared in the intervening seconds". However, a substantial number of instance start failures are directly relevant to the user.

In particular, the instance_start saga is what's responsible for performing sled resource allocation (also referred to as instance placement1) in order to actually find a sled on which the instance can live. Instance placement is influenced by a number of user-provided parameters: the resources (vCPUs and memory) requested by the instance, its affinity and anti-affinity constraints, and the presence of local disks attached to that instance (soon, see #9499). These factors can make the control plane unable to find a sled capable of hosting that instance: there may not be a sled with enough free memory and vCPUs for the instance, or affinity and/or anti-affinity constraints may render all sleds with sufficient capacity ineligible for that instance. An instance may be unable to start for reasons that are transient (e.g. sleds may be down for maintenance, temporarily reducing the available resources), or permanently (you have constructed an unsatisfied system of affinity and anti-affinity constraints).

When an instance is started via the /v1/instances/{instance}/start API endpoint, or by clicking on the button in the web console that does that, any error that prevented the instance from starting is currently bubbled up to the user.2 This is good. However, there are also some cases in which these errors do not make it into the light holes in a real life human person's face:

  • The instance failed for $SOME_REASON and the control plane attempted to automatically restart it, but failed to do so.

    In this case, the instance will appear to come to rest in the Stopped state (should unwinding instance-start sagas put the instance in Stopped or Failed? #7727), and the control plane will not attempt to restart it again, which either is or is not correct (SagaUnwound VMMs don't reincarnate as expected (#6669 is busted) #9174). This makes you sad, because you never asked for your instance to be stopped and you don't know why it is Stopped. See many VMs on dogfood in state "stopped" with intended state "running" #9177 for an example of a real life human person being sad about this.

  • The instance was asked to be started by an instance_create request with "start": true, but the client did not wait around for the start saga to complete, or the Nexus responsible for handling that request died.

    This is a bit of a weird one. Normally, the project_create_instance function does await the completion of the instance_start saga it runs after the instance is created, so the error should bubble up there:

    let instance_id = saga_outputs
    .lookup_node_output::<Uuid>("instance_id")
    .map_err(|e| Error::internal_error(&format!("{:#}", &e)))
    .internal_context("looking up output from instance create saga")?;
    // If the caller asked to start the instance, kick off that saga.
    // There's a window in which the instance is stopped and can be deleted,
    // so this is not guaranteed to succeed, and its result should not
    // affect the result of the attempt to create the instance.
    if params.start {
    let lookup = LookupPath::new(opctx, &self.db_datastore)
    .instance_id(instance_id);
    let start_result = self
    .instance_start(
    opctx,
    &lookup,
    instance_start::Reason::AutoStart,
    )
    .await;
    if let Err(e) = start_result {
    info!(self.log, "failed to start newly-created instance";
    "instance_id" => %instance_id,
    "error" => ?e);
    }
    }

    However, if the client times out the request and hangs up, the start saga will still proceed in the background. Similarly, if Nexus manages to start the start saga but vanishes into the ether before the start saga it started has completed, another Nexus will continue to execute that saga in the background. In either of those cases, should the start saga that's running in the background fail, any information about why the instance did not start goes directly to /dev/null.3

  • Also, @jmpesp is working on making the common case also worse. We have been discussing making all instance start operations asynchronous (in the sense that the HTTP API call returns as soon as the instance_start saga, not in the sense of being a Rust async fn, which it already is). This may be necessary because starting an instance with local storage takes A While and we would like the client to not get too sad about that. However, changing the API behavior to always do the actual work of starting in the background means that, well, starting the instance will now happen in the background in this case, too. So the error also ends up disappearing into the void should we make this change.

ALL OF THIS MAKES ME FEEL VERY SAD AND ANGRY AND I WOULD LIKE IT TO NOT BE THAT WAY

Footnotes

  1. At least by me. I'm not sure whether anyone else has used this terminology or not. Whatever.

  2. Or the program that called the API, which I consider to be a form of user.

  3. Technically, it does not go to /dev/null, it goes to /pool/ext/b93f880e-c55b-4d6c-9a16-939d84b628fc/crypt/debug/oxz_nexus_470fbf4d-0178-45ee-a422-136fa5f4a158/oxide-nexus:default.log.1766096101, which from the user's perspective may as well be /dev/null. And while Oxide support knows that this is a real place that actually exists, it's pretty hard to get it back out from there.

Metadata

Metadata

Assignees

Labels

nexusRelated to nexus

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions