instance start failures (and other instance events?) should be reported to the user

`instance_start` sagas can fail for a variety of reasons. Some of these are internal errors that we need not expose to the user, such as "we made you a VMM but then we tried to ensure your instance there and it turned out the VMM had just disappeared in the intervening seconds". However, a substantial number of instance start failures _are_ directly relevant to the user. 

In particular, the `instance_start` saga is what's responsible for performing *sled resource allocation* (also referred to as *instance placement*[^1]) in order to actually find a sled on which the instance can live. Instance placement is influenced by a number of user-provided parameters: the resources (vCPUs and memory) requested by the instance, its affinity and anti-affinity constraints, and the presence of local disks attached to that instance (soon, see #9499). These factors can make the control plane unable to find a sled capable of hosting that instance: there may not be a sled with enough free memory and vCPUs for the instance, or affinity and/or anti-affinity constraints may render all sleds with sufficient capacity ineligible for that instance. An instance may be unable to start for reasons that are transient (e.g. sleds may be down for maintenance, temporarily reducing the available resources), or permanently (you have constructed an unsatisfied system of affinity and anti-affinity constraints).

When an instance is started via the `/v1/instances/{instance}/start` API endpoint, or by clicking on the button in the web console that does that, any error that prevented the instance from starting is currently bubbled up to the user.[^2] This is good. However, there are also some cases in which these errors do not make it into the light holes in a real life human person's face:

- **The instance failed for `$SOME_REASON` and the control plane attempted to automatically restart it, but failed to do so.** 

  In this case, the instance will appear to come to rest in the `Stopped` state (#7727), and the control plane will not attempt to restart it again, which either is or is not correct (#9174). This makes you sad, because you never asked for your instance to be stopped and you don't know why it is `Stopped`. See #9177 for an example of a real life human person being sad about this.
- **The instance was asked to be started by an `instance_create` request with `"start": true`, but the client did not wait around for the start saga to complete, or the Nexus responsible for handling that request died**. 

  This is a bit of a weird one. Normally, the `project_create_instance` function _does_ await the _completion_ of the `instance_start` saga it runs after the instance is created, so the error _should_ bubble up there:
 https://github.com/oxidecomputer/omicron/blob/71d93855df4fae12324ba179a66f9258c6ad2782/nexus/src/app/instance.rs#L689-L714
  **However**, if the client times out the request and hangs up, the start saga will still proceed in the background. Similarly, if Nexus manages to start the start saga but vanishes into the ether before the start saga it started has completed, another Nexus will continue to execute that saga in the background. In either of those cases, should the start saga that's running in the background fail, any information about why the instance did not start goes directly to `/dev/null`.[^3]
- **Also, @jmpesp is working on making the common case also worse**. We have been discussing making all instance start operations asynchronous (in the sense that the HTTP API call returns as soon as the `instance_start` saga, not in the sense of being a Rust `async fn`, which it already is). This may be necessary because starting an instance with local storage takes A While and we would like the client to not get too sad about that. However, changing the API behavior to always do the actual work of starting in the background means that, well, starting the instance will now happen in the background in this case, too. So the error _also_ ends up disappearing into the void should we make this change.

**ALL OF THIS MAKES ME FEEL VERY SAD AND ANGRY AND I WOULD LIKE IT TO NOT BE THAT WAY**

[^1]: At least by me. I'm not sure whether anyone else has used this terminology or not. Whatever.
[^2]: Or the program that called the API, which I consider to be a form of user.
[^3]: Technically, it does not go to `/dev/null`, it goes to `/pool/ext/b93f880e-c55b-4d6c-9a16-939d84b628fc/crypt/debug/oxz_nexus_470fbf4d-0178-45ee-a422-136fa5f4a158/oxide-nexus:default.log.1766096101`, which from the user's perspective may as well be `/dev/null`. And while Oxide support knows that this is a real place that actually exists, it's pretty hard to get it back out from there.

	let instance_id = saga_outputs
	.lookup_node_output::<Uuid>("instance_id")
	.map_err(\|e\| Error::internal_error(&format!("{:#}", &e)))
	.internal_context("looking up output from instance create saga")?;

	// If the caller asked to start the instance, kick off that saga.
	// There's a window in which the instance is stopped and can be deleted,
	// so this is not guaranteed to succeed, and its result should not
	// affect the result of the attempt to create the instance.
	if params.start {
	let lookup = LookupPath::new(opctx, &self.db_datastore)
	.instance_id(instance_id);

	let start_result = self
	.instance_start(
	opctx,
	&lookup,
	instance_start::Reason::AutoStart,
	)
	.await;
	if let Err(e) = start_result {
	info!(self.log, "failed to start newly-created instance";
	"instance_id" => %instance_id,
	"error" => ?e);
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

instance start failures (and other instance events?) should be reported to the user #9185

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

instance start failures (and other instance events?) should be reported to the user #9185

Description

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions