If an MGS client attempts to start an SP update while one is in progress, they get this message:
ERRO failed to start update, error: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "1e0505fb-f138-4f24-b342-b2193702e8f1", "content-length": "421", "date": "Mon, 07 Apr 2025 21:48:42 GMT"}; value: Error { error_code: Some("UpdateInProgress"), message: "updating SP SpIdentifier { typ: Sled, slot: 15 } failed: failed to send update message to SP: Error response from SP: update still in progress (InProgress(UpdateInProgressStatus { id: UpdateId([163, 107, 251, 2, 223, 59, 78, 15, 154, 198, 5, 205, 130, 72, 5, 91]), bytes_received: 27384, total_size: 655616 }))", request_id: "1e0505fb-f138-4f24-b342-b2193702e8f1" }, update_id: a3bbdd60-1937-4305-b2d9-5296e43b110f, part_number: 913-0000019, serial_number: BRM27230037, sp_type: Sled, sp_slot: 15, component: sp, firmware_slot: 0, artifact_kind: gimlet_sp, artifact_hash: c9cb6c6d2b3fd9e198074b4160119caa21ca88632b218420a570725ffd0b8616
but obviously this is brittle. It'd be better to do this programmatically. I'd suggest we have MGS use the HttpResponseError stuff introduced in dropshot 0.15.0 to provide a structured error response that we can parse on the client and see what happened.
If an MGS client attempts to start an SP update while one is in progress, they get this message:
(this is an example logged by the MgsUpdateDriver while working on #7977)
When implementing #7977 I saw this and concluded (maybe erroneously) that this case could only be programmatically identified by looking at the error message, so that's what I did:
omicron/nexus/mgs-updates/src/driver_update.rs
Lines 220 to 226 in 1d3958c
but obviously this is brittle. It'd be better to do this programmatically. I'd suggest we have MGS use the
HttpResponseErrorstuff introduced in dropshot 0.15.0 to provide a structured error response that we can parse on the client and see what happened.I would also change this response from a 503 to a 400-level error for the reasons we document here. It's unusual I think to interpret the specific cause of a 503 -- it generally just means "your request is fine but I can't do it right now -- retry the same thing shortly". I think it's more accurate in this case to say "your request is not valid for the system's current state" -- a 400-level response of some kind -- and have the response say what's wrong with it.