Skip to content

SP/MGS: want better way to identify failure due to update in progress #8013

@davepacheco

Description

@davepacheco

If an MGS client attempts to start an SP update while one is in progress, they get this message:

ERRO failed to start update, error: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "1e0505fb-f138-4f24-b342-b2193702e8f1", "content-length": "421", "date": "Mon, 07 Apr 2025 21:48:42 GMT"}; value: Error { error_code: Some("UpdateInProgress"), message: "updating SP SpIdentifier { typ: Sled, slot: 15 } failed: failed to send update message to SP: Error response from SP: update still in progress (InProgress(UpdateInProgressStatus { id: UpdateId([163, 107, 251, 2, 223, 59, 78, 15, 154, 198, 5, 205, 130, 72, 5, 91]), bytes_received: 27384, total_size: 655616 }))", request_id: "1e0505fb-f138-4f24-b342-b2193702e8f1" }, update_id: a3bbdd60-1937-4305-b2d9-5296e43b110f, part_number: 913-0000019, serial_number: BRM27230037, sp_type: Sled, sp_slot: 15, component: sp, firmware_slot: 0, artifact_kind: gimlet_sp, artifact_hash: c9cb6c6d2b3fd9e198074b4160119caa21ca88632b218420a570725ffd0b8616

(this is an example logged by the MgsUpdateDriver while working on #7977)

When implementing #7977 I saw this and concluded (maybe erroneously) that this case could only be programmatically identified by looking at the error message, so that's what I did:

// TODO We need a better way to identify this error.
let chain = InlineErrorChain::new(&error);
let message = chain.to_string();
if !message.contains("update still in progress") {
error!(log, "failed to start update"; chain);
return Err(ApplyUpdateError::UpdateStartError(error));
}

but obviously this is brittle. It'd be better to do this programmatically. I'd suggest we have MGS use the HttpResponseError stuff introduced in dropshot 0.15.0 to provide a structured error response that we can parse on the client and see what happened.

I would also change this response from a 503 to a 400-level error for the reasons we document here. It's unusual I think to interpret the specific cause of a 503 -- it generally just means "your request is fine but I can't do it right now -- retry the same thing shortly". I think it's more accurate in this case to say "your request is not valid for the system's current state" -- a 400-level response of some kind -- and have the response say what's wrong with it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions