Skip to content

Conversation

@karencfv
Copy link
Contributor

@karencfv karencfv commented Jan 9, 2026

This is the second PR for #9412

This commit adds a list of unhealthy zpools to the sled agent inventory's health monitor. In a follow-up PR this information will be added to the DB

Successful unhealthy zpool retrieval:

$ zpool list -Hpo health,name
FAULTED fakepool1
FAULTED fakepool2
ONLINE  rpool

$ curl -H "api-version: 15.0.0"  http://[::1]:54963/inventory | jq
<...>
  "health_monitor": {
    "smf_services_in_maintenance": {
      "ok": {
        "services": [
          {
            "fmri": "svc:/site/fake-service2:default",
            "zone": "global"
          }
        ],
        "errors": [],
        "time_of_status": "2026-01-12T07:15:06.913644164Z"
      }
    },
    "unhealthy_zpools": {
      "ok": {
        "zpools": [
          "fakepool1",
          "fakepool2"
        ],
        "errors": [],
        "time_of_status": "2026-01-12T07:15:06.888312628Z"
      }
    }
  }

Response contains errors:

$ curl -H "api-version: 15.0.0"  http://[::1]:54963/inventory | jq
<...>
  "health_monitor": {
    "smf_services_in_maintenance": {
      "ok": {
        "services": [],
        "errors": [],
        "time_of_status": "2026-01-12T07:10:00.642206104Z"
      }
    },
    "unhealthy_zpools": {
      "ok": {
        "zpools": [],
        "errors": [
          "Failed to parse output: Unrecognized zpool 'health': fakepool1",
          "Failed to parse output: Unrecognized zpool 'health': fakepool2",
          "Failed to parse output: Unrecognized zpool 'health': rpool"
        ],
        "time_of_status": "2026-01-12T07:10:00.591985515Z"
      }
    }
  }

@karencfv karencfv marked this pull request as ready for review January 12, 2026 07:29
pub struct UnhealthyZpoolsResult {
pub zpools: Vec<String>,
pub errors: Vec<String>,
pub time_of_status: Option<DateTime<Utc>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be optional? If we have a result, it must have been collected at some time, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meant for when this runs on non-illumos environements

https://github.com/oxidecomputer/omicron/pull/9615/files#diff-07cfb91c0643fac6671e8b81317c92cbe806ab07e2af6fdce16db75315245e32R446-R453

The idea is that there is no time of status because no command was ever run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there is a point in time where health monitor may not have run yet -> #9589 (comment)

#[derive(Debug, Clone, PartialEq, Eq, Deserialize, Serialize, JsonSchema)]
#[serde(rename_all = "snake_case")]
pub struct UnhealthyZpoolsResult {
pub zpools: Vec<String>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth between just having a list of unhealthy zpools, or associating each zpool with it's state. In the end I went with listing the zpools only, but I'm not convinced. We'll be including the information of the health checks in the support bundle, and it'd be useful for them to be able to see what state each zpool is in. Thoughts? @davepacheco @jgallagher

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants