Skip to content

Conversation

@karencfv
Copy link
Contributor

@karencfv karencfv commented Jan 5, 2026

This commit adds the existing health monitor information (services in maintenance) to the database.

Follow up to #9434

With services in maintenance

$ ./target/debug/omdb db inventory collections show latest --db-url "postgresql://root@[::1]:62900/omicron?sslmode=disable"
<...>

HEALTH MONITOR
    2 SMF services in maintenance at 2026-01-08T03:07:53.862Z
        FMRI                                ZONE   
        svc:/site/fake-service:default      global 
        svc:/site/fake-service2:default     global 

Without services in maintenance

$ ./target/debug/omdb db inventory collections show latest --db-url "postgresql://root@[::1]:48291/omicron?sslmode=disable"
<...>

HEALTH MONITOR
    0 SMF services in maintenance at 2026-01-08T02:51:49.416Z

Health monitor isn't running for whatever reason

$ ./target/debug/omdb db inventory collections show latest --db-url "postgresql://root@[::1]:64930/omicron?sslmode=disable"
<...>

HEALTH MONITOR
    no data on SMF services in maintenance has been collected

Errors when parsing svcs response

$ ./target/debug/omdb db inventory collections show latest --db-url "postgresql://root@[::1]:64930/omicron?sslmode=disable"
<...>

HEALTH MONITOR
    2 SMF services in maintenance at 2026-01-08T03:34:20.198Z
        FMRI                                ZONE   
        svc:/site/fake-service:default      global 
        svc:/site/fake-service2:default     global 

    found errors when retrieving services in maintenance:
        fake parsing error
        another fake parsing error

Same error but there are no services in maintenance

$ ./target/debug/omdb db inventory collections show latest --db-url "postgresql://root@[::1]:64930/omicron?sslmode=disable"
<...>

HEALTH MONITOR
    0 SMF services in maintenance at 2026-01-08T03:37:52.758Z

    found errors when retrieving services in maintenance:
        fake parsing error
        another fake parsing error

An error was returned when attempting to run svcs. There may or may not be services in maintenance

$ ./target/debug/omdb db inventory collections show latest --db-url "postgresql://root@[::1]:64930/omicron?sslmode=disable"
<...>

HEALTH MONITOR
    failed to retrieve SMF services in maintenance: contract error: fake execution error: other error

Closes: #9516

Comment on lines +1021 to +1032
#[derive(Queryable, Clone, Debug, Selectable, Insertable)]
#[diesel(table_name = inv_health_monitor_svc_in_maintenance)]
pub struct InvSvcInMaintenance {
pub inv_collection_id: DbTypedUuid<CollectionKind>,
pub sled_id: DbTypedUuid<SledKind>,
pub id: DbTypedUuid<SvcInMaintenanceKind>,
pub fmri: Option<String>,
pub zone: Option<String>,
pub error_messages: Vec<String>,
pub svcs_cmd_error: Option<String>,
pub time_of_status: Option<DateTime<Utc>>,
}
Copy link
Contributor Author

@karencfv karencfv Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with flattening the SvcsInMaintenanceResult struct instead of creating two tables and joining them. It seemed like unnecessary complexity. This means that all of the rows from each collection will have the same time_of_status, error_messages, and svcs_cmd_error, but it doesn't really feel like a big deal.

pub struct SvcsInMaintenanceResult {
    pub services: Vec<SvcInMaintenance>,
    pub errors: Vec<String>,
    pub time_of_status: Option<DateTime<Utc>>,
}
pub struct SvcInMaintenance {
    pub fmri: String,
    pub zone: String,
}

Open to suggestions if there's a better way to go on about this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this because it seems to mix "Rust Vec becomes array of TEXT" and "Rust Vec becomes multiple rows" within the same type. (If we have multiple services entries we'll have multiple rows, but if we have multiple errors entries we'll only have one row.) I think we should pick one or the other; personally I shy away from SQL array because I've found them kinda difficult to work with, so I probably would have made this three tables:

  • a toplevel table that always has exactly one row per collection; this would record either time_of_status or svcs_cmd_error (with a CHECK constraint that we have one or the other, IIUC?).
  • a table for services
  • a table for errors

But I'm not sure if I'm overrotating on bad experiences with SQL arrays. @davepacheco might have an opinion here?

I guess the current model also introduces a possible invalid state it'd be nice to avoid: what if we have multiple rows from one collection that ought to have the same time_of_status / errors values, but don't? Not super likely to be a problem in practice, but sometimes we've been burned by adding this kind of structure and then later realizing it's actually kinda painful to deal with.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also avoid arrays in SQL, also possibly for not-good reasons. I usually use a separate table like you've said here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke about this at the update watercooler. Will update the PR with 3 tables

@karencfv karencfv marked this pull request as ready for review January 8, 2026 06:29
Comment on lines +1021 to +1032
#[derive(Queryable, Clone, Debug, Selectable, Insertable)]
#[diesel(table_name = inv_health_monitor_svc_in_maintenance)]
pub struct InvSvcInMaintenance {
pub inv_collection_id: DbTypedUuid<CollectionKind>,
pub sled_id: DbTypedUuid<SledKind>,
pub id: DbTypedUuid<SvcInMaintenanceKind>,
pub fmri: Option<String>,
pub zone: Option<String>,
pub error_messages: Vec<String>,
pub svcs_cmd_error: Option<String>,
pub time_of_status: Option<DateTime<Utc>>,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this because it seems to mix "Rust Vec becomes array of TEXT" and "Rust Vec becomes multiple rows" within the same type. (If we have multiple services entries we'll have multiple rows, but if we have multiple errors entries we'll only have one row.) I think we should pick one or the other; personally I shy away from SQL array because I've found them kinda difficult to work with, so I probably would have made this three tables:

  • a toplevel table that always has exactly one row per collection; this would record either time_of_status or svcs_cmd_error (with a CHECK constraint that we have one or the other, IIUC?).
  • a table for services
  • a table for errors

But I'm not sure if I'm overrotating on bad experiences with SQL arrays. @davepacheco might have an opinion here?

I guess the current model also introduces a possible invalid state it'd be nice to avoid: what if we have multiple rows from one collection that ought to have the same time_of_status / errors values, but don't? Not super likely to be a problem in practice, but sometimes we've been burned by adding this kind of structure and then later realizing it's actually kinda painful to deal with.

// made and any parsing errors we may have collected.
Ok(svcs)
if svcs.services.is_empty()
&& svcs.time_of_status.is_some() =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would time_of_status be None if we have an Ok(svcs) value for smf_services_in_mainenance? Doesn't that mean we got a status back (and therefore there must be some time at which it was collected)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would only really happen when testing or running the simulated omicron on a machine that isn't running illumos.

#[cfg(not(target_os = "illumos"))]
pub async fn in_maintenance(
log: &Logger,
) -> Result<SvcsInMaintenanceResult, ExecutionError> {
info!(log, "OS not illumos, will not check state of SMF services");
let svcs_result = SvcsInMaintenanceResult::new();
Ok(svcs_result)
}
}

impl SvcsInMaintenanceResult {
pub fn new() -> Self {
Self { services: vec![], errors: vec![], time_of_status: None }
}

I can add a comment explaining this. I didn't want to add a row because it could be confusing to see just a bunch of rows with just IDs and could throw people off, but it probably doesn't really matter much either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change SvcsInMaintenanceResult::new() to fill in now() instead, and make it non-optional? Realizing that's basically the same thing I'm asking on #9615 (comment) for the zpool checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that when I was first working on adding the type. It's basically to be able to tell if the svcs command ran or not. If it ran there's a time of status, if it didn't then there is none. Then again, this would only happen on non-illumos environments, so probably not super important to have a differentiator here? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then again, this would only happen on non-illumos environments, so probably not super important to have a differentiator here?

Yeah, I don't think it's worth making the type more complex here just to give us this bit of information. There are plenty of other places where non-illumos systems just fill in some default / dummy data, and I think that's the right choice.

Copy link
Contributor Author

@karencfv karencfv Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, actually, what I said previously isn't quite true. There is a moment where we create a new SvcsInMaintenanceResult, but we haven't ran the svcs command yet in HealthMonitorHandle::new()

pub fn spawn(log: Logger) -> Self {
// Spawn a task to retrieve information about services in maintenance
info!(log, "Starting SMF service health poller");
let (smf_services_in_maintenance_tx, smf_services_in_maintenance_rx) =
watch::channel(Ok(SvcsInMaintenanceResult::new()));
tokio::spawn(async move {
poll_smf_services_in_maintenance(
log,
smf_services_in_maintenance_tx,
)
.await
});

If the health monitor stopped working at that moment, we would record a time_of_status without services or errors and it would appear as if there are no services in maintenance. In reality, this could be false as we haven't ran the svcs command yet.

Due to this I think I'd like to keep the time_of_status as an option. Unless there's a better way to express that the health monitor hasn't run?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[inventory] Add svcs in maintenance types to DB

3 participants