-
Notifications
You must be signed in to change notification settings - Fork 66
[inventory] Add svcs in maintenance to DB #9589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| #[derive(Queryable, Clone, Debug, Selectable, Insertable)] | ||
| #[diesel(table_name = inv_health_monitor_svc_in_maintenance)] | ||
| pub struct InvSvcInMaintenance { | ||
| pub inv_collection_id: DbTypedUuid<CollectionKind>, | ||
| pub sled_id: DbTypedUuid<SledKind>, | ||
| pub id: DbTypedUuid<SvcInMaintenanceKind>, | ||
| pub fmri: Option<String>, | ||
| pub zone: Option<String>, | ||
| pub error_messages: Vec<String>, | ||
| pub svcs_cmd_error: Option<String>, | ||
| pub time_of_status: Option<DateTime<Utc>>, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with flattening the SvcsInMaintenanceResult struct instead of creating two tables and joining them. It seemed like unnecessary complexity. This means that all of the rows from each collection will have the same time_of_status, error_messages, and svcs_cmd_error, but it doesn't really feel like a big deal.
pub struct SvcsInMaintenanceResult {
pub services: Vec<SvcInMaintenance>,
pub errors: Vec<String>,
pub time_of_status: Option<DateTime<Utc>>,
}pub struct SvcInMaintenance {
pub fmri: String,
pub zone: String,
}Open to suggestions if there's a better way to go on about this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this because it seems to mix "Rust Vec becomes array of TEXT" and "Rust Vec becomes multiple rows" within the same type. (If we have multiple services entries we'll have multiple rows, but if we have multiple errors entries we'll only have one row.) I think we should pick one or the other; personally I shy away from SQL array because I've found them kinda difficult to work with, so I probably would have made this three tables:
- a toplevel table that always has exactly one row per collection; this would record either
time_of_statusorsvcs_cmd_error(with a CHECK constraint that we have one or the other, IIUC?). - a table for
services - a table for
errors
But I'm not sure if I'm overrotating on bad experiences with SQL arrays. @davepacheco might have an opinion here?
I guess the current model also introduces a possible invalid state it'd be nice to avoid: what if we have multiple rows from one collection that ought to have the same time_of_status / errors values, but don't? Not super likely to be a problem in practice, but sometimes we've been burned by adding this kind of structure and then later realizing it's actually kinda painful to deal with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also avoid arrays in SQL, also possibly for not-good reasons. I usually use a separate table like you've said here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoke about this at the update watercooler. Will update the PR with 3 tables
| #[derive(Queryable, Clone, Debug, Selectable, Insertable)] | ||
| #[diesel(table_name = inv_health_monitor_svc_in_maintenance)] | ||
| pub struct InvSvcInMaintenance { | ||
| pub inv_collection_id: DbTypedUuid<CollectionKind>, | ||
| pub sled_id: DbTypedUuid<SledKind>, | ||
| pub id: DbTypedUuid<SvcInMaintenanceKind>, | ||
| pub fmri: Option<String>, | ||
| pub zone: Option<String>, | ||
| pub error_messages: Vec<String>, | ||
| pub svcs_cmd_error: Option<String>, | ||
| pub time_of_status: Option<DateTime<Utc>>, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this because it seems to mix "Rust Vec becomes array of TEXT" and "Rust Vec becomes multiple rows" within the same type. (If we have multiple services entries we'll have multiple rows, but if we have multiple errors entries we'll only have one row.) I think we should pick one or the other; personally I shy away from SQL array because I've found them kinda difficult to work with, so I probably would have made this three tables:
- a toplevel table that always has exactly one row per collection; this would record either
time_of_statusorsvcs_cmd_error(with a CHECK constraint that we have one or the other, IIUC?). - a table for
services - a table for
errors
But I'm not sure if I'm overrotating on bad experiences with SQL arrays. @davepacheco might have an opinion here?
I guess the current model also introduces a possible invalid state it'd be nice to avoid: what if we have multiple rows from one collection that ought to have the same time_of_status / errors values, but don't? Not super likely to be a problem in practice, but sometimes we've been burned by adding this kind of structure and then later realizing it's actually kinda painful to deal with.
| // made and any parsing errors we may have collected. | ||
| Ok(svcs) | ||
| if svcs.services.is_empty() | ||
| && svcs.time_of_status.is_some() => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would time_of_status be None if we have an Ok(svcs) value for smf_services_in_mainenance? Doesn't that mean we got a status back (and therefore there must be some time at which it was collected)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would only really happen when testing or running the simulated omicron on a machine that isn't running illumos.
omicron/illumos-utils/src/svcs.rs
Lines 44 to 52 in 7a5f735
| #[cfg(not(target_os = "illumos"))] | |
| pub async fn in_maintenance( | |
| log: &Logger, | |
| ) -> Result<SvcsInMaintenanceResult, ExecutionError> { | |
| info!(log, "OS not illumos, will not check state of SMF services"); | |
| let svcs_result = SvcsInMaintenanceResult::new(); | |
| Ok(svcs_result) | |
| } | |
| } |
omicron/illumos-utils/src/svcs.rs
Lines 64 to 67 in 7a5f735
| impl SvcsInMaintenanceResult { | |
| pub fn new() -> Self { | |
| Self { services: vec![], errors: vec![], time_of_status: None } | |
| } |
I can add a comment explaining this. I didn't want to add a row because it could be confusing to see just a bunch of rows with just IDs and could throw people off, but it probably doesn't really matter much either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we change SvcsInMaintenanceResult::new() to fill in now() instead, and make it non-optional? Realizing that's basically the same thing I'm asking on #9615 (comment) for the zpool checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that when I was first working on adding the type. It's basically to be able to tell if the svcs command ran or not. If it ran there's a time of status, if it didn't then there is none. Then again, this would only happen on non-illumos environments, so probably not super important to have a differentiator here? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then again, this would only happen on non-illumos environments, so probably not super important to have a differentiator here?
Yeah, I don't think it's worth making the type more complex here just to give us this bit of information. There are plenty of other places where non-illumos systems just fill in some default / dummy data, and I think that's the right choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, actually, what I said previously isn't quite true. There is a moment where we create a new SvcsInMaintenanceResult, but we haven't ran the svcs command yet in HealthMonitorHandle::new()
omicron/sled-agent/health-monitor/src/handle.rs
Lines 32 to 45 in e6961bc
| pub fn spawn(log: Logger) -> Self { | |
| // Spawn a task to retrieve information about services in maintenance | |
| info!(log, "Starting SMF service health poller"); | |
| let (smf_services_in_maintenance_tx, smf_services_in_maintenance_rx) = | |
| watch::channel(Ok(SvcsInMaintenanceResult::new())); | |
| tokio::spawn(async move { | |
| poll_smf_services_in_maintenance( | |
| log, | |
| smf_services_in_maintenance_tx, | |
| ) | |
| .await | |
| }); |
If the health monitor stopped working at that moment, we would record a time_of_status without services or errors and it would appear as if there are no services in maintenance. In reality, this could be false as we haven't ran the svcs command yet.
Due to this I think I'd like to keep the time_of_status as an option. Unless there's a better way to express that the health monitor hasn't run?
This commit adds the existing health monitor information (services in maintenance) to the database.
Follow up to #9434
With services in maintenance
Without services in maintenance
Health monitor isn't running for whatever reason
Errors when parsing
svcsresponseSame error but there are no services in maintenance
An error was returned when attempting to run
svcs. There may or may not be services in maintenanceCloses: #9516