Skip to content

Reduce oximeter's reliance on async channels #8663

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 23, 2025
Merged

Conversation

bnaecker
Copy link
Collaborator

  • Use watch channels to communicate between oximeter's tasks, instead of mpsc. This reduces async and avoids questions of how to handle full queues.
  • Use a sync Mutex instead of an async one internally, since creating a collection task and most operations on it are synchronous (through the watch channels).
  • This is all intended to help with the responsiveness issues around Oximeter frequently hit SQL query timeout errors on rack2 #8595. In that case, a slow task inserting data into the database can wedge oximeter entirely, since the queues used internally have filled up. By moving to watch channels, we can at least ask questions about oximeter's state even if the tasks actually doing collections or database insertions are stuck or slow.

- Use `watch` channels to communicate between `oximeter`'s tasks,
  instead of `mpsc`. This reduces async and avoids questions of how to
  handle full queues.
- Use a sync `Mutex` instead of an async one internally, since creating
  a collection task and most operations on it are synchronous (through
  the watch channels).
- This is all intended to help with the responsiveness issues around
  #8595. In that case, a slow task inserting data into the database can
  wedge `oximeter` entirely, since the queues used internally have
  filled up. By moving to `watch` channels, we can at least ask
  questions about `oximeter`'s state even if the tasks actually doing
  collections or database insertions are stuck or slow.
@bnaecker bnaecker requested review from jgallagher and smklein July 22, 2025 21:45
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just a few questions

@@ -1017,29 +991,6 @@ mod tests {
logctx.cleanup_successful();
}

#[tokio::test]
async fn test_delete_nonexistent_producer_succeeds() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test no longer work with the new APIs?

(I know there's no return code, but might not be bad to assert that it doesn't panic)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does work, but I thought it was not very useful since the method is now infallible. I can put it back, I don't feel very strongly either way.

@bnaecker
Copy link
Collaborator Author

I ran an ad-hoc test locally, trying to repro similar behavior to what we see in #8595 on Dogfood. I installed Omicron locally on my dev machine, and manually ran svcadm disable clickhouse to stop the ClickHouse server inside the zone. Oximeter pretty quickly started spewing errors like:

01:05:26.212Z WARN oximeter (oximeter-agent): failed to insert some results into metric DB
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    error = Native protocol error: TCP connection to server disconnected
    file = oximeter/collector/src/results_sink.rs:92
01:05:26.306Z WARN oximeter (oximeter-agent): failed to insert some results into metric DB
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    error = Failed to check out connection to database: No backends found for this service
    file = oximeter/collector/src/results_sink.rs:92

and

01:06:01.207Z WARN oximeter (oximeter-agent): failed to insert some results into metric DB
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    error = Failed to check out connection to database: Backends exist, but none are online
    file = oximeter/collector/src/results_sink.rs:92
01:06:01.208Z DEBG oximeter (oximeter-agent): inserting 3 samples into database
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
01:06:01.208Z DEBG oximeter (oximeter-agent): reporting oximeter self-collection statistics
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    producer_id = 2af08f47-49d7-4f39-8f12-ec3c45a8ba0f
01:06:01.208Z DEBG oximeter (oximeter-agent): sent timer-based collection request to the collection task
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    producer_id = 6dd2f5ed-3c5a-45c9-bef8-2e9bf292c41f
01:06:01.208Z ERRO oximeter (oximeter-agent): timer-based collection request queue is full! This may indicate that the producer has a sampling interval that is too fast for the amount of data it generates
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    file = oximeter/collector/src/collection_task.rs:833
    interval = 1s
    producer_id = 6dd2f5ed-3c5a-45c9-bef8-2e9bf292c41f

During this time, we can continue to query oximeter for its state, including the list of producers and the details from those:

bnaecker@shale : ~/omicron $ ./target/release/omdb oximeter list-producers
note: Oximeter URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:101::d]:12223
Collector ID: 41135a29-4099-4397-839e-a6f448aa2fc0

Last refresh: 2025-07-23 01:05:46.355663135 UTC

ID                                   ADDRESS                       INTERVAL
0e57b087-018c-4295-8481-23c34d30979b [fd00:1122:3344:101::b]:48066 10s
29713898-16f1-4fd6-bed9-4e01de20af3a [fd00:1122:3344:101::a]:57064 10s
2af08f47-49d7-4f39-8f12-ec3c45a8ba0f [fd00:1122:3344:101::1]:8001  1s
6dd2f5ed-3c5a-45c9-bef8-2e9bf292c41f [fd00:1122:3344:101::2]:4677  1s
a6a01025-4ae1-4596-9bee-9ccbd7107ced [fd00:1122:3344:101::2]:57179 10s
b2e70484-434b-4174-a934-db7b248d2a14 [fd00:1122:3344:101::2]:8001  1s
ea33faeb-fcd5-4a94-adc7-5102f00d45e7 [fd00:1122:3344:101::1]:49427 30s
ec490bdb-2452-44c6-b471-193f459634c6 [fd00:1122:3344:101::2]:59655 30s
f611dcdd-57fd-44de-8168-487895849270 [fd00:1122:3344:101::c]:38203 10s
bnaecker@shale : ~/omicron $ ./target/release/omdb oximeter producer-details 0e57b087-018c-4295-8481-23c34d30979b
note: Oximeter URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:101::d]:12223

          ID: 0e57b087-018c-4295-8481-23c34d30979b
     Address: [fd00:1122:3344:101::b]:48066
  Registered: 2025-07-23T00:57:31.344Z
     Updated: 2025-07-23T00:57:31.344Z
    Interval: 10s
   Successes: 50
    Failures: 0

Last success:
  Started at: 2025-07-23T01:05:41.356Z
  Queued for: 9.57µs
    Duration: 1.889616ms
     Samples: 5

Last failure: None

Eventually, those errors become visible in the producer details:

bnaecker@shale : ~/omicron $ ./target/release/omdb oximeter producer-details 2af08f47-49d7-4f39-8f12-ec3c45a8ba0f
note: Oximeter URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:101::d]:12223

          ID: 2af08f47-49d7-4f39-8f12-ec3c45a8ba0f
     Address: [fd00:1122:3344:101::1]:8001
  Registered: 2025-07-23T00:57:46.345Z
     Updated: 2025-07-23T00:57:46.345Z
    Interval: 1s
   Successes: 472
    Failures: 152

Last success:
  Started at: 2025-07-23T01:08:09.360Z
  Queued for: 9.61µs
    Duration: 2.100948ms
     Samples: 26

Last failure:
  Started at: 2025-07-23T01:08:06.077Z
  Queued for: 0ns
    Duration: 0ns
      Reason: collections in progress

Restarting the ClickHouse service with svcadm restart clickhouse resolved those errors, as oximeter is now able to insert into the database again:

01:08:06.020Z DEBG oximeter (oximeter-agent): unrolling 36 total samples
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    id = 0b794764-a8c5-46dc-8e53-4429385d1d37
01:08:06.020Z DEBG oximeter (oximeter-agent): collecting from producer
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    producer_id = ea33faeb-fcd5-4a94-adc7-5102f00d45e7
01:08:06.024Z DEBG oximeter (oximeter-agent): inserted rows into table
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    id = 0b794764-a8c5-46dc-8e53-4429385d1d37
    n_rows = 46
    table_name = fields_string
01:08:06.025Z DEBG oximeter (oximeter-agent): inserted rows into table
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    id = 0b794764-a8c5-46dc-8e53-4429385d1d37
    n_rows = 10
    table_name = fields_u16
01:08:06.026Z DEBG oximeter (oximeter-agent): inserted rows into table
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    id = 0b794764-a8c5-46dc-8e53-4429385d1d37
    n_rows = 38
    table_name = fields_uuid

So oximeter is much more responsive here, even when the database disappears. This isn't an exact repro for the timed-out queries in #8595, but it still gives some more confidence in the changes here.

- Don't store producer endpoint info
- Use watch instead of notify for shutdown
- Consume self in shutdown method
@bnaecker bnaecker requested review from jgallagher and smklein July 23, 2025 18:09
@bnaecker bnaecker merged commit e770728 into main Jul 23, 2025
16 checks passed
@bnaecker bnaecker deleted the oximeter-watch-channels branch July 23, 2025 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants