Reduce `oximeter`'s reliance on async channels #8663

bnaecker · 2025-07-22T20:49:02Z

Use watch channels to communicate between oximeter's tasks, instead of mpsc. This reduces async and avoids questions of how to handle full queues.
Use a sync Mutex instead of an async one internally, since creating a collection task and most operations on it are synchronous (through the watch channels).
This is all intended to help with the responsiveness issues around Oximeter frequently hit SQL query timeout errors on rack2 #8595. In that case, a slow task inserting data into the database can wedge oximeter entirely, since the queues used internally have filled up. By moving to watch channels, we can at least ask questions about oximeter's state even if the tasks actually doing collections or database insertions are stuck or slow.

- Use `watch` channels to communicate between `oximeter`'s tasks, instead of `mpsc`. This reduces async and avoids questions of how to handle full queues. - Use a sync `Mutex` instead of an async one internally, since creating a collection task and most operations on it are synchronous (through the watch channels). - This is all intended to help with the responsiveness issues around #8595. In that case, a slow task inserting data into the database can wedge `oximeter` entirely, since the queues used internally have filled up. By moving to `watch` channels, we can at least ask questions about `oximeter`'s state even if the tasks actually doing collections or database insertions are stuck or slow.

smklein

looks good, just a few questions

smklein · 2025-07-22T21:50:56Z

oximeter/collector/src/agent.rs

@@ -1017,29 +991,6 @@ mod tests {
        logctx.cleanup_successful();
    }

-    #[tokio::test]
-    async fn test_delete_nonexistent_producer_succeeds() {


Does this test no longer work with the new APIs?

(I know there's no return code, but might not be bad to assert that it doesn't panic)

It does work, but I thought it was not very useful since the method is now infallible. I can put it back, I don't feel very strongly either way.

oximeter/collector/src/collection_task.rs

bnaecker · 2025-07-23T03:03:04Z

I ran an ad-hoc test locally, trying to repro similar behavior to what we see in #8595 on Dogfood. I installed Omicron locally on my dev machine, and manually ran svcadm disable clickhouse to stop the ClickHouse server inside the zone. Oximeter pretty quickly started spewing errors like:

01:05:26.212Z WARN oximeter (oximeter-agent): failed to insert some results into metric DB
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    error = Native protocol error: TCP connection to server disconnected
    file = oximeter/collector/src/results_sink.rs:92
01:05:26.306Z WARN oximeter (oximeter-agent): failed to insert some results into metric DB
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    error = Failed to check out connection to database: No backends found for this service
    file = oximeter/collector/src/results_sink.rs:92

and

01:06:01.207Z WARN oximeter (oximeter-agent): failed to insert some results into metric DB
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    error = Failed to check out connection to database: Backends exist, but none are online
    file = oximeter/collector/src/results_sink.rs:92
01:06:01.208Z DEBG oximeter (oximeter-agent): inserting 3 samples into database
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
01:06:01.208Z DEBG oximeter (oximeter-agent): reporting oximeter self-collection statistics
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    producer_id = 2af08f47-49d7-4f39-8f12-ec3c45a8ba0f
01:06:01.208Z DEBG oximeter (oximeter-agent): sent timer-based collection request to the collection task
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    producer_id = 6dd2f5ed-3c5a-45c9-bef8-2e9bf292c41f
01:06:01.208Z ERRO oximeter (oximeter-agent): timer-based collection request queue is full! This may indicate that the producer has a sampling interval that is too fast for the amount of data it generates
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    file = oximeter/collector/src/collection_task.rs:833
    interval = 1s
    producer_id = 6dd2f5ed-3c5a-45c9-bef8-2e9bf292c41f

During this time, we can continue to query oximeter for its state, including the list of producers and the details from those:

bnaecker@shale : ~/omicron $ ./target/release/omdb oximeter list-producers
note: Oximeter URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:101::d]:12223
Collector ID: 41135a29-4099-4397-839e-a6f448aa2fc0

Last refresh: 2025-07-23 01:05:46.355663135 UTC

ID                                   ADDRESS                       INTERVAL
0e57b087-018c-4295-8481-23c34d30979b [fd00:1122:3344:101::b]:48066 10s
29713898-16f1-4fd6-bed9-4e01de20af3a [fd00:1122:3344:101::a]:57064 10s
2af08f47-49d7-4f39-8f12-ec3c45a8ba0f [fd00:1122:3344:101::1]:8001  1s
6dd2f5ed-3c5a-45c9-bef8-2e9bf292c41f [fd00:1122:3344:101::2]:4677  1s
a6a01025-4ae1-4596-9bee-9ccbd7107ced [fd00:1122:3344:101::2]:57179 10s
b2e70484-434b-4174-a934-db7b248d2a14 [fd00:1122:3344:101::2]:8001  1s
ea33faeb-fcd5-4a94-adc7-5102f00d45e7 [fd00:1122:3344:101::1]:49427 30s
ec490bdb-2452-44c6-b471-193f459634c6 [fd00:1122:3344:101::2]:59655 30s
f611dcdd-57fd-44de-8168-487895849270 [fd00:1122:3344:101::c]:38203 10s
bnaecker@shale : ~/omicron $ ./target/release/omdb oximeter producer-details 0e57b087-018c-4295-8481-23c34d30979b
note: Oximeter URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:101::d]:12223

          ID: 0e57b087-018c-4295-8481-23c34d30979b
     Address: [fd00:1122:3344:101::b]:48066
  Registered: 2025-07-23T00:57:31.344Z
     Updated: 2025-07-23T00:57:31.344Z
    Interval: 10s
   Successes: 50
    Failures: 0

Last success:
  Started at: 2025-07-23T01:05:41.356Z
  Queued for: 9.57µs
    Duration: 1.889616ms
     Samples: 5

Last failure: None

Eventually, those errors become visible in the producer details:

bnaecker@shale : ~/omicron $ ./target/release/omdb oximeter producer-details 2af08f47-49d7-4f39-8f12-ec3c45a8ba0f
note: Oximeter URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:101::d]:12223

          ID: 2af08f47-49d7-4f39-8f12-ec3c45a8ba0f
     Address: [fd00:1122:3344:101::1]:8001
  Registered: 2025-07-23T00:57:46.345Z
     Updated: 2025-07-23T00:57:46.345Z
    Interval: 1s
   Successes: 472
    Failures: 152

Last success:
  Started at: 2025-07-23T01:08:09.360Z
  Queued for: 9.61µs
    Duration: 2.100948ms
     Samples: 26

Last failure:
  Started at: 2025-07-23T01:08:06.077Z
  Queued for: 0ns
    Duration: 0ns
      Reason: collections in progress

Restarting the ClickHouse service with svcadm restart clickhouse resolved those errors, as oximeter is now able to insert into the database again:

01:08:06.020Z DEBG oximeter (oximeter-agent): unrolling 36 total samples
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    id = 0b794764-a8c5-46dc-8e53-4429385d1d37
01:08:06.020Z DEBG oximeter (oximeter-agent): collecting from producer
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    producer_id = ea33faeb-fcd5-4a94-adc7-5102f00d45e7
01:08:06.024Z DEBG oximeter (oximeter-agent): inserted rows into table
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    id = 0b794764-a8c5-46dc-8e53-4429385d1d37
    n_rows = 46
    table_name = fields_string
01:08:06.025Z DEBG oximeter (oximeter-agent): inserted rows into table
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    id = 0b794764-a8c5-46dc-8e53-4429385d1d37
    n_rows = 10
    table_name = fields_u16
01:08:06.026Z DEBG oximeter (oximeter-agent): inserted rows into table
    collector_id = 41135a29-4099-4397-839e-a6f448aa2fc0
    collector_ip = fd00:1122:3344:101::d
    id = 0b794764-a8c5-46dc-8e53-4429385d1d37
    n_rows = 38
    table_name = fields_uuid

So oximeter is much more responsive here, even when the database disappears. This isn't an exact repro for the timed-out queries in #8595, but it still gives some more confidence in the changes here.

oximeter/collector/src/collection_task.rs

- Don't store producer endpoint info - Use watch instead of notify for shutdown - Consume self in shutdown method

bnaecker requested review from jgallagher and smklein July 22, 2025 21:45

smklein reviewed Jul 22, 2025

View reviewed changes

jgallagher reviewed Jul 23, 2025

View reviewed changes

oximeter/collector/src/collection_task.rs Outdated Show resolved Hide resolved

oximeter/collector/src/collection_task.rs Outdated Show resolved Hide resolved

Review feedback

6bf3984

- Don't store producer endpoint info - Use watch instead of notify for shutdown - Consume self in shutdown method

bnaecker requested review from jgallagher and smklein July 23, 2025 18:09

bnaecker added 2 commits July 23, 2025 14:24

clippy

4dc3cec

Merge branch 'main' into oximeter-watch-channels

aa57143

jgallagher approved these changes Jul 23, 2025

View reviewed changes

bnaecker merged commit e770728 into main Jul 23, 2025
16 checks passed

bnaecker deleted the oximeter-watch-channels branch July 23, 2025 23:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce `oximeter`'s reliance on async channels #8663

Reduce `oximeter`'s reliance on async channels #8663

Uh oh!

bnaecker commented Jul 22, 2025

Uh oh!

smklein left a comment

Uh oh!

smklein Jul 22, 2025

Uh oh!

bnaecker Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

bnaecker commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reduce oximeter's reliance on async channels #8663

Reduce oximeter's reliance on async channels #8663

Uh oh!

Conversation

bnaecker commented Jul 22, 2025

Uh oh!

smklein left a comment

Choose a reason for hiding this comment

Uh oh!

smklein Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

bnaecker Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bnaecker commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reduce `oximeter`'s reliance on async channels #8663

Reduce `oximeter`'s reliance on async channels #8663