-
Notifications
You must be signed in to change notification settings - Fork 48
Reduce oximeter
's reliance on async channels
#8663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Use `watch` channels to communicate between `oximeter`'s tasks, instead of `mpsc`. This reduces async and avoids questions of how to handle full queues. - Use a sync `Mutex` instead of an async one internally, since creating a collection task and most operations on it are synchronous (through the watch channels). - This is all intended to help with the responsiveness issues around #8595. In that case, a slow task inserting data into the database can wedge `oximeter` entirely, since the queues used internally have filled up. By moving to `watch` channels, we can at least ask questions about `oximeter`'s state even if the tasks actually doing collections or database insertions are stuck or slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, just a few questions
@@ -1017,29 +991,6 @@ mod tests { | |||
logctx.cleanup_successful(); | |||
} | |||
|
|||
#[tokio::test] | |||
async fn test_delete_nonexistent_producer_succeeds() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test no longer work with the new APIs?
(I know there's no return code, but might not be bad to assert that it doesn't panic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does work, but I thought it was not very useful since the method is now infallible. I can put it back, I don't feel very strongly either way.
I ran an ad-hoc test locally, trying to repro similar behavior to what we see in #8595 on Dogfood. I installed Omicron locally on my dev machine, and manually ran
and
During this time, we can continue to query oximeter for its state, including the list of producers and the details from those:
Eventually, those errors become visible in the producer details:
Restarting the ClickHouse service with
So |
- Don't store producer endpoint info - Use watch instead of notify for shutdown - Consume self in shutdown method
watch
channels to communicate betweenoximeter
's tasks, instead ofmpsc
. This reduces async and avoids questions of how to handle full queues.Mutex
instead of an async one internally, since creating a collection task and most operations on it are synchronous (through the watch channels).oximeter
entirely, since the queues used internally have filled up. By moving towatch
channels, we can at least ask questions aboutoximeter
's state even if the tasks actually doing collections or database insertions are stuck or slow.