-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: User-specified epoll flags #6084
Comments
This seems reasonable enough. I think the main question here is how the new |
Pushing against the POC Patch approach as this does not align with my future goals purpose of this library. |
@Nerdy5k what are your goals, and how does this impact your ability to use this library? |
I want to keep the metal io approach as much as possible without delegating to separate api workers. |
This doesn't force you to change how you use tokio or mio. It just opens up new options for others who are currently using shared-nothing. |
@Nerdy5k could you elaborate on what you mean here? We aren't talking about changing the innards of tokio in any way which modifies existing behavior, merely adding a new way to construct registered sockets. This doesn't impact the current IO approach, just allow a new way to interface with it. I'm not sure what you mean by "separate API workers". I suspect this to be the result of confusion? |
…ied epoll flags WIP fix for #6084. This currently only adds support for TcpListener.
I put up the POC here: #6089 |
The blog post uses level-triggered notification, which allows the code to perform 1 Can you address this? |
Sure! You bring up a good point here: while there are valid reasons to use This skipped my mind earlier in the convo, but was one of the reasons that I crafted the patch this way, with users controlling their own flags including interests. Thanks for reminding me of this; I need to add some notes to the documentation regarding this case. |
The problem
When building shared-nothing systems at scale, load balancing new connections is a significant challenge.
To summarize that blog post, SO_REUSEPORT can often introduce new sources tail latency because it splits the new connections into per-socket queues regardless of whether or not they are currently doing work. A worker parked on epoll_wait is generally an excellent candidate for a new connection when compared to a worker currently handling existing connections, as a worker who is currently looking for work anyways clearly has the capacity to accept the new connection. Note that even eBPF load SO_REUSEPORT balancing isn't ideal here, as an eBPF script can't really be thus smart. Therefore, it tends to be better for latency to load balance with epoll. Unfortunately, this can't be done with the vanilla set of options - normally, if multiple epoll instances watch the same socket, all of them get the notification, leading to a thundering herd.
Fortunately, the EPOLLEXCLUSIVE flag resolves this issue by ensuring that only one waiting epoll instance with the flag set for the particular interest will get the notification. EPOLLEXCLUSIVE is, as a result, extraordinarily useful for at-scale shared-nothing systems. It isn't always the best approach depending on how sensitive a system is to even load balancing vs TTFB, but it's an important element of any shared-nothing toolbox.
At Cloudflare, we have services which use tokio in both shared-nothing and work-stealing configurations and make extensive use of EPOLLEXCLUSIVE and other atypical epoll flags. Based on our experience serving diverse types of traffic at scale, we think that allowing users to leverage custom epoll flags would make tokio a significantly more powerful toolkit for users working on shared-nothing systems.
The solution
I have a POC patch which I can push later which adds a new
from_std
variant to several types (currently just the TCP and AF_UNIX stream listeners) which allows the specification of the exact set of epoll flags to use when registering the socket with our epoll descriptor. If we made this fallible, it wouldn't block the use of io_uring or similar in the future, as we could just document that this only works if you are using epoll. We could potentially do that only with AsyncFd, or with the listener types as I implemented in the POC, or both.We could also try and add in EPOLLEXCLUSIVE as a new IO interest that users can specify, but this has all of the issues of the POC approach I took, while being more complicated for us to implement and less flexible for users. For that reason, I'd recommend something along the lines of option number one.
If this RFC is accepted, I can take responsibility for the implementation of this.
Because Mio exposes the raw fd of the epoll instance, it can be bypassed entirely for the purposes of implementing this functionality in Tokio. As a result, Mio support is not a prerequisite for Tokio having this functionality.
The text was updated successfully, but these errors were encountered: