Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow using an app-provided thread #1923

Open
DemiMarie opened this issue Aug 18, 2021 · 21 comments · May be fixed by #4616
Open

Allow using an app-provided thread #1923

DemiMarie opened this issue Aug 18, 2021 · 21 comments · May be fixed by #4616
Labels
Area: API Area: Core Related to the shared, core protocol logic external Proposed by non-MSFT feature request A request for new functionality
Milestone

Comments

@DemiMarie
Copy link

Describe the feature you'd like supported

It would be nice if MsQuic allowed apps to provide their own threads, and perform event polling themselves.

Proposed solution

See above.

Additional context

In some environments, such as Lua and Node.js, all callbacks must eventually be run on a single thread. This currently requires marshaling them back to the main thread, which is less efficient than if MsQuic could integrate into the built-in event loop. Other environments, such as Rust with Tokio, already provide their own high-performance event loops, and having to use a separate thread for QUIC would require additional locking.

@nibanks nibanks added Area: API Area: Core Related to the shared, core protocol logic external Proposed by non-MSFT feature request A request for new functionality labels Aug 18, 2021
@nibanks nibanks added this to the Future milestone Aug 18, 2021
@nibanks
Copy link
Member

nibanks commented Aug 18, 2021

We've discussed possibly supporting this, but never came to and hard conclusion. How would you actually use this, if we added support? There is significant work involved and we wouldn't want to do this unless something would definitely use it.

@DemiMarie
Copy link
Author

We've discussed possibly supporting this, but never came to and hard conclusion. How would you actually use this, if we added support? There is significant work involved and we wouldn't want to do this unless something would definitely use it.

I don’t have any particular plans myself, and am not in a position where I am likely to use MsQuic in the near future. That said, I would not be surprised if some people have ruled out MsQuic as an option because of this without filing an issue.

@nibanks
Copy link
Member

nibanks commented Aug 19, 2021

That said, I would not be surprised if some people have ruled out MsQuic as an option because of this without filing an issue.

Perhaps. There is a reason we went with this model of owning the threads in MsQuic though. There is a lot of complexity involved in implementing a performant parallelized networking layer, and by owning the threads in MsQuic we can do all the hard work internally and the apps get it for free. If we add support for this Issue, I do expect apps that use this model to have a significant performance decrease from those that do not.

@DemiMarie
Copy link
Author

That said, I would not be surprised if some people have ruled out MsQuic as an option because of this without filing an issue.

Perhaps. There is a reason we went with this model of owning the threads in MsQuic though. There is a lot of complexity involved in implementing a performant parallelized networking layer, and by owning the threads in MsQuic we can do all the hard work internally and the apps get it for free. If we add support for this Issue, I do expect apps that use this model to have a significant performance decrease from those that do not.

Is the current model compatible with Node.js, for example, or would that require marshalling?

@nibanks
Copy link
Member

nibanks commented Aug 19, 2021

Is the current model compatible with Node.js, for example, or would that require marshalling?

I have no experience with Node.js so I cannot answer that.

@thhous-msft
Copy link
Contributor

I would actually be slightly scared to put the QUIC workers on the UI thread in Node.js. Unlike the current TCP and UDP implementations, which do a TINY amount of work at the user mode level, QUIC is very computationally expensive, including encryption, synchronous DNS lookup, and all the timing requirements for the protocol. I highly suspect running QUIC directly on the UI thread would start to cause the UI to lag. And because you'd be limited to a single thread, you'd lose a lot of perf there as well.

Very little in Node currently is computationally expensive, and anything that is usually is marshalled to a separate thread in some way.

@DemiMarie
Copy link
Author

I would actually be slightly scared to put the QUIC workers on the UI thread in Node.js. Unlike the current TCP and UDP implementations, which do a TINY amount of work at the user mode level, QUIC is very computationally expensive, including encryption, synchronous DNS lookup, and all the timing requirements for the protocol. I highly suspect running QUIC directly on the UI thread would start to cause the UI to lag. And because you'd be limited to a single thread, you'd lose a lot of perf there as well.

How does that compare to the current TLS client and server? Also, my understanding is that Node.js usually scales by having multiple instances of the server running, or by using multiple contexts. So using 2x the CPU for less than 2x performance is not guaranteed to be a win.

@nibanks
Copy link
Member

nibanks commented Aug 19, 2021

How does that compare to the current TLS client and server?

Again, I don't know how that's currently done in Node, but TLS is very expensive, so I'd assume it's never done on a blocking thread.

Also, my understanding is that Node.js usually scales by having multiple instances of the server running, or by using multiple contexts. So using 2x the CPU for less than 2x performance is not guaranteed to be a win.

MsQuic scales thread with processor count. Additionally, RSS (receive side scaling) uses dedicated threads per processor to match the NIC's processor receive indications, so it does scale very well; especially on multi-NUMA node machines.

@DemiMarie
Copy link
Author

That said, I would not be surprised if some people have ruled out MsQuic as an option because of this without filing an issue.

Perhaps. There is a reason we went with this model of owning the threads in MsQuic though. There is a lot of complexity involved in implementing a performant parallelized networking layer, and by owning the threads in MsQuic we can do all the hard work internally and the apps get it for free. If we add support for this Issue, I do expect apps that use this model to have a significant performance decrease from those that do not.

At a minimum, I would like to be able to integrate my own code into MsQuic’s event loop somehow. I might need to handle HTTP/1.1 and HTTP/2 traffic as well, for instance.

@nibanks
Copy link
Member

nibanks commented Dec 18, 2021

At a minimum, I would like to be able to integrate my own code into MsQuic’s event loop somehow. I might need to handle HTTP/1.1 and HTTP/2 traffic as well, for instance.

@DemiMarie we're doing work on refactoring how scheduling works, and would be happy to take inputs and suggestions. We refactored the QUIC worker thread so that it can be run by another thread:

//
// General purpose execution context abstraction layer. Used for driving worker
// loops.
//

typedef struct CXPLAT_EXECUTION_CONTEXT CXPLAT_EXECUTION_CONTEXT;

//
// Returns FALSE when it's time to cleanup.
//
typedef
_IRQL_requires_max_(PASSIVE_LEVEL)
BOOLEAN
(*CXPLAT_EXECUTION_FN)(
    _Inout_ CXPLAT_EXECUTION_CONTEXT* Context,
    _Inout_ uint64_t* TimeNowUs,    // The current time, in microseconds.
    _In_ CXPLAT_THREAD_ID ThreadID  // The current thread ID.
    );

typedef struct CXPLAT_EXECUTION_CONTEXT {

    void* Context;
    CXPLAT_EXECUTION_FN Callback;
    uint64_t NextTimeUs;
    BOOLEAN Ready;

} CXPLAT_EXECUTION_CONTEXT;

And usage:

// TODO - Add synchronization around this stuff.
uint32_t ExecutionContextCount = 0;
CXPLAT_EXECUTION_CONTEXT* ExecutionContexts[8];

void CxPlatAddExecutionContext(CXPLAT_EXECUTION_CONTEXT* Context)
{
    CXPLAT_FRE_ASSERT(ExecutionContextCount < ARRAYSIZE(ExecutionContexts));
    ExecutionContexts[ExecutionContextCount] = Context;
    ExecutionContextCount++;
}

BOOLEAN CxPlatRunExecutionContexts(_In_ CXPLAT_THREAD_ID ThreadID)
{
    if (ExecutionContextCount == 0) {
        return FALSE;
    }

    uint64_t TimeNow = CxPlatTimeUs64();
    for (uint32_t i = 0; i < ExecutionContextCount; i++) {
        CXPLAT_EXECUTION_CONTEXT* Context = ExecutionContexts[i];
        if (Context->Ready || Context->NextTimeUs <= TimeNow) {
            if (!Context->Callback(Context->Context, &TimeNow, ThreadID)) {
                // Remove the context from the array.
                if (i + 1 < ExecutionContextCount) {
                    ExecutionContexts[i] = ExecutionContexts[--ExecutionContextCount];
                } else {
                    ExecutionContextCount--;
                }
            }
        }
    }

    return TRUE;
}

With this model exposed to the API, it would allow the app's thread do drive the execution contexts. The complexity comes in trying to continue to have things like RSS and CID-based routing still work effectively.

@DemiMarie
Copy link
Author

@nibanks so one thought I had is to allow the MsQuic event loop to handle other things as well, such as pollable file descriptors on Unix and I/O completion ports and waitable events on Windows. The latter will require using undocumented NT kernel APIs, but I imagine it would not be too hard for you to work around that problem.

As far as RSS and CID-based routing, what are the tricky parts? Would it be possible to decouple the networking code from the state machine, as Quinn does? Would there be a performance penalty in doing so?

@nibanks
Copy link
Member

nibanks commented Mar 18, 2022

@nibanks so one thought I had is to allow the MsQuic event loop to handle other things as well, such as pollable file descriptors on Unix and I/O completion ports and waitable events on Windows. The latter will require using undocumented NT kernel APIs, but I imagine it would not be too hard for you to work around that problem.

@DemiMarie yes we've thought about designs both where msquic handles everything and where we expose interfaces such that the app can handle everything. Both have complexities, mostly originating from the fact that there is no single, easy pattern that works cross-platform. Just for the datapath layer, epoll, kqueue, iocp, etc. all have slight differences that complicate things.

As far as RSS and CID-based routing, what are the tricky parts? Would it be possible to decouple the networking code from the state machine, as Quinn does? Would there be a performance penalty in doing so?

Anything is possible, but we have to balance complexity and performance. Unlike any other QUIC stack that I know, MsQuic is designed to align RSS all the way up from the NIC even into the application thread; all on the same CPU (if everything is used properly). This is very complicated and difficult to achieve, and providing for a generic interface that other threads could control will make it more difficult.

That isn't to say we don't want to go there. We want to figure out a good way to do this, but still haven't quite achieved it yet.

@nibanks nibanks moved this to Should be written in MsQuic Walkthroughs May 8, 2023
@nibanks nibanks pinned this issue Aug 18, 2023
@bwoebi
Copy link

bwoebi commented Sep 24, 2023

Just stumbling about this issue, might be worth adding my 2 cents:
I really like the API surface of MsQuic, it feels more complete and usable than any other QUIC implementation I've encountered.

However, I'm at a loss at how I would integrate it with PHP (via FFI). The PHP model generally requires PHP Code to be invoked only from a single thread (and then do multi-processing if needed for scaling horizontally).
Then, in addition, one generally wants to schedule timers and other I/O on the same thread.
But ultimately all these event loops are doing is "notify me when there's some event waiting for this file descriptor". The easiest way to integrate would thus being able to chose an executor model, where I can just give it my udp socket handle, and then it tells me when I should start and stop polling for readability/writability via callbacks and then I can notify the MsQuic executor about that fact.

I would like if MsQuic would not fully decouple the networking, as I definitely appreciate it trying to optimize the networking, setting socket options etc. Just the small task of socket I/O readiness I would need to be abstracted away.

@nibanks
Copy link
Member

nibanks commented Sep 24, 2023

It's definitely a goal to be able to allow the app thread to drive the execution. It's still a work in progress though. Thanks for the feedback!

@nibanks nibanks linked a pull request Oct 16, 2024 that will close this issue
@redbaron
Copy link
Contributor

What is being described/requested here is sans-io model:

By externalising all IO and timers, library becomes effectively just a state machine. One benefit it brings is ease of porting library to other platforms: it simply doesn't contain any platform specific code anymore. Currently we ruled out msquic for one of our projects because plugging IO for consoles platforms requires maintaining fork.

@nibanks
Copy link
Member

nibanks commented Oct 23, 2024

As you can see from the recently linked draft PR, we're actively working on exposing a way for external control of the execution.

@redbaron
Copy link
Contributor

I had a look, it is very early work and it is hard to see how it will shape up. It might allow better control over threading, but it looks like it retain lot of responsibility for IO in msquic, making porting it still a hassle.

Ideal sans-io interface should accept time delta since last poll and vector of (socket_handle, buffer) tuples, with library completely unaware how exactly each one of them was received all it knows that given socket handle sent us given buffer. It then returns vector of (socket_handle, buffer) of what it would like to app to send and minimal time delta it expects to be polled again to process timeouts.

Realistic payloads passed in/out from sans-io library are likely to be more complicated with enums for connection created, closed, reporting IO errors, supporting. multiple buffers per socket for scatter/gather IO, etc.

@nibanks
Copy link
Member

nibanks commented Oct 23, 2024

That model assumes you have socket handles, which is not always correct in terms of XDP and DPDK.

@DemiMarie
Copy link
Author

Indeed so. RSS alignment (which really helps performance) is another factor.

@redbaron
Copy link
Contributor

I don't follow. Socket handle doesn't have to point to actual kernel socket, call it io_handle , just an identifier which IO layer outside of msquic can use to understand how to send bytes there and msquic use it to identify QUIC endpoint bytes belong to. I am not familiar with XDP or DPDK , but surely it has notion of source:port,dst:port even if it crafts raw packets including all of IP headers, io_handle can be mapped to these network tuples.

Same for RSS, because all IO is externalised , msquic gives up control on it and it is up to IO layer to chose CPU to run IO on. If required there can be msquic instance per CPU to have share nothing architecture. With msquic acting just as a state machine app has full flexibility how and when to drive it.

@sekoosay12
Copy link

Jump

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: API Area: Core Related to the shared, core protocol logic external Proposed by non-MSFT feature request A request for new functionality
Projects
Status: Should be written
Development

Successfully merging a pull request may close this issue.

6 participants