-
Notifications
You must be signed in to change notification settings - Fork 6.7k
IO uring improvements #14158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IO uring improvements #14158
Conversation
|
TLDR; this PR consists of 3 parts:
|
f2149e2 to
fbde7c8
Compare
|
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this in D88172809. |
xingbowang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. This is a tricky bug to catch. This is a massive improvement on the robustness of IO Uring. I really appreciate the comment message. It helped provide a lot of context. Thank you!
| if (read_again) { | ||
| Slice tmp_slice; | ||
| req->status = | ||
| Read(req->offset + req_wrap->finished_len, | ||
| req->len - req_wrap->finished_len, options, &tmp_slice, | ||
| req->scratch + req_wrap->finished_len, dbg); | ||
| req->result = | ||
| Slice(req->scratch, req_wrap->finished_len + tmp_slice.size()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we calling synchronous Read API here directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xingbowang synchronous re-reading has been added to properly handle the corner case when IO uring returns cqe->res == 0. Please see the original PR and the comment by Jens in this PR for context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@axboe given that this change (to issue 2nd read in case when 1st one returned 0 to differentiate between the EOF and buffered cached read that raced with another request/user) was added 5y ago, I'm wondering if anything changed since then and perhaps we can just get the correct behavior by-default if we use some of the IO ring initialization flags / constructs?
anand1976
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I noticed that Poll() doesn't have logic to resubmit the request if bytes_read < requested length. Maybe something to keep in mind as a todo.
Good catch @anand1976 ! Yes, we can perhaps address this in a followup. |
f549d46 to
b229b05
Compare
xingbowang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you again for improving this.
…ng terminal error
|
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this in D88172809. |
|
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this in D88172809. |
anand1976
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new error handling logic LGTM! Had a couple of minor suggestions, but will leave it to you.
|
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this in D88172809. |
|
@mszeszko-meta merged this pull request in 5a06787. |
Summary
PosixRandomAccessFile::MultiReadwas introduced in Dec 2019 in #5881. Subsequently, 2 years after, we introduced thePosixRandomAccessFile::ReadAsyncAPI in #9578, which was reusing the samePosixFileSystemIO ring asMultiReadAPI, consequently writing to the very same ring's submission queue (without waiting!). This 'shared ring' design is problematic, since sequentially interleavingReadAsyncandMultiReadAPI calls on the very same thread might result in reading 'unknown' events inMultiReadleading toBad cqe dataerrors (and therefore falsely perceived as a corruption) - which, for some services (running on local flash), in itself is a hard blocker for adopting RocksDB async prefetching ('async IO') that heavily relies on theReadAsyncAPI. This change aims to solve this problem by maintaining separate thread local IO rings forasync readsandmulti readsassuring correct execution. In addition, we're adding more robust error handling in form of retries for kernel interrupts and draining the queue when process is experiencing terse memory condition. Separately, we're enhancing the performance aspect by explicitly marking the rings to be written to / read from by a single thread (IORING_SETUP_SINGLE_ISSUER[if available]) and defer the task just before the application intends to process completions (IORING_SETUP_DEFER_TASKRUN[if available]). See https://man7.org/linux/man-pages/man2/io_uring_setup.2.html for reference.Benchmark
TLDR
There's no evident advantage of using
io_uring_submit(relative to proposedio_uring_submit_and_wait) across batches of size 10, 250 and 1000 simulating significantly-less, close-to and 4x-abovekIoUringDepthbatch size.io_uring_submitmight be more appealing if (at least) one of the IOs is slow (which was NOT the case during the benchmark). More notably, with this PR switching fromio_uring_submit_and_wait->io_uring_submitcan be done with a single line change due to implemented guardrails (we can followup with adding optional config for true ring semantics [if needed]).Compilation
Create DB
LSM
MultiReadRandom (with caching disabled)
Each run was preceded by OS page cache cleanup with
echo 1 | sudo tee /proc/sys/vm/drop_caches.Specs