IO uring improvements #14158

mszeszko-meta · 2025-12-01T19:29:55Z

Summary

PosixRandomAccessFile::MultiRead was introduced in Dec 2019 in #5881. Subsequently, 2 years after, we introduced the PosixRandomAccessFile::ReadAsync API in #9578, which was reusing the same PosixFileSystem IO ring as MultiRead API, consequently writing to the very same ring's submission queue (without waiting!). This 'shared ring' design is problematic, since sequentially interleaving ReadAsync and MultiRead API calls on the very same thread might result in reading 'unknown' events in MultiRead leading to Bad cqe data errors (and therefore falsely perceived as a corruption) - which, for some services (running on local flash), in itself is a hard blocker for adopting RocksDB async prefetching ('async IO') that heavily relies on the ReadAsync API. This change aims to solve this problem by maintaining separate thread local IO rings for async reads and multi reads assuring correct execution. In addition, we're adding more robust error handling in form of retries for kernel interrupts and draining the queue when process is experiencing terse memory condition. Separately, we're enhancing the performance aspect by explicitly marking the rings to be written to / read from by a single thread (IORING_SETUP_SINGLE_ISSUER [if available]) and defer the task just before the application intends to process completions (IORING_SETUP_DEFER_TASKRUN [if available]). See https://man7.org/linux/man-pages/man2/io_uring_setup.2.html for reference.

Benchmark

TLDR
There's no evident advantage of using io_uring_submit (relative to proposed io_uring_submit_and_wait) across batches of size 10, 250 and 1000 simulating significantly-less, close-to and 4x-above kIoUringDepth batch size. io_uring_submit might be more appealing if (at least) one of the IOs is slow (which was NOT the case during the benchmark). More notably, with this PR switching from io_uring_submit_and_wait -> io_uring_submit can be done with a single line change due to implemented guardrails (we can followup with adding optional config for true ring semantics [if needed]).

Compilation

DEBUG_LEVEL=0 make db_bench

Create DB

./db_bench \
    --db=/db/testdb_2.5m_k100_v6144_16kB_LZ4 \
    --benchmarks=fillseq \
    --num=2500000 \
    --key_size=100 \
    --value_size=6144 \
    --compression_type=LZ4 \
    --block_size=16384 \
    --seed=1723056275

LSM

L0: 2 files, L1: 5, L2: 49, L3: 79
Each file is roughly ~35M in size

MultiReadRandom (with caching disabled)

Each run was preceded by OS page cache cleanup with echo 1 | sudo tee /proc/sys/vm/drop_caches.

./db_bench \
    --use_existing_db=true \
    --db=/db/testdb_2.5m_k100_v6144_16kB_LZ4 \
    --compression_type=LZ4 \
    --benchmarks=multireadrandom \
    --num= **<N>** \
    --batch_size= **<B>** \
    --io_uring_enabled=true \
    --async_io=false \
    --optimize_multiget_for_io=false \
    --threads=4 \
    --cache_size=0 \
    --use_direct_reads=true \
    --use_direct_io_for_flush_and_compaction=true \
    --cache_index_and_filter_blocks=false \
    --pin_l0_filter_and_index_blocks_in_cache=false \
    --pin_top_level_index_and_filter=false \
    --prepopulate_block_cache=0 \
    --row_cache_size=0 \
    --use_blob_cache=false \
    --use_compressed_secondary_cache=false

	B=10; N=100,000	B = 250; N=80,000	B = 1,000; N=20,000
baseline	31.5 (± 0.4) us/op	17.5 (± 0.5) us/op	13.5 (± 0.4) us/op
io_uring_submit_and_wait	31.5 (± 0.6) us/op	17.7 (± 0.4) us/op	13.6 (± 0.4) us/op
io_uring_submit	31.5 (± 0.6) us/op	17.5 (± 0.5) us/op	13.4 (± 0.45) us/op

Specs

Property	Value
RocksDB	version 10.9.0
Date	Tue Dec 9 15:57:03 2025
CPU	56 * Intel Sapphire Rapids (T10 SPR)
Kernel version	6.9.0-0_fbk12_0_g28f2d09ad102

mszeszko-meta · 2025-12-01T23:16:12Z

TLDR; this PR consists of 3 parts:

Correctness for (sequentially) interleaving PosixRandomAccessFile::MultiRead and PosixRandomAccessFile::ReadAsync operations within the very same thread.
Reliability: enhanced error handling (e.g -EINTR, -EAGAIN, -ENOMEM).
Performance: IO ring flags (if available).

meta-codesync · 2025-12-02T17:29:25Z

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this in D88172809.

xingbowang

Wow. This is a tricky bug to catch. This is a massive improvement on the robustness of IO Uring. I really appreciate the comment message. It helped provide a lot of context. Thank you!

env/env_test.cc

env/io_posix.cc

xingbowang · 2025-12-04T00:20:17Z

env/io_posix.cc

          if (read_again) {
            Slice tmp_slice;
            req->status =
                Read(req->offset + req_wrap->finished_len,
                     req->len - req_wrap->finished_len, options, &tmp_slice,
                     req->scratch + req_wrap->finished_len, dbg);
            req->result =
                Slice(req->scratch, req_wrap->finished_len + tmp_slice.size());
          }


Why are we calling synchronous Read API here directly?

@xingbowang synchronous re-reading has been added to properly handle the corner case when IO uring returns cqe->res == 0. Please see the original PR and the comment by Jens in this PR for context.

@axboe given that this change (to issue 2nd read in case when 1st one returned 0 to differentiate between the EOF and buffered cached read that raced with another request/user) was added 5y ago, I'm wondering if anything changed since then and perhaps we can just get the correct behavior by-default if we use some of the IO ring initialization flags / constructs?

anand1976

LGTM. I noticed that Poll() doesn't have logic to resubmit the request if bytes_read < requested length. Maybe something to keep in mind as a todo.

env/io_posix.cc

mszeszko-meta · 2025-12-04T00:42:08Z

LGTM. I noticed that Poll() doesn't have logic to resubmit the request if bytes_read < requested length. Maybe something to keep in mind as a todo.

Good catch @anand1976 ! Yes, we can perhaps address this in a followup.

xingbowang

LGTM. Thank you again for improving this.

env/io_posix.cc

env/env_test.cc

…ng terminal error

meta-codesync · 2025-12-09T20:28:35Z

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this in D88172809.

meta-codesync · 2025-12-10T00:22:01Z

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this in D88172809.

env/io_posix.cc

anand1976

The new error handling logic LGTM! Had a couple of minor suggestions, but will leave it to you.

env/io_posix.cc

meta-codesync · 2025-12-12T20:38:30Z

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this in D88172809.

meta-codesync · 2025-12-12T22:28:41Z

@mszeszko-meta merged this pull request in 5a06787.

mszeszko-meta requested a review from anand1976 December 1, 2025 19:29

meta-cla bot added the CLA Signed label Dec 1, 2025

mszeszko-meta requested a review from xingbowang December 1, 2025 21:11

IO Uring bugfixes + reliability updates

fbde7c8

mszeszko-meta force-pushed the io_uring_improvements branch from f2149e2 to fbde7c8 Compare December 2, 2025 17:06

mszeszko-meta marked this pull request as ready for review December 2, 2025 17:08

xingbowang reviewed Dec 4, 2025

View reviewed changes

anand1976 approved these changes Dec 4, 2025

View reviewed changes

env/io_posix.cc Outdated Show resolved Hide resolved

mszeszko-meta added 2 commits December 5, 2025 16:37

Address the comments

dacca07

Add regression test for interleaving IO uring operations

b229b05

mszeszko-meta force-pushed the io_uring_improvements branch from f549d46 to b229b05 Compare December 6, 2025 00:44

xingbowang approved these changes Dec 6, 2025

View reviewed changes

env/io_posix.cc Outdated Show resolved Hide resolved

env/io_posix.cc Outdated Show resolved Hide resolved

env/env_test.cc Outdated Show resolved Hide resolved

env/env_test.cc Outdated Show resolved Hide resolved

Busy CPU loops handling + forceful completion reaping upon encounteri…

5243f75

…ng terminal error

Run the benchmark + new request submission accounting bugfix

996705b

anand1976 reviewed Dec 10, 2025

View reviewed changes

env/io_posix.cc Outdated Show resolved Hide resolved

anand1976 approved these changes Dec 11, 2025

View reviewed changes

env/io_posix.cc Show resolved Hide resolved

env/io_posix.cc Outdated Show resolved Hide resolved

Simplify internal accounting by using IO Uring APIs

b978f3a

meta-codesync bot closed this in 5a06787 Dec 12, 2025

facebook-github-bot added the Merged label Dec 12, 2025

IO uring improvements #14158

IO uring improvements #14158

Conversation

mszeszko-meta commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark

MultiReadRandom (with caching disabled)

Specs

Uh oh!

mszeszko-meta commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync bot commented Dec 2, 2025

Uh oh!

xingbowang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xingbowang Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

mszeszko-meta Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mszeszko-meta Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anand1976 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mszeszko-meta commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xingbowang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

meta-codesync bot commented Dec 9, 2025

Uh oh!

meta-codesync bot commented Dec 10, 2025

Uh oh!

Uh oh!

anand1976 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

meta-codesync bot commented Dec 12, 2025

Uh oh!

meta-codesync bot commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mszeszko-meta commented Dec 1, 2025 •

edited

Loading

mszeszko-meta commented Dec 1, 2025 •

edited

Loading

mszeszko-meta Dec 5, 2025 •

edited

Loading

mszeszko-meta Dec 5, 2025 •

edited

Loading

mszeszko-meta commented Dec 4, 2025 •

edited

Loading