Skip to content

Conversation

@mszeszko-meta
Copy link
Contributor

@mszeszko-meta mszeszko-meta commented Dec 1, 2025

Summary

PosixRandomAccessFile::MultiRead was introduced in Dec 2019 in #5881. Subsequently, 2 years after, we introduced the PosixRandomAccessFile::ReadAsync API in #9578, which was reusing the same PosixFileSystem IO ring as MultiRead API, consequently writing to the very same ring's submission queue (without waiting!). This 'shared ring' design is problematic, since sequentially interleaving ReadAsync and MultiRead API calls on the very same thread might result in reading 'unknown' events in MultiRead leading to Bad cqe data errors (and therefore falsely perceived as a corruption) - which, for some services (running on local flash), in itself is a hard blocker for adopting RocksDB async prefetching ('async IO') that heavily relies on the ReadAsync API. This change aims to solve this problem by maintaining separate thread local IO rings for async reads and multi reads assuring correct execution. In addition, we're adding more robust error handling in form of retries for kernel interrupts and draining the queue when process is experiencing terse memory condition. Separately, we're enhancing the performance aspect by explicitly marking the rings to be written to / read from by a single thread (IORING_SETUP_SINGLE_ISSUER [if available]) and defer the task just before the application intends to process completions (IORING_SETUP_DEFER_TASKRUN [if available]). See https://man7.org/linux/man-pages/man2/io_uring_setup.2.html for reference.

Benchmark

TLDR
There's no evident advantage of using io_uring_submit (relative to proposed io_uring_submit_and_wait) across batches of size 10, 250 and 1000 simulating significantly-less, close-to and 4x-above kIoUringDepth batch size. io_uring_submit might be more appealing if (at least) one of the IOs is slow (which was NOT the case during the benchmark). More notably, with this PR switching from io_uring_submit_and_wait -> io_uring_submit can be done with a single line change due to implemented guardrails (we can followup with adding optional config for true ring semantics [if needed]).

Compilation

DEBUG_LEVEL=0 make db_bench

Create DB

./db_bench \
    --db=/db/testdb_2.5m_k100_v6144_16kB_LZ4 \
    --benchmarks=fillseq \
    --num=2500000 \
    --key_size=100 \
    --value_size=6144 \
    --compression_type=LZ4 \
    --block_size=16384 \
    --seed=1723056275

LSM

  • L0: 2 files, L1: 5, L2: 49, L3: 79
  • Each file is roughly ~35M in size

MultiReadRandom (with caching disabled)

Each run was preceded by OS page cache cleanup with echo 1 | sudo tee /proc/sys/vm/drop_caches.

./db_bench \
    --use_existing_db=true \
    --db=/db/testdb_2.5m_k100_v6144_16kB_LZ4 \
    --compression_type=LZ4 \
    --benchmarks=multireadrandom \
    --num= **<N>** \
    --batch_size= **<B>** \
    --io_uring_enabled=true \
    --async_io=false \
    --optimize_multiget_for_io=false \
    --threads=4 \
    --cache_size=0 \
    --use_direct_reads=true \
    --use_direct_io_for_flush_and_compaction=true \
    --cache_index_and_filter_blocks=false \
    --pin_l0_filter_and_index_blocks_in_cache=false \
    --pin_top_level_index_and_filter=false \
    --prepopulate_block_cache=0 \
    --row_cache_size=0 \
    --use_blob_cache=false \
    --use_compressed_secondary_cache=false
  B=10; N=100,000 B = 250; N=80,000 B = 1,000; N=20,000
baseline 31.5 (± 0.4) us/op 17.5 (± 0.5) us/op 13.5 (± 0.4) us/op
io_uring_submit_and_wait 31.5 (± 0.6) us/op 17.7 (± 0.4) us/op 13.6 (± 0.4) us/op
io_uring_submit 31.5 (± 0.6) us/op 17.5 (± 0.5) us/op 13.4 (± 0.45) us/op

Specs

Property Value
RocksDB version 10.9.0
Date Tue Dec 9 15:57:03 2025
CPU 56 * Intel Sapphire Rapids (T10 SPR)
Kernel version 6.9.0-0_fbk12_0_g28f2d09ad102

@mszeszko-meta
Copy link
Contributor Author

mszeszko-meta commented Dec 1, 2025

TLDR; this PR consists of 3 parts:

  1. Correctness for (sequentially) interleaving PosixRandomAccessFile::MultiRead and PosixRandomAccessFile::ReadAsync operations within the very same thread.
  2. Reliability: enhanced error handling (e.g -EINTR, -EAGAIN, -ENOMEM).
  3. Performance: IO ring flags (if available).

@mszeszko-meta mszeszko-meta marked this pull request as ready for review December 2, 2025 17:08
@meta-codesync
Copy link

meta-codesync bot commented Dec 2, 2025

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this in D88172809.

Copy link
Contributor

@xingbowang xingbowang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. This is a tricky bug to catch. This is a massive improvement on the robustness of IO Uring. I really appreciate the comment message. It helped provide a lot of context. Thank you!

Comment on lines 858 to 866
if (read_again) {
Slice tmp_slice;
req->status =
Read(req->offset + req_wrap->finished_len,
req->len - req_wrap->finished_len, options, &tmp_slice,
req->scratch + req_wrap->finished_len, dbg);
req->result =
Slice(req->scratch, req_wrap->finished_len + tmp_slice.size());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we calling synchronous Read API here directly?

Copy link
Contributor Author

@mszeszko-meta mszeszko-meta Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xingbowang synchronous re-reading has been added to properly handle the corner case when IO uring returns cqe->res == 0. Please see the original PR and the comment by Jens in this PR for context.

Copy link
Contributor Author

@mszeszko-meta mszeszko-meta Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@axboe given that this change (to issue 2nd read in case when 1st one returned 0 to differentiate between the EOF and buffered cached read that raced with another request/user) was added 5y ago, I'm wondering if anything changed since then and perhaps we can just get the correct behavior by-default if we use some of the IO ring initialization flags / constructs?

Copy link
Contributor

@anand1976 anand1976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I noticed that Poll() doesn't have logic to resubmit the request if bytes_read < requested length. Maybe something to keep in mind as a todo.

@mszeszko-meta
Copy link
Contributor Author

mszeszko-meta commented Dec 4, 2025

LGTM. I noticed that Poll() doesn't have logic to resubmit the request if bytes_read < requested length. Maybe something to keep in mind as a todo.

Good catch @anand1976 ! Yes, we can perhaps address this in a followup.

Copy link
Contributor

@xingbowang xingbowang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you again for improving this.

@meta-codesync
Copy link

meta-codesync bot commented Dec 9, 2025

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this in D88172809.

@meta-codesync
Copy link

meta-codesync bot commented Dec 10, 2025

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this in D88172809.

Copy link
Contributor

@anand1976 anand1976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new error handling logic LGTM! Had a couple of minor suggestions, but will leave it to you.

@meta-codesync
Copy link

meta-codesync bot commented Dec 12, 2025

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this in D88172809.

@meta-codesync
Copy link

meta-codesync bot commented Dec 12, 2025

@mszeszko-meta merged this pull request in 5a06787.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants