Skip to content

Conversation

@SF-Zhou
Copy link

@SF-Zhou SF-Zhou commented Nov 11, 2025

Motivation

We encountered critically poor write performance when running RocksDB on HDDs with sync enabled.

Implementation

To resolve this, we developed two interconnected features:

  1. Direct I/O for WAL:​ Added an option to open WAL files with O_DIRECT, ensuring all writes are aligned.
  2. WAL Preallocation:​ To handle the file growth and metadata sync requirements safely with O_DIRECT, we implemented a preallocation strategy. Before writing a batch, it pre-allocates a large block (default 1 MiB) of zero-padded space and issues a sync to persist the file's metadata. Subsequent batch writes are performed within this pre-allocated range, thus avoiding any further metadata operations that would require a sync.

Result

Our tests confirm that using Direct I/O combined with WAL preallocation provides a massive performance improvement compared to using sync alone.

# HDD model: Seagate ST18000NM004J

# disable use_direct_io_for_wal.
./db_bench --benchmarks=fillseq --db=/storage/data1/rocksdb --num=10000 --value_size=128 --sync=true --use_direct_io_for_wal=false
Set seed to 1762841673105496 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 10.9.0
Date:       Tue Nov 11 14:14:33 2025
CPU:        112 * Intel(R) Xeon(R) Gold 5420+
CPUCache:   53760 KB
Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
Values:     128 bytes each (64 bytes after compression)
Entries:    10000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    1.4 MB (estimated)
FileSize:   0.8 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
DB path: [/storage/data1/rocksdb]
fillseq      :   25555.892 micros/op 39 ops/sec 255.559 seconds 10000 operations;    0.0 MB/s

# enable use_direct_io_for_wal.
./db_bench --benchmarks=fillseq --db=/storage/data1/rocksdb --num=10000 --value_size=128 --sync=true --use_direct_io_for_wal=true
Set seed to 1762841968828058 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 10.9.0
Date:       Tue Nov 11 14:19:29 2025
CPU:        112 * Intel(R) Xeon(R) Gold 5420+
CPUCache:   53760 KB
Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
Values:     128 bytes each (64 bytes after compression)
Entries:    10000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    1.4 MB (estimated)
FileSize:   0.8 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
DB path: [/storage/data1/rocksdb]
fillseq      :     147.164 micros/op 6795 ops/sec 1.472 seconds 10000 operations;    0.9 MB/s

@meta-cla meta-cla bot added the CLA Signed label Nov 11, 2025
@SF-Zhou SF-Zhou changed the title Add direct I/O support for WAL writes WIP: Add direct I/O support for WAL writes Nov 12, 2025
@SF-Zhou SF-Zhou marked this pull request as draft November 12, 2025 05:20
@SF-Zhou SF-Zhou changed the title WIP: Add direct I/O support for WAL writes Add direct I/O support for WAL writes Nov 12, 2025
@SF-Zhou SF-Zhou force-pushed the copilot/add-direct-io-support-wal branch from 56fe0af to 9e3d20a Compare November 12, 2025 15:41
@SF-Zhou SF-Zhou marked this pull request as ready for review November 12, 2025 15:42
SF-Zhou and others added 3 commits November 13, 2025 11:25
* Initial plan

* Add comprehensive unit tests for WAL direct I/O pre-allocation

Add 11 new unit tests covering:
- Basic direct I/O with pre-allocation
- Various block sizes (64KB, 256KB, 1MB, 4MB)
- Multiple flush cycles
- Synchronous writes with pre-allocation
- WAL recovery with pre-allocation
- Zero block size (default behavior)
- Concurrent writes (thread safety)
- Interaction with memtable flushes
- Edge cases (empty writes, boundary conditions)
- Close during active pre-allocation

Co-authored-by: SF-Zhou <[email protected]>

* Format code

* Apply suggestion from @Copilot

Co-authored-by: Copilot <[email protected]>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: SF-Zhou <[email protected]>
Co-authored-by: SF-Zhou <[email protected]>
Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant