mm: improve write performance with RWF_DONTCACHE#1329
Open
vfsci-bot[bot] wants to merge 2 commits intovfs.base.cifrom
Open
mm: improve write performance with RWF_DONTCACHE#1329vfsci-bot[bot] wants to merge 2 commits intovfs.base.cifrom
vfsci-bot[bot] wants to merge 2 commits intovfs.base.cifrom
Conversation
Add a per-wb WB_DONTCACHE_DIRTY counter that tracks the number of dirty pages with the dropbehind flag set (i.e., pages dirtied via RWF_DONTCACHE writes). Increment the counter alongside WB_RECLAIMABLE in folio_account_dirtied() when the folio has the dropbehind flag set, and decrement it in folio_clear_dirty_for_io() and folio_account_cleaned(). Also decrement it when a non-DONTCACHE lookup atomically clears the dropbehind flag on a dirty folio in __filemap_get_folio_mpol(), using folio_test_clear_dropbehind() to prevent concurrent lookups from double-decrementing the counter, and guarding the decrement with mapping_can_writeback() to match the increment path. The counter will be used by the writeback flusher to determine how many pages to write back when expediting writeback for IOCB_DONTCACHE writes, without flushing the entire BDI's dirty pages. Suggested-by: Jan Kara <jack@suse.cz> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Jeff Layton <jlayton@kernel.org>
…cking
The IOCB_DONTCACHE writeback path in generic_write_sync() calls
filemap_flush_range() on every write, submitting writeback inline in
the writer's context. Perf lock contention profiling shows the
performance problem is not lock contention but the writeback submission
work itself — walking the page tree and submitting I/O blocks the writer
for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
(dontcache).
Replace the inline filemap_flush_range() call with a flusher kick that
drains dirty pages in the background. This moves writeback submission
completely off the writer's hot path.
To avoid flushing unrelated buffered dirty data, add a dedicated
WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to
write back. The flusher writes back that many pages from the oldest dirty
inodes (not restricted to dontcache-specific inodes). This helps
preserve I/O batching while limiting the scope of expedited writeback.
Like WB_start_all, the WB_start_dontcache bit coalesces multiple
DONTCACHE writes into a single flusher wakeup without per-write
allocations. Use test_and_clear_bit to atomically consume the kick
request before reading the dirty counter and starting writeback, so that
concurrent DONTCACHE writes during writeback can re-set the bit and
schedule a follow-up flusher run.
Read the dirty counter with wb_stat_sum() (aggregating per-CPU batches)
rather than wb_stat() (which reads only the global counter) to ensure
small writes below the percpu batch threshold are visible to the flusher.
In filemap_dontcache_kick_writeback(), set the WB_start_dontcache bit
inside the unlocked_inode_to_wb_begin/end section for correct cgroup
writeback domain targeting, but defer the wb_wakeup() call until after
the section ends, since wb_wakeup() uses spin_unlock_irq() which would
unconditionally re-enable interrupts while the i_pages xa_lock may still
be held under irqsave during a cgroup writeback switch.
Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
visibility.
dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM,
xfs on NVMe, fio io_uring):
Buffered and direct I/O paths are unaffected by this patchset. All
improvements are confined to the dontcache path:
Single-stream throughput (MB/s):
Before After Change
seq-write/dontcache 298 897 +201%
rand-write/dontcache 131 236 +80%
Tail latency improvements (seq-write/dontcache):
p99: 135,266 us -> 23,986 us (-82%)
p99.9: 8,925,479 us -> 28,443 us (-99.7%)
Multi-writer (4 jobs, sequential write):
Before After Change
dontcache aggregate (MB/s) 2,529 4,532 +79%
dontcache p99 (us) 8,553 1,002 -88%
dontcache p99.9 (us) 109,314 1,057 -99%
Dontcache multi-writer throughput now matches buffered (4,532 vs
4,616 MB/s).
32-file write (Axboe test):
Before After Change
dontcache aggregate (MB/s) 1,548 3,499 +126%
dontcache p99 (us) 10,170 602 -94%
Peak dirty pages (MB) 1,837 213 -88%
Dontcache now reaches 81% of buffered throughput (was 35%).
Competing writers (dontcache vs buffered, separate files):
Before After
buffered writer 868 433 MB/s
dontcache writer 415 433 MB/s
Aggregate 1,284 866 MB/s
Previously the buffered writer starved the dontcache writer 2:1.
With per-bdi_writeback tracking, both writers now receive equal
bandwidth. The aggregate matches the buffered-vs-buffered baseline
(863 MB/s), indicating fair sharing regardless of I/O mode.
The dontcache writer's p99.9 latency collapsed from 119 ms to
33 ms (-73%), eliminating the severe periodic stalls seen in the
baseline. Both writers now share identical latency profiles,
matching the buffered-vs-buffered pattern.
The per-bdi_writeback dirty tracking dramatically reduces peak dirty
pages in dontcache workloads, with the 32-file test dropping from
1.8 GB to 213 MB. Dontcache sequential write throughput triples and
multi-writer throughput reaches parity with buffered I/O, with tail
latencies collapsing by 1-2 orders of magnitude.
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Series: https://patchwork.kernel.org/project/linux-fsdevel/list/?series=1090088
Submitter: Jeff Layton
Version: 6
Patches: 2/2
Message-ID:
<20260505-dontcache-v6-0-66463805dd6a@kernel.org>Base: vfs.base.ci
Lore: https://lore.kernel.org/linux-fsdevel/20260505-dontcache-v6-0-66463805dd6a@kernel.org
Automated by ml2pr