Skip to content

fs: Deferred inode reclaim#1300

Open
vfsci-bot[bot] wants to merge 4 commits intovfs.base.cifrom
pw/1087657/vfs.base.ci
Open

fs: Deferred inode reclaim#1300
vfsci-bot[bot] wants to merge 4 commits intovfs.base.cifrom
pw/1087657/vfs.base.ci

Conversation

@vfsci-bot
Copy link
Copy Markdown

@vfsci-bot vfsci-bot Bot commented Apr 29, 2026

Series: https://patchwork.kernel.org/project/linux-fsdevel/list/?series=1087657
Submitter: Jan Kara
Version: 1
Patches: 4/4
Message-ID: <20260429174850.18223-1-jack@suse.cz>
Base: vfs.base.ci
Lore: https://lore.kernel.org/linux-fsdevel/20260429174850.18223-1-jack@suse.cz


Automated by ml2pr

jankara added 4 commits April 29, 2026 19:26
When inode has dirtied timestamps, we currently call sync_lazytime() on
last iput. This is done because inode with any dirty bit set is not
inserted into LRU and dirty timestamps expire only after many (12 by
default) hours so these inodes would be sitting outside of LRU aging for
a really long time. However this can result in doing IO and consequently
GFP_NOFAIL allocations from dentry reclaim making MM complain. Sample
trace for ext4 is:

prune_dcache_sb
shrink_dentry_list
__dentry_kill
iput
sync_lazytime
__mark_inode_dirty
ext4_dirty_inode
__ext4_mark_inode_dirty
ext4_reserve_inode_write
ext4_get_inode_loc
bdev_getblk
__filemap_get_folio_mpol

Avoid this dirtying on last iput by reshuffling unused inodes to the
beginning of b_dirty_time list and clobbering dirtied_time_when instead
so that they get written during next periodic writeback.

Signed-off-by: Jan Kara <jack@suse.cz>
Reclaim of some inodes is rather complex requiring running transactions
or doing other IO. Consequently filesystems end up doing GFP_NOFAIL
allocations from kswapd or even direct reclaim which is problematic
because forward progress of these allocations isn't guaranteed. Add
infrastructure for marking inodes whose reclaim is difficult and offload
reclaim of such inodes into a workqueue to not block kswapd with
difficult inode reclaim.

Signed-off-by: Jan Kara <jack@suse.cz>
Deferring difficult inode reclaim from prune_icache_sb() to a workqueue
removes the natural feedback loop of blocking tasks in direct reclaim
until they make space for new allocations. This can result in the list
of deferred inodes to grow beyond any bounds and possibly push the
machine to a reclaim storm or OOM.

Add a throttling mechanism slowing down tasks in
mark_inode_reclaim_deferred() if the list of deferred inodes to reclaim
grows over limit. We measure average time it takes to reclaim inode on
deferred list and block tasks proportionally to that.

Signed-off-by: Jan Kara <jack@suse.cz>
When we have to free preallocations during inode eviction, we need to
load block bitmaps and run transaction to modify them. This takes time
and also requires GFP_NOFAIL allocations. Mark inodes with preallocated
blocks as needing offloading of inode reclaim to a workqueue so that we
don't block reclaim for long and potentially deadlock MM subsystem.

Signed-off-by: Jan Kara <jack@suse.cz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant