Details
-
Technical task
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
9223372036854775807
Description
When write a larger file into Lustre with WBC enabled (aging_keep flush mode), it trapped into a endless loop:
dd if=/dev/zero of=/mnt/lustre/tdir/tfile bs=1M count=4096
cat /proc/22735/stack
[<0>] balance_dirty_pages+0x426/0xcd0
[<0>] balance_dirty_pages_ratelimited+0x2af/0x3b0
[<0>] generic_perform_write+0x16a/0x1b0
[<0>] __generic_file_write_iter+0xfa/0x1c0
[<0>] generic_file_write_iter+0xab/0x150
[<0>] memfs_file_write_iter+0xd7/0x180 [lustre]
[<0>] new_sync_write+0x124/0x170
[<0>] vfs_write+0xa5/0x1a0
[<0>] ksys_write+0x4f/0xb0
[<0>] do_syscall_64+0x5b/0x1b0
[<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca
[<0>] 0xffffffffffffffff
The reason is because that the kernel found the current writing process tries to write out some dirty pages in @balance_dirty_pages() due to the rate limit mechanism in Linux kernel, but the pages are pinned in MemFS, and are not reclaimable.
We found that for a client with 96G memory, it will trap into the endless loop when the write size is larger than 8G.
Here, there are two solution:
one solution is to disable dirty account for the DBI:
sb->s_bdi->capabilities |= BDI_CAP_NO_ACCT_DIRTY; void balance_dirty_pages_ratelimited(struct address_space *mapping) { struct inode *inode = mapping->host; struct backing_dev_info *bdi = inode_to_bdi(inode); struct bdi_writeback *wb = NULL; int ratelimit; int *p; if (!bdi_cap_account_dirty(bdi)) return; ...
By this way, it will not trigger to call balance_dirty_pages. It can write as many cache pages as possible before reaching the page cache limits in MemFS.
Another solution is that:
when write-out inode in @balance_dirty_pages->wb_start_background_writeback(), the client assimilates the cache pages from MemFS into Lustre, after that the assimilated pages in Lustre are reclaimable, the dirty pages can be written out to Lustre backend.