[LU-19014] Random client hung in balance_dirty_pages() with CGroup (memcg) enabled - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Two nodes at least append write a shared file in Lustre with memcg enabled.

It will hang in the following call traces:

[root@ouzo2 margu]$ cat /proc/2593160/stack
[<0>] balance_dirty_pages+0x2ee/0xd10
[<0>] balance_dirty_pages_ratelimited_flags+0x27a/0x380
[<0>] generic_perform_write+0x150/0x210
[<0>] vvp_io_write_start+0x516/0xc00 [lustre]
[<0>] cl_io_start+0x5a/0x110 [obdclass]
[<0>] cl_io_loop+0x97/0x1f0 [obdclass]
[<0>] ll_file_io_generic+0x4d2/0xe50 [lustre]
[<0>] do_file_write_iter+0x3e9/0x5d0 [lustre]
[<0>] vfs_write+0x2cb/0x410
[<0>] ksys_write+0x5f/0xe0
[<0>] do_syscall_64+0x5c/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80

After analyze the core dump of the hang system, we found that the @bdi_writeback data structure (wb) corresponded to the memcg has dirty I/O (in state 5: WB_registered | WB_has_dirty_io) with dirty pages, but can not write-out the dirty pages, and loop forever in balance_dirty_pages().

This is a bug in Lustre memcg code. In OSC, it stops the write back dirty pages once found that the unstable pages for an OSC is zero. This is incorrect, we should still write back the dirty pages to update the wb stat accounting, thus the write process can continue instead of looping forever.

Attachments

Issue Links

is related to

LU-19043 Check vm.dirty_ratio at client mount time

Open

Activity

People

Assignee:: Qian Yingjin

Reporter:: Qian Yingjin

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 14/May/25 8:57 AM

Updated:: 30/May/25 6:02 AM