[LU-15427] Lustre client hangs under memory pressue Created: 10/Jan/22  Updated: 10/Jan/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.8
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Joe Frith Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

RHEL 7 - Kernel 3.10.0-1160.42.2.el7.x86_64


Epic/Theme: patch
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

Our compute nodes have 384GB memory and 192GB swap space. When application use a lot of memory (all of 384GB and some of 192GB swap) , many processes reading/writing to Lustre enter in D state and hang and never recover. We see the below in syslog. Note - Swap never gets full. 

Jan  3 18:26:03 spool0121 kernel: INFO: task kswapd0:510 blocked for more than 120 seconds.
Jan  3 18:26:03 spool0121 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  3 18:26:03 spool0121 kernel: kswapd0         D ffffa0f83caf85e0     0   510      2 0x00000000
Jan  3 18:26:03 spool0121 kernel: Call Trace:
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa3387480>] ? bit_wait+0x50/0x50
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa3389179>] schedule+0x29/0x70
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa3386e41>] schedule_timeout+0x221/0x2d0
Jan  3 18:26:03 spool0121 kernel: [<ffffffffc152111c>] ? cl_io_slice_add+0x5c/0x190 [obdclass]
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2d06992>] ? ktime_get_ts64+0x52/0xf0
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa3387480>] ? bit_wait+0x50/0x50
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa3388a2d>] io_schedule_timeout+0xad/0x130
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa3388ac8>] io_schedule+0x18/0x20
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa3387491>] bit_wait_io+0x11/0x50
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa3386fb7>] __wait_on_bit+0x67/0x90
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dbd3c1>] wait_on_page_bit+0x81/0xa0
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2cc7010>] ? wake_bit_function+0x40/0x40
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dd380b>] shrink_page_list+0x9eb/0xc30
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dd2853>] ? isolate_lru_pages.isra.47+0xd3/0x190
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dd4066>] shrink_inactive_list+0x1b6/0x5c0
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dcd77e>] ? release_pages+0x24e/0x430
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dd4b45>] shrink_lruvec+0x375/0x730
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2cd2c60>] ? task_rq_unlock+0x20/0x20
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dd5d36>] mem_cgroup_shrink_node_zone+0xa6/0x170
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2e428c3>] mem_cgroup_soft_limit_reclaim+0x1e3/0x4b0
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dd60e0>] balance_pgdat+0x2e0/0x5e0
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dd6553>] kswapd+0x173/0x430
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2cc6f50>] ? wake_up_atomic_t+0x30/0x30
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dd63e0>] ? balance_pgdat+0x5e0/0x5e0
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2cc5e61>] kthread+0xd1/0xe0
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2cc5d90>] ? insert_kthread_work+0x40/0x40
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa3395ddd>] ret_from_fork_nospec_begin+0x7/0x21
Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2cc5d90>] ? insert_kthread_work+0x40/0x40


Generated at Sat Feb 10 03:18:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.