Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.12.8
-
None
-
RHEL 7 - Kernel 3.10.0-1160.42.2.el7.x86_64
-
2
-
9223372036854775807
Description
Our compute nodes have 384GB memory and 192GB swap space. When application use a lot of memory (all of 384GB and some of 192GB swap) , many processes reading/writing to Lustre enter in D state and hang and never recover. We see the below in syslog. Note - Swap never gets full.
Jan 3 18:26:03 spool0121 kernel: INFO: task kswapd0:510 blocked for more than 120 seconds.
Jan 3 18:26:03 spool0121 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 3 18:26:03 spool0121 kernel: kswapd0 D ffffa0f83caf85e0 0 510 2 0x00000000
Jan 3 18:26:03 spool0121 kernel: Call Trace:
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa3387480>] ? bit_wait+0x50/0x50
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa3389179>] schedule+0x29/0x70
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa3386e41>] schedule_timeout+0x221/0x2d0
Jan 3 18:26:03 spool0121 kernel: [<ffffffffc152111c>] ? cl_io_slice_add+0x5c/0x190 [obdclass]
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa2d06992>] ? ktime_get_ts64+0x52/0xf0
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa3387480>] ? bit_wait+0x50/0x50
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa3388a2d>] io_schedule_timeout+0xad/0x130
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa3388ac8>] io_schedule+0x18/0x20
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa3387491>] bit_wait_io+0x11/0x50
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa3386fb7>] __wait_on_bit+0x67/0x90
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa2dbd3c1>] wait_on_page_bit+0x81/0xa0
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa2cc7010>] ? wake_bit_function+0x40/0x40
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa2dd380b>] shrink_page_list+0x9eb/0xc30
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa2dd2853>] ? isolate_lru_pages.isra.47+0xd3/0x190
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa2dd4066>] shrink_inactive_list+0x1b6/0x5c0
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa2dcd77e>] ? release_pages+0x24e/0x430
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa2dd4b45>] shrink_lruvec+0x375/0x730
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa2cd2c60>] ? task_rq_unlock+0x20/0x20
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa2dd5d36>] mem_cgroup_shrink_node_zone+0xa6/0x170
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa2e428c3>] mem_cgroup_soft_limit_reclaim+0x1e3/0x4b0
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa2dd60e0>] balance_pgdat+0x2e0/0x5e0
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa2dd6553>] kswapd+0x173/0x430
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa2cc6f50>] ? wake_up_atomic_t+0x30/0x30
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa2dd63e0>] ? balance_pgdat+0x5e0/0x5e0
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa2cc5e61>] kthread+0xd1/0xe0
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa2cc5d90>] ? insert_kthread_work+0x40/0x40
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa3395ddd>] ret_from_fork_nospec_begin+0x7/0x21
Jan 3 18:26:03 spool0121 kernel: [<ffffffffa2cc5d90>] ? insert_kthread_work+0x40/0x40