Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.10.3
-
None
-
3
-
9223372036854775807
Description
Hello,
When a userspace process goes crazy with memory allocation, sometimes OOM killer does not manage to kick in because Lustre is still trying to free its memory.
I am not sure if it deadlocked or there is just too many locks which it is trying to free but it has been in this state for more than 12 hours before it was manually crashed.
This is Centos 7.4 with kernel 3.10.0-693.5.2.el7.x86_64
Machine still responds to pings when it is in this state.
Here is one of the kernel task stack:
[223483.032862] [<ffffffff81196b27>] ? putback_inactive_pages+0x117/0x2d0 [223483.050260] [<ffffffff81196f0a>] ? shrink_inactive_list+0x22a/0x5d0 [223483.062319] [<ffffffff811979a5>] shrink_lruvec+0x385/0x730 [223483.073571] [<ffffffffc085ee07>] ? ldlm_cli_pool_shrink+0x67/0x100 [ptlrpc] [223483.086214] [<ffffffff81197dc6>] shrink_zone+0x76/0x1a0 [223483.096773] [<ffffffff811982d0>] do_try_to_free_pages+0xf0/0x4e0 [223483.108086] [<ffffffff811987bc>] try_to_free_pages+0xfc/0x180 [223483.119023] [<ffffffff8169fbcb>] __alloc_pages_slowpath+0x457/0x724 [223483.130417] [<ffffffff8118cdb5>] __alloc_pages_nodemask+0x405/0x420 [223483.141673] [<ffffffff811d081a>] alloc_page_interleave+0x3a/0xa0 [223483.152526] [<ffffffff811d4133>] alloc_pages_vma+0x143/0x200 [223483.162848] [<ffffffff811c37a0>] ? end_swap_bio_write+0x80/0x80 [223483.173345] [<ffffffff811c44ad>] read_swap_cache_async+0xed/0x160 [223483.183938] [<ffffffff811c45c8>] swapin_readahead+0xa8/0x110 [223483.193933] [<ffffffff811b22cb>] handle_mm_fault+0xadb/0xfa0 [223483.203823] [<ffffffff816b00b4>] __do_page_fault+0x154/0x450 [223483.213621] [<ffffffff816b03e5>] do_page_fault+0x35/0x90 [223483.222983] [<ffffffff816ac608>] page_fault+0x28/0x30
Please let me know if you need more information.
Regards.
Jacek Tomaka