[LU-13212] Lustre client hangs machine under memory pressure - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.15.0
Affects Version/s: Lustre 2.10.3
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Hello,

When a userspace process goes crazy with memory allocation, sometimes OOM killer does not manage to kick in because Lustre is still trying to free its memory.
I am not sure if it deadlocked or there is just too many locks which it is trying to free but it has been in this state for more than 12 hours before it was manually crashed.
This is Centos 7.4 with kernel 3.10.0-693.5.2.el7.x86_64
Machine still responds to pings when it is in this state.

Here is one of the kernel task stack:

[223483.032862]  [<ffffffff81196b27>] ? putback_inactive_pages+0x117/0x2d0
[223483.050260]  [<ffffffff81196f0a>] ? shrink_inactive_list+0x22a/0x5d0
[223483.062319]  [<ffffffff811979a5>] shrink_lruvec+0x385/0x730
[223483.073571]  [<ffffffffc085ee07>] ? ldlm_cli_pool_shrink+0x67/0x100 [ptlrpc]
[223483.086214]  [<ffffffff81197dc6>] shrink_zone+0x76/0x1a0
[223483.096773]  [<ffffffff811982d0>] do_try_to_free_pages+0xf0/0x4e0
[223483.108086]  [<ffffffff811987bc>] try_to_free_pages+0xfc/0x180
[223483.119023]  [<ffffffff8169fbcb>] __alloc_pages_slowpath+0x457/0x724
[223483.130417]  [<ffffffff8118cdb5>] __alloc_pages_nodemask+0x405/0x420
[223483.141673]  [<ffffffff811d081a>] alloc_page_interleave+0x3a/0xa0
[223483.152526]  [<ffffffff811d4133>] alloc_pages_vma+0x143/0x200
[223483.162848]  [<ffffffff811c37a0>] ? end_swap_bio_write+0x80/0x80
[223483.173345]  [<ffffffff811c44ad>] read_swap_cache_async+0xed/0x160
[223483.183938]  [<ffffffff811c45c8>] swapin_readahead+0xa8/0x110
[223483.193933]  [<ffffffff811b22cb>] handle_mm_fault+0xadb/0xfa0
[223483.203823]  [<ffffffff816b00b4>] __do_page_fault+0x154/0x450
[223483.213621]  [<ffffffff816b03e5>] do_page_fault+0x35/0x90
[223483.222983]  [<ffffffff816ac608>] page_fault+0x28/0x30

Please let me know if you need more information.
Regards.
Jacek Tomaka

Attachments

Issue Links

is related to

LU-12241 recovery-random-scale test fail_client_mds fails with ‘ptlrpcd_00_00: page allocation stalls’

Open

LU-15058 replace critical vmalloc allocations with kernel APIs to improve performance

Open

Activity

[LU-13212] Lustre client hangs machine under memory pressure

James A Simmons made changes - 04/Oct/21 6:21 PM

Link

New: This issue is related to LU-15058 [ LU-15058 ]

Peter Jones made changes - 21/Apr/21 3:57 AM

Fix Version/s		New: Lustre 2.15.0 [ 14791 ]
Assignee	Original: WC Triage [ wc-triage ]	New: Andreas Dilger [ adilger ]
Resolution		New: Fixed [ 1 ]
Status	Original: Open [ 1 ]	New: Resolved [ 5 ]

John Hammond made changes - 14/Apr/21 6:37 PM

Link

New: This issue is related to EX-3004 [ EX-3004 ]

Andreas Dilger made changes - 12/Apr/21 6:57 PM

Link

New: This issue is related to LU-12241 [ LU-12241 ]

Jacek Tomaka (Inactive) made changes - 28/Feb/20 2:04 PM

Description

Original: Hello,

When a userspace process goes crazy, sometimes OOM killer does not manage to kick in because Lustre is still trying to free its memory.
I am not sure if it deadlocked or there is just too many locks which it is trying to free but it has been in this state for more than 12 hours before it was manually crashed.
This is Centos 7.4 with kernel 3.10.0-693.5.2.el7.x86_64
Machine still responds to pings when it is in this state.

Here is one of the kernel task stack:
{noformat}
[223483.032862] [<ffffffff81196b27>] ? putback_inactive_pages+0x117/0x2d0
[223483.050260] [<ffffffff81196f0a>] ? shrink_inactive_list+0x22a/0x5d0
[223483.062319] [<ffffffff811979a5>] shrink_lruvec+0x385/0x730
[223483.073571] [<ffffffffc085ee07>] ? ldlm_cli_pool_shrink+0x67/0x100 [ptlrpc]
[223483.086214] [<ffffffff81197dc6>] shrink_zone+0x76/0x1a0
[223483.096773] [<ffffffff811982d0>] do_try_to_free_pages+0xf0/0x4e0
[223483.108086] [<ffffffff811987bc>] try_to_free_pages+0xfc/0x180
[223483.119023] [<ffffffff8169fbcb>] __alloc_pages_slowpath+0x457/0x724
[223483.130417] [<ffffffff8118cdb5>] __alloc_pages_nodemask+0x405/0x420
[223483.141673] [<ffffffff811d081a>] alloc_page_interleave+0x3a/0xa0
[223483.152526] [<ffffffff811d4133>] alloc_pages_vma+0x143/0x200
[223483.162848] [<ffffffff811c37a0>] ? end_swap_bio_write+0x80/0x80
[223483.173345] [<ffffffff811c44ad>] read_swap_cache_async+0xed/0x160
[223483.183938] [<ffffffff811c45c8>] swapin_readahead+0xa8/0x110
[223483.193933] [<ffffffff811b22cb>] handle_mm_fault+0xadb/0xfa0
[223483.203823] [<ffffffff816b00b4>] __do_page_fault+0x154/0x450
[223483.213621] [<ffffffff816b03e5>] do_page_fault+0x35/0x90
[223483.222983] [<ffffffff816ac608>] page_fault+0x28/0x30
{noformat}

Please let me know if you need more information.
Regards.
Jacek Tomaka

New: Hello,

When a userspace process goes crazy with memory allocation, sometimes OOM killer does not manage to kick in because Lustre is still trying to free its memory.
I am not sure if it deadlocked or there is just too many locks which it is trying to free but it has been in this state for more than 12 hours before it was manually crashed.
This is Centos 7.4 with kernel 3.10.0-693.5.2.el7.x86_64
Machine still responds to pings when it is in this state.

Here is one of the kernel task stack:
{noformat}
[223483.032862] [<ffffffff81196b27>] ? putback_inactive_pages+0x117/0x2d0
[223483.050260] [<ffffffff81196f0a>] ? shrink_inactive_list+0x22a/0x5d0
[223483.062319] [<ffffffff811979a5>] shrink_lruvec+0x385/0x730
[223483.073571] [<ffffffffc085ee07>] ? ldlm_cli_pool_shrink+0x67/0x100 [ptlrpc]
[223483.086214] [<ffffffff81197dc6>] shrink_zone+0x76/0x1a0
[223483.096773] [<ffffffff811982d0>] do_try_to_free_pages+0xf0/0x4e0
[223483.108086] [<ffffffff811987bc>] try_to_free_pages+0xfc/0x180
[223483.119023] [<ffffffff8169fbcb>] __alloc_pages_slowpath+0x457/0x724
[223483.130417] [<ffffffff8118cdb5>] __alloc_pages_nodemask+0x405/0x420
[223483.141673] [<ffffffff811d081a>] alloc_page_interleave+0x3a/0xa0
[223483.152526] [<ffffffff811d4133>] alloc_pages_vma+0x143/0x200
[223483.162848] [<ffffffff811c37a0>] ? end_swap_bio_write+0x80/0x80
[223483.173345] [<ffffffff811c44ad>] read_swap_cache_async+0xed/0x160
[223483.183938] [<ffffffff811c45c8>] swapin_readahead+0xa8/0x110
[223483.193933] [<ffffffff811b22cb>] handle_mm_fault+0xadb/0xfa0
[223483.203823] [<ffffffff816b00b4>] __do_page_fault+0x154/0x450
[223483.213621] [<ffffffff816b03e5>] do_page_fault+0x35/0x90
[223483.222983] [<ffffffff816ac608>] page_fault+0x28/0x30
{noformat}

Please let me know if you need more information.
Regards.
Jacek Tomaka

Jacek Tomaka (Inactive) created issue - 07/Feb/20 7:19 AM

People

Assignee:: Andreas Dilger

Reporter:: Jacek Tomaka (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 07/Feb/20 7:19 AM

Updated:: 04/Oct/21 6:21 PM

Resolved:: 21/Apr/21 3:57 AM