[LU-2115] ldlm_bl_xx thread hangs under high memory pressure Created: 09/Oct/12  Updated: 09/Jan/20  Resolved: 09/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 1.8.8
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Hiroya Nozaki Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

it has happened with FEFS, based on Lustre-1.8.5 in RIKEN K computer environmentI
MDSx1, OSSx2592, OSTx5184, Clientx84672


Severity: 3
Rank (Obsolete): 5110

 Description   

since the time after cheking the patch of Bugzilla 24320(https://bugzilla.lustre.org/show_bug.cgi?id=24320), I've started to doubt if the patch is enough.

it's because, let's say, all ldlm_bl_xx threads tried to create new threads and failed to do it due to lack of memory. Then next, all ldlm_bl_xx threads will cal try_to_free_pages via the kmalloc, which has failed to allocate slab memory, and finally try_to_free_pages will call _ldlm_bl_to_thread with LDLM_SYNC via shrink_slab(). so, this case can end up in dead-lock situation, all ldlm_bl_xx thread awaits returning of_ldlm_bl_to_threads but thre is no more ldlm_bl_xx thread to handle blocking requests.

So I think we'd better add set/clear PF_MEMALLOC into before/after cfs_kernel_thread/cfs_create_thread to prevent ldlm_bl_xxx threads from calling __ldlm_bl_to_thread().



 Comments   
Comment by Hiroya Nozaki [ 10/Oct/12 ]

I know you all are really busy, but could someone check whether or not this is true ?
I know ldlm_bl_work item has the variable blwi_mem_pressure, but I think when every ldlm_bl_xx thread pick up blwi and fail to create a new thread at a time, that variable won't work.

Comment by Andreas Dilger [ 09/Jan/20 ]

Close old ticket.

Generated at Sat Feb 10 01:22:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.