[LU-5349] Deadlock in mdc_close() Created: 15/Jul/14 Updated: 29/Sep/14 Resolved: 29/Sep/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | Lustre 2.7.0, Lustre 2.5.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Bruno Travouillon (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL6 w/ patched kernel |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 14915 |
| Description |
|
We had to crash/dump one of our Lustre clients because of a deadlock issue in mdc_close(). The PID 5231 was waiting for a lock that it already owned. BTW, we had a lot of process waiting for this lock. In the backtrace of the process, we can see two calls to mdc_close(). The second is due to the system reclaiming memory. crash> bt 5231 PID: 5231 TASK: ffff881518308b00 CPU: 2 COMMAND: "code2" #0 [ffff88171cb43188] schedule at ffffffff81528a52 #1 [ffff88171cb43250] __mutex_lock_slowpath at ffffffff8152a20e #2 [ffff88171cb432c0] mutex_lock at ffffffff8152a0ab <=== Requires a new lock #3 [ffff88171cb432e0] mdc_close at ffffffffa09176db [mdc] #4 [ffff88171cb43330] lmv_close at ffffffffa0b9bcb8 [lmv] #5 [ffff88171cb43380] ll_close_inode_openhandle at ffffffffa0a80c1e [lustre] #6 [ffff88171cb43400] ll_md_real_close at ffffffffa0a81afa [lustre] #7 [ffff88171cb43430] ll_clear_inode at ffffffffa0a92dee [lustre] #8 [ffff88171cb43470] clear_inode at ffffffff811a626c #9 [ffff88171cb43490] dispose_list at ffffffff811a6340 #10 [ffff88171cb434d0] shrink_icache_memory at ffffffff811a6694 #11 [ffff88171cb43530] shrink_slab at ffffffff81138b7a #12 [ffff88171cb43590] zone_reclaim at ffffffff8113b77e #13 [ffff88171cb436b0] get_page_from_freelist at ffffffff8112d8dc #14 [ffff88171cb437e0] __alloc_pages_nodemask at ffffffff8112f443 #15 [ffff88171cb43920] alloc_pages_current at ffffffff811680ca #16 [ffff88171cb43950] __vmalloc_area_node at ffffffff81159696 #17 [ffff88171cb439b0] __vmalloc_node at ffffffff8115953d #18 [ffff88171cb43a10] vmalloc at ffffffff8115985c #19 [ffff88171cb43a20] cfs_alloc_large at ffffffffa03b4b1e [libcfs] #20 [ffff88171cb43a30] null_alloc_repbuf at ffffffffa06c4961 [ptlrpc] #21 [ffff88171cb43a60] sptlrpc_cli_alloc_repbuf at ffffffffa06b2355 [ptlrpc] #22 [ffff88171cb43a90] ptl_send_rpc at ffffffffa068432c [ptlrpc] #23 [ffff88171cb43b50] ptlrpc_send_new_req at ffffffffa067879b [ptlrpc] #24 [ffff88171cb43bc0] ptlrpc_set_wait at ffffffffa067ddb6 [ptlrpc] #25 [ffff88171cb43c60] ptlrpc_queue_wait at ffffffffa067e0df [ptlrpc] <=== PID has the lock #26 [ffff88171cb43c80] mdc_close at ffffffffa0917714 [mdc] #27 [ffff88171cb43cd0] lmv_close at ffffffffa0b9bcb8 [lmv] #28 [ffff88171cb43d20] ll_close_inode_openhandle at ffffffffa0a80c1e [lustre] #29 [ffff88171cb43da0] ll_md_real_close at ffffffffa0a81afa [lustre] #30 [ffff88171cb43dd0] ll_md_close at ffffffffa0a81d8a [lustre] #31 [ffff88171cb43e80] ll_file_release at ffffffffa0a8233b [lustre] #32 [ffff88171cb43ec0] __fput at ffffffff8118ad55 #33 [ffff88171cb43f10] fput at ffffffff8118ae95 #34 [ffff88171cb43f20] filp_close at ffffffff811861bd #35 [ffff88171cb43f50] sys_close at ffffffff81186295 #36 [ffff88171cb43f80] system_call_fastpath at ffffffff8100b072 RIP: 00002adaacdf26d0 RSP: 00007fff9665e238 RFLAGS: 00010246 RAX: 0000000000000003 RBX: ffffffff8100b072 RCX: 0000000000002261 RDX: 00000000044a24b0 RSI: 0000000000000001 RDI: 0000000000000005 RBP: 0000000000000000 R8: 00002adaad0ac560 R9: 0000000000000001 R10: 00000000000004fd R11: 0000000000000246 R12: 00000000000004fc R13: 00000000ffffffff R14: 00000000044a23d0 R15: 00000000ffffffff ORIG_RAX: 0000000000000003 CS: 0033 SS: 002b We have a recursive locking here, which is not permitted. |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 15/Jul/14 ] |
|
May be we can avoid this self dead-lock by detecting in prolog of ll_clear_inode() routine that current thread is already owner of class_exp2obd(ll_i2sbi(inode) |
| Comment by Jodi Levi (Inactive) [ 15/Jul/14 ] |
|
Bruno, |
| Comment by Bruno Faccini (Inactive) [ 22/Jul/14 ] |
|
Jodi: Looks like yes. I try to implement my fix idea in http://review.whamcloud.com/11183. But Bruno, any help to find a way to reproduce the same situation is welcome ... |
| Comment by Oleg Drokin [ 22/Jul/14 ] |
|
The most important part of the failure here is that I just realized vmalloc does not take GFP mask so we cannot tell it not to dive into filesystems for memory reclaiming. The real fix here would be to change OBD_VMALLOC to use __vmalloc instead with GFP_NOFS flags like we do with regular vmalloc. (we also need to add GFP_ZERO of course to prezero the memory for us). |
| Comment by Andreas Dilger [ 22/Jul/14 ] |
|
How long has __vmalloc() existed? I've never seen that before, and we've had similar problems to this in the past that could have been fixed in a similar manner. |
| Comment by Oleg Drokin [ 22/Jul/14 ] |
|
I just did some git research and it has been there since forever, basically. |
| Comment by Bruno Faccini (Inactive) [ 22/Jul/14 ] |
|
Oleg: thanks for this hint! I agree it looks more elegant than my own idea, and since it is much more restrictive it should handle all other dead-lock possibilities than only this inflight RPC serialization mutex ... But concerning the fix detail now, I don't see any current way to specify any combination of allocation flag within our vmalloc()/vzalloc() based set of macros, should we think to implement it now ? |
| Comment by Oleg Drokin [ 23/Jul/14 ] |
|
Basically we need to do the same thing we do for the OBD_ALLOC code - just use GFP_NOFS by default (since we are a filesystem, that's a safe bet). |
| Comment by Bruno Faccini (Inactive) [ 23/Jul/14 ] |
|
I just pushed http://review.whamcloud.com/11190 in order to have vmalloc[_node]() based allocations to no longer use __GFP_FS by default. I also found that this will enable for a real NUMA node parameter setting! |
| Comment by Bruno Faccini (Inactive) [ 23/Jul/14 ] |
|
Humm too bad, __vmalloc_node() is not exported by Kernels ... So I am stuck if I want to also fix cfs_cpt_vzalloc() about the fact that __GFP_FS is used by default by vzalloc_node() and continue to forward a NUMA node specification, since I think only __vmalloc_node() would permit ... I just pushed patch-set #2 of http://review.whamcloud.com/11190 assuming this last case. |
| Comment by Andreas Dilger [ 23/Jul/14 ] |
|
The other question here is how many OSTs are in this filesystem, and if you are using wide striping? I'm trying to figure out why this was using vmalloc() instead of kmalloc(), and if there is a separate bug to be addressed to reduce the allocation size. |
| Comment by Bruno Travouillon (Inactive) [ 23/Jul/14 ] |
|
There are several Lustre filesystems mounted on this client:
This Lustre client is a login node, with many user working interactively. You can find in the attached file the outputs of sar -B and sar -R. Hope this helps. |
| Comment by Bruno Faccini (Inactive) [ 07/Aug/14 ] |
|
Patch http://review.whamcloud.com/#/c/11183/ has been abandoned. |
| Comment by Bruno Faccini (Inactive) [ 03/Sep/14 ] |
|
Master patch http://review.whamcloud.com/11190 has landed. |
| Comment by Sebastien Buisson (Inactive) [ 03/Sep/14 ] |
|
Bruno, why does b2_5 version lack _GFP_ZERO flag in call to __vmalloc() (_OBD_VMALLOC_VERBOSE macro)? |
| Comment by Bruno Faccini (Inactive) [ 03/Sep/14 ] |
|
I forgot to mention it/why, nice catch! It is because b2_5 uses vmalloc() when master used vzalloc(), and I wanted to challenge my future reviewers about this ... |
| Comment by Oleg Drokin [ 24/Sep/14 ] |
|
also GFP_ZERO is not really needed in b2_5 patch because we explicitly zero the allocation with memset() afterwards anyway. |
| Comment by Peter Jones [ 29/Sep/14 ] |
|
Landed for 2.5.4 and 2.7 |