Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.4.3
-
None
-
RHEL6 w/ patched kernel
-
3
-
14915
Description
We had to crash/dump one of our Lustre clients because of a deadlock issue in mdc_close(). The PID 5231 was waiting for a lock that it already owned. BTW, we had a lot of process waiting for this lock.
In the backtrace of the process, we can see two calls to mdc_close(). The second is due to the system reclaiming memory.
crash> bt 5231 PID: 5231 TASK: ffff881518308b00 CPU: 2 COMMAND: "code2" #0 [ffff88171cb43188] schedule at ffffffff81528a52 #1 [ffff88171cb43250] __mutex_lock_slowpath at ffffffff8152a20e #2 [ffff88171cb432c0] mutex_lock at ffffffff8152a0ab <=== Requires a new lock #3 [ffff88171cb432e0] mdc_close at ffffffffa09176db [mdc] #4 [ffff88171cb43330] lmv_close at ffffffffa0b9bcb8 [lmv] #5 [ffff88171cb43380] ll_close_inode_openhandle at ffffffffa0a80c1e [lustre] #6 [ffff88171cb43400] ll_md_real_close at ffffffffa0a81afa [lustre] #7 [ffff88171cb43430] ll_clear_inode at ffffffffa0a92dee [lustre] #8 [ffff88171cb43470] clear_inode at ffffffff811a626c #9 [ffff88171cb43490] dispose_list at ffffffff811a6340 #10 [ffff88171cb434d0] shrink_icache_memory at ffffffff811a6694 #11 [ffff88171cb43530] shrink_slab at ffffffff81138b7a #12 [ffff88171cb43590] zone_reclaim at ffffffff8113b77e #13 [ffff88171cb436b0] get_page_from_freelist at ffffffff8112d8dc #14 [ffff88171cb437e0] __alloc_pages_nodemask at ffffffff8112f443 #15 [ffff88171cb43920] alloc_pages_current at ffffffff811680ca #16 [ffff88171cb43950] __vmalloc_area_node at ffffffff81159696 #17 [ffff88171cb439b0] __vmalloc_node at ffffffff8115953d #18 [ffff88171cb43a10] vmalloc at ffffffff8115985c #19 [ffff88171cb43a20] cfs_alloc_large at ffffffffa03b4b1e [libcfs] #20 [ffff88171cb43a30] null_alloc_repbuf at ffffffffa06c4961 [ptlrpc] #21 [ffff88171cb43a60] sptlrpc_cli_alloc_repbuf at ffffffffa06b2355 [ptlrpc] #22 [ffff88171cb43a90] ptl_send_rpc at ffffffffa068432c [ptlrpc] #23 [ffff88171cb43b50] ptlrpc_send_new_req at ffffffffa067879b [ptlrpc] #24 [ffff88171cb43bc0] ptlrpc_set_wait at ffffffffa067ddb6 [ptlrpc] #25 [ffff88171cb43c60] ptlrpc_queue_wait at ffffffffa067e0df [ptlrpc] <=== PID has the lock #26 [ffff88171cb43c80] mdc_close at ffffffffa0917714 [mdc] #27 [ffff88171cb43cd0] lmv_close at ffffffffa0b9bcb8 [lmv] #28 [ffff88171cb43d20] ll_close_inode_openhandle at ffffffffa0a80c1e [lustre] #29 [ffff88171cb43da0] ll_md_real_close at ffffffffa0a81afa [lustre] #30 [ffff88171cb43dd0] ll_md_close at ffffffffa0a81d8a [lustre] #31 [ffff88171cb43e80] ll_file_release at ffffffffa0a8233b [lustre] #32 [ffff88171cb43ec0] __fput at ffffffff8118ad55 #33 [ffff88171cb43f10] fput at ffffffff8118ae95 #34 [ffff88171cb43f20] filp_close at ffffffff811861bd #35 [ffff88171cb43f50] sys_close at ffffffff81186295 #36 [ffff88171cb43f80] system_call_fastpath at ffffffff8100b072 RIP: 00002adaacdf26d0 RSP: 00007fff9665e238 RFLAGS: 00010246 RAX: 0000000000000003 RBX: ffffffff8100b072 RCX: 0000000000002261 RDX: 00000000044a24b0 RSI: 0000000000000001 RDI: 0000000000000005 RBP: 0000000000000000 R8: 00002adaad0ac560 R9: 0000000000000001 R10: 00000000000004fd R11: 0000000000000246 R12: 00000000000004fc R13: 00000000ffffffff R14: 00000000044a23d0 R15: 00000000ffffffff ORIG_RAX: 0000000000000003 CS: 0033 SS: 002b
We have a recursive locking here, which is not permitted.