We had to crash/dump one of our Lustre clients because of a deadlock issue in mdc_close(). The PID 5231 was waiting for a lock that it already owned. BTW, we had a lot of process waiting for this lock.
In the backtrace of the process, we can see two calls to mdc_close(). The second is due to the system reclaiming memory.
crash> bt 5231
PID: 5231 TASK: ffff881518308b00 CPU: 2 COMMAND: "code2"
#0 [ffff88171cb43188] schedule at ffffffff81528a52
#1 [ffff88171cb43250] __mutex_lock_slowpath at ffffffff8152a20e
#2 [ffff88171cb432c0] mutex_lock at ffffffff8152a0ab <=== Requires a new lock
#3 [ffff88171cb432e0] mdc_close at ffffffffa09176db [mdc]
#4 [ffff88171cb43330] lmv_close at ffffffffa0b9bcb8 [lmv]
#5 [ffff88171cb43380] ll_close_inode_openhandle at ffffffffa0a80c1e [lustre]
#6 [ffff88171cb43400] ll_md_real_close at ffffffffa0a81afa [lustre]
#7 [ffff88171cb43430] ll_clear_inode at ffffffffa0a92dee [lustre]
#8 [ffff88171cb43470] clear_inode at ffffffff811a626c
#9 [ffff88171cb43490] dispose_list at ffffffff811a6340
#10 [ffff88171cb434d0] shrink_icache_memory at ffffffff811a6694
#11 [ffff88171cb43530] shrink_slab at ffffffff81138b7a
#12 [ffff88171cb43590] zone_reclaim at ffffffff8113b77e
#13 [ffff88171cb436b0] get_page_from_freelist at ffffffff8112d8dc
#14 [ffff88171cb437e0] __alloc_pages_nodemask at ffffffff8112f443
#15 [ffff88171cb43920] alloc_pages_current at ffffffff811680ca
#16 [ffff88171cb43950] __vmalloc_area_node at ffffffff81159696
#17 [ffff88171cb439b0] __vmalloc_node at ffffffff8115953d
#18 [ffff88171cb43a10] vmalloc at ffffffff8115985c
#19 [ffff88171cb43a20] cfs_alloc_large at ffffffffa03b4b1e [libcfs]
#20 [ffff88171cb43a30] null_alloc_repbuf at ffffffffa06c4961 [ptlrpc]
#21 [ffff88171cb43a60] sptlrpc_cli_alloc_repbuf at ffffffffa06b2355 [ptlrpc]
#22 [ffff88171cb43a90] ptl_send_rpc at ffffffffa068432c [ptlrpc]
#23 [ffff88171cb43b50] ptlrpc_send_new_req at ffffffffa067879b [ptlrpc]
#24 [ffff88171cb43bc0] ptlrpc_set_wait at ffffffffa067ddb6 [ptlrpc]
#25 [ffff88171cb43c60] ptlrpc_queue_wait at ffffffffa067e0df [ptlrpc] <=== PID has the lock
#26 [ffff88171cb43c80] mdc_close at ffffffffa0917714 [mdc]
#27 [ffff88171cb43cd0] lmv_close at ffffffffa0b9bcb8 [lmv]
#28 [ffff88171cb43d20] ll_close_inode_openhandle at ffffffffa0a80c1e [lustre]
#29 [ffff88171cb43da0] ll_md_real_close at ffffffffa0a81afa [lustre]
#30 [ffff88171cb43dd0] ll_md_close at ffffffffa0a81d8a [lustre]
#31 [ffff88171cb43e80] ll_file_release at ffffffffa0a8233b [lustre]
#32 [ffff88171cb43ec0] __fput at ffffffff8118ad55
#33 [ffff88171cb43f10] fput at ffffffff8118ae95
#34 [ffff88171cb43f20] filp_close at ffffffff811861bd
#35 [ffff88171cb43f50] sys_close at ffffffff81186295
#36 [ffff88171cb43f80] system_call_fastpath at ffffffff8100b072
RIP: 00002adaacdf26d0 RSP: 00007fff9665e238 RFLAGS: 00010246
RAX: 0000000000000003 RBX: ffffffff8100b072 RCX: 0000000000002261
RDX: 00000000044a24b0 RSI: 0000000000000001 RDI: 0000000000000005
RBP: 0000000000000000 R8: 00002adaad0ac560 R9: 0000000000000001
R10: 00000000000004fd R11: 0000000000000246 R12: 00000000000004fc
R13: 00000000ffffffff R14: 00000000044a23d0 R15: 00000000ffffffff
ORIG_RAX: 0000000000000003 CS: 0033 SS: 002b
We have a recursive locking here, which is not permitted.