[LU-5349] Deadlock in mdc_close() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4
Affects Version/s: Lustre 2.4.3
Labels:
None
Environment:
RHEL6 w/ patched kernel

Severity:
3
Rank (Obsolete):
14915

Description

We had to crash/dump one of our Lustre clients because of a deadlock issue in mdc_close(). The PID 5231 was waiting for a lock that it already owned. BTW, we had a lot of process waiting for this lock.

In the backtrace of the process, we can see two calls to mdc_close(). The second is due to the system reclaiming memory.

crash> bt 5231
PID: 5231   TASK: ffff881518308b00  CPU: 2   COMMAND: "code2"
 #0 [ffff88171cb43188] schedule at ffffffff81528a52
 #1 [ffff88171cb43250] __mutex_lock_slowpath at ffffffff8152a20e
 #2 [ffff88171cb432c0] mutex_lock at ffffffff8152a0ab                  <=== Requires a new lock
 #3 [ffff88171cb432e0] mdc_close at ffffffffa09176db [mdc]
 #4 [ffff88171cb43330] lmv_close at ffffffffa0b9bcb8 [lmv]
 #5 [ffff88171cb43380] ll_close_inode_openhandle at ffffffffa0a80c1e [lustre]
 #6 [ffff88171cb43400] ll_md_real_close at ffffffffa0a81afa [lustre]
 #7 [ffff88171cb43430] ll_clear_inode at ffffffffa0a92dee [lustre]
 #8 [ffff88171cb43470] clear_inode at ffffffff811a626c
 #9 [ffff88171cb43490] dispose_list at ffffffff811a6340
#10 [ffff88171cb434d0] shrink_icache_memory at ffffffff811a6694
#11 [ffff88171cb43530] shrink_slab at ffffffff81138b7a
#12 [ffff88171cb43590] zone_reclaim at ffffffff8113b77e
#13 [ffff88171cb436b0] get_page_from_freelist at ffffffff8112d8dc
#14 [ffff88171cb437e0] __alloc_pages_nodemask at ffffffff8112f443
#15 [ffff88171cb43920] alloc_pages_current at ffffffff811680ca
#16 [ffff88171cb43950] __vmalloc_area_node at ffffffff81159696
#17 [ffff88171cb439b0] __vmalloc_node at ffffffff8115953d
#18 [ffff88171cb43a10] vmalloc at ffffffff8115985c
#19 [ffff88171cb43a20] cfs_alloc_large at ffffffffa03b4b1e [libcfs]
#20 [ffff88171cb43a30] null_alloc_repbuf at ffffffffa06c4961 [ptlrpc]
#21 [ffff88171cb43a60] sptlrpc_cli_alloc_repbuf at ffffffffa06b2355 [ptlrpc]
#22 [ffff88171cb43a90] ptl_send_rpc at ffffffffa068432c [ptlrpc]
#23 [ffff88171cb43b50] ptlrpc_send_new_req at ffffffffa067879b [ptlrpc]
#24 [ffff88171cb43bc0] ptlrpc_set_wait at ffffffffa067ddb6 [ptlrpc]
#25 [ffff88171cb43c60] ptlrpc_queue_wait at ffffffffa067e0df [ptlrpc]   <=== PID has the lock
#26 [ffff88171cb43c80] mdc_close at ffffffffa0917714 [mdc]
#27 [ffff88171cb43cd0] lmv_close at ffffffffa0b9bcb8 [lmv]
#28 [ffff88171cb43d20] ll_close_inode_openhandle at ffffffffa0a80c1e [lustre]
#29 [ffff88171cb43da0] ll_md_real_close at ffffffffa0a81afa [lustre]
#30 [ffff88171cb43dd0] ll_md_close at ffffffffa0a81d8a [lustre]
#31 [ffff88171cb43e80] ll_file_release at ffffffffa0a8233b [lustre]
#32 [ffff88171cb43ec0] __fput at ffffffff8118ad55
#33 [ffff88171cb43f10] fput at ffffffff8118ae95
#34 [ffff88171cb43f20] filp_close at ffffffff811861bd
#35 [ffff88171cb43f50] sys_close at ffffffff81186295
#36 [ffff88171cb43f80] system_call_fastpath at ffffffff8100b072
    RIP: 00002adaacdf26d0  RSP: 00007fff9665e238  RFLAGS: 00010246
    RAX: 0000000000000003  RBX: ffffffff8100b072  RCX: 0000000000002261
    RDX: 00000000044a24b0  RSI: 0000000000000001  RDI: 0000000000000005
    RBP: 0000000000000000   R8: 00002adaad0ac560   R9: 0000000000000001
    R10: 00000000000004fd  R11: 0000000000000246  R12: 00000000000004fc
    R13: 00000000ffffffff  R14: 00000000044a23d0  R15: 00000000ffffffff
    ORIG_RAX: 0000000000000003  CS: 0033  SS: 002b

We have a recursive locking here, which is not permitted.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

report_for_support
55 kB
15/Jul/14 6:49 AM

Activity

[LU-5349] Deadlock in mdc_close()

Peter Jones added a comment - 29/Sep/14 1:06 PM

Landed for 2.5.4 and 2.7

Peter Jones added a comment - 29/Sep/14 1:06 PM Landed for 2.5.4 and 2.7

Oleg Drokin added a comment - 24/Sep/14 3:14 AM

also GFP_ZERO is not really needed in b2_5 patch because we explicitly zero the allocation with memset() afterwards anyway.

Oleg Drokin added a comment - 24/Sep/14 3:14 AM also GFP_ZERO is not really needed in b2_5 patch because we explicitly zero the allocation with memset() afterwards anyway.

Bruno Faccini (Inactive) added a comment - 03/Sep/14 12:29 PM - edited

I forgot to mention it/why, nice catch! It is because b2_5 uses vmalloc() when master used vzalloc(), and I wanted to challenge my future reviewers about this ...

Bruno Faccini (Inactive) added a comment - 03/Sep/14 12:29 PM - edited I forgot to mention it/why, nice catch! It is because b2_5 uses vmalloc() when master used vzalloc(), and I wanted to challenge my future reviewers about this ...

Sebastien Buisson (Inactive) added a comment - 03/Sep/14 10:45 AM

Bruno, why does b2_5 version lack _GFP_ZERO flag in call to __vmalloc() (_OBD_VMALLOC_VERBOSE macro)?

Sebastien Buisson (Inactive) added a comment - 03/Sep/14 10:45 AM Bruno, why does b2_5 version lack _ GFP_ZERO flag in call to __vmalloc() ( _OBD_VMALLOC_VERBOSE macro)?

Bruno Faccini (Inactive) added a comment - 03/Sep/14 9:24 AM

Master patch http://review.whamcloud.com/11190 has landed.
b2_5 version is now at http://review.whamcloud.com/11739.

Bruno Faccini (Inactive) added a comment - 03/Sep/14 9:24 AM Master patch http://review.whamcloud.com/11190 has landed. b2_5 version is now at http://review.whamcloud.com/11739 .

Bruno Faccini (Inactive) added a comment - 07/Aug/14 2:33 PM

Patch http://review.whamcloud.com/#/c/11183/ has been abandoned.

Bruno Faccini (Inactive) added a comment - 07/Aug/14 2:33 PM Patch http://review.whamcloud.com/#/c/11183/ has been abandoned.

Bruno Travouillon (Inactive) added a comment - 23/Jul/14 6:48 PM - edited

There are several Lustre filesystems mounted on this client:

a Lustre 2.4 filesystem on the same LNET with 1 MDT and 480 OSTs. We do not use wide striping.
a Lustre 2.1 filesystem on another LNET with 1 MDT and 224 OSTs.
a Lustre 2.1 filesystem on another LNET with 1 MDT and 56 OSTs.
a Lustre 2.1 filesystem on another LNET with 1 MDT and 48 OSTs.

This Lustre client is a login node, with many user working interactively.

You can find in the attached file the outputs of sar -B and sar -R.

Hope this helps.

Bruno Travouillon (Inactive) added a comment - 23/Jul/14 6:48 PM - edited There are several Lustre filesystems mounted on this client: a Lustre 2.4 filesystem on the same LNET with 1 MDT and 480 OSTs. We do not use wide striping. a Lustre 2.1 filesystem on another LNET with 1 MDT and 224 OSTs. a Lustre 2.1 filesystem on another LNET with 1 MDT and 56 OSTs. a Lustre 2.1 filesystem on another LNET with 1 MDT and 48 OSTs. This Lustre client is a login node, with many user working interactively. You can find in the attached file the outputs of sar -B and sar -R . Hope this helps.

Andreas Dilger added a comment - 23/Jul/14 5:49 PM

The other question here is how many OSTs are in this filesystem, and if you are using wide striping? I'm trying to figure out why this was using vmalloc() instead of kmalloc(), and if there is a separate bug to be addressed to reduce the allocation size.

Andreas Dilger added a comment - 23/Jul/14 5:49 PM The other question here is how many OSTs are in this filesystem, and if you are using wide striping? I'm trying to figure out why this was using vmalloc() instead of kmalloc(), and if there is a separate bug to be addressed to reduce the allocation size.

Bruno Faccini (Inactive) added a comment - 23/Jul/14 3:00 PM

Humm too bad, __vmalloc_node() is not exported by Kernels ... So I am stuck if I want to also fix cfs_cpt_vzalloc() about the fact that __GFP_FS is used by default by vzalloc_node() and continue to forward a NUMA node specification, since I think only __vmalloc_node() would permit ...
So what should I do in cfs_cpt_vzalloc(), call __vmalloc() and forget about the node specified (but this may imply performance issues with NUMA aware ptlrpcds which use this...) or leave it like this until Kernel exports an accurate entry-point and assume that at the moment no cfs_cpt_vzalloc() call occurs during any File-System operations ?

I just pushed patch-set #2 of http://review.whamcloud.com/11190 assuming this last case.

Bruno Faccini (Inactive) added a comment - 23/Jul/14 3:00 PM Humm too bad, __vmalloc_node() is not exported by Kernels ... So I am stuck if I want to also fix cfs_cpt_vzalloc() about the fact that __GFP_FS is used by default by vzalloc_node() and continue to forward a NUMA node specification, since I think only __vmalloc_node() would permit ... So what should I do in cfs_cpt_vzalloc(), call __vmalloc() and forget about the node specified (but this may imply performance issues with NUMA aware ptlrpcds which use this...) or leave it like this until Kernel exports an accurate entry-point and assume that at the moment no cfs_cpt_vzalloc() call occurs during any File-System operations ? I just pushed patch-set #2 of http://review.whamcloud.com/11190 assuming this last case.

Bruno Faccini (Inactive) added a comment - 23/Jul/14 9:18 AM

I just pushed http://review.whamcloud.com/11190 in order to have vmalloc[_node]() based allocations to no longer use __GFP_FS by default. I also found that this will enable for a real NUMA node parameter setting!

Bruno Faccini (Inactive) added a comment - 23/Jul/14 9:18 AM I just pushed http://review.whamcloud.com/11190 in order to have vmalloc [_node] () based allocations to no longer use __GFP_FS by default. I also found that this will enable for a real NUMA node parameter setting!

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Bruno Travouillon (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 15/Jul/14 6:49 AM

Updated:: 29/Sep/14 1:06 PM

Resolved:: 29/Sep/14 1:06 PM