[LU-5349] Deadlock in mdc_close() Created: 15/Jul/14  Updated: 29/Sep/14  Resolved: 29/Sep/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4

Type: Bug Priority: Minor
Reporter: Bruno Travouillon (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

RHEL6 w/ patched kernel


Attachments: HTML File report_for_support    
Severity: 3
Rank (Obsolete): 14915

 Description   

We had to crash/dump one of our Lustre clients because of a deadlock issue in mdc_close(). The PID 5231 was waiting for a lock that it already owned. BTW, we had a lot of process waiting for this lock.

In the backtrace of the process, we can see two calls to mdc_close(). The second is due to the system reclaiming memory.

crash> bt 5231
PID: 5231   TASK: ffff881518308b00  CPU: 2   COMMAND: "code2"
 #0 [ffff88171cb43188] schedule at ffffffff81528a52
 #1 [ffff88171cb43250] __mutex_lock_slowpath at ffffffff8152a20e
 #2 [ffff88171cb432c0] mutex_lock at ffffffff8152a0ab                  <=== Requires a new lock
 #3 [ffff88171cb432e0] mdc_close at ffffffffa09176db [mdc]
 #4 [ffff88171cb43330] lmv_close at ffffffffa0b9bcb8 [lmv]
 #5 [ffff88171cb43380] ll_close_inode_openhandle at ffffffffa0a80c1e [lustre]
 #6 [ffff88171cb43400] ll_md_real_close at ffffffffa0a81afa [lustre]
 #7 [ffff88171cb43430] ll_clear_inode at ffffffffa0a92dee [lustre]
 #8 [ffff88171cb43470] clear_inode at ffffffff811a626c
 #9 [ffff88171cb43490] dispose_list at ffffffff811a6340
#10 [ffff88171cb434d0] shrink_icache_memory at ffffffff811a6694
#11 [ffff88171cb43530] shrink_slab at ffffffff81138b7a
#12 [ffff88171cb43590] zone_reclaim at ffffffff8113b77e
#13 [ffff88171cb436b0] get_page_from_freelist at ffffffff8112d8dc
#14 [ffff88171cb437e0] __alloc_pages_nodemask at ffffffff8112f443
#15 [ffff88171cb43920] alloc_pages_current at ffffffff811680ca
#16 [ffff88171cb43950] __vmalloc_area_node at ffffffff81159696
#17 [ffff88171cb439b0] __vmalloc_node at ffffffff8115953d
#18 [ffff88171cb43a10] vmalloc at ffffffff8115985c
#19 [ffff88171cb43a20] cfs_alloc_large at ffffffffa03b4b1e [libcfs]
#20 [ffff88171cb43a30] null_alloc_repbuf at ffffffffa06c4961 [ptlrpc]
#21 [ffff88171cb43a60] sptlrpc_cli_alloc_repbuf at ffffffffa06b2355 [ptlrpc]
#22 [ffff88171cb43a90] ptl_send_rpc at ffffffffa068432c [ptlrpc]
#23 [ffff88171cb43b50] ptlrpc_send_new_req at ffffffffa067879b [ptlrpc]
#24 [ffff88171cb43bc0] ptlrpc_set_wait at ffffffffa067ddb6 [ptlrpc]
#25 [ffff88171cb43c60] ptlrpc_queue_wait at ffffffffa067e0df [ptlrpc]   <=== PID has the lock
#26 [ffff88171cb43c80] mdc_close at ffffffffa0917714 [mdc]
#27 [ffff88171cb43cd0] lmv_close at ffffffffa0b9bcb8 [lmv]
#28 [ffff88171cb43d20] ll_close_inode_openhandle at ffffffffa0a80c1e [lustre]
#29 [ffff88171cb43da0] ll_md_real_close at ffffffffa0a81afa [lustre]
#30 [ffff88171cb43dd0] ll_md_close at ffffffffa0a81d8a [lustre]
#31 [ffff88171cb43e80] ll_file_release at ffffffffa0a8233b [lustre]
#32 [ffff88171cb43ec0] __fput at ffffffff8118ad55
#33 [ffff88171cb43f10] fput at ffffffff8118ae95
#34 [ffff88171cb43f20] filp_close at ffffffff811861bd
#35 [ffff88171cb43f50] sys_close at ffffffff81186295
#36 [ffff88171cb43f80] system_call_fastpath at ffffffff8100b072
    RIP: 00002adaacdf26d0  RSP: 00007fff9665e238  RFLAGS: 00010246
    RAX: 0000000000000003  RBX: ffffffff8100b072  RCX: 0000000000002261
    RDX: 00000000044a24b0  RSI: 0000000000000001  RDI: 0000000000000005
    RBP: 0000000000000000   R8: 00002adaad0ac560   R9: 0000000000000001
    R10: 00000000000004fd  R11: 0000000000000246  R12: 00000000000004fc
    R13: 00000000ffffffff  R14: 00000000044a23d0  R15: 00000000ffffffff
    ORIG_RAX: 0000000000000003  CS: 0033  SS: 002b

We have a recursive locking here, which is not permitted.



 Comments   
Comment by Bruno Faccini (Inactive) [ 15/Jul/14 ]

May be we can avoid this self dead-lock by detecting in prolog of ll_clear_inode() routine that current thread is already owner of class_exp2obd(ll_i2sbi(inode)>ll_md_exp)>u.cli.cl_close_lock->rpcl_mutex ?

Comment by Jodi Levi (Inactive) [ 15/Jul/14 ]

Bruno,
Can you comment if this is happening in 2.6?

Comment by Bruno Faccini (Inactive) [ 22/Jul/14 ]

Jodi: Looks like yes.

I try to implement my fix idea in http://review.whamcloud.com/11183. But Bruno, any help to find a way to reproduce the same situation is welcome ...

Comment by Oleg Drokin [ 22/Jul/14 ]

The most important part of the failure here is that I just realized vmalloc does not take GFP mask so we cannot tell it not to dive into filesystems for memory reclaiming.
As such it opens endless possibilitiees for deadlocks and such.

The real fix here would be to change OBD_VMALLOC to use __vmalloc instead with GFP_NOFS flags like we do with regular vmalloc. (we also need to add GFP_ZERO of course to prezero the memory for us).

Comment by Andreas Dilger [ 22/Jul/14 ]

How long has __vmalloc() existed? I've never seen that before, and we've had similar problems to this in the past that could have been fixed in a similar manner.

Comment by Oleg Drokin [ 22/Jul/14 ]

I just did some git research and it has been there since forever, basically.
2.6.12-rc2 (when Linus repo at github starts) already has it.

Comment by Bruno Faccini (Inactive) [ 22/Jul/14 ]

Oleg: thanks for this hint! I agree it looks more elegant than my own idea, and since it is much more restrictive it should handle all other dead-lock possibilities than only this inflight RPC serialization mutex ...

But concerning the fix detail now, I don't see any current way to specify any combination of allocation flag within our vmalloc()/vzalloc() based set of macros, should we think to implement it now ?

Comment by Oleg Drokin [ 23/Jul/14 ]

Basically we need to do the same thing we do for the OBD_ALLOC code - just use GFP_NOFS by default (since we are a filesystem, that's a safe bet).
So far thee was no need for OBD_VMALLOC_GFP so I am not sure now is the tiem to introduce it.

Comment by Bruno Faccini (Inactive) [ 23/Jul/14 ]

I just pushed http://review.whamcloud.com/11190 in order to have vmalloc[_node]() based allocations to no longer use __GFP_FS by default. I also found that this will enable for a real NUMA node parameter setting!

Comment by Bruno Faccini (Inactive) [ 23/Jul/14 ]

Humm too bad, __vmalloc_node() is not exported by Kernels ... So I am stuck if I want to also fix cfs_cpt_vzalloc() about the fact that __GFP_FS is used by default by vzalloc_node() and continue to forward a NUMA node specification, since I think only __vmalloc_node() would permit ...
So what should I do in cfs_cpt_vzalloc(), call __vmalloc() and forget about the node specified (but this may imply performance issues with NUMA aware ptlrpcds which use this...) or leave it like this until Kernel exports an accurate entry-point and assume that at the moment no cfs_cpt_vzalloc() call occurs during any File-System operations ?

I just pushed patch-set #2 of http://review.whamcloud.com/11190 assuming this last case.

Comment by Andreas Dilger [ 23/Jul/14 ]

The other question here is how many OSTs are in this filesystem, and if you are using wide striping? I'm trying to figure out why this was using vmalloc() instead of kmalloc(), and if there is a separate bug to be addressed to reduce the allocation size.

Comment by Bruno Travouillon (Inactive) [ 23/Jul/14 ]

There are several Lustre filesystems mounted on this client:

  • a Lustre 2.4 filesystem on the same LNET with 1 MDT and 480 OSTs. We do not use wide striping.
  • a Lustre 2.1 filesystem on another LNET with 1 MDT and 224 OSTs.
  • a Lustre 2.1 filesystem on another LNET with 1 MDT and 56 OSTs.
  • a Lustre 2.1 filesystem on another LNET with 1 MDT and 48 OSTs.

This Lustre client is a login node, with many user working interactively.

You can find in the attached file the outputs of sar -B and sar -R.

Hope this helps.

Comment by Bruno Faccini (Inactive) [ 07/Aug/14 ]

Patch http://review.whamcloud.com/#/c/11183/ has been abandoned.

Comment by Bruno Faccini (Inactive) [ 03/Sep/14 ]

Master patch http://review.whamcloud.com/11190 has landed.
b2_5 version is now at http://review.whamcloud.com/11739.

Comment by Sebastien Buisson (Inactive) [ 03/Sep/14 ]

Bruno, why does b2_5 version lack _GFP_ZERO flag in call to __vmalloc() (_OBD_VMALLOC_VERBOSE macro)?

Comment by Bruno Faccini (Inactive) [ 03/Sep/14 ]

I forgot to mention it/why, nice catch! It is because b2_5 uses vmalloc() when master used vzalloc(), and I wanted to challenge my future reviewers about this ...

Comment by Oleg Drokin [ 24/Sep/14 ]

also GFP_ZERO is not really needed in b2_5 patch because we explicitly zero the allocation with memset() afterwards anyway.

Comment by Peter Jones [ 29/Sep/14 ]

Landed for 2.5.4 and 2.7

Generated at Sat Feb 10 01:50:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.