[LU-4861] App hung - deadlock in cl_lock_mutex_get along cl_glimpse_lock path Created: 03/Apr/14  Updated: 12/Aug/14  Resolved: 20/Jun/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.1
Fix Version/s: Lustre 2.6.0, Lustre 2.5.3

Type: Bug Priority: Critical
Reporter: Ann Koehler (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre 2.5.1 on both clients and servers.


Issue Links:
Related
is related to LU-5225 Client is evicted by multiple OSTs on... Resolved
Severity: 3
Rank (Obsolete): 13411

 Description   

Have several occurrences of applications hanging. Stack traces show the application processes waiting in cl_lock_mutex_get/mutex_lock on code path through cl_glimpse_lock. All the dumps I've looked at, show one of the processes calling osc_ldlm_completion_ast along the way. Two processes are deadlocked on 2 cl_lock.cll_guard mutexes. All other app processes are waiting for one of these two mutexes.

> crash> bt -F | grep -A 1 '#2'
>  #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533
>     ffff88083f505c48: [ccc_object_kmem] [cl_lock_kmem]

                                         addr(cl_lock)
> crash> foreach growfiles bt -f | grep -A 1 '#2' | grep -v mutex_lock
>     ffff88083f533c28: ffff8808350cdef8 ffff88083ba19ed0
>     ffff88083c4899a8: ffff88083f684d98 ffff88083bbf5b70
>     ffff8807eb48fd08: ffff8808350cdef8 ffff88083ba19ed0
>     ffff88083b4cfb98: ffff88083f684d98 ffff88083bbf5b70
>     ffff8807ea2e1aa8: ffff88083f684d98 ffff88083bbf5b70
>     ffff88083f505c48: ffff8808350cdef8 ffff88083ba19ed0
>     ffff88083ff5fc28: ffff8808350cdef8 ffff88083ba19ed0
>     ffff880833821d08: ffff8808350cdef8 ffff88083ba19ed0
>     ffff880833751c28: ffff8808350cdef8 ffff88083ba19ed0
>     ffff88083f5f1c28: ffff8808350cdef8 ffff88083ba19ed0
>     ffff88083e157c28: ffff8808350cdef8 ffff88083ba19ed0
>     ffff880833749c28: ffff8808350cdef8 ffff88083ba19ed0
>     ffff88083dfcbc28: ffff8808350cdef8 ffff88083ba19ed0
>     ffff88083bd65c28: ffff8808350cdef8 ffff88083ba19ed0
>     ffff880833755c28: ffff8808350cdef8 ffff88083ba19ed0
>     ffff880833801c28: ffff8808350cdef8 ffff88083ba19ed0
>     ffff88083fd31c28: ffff8808350cdef8 ffff88083ba19ed0
>     ffff8807ed5a3c28: ffff8808350cdef8 ffff88083ba19ed0
>     ffff8807e0117c28: ffff8808350cdef8 ffff88083ba19ed0

crash> struct cl_lock.cll_guard ffff88083bbf5b70 | grep owner
    owner = 0xffff88083fe0c7f0
crash> ps | grep ffff88083fe0c7f0
   5548      1    3 ffff88083fe0c7f0  UN   0.0    3120   1576  growfiles

crash> struct cl_lock.cll_guard ffff88083ba19ed0 | grep owner
    owner = 0xffff8808336497f0
crash> ps | grep ffff8808336497f0
   5543      1   12 ffff8808336497f0  UN   0.0    3120   1576  growfiles

> crash> for 5543 bt -f | grep -A 1 '#2'
>  #2 [ffff88083c4899a0] mutex_lock at ffffffff8144f533
>     ffff88083c4899a8: ffff88083f684d98 ffff88083bbf5b70 
> crash> for 5548 bt -f | grep -A 1 '#2'
>  #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533
>     ffff88083f505c48: ffff8808350cdef8 ffff88083ba19ed0 

So a deadlock exists between pids 5543 and 5548. All other growfiles tasks are waiting for one of these two pids.
                           Owner     Waiter
cl_lock ffff88083bbf5b70   5548      5543
cl_lock ffff88083ba19ed0   5543      5548

> crash> bt
> PID: 5548   TASK: ffff88083fe0c7f0  CPU: 3   COMMAND: "growfiles"
>  #0 [ffff88083f505a68] schedule at ffffffff8144e6b7
>  #1 [ffff88083f505bd0] __mutex_lock_slowpath at ffffffff8144fb0e
>  #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533
>  #3 [ffff88083f505c60] cl_lock_mutex_get at ffffffffa03aa046 [obdclass]
>  #4 [ffff88083f505c90] lov_lock_enqueue at ffffffffa07c077f [lov]
>  #5 [ffff88083f505d30] cl_enqueue_try at ffffffffa03abffb [obdclass]
>  #6 [ffff88083f505d80] cl_enqueue_locked at ffffffffa03aceef [obdclass]
>  #7 [ffff88083f505dc0] cl_lock_request at ffffffffa03adb0e [obdclass]
>  #8 [ffff88083f505e20] cl_glimpse_lock at ffffffffa089089f [lustre]
>  #9 [ffff88083f505e80] cl_glimpse_size0 at ffffffffa0890d4d [lustre]
> #10 [ffff88083f505ed0] ll_file_seek at ffffffffa083d988 [lustre]
> #11 [ffff88083f505f30] vfs_llseek at ffffffff81155eea
> #12 [ffff88083f505f40] sys_lseek at ffffffff8115604e
> #13 [ffff88083f505f80] system_call_fastpath at ffffffff814589ab

> PID: 5543   TASK: ffff8808336497f0  CPU: 12  COMMAND: "growfiles"
>  #0 [ffff88083c4897c8] schedule at ffffffff8144e6b7
>  #1 [ffff88083c489930] __mutex_lock_slowpath at ffffffff8144fb0e
>  #2 [ffff88083c4899a0] mutex_lock at ffffffff8144f533
>  #3 [ffff88083c4899c0] cl_lock_mutex_get at ffffffffa03aa046 [obdclass]
>  #4 [ffff88083c4899f0] osc_ldlm_completion_ast at ffffffffa072ea6f [osc]
>  #5 [ffff88083c489a40] ldlm_lock_match at ffffffffa04a1477 [ptlrpc]
>  #6 [ffff88083c489b20] osc_enqueue_base at ffffffffa07128f0 [osc]
>  #7 [ffff88083c489bb0] osc_lock_enqueue at ffffffffa072ccb6 [osc]
>  #8 [ffff88083c489c40] cl_enqueue_try at ffffffffa03abffb [obdclass]
>  #9 [ffff88083c489c90] lov_lock_enqueue at ffffffffa07c01d2 [lov]
> #10 [ffff88083c489d30] cl_enqueue_try at ffffffffa03abffb [obdclass]
> #11 [ffff88083c489d80] cl_enqueue_locked at ffffffffa03aceef [obdclass]
> #12 [ffff88083c489dc0] cl_lock_request at ffffffffa03adb0e [obdclass]
> #13 [ffff88083c489e20] cl_glimpse_lock at ffffffffa089089f [lustre]
> #14 [ffff88083c489e80] cl_glimpse_size0 at ffffffffa0890d4d [lustre]
> #15 [ffff88083c489ed0] ll_file_seek at ffffffffa083d988 [lustre]
> #16 [ffff88083c489f30] vfs_llseek at ffffffff81155eea
> #17 [ffff88083c489f40] sys_lseek at ffffffff8115604e
> #18 [ffff88083c489f80] system_call_fastpath at ffffffff814589ab

The version of Lustre is 2.5.1 with some additional patches, in particular LU3027, 7841 has been reverted. The patch from LU-4558, 9876 is NOT included.



 Comments   
Comment by Ann Koehler (Inactive) [ 03/Apr/14 ]

Uploaded dump to ftp.whamcloud.com:/uploads/LU-4861/bug810344_cllock_deadlock.tgz

This ticket is primarily "for your information". Seems like there's a reasonable chance that the cause of the deadlock is related to the root cause of LU-4591. I'm passing on the dump in case it helps solve LU-4591. And in case it's not, we're working the issue through Xyratex in parallel.

Comment by Jinshan Xiong (Inactive) [ 04/Apr/14 ]

Thanks for the coredump, I will take a look.

Comment by Ann Koehler (Inactive) [ 22/Apr/14 ]

FYI, we've seen this same deadlock again with the patch http://review.whamcloud.com/9881 from LU-4558 applied.

Comment by Patrick Farrell (Inactive) [ 20/May/14 ]

I hit this bug last night during stress testing of current, unmodified master on SLES11SP3.

Most recent commit on the branch being tested:
commit 864fc9daac267819f5e3bdebef6cdac4c6325626
Author: Mikhail Pershin <mike.pershin@intel.com>
Date: Tue May 13 18:34:03 2014 +0400

LU-2059 mgs: don't fail on missing params log
-----------------------------------------

Uploaded dump to ftp.whamcloud.com:/uploads/LU-4861/master_140520_dump_upload.tar.gz

Note there is also a master_140520_dump.tar.gz in that same directory, which is a failed transfer.

Comment by Jinshan Xiong (Inactive) [ 29/May/14 ]

I will take a look at this issue.

Comment by Jinshan Xiong (Inactive) [ 04/Jun/14 ]

please try patch: http://review.whamcloud.com/10581

Comment by Patrick Farrell (Inactive) [ 04/Jun/14 ]

We'll try the patch. Separately, because Jinshan asked, the logs from the dump I uploaded are here:
ftp.whamcloud.com:/uploads/LU-4861/logs.tar.gz

Comment by Patrick Farrell (Inactive) [ 05/Jun/14 ]

Jinshan - We've given this patch a bit of testing, and seen no issues. We don't have a clear reproducer, so we can't say the bug is fixed for sure, but we're pulling this in to our branch of Lustre so it will see more exposure over the next few weeks.

Comment by Peter Jones [ 20/Jun/14 ]

Landed for 2.6

Generated at Sat Feb 10 01:46:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.