[LU-4861] App hung - deadlock in cl_lock_mutex_get along cl_glimpse_lock path Created: 03/Apr/14 Updated: 12/Aug/14 Resolved: 20/Jun/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.1 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Ann Koehler (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre 2.5.1 on both clients and servers. |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 13411 | ||||||||
| Description |
|
Have several occurrences of applications hanging. Stack traces show the application processes waiting in cl_lock_mutex_get/mutex_lock on code path through cl_glimpse_lock. All the dumps I've looked at, show one of the processes calling osc_ldlm_completion_ast along the way. Two processes are deadlocked on 2 cl_lock.cll_guard mutexes. All other app processes are waiting for one of these two mutexes. > crash> bt -F | grep -A 1 '#2'
> #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533
> ffff88083f505c48: [ccc_object_kmem] [cl_lock_kmem]
addr(cl_lock)
> crash> foreach growfiles bt -f | grep -A 1 '#2' | grep -v mutex_lock
> ffff88083f533c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff88083c4899a8: ffff88083f684d98 ffff88083bbf5b70
> ffff8807eb48fd08: ffff8808350cdef8 ffff88083ba19ed0
> ffff88083b4cfb98: ffff88083f684d98 ffff88083bbf5b70
> ffff8807ea2e1aa8: ffff88083f684d98 ffff88083bbf5b70
> ffff88083f505c48: ffff8808350cdef8 ffff88083ba19ed0
> ffff88083ff5fc28: ffff8808350cdef8 ffff88083ba19ed0
> ffff880833821d08: ffff8808350cdef8 ffff88083ba19ed0
> ffff880833751c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff88083f5f1c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff88083e157c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff880833749c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff88083dfcbc28: ffff8808350cdef8 ffff88083ba19ed0
> ffff88083bd65c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff880833755c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff880833801c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff88083fd31c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff8807ed5a3c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff8807e0117c28: ffff8808350cdef8 ffff88083ba19ed0
crash> struct cl_lock.cll_guard ffff88083bbf5b70 | grep owner
owner = 0xffff88083fe0c7f0
crash> ps | grep ffff88083fe0c7f0
5548 1 3 ffff88083fe0c7f0 UN 0.0 3120 1576 growfiles
crash> struct cl_lock.cll_guard ffff88083ba19ed0 | grep owner
owner = 0xffff8808336497f0
crash> ps | grep ffff8808336497f0
5543 1 12 ffff8808336497f0 UN 0.0 3120 1576 growfiles
> crash> for 5543 bt -f | grep -A 1 '#2'
> #2 [ffff88083c4899a0] mutex_lock at ffffffff8144f533
> ffff88083c4899a8: ffff88083f684d98 ffff88083bbf5b70
> crash> for 5548 bt -f | grep -A 1 '#2'
> #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533
> ffff88083f505c48: ffff8808350cdef8 ffff88083ba19ed0
So a deadlock exists between pids 5543 and 5548. All other growfiles tasks are waiting for one of these two pids.
Owner Waiter
cl_lock ffff88083bbf5b70 5548 5543
cl_lock ffff88083ba19ed0 5543 5548
> crash> bt
> PID: 5548 TASK: ffff88083fe0c7f0 CPU: 3 COMMAND: "growfiles"
> #0 [ffff88083f505a68] schedule at ffffffff8144e6b7
> #1 [ffff88083f505bd0] __mutex_lock_slowpath at ffffffff8144fb0e
> #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533
> #3 [ffff88083f505c60] cl_lock_mutex_get at ffffffffa03aa046 [obdclass]
> #4 [ffff88083f505c90] lov_lock_enqueue at ffffffffa07c077f [lov]
> #5 [ffff88083f505d30] cl_enqueue_try at ffffffffa03abffb [obdclass]
> #6 [ffff88083f505d80] cl_enqueue_locked at ffffffffa03aceef [obdclass]
> #7 [ffff88083f505dc0] cl_lock_request at ffffffffa03adb0e [obdclass]
> #8 [ffff88083f505e20] cl_glimpse_lock at ffffffffa089089f [lustre]
> #9 [ffff88083f505e80] cl_glimpse_size0 at ffffffffa0890d4d [lustre]
> #10 [ffff88083f505ed0] ll_file_seek at ffffffffa083d988 [lustre]
> #11 [ffff88083f505f30] vfs_llseek at ffffffff81155eea
> #12 [ffff88083f505f40] sys_lseek at ffffffff8115604e
> #13 [ffff88083f505f80] system_call_fastpath at ffffffff814589ab
> PID: 5543 TASK: ffff8808336497f0 CPU: 12 COMMAND: "growfiles"
> #0 [ffff88083c4897c8] schedule at ffffffff8144e6b7
> #1 [ffff88083c489930] __mutex_lock_slowpath at ffffffff8144fb0e
> #2 [ffff88083c4899a0] mutex_lock at ffffffff8144f533
> #3 [ffff88083c4899c0] cl_lock_mutex_get at ffffffffa03aa046 [obdclass]
> #4 [ffff88083c4899f0] osc_ldlm_completion_ast at ffffffffa072ea6f [osc]
> #5 [ffff88083c489a40] ldlm_lock_match at ffffffffa04a1477 [ptlrpc]
> #6 [ffff88083c489b20] osc_enqueue_base at ffffffffa07128f0 [osc]
> #7 [ffff88083c489bb0] osc_lock_enqueue at ffffffffa072ccb6 [osc]
> #8 [ffff88083c489c40] cl_enqueue_try at ffffffffa03abffb [obdclass]
> #9 [ffff88083c489c90] lov_lock_enqueue at ffffffffa07c01d2 [lov]
> #10 [ffff88083c489d30] cl_enqueue_try at ffffffffa03abffb [obdclass]
> #11 [ffff88083c489d80] cl_enqueue_locked at ffffffffa03aceef [obdclass]
> #12 [ffff88083c489dc0] cl_lock_request at ffffffffa03adb0e [obdclass]
> #13 [ffff88083c489e20] cl_glimpse_lock at ffffffffa089089f [lustre]
> #14 [ffff88083c489e80] cl_glimpse_size0 at ffffffffa0890d4d [lustre]
> #15 [ffff88083c489ed0] ll_file_seek at ffffffffa083d988 [lustre]
> #16 [ffff88083c489f30] vfs_llseek at ffffffff81155eea
> #17 [ffff88083c489f40] sys_lseek at ffffffff8115604e
> #18 [ffff88083c489f80] system_call_fastpath at ffffffff814589ab
The version of Lustre is 2.5.1 with some additional patches, in particular LU3027, 7841 has been reverted. The patch from |
| Comments |
| Comment by Ann Koehler (Inactive) [ 03/Apr/14 ] |
|
Uploaded dump to ftp.whamcloud.com:/uploads/ This ticket is primarily "for your information". Seems like there's a reasonable chance that the cause of the deadlock is related to the root cause of |
| Comment by Jinshan Xiong (Inactive) [ 04/Apr/14 ] |
|
Thanks for the coredump, I will take a look. |
| Comment by Ann Koehler (Inactive) [ 22/Apr/14 ] |
|
FYI, we've seen this same deadlock again with the patch http://review.whamcloud.com/9881 from |
| Comment by Patrick Farrell (Inactive) [ 20/May/14 ] |
|
I hit this bug last night during stress testing of current, unmodified master on SLES11SP3. Most recent commit on the branch being tested: Uploaded dump to ftp.whamcloud.com:/uploads/ Note there is also a master_140520_dump.tar.gz in that same directory, which is a failed transfer. |
| Comment by Jinshan Xiong (Inactive) [ 29/May/14 ] |
|
I will take a look at this issue. |
| Comment by Jinshan Xiong (Inactive) [ 04/Jun/14 ] |
|
please try patch: http://review.whamcloud.com/10581 |
| Comment by Patrick Farrell (Inactive) [ 04/Jun/14 ] |
|
We'll try the patch. Separately, because Jinshan asked, the logs from the dump I uploaded are here: |
| Comment by Patrick Farrell (Inactive) [ 05/Jun/14 ] |
|
Jinshan - We've given this patch a bit of testing, and seen no issues. We don't have a clear reproducer, so we can't say the bug is fixed for sure, but we're pulling this in to our branch of Lustre so it will see more exposure over the next few weeks. |
| Comment by Peter Jones [ 20/Jun/14 ] |
|
Landed for 2.6 |