Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.5.1
-
None
-
Lustre 2.5.1 on both clients and servers.
-
3
-
13411
Description
Have several occurrences of applications hanging. Stack traces show the application processes waiting in cl_lock_mutex_get/mutex_lock on code path through cl_glimpse_lock. All the dumps I've looked at, show one of the processes calling osc_ldlm_completion_ast along the way. Two processes are deadlocked on 2 cl_lock.cll_guard mutexes. All other app processes are waiting for one of these two mutexes.
> crash> bt -F | grep -A 1 '#2'
> #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533
> ffff88083f505c48: [ccc_object_kmem] [cl_lock_kmem]
addr(cl_lock)
> crash> foreach growfiles bt -f | grep -A 1 '#2' | grep -v mutex_lock
> ffff88083f533c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff88083c4899a8: ffff88083f684d98 ffff88083bbf5b70
> ffff8807eb48fd08: ffff8808350cdef8 ffff88083ba19ed0
> ffff88083b4cfb98: ffff88083f684d98 ffff88083bbf5b70
> ffff8807ea2e1aa8: ffff88083f684d98 ffff88083bbf5b70
> ffff88083f505c48: ffff8808350cdef8 ffff88083ba19ed0
> ffff88083ff5fc28: ffff8808350cdef8 ffff88083ba19ed0
> ffff880833821d08: ffff8808350cdef8 ffff88083ba19ed0
> ffff880833751c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff88083f5f1c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff88083e157c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff880833749c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff88083dfcbc28: ffff8808350cdef8 ffff88083ba19ed0
> ffff88083bd65c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff880833755c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff880833801c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff88083fd31c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff8807ed5a3c28: ffff8808350cdef8 ffff88083ba19ed0
> ffff8807e0117c28: ffff8808350cdef8 ffff88083ba19ed0
crash> struct cl_lock.cll_guard ffff88083bbf5b70 | grep owner
owner = 0xffff88083fe0c7f0
crash> ps | grep ffff88083fe0c7f0
5548 1 3 ffff88083fe0c7f0 UN 0.0 3120 1576 growfiles
crash> struct cl_lock.cll_guard ffff88083ba19ed0 | grep owner
owner = 0xffff8808336497f0
crash> ps | grep ffff8808336497f0
5543 1 12 ffff8808336497f0 UN 0.0 3120 1576 growfiles
> crash> for 5543 bt -f | grep -A 1 '#2'
> #2 [ffff88083c4899a0] mutex_lock at ffffffff8144f533
> ffff88083c4899a8: ffff88083f684d98 ffff88083bbf5b70
> crash> for 5548 bt -f | grep -A 1 '#2'
> #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533
> ffff88083f505c48: ffff8808350cdef8 ffff88083ba19ed0
So a deadlock exists between pids 5543 and 5548. All other growfiles tasks are waiting for one of these two pids.
Owner Waiter
cl_lock ffff88083bbf5b70 5548 5543
cl_lock ffff88083ba19ed0 5543 5548
> crash> bt
> PID: 5548 TASK: ffff88083fe0c7f0 CPU: 3 COMMAND: "growfiles"
> #0 [ffff88083f505a68] schedule at ffffffff8144e6b7
> #1 [ffff88083f505bd0] __mutex_lock_slowpath at ffffffff8144fb0e
> #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533
> #3 [ffff88083f505c60] cl_lock_mutex_get at ffffffffa03aa046 [obdclass]
> #4 [ffff88083f505c90] lov_lock_enqueue at ffffffffa07c077f [lov]
> #5 [ffff88083f505d30] cl_enqueue_try at ffffffffa03abffb [obdclass]
> #6 [ffff88083f505d80] cl_enqueue_locked at ffffffffa03aceef [obdclass]
> #7 [ffff88083f505dc0] cl_lock_request at ffffffffa03adb0e [obdclass]
> #8 [ffff88083f505e20] cl_glimpse_lock at ffffffffa089089f [lustre]
> #9 [ffff88083f505e80] cl_glimpse_size0 at ffffffffa0890d4d [lustre]
> #10 [ffff88083f505ed0] ll_file_seek at ffffffffa083d988 [lustre]
> #11 [ffff88083f505f30] vfs_llseek at ffffffff81155eea
> #12 [ffff88083f505f40] sys_lseek at ffffffff8115604e
> #13 [ffff88083f505f80] system_call_fastpath at ffffffff814589ab
> PID: 5543 TASK: ffff8808336497f0 CPU: 12 COMMAND: "growfiles"
> #0 [ffff88083c4897c8] schedule at ffffffff8144e6b7
> #1 [ffff88083c489930] __mutex_lock_slowpath at ffffffff8144fb0e
> #2 [ffff88083c4899a0] mutex_lock at ffffffff8144f533
> #3 [ffff88083c4899c0] cl_lock_mutex_get at ffffffffa03aa046 [obdclass]
> #4 [ffff88083c4899f0] osc_ldlm_completion_ast at ffffffffa072ea6f [osc]
> #5 [ffff88083c489a40] ldlm_lock_match at ffffffffa04a1477 [ptlrpc]
> #6 [ffff88083c489b20] osc_enqueue_base at ffffffffa07128f0 [osc]
> #7 [ffff88083c489bb0] osc_lock_enqueue at ffffffffa072ccb6 [osc]
> #8 [ffff88083c489c40] cl_enqueue_try at ffffffffa03abffb [obdclass]
> #9 [ffff88083c489c90] lov_lock_enqueue at ffffffffa07c01d2 [lov]
> #10 [ffff88083c489d30] cl_enqueue_try at ffffffffa03abffb [obdclass]
> #11 [ffff88083c489d80] cl_enqueue_locked at ffffffffa03aceef [obdclass]
> #12 [ffff88083c489dc0] cl_lock_request at ffffffffa03adb0e [obdclass]
> #13 [ffff88083c489e20] cl_glimpse_lock at ffffffffa089089f [lustre]
> #14 [ffff88083c489e80] cl_glimpse_size0 at ffffffffa0890d4d [lustre]
> #15 [ffff88083c489ed0] ll_file_seek at ffffffffa083d988 [lustre]
> #16 [ffff88083c489f30] vfs_llseek at ffffffff81155eea
> #17 [ffff88083c489f40] sys_lseek at ffffffff8115604e
> #18 [ffff88083c489f80] system_call_fastpath at ffffffff814589ab
The version of Lustre is 2.5.1 with some additional patches, in particular LU3027, 7841 has been reverted. The patch from LU-4558, 9876 is NOT included.
Attachments
Issue Links
- is related to
-
LU-5225 Client is evicted by multiple OSTs on all OSSs
-
- Resolved
-