Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.5.1
-
None
-
Lustre 2.5.1 on both clients and servers.
-
3
-
13411
Description
Have several occurrences of applications hanging. Stack traces show the application processes waiting in cl_lock_mutex_get/mutex_lock on code path through cl_glimpse_lock. All the dumps I've looked at, show one of the processes calling osc_ldlm_completion_ast along the way. Two processes are deadlocked on 2 cl_lock.cll_guard mutexes. All other app processes are waiting for one of these two mutexes.
> crash> bt -F | grep -A 1 '#2' > #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533 > ffff88083f505c48: [ccc_object_kmem] [cl_lock_kmem] addr(cl_lock) > crash> foreach growfiles bt -f | grep -A 1 '#2' | grep -v mutex_lock > ffff88083f533c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff88083c4899a8: ffff88083f684d98 ffff88083bbf5b70 > ffff8807eb48fd08: ffff8808350cdef8 ffff88083ba19ed0 > ffff88083b4cfb98: ffff88083f684d98 ffff88083bbf5b70 > ffff8807ea2e1aa8: ffff88083f684d98 ffff88083bbf5b70 > ffff88083f505c48: ffff8808350cdef8 ffff88083ba19ed0 > ffff88083ff5fc28: ffff8808350cdef8 ffff88083ba19ed0 > ffff880833821d08: ffff8808350cdef8 ffff88083ba19ed0 > ffff880833751c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff88083f5f1c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff88083e157c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff880833749c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff88083dfcbc28: ffff8808350cdef8 ffff88083ba19ed0 > ffff88083bd65c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff880833755c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff880833801c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff88083fd31c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff8807ed5a3c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff8807e0117c28: ffff8808350cdef8 ffff88083ba19ed0 crash> struct cl_lock.cll_guard ffff88083bbf5b70 | grep owner owner = 0xffff88083fe0c7f0 crash> ps | grep ffff88083fe0c7f0 5548 1 3 ffff88083fe0c7f0 UN 0.0 3120 1576 growfiles crash> struct cl_lock.cll_guard ffff88083ba19ed0 | grep owner owner = 0xffff8808336497f0 crash> ps | grep ffff8808336497f0 5543 1 12 ffff8808336497f0 UN 0.0 3120 1576 growfiles > crash> for 5543 bt -f | grep -A 1 '#2' > #2 [ffff88083c4899a0] mutex_lock at ffffffff8144f533 > ffff88083c4899a8: ffff88083f684d98 ffff88083bbf5b70 > crash> for 5548 bt -f | grep -A 1 '#2' > #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533 > ffff88083f505c48: ffff8808350cdef8 ffff88083ba19ed0 So a deadlock exists between pids 5543 and 5548. All other growfiles tasks are waiting for one of these two pids. Owner Waiter cl_lock ffff88083bbf5b70 5548 5543 cl_lock ffff88083ba19ed0 5543 5548 > crash> bt > PID: 5548 TASK: ffff88083fe0c7f0 CPU: 3 COMMAND: "growfiles" > #0 [ffff88083f505a68] schedule at ffffffff8144e6b7 > #1 [ffff88083f505bd0] __mutex_lock_slowpath at ffffffff8144fb0e > #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533 > #3 [ffff88083f505c60] cl_lock_mutex_get at ffffffffa03aa046 [obdclass] > #4 [ffff88083f505c90] lov_lock_enqueue at ffffffffa07c077f [lov] > #5 [ffff88083f505d30] cl_enqueue_try at ffffffffa03abffb [obdclass] > #6 [ffff88083f505d80] cl_enqueue_locked at ffffffffa03aceef [obdclass] > #7 [ffff88083f505dc0] cl_lock_request at ffffffffa03adb0e [obdclass] > #8 [ffff88083f505e20] cl_glimpse_lock at ffffffffa089089f [lustre] > #9 [ffff88083f505e80] cl_glimpse_size0 at ffffffffa0890d4d [lustre] > #10 [ffff88083f505ed0] ll_file_seek at ffffffffa083d988 [lustre] > #11 [ffff88083f505f30] vfs_llseek at ffffffff81155eea > #12 [ffff88083f505f40] sys_lseek at ffffffff8115604e > #13 [ffff88083f505f80] system_call_fastpath at ffffffff814589ab > PID: 5543 TASK: ffff8808336497f0 CPU: 12 COMMAND: "growfiles" > #0 [ffff88083c4897c8] schedule at ffffffff8144e6b7 > #1 [ffff88083c489930] __mutex_lock_slowpath at ffffffff8144fb0e > #2 [ffff88083c4899a0] mutex_lock at ffffffff8144f533 > #3 [ffff88083c4899c0] cl_lock_mutex_get at ffffffffa03aa046 [obdclass] > #4 [ffff88083c4899f0] osc_ldlm_completion_ast at ffffffffa072ea6f [osc] > #5 [ffff88083c489a40] ldlm_lock_match at ffffffffa04a1477 [ptlrpc] > #6 [ffff88083c489b20] osc_enqueue_base at ffffffffa07128f0 [osc] > #7 [ffff88083c489bb0] osc_lock_enqueue at ffffffffa072ccb6 [osc] > #8 [ffff88083c489c40] cl_enqueue_try at ffffffffa03abffb [obdclass] > #9 [ffff88083c489c90] lov_lock_enqueue at ffffffffa07c01d2 [lov] > #10 [ffff88083c489d30] cl_enqueue_try at ffffffffa03abffb [obdclass] > #11 [ffff88083c489d80] cl_enqueue_locked at ffffffffa03aceef [obdclass] > #12 [ffff88083c489dc0] cl_lock_request at ffffffffa03adb0e [obdclass] > #13 [ffff88083c489e20] cl_glimpse_lock at ffffffffa089089f [lustre] > #14 [ffff88083c489e80] cl_glimpse_size0 at ffffffffa0890d4d [lustre] > #15 [ffff88083c489ed0] ll_file_seek at ffffffffa083d988 [lustre] > #16 [ffff88083c489f30] vfs_llseek at ffffffff81155eea > #17 [ffff88083c489f40] sys_lseek at ffffffff8115604e > #18 [ffff88083c489f80] system_call_fastpath at ffffffff814589ab
The version of Lustre is 2.5.1 with some additional patches, in particular LU3027, 7841 has been reverted. The patch from LU-4558, 9876 is NOT included.
Attachments
Issue Links
- is related to
-
LU-5225 Client is evicted by multiple OSTs on all OSSs
- Resolved