Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.5.1
-
None
-
Lustre 2.5.1 on both clients and servers.
-
3
-
13411
Description
Have several occurrences of applications hanging. Stack traces show the application processes waiting in cl_lock_mutex_get/mutex_lock on code path through cl_glimpse_lock. All the dumps I've looked at, show one of the processes calling osc_ldlm_completion_ast along the way. Two processes are deadlocked on 2 cl_lock.cll_guard mutexes. All other app processes are waiting for one of these two mutexes.
> crash> bt -F | grep -A 1 '#2' > #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533 > ffff88083f505c48: [ccc_object_kmem] [cl_lock_kmem] addr(cl_lock) > crash> foreach growfiles bt -f | grep -A 1 '#2' | grep -v mutex_lock > ffff88083f533c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff88083c4899a8: ffff88083f684d98 ffff88083bbf5b70 > ffff8807eb48fd08: ffff8808350cdef8 ffff88083ba19ed0 > ffff88083b4cfb98: ffff88083f684d98 ffff88083bbf5b70 > ffff8807ea2e1aa8: ffff88083f684d98 ffff88083bbf5b70 > ffff88083f505c48: ffff8808350cdef8 ffff88083ba19ed0 > ffff88083ff5fc28: ffff8808350cdef8 ffff88083ba19ed0 > ffff880833821d08: ffff8808350cdef8 ffff88083ba19ed0 > ffff880833751c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff88083f5f1c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff88083e157c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff880833749c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff88083dfcbc28: ffff8808350cdef8 ffff88083ba19ed0 > ffff88083bd65c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff880833755c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff880833801c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff88083fd31c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff8807ed5a3c28: ffff8808350cdef8 ffff88083ba19ed0 > ffff8807e0117c28: ffff8808350cdef8 ffff88083ba19ed0 crash> struct cl_lock.cll_guard ffff88083bbf5b70 | grep owner owner = 0xffff88083fe0c7f0 crash> ps | grep ffff88083fe0c7f0 5548 1 3 ffff88083fe0c7f0 UN 0.0 3120 1576 growfiles crash> struct cl_lock.cll_guard ffff88083ba19ed0 | grep owner owner = 0xffff8808336497f0 crash> ps | grep ffff8808336497f0 5543 1 12 ffff8808336497f0 UN 0.0 3120 1576 growfiles > crash> for 5543 bt -f | grep -A 1 '#2' > #2 [ffff88083c4899a0] mutex_lock at ffffffff8144f533 > ffff88083c4899a8: ffff88083f684d98 ffff88083bbf5b70 > crash> for 5548 bt -f | grep -A 1 '#2' > #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533 > ffff88083f505c48: ffff8808350cdef8 ffff88083ba19ed0 So a deadlock exists between pids 5543 and 5548. All other growfiles tasks are waiting for one of these two pids. Owner Waiter cl_lock ffff88083bbf5b70 5548 5543 cl_lock ffff88083ba19ed0 5543 5548 > crash> bt > PID: 5548 TASK: ffff88083fe0c7f0 CPU: 3 COMMAND: "growfiles" > #0 [ffff88083f505a68] schedule at ffffffff8144e6b7 > #1 [ffff88083f505bd0] __mutex_lock_slowpath at ffffffff8144fb0e > #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533 > #3 [ffff88083f505c60] cl_lock_mutex_get at ffffffffa03aa046 [obdclass] > #4 [ffff88083f505c90] lov_lock_enqueue at ffffffffa07c077f [lov] > #5 [ffff88083f505d30] cl_enqueue_try at ffffffffa03abffb [obdclass] > #6 [ffff88083f505d80] cl_enqueue_locked at ffffffffa03aceef [obdclass] > #7 [ffff88083f505dc0] cl_lock_request at ffffffffa03adb0e [obdclass] > #8 [ffff88083f505e20] cl_glimpse_lock at ffffffffa089089f [lustre] > #9 [ffff88083f505e80] cl_glimpse_size0 at ffffffffa0890d4d [lustre] > #10 [ffff88083f505ed0] ll_file_seek at ffffffffa083d988 [lustre] > #11 [ffff88083f505f30] vfs_llseek at ffffffff81155eea > #12 [ffff88083f505f40] sys_lseek at ffffffff8115604e > #13 [ffff88083f505f80] system_call_fastpath at ffffffff814589ab > PID: 5543 TASK: ffff8808336497f0 CPU: 12 COMMAND: "growfiles" > #0 [ffff88083c4897c8] schedule at ffffffff8144e6b7 > #1 [ffff88083c489930] __mutex_lock_slowpath at ffffffff8144fb0e > #2 [ffff88083c4899a0] mutex_lock at ffffffff8144f533 > #3 [ffff88083c4899c0] cl_lock_mutex_get at ffffffffa03aa046 [obdclass] > #4 [ffff88083c4899f0] osc_ldlm_completion_ast at ffffffffa072ea6f [osc] > #5 [ffff88083c489a40] ldlm_lock_match at ffffffffa04a1477 [ptlrpc] > #6 [ffff88083c489b20] osc_enqueue_base at ffffffffa07128f0 [osc] > #7 [ffff88083c489bb0] osc_lock_enqueue at ffffffffa072ccb6 [osc] > #8 [ffff88083c489c40] cl_enqueue_try at ffffffffa03abffb [obdclass] > #9 [ffff88083c489c90] lov_lock_enqueue at ffffffffa07c01d2 [lov] > #10 [ffff88083c489d30] cl_enqueue_try at ffffffffa03abffb [obdclass] > #11 [ffff88083c489d80] cl_enqueue_locked at ffffffffa03aceef [obdclass] > #12 [ffff88083c489dc0] cl_lock_request at ffffffffa03adb0e [obdclass] > #13 [ffff88083c489e20] cl_glimpse_lock at ffffffffa089089f [lustre] > #14 [ffff88083c489e80] cl_glimpse_size0 at ffffffffa0890d4d [lustre] > #15 [ffff88083c489ed0] ll_file_seek at ffffffffa083d988 [lustre] > #16 [ffff88083c489f30] vfs_llseek at ffffffff81155eea > #17 [ffff88083c489f40] sys_lseek at ffffffff8115604e > #18 [ffff88083c489f80] system_call_fastpath at ffffffff814589ab
The version of Lustre is 2.5.1 with some additional patches, in particular LU3027, 7841 has been reverted. The patch from LU-4558, 9876 is NOT included.
Attachments
Issue Links
- is related to
-
LU-5225 Client is evicted by multiple OSTs on all OSSs
-
- Resolved
-
Jinshan - We've given this patch a bit of testing, and seen no issues. We don't have a clear reproducer, so we can't say the bug is fixed for sure, but we're pulling this in to our branch of Lustre so it will see more exposure over the next few weeks.