Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4861

App hung - deadlock in cl_lock_mutex_get along cl_glimpse_lock path

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0, Lustre 2.5.3
    • Lustre 2.5.1
    • None
    • Lustre 2.5.1 on both clients and servers.
    • 3
    • 13411

    Description

      Have several occurrences of applications hanging. Stack traces show the application processes waiting in cl_lock_mutex_get/mutex_lock on code path through cl_glimpse_lock. All the dumps I've looked at, show one of the processes calling osc_ldlm_completion_ast along the way. Two processes are deadlocked on 2 cl_lock.cll_guard mutexes. All other app processes are waiting for one of these two mutexes.

      > crash> bt -F | grep -A 1 '#2'
      >  #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533
      >     ffff88083f505c48: [ccc_object_kmem] [cl_lock_kmem]
      
                                               addr(cl_lock)
      > crash> foreach growfiles bt -f | grep -A 1 '#2' | grep -v mutex_lock
      >     ffff88083f533c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff88083c4899a8: ffff88083f684d98 ffff88083bbf5b70
      >     ffff8807eb48fd08: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff88083b4cfb98: ffff88083f684d98 ffff88083bbf5b70
      >     ffff8807ea2e1aa8: ffff88083f684d98 ffff88083bbf5b70
      >     ffff88083f505c48: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff88083ff5fc28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff880833821d08: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff880833751c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff88083f5f1c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff88083e157c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff880833749c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff88083dfcbc28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff88083bd65c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff880833755c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff880833801c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff88083fd31c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff8807ed5a3c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff8807e0117c28: ffff8808350cdef8 ffff88083ba19ed0
      
      crash> struct cl_lock.cll_guard ffff88083bbf5b70 | grep owner
          owner = 0xffff88083fe0c7f0
      crash> ps | grep ffff88083fe0c7f0
         5548      1    3 ffff88083fe0c7f0  UN   0.0    3120   1576  growfiles
      
      crash> struct cl_lock.cll_guard ffff88083ba19ed0 | grep owner
          owner = 0xffff8808336497f0
      crash> ps | grep ffff8808336497f0
         5543      1   12 ffff8808336497f0  UN   0.0    3120   1576  growfiles
      
      > crash> for 5543 bt -f | grep -A 1 '#2'
      >  #2 [ffff88083c4899a0] mutex_lock at ffffffff8144f533
      >     ffff88083c4899a8: ffff88083f684d98 ffff88083bbf5b70 
      > crash> for 5548 bt -f | grep -A 1 '#2'
      >  #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533
      >     ffff88083f505c48: ffff8808350cdef8 ffff88083ba19ed0 
      
      So a deadlock exists between pids 5543 and 5548. All other growfiles tasks are waiting for one of these two pids.
                                 Owner     Waiter
      cl_lock ffff88083bbf5b70   5548      5543
      cl_lock ffff88083ba19ed0   5543      5548
      
      > crash> bt
      > PID: 5548   TASK: ffff88083fe0c7f0  CPU: 3   COMMAND: "growfiles"
      >  #0 [ffff88083f505a68] schedule at ffffffff8144e6b7
      >  #1 [ffff88083f505bd0] __mutex_lock_slowpath at ffffffff8144fb0e
      >  #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533
      >  #3 [ffff88083f505c60] cl_lock_mutex_get at ffffffffa03aa046 [obdclass]
      >  #4 [ffff88083f505c90] lov_lock_enqueue at ffffffffa07c077f [lov]
      >  #5 [ffff88083f505d30] cl_enqueue_try at ffffffffa03abffb [obdclass]
      >  #6 [ffff88083f505d80] cl_enqueue_locked at ffffffffa03aceef [obdclass]
      >  #7 [ffff88083f505dc0] cl_lock_request at ffffffffa03adb0e [obdclass]
      >  #8 [ffff88083f505e20] cl_glimpse_lock at ffffffffa089089f [lustre]
      >  #9 [ffff88083f505e80] cl_glimpse_size0 at ffffffffa0890d4d [lustre]
      > #10 [ffff88083f505ed0] ll_file_seek at ffffffffa083d988 [lustre]
      > #11 [ffff88083f505f30] vfs_llseek at ffffffff81155eea
      > #12 [ffff88083f505f40] sys_lseek at ffffffff8115604e
      > #13 [ffff88083f505f80] system_call_fastpath at ffffffff814589ab
      
      > PID: 5543   TASK: ffff8808336497f0  CPU: 12  COMMAND: "growfiles"
      >  #0 [ffff88083c4897c8] schedule at ffffffff8144e6b7
      >  #1 [ffff88083c489930] __mutex_lock_slowpath at ffffffff8144fb0e
      >  #2 [ffff88083c4899a0] mutex_lock at ffffffff8144f533
      >  #3 [ffff88083c4899c0] cl_lock_mutex_get at ffffffffa03aa046 [obdclass]
      >  #4 [ffff88083c4899f0] osc_ldlm_completion_ast at ffffffffa072ea6f [osc]
      >  #5 [ffff88083c489a40] ldlm_lock_match at ffffffffa04a1477 [ptlrpc]
      >  #6 [ffff88083c489b20] osc_enqueue_base at ffffffffa07128f0 [osc]
      >  #7 [ffff88083c489bb0] osc_lock_enqueue at ffffffffa072ccb6 [osc]
      >  #8 [ffff88083c489c40] cl_enqueue_try at ffffffffa03abffb [obdclass]
      >  #9 [ffff88083c489c90] lov_lock_enqueue at ffffffffa07c01d2 [lov]
      > #10 [ffff88083c489d30] cl_enqueue_try at ffffffffa03abffb [obdclass]
      > #11 [ffff88083c489d80] cl_enqueue_locked at ffffffffa03aceef [obdclass]
      > #12 [ffff88083c489dc0] cl_lock_request at ffffffffa03adb0e [obdclass]
      > #13 [ffff88083c489e20] cl_glimpse_lock at ffffffffa089089f [lustre]
      > #14 [ffff88083c489e80] cl_glimpse_size0 at ffffffffa0890d4d [lustre]
      > #15 [ffff88083c489ed0] ll_file_seek at ffffffffa083d988 [lustre]
      > #16 [ffff88083c489f30] vfs_llseek at ffffffff81155eea
      > #17 [ffff88083c489f40] sys_lseek at ffffffff8115604e
      > #18 [ffff88083c489f80] system_call_fastpath at ffffffff814589ab
      

      The version of Lustre is 2.5.1 with some additional patches, in particular LU3027, 7841 has been reverted. The patch from LU-4558, 9876 is NOT included.

      Attachments

        Issue Links

          Activity

            People

              jay Jinshan Xiong (Inactive)
              amk Ann Koehler (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: