Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4861

App hung - deadlock in cl_lock_mutex_get along cl_glimpse_lock path

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0, Lustre 2.5.3
    • Lustre 2.5.1
    • None
    • Lustre 2.5.1 on both clients and servers.
    • 3
    • 13411

    Description

      Have several occurrences of applications hanging. Stack traces show the application processes waiting in cl_lock_mutex_get/mutex_lock on code path through cl_glimpse_lock. All the dumps I've looked at, show one of the processes calling osc_ldlm_completion_ast along the way. Two processes are deadlocked on 2 cl_lock.cll_guard mutexes. All other app processes are waiting for one of these two mutexes.

      > crash> bt -F | grep -A 1 '#2'
      >  #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533
      >     ffff88083f505c48: [ccc_object_kmem] [cl_lock_kmem]
      
                                               addr(cl_lock)
      > crash> foreach growfiles bt -f | grep -A 1 '#2' | grep -v mutex_lock
      >     ffff88083f533c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff88083c4899a8: ffff88083f684d98 ffff88083bbf5b70
      >     ffff8807eb48fd08: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff88083b4cfb98: ffff88083f684d98 ffff88083bbf5b70
      >     ffff8807ea2e1aa8: ffff88083f684d98 ffff88083bbf5b70
      >     ffff88083f505c48: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff88083ff5fc28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff880833821d08: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff880833751c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff88083f5f1c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff88083e157c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff880833749c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff88083dfcbc28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff88083bd65c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff880833755c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff880833801c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff88083fd31c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff8807ed5a3c28: ffff8808350cdef8 ffff88083ba19ed0
      >     ffff8807e0117c28: ffff8808350cdef8 ffff88083ba19ed0
      
      crash> struct cl_lock.cll_guard ffff88083bbf5b70 | grep owner
          owner = 0xffff88083fe0c7f0
      crash> ps | grep ffff88083fe0c7f0
         5548      1    3 ffff88083fe0c7f0  UN   0.0    3120   1576  growfiles
      
      crash> struct cl_lock.cll_guard ffff88083ba19ed0 | grep owner
          owner = 0xffff8808336497f0
      crash> ps | grep ffff8808336497f0
         5543      1   12 ffff8808336497f0  UN   0.0    3120   1576  growfiles
      
      > crash> for 5543 bt -f | grep -A 1 '#2'
      >  #2 [ffff88083c4899a0] mutex_lock at ffffffff8144f533
      >     ffff88083c4899a8: ffff88083f684d98 ffff88083bbf5b70 
      > crash> for 5548 bt -f | grep -A 1 '#2'
      >  #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533
      >     ffff88083f505c48: ffff8808350cdef8 ffff88083ba19ed0 
      
      So a deadlock exists between pids 5543 and 5548. All other growfiles tasks are waiting for one of these two pids.
                                 Owner     Waiter
      cl_lock ffff88083bbf5b70   5548      5543
      cl_lock ffff88083ba19ed0   5543      5548
      
      > crash> bt
      > PID: 5548   TASK: ffff88083fe0c7f0  CPU: 3   COMMAND: "growfiles"
      >  #0 [ffff88083f505a68] schedule at ffffffff8144e6b7
      >  #1 [ffff88083f505bd0] __mutex_lock_slowpath at ffffffff8144fb0e
      >  #2 [ffff88083f505c40] mutex_lock at ffffffff8144f533
      >  #3 [ffff88083f505c60] cl_lock_mutex_get at ffffffffa03aa046 [obdclass]
      >  #4 [ffff88083f505c90] lov_lock_enqueue at ffffffffa07c077f [lov]
      >  #5 [ffff88083f505d30] cl_enqueue_try at ffffffffa03abffb [obdclass]
      >  #6 [ffff88083f505d80] cl_enqueue_locked at ffffffffa03aceef [obdclass]
      >  #7 [ffff88083f505dc0] cl_lock_request at ffffffffa03adb0e [obdclass]
      >  #8 [ffff88083f505e20] cl_glimpse_lock at ffffffffa089089f [lustre]
      >  #9 [ffff88083f505e80] cl_glimpse_size0 at ffffffffa0890d4d [lustre]
      > #10 [ffff88083f505ed0] ll_file_seek at ffffffffa083d988 [lustre]
      > #11 [ffff88083f505f30] vfs_llseek at ffffffff81155eea
      > #12 [ffff88083f505f40] sys_lseek at ffffffff8115604e
      > #13 [ffff88083f505f80] system_call_fastpath at ffffffff814589ab
      
      > PID: 5543   TASK: ffff8808336497f0  CPU: 12  COMMAND: "growfiles"
      >  #0 [ffff88083c4897c8] schedule at ffffffff8144e6b7
      >  #1 [ffff88083c489930] __mutex_lock_slowpath at ffffffff8144fb0e
      >  #2 [ffff88083c4899a0] mutex_lock at ffffffff8144f533
      >  #3 [ffff88083c4899c0] cl_lock_mutex_get at ffffffffa03aa046 [obdclass]
      >  #4 [ffff88083c4899f0] osc_ldlm_completion_ast at ffffffffa072ea6f [osc]
      >  #5 [ffff88083c489a40] ldlm_lock_match at ffffffffa04a1477 [ptlrpc]
      >  #6 [ffff88083c489b20] osc_enqueue_base at ffffffffa07128f0 [osc]
      >  #7 [ffff88083c489bb0] osc_lock_enqueue at ffffffffa072ccb6 [osc]
      >  #8 [ffff88083c489c40] cl_enqueue_try at ffffffffa03abffb [obdclass]
      >  #9 [ffff88083c489c90] lov_lock_enqueue at ffffffffa07c01d2 [lov]
      > #10 [ffff88083c489d30] cl_enqueue_try at ffffffffa03abffb [obdclass]
      > #11 [ffff88083c489d80] cl_enqueue_locked at ffffffffa03aceef [obdclass]
      > #12 [ffff88083c489dc0] cl_lock_request at ffffffffa03adb0e [obdclass]
      > #13 [ffff88083c489e20] cl_glimpse_lock at ffffffffa089089f [lustre]
      > #14 [ffff88083c489e80] cl_glimpse_size0 at ffffffffa0890d4d [lustre]
      > #15 [ffff88083c489ed0] ll_file_seek at ffffffffa083d988 [lustre]
      > #16 [ffff88083c489f30] vfs_llseek at ffffffff81155eea
      > #17 [ffff88083c489f40] sys_lseek at ffffffff8115604e
      > #18 [ffff88083c489f80] system_call_fastpath at ffffffff814589ab
      

      The version of Lustre is 2.5.1 with some additional patches, in particular LU3027, 7841 has been reverted. The patch from LU-4558, 9876 is NOT included.

      Attachments

        Issue Links

          Activity

            [LU-4861] App hung - deadlock in cl_lock_mutex_get along cl_glimpse_lock path
            pjones Peter Jones added a comment -

            Landed for 2.6

            pjones Peter Jones added a comment - Landed for 2.6

            Jinshan - We've given this patch a bit of testing, and seen no issues. We don't have a clear reproducer, so we can't say the bug is fixed for sure, but we're pulling this in to our branch of Lustre so it will see more exposure over the next few weeks.

            paf Patrick Farrell (Inactive) added a comment - Jinshan - We've given this patch a bit of testing, and seen no issues. We don't have a clear reproducer, so we can't say the bug is fixed for sure, but we're pulling this in to our branch of Lustre so it will see more exposure over the next few weeks.

            We'll try the patch. Separately, because Jinshan asked, the logs from the dump I uploaded are here:
            ftp.whamcloud.com:/uploads/LU-4861/logs.tar.gz

            paf Patrick Farrell (Inactive) added a comment - We'll try the patch. Separately, because Jinshan asked, the logs from the dump I uploaded are here: ftp.whamcloud.com:/uploads/ LU-4861 /logs.tar.gz
            jay Jinshan Xiong (Inactive) added a comment - please try patch: http://review.whamcloud.com/10581

            I will take a look at this issue.

            jay Jinshan Xiong (Inactive) added a comment - I will take a look at this issue.

            I hit this bug last night during stress testing of current, unmodified master on SLES11SP3.

            Most recent commit on the branch being tested:
            commit 864fc9daac267819f5e3bdebef6cdac4c6325626
            Author: Mikhail Pershin <mike.pershin@intel.com>
            Date: Tue May 13 18:34:03 2014 +0400

            LU-2059 mgs: don't fail on missing params log
            -----------------------------------------

            Uploaded dump to ftp.whamcloud.com:/uploads/LU-4861/master_140520_dump_upload.tar.gz

            Note there is also a master_140520_dump.tar.gz in that same directory, which is a failed transfer.

            paf Patrick Farrell (Inactive) added a comment - I hit this bug last night during stress testing of current, unmodified master on SLES11SP3. Most recent commit on the branch being tested: commit 864fc9daac267819f5e3bdebef6cdac4c6325626 Author: Mikhail Pershin <mike.pershin@intel.com> Date: Tue May 13 18:34:03 2014 +0400 LU-2059 mgs: don't fail on missing params log ----------------------------------------- Uploaded dump to ftp.whamcloud.com:/uploads/ LU-4861 /master_140520_dump_upload.tar.gz Note there is also a master_140520_dump.tar.gz in that same directory, which is a failed transfer.

            FYI, we've seen this same deadlock again with the patch http://review.whamcloud.com/9881 from LU-4558 applied.

            amk Ann Koehler (Inactive) added a comment - FYI, we've seen this same deadlock again with the patch http://review.whamcloud.com/9881 from LU-4558 applied.

            Thanks for the coredump, I will take a look.

            jay Jinshan Xiong (Inactive) added a comment - Thanks for the coredump, I will take a look.

            Uploaded dump to ftp.whamcloud.com:/uploads/LU-4861/bug810344_cllock_deadlock.tgz

            This ticket is primarily "for your information". Seems like there's a reasonable chance that the cause of the deadlock is related to the root cause of LU-4591. I'm passing on the dump in case it helps solve LU-4591. And in case it's not, we're working the issue through Xyratex in parallel.

            amk Ann Koehler (Inactive) added a comment - Uploaded dump to ftp.whamcloud.com:/uploads/ LU-4861 /bug810344_cllock_deadlock.tgz This ticket is primarily "for your information". Seems like there's a reasonable chance that the cause of the deadlock is related to the root cause of LU-4591 . I'm passing on the dump in case it helps solve LU-4591 . And in case it's not, we're working the issue through Xyratex in parallel.

            People

              jay Jinshan Xiong (Inactive)
              amk Ann Koehler (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: