Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7825

ldlm_lock.c:810:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_writers > 0

Details

    • 3
    • 9223372036854775807

    Description

      Error happens during soak testing of build '20160224' (b2_8 RC2) (see:
      https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola& spaceKey=Releases#SoakTestingonLola-20150224). DNE is enabled.
      MDSes had been formatted using ldiskfs, OSTs using zfs. MDSes are configured in active-active HA failover configuration.

      Sequence of events:

      • 2016-02-27 02:04:02,121:fsmgmt.fsmgmt:INFO mds_failover just completed (lola-10 ---> lola-11)
      • Feb 27 02:06:44 lola-10 kernel: Lustre: soaked-MDT0005: Recovery over after 2:42, of 16 clients 14 recovered and 2 were evicted.
      • Feb 27 02:12:06 lola-10 kernel: Lustre: soaked-MDT0004: Recovery over after 8:02, of 16 clients 11 recovered and 5 were evicted.
      • 2016-02-27 02:12:58 lola-9 (different HA pair) crashed

      The error reads as:

      <0>LustreError: 5003:0:(ldlm_lock.c:810:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_writers > 0 ) failed: 
      <0>LustreError: 5003:0:(ldlm_lock.c:810:ldlm_lock_decref_internal_nolock()) LBUG
      <4>Pid: 5003, comm: mdt02_007
      <4>
      <4>Call Trace:
      <4> [<ffffffffa0748875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa0748e77>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa0a2ef0f>] ldlm_lock_decref_internal_nolock+0x17f/0x180 [ptlrpc]
      <4> [<ffffffffa0a3102d>] ldlm_lock_decref_internal+0x4d/0xa80 [ptlrpc]
      <4> [<ffffffffa083f935>] ? class_handle2object+0x95/0x190 [obdclass]
      <4> [<ffffffffa0a325a0>] ldlm_lock_decref_and_cancel+0x80/0x150 [ptlrpc]
      <4> [<ffffffffa1164c67>] mdt_object_unlock+0xa7/0x2e0 [mdt]
      <4> [<ffffffffa11867ca>] mdt_reint_rename_or_migrate+0xf3a/0x2600 [mdt]
      <4> [<ffffffffa0ab7bdd>] ? null_alloc_rs+0xcd/0x320 [ptlrpc]
      <4> [<ffffffffa0876cbc>] ? upcall_cache_get_entry+0x29c/0x880 [obdclass]
      <4> [<ffffffffa087bbf0>] ? lu_ucred+0x20/0x30 [obdclass]
      <4> [<ffffffffa0a7d100>] ? lustre_pack_reply_v2+0x180/0x280 [ptlrpc]
      <4> [<ffffffffa117d50f>] ? ucred_set_jobid+0x5f/0x70 [mdt]
      <4> [<ffffffffa1187ec3>] mdt_reint_rename+0x13/0x20 [mdt]
      <4> [<ffffffffa118118d>] mdt_reint_rec+0x5d/0x200 [mdt]
      <4> [<ffffffffa116cddb>] mdt_reint_internal+0x62b/0x9f0 [mdt]
      <4> [<ffffffffa116d63b>] mdt_reint+0x6b/0x120 [mdt]
      <4> [<ffffffffa0ae0c2c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
      <4> [<ffffffffa0a8dc61>] ptlrpc_main+0xd21/0x1800 [ptlrpc]
      <4> [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0
      <4> [<ffffffffa0a8cf40>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
      <4> [<ffffffff8109e78e>] kthread+0x9e/0xc0
      <4> [<ffffffff8100c28a>] child_rip+0xa/0x20
      <4> [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
      <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 5003, comm: mdt02_007 Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.x86_64 #1
      <4>Call Trace:
      <4> [<ffffffff81529c9c>] ? panic+0xa7/0x16f
      <4> [<ffffffffa0748ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa0a2ef0f>] ? ldlm_lock_decref_internal_nolock+0x17f/0x180 [ptlrpc]
      <4> [<ffffffffa0a3102d>] ? ldlm_lock_decref_internal+0x4d/0xa80 [ptlrpc]
      <4> [<ffffffffa083f935>] ? class_handle2object+0x95/0x190 [obdclass]
      <4> [<ffffffffa0a325a0>] ? ldlm_lock_decref_and_cancel+0x80/0x150 [ptlrpc]
      <4> [<ffffffffa1164c67>] ? mdt_object_unlock+0xa7/0x2e0 [mdt]
      <4> [<ffffffffa11867ca>] ? mdt_reint_rename_or_migrate+0xf3a/0x2600 [mdt]
      <4> [<ffffffffa0ab7bdd>] ? null_alloc_rs+0xcd/0x320 [ptlrpc]
      <4> [<ffffffffa0876cbc>] ? upcall_cache_get_entry+0x29c/0x880 [obdclass]
      <4> [<ffffffffa087bbf0>] ? lu_ucred+0x20/0x30 [obdclass]
      <4> [<ffffffffa0a7d100>] ? lustre_pack_reply_v2+0x180/0x280 [ptlrpc]
      <4> [<ffffffffa117d50f>] ? ucred_set_jobid+0x5f/0x70 [mdt]
      <4> [<ffffffffa1187ec3>] ? mdt_reint_rename+0x13/0x20 [mdt]
      <4> [<ffffffffa118118d>] ? mdt_reint_rec+0x5d/0x200 [mdt]
      <4> [<ffffffffa116cddb>] ? mdt_reint_internal+0x62b/0x9f0 [mdt]
      <4> [<ffffffffa116d63b>] ? mdt_reint+0x6b/0x120 [mdt]
      <4> [<ffffffffa0ae0c2c>] ? tgt_request_handle+0x8ec/0x1440 [ptlrpc]
      <4> [<ffffffffa0a8dc61>] ? ptlrpc_main+0xd21/0x1800 [ptlrpc]
      <4> [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0
      <4> [<ffffffffa0a8cf40>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
      <4> [<ffffffff8109e78e>] ? kthread+0x9e/0xc0
      <4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20
      <4> [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
      <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
      

      Attached message, console logs of MDS nodes lola-9, lola-10 and also vmcore-dmesg.txt.
      Crash file will be saved separately.

      Attachments

        1. console-lola-10.log.bz2
          392 kB
        2. console-lola-9.log.bz2
          608 kB
        3. lola-9-vmcore-dmesg.txt.bz2
          34 kB
        4. messages-lola-10.log.bz2
          310 kB
        5. messages-lola-9.log.bz2
          270 kB

        Activity

          [LU-7825] ldlm_lock.c:810:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_writers > 0
          jgmitter Joseph Gmitter (Inactive) added a comment - - edited

          Landed to master and b2_8. Is present in the 2.8.0 release.

          jgmitter Joseph Gmitter (Inactive) added a comment - - edited Landed to master and b2_8. Is present in the 2.8.0 release.

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18707/
          Subject: LU-7825 mdt: release parent lock correctly for rename
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 30ece848c046dda5c450dc49c6b146360c077a22

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18707/ Subject: LU-7825 mdt: release parent lock correctly for rename Project: fs/lustre-release Branch: master Current Patch Set: Commit: 30ece848c046dda5c450dc49c6b146360c077a22

          wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18707
          Subject: LU-7825 mdt: release parent lock correctly for rename
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 6a240c713c30cd5b167d32e5c2a163f6b18d8ef6

          gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18707 Subject: LU-7825 mdt: release parent lock correctly for rename Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6a240c713c30cd5b167d32e5c2a163f6b18d8ef6
          di.wang Di Wang added a comment -

          Hmm, it looks like lock is not released correctly in the error handler path of mdt_reint_rename_internal(). will cook a patch.

          di.wang Di Wang added a comment - Hmm, it looks like lock is not released correctly in the error handler path of mdt_reint_rename_internal(). will cook a patch.

          The crash file has been saved at lhn.hpdd.intel.com:/scratch/crashdumps/lu-7825/lola-9/127.0.0.1-2016-02-27-02\:12\:58/.

          heckes Frank Heckes (Inactive) added a comment - The crash file has been saved at lhn.hpdd.intel.com:/scratch/crashdumps/lu-7825/lola-9/127.0.0.1-2016-02-27-02\:12\:58/ .

          People

            di.wang Di Wang
            heckes Frank Heckes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: