Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8592

MDS crashed with ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 )

Details

    • 3
    • 9223372036854775807

    Description

      Error happened during soak testing of build '20160902' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160902)
      Configuration reads as:
      4 MDS with 1 MDT / MDS, formatted with ldiskfs and configured pairwise in active-active HA configuration
      6 OSS with 4 OST / OSS formatted with ldiskfs and configured pairwise in active-active HA configuration
      DNE is enabled

      Sequence of events

      • 2016-09-06 02:51:28,201:fsmgmt.fsmgmt:INFO triggering fault mds_failover ( lola-8 (mdt-0) --> lola-9)
      • 2016-09-06 03:41:33,479:fsmgmt.fsmgmt:INFO mds_failover just completed (mdt-0 failed back to lola-8)
      • 2016-09-06 03:41:17 MDS lola-11 crashed with error message:
        <0>LustreError: 6666:0:(lu_object.h:716:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed: 
        <0>LustreError: 6666:0:(lu_object.h:716:lu_object_get()) LBUG
        <4>Pid: 6666, comm: mdt03_002
        <4>
        <4>Call Trace:
        <4> [<ffffffffa07f0875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
        <4> [<ffffffffa07f0e77>] lbug_with_loc+0x47/0xb0 [libcfs]
        <4> [<ffffffffa1203071>] mdt_remote_object_lock+0x491/0x4a0 [mdt]
        <4> [<ffffffffa12298a0>] mdt_reint_open+0x2b90/0x3180 [mdt]
        <4> [<ffffffffa1211ead>] mdt_reint_rec+0x5d/0x200 [mdt]
        <4> [<ffffffffa11fd5db>] mdt_reint_internal+0x62b/0xa50 [mdt]
        <4> [<ffffffffa11fdbf6>] mdt_intent_reint+0x1f6/0x440 [mdt]
        <4> [<ffffffffa11fb8be>] mdt_intent_policy+0x4be/0xc70 [mdt]
        <4> [<ffffffffa0ab77c7>] ldlm_lock_enqueue+0x127/0x990 [ptlrpc]
        <4> [<ffffffffa0ae2c27>] ldlm_handle_enqueue0+0x807/0x14d0 [ptlrpc]
        <4> [<ffffffffa0b68b21>] tgt_enqueue+0x61/0x230 [ptlrpc]
        <4> [<ffffffffa0b69ccc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
        <4> [<ffffffffa0b16501>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
        <4> [<ffffffffa0b157d0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
        <4> [<ffffffff810a138e>] kthread+0x9e/0xc0
        <4> [<ffffffff8100c28a>] child_rip+0xa/0x20
        <4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
        <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
        <4>
        <0>Kernel panic - not syncing: LBUG
        <4>Pid: 6666, comm: mdt03_002 Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1
        <4>Call Trace:
        <4> [<ffffffff81539407>] ? panic+0xa7/0x16f
        <4> [<ffffffffa07f0ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
        <4> [<ffffffffa1203071>] ? mdt_remote_object_lock+0x491/0x4a0 [mdt]
        <4> [<ffffffffa12298a0>] ? mdt_reint_open+0x2b90/0x3180 [mdt]
        <4> [<ffffffffa1211ead>] ? mdt_reint_rec+0x5d/0x200 [mdt]
        <4> [<ffffffffa11fd5db>] ? mdt_reint_internal+0x62b/0xa50 [mdt]
        <4> [<ffffffffa11fdbf6>] ? mdt_intent_reint+0x1f6/0x440 [mdt]
        <4> [<ffffffffa11fb8be>] ? mdt_intent_policy+0x4be/0xc70 [mdt]
        <4> [<ffffffffa0ab77c7>] ? ldlm_lock_enqueue+0x127/0x990 [ptlrpc]
        <4> [<ffffffffa0ae2c27>] ? ldlm_handle_enqueue0+0x807/0x14d0 [ptlrpc]
        <4> [<ffffffffa0b68b21>] ? tgt_enqueue+0x61/0x230 [ptlrpc]
        <4> [<ffffffffa0b69ccc>] ? tgt_request_handle+0x8ec/0x1440 [ptlrpc]
        <4> [<ffffffffa0b16501>] ? ptlrpc_main+0xd31/0x1800 [ptlrpc]
        <4> [<ffffffffa0b157d0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
        <4> [<ffffffff810a138e>] ? kthread+0x9e/0xc0
        <4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20
        <4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
        <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
        

        Attached files:
        console, message logs of all MDS nodes; vmcore-dmesg.txt of lola-11.
        crash dump is available.

      Attachments

        1. console-lola-10.log.bz2
          54 kB
        2. console-lola-11.log.bz2
          54 kB
        3. console-lola-8.log.bz2
          94 kB
        4. console-lola-9.log.bz2
          63 kB
        5. lola-11-vmcore-dmesg.txt.bz2
          29 kB
        6. message-lola-10.log.bz2
          188 kB
        7. message-lola-11.log.bz2
          183 kB
        8. message-lola-8.log.bz2
          398 kB
        9. message-lola-9.log.bz2
          405 kB

        Activity

          [LU-8592] MDS crashed with ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 )
          pjones Peter Jones added a comment -

          Landed for 2.9

          pjones Peter Jones added a comment - Landed for 2.9

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22438/
          Subject: LU-8592 mdt: hold mdt_device::mdt_md_root until service stop
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 1bf196aaf5adb9a1c67886c0cd6a780ec6838040

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22438/ Subject: LU-8592 mdt: hold mdt_device::mdt_md_root until service stop Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1bf196aaf5adb9a1c67886c0cd6a780ec6838040
          heckes Frank Heckes (Inactive) added a comment - - edited

          Installed build containing http://review.whamcloud.com/22438 patchset #4 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160916)
          Test session is running since 2016-09-17. No occurrence of this bug so far, but due to LU-8580 (happens with frequency of 2 hours) it's difficult to say whether your fix resolves the problem.

          heckes Frank Heckes (Inactive) added a comment - - edited Installed build containing http://review.whamcloud.com/22438 patchset #4 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160916 ) Test session is running since 2016-09-17. No occurrence of this bug so far, but due to LU-8580 (happens with frequency of 2 hours) it's difficult to say whether your fix resolves the problem.

          Sure, I'll do. Could take till Wednesday before I start, as I have to reproduce an error for EE-3.1 first.

          heckes Frank Heckes (Inactive) added a comment - Sure, I'll do. Could take till Wednesday before I start, as I have to reproduce an error for EE-3.1 first.

          Frank,

          Would you please to try this patch?
          Thanks!

          yong.fan nasf (Inactive) added a comment - Frank, Would you please to try this patch? Thanks!

          Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/22438
          Subject: LU-8592 dne: cache remote ibits lock properly
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 8bae476f5258e78be6cc7a58a19be8f95fc59209

          gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/22438 Subject: LU-8592 dne: cache remote ibits lock properly Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8bae476f5258e78be6cc7a58a19be8f95fc59209

          People

            yong.fan nasf (Inactive)
            heckes Frank Heckes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: