Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8592

MDS crashed with ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 )

Details

    • 3
    • 9223372036854775807

    Description

      Error happened during soak testing of build '20160902' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160902)
      Configuration reads as:
      4 MDS with 1 MDT / MDS, formatted with ldiskfs and configured pairwise in active-active HA configuration
      6 OSS with 4 OST / OSS formatted with ldiskfs and configured pairwise in active-active HA configuration
      DNE is enabled

      Sequence of events

      • 2016-09-06 02:51:28,201:fsmgmt.fsmgmt:INFO triggering fault mds_failover ( lola-8 (mdt-0) --> lola-9)
      • 2016-09-06 03:41:33,479:fsmgmt.fsmgmt:INFO mds_failover just completed (mdt-0 failed back to lola-8)
      • 2016-09-06 03:41:17 MDS lola-11 crashed with error message:
        <0>LustreError: 6666:0:(lu_object.h:716:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed: 
        <0>LustreError: 6666:0:(lu_object.h:716:lu_object_get()) LBUG
        <4>Pid: 6666, comm: mdt03_002
        <4>
        <4>Call Trace:
        <4> [<ffffffffa07f0875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
        <4> [<ffffffffa07f0e77>] lbug_with_loc+0x47/0xb0 [libcfs]
        <4> [<ffffffffa1203071>] mdt_remote_object_lock+0x491/0x4a0 [mdt]
        <4> [<ffffffffa12298a0>] mdt_reint_open+0x2b90/0x3180 [mdt]
        <4> [<ffffffffa1211ead>] mdt_reint_rec+0x5d/0x200 [mdt]
        <4> [<ffffffffa11fd5db>] mdt_reint_internal+0x62b/0xa50 [mdt]
        <4> [<ffffffffa11fdbf6>] mdt_intent_reint+0x1f6/0x440 [mdt]
        <4> [<ffffffffa11fb8be>] mdt_intent_policy+0x4be/0xc70 [mdt]
        <4> [<ffffffffa0ab77c7>] ldlm_lock_enqueue+0x127/0x990 [ptlrpc]
        <4> [<ffffffffa0ae2c27>] ldlm_handle_enqueue0+0x807/0x14d0 [ptlrpc]
        <4> [<ffffffffa0b68b21>] tgt_enqueue+0x61/0x230 [ptlrpc]
        <4> [<ffffffffa0b69ccc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
        <4> [<ffffffffa0b16501>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
        <4> [<ffffffffa0b157d0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
        <4> [<ffffffff810a138e>] kthread+0x9e/0xc0
        <4> [<ffffffff8100c28a>] child_rip+0xa/0x20
        <4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
        <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
        <4>
        <0>Kernel panic - not syncing: LBUG
        <4>Pid: 6666, comm: mdt03_002 Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1
        <4>Call Trace:
        <4> [<ffffffff81539407>] ? panic+0xa7/0x16f
        <4> [<ffffffffa07f0ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
        <4> [<ffffffffa1203071>] ? mdt_remote_object_lock+0x491/0x4a0 [mdt]
        <4> [<ffffffffa12298a0>] ? mdt_reint_open+0x2b90/0x3180 [mdt]
        <4> [<ffffffffa1211ead>] ? mdt_reint_rec+0x5d/0x200 [mdt]
        <4> [<ffffffffa11fd5db>] ? mdt_reint_internal+0x62b/0xa50 [mdt]
        <4> [<ffffffffa11fdbf6>] ? mdt_intent_reint+0x1f6/0x440 [mdt]
        <4> [<ffffffffa11fb8be>] ? mdt_intent_policy+0x4be/0xc70 [mdt]
        <4> [<ffffffffa0ab77c7>] ? ldlm_lock_enqueue+0x127/0x990 [ptlrpc]
        <4> [<ffffffffa0ae2c27>] ? ldlm_handle_enqueue0+0x807/0x14d0 [ptlrpc]
        <4> [<ffffffffa0b68b21>] ? tgt_enqueue+0x61/0x230 [ptlrpc]
        <4> [<ffffffffa0b69ccc>] ? tgt_request_handle+0x8ec/0x1440 [ptlrpc]
        <4> [<ffffffffa0b16501>] ? ptlrpc_main+0xd31/0x1800 [ptlrpc]
        <4> [<ffffffffa0b157d0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
        <4> [<ffffffff810a138e>] ? kthread+0x9e/0xc0
        <4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20
        <4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
        <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
        

        Attached files:
        console, message logs of all MDS nodes; vmcore-dmesg.txt of lola-11.
        crash dump is available.

      Attachments

        1. console-lola-10.log.bz2
          54 kB
        2. console-lola-11.log.bz2
          54 kB
        3. console-lola-8.log.bz2
          94 kB
        4. console-lola-9.log.bz2
          63 kB
        5. lola-11-vmcore-dmesg.txt.bz2
          29 kB
        6. message-lola-10.log.bz2
          188 kB
        7. message-lola-11.log.bz2
          183 kB
        8. message-lola-8.log.bz2
          398 kB
        9. message-lola-9.log.bz2
          405 kB

        Activity

          [LU-8592] MDS crashed with ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 )
          pjones Peter Jones added a comment -

          Landed for 2.9

          pjones Peter Jones added a comment - Landed for 2.9

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22438/
          Subject: LU-8592 mdt: hold mdt_device::mdt_md_root until service stop
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 1bf196aaf5adb9a1c67886c0cd6a780ec6838040

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22438/ Subject: LU-8592 mdt: hold mdt_device::mdt_md_root until service stop Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1bf196aaf5adb9a1c67886c0cd6a780ec6838040
          heckes Frank Heckes (Inactive) added a comment - - edited

          Installed build containing http://review.whamcloud.com/22438 patchset #4 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160916)
          Test session is running since 2016-09-17. No occurrence of this bug so far, but due to LU-8580 (happens with frequency of 2 hours) it's difficult to say whether your fix resolves the problem.

          heckes Frank Heckes (Inactive) added a comment - - edited Installed build containing http://review.whamcloud.com/22438 patchset #4 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160916 ) Test session is running since 2016-09-17. No occurrence of this bug so far, but due to LU-8580 (happens with frequency of 2 hours) it's difficult to say whether your fix resolves the problem.

          Sure, I'll do. Could take till Wednesday before I start, as I have to reproduce an error for EE-3.1 first.

          heckes Frank Heckes (Inactive) added a comment - Sure, I'll do. Could take till Wednesday before I start, as I have to reproduce an error for EE-3.1 first.

          Frank,

          Would you please to try this patch?
          Thanks!

          yong.fan nasf (Inactive) added a comment - Frank, Would you please to try this patch? Thanks!

          Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/22438
          Subject: LU-8592 dne: cache remote ibits lock properly
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 8bae476f5258e78be6cc7a58a19be8f95fc59209

          gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/22438 Subject: LU-8592 dne: cache remote ibits lock properly Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8bae476f5258e78be6cc7a58a19be8f95fc59209
          yong.fan nasf (Inactive) added a comment - - edited

          The issue may be related with the patch http://review.whamcloud.com/#/c/19041. Such patch caches the metadata attributes on remote MDT. To invalid the cached attributes, it makes the remote ibits lock callback data to reference the cached object. Such reference will be released when the mdt_remote_blocking_ast() is triggered.

          According to the logs on the lola-11, just before the ASSERTION, the OSP detected some exception:

          <3>LustreError: 167-0: soaked-MDT0000-osp-MDT0003: This client was evicted by soaked-MDT0000; in progress operations using this service will fail.
          <3>LustreError: 32471:0:(ldlm_resource.c:878:ldlm_resource_complain()) soaked-MDT0000-osp-MDT0003: namespace resource [0x2c00013a4:0x97e0:0x0].0x0 (ffff88078ba0f2c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
          

          That means the connection from the MDT3 to the MDT0 was evicted by the MDT0, such IMP_EVENT_INVALIDATE event triggered ldlm_namespace_cleanup(). Unfortunately, at that time, some up layer user was referencing the resource [0x2c00013a4:0x97e0:0x0]. It is suspected that such resource (object) has been cleaned by force, then caused related object's reference wrong, as to the subsequent mdt_remote_object_lock() hit the ASSERTION. Currently, I have no the detailed scenario to reproduce the failure, but I will make a patch to enhance related logic and try to check what will happen.

          yong.fan nasf (Inactive) added a comment - - edited The issue may be related with the patch http://review.whamcloud.com/#/c/19041 . Such patch caches the metadata attributes on remote MDT. To invalid the cached attributes, it makes the remote ibits lock callback data to reference the cached object. Such reference will be released when the mdt_remote_blocking_ast() is triggered. According to the logs on the lola-11, just before the ASSERTION, the OSP detected some exception: <3>LustreError: 167-0: soaked-MDT0000-osp-MDT0003: This client was evicted by soaked-MDT0000; in progress operations using this service will fail. <3>LustreError: 32471:0:(ldlm_resource.c:878:ldlm_resource_complain()) soaked-MDT0000-osp-MDT0003: namespace resource [0x2c00013a4:0x97e0:0x0].0x0 (ffff88078ba0f2c0) refcount nonzero (1) after lock cleanup; forcing cleanup. That means the connection from the MDT3 to the MDT0 was evicted by the MDT0, such IMP_EVENT_INVALIDATE event triggered ldlm_namespace_cleanup(). Unfortunately, at that time, some up layer user was referencing the resource [0x2c00013a4:0x97e0:0x0] . It is suspected that such resource (object) has been cleaned by force, then caused related object's reference wrong, as to the subsequent mdt_remote_object_lock() hit the ASSERTION. Currently, I have no the detailed scenario to reproduce the failure, but I will make a patch to enhance related logic and try to check what will happen.

          crash dump had been saved to lhn.hpdd.intel.com:/scratch/crashdumps/lu-8592/lola-11/27.0.0.1-2016-09-06-03:41:17

          heckes Frank Heckes (Inactive) added a comment - crash dump had been saved to lhn.hpdd.intel.com:/scratch/crashdumps/lu-8592/lola-11/27.0.0.1-2016-09-06-03:41:17

          People

            yong.fan nasf (Inactive)
            heckes Frank Heckes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: