Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8510

ASSERTION( dt->do_ops->do_invalidate ) failed

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • Lustre 2.9.0
    • CentOS Linux 7/x86_64
    • 3
    • 9223372036854775807

    Description

      The following call stack during autotesting on Maloo for http://review.whamcloud.com/#/c/20546/. My new test is standing up 3 MDTs with non-consecutive indices and a couple of OSTs. The method I am using to start the "custom" filesystem seems to be consistent with how other tests start their "custom" filesystems.

      Link to the Maloo test session results is https://testing.hpdd.intel.com/test_sessions/4599d8d8-6108-11e6-906c-5254006e85c2.

      The LBUG is preventing the filesystem from coming up. Any suggestions?

       LustreError: 21374:0:(dt_object.h:2633:dt_invalidate()) ASSERTION( dt->do_ops->do_invalidate ) failed:
       LustreError: 21374:0:(dt_object.h:2633:dt_invalidate()) LBUG
       Pid: 21374, comm: mdt00_002 
      
       Call Trace:
        [<ffffffffa05e67d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
        [<ffffffffa05e6d75>] lbug_with_loc+0x45/0xc0 [libcfs]
        [<ffffffffa0ea8fcf>] lod_object_unlock+0x39f/0x440 [lod]
        [<ffffffffa0f11e1b>] mdd_object_unlock+0x3b/0xd0 [mdd]
        [<ffffffffa0ddbb62>] mdt_unlock_slaves+0x1a2/0x3c0 [mdt]
        [<ffffffffa0de3c72>] mdt_md_create+0xb52/0xba0 [mdt]
        [<ffffffffa0de3e2b>] mdt_reint_create+0x16b/0x350 [mdt]
        [<ffffffffa0de5330>] mdt_reint_rec+0x80/0x210 [mdt]
        [<ffffffffa0dc7d62>] mdt_reint_internal+0x5b2/0x9b0 [mdt]
        [<ffffffffa0dd3077>] mdt_reint+0x67/0x140 [mdt]
        [<ffffffffa0a69aa5>] tgt_request_handle+0x915/0x1320 [ptlrpc]
        [<ffffffffa0a15c5b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
        [<ffffffffa0a13818>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
        [<ffffffffa05f1957>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
        [<ffffffffa0a19d10>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
        [<ffffffffa0a19270>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc]
        [<ffffffff810a5aef>] kthread+0xcf/0xe0
        [<ffffffff810a5a20>] ? kthread+0x0/0xe0
        [<ffffffff816469d8>] ret_from_fork+0x58/0x90
        [<ffffffff810a5a20>] ? kthread+0x0/0xe0
      
       Kernel panic - not syncing: LBUG
      

      Attachments

        Activity

          [LU-8510] ASSERTION( dt->do_ops->do_invalidate ) failed
          pjones Peter Jones added a comment -

          Landed for 2.9

          pjones Peter Jones added a comment - Landed for 2.9

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22017/
          Subject: LU-8510 dne: set osd_obj_ea_ops::dt_invalidate
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 13364590a8c9ef64320f62b9937c01aaa6b6fa85

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22017/ Subject: LU-8510 dne: set osd_obj_ea_ops::dt_invalidate Project: fs/lustre-release Branch: master Current Patch Set: Commit: 13364590a8c9ef64320f62b9937c01aaa6b6fa85

          The soak test was executed with el6.7 build (https://build.hpdd.intel.com/job/lustre-master/3431/ tag 2.8.57)

          heckes Frank Heckes (Inactive) added a comment - The soak test was executed with el6.7 build ( https://build.hpdd.intel.com/job/lustre-master/3431/ tag 2.8.57)
          heckes Frank Heckes (Inactive) added a comment - - edited

          The same error also happened during soak testing of '20160902' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20160902)
          Test cluster configuration:
          4 MDS with 1 MDT / MDS, backend FS formatted with ldiskfs , in active-active HA configuration (node pair affected lola-[8,9])
          6 OSS with 4 OST / OSS, backend FS formatted with zfs , n active-active HA configuration

          Error message is the same beside addresses (see attached file vmcore-dmesg.txt)

          Sequence of events

          • 2016-09-05 15:48:29 Node lola-8 crashed before (injected fault) failover of lola-9's resources (mdt11) to lola-8.
          • 2016-09-05 15:55:04 lola-8 became available before failover took place
          • 2016-09-05 16:25:38,987:fsmgmt.fsmgmt:INFO triggering fault mds_failover (lola-9)
          • 2016-09-05 16:35:39,398: mdt-1 successful mounted on lola-8, but stalled in recovery due to missing MDT/MGT of lola-8
          • The MDT/MDT can't be mounted on the primary node (active-active HA configuration) anymore. The error message reads as:
            [root@lola-8 ~]# date ; mount -t lustre -o rw,user_xattr /dev/disk/by-id/dm-name-360080e50002ff4f00000026952013088p1 /mnt/soaked-mdt0 ; date
            Tue Sep  6 02:04:13 PDT 2016
            mount.lustre: mount /dev/mapper/360080e50002ff4f00000026952013088p1 at /mnt/soaked-mdt0 failed: Input/output error
            Is the MGS running?
            Tue Sep  6 02:05:14 PDT 2016
            [root@lola-8 ~]# lctl debug_kernel /tmp/lustre-log-20160906-020514-mgs-mount-fails
            

            (Double checked HW; IB and disk resources are operation and sane)
            After manual umount of mdt1 and reboot of node lola-8, mdt-0, mdt-1 could be mounted and recovery completed within 2 mins and primary
            resource could be switched back to (primary) node lola-9 again.
            This symptom is eventually a different bug that happens only by chance due to the node crash.

          Attached files: messages, console and vmcore-dmesg.txt of affected node lola-8, debug log (mask -1) containing debug information of the time interval while executing the mount command specified above.
          A crash dump file exists and had been store to lhn.hpdd.intel.com:/scratch/crashdumps/lu-8510/lola-8/127.0.0.1-2016-09-05-15:48:29.

          heckes Frank Heckes (Inactive) added a comment - - edited The same error also happened during soak testing of '20160902' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20160902 ) Test cluster configuration: 4 MDS with 1 MDT / MDS, backend FS formatted with ldiskfs , in active-active HA configuration (node pair affected lola- [8,9] ) 6 OSS with 4 OST / OSS, backend FS formatted with zfs , n active-active HA configuration Error message is the same beside addresses (see attached file vmcore-dmesg.txt) Sequence of events 2016-09-05 15:48:29 Node lola-8 crashed before (injected fault) failover of lola-9's resources (mdt11) to lola-8. 2016-09-05 15:55:04 lola-8 became available before failover took place 2016-09-05 16:25:38,987:fsmgmt.fsmgmt:INFO triggering fault mds_failover ( lola-9 ) 2016-09-05 16:35:39,398: mdt-1 successful mounted on lola-8 , but stalled in recovery due to missing MDT/MGT of lola-8 The MDT/MDT can't be mounted on the primary node (active-active HA configuration) anymore. The error message reads as: [root@lola-8 ~]# date ; mount -t lustre -o rw,user_xattr /dev/disk/by-id/dm-name-360080e50002ff4f00000026952013088p1 /mnt/soaked-mdt0 ; date Tue Sep 6 02:04:13 PDT 2016 mount.lustre: mount /dev/mapper/360080e50002ff4f00000026952013088p1 at /mnt/soaked-mdt0 failed: Input/output error Is the MGS running? Tue Sep 6 02:05:14 PDT 2016 [root@lola-8 ~]# lctl debug_kernel /tmp/lustre-log-20160906-020514-mgs-mount-fails (Double checked HW; IB and disk resources are operation and sane) After manual umount of mdt1 and reboot of node lola-8 , mdt-0, mdt-1 could be mounted and recovery completed within 2 mins and primary resource could be switched back to (primary) node lola-9 again. This symptom is eventually a different bug that happens only by chance due to the node crash. Attached files: messages, console and vmcore-dmesg.txt of affected node lola-8 , debug log (mask -1 ) containing debug information of the time interval while executing the mount command specified above. A crash dump file exists and had been store to lhn.hpdd.intel.com:/scratch/crashdumps/lu-8510/lola-8/127.0.0.1-2016-09-05-15:48:29 .

          Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/22017
          Subject: LU-8510 dne: set osd_obj_ea_ops::dt_invalidate
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: a9fae446db68cb1c34f2db949c875f30d5e93980

          gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/22017 Subject: LU-8510 dne: set osd_obj_ea_ops::dt_invalidate Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a9fae446db68cb1c34f2db949c875f30d5e93980
          pjones Peter Jones added a comment -

          Bobijam

          Could you please assist with this issue

          Thanks

          Peter

          pjones Peter Jones added a comment - Bobijam Could you please assist with this issue Thanks Peter

          People

            bobijam Zhenyu Xu
            dinatale2 Giuseppe Di Natale (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: