Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.7.0
    • Lustre 2.7.0
    • OpenSFS cluster with two MDSs with one MDT each, three OSSs with two OSTs each and three clients. Lustre master tag 2.6.90 build #2734
    • 3
    • 16512

    Description

      While running test 3.3.2 from the LFSCK Phase 3 test plan, both MDSs crashed during LFSCK.

      Test 3.3.2 calls for creating a number of subdirectories and creating a variety of objects in each subdirectory including files, multiply linked files, and local and remote subdirectories; local and remote in terms of the MDS. The test plan calls for setting fail_loc to 1603 so that all objects created will have no linkEA. Creating the objects work with this failure injected, but the both MDSs crash when ‘lctl lfsck_start’ is called.

      On the main MDS, with index 0, we call ‘lctl lfsck_start -M scratch-MDT0000 -A -c -C --reset --type namespace’. LFSCK starts and soon after the MDSs crash with the following on their consoles:

      Message from syslogd@mds01-ib at Nov 12 12:44:24 ...
       kernel:LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_ops ) failed: 
      
      Message from syslogd@mds01-ib at Nov 12 12:44:24 ...
       kernel:LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) LBUG
      

      From the crash dmesg log on MDS1, we see:

      <6>Lustre: *** cfs_fail_loc=1603, val=0***
      <6>Lustre: Skipped 97 previous similar messages
      <0>LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_ops ) failed: 
      <0>LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) LBUG
      <4>Pid: 14961, comm: lfsck
      <4>
      <4>Call Trace:
      <4> [<ffffffffa044c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa044ce97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa0ed5998>] lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsck]
      <4> [<ffffffffa0e961c5>] lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
      <4> [<ffffffffa0e8b1c3>] lfsck_open_dir+0xa3/0x380 [lfsck]
      <4> [<ffffffffa0e8e577>] lfsck_exec_oit+0x677/0xb80 [lfsck]
      <4> [<ffffffffa0a1b04a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      <4> [<ffffffffa0e8fd35>] lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
      <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4> [<ffffffffa0e9146e>] lfsck_master_engine+0xabe/0x1390 [lfsck]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa0e909b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
      <4> [<ffffffff8109abf6>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 14961, comm: lfsck Not tainted 2.6.32-431.29.2.el6_lustre.gd99708b.x86_6
      4 #1
      <4>Call Trace:
      <4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f
      <4> [<ffffffffa044ceeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa0ed5998>] ? lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsc
      k]
      <4> [<ffffffffa0e961c5>] ? lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
      <4> [<ffffffffa0e8b1c3>] ? lfsck_open_dir+0xa3/0x380 [lfsck]
      <4> [<ffffffffa0e8e577>] ? lfsck_exec_oit+0x677/0xb80 [lfsck]
      <4> [<ffffffffa0a1b04a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      <4> [<ffffffffa0e8fd35>] ? lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
      <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4> [<ffffffffa0e9146e>] ? lfsck_master_engine+0xabe/0x1390 [lfsck]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa0e909b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
      <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      From the dmesg log from the crash on the second MDS looks a little different with an error from lfsck_namespace_rebuild_linkea():

      <6>Lustre: *** cfs_fail_loc=1603, val=0***
      <6>Lustre: *** cfs_fail_loc=1603, val=0***
      <6>Lustre: Skipped 6 previous similar messages
      <0>LustreError: 6700:0:(lfsck_namespace.c:1932:lfsck_namespace_rebuild_linkea())
       ASSERTION( !dt_object_remote(obj) ) failed: 
      <0>LustreError: 6698:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_op
      s ) failed: 
      <0>LustreError: 6698:0:(dt_object.h:2700:dt_lookup()) LBUG
      <4>Pid: 6698, comm: lfsck
      <4>
      <4>Call Trace:
      <4> [<ffffffffa083b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa083be97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa09a1998>] lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsck]
      <4> [<ffffffffa09621c5>] lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
      <4> [<ffffffffa09571c3>] lfsck_open_dir+0xa3/0x380 [lfsck]
      <4> [<ffffffffa095a577>] lfsck_exec_oit+0x677/0xb80 [lfsck]
      <4> [<ffffffffa035704a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      <4> [<ffffffffa095bd35>] lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
      <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4> [<ffffffffa095d46e>] lfsck_master_engine+0xabe/0x1390 [lfsck]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa095c9b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
      <4> [<ffffffff8109abf6>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 6698, comm: lfsck Not tainted 2.6.32-431.29.2.el6_lustre.gd99708b.x86_64
       #1
      <4>Call Trace:
      <4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f
      <4> [<ffffffffa083beeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa09a1998>] ? lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsc
      k]
      <4> [<ffffffffa09621c5>] ? lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
      <4> [<ffffffffa09571c3>] ? lfsck_open_dir+0xa3/0x380 [lfsck]
      <4> [<ffffffffa095a577>] ? lfsck_exec_oit+0x677/0xb80 [lfsck]
      <4> [<ffffffffa035704a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      <4> [<ffffffffa095bd35>] ? lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
      <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4> [<ffffffffa095d46e>] ? lfsck_master_engine+0xabe/0x1390 [lfsck]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa095c9b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
      <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      Attachments

        Issue Links

          Activity

            [LU-5914] LFSCK: dt_lookup()) LBUG

            Patch landed to Master.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13801/
            Subject: LU-5914 lfsck: dt_index_try before dt_lookup
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5bae33a95e6330af35e9749356f31a7569ec1e09

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13801/ Subject: LU-5914 lfsck: dt_index_try before dt_lookup Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5bae33a95e6330af35e9749356f31a7569ec1e09

            Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13801
            Subject: LU-5914 lfsck: dt_index_try before dt_lookup
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5942806bc97612b8d87560e70fde05111aa1aba1

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13801 Subject: LU-5914 lfsck: dt_index_try before dt_lookup Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5942806bc97612b8d87560e70fde05111aa1aba1

            I believe we just hit this on soak testing on Lola-8
            LustreError: 4782:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_ops ) failed: ^M
            LustreError: 4782:0:(dt_object.h:2700:dt_lookup()) LBUG^M
            Pid: 4782, comm: lfsck_namespace^M
            ^M
            Call Trace:^M
            [<ffffffffa055c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
            [<ffffffffa055ce97>] lbug_with_loc+0x47/0xb0 [libcfs]^M
            [<ffffffffa0fd6344>] lfsck_namespace_check_exist+0x364/0x410 [lfsck]^M
            [<ffffffffa0fd9551>] lfsck_namespace_assistant_handler_p1+0xb41/0x2250 [lfsck]^M
            [<ffffffff810623a9>] ? find_busiest_queue+0x69/0x150^M
            [<ffffffffa0fc96e7>] lfsck_assistant_engine+0x497/0x1df0 [lfsck]^M
            [<ffffffff81061d00>] ? default_wake_function+0x0/0x20^M
            [<ffffffffa0fc9250>] ? lfsck_assistant_engine+0x0/0x1df0 [lfsck]^M
            [<ffffffff8109abf6>] kthread+0x96/0xa0^M
            [<ffffffff8100c20a>] child_rip+0xa/0x20^M
            [<ffffffff8109ab60>] ? kthread+0x0/0xa0^M
            [<ffffffff8100c200>] ? child_rip+0x0/0x20^M
            ^M
            We are running 2.6.94, with some patches from Li Wei, see the lola soak page

            cliffw Cliff White (Inactive) added a comment - I believe we just hit this on soak testing on Lola-8 LustreError: 4782:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_ops ) failed: ^M LustreError: 4782:0:(dt_object.h:2700:dt_lookup()) LBUG^M Pid: 4782, comm: lfsck_namespace^M ^M Call Trace:^M [<ffffffffa055c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] ^M [<ffffffffa055ce97>] lbug_with_loc+0x47/0xb0 [libcfs] ^M [<ffffffffa0fd6344>] lfsck_namespace_check_exist+0x364/0x410 [lfsck] ^M [<ffffffffa0fd9551>] lfsck_namespace_assistant_handler_p1+0xb41/0x2250 [lfsck] ^M [<ffffffff810623a9>] ? find_busiest_queue+0x69/0x150^M [<ffffffffa0fc96e7>] lfsck_assistant_engine+0x497/0x1df0 [lfsck] ^M [<ffffffff81061d00>] ? default_wake_function+0x0/0x20^M [<ffffffffa0fc9250>] ? lfsck_assistant_engine+0x0/0x1df0 [lfsck] ^M [<ffffffff8109abf6>] kthread+0x96/0xa0^M [<ffffffff8100c20a>] child_rip+0xa/0x20^M [<ffffffff8109ab60>] ? kthread+0x0/0xa0^M [<ffffffff8100c200>] ? child_rip+0x0/0x20^M ^M We are running 2.6.94, with some patches from Li Wei, see the lola soak page

            Patch landed to Master.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13188/
            Subject: LU-5914 lfsck: dt_lookup() LBUG
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4950f42fb6286f4cf5e6092593acabde76a8f093

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13188/ Subject: LU-5914 lfsck: dt_lookup() LBUG Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4950f42fb6286f4cf5e6092593acabde76a8f093

            Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/13188
            Subject: LU-5914 lfsck: dt_lookup() LBUG
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 01ebcf55122923e3c89ddde42766cb76eca142ef

            gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/13188 Subject: LU-5914 lfsck: dt_lookup() LBUG Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 01ebcf55122923e3c89ddde42766cb76eca142ef

            Lai, you are right. The "LASSERT(!dt_object_remote(obj));" in lfsck_namespace_rebuild_linkea() is wrong, should be removed. But I still not sure how the "dt_lookup()) LBUG" can be triggered.

            yong.fan nasf (Inactive) added a comment - Lai, you are right. The "LASSERT(!dt_object_remote(obj));" in lfsck_namespace_rebuild_linkea() is wrong, should be removed. But I still not sure how the "dt_lookup()) LBUG" can be triggered.
            laisiyao Lai Siyao added a comment -
            #20 [ffff88102c4b7a48] lfsck_namespace_rebuild_linkea at ffffffffa0ece540 [lfsck]
            #21 [ffff88102c4b7ab8] lfsck_namespace_striped_dir_rescan at ffffffffa0f07690 [lfsck]
            #22 [ffff88102c4b7c58] lfsck_namespace_assistant_handler_p1 at ffffffffa0ed8ec3 [lfsck]
            #23 [ffff88102c4b7d78] lfsck_assistant_engine at ffffffffa0ec81c7 [lfsck]
            #24 [ffff88102c4b7ee8] kthread at ffffffff8109abf6
            #25 [ffff88102c4b7f48] kernel_thread at ffffffff8100c20a
            

            backtrace of MDS2 vmcore shows it firstly asserted on this, I doubt there is race in normal operation and lfsck, so an object can be either remote or local during lfsck, and it caused assert. A glance of the code shows the locking of objects is loose, which may be the cause.

            Fanyong, what's your opinion?

            laisiyao Lai Siyao added a comment - #20 [ffff88102c4b7a48] lfsck_namespace_rebuild_linkea at ffffffffa0ece540 [lfsck] #21 [ffff88102c4b7ab8] lfsck_namespace_striped_dir_rescan at ffffffffa0f07690 [lfsck] #22 [ffff88102c4b7c58] lfsck_namespace_assistant_handler_p1 at ffffffffa0ed8ec3 [lfsck] #23 [ffff88102c4b7d78] lfsck_assistant_engine at ffffffffa0ec81c7 [lfsck] #24 [ffff88102c4b7ee8] kthread at ffffffff8109abf6 #25 [ffff88102c4b7f48] kernel_thread at ffffffff8100c20a backtrace of MDS2 vmcore shows it firstly asserted on this, I doubt there is race in normal operation and lfsck, so an object can be either remote or local during lfsck, and it caused assert. A glance of the code shows the locking of objects is loose, which may be the cause. Fanyong, what's your opinion?

            I've just run the script for LFSCk Phase 3 test 3.3.2 with lustre-master build 2768 and hit this LBUG again.

            I've uploaded vmcore, dmesg, and the script to uploads/LU-5914 where _mds1 are logs from MDS1 and _mds2 are logs from MDS2. There were two MDSs with two MDTs each with three OSSs each with two OSTs.

            jamesanunez James Nunez (Inactive) added a comment - I've just run the script for LFSCk Phase 3 test 3.3.2 with lustre-master build 2768 and hit this LBUG again. I've uploaded vmcore, dmesg, and the script to uploads/ LU-5914 where _mds1 are logs from MDS1 and _mds2 are logs from MDS2. There were two MDSs with two MDTs each with three OSSs each with two OSTs.

            People

              yong.fan nasf (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: