Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.7.0
    • Lustre 2.7.0
    • OpenSFS cluster with two MDSs with one MDT each, three OSSs with two OSTs each and three clients. Lustre master tag 2.6.90 build #2734
    • 3
    • 16512

    Description

      While running test 3.3.2 from the LFSCK Phase 3 test plan, both MDSs crashed during LFSCK.

      Test 3.3.2 calls for creating a number of subdirectories and creating a variety of objects in each subdirectory including files, multiply linked files, and local and remote subdirectories; local and remote in terms of the MDS. The test plan calls for setting fail_loc to 1603 so that all objects created will have no linkEA. Creating the objects work with this failure injected, but the both MDSs crash when ‘lctl lfsck_start’ is called.

      On the main MDS, with index 0, we call ‘lctl lfsck_start -M scratch-MDT0000 -A -c -C --reset --type namespace’. LFSCK starts and soon after the MDSs crash with the following on their consoles:

      Message from syslogd@mds01-ib at Nov 12 12:44:24 ...
       kernel:LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_ops ) failed: 
      
      Message from syslogd@mds01-ib at Nov 12 12:44:24 ...
       kernel:LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) LBUG
      

      From the crash dmesg log on MDS1, we see:

      <6>Lustre: *** cfs_fail_loc=1603, val=0***
      <6>Lustre: Skipped 97 previous similar messages
      <0>LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_ops ) failed: 
      <0>LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) LBUG
      <4>Pid: 14961, comm: lfsck
      <4>
      <4>Call Trace:
      <4> [<ffffffffa044c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa044ce97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa0ed5998>] lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsck]
      <4> [<ffffffffa0e961c5>] lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
      <4> [<ffffffffa0e8b1c3>] lfsck_open_dir+0xa3/0x380 [lfsck]
      <4> [<ffffffffa0e8e577>] lfsck_exec_oit+0x677/0xb80 [lfsck]
      <4> [<ffffffffa0a1b04a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      <4> [<ffffffffa0e8fd35>] lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
      <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4> [<ffffffffa0e9146e>] lfsck_master_engine+0xabe/0x1390 [lfsck]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa0e909b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
      <4> [<ffffffff8109abf6>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 14961, comm: lfsck Not tainted 2.6.32-431.29.2.el6_lustre.gd99708b.x86_6
      4 #1
      <4>Call Trace:
      <4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f
      <4> [<ffffffffa044ceeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa0ed5998>] ? lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsc
      k]
      <4> [<ffffffffa0e961c5>] ? lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
      <4> [<ffffffffa0e8b1c3>] ? lfsck_open_dir+0xa3/0x380 [lfsck]
      <4> [<ffffffffa0e8e577>] ? lfsck_exec_oit+0x677/0xb80 [lfsck]
      <4> [<ffffffffa0a1b04a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      <4> [<ffffffffa0e8fd35>] ? lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
      <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4> [<ffffffffa0e9146e>] ? lfsck_master_engine+0xabe/0x1390 [lfsck]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa0e909b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
      <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      From the dmesg log from the crash on the second MDS looks a little different with an error from lfsck_namespace_rebuild_linkea():

      <6>Lustre: *** cfs_fail_loc=1603, val=0***
      <6>Lustre: *** cfs_fail_loc=1603, val=0***
      <6>Lustre: Skipped 6 previous similar messages
      <0>LustreError: 6700:0:(lfsck_namespace.c:1932:lfsck_namespace_rebuild_linkea())
       ASSERTION( !dt_object_remote(obj) ) failed: 
      <0>LustreError: 6698:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_op
      s ) failed: 
      <0>LustreError: 6698:0:(dt_object.h:2700:dt_lookup()) LBUG
      <4>Pid: 6698, comm: lfsck
      <4>
      <4>Call Trace:
      <4> [<ffffffffa083b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa083be97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa09a1998>] lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsck]
      <4> [<ffffffffa09621c5>] lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
      <4> [<ffffffffa09571c3>] lfsck_open_dir+0xa3/0x380 [lfsck]
      <4> [<ffffffffa095a577>] lfsck_exec_oit+0x677/0xb80 [lfsck]
      <4> [<ffffffffa035704a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      <4> [<ffffffffa095bd35>] lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
      <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4> [<ffffffffa095d46e>] lfsck_master_engine+0xabe/0x1390 [lfsck]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa095c9b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
      <4> [<ffffffff8109abf6>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 6698, comm: lfsck Not tainted 2.6.32-431.29.2.el6_lustre.gd99708b.x86_64
       #1
      <4>Call Trace:
      <4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f
      <4> [<ffffffffa083beeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa09a1998>] ? lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsc
      k]
      <4> [<ffffffffa09621c5>] ? lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
      <4> [<ffffffffa09571c3>] ? lfsck_open_dir+0xa3/0x380 [lfsck]
      <4> [<ffffffffa095a577>] ? lfsck_exec_oit+0x677/0xb80 [lfsck]
      <4> [<ffffffffa035704a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      <4> [<ffffffffa095bd35>] ? lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
      <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4> [<ffffffffa095d46e>] ? lfsck_master_engine+0xabe/0x1390 [lfsck]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa095c9b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
      <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      Attachments

        Issue Links

          Activity

            [LU-5914] LFSCK: dt_lookup()) LBUG

            Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13801
            Subject: LU-5914 lfsck: dt_index_try before dt_lookup
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5942806bc97612b8d87560e70fde05111aa1aba1

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13801 Subject: LU-5914 lfsck: dt_index_try before dt_lookup Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5942806bc97612b8d87560e70fde05111aa1aba1

            I believe we just hit this on soak testing on Lola-8
            LustreError: 4782:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_ops ) failed: ^M
            LustreError: 4782:0:(dt_object.h:2700:dt_lookup()) LBUG^M
            Pid: 4782, comm: lfsck_namespace^M
            ^M
            Call Trace:^M
            [<ffffffffa055c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
            [<ffffffffa055ce97>] lbug_with_loc+0x47/0xb0 [libcfs]^M
            [<ffffffffa0fd6344>] lfsck_namespace_check_exist+0x364/0x410 [lfsck]^M
            [<ffffffffa0fd9551>] lfsck_namespace_assistant_handler_p1+0xb41/0x2250 [lfsck]^M
            [<ffffffff810623a9>] ? find_busiest_queue+0x69/0x150^M
            [<ffffffffa0fc96e7>] lfsck_assistant_engine+0x497/0x1df0 [lfsck]^M
            [<ffffffff81061d00>] ? default_wake_function+0x0/0x20^M
            [<ffffffffa0fc9250>] ? lfsck_assistant_engine+0x0/0x1df0 [lfsck]^M
            [<ffffffff8109abf6>] kthread+0x96/0xa0^M
            [<ffffffff8100c20a>] child_rip+0xa/0x20^M
            [<ffffffff8109ab60>] ? kthread+0x0/0xa0^M
            [<ffffffff8100c200>] ? child_rip+0x0/0x20^M
            ^M
            We are running 2.6.94, with some patches from Li Wei, see the lola soak page

            cliffw Cliff White (Inactive) added a comment - I believe we just hit this on soak testing on Lola-8 LustreError: 4782:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_ops ) failed: ^M LustreError: 4782:0:(dt_object.h:2700:dt_lookup()) LBUG^M Pid: 4782, comm: lfsck_namespace^M ^M Call Trace:^M [<ffffffffa055c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] ^M [<ffffffffa055ce97>] lbug_with_loc+0x47/0xb0 [libcfs] ^M [<ffffffffa0fd6344>] lfsck_namespace_check_exist+0x364/0x410 [lfsck] ^M [<ffffffffa0fd9551>] lfsck_namespace_assistant_handler_p1+0xb41/0x2250 [lfsck] ^M [<ffffffff810623a9>] ? find_busiest_queue+0x69/0x150^M [<ffffffffa0fc96e7>] lfsck_assistant_engine+0x497/0x1df0 [lfsck] ^M [<ffffffff81061d00>] ? default_wake_function+0x0/0x20^M [<ffffffffa0fc9250>] ? lfsck_assistant_engine+0x0/0x1df0 [lfsck] ^M [<ffffffff8109abf6>] kthread+0x96/0xa0^M [<ffffffff8100c20a>] child_rip+0xa/0x20^M [<ffffffff8109ab60>] ? kthread+0x0/0xa0^M [<ffffffff8100c200>] ? child_rip+0x0/0x20^M ^M We are running 2.6.94, with some patches from Li Wei, see the lola soak page

            Patch landed to Master.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13188/
            Subject: LU-5914 lfsck: dt_lookup() LBUG
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4950f42fb6286f4cf5e6092593acabde76a8f093

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13188/ Subject: LU-5914 lfsck: dt_lookup() LBUG Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4950f42fb6286f4cf5e6092593acabde76a8f093

            Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/13188
            Subject: LU-5914 lfsck: dt_lookup() LBUG
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 01ebcf55122923e3c89ddde42766cb76eca142ef

            gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/13188 Subject: LU-5914 lfsck: dt_lookup() LBUG Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 01ebcf55122923e3c89ddde42766cb76eca142ef

            Lai, you are right. The "LASSERT(!dt_object_remote(obj));" in lfsck_namespace_rebuild_linkea() is wrong, should be removed. But I still not sure how the "dt_lookup()) LBUG" can be triggered.

            yong.fan nasf (Inactive) added a comment - Lai, you are right. The "LASSERT(!dt_object_remote(obj));" in lfsck_namespace_rebuild_linkea() is wrong, should be removed. But I still not sure how the "dt_lookup()) LBUG" can be triggered.
            laisiyao Lai Siyao added a comment -
            #20 [ffff88102c4b7a48] lfsck_namespace_rebuild_linkea at ffffffffa0ece540 [lfsck]
            #21 [ffff88102c4b7ab8] lfsck_namespace_striped_dir_rescan at ffffffffa0f07690 [lfsck]
            #22 [ffff88102c4b7c58] lfsck_namespace_assistant_handler_p1 at ffffffffa0ed8ec3 [lfsck]
            #23 [ffff88102c4b7d78] lfsck_assistant_engine at ffffffffa0ec81c7 [lfsck]
            #24 [ffff88102c4b7ee8] kthread at ffffffff8109abf6
            #25 [ffff88102c4b7f48] kernel_thread at ffffffff8100c20a
            

            backtrace of MDS2 vmcore shows it firstly asserted on this, I doubt there is race in normal operation and lfsck, so an object can be either remote or local during lfsck, and it caused assert. A glance of the code shows the locking of objects is loose, which may be the cause.

            Fanyong, what's your opinion?

            laisiyao Lai Siyao added a comment - #20 [ffff88102c4b7a48] lfsck_namespace_rebuild_linkea at ffffffffa0ece540 [lfsck] #21 [ffff88102c4b7ab8] lfsck_namespace_striped_dir_rescan at ffffffffa0f07690 [lfsck] #22 [ffff88102c4b7c58] lfsck_namespace_assistant_handler_p1 at ffffffffa0ed8ec3 [lfsck] #23 [ffff88102c4b7d78] lfsck_assistant_engine at ffffffffa0ec81c7 [lfsck] #24 [ffff88102c4b7ee8] kthread at ffffffff8109abf6 #25 [ffff88102c4b7f48] kernel_thread at ffffffff8100c20a backtrace of MDS2 vmcore shows it firstly asserted on this, I doubt there is race in normal operation and lfsck, so an object can be either remote or local during lfsck, and it caused assert. A glance of the code shows the locking of objects is loose, which may be the cause. Fanyong, what's your opinion?

            I've just run the script for LFSCk Phase 3 test 3.3.2 with lustre-master build 2768 and hit this LBUG again.

            I've uploaded vmcore, dmesg, and the script to uploads/LU-5914 where _mds1 are logs from MDS1 and _mds2 are logs from MDS2. There were two MDSs with two MDTs each with three OSSs each with two OSTs.

            jamesanunez James Nunez (Inactive) added a comment - I've just run the script for LFSCk Phase 3 test 3.3.2 with lustre-master build 2768 and hit this LBUG again. I've uploaded vmcore, dmesg, and the script to uploads/ LU-5914 where _mds1 are logs from MDS1 and _mds2 are logs from MDS2. There were two MDSs with two MDTs each with three OSSs each with two OSTs.

            Lai, Yes, I can still reproduce this error/crash with a recent build of lustre-master; build # 2760.

            jamesanunez James Nunez (Inactive) added a comment - Lai, Yes, I can still reproduce this error/crash with a recent build of lustre-master; build # 2760.
            laisiyao Lai Siyao added a comment -

            I tested in a small scale for test plan 3.3.2, but couldn't reproduce. Could you verify it still occurs on latest master? and is there script for test plan to download?

            laisiyao Lai Siyao added a comment - I tested in a small scale for test plan 3.3.2, but couldn't reproduce. Could you verify it still occurs on latest master? and is there script for test plan to download?
            di.wang Di Wang added a comment -

            Hmm, I am not sure. Fan yong and I did discuss sth about different local and remote object creation process cause the dt_index_ops is not being assigned properly for remote object in LFSCK use cases. I am not sure he made the patch or not.

            For this LBUG, it seems dt_try_as_dir is missing before dt_lookup, though I did not check the code carefully.

            di.wang Di Wang added a comment - Hmm, I am not sure. Fan yong and I did discuss sth about different local and remote object creation process cause the dt_index_ops is not being assigned properly for remote object in LFSCK use cases. I am not sure he made the patch or not. For this LBUG, it seems dt_try_as_dir is missing before dt_lookup, though I did not check the code carefully.

            People

              yong.fan nasf (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: