Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.7.0
    • Lustre 2.7.0
    • OpenSFS cluster with two MDSs with one MDT each, three OSSs with two OSTs each and three clients. Lustre master tag 2.6.90 build #2734
    • 3
    • 16512

    Description

      While running test 3.3.2 from the LFSCK Phase 3 test plan, both MDSs crashed during LFSCK.

      Test 3.3.2 calls for creating a number of subdirectories and creating a variety of objects in each subdirectory including files, multiply linked files, and local and remote subdirectories; local and remote in terms of the MDS. The test plan calls for setting fail_loc to 1603 so that all objects created will have no linkEA. Creating the objects work with this failure injected, but the both MDSs crash when ‘lctl lfsck_start’ is called.

      On the main MDS, with index 0, we call ‘lctl lfsck_start -M scratch-MDT0000 -A -c -C --reset --type namespace’. LFSCK starts and soon after the MDSs crash with the following on their consoles:

      Message from syslogd@mds01-ib at Nov 12 12:44:24 ...
       kernel:LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_ops ) failed: 
      
      Message from syslogd@mds01-ib at Nov 12 12:44:24 ...
       kernel:LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) LBUG
      

      From the crash dmesg log on MDS1, we see:

      <6>Lustre: *** cfs_fail_loc=1603, val=0***
      <6>Lustre: Skipped 97 previous similar messages
      <0>LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_ops ) failed: 
      <0>LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) LBUG
      <4>Pid: 14961, comm: lfsck
      <4>
      <4>Call Trace:
      <4> [<ffffffffa044c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa044ce97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa0ed5998>] lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsck]
      <4> [<ffffffffa0e961c5>] lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
      <4> [<ffffffffa0e8b1c3>] lfsck_open_dir+0xa3/0x380 [lfsck]
      <4> [<ffffffffa0e8e577>] lfsck_exec_oit+0x677/0xb80 [lfsck]
      <4> [<ffffffffa0a1b04a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      <4> [<ffffffffa0e8fd35>] lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
      <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4> [<ffffffffa0e9146e>] lfsck_master_engine+0xabe/0x1390 [lfsck]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa0e909b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
      <4> [<ffffffff8109abf6>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 14961, comm: lfsck Not tainted 2.6.32-431.29.2.el6_lustre.gd99708b.x86_6
      4 #1
      <4>Call Trace:
      <4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f
      <4> [<ffffffffa044ceeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa0ed5998>] ? lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsc
      k]
      <4> [<ffffffffa0e961c5>] ? lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
      <4> [<ffffffffa0e8b1c3>] ? lfsck_open_dir+0xa3/0x380 [lfsck]
      <4> [<ffffffffa0e8e577>] ? lfsck_exec_oit+0x677/0xb80 [lfsck]
      <4> [<ffffffffa0a1b04a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      <4> [<ffffffffa0e8fd35>] ? lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
      <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4> [<ffffffffa0e9146e>] ? lfsck_master_engine+0xabe/0x1390 [lfsck]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa0e909b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
      <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      From the dmesg log from the crash on the second MDS looks a little different with an error from lfsck_namespace_rebuild_linkea():

      <6>Lustre: *** cfs_fail_loc=1603, val=0***
      <6>Lustre: *** cfs_fail_loc=1603, val=0***
      <6>Lustre: Skipped 6 previous similar messages
      <0>LustreError: 6700:0:(lfsck_namespace.c:1932:lfsck_namespace_rebuild_linkea())
       ASSERTION( !dt_object_remote(obj) ) failed: 
      <0>LustreError: 6698:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_op
      s ) failed: 
      <0>LustreError: 6698:0:(dt_object.h:2700:dt_lookup()) LBUG
      <4>Pid: 6698, comm: lfsck
      <4>
      <4>Call Trace:
      <4> [<ffffffffa083b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa083be97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa09a1998>] lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsck]
      <4> [<ffffffffa09621c5>] lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
      <4> [<ffffffffa09571c3>] lfsck_open_dir+0xa3/0x380 [lfsck]
      <4> [<ffffffffa095a577>] lfsck_exec_oit+0x677/0xb80 [lfsck]
      <4> [<ffffffffa035704a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      <4> [<ffffffffa095bd35>] lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
      <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4> [<ffffffffa095d46e>] lfsck_master_engine+0xabe/0x1390 [lfsck]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa095c9b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
      <4> [<ffffffff8109abf6>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 6698, comm: lfsck Not tainted 2.6.32-431.29.2.el6_lustre.gd99708b.x86_64
       #1
      <4>Call Trace:
      <4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f
      <4> [<ffffffffa083beeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa09a1998>] ? lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsc
      k]
      <4> [<ffffffffa09621c5>] ? lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
      <4> [<ffffffffa09571c3>] ? lfsck_open_dir+0xa3/0x380 [lfsck]
      <4> [<ffffffffa095a577>] ? lfsck_exec_oit+0x677/0xb80 [lfsck]
      <4> [<ffffffffa035704a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      <4> [<ffffffffa095bd35>] ? lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
      <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4> [<ffffffffa095d46e>] ? lfsck_master_engine+0xabe/0x1390 [lfsck]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa095c9b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
      <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      Attachments

        Issue Links

          Activity

            [LU-5914] LFSCK: dt_lookup()) LBUG

            Lai, you are right. The "LASSERT(!dt_object_remote(obj));" in lfsck_namespace_rebuild_linkea() is wrong, should be removed. But I still not sure how the "dt_lookup()) LBUG" can be triggered.

            yong.fan nasf (Inactive) added a comment - Lai, you are right. The "LASSERT(!dt_object_remote(obj));" in lfsck_namespace_rebuild_linkea() is wrong, should be removed. But I still not sure how the "dt_lookup()) LBUG" can be triggered.
            laisiyao Lai Siyao added a comment -
            #20 [ffff88102c4b7a48] lfsck_namespace_rebuild_linkea at ffffffffa0ece540 [lfsck]
            #21 [ffff88102c4b7ab8] lfsck_namespace_striped_dir_rescan at ffffffffa0f07690 [lfsck]
            #22 [ffff88102c4b7c58] lfsck_namespace_assistant_handler_p1 at ffffffffa0ed8ec3 [lfsck]
            #23 [ffff88102c4b7d78] lfsck_assistant_engine at ffffffffa0ec81c7 [lfsck]
            #24 [ffff88102c4b7ee8] kthread at ffffffff8109abf6
            #25 [ffff88102c4b7f48] kernel_thread at ffffffff8100c20a
            

            backtrace of MDS2 vmcore shows it firstly asserted on this, I doubt there is race in normal operation and lfsck, so an object can be either remote or local during lfsck, and it caused assert. A glance of the code shows the locking of objects is loose, which may be the cause.

            Fanyong, what's your opinion?

            laisiyao Lai Siyao added a comment - #20 [ffff88102c4b7a48] lfsck_namespace_rebuild_linkea at ffffffffa0ece540 [lfsck] #21 [ffff88102c4b7ab8] lfsck_namespace_striped_dir_rescan at ffffffffa0f07690 [lfsck] #22 [ffff88102c4b7c58] lfsck_namespace_assistant_handler_p1 at ffffffffa0ed8ec3 [lfsck] #23 [ffff88102c4b7d78] lfsck_assistant_engine at ffffffffa0ec81c7 [lfsck] #24 [ffff88102c4b7ee8] kthread at ffffffff8109abf6 #25 [ffff88102c4b7f48] kernel_thread at ffffffff8100c20a backtrace of MDS2 vmcore shows it firstly asserted on this, I doubt there is race in normal operation and lfsck, so an object can be either remote or local during lfsck, and it caused assert. A glance of the code shows the locking of objects is loose, which may be the cause. Fanyong, what's your opinion?

            I've just run the script for LFSCk Phase 3 test 3.3.2 with lustre-master build 2768 and hit this LBUG again.

            I've uploaded vmcore, dmesg, and the script to uploads/LU-5914 where _mds1 are logs from MDS1 and _mds2 are logs from MDS2. There were two MDSs with two MDTs each with three OSSs each with two OSTs.

            jamesanunez James Nunez (Inactive) added a comment - I've just run the script for LFSCk Phase 3 test 3.3.2 with lustre-master build 2768 and hit this LBUG again. I've uploaded vmcore, dmesg, and the script to uploads/ LU-5914 where _mds1 are logs from MDS1 and _mds2 are logs from MDS2. There were two MDSs with two MDTs each with three OSSs each with two OSTs.

            Lai, Yes, I can still reproduce this error/crash with a recent build of lustre-master; build # 2760.

            jamesanunez James Nunez (Inactive) added a comment - Lai, Yes, I can still reproduce this error/crash with a recent build of lustre-master; build # 2760.
            laisiyao Lai Siyao added a comment -

            I tested in a small scale for test plan 3.3.2, but couldn't reproduce. Could you verify it still occurs on latest master? and is there script for test plan to download?

            laisiyao Lai Siyao added a comment - I tested in a small scale for test plan 3.3.2, but couldn't reproduce. Could you verify it still occurs on latest master? and is there script for test plan to download?
            di.wang Di Wang added a comment -

            Hmm, I am not sure. Fan yong and I did discuss sth about different local and remote object creation process cause the dt_index_ops is not being assigned properly for remote object in LFSCK use cases. I am not sure he made the patch or not.

            For this LBUG, it seems dt_try_as_dir is missing before dt_lookup, though I did not check the code carefully.

            di.wang Di Wang added a comment - Hmm, I am not sure. Fan yong and I did discuss sth about different local and remote object creation process cause the dt_index_ops is not being assigned properly for remote object in LFSCK use cases. I am not sure he made the patch or not. For this LBUG, it seems dt_try_as_dir is missing before dt_lookup, though I did not check the code carefully.

            Di, I thought there was a recent patch in this area already?

            adilger Andreas Dilger added a comment - Di, I thought there was a recent patch in this area already?

            Lai,
            Would you be able to comment on this one?
            Thank you!

            jlevi Jodi Levi (Inactive) added a comment - Lai, Would you be able to comment on this one? Thank you!

            People

              yong.fan nasf (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: