Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5914

LFSCK: dt_lookup()) LBUG

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.7.0
    • Lustre 2.7.0
    • OpenSFS cluster with two MDSs with one MDT each, three OSSs with two OSTs each and three clients. Lustre master tag 2.6.90 build #2734
    • 3
    • 16512

    Description

      While running test 3.3.2 from the LFSCK Phase 3 test plan, both MDSs crashed during LFSCK.

      Test 3.3.2 calls for creating a number of subdirectories and creating a variety of objects in each subdirectory including files, multiply linked files, and local and remote subdirectories; local and remote in terms of the MDS. The test plan calls for setting fail_loc to 1603 so that all objects created will have no linkEA. Creating the objects work with this failure injected, but the both MDSs crash when ‘lctl lfsck_start’ is called.

      On the main MDS, with index 0, we call ‘lctl lfsck_start -M scratch-MDT0000 -A -c -C --reset --type namespace’. LFSCK starts and soon after the MDSs crash with the following on their consoles:

      Message from syslogd@mds01-ib at Nov 12 12:44:24 ...
       kernel:LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_ops ) failed: 
      
      Message from syslogd@mds01-ib at Nov 12 12:44:24 ...
       kernel:LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) LBUG
      

      From the crash dmesg log on MDS1, we see:

      <6>Lustre: *** cfs_fail_loc=1603, val=0***
      <6>Lustre: Skipped 97 previous similar messages
      <0>LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_ops ) failed: 
      <0>LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) LBUG
      <4>Pid: 14961, comm: lfsck
      <4>
      <4>Call Trace:
      <4> [<ffffffffa044c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa044ce97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa0ed5998>] lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsck]
      <4> [<ffffffffa0e961c5>] lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
      <4> [<ffffffffa0e8b1c3>] lfsck_open_dir+0xa3/0x380 [lfsck]
      <4> [<ffffffffa0e8e577>] lfsck_exec_oit+0x677/0xb80 [lfsck]
      <4> [<ffffffffa0a1b04a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      <4> [<ffffffffa0e8fd35>] lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
      <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4> [<ffffffffa0e9146e>] lfsck_master_engine+0xabe/0x1390 [lfsck]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa0e909b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
      <4> [<ffffffff8109abf6>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 14961, comm: lfsck Not tainted 2.6.32-431.29.2.el6_lustre.gd99708b.x86_6
      4 #1
      <4>Call Trace:
      <4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f
      <4> [<ffffffffa044ceeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa0ed5998>] ? lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsc
      k]
      <4> [<ffffffffa0e961c5>] ? lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
      <4> [<ffffffffa0e8b1c3>] ? lfsck_open_dir+0xa3/0x380 [lfsck]
      <4> [<ffffffffa0e8e577>] ? lfsck_exec_oit+0x677/0xb80 [lfsck]
      <4> [<ffffffffa0a1b04a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      <4> [<ffffffffa0e8fd35>] ? lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
      <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4> [<ffffffffa0e9146e>] ? lfsck_master_engine+0xabe/0x1390 [lfsck]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa0e909b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
      <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      From the dmesg log from the crash on the second MDS looks a little different with an error from lfsck_namespace_rebuild_linkea():

      <6>Lustre: *** cfs_fail_loc=1603, val=0***
      <6>Lustre: *** cfs_fail_loc=1603, val=0***
      <6>Lustre: Skipped 6 previous similar messages
      <0>LustreError: 6700:0:(lfsck_namespace.c:1932:lfsck_namespace_rebuild_linkea())
       ASSERTION( !dt_object_remote(obj) ) failed: 
      <0>LustreError: 6698:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_op
      s ) failed: 
      <0>LustreError: 6698:0:(dt_object.h:2700:dt_lookup()) LBUG
      <4>Pid: 6698, comm: lfsck
      <4>
      <4>Call Trace:
      <4> [<ffffffffa083b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa083be97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa09a1998>] lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsck]
      <4> [<ffffffffa09621c5>] lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
      <4> [<ffffffffa09571c3>] lfsck_open_dir+0xa3/0x380 [lfsck]
      <4> [<ffffffffa095a577>] lfsck_exec_oit+0x677/0xb80 [lfsck]
      <4> [<ffffffffa035704a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      <4> [<ffffffffa095bd35>] lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
      <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4> [<ffffffffa095d46e>] lfsck_master_engine+0xabe/0x1390 [lfsck]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa095c9b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
      <4> [<ffffffff8109abf6>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 6698, comm: lfsck Not tainted 2.6.32-431.29.2.el6_lustre.gd99708b.x86_64
       #1
      <4>Call Trace:
      <4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f
      <4> [<ffffffffa083beeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa09a1998>] ? lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsc
      k]
      <4> [<ffffffffa09621c5>] ? lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
      <4> [<ffffffffa09571c3>] ? lfsck_open_dir+0xa3/0x380 [lfsck]
      <4> [<ffffffffa095a577>] ? lfsck_exec_oit+0x677/0xb80 [lfsck]
      <4> [<ffffffffa035704a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      <4> [<ffffffffa095bd35>] ? lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
      <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4> [<ffffffffa095d46e>] ? lfsck_master_engine+0xabe/0x1390 [lfsck]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa095c9b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
      <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      Attachments

        Issue Links

          Activity

            People

              yong.fan nasf (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: