Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5892

lfsck_needs_scan_dir() LBUG

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • Lustre 2.7.0
    • OpenSFS cluster with two MDSs with one MDT each, three OSSs with two OSTs each and three clients. Lustre master tag 2.6.90.
    • 3
    • 16466

    Description

      I’ve run into a problem while executing the LFSCK Phase 3 test plan at https://jira.hpdd.intel.com/browse/LU-4836
      Test 3.3.2 calls for creating a number of subdirectories and creating a variety of objects in each subdirectory including local and remote subdirectories; local and remote in terms of the MDS. The test plan calls for setting fail_loc to 1502 so that all objects crated will have no linkEA. Creating files and local subdirectories work with this failure injected, but not remote directories. In the following, “rdir-1” is the remote directory:

      Create local subdirectories
      Create remote subdirectories
      error on LL_IOC_LMV_SETSTRIPE '/lustre/scratch/test_dir/sdir-0/rdir-1' (3): No data available
      error: mkdir: create stripe dir '/lustre/scratch/test_dir/sdir-0/rdir-1' failed
      status        script            Total(sec) E(xcluded) S(low) 
      ------------------------------------------------------------------------------------
      
      touch: missing file operand
      Try `touch --help' for more information.
      

      I cannot remove the remote directory:

      # rm -rf /lustre/scratch/test_dir/sdir-0/rdir-1 
      [root@c13 tests]# ls /lustre/scratch/test_dir/sdir-0/rdir-1 
      ls: cannot access /lustre/scratch/test_dir/sdir-0/rdir-1: No such file or directory
      [root@c13 tests]# ls /lustre/scratch/test_dir
      sdir-0
      [root@c13 tests]# ls /lustre/scratch/test_dir/sdir-0/
      ls: cannot access /lustre/scratch/test_dir/sdir-0/rdir-1: No such file or directory
      rdir-1
      

      So, I figured I’d run LFSCk since it is supposed to correct these errors, but LFSCK crashes the node. On MDS1, I reset the fail_loc to zero and ran ‘lctl lfsck_start’:

      # lctl set_param fail_loc=0
      
      # lctl lfsck_start -A -M scratch-MDT0000 -c -C --reset --type namespace
      Started LFSCK on the device scratch-MDT0000: scrub namespace
      [root@mds01 ~]# 
      Message from syslogd@mds01-ib at Nov 10 08:11:20 ...
       kernel:LustreError: 17451:0:(lfsck_engine.c:232:lfsck_needs_scan_dir()) ASSERTION( depth > 0 ) failed: 
      
      Message from syslogd@mds01-ib at Nov 10 08:11:20 ...
       kernel:LustreError: 17451:0:(lfsck_engine.c:232:lfsck_needs_scan_dir()) LBUG
      

      From the crash dmesg on MDS1:

      <0>LustreError: 17451:0:(lfsck_engine.c:232:lfsck_needs_scan_dir()) ASSERTION( d
      epth > 0 ) failed: 
      <0>LustreError: 17451:0:(lfsck_engine.c:232:lfsck_needs_scan_dir()) LBUG
      <4>Pid: 17451, comm: lfsck
      <4>
      <4>Call Trace:
      <4> [<ffffffffa06f2895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa06f2e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa19106f7>] lfsck_exec_oit+0x7f7/0xb80 [lfsck]
      <4> [<ffffffffa078904a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      <4> [<ffffffffa1911d35>] lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
      <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4> [<ffffffffa191346e>] lfsck_master_engine+0xabe/0x1390 [lfsck]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa19129b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
      <4> [<ffffffff8109abf6>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 17451, comm: lfsck Not tainted 2.6.32-431.29.2.el6_lustre.gd99708b.x86_6
      4 #1
      <4>Call Trace:
      <4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f
      <4> [<ffffffffa06f2eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa19106f7>] ? lfsck_exec_oit+0x7f7/0xb80 [lfsck]
      <4> [<ffffffffa078904a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
      <4> [<ffffffffa1911d35>] ? lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
      <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4> [<ffffffffa191346e>] ? lfsck_master_engine+0xabe/0x1390 [lfsck]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa19129b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
      <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      On the second MDS, dmesg contains:

      Lustre: *** cfs_fail_loc=1502, val=0***
      LustreError: 1426:0:(osd_handler.c:4546:osd_index_ea_insert()) scratch-MDT0001-osd: add [0x3c0000400:0xea61:0x0] error: rc = -61
      LustreError: 1426:0:(osd_handler.c:2495:osd_object_destroy()) scratch-MDT0001-osd: delete inode [0x3c0000400:0xea61:0x0]: rc = -61
      

      Attachments

        Activity

          People

            yong.fan nasf (Inactive)
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: