[LU-5892] lfsck_needs_scan_dir() LBUG Created: 10/Nov/14  Updated: 23/Nov/14  Resolved: 23/Nov/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Critical
Reporter: James Nunez (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: lfsck
Environment:

OpenSFS cluster with two MDSs with one MDT each, three OSSs with two OSTs each and three clients. Lustre master tag 2.6.90.


Severity: 3
Rank (Obsolete): 16466

 Description   

I’ve run into a problem while executing the LFSCK Phase 3 test plan at https://jira.hpdd.intel.com/browse/LU-4836
Test 3.3.2 calls for creating a number of subdirectories and creating a variety of objects in each subdirectory including local and remote subdirectories; local and remote in terms of the MDS. The test plan calls for setting fail_loc to 1502 so that all objects crated will have no linkEA. Creating files and local subdirectories work with this failure injected, but not remote directories. In the following, “rdir-1” is the remote directory:

Create local subdirectories
Create remote subdirectories
error on LL_IOC_LMV_SETSTRIPE '/lustre/scratch/test_dir/sdir-0/rdir-1' (3): No data available
error: mkdir: create stripe dir '/lustre/scratch/test_dir/sdir-0/rdir-1' failed
status        script            Total(sec) E(xcluded) S(low) 
------------------------------------------------------------------------------------

touch: missing file operand
Try `touch --help' for more information.

I cannot remove the remote directory:

# rm -rf /lustre/scratch/test_dir/sdir-0/rdir-1 
[root@c13 tests]# ls /lustre/scratch/test_dir/sdir-0/rdir-1 
ls: cannot access /lustre/scratch/test_dir/sdir-0/rdir-1: No such file or directory
[root@c13 tests]# ls /lustre/scratch/test_dir
sdir-0
[root@c13 tests]# ls /lustre/scratch/test_dir/sdir-0/
ls: cannot access /lustre/scratch/test_dir/sdir-0/rdir-1: No such file or directory
rdir-1

So, I figured I’d run LFSCk since it is supposed to correct these errors, but LFSCK crashes the node. On MDS1, I reset the fail_loc to zero and ran ‘lctl lfsck_start’:

# lctl set_param fail_loc=0

# lctl lfsck_start -A -M scratch-MDT0000 -c -C --reset --type namespace
Started LFSCK on the device scratch-MDT0000: scrub namespace
[root@mds01 ~]# 
Message from syslogd@mds01-ib at Nov 10 08:11:20 ...
 kernel:LustreError: 17451:0:(lfsck_engine.c:232:lfsck_needs_scan_dir()) ASSERTION( depth > 0 ) failed: 

Message from syslogd@mds01-ib at Nov 10 08:11:20 ...
 kernel:LustreError: 17451:0:(lfsck_engine.c:232:lfsck_needs_scan_dir()) LBUG

From the crash dmesg on MDS1:

<0>LustreError: 17451:0:(lfsck_engine.c:232:lfsck_needs_scan_dir()) ASSERTION( d
epth > 0 ) failed: 
<0>LustreError: 17451:0:(lfsck_engine.c:232:lfsck_needs_scan_dir()) LBUG
<4>Pid: 17451, comm: lfsck
<4>
<4>Call Trace:
<4> [<ffffffffa06f2895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa06f2e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa19106f7>] lfsck_exec_oit+0x7f7/0xb80 [lfsck]
<4> [<ffffffffa078904a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
<4> [<ffffffffa1911d35>] lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
<4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
<4> [<ffffffffa191346e>] lfsck_master_engine+0xabe/0x1390 [lfsck]
<4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa19129b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
<4> [<ffffffff8109abf6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 17451, comm: lfsck Not tainted 2.6.32-431.29.2.el6_lustre.gd99708b.x86_6
4 #1
<4>Call Trace:
<4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f
<4> [<ffffffffa06f2eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa19106f7>] ? lfsck_exec_oit+0x7f7/0xb80 [lfsck]
<4> [<ffffffffa078904a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
<4> [<ffffffffa1911d35>] ? lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
<4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
<4> [<ffffffffa191346e>] ? lfsck_master_engine+0xabe/0x1390 [lfsck]
<4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa19129b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
<4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

On the second MDS, dmesg contains:

Lustre: *** cfs_fail_loc=1502, val=0***
LustreError: 1426:0:(osd_handler.c:4546:osd_index_ea_insert()) scratch-MDT0001-osd: add [0x3c0000400:0xea61:0x0] error: rc = -61
LustreError: 1426:0:(osd_handler.c:2495:osd_object_destroy()) scratch-MDT0001-osd: delete inode [0x3c0000400:0xea61:0x0]: rc = -61


 Comments   
Comment by Jodi Levi (Inactive) [ 10/Nov/14 ]

Fan Yong,
Can you look into this one?
Thank you!

Comment by James Nunez (Inactive) [ 10/Nov/14 ]

I've uploaded the vmcore from the MDS that crashed at uploads/LU-5892 on the ftp site.

Comment by nasf (Inactive) [ 11/Nov/14 ]

James,
To generate MDT-object without linkEA, you need to inject the failure stub "#define OBD_FAIL_LFSCK_LINKEA_CRASH 0x1603". The failure stub 0x1502 will cause the MDT-object has no LMV EA. That is why you saw the message "osd_index_ea_insert()) scratch-MDT0001-osd: add [0x3c0000400:0xea61:0x0] error: rc = -61".

As for the LBUG(), I will make the patch to fix it.

Comment by nasf (Inactive) [ 11/Nov/14 ]

Here is the patch:
http://review.whamcloud.com/12670

Comment by Gerrit Updater [ 23/Nov/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12670/
Subject: LU-5892 lfsck: remove improper LASSERT in lfsck_needs_scan_dir
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4db8abadae3f8a393fe0d25e07575305ae3876da

Generated at Sat Feb 10 01:55:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.