Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.7.0
-
OpenSFS cluster with two MDSs with one MDT each, three OSSs with two OSTs each and three clients. Lustre master tag 2.6.90.
-
3
-
16466
Description
I’ve run into a problem while executing the LFSCK Phase 3 test plan at https://jira.hpdd.intel.com/browse/LU-4836
Test 3.3.2 calls for creating a number of subdirectories and creating a variety of objects in each subdirectory including local and remote subdirectories; local and remote in terms of the MDS. The test plan calls for setting fail_loc to 1502 so that all objects crated will have no linkEA. Creating files and local subdirectories work with this failure injected, but not remote directories. In the following, “rdir-1” is the remote directory:
Create local subdirectories Create remote subdirectories error on LL_IOC_LMV_SETSTRIPE '/lustre/scratch/test_dir/sdir-0/rdir-1' (3): No data available error: mkdir: create stripe dir '/lustre/scratch/test_dir/sdir-0/rdir-1' failed status script Total(sec) E(xcluded) S(low) ------------------------------------------------------------------------------------ touch: missing file operand Try `touch --help' for more information.
I cannot remove the remote directory:
# rm -rf /lustre/scratch/test_dir/sdir-0/rdir-1 [root@c13 tests]# ls /lustre/scratch/test_dir/sdir-0/rdir-1 ls: cannot access /lustre/scratch/test_dir/sdir-0/rdir-1: No such file or directory [root@c13 tests]# ls /lustre/scratch/test_dir sdir-0 [root@c13 tests]# ls /lustre/scratch/test_dir/sdir-0/ ls: cannot access /lustre/scratch/test_dir/sdir-0/rdir-1: No such file or directory rdir-1
So, I figured I’d run LFSCk since it is supposed to correct these errors, but LFSCK crashes the node. On MDS1, I reset the fail_loc to zero and ran ‘lctl lfsck_start’:
# lctl set_param fail_loc=0 # lctl lfsck_start -A -M scratch-MDT0000 -c -C --reset --type namespace Started LFSCK on the device scratch-MDT0000: scrub namespace [root@mds01 ~]# Message from syslogd@mds01-ib at Nov 10 08:11:20 ... kernel:LustreError: 17451:0:(lfsck_engine.c:232:lfsck_needs_scan_dir()) ASSERTION( depth > 0 ) failed: Message from syslogd@mds01-ib at Nov 10 08:11:20 ... kernel:LustreError: 17451:0:(lfsck_engine.c:232:lfsck_needs_scan_dir()) LBUG
From the crash dmesg on MDS1:
<0>LustreError: 17451:0:(lfsck_engine.c:232:lfsck_needs_scan_dir()) ASSERTION( d epth > 0 ) failed: <0>LustreError: 17451:0:(lfsck_engine.c:232:lfsck_needs_scan_dir()) LBUG <4>Pid: 17451, comm: lfsck <4> <4>Call Trace: <4> [<ffffffffa06f2895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa06f2e97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa19106f7>] lfsck_exec_oit+0x7f7/0xb80 [lfsck] <4> [<ffffffffa078904a>] ? fld_cache_lookup+0x3a/0x1e0 [fld] <4> [<ffffffffa1911d35>] lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck] <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90 <4> [<ffffffffa191346e>] lfsck_master_engine+0xabe/0x1390 [lfsck] <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa19129b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck] <4> [<ffffffff8109abf6>] kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] child_rip+0xa/0x20 <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4> <0>Kernel panic - not syncing: LBUG <4>Pid: 17451, comm: lfsck Not tainted 2.6.32-431.29.2.el6_lustre.gd99708b.x86_6 4 #1 <4>Call Trace: <4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f <4> [<ffffffffa06f2eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] <4> [<ffffffffa19106f7>] ? lfsck_exec_oit+0x7f7/0xb80 [lfsck] <4> [<ffffffffa078904a>] ? fld_cache_lookup+0x3a/0x1e0 [fld] <4> [<ffffffffa1911d35>] ? lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck] <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90 <4> [<ffffffffa191346e>] ? lfsck_master_engine+0xabe/0x1390 [lfsck] <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa19129b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck] <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20 <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
On the second MDS, dmesg contains:
Lustre: *** cfs_fail_loc=1502, val=0*** LustreError: 1426:0:(osd_handler.c:4546:osd_index_ea_insert()) scratch-MDT0001-osd: add [0x3c0000400:0xea61:0x0] error: rc = -61 LustreError: 1426:0:(osd_handler.c:2495:osd_object_destroy()) scratch-MDT0001-osd: delete inode [0x3c0000400:0xea61:0x0]: rc = -61