[LU-5892] lfsck_needs_scan_dir() LBUG Created: 10/Nov/14 Updated: 23/Nov/14 Resolved: 23/Nov/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | James Nunez (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | lfsck | ||
| Environment: |
OpenSFS cluster with two MDSs with one MDT each, three OSSs with two OSTs each and three clients. Lustre master tag 2.6.90. |
||
| Severity: | 3 |
| Rank (Obsolete): | 16466 |
| Description |
|
I’ve run into a problem while executing the LFSCK Phase 3 test plan at https://jira.hpdd.intel.com/browse/LU-4836 Create local subdirectories Create remote subdirectories error on LL_IOC_LMV_SETSTRIPE '/lustre/scratch/test_dir/sdir-0/rdir-1' (3): No data available error: mkdir: create stripe dir '/lustre/scratch/test_dir/sdir-0/rdir-1' failed status script Total(sec) E(xcluded) S(low) ------------------------------------------------------------------------------------ touch: missing file operand Try `touch --help' for more information. I cannot remove the remote directory: # rm -rf /lustre/scratch/test_dir/sdir-0/rdir-1 [root@c13 tests]# ls /lustre/scratch/test_dir/sdir-0/rdir-1 ls: cannot access /lustre/scratch/test_dir/sdir-0/rdir-1: No such file or directory [root@c13 tests]# ls /lustre/scratch/test_dir sdir-0 [root@c13 tests]# ls /lustre/scratch/test_dir/sdir-0/ ls: cannot access /lustre/scratch/test_dir/sdir-0/rdir-1: No such file or directory rdir-1 So, I figured I’d run LFSCk since it is supposed to correct these errors, but LFSCK crashes the node. On MDS1, I reset the fail_loc to zero and ran ‘lctl lfsck_start’: # lctl set_param fail_loc=0 # lctl lfsck_start -A -M scratch-MDT0000 -c -C --reset --type namespace Started LFSCK on the device scratch-MDT0000: scrub namespace [root@mds01 ~]# Message from syslogd@mds01-ib at Nov 10 08:11:20 ... kernel:LustreError: 17451:0:(lfsck_engine.c:232:lfsck_needs_scan_dir()) ASSERTION( depth > 0 ) failed: Message from syslogd@mds01-ib at Nov 10 08:11:20 ... kernel:LustreError: 17451:0:(lfsck_engine.c:232:lfsck_needs_scan_dir()) LBUG From the crash dmesg on MDS1: <0>LustreError: 17451:0:(lfsck_engine.c:232:lfsck_needs_scan_dir()) ASSERTION( d epth > 0 ) failed: <0>LustreError: 17451:0:(lfsck_engine.c:232:lfsck_needs_scan_dir()) LBUG <4>Pid: 17451, comm: lfsck <4> <4>Call Trace: <4> [<ffffffffa06f2895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa06f2e97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa19106f7>] lfsck_exec_oit+0x7f7/0xb80 [lfsck] <4> [<ffffffffa078904a>] ? fld_cache_lookup+0x3a/0x1e0 [fld] <4> [<ffffffffa1911d35>] lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck] <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90 <4> [<ffffffffa191346e>] lfsck_master_engine+0xabe/0x1390 [lfsck] <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa19129b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck] <4> [<ffffffff8109abf6>] kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] child_rip+0xa/0x20 <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4> <0>Kernel panic - not syncing: LBUG <4>Pid: 17451, comm: lfsck Not tainted 2.6.32-431.29.2.el6_lustre.gd99708b.x86_6 4 #1 <4>Call Trace: <4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f <4> [<ffffffffa06f2eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] <4> [<ffffffffa19106f7>] ? lfsck_exec_oit+0x7f7/0xb80 [lfsck] <4> [<ffffffffa078904a>] ? fld_cache_lookup+0x3a/0x1e0 [fld] <4> [<ffffffffa1911d35>] ? lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck] <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90 <4> [<ffffffffa191346e>] ? lfsck_master_engine+0xabe/0x1390 [lfsck] <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa19129b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck] <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20 <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 On the second MDS, dmesg contains: Lustre: *** cfs_fail_loc=1502, val=0*** LustreError: 1426:0:(osd_handler.c:4546:osd_index_ea_insert()) scratch-MDT0001-osd: add [0x3c0000400:0xea61:0x0] error: rc = -61 LustreError: 1426:0:(osd_handler.c:2495:osd_object_destroy()) scratch-MDT0001-osd: delete inode [0x3c0000400:0xea61:0x0]: rc = -61 |
| Comments |
| Comment by Jodi Levi (Inactive) [ 10/Nov/14 ] |
|
Fan Yong, |
| Comment by James Nunez (Inactive) [ 10/Nov/14 ] |
|
I've uploaded the vmcore from the MDS that crashed at uploads/ |
| Comment by nasf (Inactive) [ 11/Nov/14 ] |
|
James, As for the LBUG(), I will make the patch to fix it. |
| Comment by nasf (Inactive) [ 11/Nov/14 ] |
|
Here is the patch: |
| Comment by Gerrit Updater [ 23/Nov/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12670/ |