Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.7.0
-
OpenSFS cluster with two MDSs with one MDT each, three OSSs with two OSTs each and three clients. Lustre master tag 2.6.90 build #2734
-
3
-
16512
Description
While running test 3.3.2 from the LFSCK Phase 3 test plan, both MDSs crashed during LFSCK.
Test 3.3.2 calls for creating a number of subdirectories and creating a variety of objects in each subdirectory including files, multiply linked files, and local and remote subdirectories; local and remote in terms of the MDS. The test plan calls for setting fail_loc to 1603 so that all objects created will have no linkEA. Creating the objects work with this failure injected, but the both MDSs crash when ‘lctl lfsck_start’ is called.
On the main MDS, with index 0, we call ‘lctl lfsck_start -M scratch-MDT0000 -A -c -C --reset --type namespace’. LFSCK starts and soon after the MDSs crash with the following on their consoles:
Message from syslogd@mds01-ib at Nov 12 12:44:24 ... kernel:LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_ops ) failed: Message from syslogd@mds01-ib at Nov 12 12:44:24 ... kernel:LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) LBUG
From the crash dmesg log on MDS1, we see:
<6>Lustre: *** cfs_fail_loc=1603, val=0*** <6>Lustre: Skipped 97 previous similar messages <0>LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_ops ) failed: <0>LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) LBUG <4>Pid: 14961, comm: lfsck <4> <4>Call Trace: <4> [<ffffffffa044c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa044ce97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa0ed5998>] lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsck] <4> [<ffffffffa0e961c5>] lfsck_namespace_open_dir+0x175/0x1a0 [lfsck] <4> [<ffffffffa0e8b1c3>] lfsck_open_dir+0xa3/0x380 [lfsck] <4> [<ffffffffa0e8e577>] lfsck_exec_oit+0x677/0xb80 [lfsck] <4> [<ffffffffa0a1b04a>] ? fld_cache_lookup+0x3a/0x1e0 [fld] <4> [<ffffffffa0e8fd35>] lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck] <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90 <4> [<ffffffffa0e9146e>] lfsck_master_engine+0xabe/0x1390 [lfsck] <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa0e909b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck] <4> [<ffffffff8109abf6>] kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] child_rip+0xa/0x20 <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4> <0>Kernel panic - not syncing: LBUG <4>Pid: 14961, comm: lfsck Not tainted 2.6.32-431.29.2.el6_lustre.gd99708b.x86_6 4 #1 <4>Call Trace: <4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f <4> [<ffffffffa044ceeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] <4> [<ffffffffa0ed5998>] ? lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsc k] <4> [<ffffffffa0e961c5>] ? lfsck_namespace_open_dir+0x175/0x1a0 [lfsck] <4> [<ffffffffa0e8b1c3>] ? lfsck_open_dir+0xa3/0x380 [lfsck] <4> [<ffffffffa0e8e577>] ? lfsck_exec_oit+0x677/0xb80 [lfsck] <4> [<ffffffffa0a1b04a>] ? fld_cache_lookup+0x3a/0x1e0 [fld] <4> [<ffffffffa0e8fd35>] ? lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck] <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90 <4> [<ffffffffa0e9146e>] ? lfsck_master_engine+0xabe/0x1390 [lfsck] <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa0e909b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck] <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20 <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
From the dmesg log from the crash on the second MDS looks a little different with an error from lfsck_namespace_rebuild_linkea():
<6>Lustre: *** cfs_fail_loc=1603, val=0*** <6>Lustre: *** cfs_fail_loc=1603, val=0*** <6>Lustre: Skipped 6 previous similar messages <0>LustreError: 6700:0:(lfsck_namespace.c:1932:lfsck_namespace_rebuild_linkea()) ASSERTION( !dt_object_remote(obj) ) failed: <0>LustreError: 6698:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_op s ) failed: <0>LustreError: 6698:0:(dt_object.h:2700:dt_lookup()) LBUG <4>Pid: 6698, comm: lfsck <4> <4>Call Trace: <4> [<ffffffffa083b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa083be97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa09a1998>] lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsck] <4> [<ffffffffa09621c5>] lfsck_namespace_open_dir+0x175/0x1a0 [lfsck] <4> [<ffffffffa09571c3>] lfsck_open_dir+0xa3/0x380 [lfsck] <4> [<ffffffffa095a577>] lfsck_exec_oit+0x677/0xb80 [lfsck] <4> [<ffffffffa035704a>] ? fld_cache_lookup+0x3a/0x1e0 [fld] <4> [<ffffffffa095bd35>] lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck] <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90 <4> [<ffffffffa095d46e>] lfsck_master_engine+0xabe/0x1390 [lfsck] <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa095c9b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck] <4> [<ffffffff8109abf6>] kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] child_rip+0xa/0x20 <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4> <0>Kernel panic - not syncing: LBUG <4>Pid: 6698, comm: lfsck Not tainted 2.6.32-431.29.2.el6_lustre.gd99708b.x86_64 #1 <4>Call Trace: <4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f <4> [<ffffffffa083beeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] <4> [<ffffffffa09a1998>] ? lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsc k] <4> [<ffffffffa09621c5>] ? lfsck_namespace_open_dir+0x175/0x1a0 [lfsck] <4> [<ffffffffa09571c3>] ? lfsck_open_dir+0xa3/0x380 [lfsck] <4> [<ffffffffa095a577>] ? lfsck_exec_oit+0x677/0xb80 [lfsck] <4> [<ffffffffa035704a>] ? fld_cache_lookup+0x3a/0x1e0 [fld] <4> [<ffffffffa095bd35>] ? lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck] <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90 <4> [<ffffffffa095d46e>] ? lfsck_master_engine+0xabe/0x1390 [lfsck] <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa095c9b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck] <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20 <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
Attachments
Issue Links
- mentioned in
-
Page Loading...