[LU-5914] LFSCK: dt_lookup()) LBUG Created: 12/Nov/14 Updated: 06/Mar/15 Resolved: 20/Feb/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | James Nunez (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MB, lfsck, soak | ||
| Environment: |
OpenSFS cluster with two MDSs with one MDT each, three OSSs with two OSTs each and three clients. Lustre master tag 2.6.90 build #2734 |
||
| Severity: | 3 |
| Rank (Obsolete): | 16512 |
| Description |
|
While running test 3.3.2 from the LFSCK Phase 3 test plan, both MDSs crashed during LFSCK. Test 3.3.2 calls for creating a number of subdirectories and creating a variety of objects in each subdirectory including files, multiply linked files, and local and remote subdirectories; local and remote in terms of the MDS. The test plan calls for setting fail_loc to 1603 so that all objects created will have no linkEA. Creating the objects work with this failure injected, but the both MDSs crash when ‘lctl lfsck_start’ is called. On the main MDS, with index 0, we call ‘lctl lfsck_start -M scratch-MDT0000 -A -c -C --reset --type namespace’. LFSCK starts and soon after the MDSs crash with the following on their consoles: Message from syslogd@mds01-ib at Nov 12 12:44:24 ... kernel:LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_ops ) failed: Message from syslogd@mds01-ib at Nov 12 12:44:24 ... kernel:LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) LBUG From the crash dmesg log on MDS1, we see: <6>Lustre: *** cfs_fail_loc=1603, val=0*** <6>Lustre: Skipped 97 previous similar messages <0>LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_ops ) failed: <0>LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) LBUG <4>Pid: 14961, comm: lfsck <4> <4>Call Trace: <4> [<ffffffffa044c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa044ce97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa0ed5998>] lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsck] <4> [<ffffffffa0e961c5>] lfsck_namespace_open_dir+0x175/0x1a0 [lfsck] <4> [<ffffffffa0e8b1c3>] lfsck_open_dir+0xa3/0x380 [lfsck] <4> [<ffffffffa0e8e577>] lfsck_exec_oit+0x677/0xb80 [lfsck] <4> [<ffffffffa0a1b04a>] ? fld_cache_lookup+0x3a/0x1e0 [fld] <4> [<ffffffffa0e8fd35>] lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck] <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90 <4> [<ffffffffa0e9146e>] lfsck_master_engine+0xabe/0x1390 [lfsck] <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa0e909b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck] <4> [<ffffffff8109abf6>] kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] child_rip+0xa/0x20 <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4> <0>Kernel panic - not syncing: LBUG <4>Pid: 14961, comm: lfsck Not tainted 2.6.32-431.29.2.el6_lustre.gd99708b.x86_6 4 #1 <4>Call Trace: <4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f <4> [<ffffffffa044ceeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] <4> [<ffffffffa0ed5998>] ? lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsc k] <4> [<ffffffffa0e961c5>] ? lfsck_namespace_open_dir+0x175/0x1a0 [lfsck] <4> [<ffffffffa0e8b1c3>] ? lfsck_open_dir+0xa3/0x380 [lfsck] <4> [<ffffffffa0e8e577>] ? lfsck_exec_oit+0x677/0xb80 [lfsck] <4> [<ffffffffa0a1b04a>] ? fld_cache_lookup+0x3a/0x1e0 [fld] <4> [<ffffffffa0e8fd35>] ? lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck] <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90 <4> [<ffffffffa0e9146e>] ? lfsck_master_engine+0xabe/0x1390 [lfsck] <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa0e909b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck] <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20 <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 From the dmesg log from the crash on the second MDS looks a little different with an error from lfsck_namespace_rebuild_linkea(): <6>Lustre: *** cfs_fail_loc=1603, val=0*** <6>Lustre: *** cfs_fail_loc=1603, val=0*** <6>Lustre: Skipped 6 previous similar messages <0>LustreError: 6700:0:(lfsck_namespace.c:1932:lfsck_namespace_rebuild_linkea()) ASSERTION( !dt_object_remote(obj) ) failed: <0>LustreError: 6698:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_op s ) failed: <0>LustreError: 6698:0:(dt_object.h:2700:dt_lookup()) LBUG <4>Pid: 6698, comm: lfsck <4> <4>Call Trace: <4> [<ffffffffa083b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa083be97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa09a1998>] lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsck] <4> [<ffffffffa09621c5>] lfsck_namespace_open_dir+0x175/0x1a0 [lfsck] <4> [<ffffffffa09571c3>] lfsck_open_dir+0xa3/0x380 [lfsck] <4> [<ffffffffa095a577>] lfsck_exec_oit+0x677/0xb80 [lfsck] <4> [<ffffffffa035704a>] ? fld_cache_lookup+0x3a/0x1e0 [fld] <4> [<ffffffffa095bd35>] lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck] <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90 <4> [<ffffffffa095d46e>] lfsck_master_engine+0xabe/0x1390 [lfsck] <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa095c9b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck] <4> [<ffffffff8109abf6>] kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] child_rip+0xa/0x20 <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4> <0>Kernel panic - not syncing: LBUG <4>Pid: 6698, comm: lfsck Not tainted 2.6.32-431.29.2.el6_lustre.gd99708b.x86_64 #1 <4>Call Trace: <4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f <4> [<ffffffffa083beeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] <4> [<ffffffffa09a1998>] ? lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsc k] <4> [<ffffffffa09621c5>] ? lfsck_namespace_open_dir+0x175/0x1a0 [lfsck] <4> [<ffffffffa09571c3>] ? lfsck_open_dir+0xa3/0x380 [lfsck] <4> [<ffffffffa095a577>] ? lfsck_exec_oit+0x677/0xb80 [lfsck] <4> [<ffffffffa035704a>] ? fld_cache_lookup+0x3a/0x1e0 [fld] <4> [<ffffffffa095bd35>] ? lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck] <4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90 <4> [<ffffffffa095d46e>] ? lfsck_master_engine+0xabe/0x1390 [lfsck] <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa095c9b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck] <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20 <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 |
| Comments |
| Comment by Jodi Levi (Inactive) [ 13/Nov/14 ] |
|
Lai, |
| Comment by Andreas Dilger [ 13/Nov/14 ] |
|
Di, I thought there was a recent patch in this area already? |
| Comment by Di Wang [ 13/Nov/14 ] |
|
Hmm, I am not sure. Fan yong and I did discuss sth about different local and remote object creation process cause the dt_index_ops is not being assigned properly for remote object in LFSCK use cases. I am not sure he made the patch or not. For this LBUG, it seems dt_try_as_dir is missing before dt_lookup, though I did not check the code carefully. |
| Comment by Lai Siyao [ 21/Nov/14 ] |
|
I tested in a small scale for test plan 3.3.2, but couldn't reproduce. Could you verify it still occurs on latest master? and is there script for test plan to download? |
| Comment by James Nunez (Inactive) [ 05/Dec/14 ] |
|
Lai, Yes, I can still reproduce this error/crash with a recent build of lustre-master; build # 2760. |
| Comment by James Nunez (Inactive) [ 05/Dec/14 ] |
|
I've just run the script for LFSCk Phase 3 test 3.3.2 with lustre-master build 2768 and hit this LBUG again. I've uploaded vmcore, dmesg, and the script to uploads/ |
| Comment by Lai Siyao [ 09/Dec/14 ] |
#20 [ffff88102c4b7a48] lfsck_namespace_rebuild_linkea at ffffffffa0ece540 [lfsck] #21 [ffff88102c4b7ab8] lfsck_namespace_striped_dir_rescan at ffffffffa0f07690 [lfsck] #22 [ffff88102c4b7c58] lfsck_namespace_assistant_handler_p1 at ffffffffa0ed8ec3 [lfsck] #23 [ffff88102c4b7d78] lfsck_assistant_engine at ffffffffa0ec81c7 [lfsck] #24 [ffff88102c4b7ee8] kthread at ffffffff8109abf6 #25 [ffff88102c4b7f48] kernel_thread at ffffffff8100c20a backtrace of MDS2 vmcore shows it firstly asserted on this, I doubt there is race in normal operation and lfsck, so an object can be either remote or local during lfsck, and it caused assert. A glance of the code shows the locking of objects is loose, which may be the cause. Fanyong, what's your opinion? |
| Comment by nasf (Inactive) [ 09/Dec/14 ] |
|
Lai, you are right. The "LASSERT(!dt_object_remote(obj));" in lfsck_namespace_rebuild_linkea() is wrong, should be removed. But I still not sure how the "dt_lookup()) LBUG" can be triggered. |
| Comment by Gerrit Updater [ 26/Dec/14 ] |
|
Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/13188 |
| Comment by Gerrit Updater [ 31/Dec/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13188/ |
| Comment by Jodi Levi (Inactive) [ 05/Jan/15 ] |
|
Patch landed to Master. |
| Comment by Cliff White (Inactive) [ 18/Feb/15 ] |
|
I believe we just hit this on soak testing on Lola-8 |
| Comment by Gerrit Updater [ 19/Feb/15 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13801 |
| Comment by Gerrit Updater [ 20/Feb/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13801/ |
| Comment by Jodi Levi (Inactive) [ 20/Feb/15 ] |
|
Patch landed to Master. |