[LU-5914] LFSCK: dt_lookup()) LBUG Created: 12/Nov/14  Updated: 06/Mar/15  Resolved: 20/Feb/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Blocker
Reporter: James Nunez (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: MB, lfsck, soak
Environment:

OpenSFS cluster with two MDSs with one MDT each, three OSSs with two OSTs each and three clients. Lustre master tag 2.6.90 build #2734


Severity: 3
Rank (Obsolete): 16512

 Description   

While running test 3.3.2 from the LFSCK Phase 3 test plan, both MDSs crashed during LFSCK.

Test 3.3.2 calls for creating a number of subdirectories and creating a variety of objects in each subdirectory including files, multiply linked files, and local and remote subdirectories; local and remote in terms of the MDS. The test plan calls for setting fail_loc to 1603 so that all objects created will have no linkEA. Creating the objects work with this failure injected, but the both MDSs crash when ‘lctl lfsck_start’ is called.

On the main MDS, with index 0, we call ‘lctl lfsck_start -M scratch-MDT0000 -A -c -C --reset --type namespace’. LFSCK starts and soon after the MDSs crash with the following on their consoles:

Message from syslogd@mds01-ib at Nov 12 12:44:24 ...
 kernel:LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_ops ) failed: 

Message from syslogd@mds01-ib at Nov 12 12:44:24 ...
 kernel:LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) LBUG

From the crash dmesg log on MDS1, we see:

<6>Lustre: *** cfs_fail_loc=1603, val=0***
<6>Lustre: Skipped 97 previous similar messages
<0>LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_ops ) failed: 
<0>LustreError: 14961:0:(dt_object.h:2700:dt_lookup()) LBUG
<4>Pid: 14961, comm: lfsck
<4>
<4>Call Trace:
<4> [<ffffffffa044c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa044ce97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0ed5998>] lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsck]
<4> [<ffffffffa0e961c5>] lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
<4> [<ffffffffa0e8b1c3>] lfsck_open_dir+0xa3/0x380 [lfsck]
<4> [<ffffffffa0e8e577>] lfsck_exec_oit+0x677/0xb80 [lfsck]
<4> [<ffffffffa0a1b04a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
<4> [<ffffffffa0e8fd35>] lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
<4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
<4> [<ffffffffa0e9146e>] lfsck_master_engine+0xabe/0x1390 [lfsck]
<4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0e909b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
<4> [<ffffffff8109abf6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 14961, comm: lfsck Not tainted 2.6.32-431.29.2.el6_lustre.gd99708b.x86_6
4 #1
<4>Call Trace:
<4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f
<4> [<ffffffffa044ceeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa0ed5998>] ? lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsc
k]
<4> [<ffffffffa0e961c5>] ? lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
<4> [<ffffffffa0e8b1c3>] ? lfsck_open_dir+0xa3/0x380 [lfsck]
<4> [<ffffffffa0e8e577>] ? lfsck_exec_oit+0x677/0xb80 [lfsck]
<4> [<ffffffffa0a1b04a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
<4> [<ffffffffa0e8fd35>] ? lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
<4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
<4> [<ffffffffa0e9146e>] ? lfsck_master_engine+0xabe/0x1390 [lfsck]
<4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0e909b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
<4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

From the dmesg log from the crash on the second MDS looks a little different with an error from lfsck_namespace_rebuild_linkea():

<6>Lustre: *** cfs_fail_loc=1603, val=0***
<6>Lustre: *** cfs_fail_loc=1603, val=0***
<6>Lustre: Skipped 6 previous similar messages
<0>LustreError: 6700:0:(lfsck_namespace.c:1932:lfsck_namespace_rebuild_linkea())
 ASSERTION( !dt_object_remote(obj) ) failed: 
<0>LustreError: 6698:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_op
s ) failed: 
<0>LustreError: 6698:0:(dt_object.h:2700:dt_lookup()) LBUG
<4>Pid: 6698, comm: lfsck
<4>
<4>Call Trace:
<4> [<ffffffffa083b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa083be97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa09a1998>] lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsck]
<4> [<ffffffffa09621c5>] lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
<4> [<ffffffffa09571c3>] lfsck_open_dir+0xa3/0x380 [lfsck]
<4> [<ffffffffa095a577>] lfsck_exec_oit+0x677/0xb80 [lfsck]
<4> [<ffffffffa035704a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
<4> [<ffffffffa095bd35>] lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
<4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
<4> [<ffffffffa095d46e>] lfsck_master_engine+0xabe/0x1390 [lfsck]
<4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa095c9b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
<4> [<ffffffff8109abf6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 6698, comm: lfsck Not tainted 2.6.32-431.29.2.el6_lustre.gd99708b.x86_64
 #1
<4>Call Trace:
<4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f
<4> [<ffffffffa083beeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa09a1998>] ? lfsck_namespace_verify_stripe_slave+0x408/0xa30 [lfsc
k]
<4> [<ffffffffa09621c5>] ? lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
<4> [<ffffffffa09571c3>] ? lfsck_open_dir+0xa3/0x380 [lfsck]
<4> [<ffffffffa095a577>] ? lfsck_exec_oit+0x677/0xb80 [lfsck]
<4> [<ffffffffa035704a>] ? fld_cache_lookup+0x3a/0x1e0 [fld]
<4> [<ffffffffa095bd35>] ? lfsck_master_oit_engine+0x12b5/0x1f30 [lfsck]
<4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
<4> [<ffffffffa095d46e>] ? lfsck_master_engine+0xabe/0x1390 [lfsck]
<4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa095c9b0>] ? lfsck_master_engine+0x0/0x1390 [lfsck]
<4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20


 Comments   
Comment by Jodi Levi (Inactive) [ 13/Nov/14 ]

Lai,
Would you be able to comment on this one?
Thank you!

Comment by Andreas Dilger [ 13/Nov/14 ]

Di, I thought there was a recent patch in this area already?

Comment by Di Wang [ 13/Nov/14 ]

Hmm, I am not sure. Fan yong and I did discuss sth about different local and remote object creation process cause the dt_index_ops is not being assigned properly for remote object in LFSCK use cases. I am not sure he made the patch or not.

For this LBUG, it seems dt_try_as_dir is missing before dt_lookup, though I did not check the code carefully.

Comment by Lai Siyao [ 21/Nov/14 ]

I tested in a small scale for test plan 3.3.2, but couldn't reproduce. Could you verify it still occurs on latest master? and is there script for test plan to download?

Comment by James Nunez (Inactive) [ 05/Dec/14 ]

Lai, Yes, I can still reproduce this error/crash with a recent build of lustre-master; build # 2760.

Comment by James Nunez (Inactive) [ 05/Dec/14 ]

I've just run the script for LFSCk Phase 3 test 3.3.2 with lustre-master build 2768 and hit this LBUG again.

I've uploaded vmcore, dmesg, and the script to uploads/LU-5914 where _mds1 are logs from MDS1 and _mds2 are logs from MDS2. There were two MDSs with two MDTs each with three OSSs each with two OSTs.

Comment by Lai Siyao [ 09/Dec/14 ]
#20 [ffff88102c4b7a48] lfsck_namespace_rebuild_linkea at ffffffffa0ece540 [lfsck]
#21 [ffff88102c4b7ab8] lfsck_namespace_striped_dir_rescan at ffffffffa0f07690 [lfsck]
#22 [ffff88102c4b7c58] lfsck_namespace_assistant_handler_p1 at ffffffffa0ed8ec3 [lfsck]
#23 [ffff88102c4b7d78] lfsck_assistant_engine at ffffffffa0ec81c7 [lfsck]
#24 [ffff88102c4b7ee8] kthread at ffffffff8109abf6
#25 [ffff88102c4b7f48] kernel_thread at ffffffff8100c20a

backtrace of MDS2 vmcore shows it firstly asserted on this, I doubt there is race in normal operation and lfsck, so an object can be either remote or local during lfsck, and it caused assert. A glance of the code shows the locking of objects is loose, which may be the cause.

Fanyong, what's your opinion?

Comment by nasf (Inactive) [ 09/Dec/14 ]

Lai, you are right. The "LASSERT(!dt_object_remote(obj));" in lfsck_namespace_rebuild_linkea() is wrong, should be removed. But I still not sure how the "dt_lookup()) LBUG" can be triggered.

Comment by Gerrit Updater [ 26/Dec/14 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/13188
Subject: LU-5914 lfsck: dt_lookup() LBUG
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 01ebcf55122923e3c89ddde42766cb76eca142ef

Comment by Gerrit Updater [ 31/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13188/
Subject: LU-5914 lfsck: dt_lookup() LBUG
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4950f42fb6286f4cf5e6092593acabde76a8f093

Comment by Jodi Levi (Inactive) [ 05/Jan/15 ]

Patch landed to Master.

Comment by Cliff White (Inactive) [ 18/Feb/15 ]

I believe we just hit this on soak testing on Lola-8
LustreError: 4782:0:(dt_object.h:2700:dt_lookup()) ASSERTION( dt->do_index_ops ) failed: ^M
LustreError: 4782:0:(dt_object.h:2700:dt_lookup()) LBUG^M
Pid: 4782, comm: lfsck_namespace^M
^M
Call Trace:^M
[<ffffffffa055c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
[<ffffffffa055ce97>] lbug_with_loc+0x47/0xb0 [libcfs]^M
[<ffffffffa0fd6344>] lfsck_namespace_check_exist+0x364/0x410 [lfsck]^M
[<ffffffffa0fd9551>] lfsck_namespace_assistant_handler_p1+0xb41/0x2250 [lfsck]^M
[<ffffffff810623a9>] ? find_busiest_queue+0x69/0x150^M
[<ffffffffa0fc96e7>] lfsck_assistant_engine+0x497/0x1df0 [lfsck]^M
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20^M
[<ffffffffa0fc9250>] ? lfsck_assistant_engine+0x0/0x1df0 [lfsck]^M
[<ffffffff8109abf6>] kthread+0x96/0xa0^M
[<ffffffff8100c20a>] child_rip+0xa/0x20^M
[<ffffffff8109ab60>] ? kthread+0x0/0xa0^M
[<ffffffff8100c200>] ? child_rip+0x0/0x20^M
^M
We are running 2.6.94, with some patches from Li Wei, see the lola soak page

Comment by Gerrit Updater [ 19/Feb/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13801
Subject: LU-5914 lfsck: dt_index_try before dt_lookup
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5942806bc97612b8d87560e70fde05111aa1aba1

Comment by Gerrit Updater [ 20/Feb/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13801/
Subject: LU-5914 lfsck: dt_index_try before dt_lookup
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5bae33a95e6330af35e9749356f31a7569ec1e09

Comment by Jodi Levi (Inactive) [ 20/Feb/15 ]

Patch landed to Master.

Generated at Sat Feb 10 01:55:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.