[LU-5233] 2.6 DNE stress testing: (lod_object.c:930:lod_declare_attr_set()) ASSERTION( lo->ldo_stripe ) failed Created: 19/Jun/14  Updated: 26/Jun/14  Resolved: 26/Jun/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Critical
Reporter: Patrick Farrell (Inactive) Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: HB, dne2

Issue Links:
Related
is related to LU-5204 2.6 DNE stress testing: EINVAL when a... Resolved
Severity: 3
Rank (Obsolete): 14584

 Description   

On the same system as LU-5204 (with OST38/0026 still not reachable from MDS1/MDT0), we hit this LBUG on MDS1 during stress testing:

0>LustreError: 26714:0:(lod_object.c:930:lod_declare_attr_set()) ASSERTION( lo->ldo_stripe ) failed:
<0>LustreError: 26714:0:(lod_object.c:930:lod_declare_attr_set()) LBUG
<4>Pid: 26714, comm: mdt02_089
<4>
<4>Call Trace:
<4> [<ffffffffa0c55895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0c55e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa15d70e0>] lod_declare_attr_set+0x600/0x660 [lod]
<4> [<ffffffffa16338b8>] mdd_declare_object_initialize+0xa8/0x290 [mdd]
<4> [<ffffffffa1635018>] mdd_create+0xb88/0x1870 [mdd]
<4> [<ffffffffa1506217>] mdt_reint_create+0xcf7/0xed0 [mdt]
<4> [<ffffffffa1500a81>] mdt_reint_rec+0x41/0xe0 [mdt]
<4> [<ffffffffa14e5e93>] mdt_reint_internal+0x4c3/0x7c0 [mdt]
<4> [<ffffffffa14e671b>] mdt_reint+0x6b/0x120 [mdt]
<4> [<ffffffffa103a2ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
<4> [<ffffffffa0fe9d1a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
<4> [<ffffffffa0fe9000>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
<4> [<ffffffff8109aee6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

Additionally, we had the following stuck thread:
<3>INFO: task mdt01_020:26426 blocked for more than 120 seconds.
<3> Not tainted 2.6.32-431.5.1.el6.x86_64 #1
<3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>mdt01_020 D 000000000000000a 0 26426 2 0x00000000
<4> ffff880ffa4d7af0 0000000000000046 0000000000000000 ffffffffa0c6bd75
<4> 0000000100000000 ffffc9003aa25030 0000000000000246 0000000000000246
<4> ffff88100aaae638 ffff880ffa4d7fd8 000000000000fbc8 ffff88100aaae638
<4>Call Trace:
<4> [<ffffffffa0c6bd75>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs]
<4> [<ffffffffa0d225db>] lu_object_find_at+0xab/0x350 [obdclass]
<4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0d22896>] lu_object_find+0x16/0x20 [obdclass]
<4> [<ffffffffa14e2ea6>] mdt_object_find+0x56/0x170 [mdt]
<4> [<ffffffffa14e4d2b>] mdt_intent_policy+0x75b/0xca0 [mdt]
<4> [<ffffffffa0f8e899>] ldlm_lock_enqueue+0x369/0x930 [ptlrpc]
<4> [<ffffffffa0fb7d8f>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
<4> [<ffffffffa1039f02>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
<4> [<ffffffffa103a2ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
<4> [<ffffffffa0fe9d1a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
<4> [<ffffffffa0fe9000>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
<4> [<ffffffff8109aee6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

For some time before the LBUG. This thread is - in all of these instances - stuck in a rather odd spot in cfs_hash_bd_lookup_intent:
match = intent_add ? NULL : hnode;
hlist_for_each(ehnode, hhead) {
if (!cfs_hash_keycmp(hs, key, ehnode))
continue;

Specifically, it reports as being stuck on the cfs_hash_keycmp line. It's not clear to me how a thread could get stuck there. I may be missing some operation it's doing as part of that.

I'll make the dump available shortly.



 Comments   
Comment by Patrick Farrell (Inactive) [ 19/Jun/14 ]

MDS dump will here in < 10 minutes:
ftp.cray.com
u: anonymous
p: anonymous

Then:
cd outbound/LU-5233/
And then the file is:
mds001_mdt000_LU5233.tar.gz

Comment by Patrick Farrell (Inactive) [ 19/Jun/14 ]

There was also a client which was stuck waiting on a reply from MDS001/MDT000 before it crashed [Obviously, there were many time outs after it crashed, but before that.], and the times match roughly with those for the stuck thread. The stuck thread is probably a separate issue from the LBUG, but I don't want to separate them until we're further along.

Here's the client bug information:
At 23:33:48, MDS0 died with an LBUG. (LU-5233)

One of the client nodes got stuck before that - This is thread refusing to exit because it's stuck in Lustre (Many other client threads were also stuck behind this one for the MDC rpc lock in mdc_close):
console-20140618:2014-06-18T23:07:16.160830-05:00 c0-0c1s4n2 <node_health:5.1> APID:1236942 (Application_Exited_Check) WARNING: Stack trace for process 13769:
console-20140618:2014-06-18T23:07:16.261778-05:00 c0-0c1s4n2 <node_health:5.1> APID:1236942 (Application_Exited_Check) STACK:
ptlrpc_set_wait+0x2e5/0x8c0 [ptlrpc];
ptlrpc_queue_wait+0x8b/0x230 [ptlrpc];
mdc_close+0x1ed/0xa50 [mdc];
lmv_close+0x242/0x5b0 [lmv];
ll_close_inode_openhandle+0x2fa/0x10a0 [lustre];
ll_md_real_close+0xb0/0x210 [lustre];
ll_file_release+0x68c/0xb60 [lustre];
fput+0xe2/0x200;
filp_close+0x63/0x90;
put_files_struct+0x84/0xe0;
exit_files+0x53/0x70;
do_exit+0x1ec/0x990;
do_group_exit+0x4c/0xc0;
get_signal_to_deliver+0x243/0x490;
do_notify_resume+0xe0/0x7f0;
int_signal+0x12/0x17;
0x20061a87;
0xffffffffffffffff;

The client is waiting for a ptlrpc reply. I strongly suspect this corresponds to the stuck thread messages on the MDS.
Unfortunately, by the time the node was dumped, the client had given up waiting and all of the tasks have exited (and the dk log is empty). So there's no way to confirm from the client side.

The first stuck thread messages on the MDS come here:

Jun 18 23:16:36 galaxy-esf-mds001 kernel: INFO: task mdt01_020:26426 blocked for more than 120 seconds.
<4>Call Trace:
<4> [<ffffffffa0c6bd75>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs]
<4> [<ffffffffa0d21fc4>] ? htable_lookup+0x1c4/0x1e0 [obdclass]
<4> [<ffffffffa0d225db>] lu_object_find_at+0xab/0x350 [obdclass]
<4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0d22896>] lu_object_find+0x16/0x20 [obdclass]
<4> [<ffffffffa14e2ea6>] mdt_object_find+0x56/0x170 [mdt]
<4> [<ffffffffa14e4d2b>] mdt_intent_policy+0x75b/0xca0 [mdt]
<4> [<ffffffffa0f8e899>] ldlm_lock_enqueue+0x369/0x930 [ptlrpc]
<4> [<ffffffffa0fb7d8f>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
<4> [<ffffffffa1039f02>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
<4> [<ffffffffa103a2ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
<4> [<ffffffffa0fe9d1a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
<4> [<ffffffffa0fe9000>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
<4> [<ffffffff8109aee6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

And are repeated up until when it LBUGged (always the same task).

The stuck thread message from the client is coming on task exit, so it's already been stuck for some amount of time. The first stuck thread message on the MDS (Stuck for 600 seconds) comes 9 minutes or so after the client reports a stuck thread. So the time frames are pretty good.

Without digging through data structures on the MDS I can't be sure, it seems likely the stuck thread on the MDS is the cause of the problem on the client.

Comment by Jodi Levi (Inactive) [ 20/Jun/14 ]

Di,
Can you please have a look at this one and complete an initial assessment to determine if this should be a blocker for 2.6?

Comment by Di Wang [ 21/Jun/14 ]

Jodi:

Yes, since it is a LBUG, probably could be a blocker, or at least critical one. But I think I know the reason, I will cook a patch soon.

Comment by Di Wang [ 21/Jun/14 ]

http://review.whamcloud.com/10772

Comment by Jodi Levi (Inactive) [ 26/Jun/14 ]

Patch landed to Master. Please reopen ticket if there is more work needed.

Generated at Sat Feb 10 01:49:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.