Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.6.0
Affects Version/s: None
Labels:
- HB
- dne2

Severity:
3
Rank (Obsolete):
14584

Description

On the same system as ~~LU-5204~~ (with OST38/0026 still not reachable from MDS1/MDT0), we hit this LBUG on MDS1 during stress testing:

0>LustreError: 26714:0:(lod_object.c:930:lod_declare_attr_set()) ASSERTION( lo->ldo_stripe ) failed:
<0>LustreError: 26714:0:(lod_object.c:930:lod_declare_attr_set()) LBUG
<4>Pid: 26714, comm: mdt02_089
<4>
<4>Call Trace:
<4> [<ffffffffa0c55895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0c55e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa15d70e0>] lod_declare_attr_set+0x600/0x660 [lod]
<4> [<ffffffffa16338b8>] mdd_declare_object_initialize+0xa8/0x290 [mdd]
<4> [<ffffffffa1635018>] mdd_create+0xb88/0x1870 [mdd]
<4> [<ffffffffa1506217>] mdt_reint_create+0xcf7/0xed0 [mdt]
<4> [<ffffffffa1500a81>] mdt_reint_rec+0x41/0xe0 [mdt]
<4> [<ffffffffa14e5e93>] mdt_reint_internal+0x4c3/0x7c0 [mdt]
<4> [<ffffffffa14e671b>] mdt_reint+0x6b/0x120 [mdt]
<4> [<ffffffffa103a2ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
<4> [<ffffffffa0fe9d1a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
<4> [<ffffffffa0fe9000>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
<4> [<ffffffff8109aee6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

Additionally, we had the following stuck thread:
<3>INFO: task mdt01_020:26426 blocked for more than 120 seconds.
<3> Not tainted 2.6.32-431.5.1.el6.x86_64 #1
<3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>mdt01_020 D 000000000000000a 0 26426 2 0x00000000
<4> ffff880ffa4d7af0 0000000000000046 0000000000000000 ffffffffa0c6bd75
<4> 0000000100000000 ffffc9003aa25030 0000000000000246 0000000000000246
<4> ffff88100aaae638 ffff880ffa4d7fd8 000000000000fbc8 ffff88100aaae638
<4>Call Trace:
<4> [<ffffffffa0c6bd75>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs]
<4> [<ffffffffa0d225db>] lu_object_find_at+0xab/0x350 [obdclass]
<4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0d22896>] lu_object_find+0x16/0x20 [obdclass]
<4> [<ffffffffa14e2ea6>] mdt_object_find+0x56/0x170 [mdt]
<4> [<ffffffffa14e4d2b>] mdt_intent_policy+0x75b/0xca0 [mdt]
<4> [<ffffffffa0f8e899>] ldlm_lock_enqueue+0x369/0x930 [ptlrpc]
<4> [<ffffffffa0fb7d8f>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
<4> [<ffffffffa1039f02>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
<4> [<ffffffffa103a2ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
<4> [<ffffffffa0fe9d1a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
<4> [<ffffffffa0fe9000>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
<4> [<ffffffff8109aee6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

For some time before the LBUG. This thread is - in all of these instances - stuck in a rather odd spot in cfs_hash_bd_lookup_intent:
match = intent_add ? NULL : hnode;
hlist_for_each(ehnode, hhead) {
if (!cfs_hash_keycmp(hs, key, ehnode))
continue;

Specifically, it reports as being stuck on the cfs_hash_keycmp line. It's not clear to me how a thread could get stuck there. I may be missing some operation it's doing as part of that.

I'll make the dump available shortly.

Attachments

Issue Links

is related to

LU-5204 2.6 DNE stress testing: EINVAL when attempting to delete file

Resolved

Activity

People

Assignee:: Di Wang (Inactive)

Reporter:: Patrick Farrell (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 19/Jun/14 6:26 PM

Updated:: 26/Jun/14 12:56 PM

Resolved:: 26/Jun/14 12:56 PM