[LU-4458] Interop 2.5.0<->2.6 failure on test suite recovery-small test_9 Created: 08/Jan/14  Updated: 19/Jan/15  Resolved: 19/Jan/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: HB
Environment:

server: lustre-master build # 1823 RHEL6 ldiskfs
client: 2.5.0


Issue Links:
Related
is related to LU-793 Reconnections should not be refused w... Resolved
Severity: 3
Rank (Obsolete): 12221

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/3ba4558e-77f3-11e3-a6a3-52540035b04c.

The sub-test test_9 failed with the following error:

test failed to respond and timed out

Found D process on OST:

22:51:46:Lustre: DEBUG MARKER: == recovery-small test 9: pause bulk on OST (bug 1420) == 22:48:54 (1389077334)
22:51:47:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x214
22:51:47:LustreError: 2046:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_fail_timeout id 214 sleeping for 20000000ms
22:51:47:INFO: task ll_ost_io00_002:2046 blocked for more than 120 seconds.
22:51:47:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
22:51:47:ll_ost_io00_0 D 0000000000000001     0  2046      2 0x00000080
22:51:47: ffff8802f8759a60 0000000000000046 ffff8802f8759ac0 0000000016734040
22:57:44: ffffffffa0566ab0 ffff880316255389 0000004e359ea090 ffffffffa053c044
22:57:44: ffff8803167345f8 ffff8802f8759fd8 000000000000fb88 ffff8803167345f8
22:57:44:Call Trace:
22:57:44: [<ffffffff8150f3f2>] schedule_timeout+0x192/0x2e0
22:57:44: [<ffffffff810811e0>] ? process_timeout+0x0/0x10
22:57:45: [<ffffffffa0520d0f>] __cfs_fail_timeout_set+0xcf/0x150 [libcfs]
22:57:45: [<ffffffffa0eaaec9>] cfs_fail_timeout_set.clone.2+0x29/0x30 [ptlrpc]
22:57:45: [<ffffffffa0eae94b>] tgt_brw_write+0x34b/0x1550 [ptlrpc]
22:57:45: [<ffffffffa0525921>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
22:57:45: [<ffffffffa0eb0fea>] tgt_handle_request0+0x2ea/0x1490 [ptlrpc]
22:57:45: [<ffffffffa0525921>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
22:57:45: [<ffffffffa0eb25ca>] tgt_request_handle+0x43a/0x980 [ptlrpc]
22:57:45: [<ffffffffa0e65725>] ptlrpc_main+0xd25/0x1970 [ptlrpc]
22:57:45: [<ffffffffa0e64a00>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
22:57:46: [<ffffffff81096a36>] kthread+0x96/0xa0
22:57:46: [<ffffffff8100c0ca>] child_rip+0xa/0x20
22:57:46: [<ffffffff810969a0>] ? kthread+0x0/0xa0
22:57:46: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
22:57:46:INFO: task ll_ost_io00_002:2046 blocked for more than 120 seconds.
22:57:46:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
22:57:46:ll_ost_io00_0 D 0000000000000001     0  2046      2 0x00000080
22:57:46: ffff8802f8759a60 0000000000000046 ffff8802f8759ac0 0000000016734040
22:57:46: ffffffffa0566ab0 ffff880316255389 0000004e359ea090 ffffffffa053c044
22:57:46: ffff8803167345f8 ffff8802f8759fd8 000000000000fb88 ffff8803167345f8
22:57:46:Call Trace:


 Comments   
Comment by Jodi Levi (Inactive) [ 15/Jan/14 ]

Mike,
Can you please have a look at this one?
Thank you!

Comment by Mikhail Pershin [ 11/Apr/14 ]

The pause_bulk() was changed in 2.6 and now this affects compatibility.

Comment by Mikhail Pershin [ 21/Apr/14 ]

The LU-793 patch to b2_5 should fix this issue

Comment by Andreas Dilger [ 14/Nov/14 ]

This test is still failing on average once or twice a day:
https://testing.hpdd.intel.com/test_sets/654abbae-6b53-11e4-88ff-5254006e85c2
https://testing.hpdd.intel.com/test_sets/4b1792be-6b87-11e4-be53-5254006e85c2
https://testing.hpdd.intel.com/test_sets/40bc9864-6baf-11e4-88ff-5254006e85c2
https://testing.hpdd.intel.com/test_sets/7c70a83e-6c07-11e4-909d-5254006e85c2

If these failures are a different bug, then this one should be closed and a new one opened.

Comment by Gerrit Updater [ 30/Dec/14 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/13205
Subject: LU-4458 test: take fail_val into account for pause_bulk()
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: c5ae6409e1d830e8ae4445468f2f9def654c8108

Comment by Gerrit Updater [ 16/Jan/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13205/
Subject: LU-4458 test: take fail_val into account for pause_bulk()
Project: fs/lustre-release
Branch: b2_5
Current Patch Set:
Commit: ae75822728635169a1e23c5d71df3ad122dff0a3

Comment by Jodi Levi (Inactive) [ 19/Jan/15 ]

Patch landed to Master.

Generated at Sat Feb 10 01:42:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.