[LU-10670] sanity-flr test 43 timeout Created: 15/Feb/18  Updated: 17/Jul/18  Resolved: 27/Feb/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Blocker
Reporter: Mikhail Pershin Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10250 replay-single test_74: hang and time... Open
is related to LU-10448 policy to pick a primary for mirrored... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.hpdd.intel.com/test_sets/713fb70e-119d-11e8-a6ad-52540065bddc

It fails very often:

Error: 'Timeout occurred after 227 mins, last suite running was sanity-flr, restarting cluster to continue tests' 
Failure Rate: 41.18% of most recent 17 runs, 22 skipped (all branches)

On a client:

[10077.749514] Lustre: DEBUG MARKER: == sanity-flr test 43: mirror pick on write ========================================================== 12:14:55 (1518610495)
[10320.098013] INFO: task dd:23892 blocked for more than 120 seconds.
[10320.114074] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10320.116709] dd              D ffff88007b96dee0     0 23892  23675 0x00000080
[10320.119330] Call Trace:
[10320.125475]  [<ffffffff810c6632>] ? default_wake_function+0x12/0x20
[10320.150782]  [<ffffffff810bc2d8>] ? __wake_up_common+0x58/0x90
[10320.154162]  [<ffffffff816ab8a9>] schedule+0x29/0x70
[10320.170306]  [<ffffffff816a92b9>] schedule_timeout+0x239/0x2c0
[10320.176336]  [<ffffffffc09f5e88>] ? ptlrpc_set_add_new_req+0xd8/0x150 [ptlrpc]
[10320.178829]  [<ffffffffc0bd50c0>] ? osc_io_ladvise_end+0x50/0x50 [osc]
[10320.181237]  [<ffffffffc0a25ffb>] ? ptlrpcd_add_req+0x22b/0x300 [ptlrpc]
[10320.183701]  [<ffffffffc09fbe99>] ? ptlrpc_request_bufs_pack+0x1d9/0x480 [ptlrpc]
[10320.186106]  [<ffffffff816abc5d>] wait_for_completion+0xfd/0x140
[10320.188437]  [<ffffffff810c6620>] ? wake_up_state+0x20/0x20
[10320.190651]  [<ffffffffc0bd5284>] osc_io_setattr_end+0xc4/0x180 [osc]
[10320.192955]  [<ffffffffc0bd63d0>] ? osc_io_setattr_start+0x260/0x700 [osc]
[10320.195231]  [<ffffffffc0c28490>] ? lov_io_iter_fini_wrapper+0x50/0x50 [lov]
[10320.197659]  [<ffffffffc0832e8d>] cl_io_end+0x5d/0x150 [obdclass]
[10320.199802]  [<ffffffffc0c2856b>] lov_io_end_wrapper+0xdb/0xe0 [lov]
[10320.202033]  [<ffffffffc0c28bc5>] lov_io_call.isra.5+0x85/0x140 [lov]
[10320.204170]  [<ffffffffc0c28cb6>] lov_io_end+0x36/0xb0 [lov]
[10320.206291]  [<ffffffffc0832e8d>] cl_io_end+0x5d/0x150 [obdclass]
[10320.208353]  [<ffffffffc083551f>] cl_io_loop+0x13f/0xc70 [obdclass]
[10320.210509]  [<ffffffffc0cd1460>] cl_setattr_ost+0x250/0x3c0 [lustre]
[10320.212550]  [<ffffffffc0cab495>] ll_setattr_raw+0x1165/0x1270 [lustre]
[10320.214631]  [<ffffffffc0cab60c>] ll_setattr+0x6c/0xd0 [lustre]
[10320.217542]  [<ffffffff81220fc1>] notify_change+0x2c1/0x420
[10320.228621]  [<ffffffff812b45b6>] ? security_inode_need_killpriv+0x16/0x20
[10320.230605]  [<ffffffff81200ad5>] do_truncate+0x75/0xc0
[10320.232485]  [<ffffffff81211d97>] do_last+0x627/0x12c0
[10320.234244]  [<ffffffff81212af2>] path_openat+0xc2/0x490
[10320.236065]  [<ffffffff811af746>] ? do_read_fault.isra.44+0xe6/0x130
[10320.237871]  [<ffffffff8121508b>] do_filp_open+0x4b/0xb0
[10320.239642]  [<ffffffff8122233a>] ? __alloc_fd+0x8a/0x130
[10320.241313]  [<ffffffff81201bc3>] do_sys_open+0xf3/0x1f0
[10320.243068]  [<ffffffff816b8945>] ? system_call_after_swapgs+0x172/0x214
[10320.244820]  [<ffffffff81201cde>] SyS_open+0x1e/0x20
[10320.246469]  [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
[10320.248096]  [<ffffffff816b889d>] ? system_call_after_swapgs+0xca/0x214


 Comments   
Comment by Jian Yu [ 15/Feb/18 ]

This is a regression failure introduced by patch https://review.whamcloud.com/30711 for LU-10448. The failure occurred 8 times in one day, which is now affecting patch review testing on master branch.

Comment by Gerrit Updater [ 15/Feb/18 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/31315
Subject: LU-10670 test: make sanity-flr test_43 more reliable
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cc39d8716708fdec6656aa46bb9a36fe8e770c98

Comment by Bob Glossman (Inactive) [ 21/Feb/18 ]

another on master:
https://testing.hpdd.intel.com/test_sets/4ff42be0-16cd-11e8-bd00-52540065bddc

Comment by Patrick Farrell (Inactive) [ 21/Feb/18 ]

Another on master:
https://testing.hpdd.intel.com/test_sessions/c96f08e7-6ef5-4ac4-bc20-dd5f9ef8010a

Comment by Gerrit Updater [ 27/Feb/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31315/
Subject: LU-10670 test: make sanity-flr test_43 more reliable
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9af57d0bdce9949dc3fe91817263758b57efbe9b

Comment by Peter Jones [ 27/Feb/18 ]

Landed for 2.11

Comment by Patrick Farrell (Inactive) [ 02/Mar/18 ]

Looks like a hit with this fix in:
https://testing.hpdd.intel.com/test_sessions/0ecb0a78-c3ff-423c-8fdd-0255a7b3a203

Comment by Zhenyu Xu [ 02/Mar/18 ]

the build's parent is e528677e1630093362394ae36d725c321d0da4f2, which does not have this fix.

Comment by Patrick Farrell (Inactive) [ 02/Mar/18 ]

Ah, OK, I will ask him to rebase.

Generated at Sat Feb 10 02:37:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.