[LU-11127] sanity-flr test_34b: @@@@@@ FAIL: can\'t put import for osc into FULL state after 40 sec, have REPLAY_WAIT Created: 07/Jul/18  Updated: 06/Aug/18  Resolved: 06/Aug/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Major
Reporter: Mikhail Pershin Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The failure rate for sanity-flr.sh test_34b is about 10%. Test is failed always with the same error like below:

[12008.812898] Lustre: DEBUG MARKER: trevis-17vm1.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0001-osc-ffff926024c37800.ost_server_uuid 40
[12049.499056] Lustre: DEBUG MARKER: /usr/sbin/lctl mark rpc test_34b: @@@@@@ FAIL: can\'t put import for osc.lustre-OST0001-osc-ffff926024c37800.ost_server_uuid into FULL state after 40 sec, have REPLAY_WAIT

examples:
https://testing.whamcloud.com/sub_tests/82aec676-7dd9-11e8-8b8a-52540065bddc
https://testing.whamcloud.com/sub_tests/08e9d360-7ead-11e8-8fe6-52540065bddc
https://testing.whamcloud.com/sub_tests/20d551f2-7f07-11e8-97ff-52540065bddc
https://testing.whamcloud.com/sub_tests/1ca17f50-7fea-11e8-b441-52540065bddc



 Comments   
Comment by James Nunez (Inactive) [ 13/Jul/18 ]

This test started failing on July 2, 2018 and is failing only in DNE testing.

Comment by Mikhail Pershin [ 22/Jul/18 ]

it is failing a lot of runs:

Failure Rate: 22.41% of most recent 58 runs, 42 skipped (all branches)

Comment by Andreas Dilger [ 01/Aug/18 ]

There isn't anything in test_34b() that seems any different than test_34a(), but 34a has not had any failures.

This might relate to some kind of problem with stopping and restarting the OSTs twice in a row quickly under DNE (e.g. MDT reconnections slow, or the previous restart has reset at_min)? One option would be to increase the minimum time that _wait_osc_import_state() waits with multiple MDTs by some amount to compensate.

Comment by Gerrit Updater [ 01/Aug/18 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32917
Subject: LU-11127 tests: stop running sanity-flr test 34b
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2e588819a7d1291622326ccafcf163ad4bc86492

Comment by James Nunez (Inactive) [ 01/Aug/18 ]

Uploaded patch https://review.whamcloud.com/32917 to stop running test 34b in case we need to employ these drastic measures!

Comment by Gerrit Updater [ 02/Aug/18 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/32922
Subject: LU-11127 test: sanity-flr OST not recovery fast enough
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: acb4dd8dac495e554f3b86645a65bc4de80cbe87

Comment by Gerrit Updater [ 06/Aug/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32922/
Subject: LU-11127 test: sanity-flr OST not recovery fast enough
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5c8dfe2bdf8313b6bbb2055dd22290f328d94549

Comment by Peter Jones [ 06/Aug/18 ]

Looks like we don't!

Generated at Sat Feb 10 02:41:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.