[LU-13167] sanity-flr test_33b: read mirror too slow without ost2, from 1 to 4 Created: 22/Jan/20  Updated: 06/Jul/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.3, Lustre 2.12.5, Lustre 2.12.6, Lustre 2.12.9
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Jian Yu
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for liuying <emoly.liu@intel.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/69b56d06-3c3b-11ea-b0f4-52540065bddc

test_33b failed with the following error:

onyx-66vm8: == rpc test complete, duration -o sec ================================================================ 07:30:01 (1579591801)
onyx-66vm8: onyx-66vm8.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-OST0001.recovery_status 1475
onyx-66vm8: *.lustre-OST0001.recovery_status status: COMPLETE
 sanity-flr test_33b: @@@@@@ FAIL: read mirror too slow without ost2, from 1 to 4 

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-flr test_33b - read mirror too slow without ost2, from 1 to 4



 Comments   
Comment by Jian Yu [ 30/Jan/20 ]

The failure also occurred on Lustre b2_12 branch:
https://testing.whamcloud.com/test_sets/d96923d8-4341-11ea-86b2-52540065bddc

Comment by Artem Blagodarenko (Inactive) [ 22/Jan/22 ]

+1 https://testing.whamcloud.com/test_sets/e8b442a3-09c2-45a4-9860-348fc1ea9015

Comment by Andreas Dilger [ 21/Apr/23 ]

This test fails intermittently, but steadily over time. I suspect like many tests that are measuring performance when running in a VM that there are inconsistencies in the system performance that may be causing this. However, I also don't like to disable the test completely or ignore all errors (with error_not_in_vm) because we very rarely run these tests on real hardware, so there would be no way to detect if the test started failing regularly.

Ideally, the test framework could monitor the failure rate (e.g. 1/30 failures on a branch) and then consider it a real failure if a patch failed more often than that. It might be possible to work something into test-framework.sh that took "error_not_in_vm" with some failure rate 1/R, and if the subtest failed it would automatically set ONLY_REPEAT for the current test and then re-run it R times to ensure it wasn't failing more frequently. That would still have some chance of failing or missing issues, but would usually avoid the need for gratuitous test restarts, and be more likely to catch changes in intermittent failures than developers are...

Generated at Sat Feb 10 02:58:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.