[LU-16632] sanity test_56xh: 'lfs migrate -W' too slow Created: 10/Mar/23  Updated: 14/Nov/23  Resolved: 29/Mar/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13482 add verbose stats to lfs_migrate and ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for S Buisson <sbuisson@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/b7260618-4120-4ad7-ad92-b6aae2bb0332

test_56xh failed with the following error:

'lfs migrate -W' too slow in VM (80 > 2 * 25 2)

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/92966 - 4.18.0-348.7.1.el8_5.x86_64
servers: https://build.whamcloud.com/job/lustre-reviews/92966 - 4.18.0-348.23.1.el8_lustre.x86_64

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_56xh - 'lfs migrate -W' too slow in VM (80 > 2 * 25 2)



 Comments   
Comment by Andreas Dilger [ 10/Mar/23 ]

The "lfs migrate -W" option and corresponding test were added in LU-13482.

We can't reliably measure performance inside the VMs, so we likely need to increase the margin of error allowed for this test. It currently checks:

       (( elapsed <= size_mb * 150 / 100 )) ||
               error "'lfs migrate -W' too slow in VM ($elapsed > 2 * $size_mb 2)"

And the error message says it is 2x the time, but the check is only 1.5x the time. The check and message should be fixed to use "* 350 / 100", along with removing the stray "2" at the end of the error (test took 80s vs 25s without throttle, which is 3.2x longer).

I wonder if this is exacerbated by the addition of many sleeps, which gives the VM more chances to reschedule this thread under contention, or possibly inaccuracies in the nanosecond clock that are hurting the rate calculations?

I wonder if the algorithm should be designed to "catch up" in those cases (possibly exceeding the bandwidth cap if it was previously starved, or at least not going to sleep in those cases), or is the intent that it should not exceed the bandwidth limit even if previous intervals were slower than necessary (notwithstanding the fact that it is very bursty with its writes)?

Comment by Tim Day [ 11/Mar/23 ]

Rescheduling seems more likely to me than clock inaccuracies. Looking at a handful of past runs shows that the test very reliably hits 25s for the migrate job. Not much variability.

 

I think it's preferable that the bandwidth cap is respected as much as possible, even if the migrate falls behind a bit. I think it makes more sense from a user perspective. I agree that the algorithm could be made less bursty in its writes.

Comment by Gerrit Updater [ 11/Mar/23 ]

"Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50262
Subject: LU-16632 tests: more margin of error for 56xh
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2227fdcd8901262f09ccb669aed8fb3b4addc1a5

Comment by Gerrit Updater [ 28/Mar/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50262/
Subject: LU-16632 tests: more margin of error for sanity/56xh
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 36cbba150bce9e2890c8b462ec2ce4af2d6353a5

Comment by Peter Jones [ 29/Mar/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:28:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.