[LU-16632] sanity test_56xh: 'lfs migrate -W' too slow Created: 10/Mar/23 Updated: 14/Nov/23 Resolved: 29/Mar/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Jian Yu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for S Buisson <sbuisson@ddn.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/b7260618-4120-4ad7-ad92-b6aae2bb0332 test_56xh failed with the following error: 'lfs migrate -W' too slow in VM (80 > 2 * 25 2) Test session details: VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 10/Mar/23 ] |
|
The "lfs migrate -W" option and corresponding test were added in We can't reliably measure performance inside the VMs, so we likely need to increase the margin of error allowed for this test. It currently checks:
(( elapsed <= size_mb * 150 / 100 )) ||
error "'lfs migrate -W' too slow in VM ($elapsed > 2 * $size_mb 2)"
And the error message says it is 2x the time, but the check is only 1.5x the time. The check and message should be fixed to use "* 350 / 100", along with removing the stray "2" at the end of the error (test took 80s vs 25s without throttle, which is 3.2x longer). I wonder if this is exacerbated by the addition of many sleeps, which gives the VM more chances to reschedule this thread under contention, or possibly inaccuracies in the nanosecond clock that are hurting the rate calculations? I wonder if the algorithm should be designed to "catch up" in those cases (possibly exceeding the bandwidth cap if it was previously starved, or at least not going to sleep in those cases), or is the intent that it should not exceed the bandwidth limit even if previous intervals were slower than necessary (notwithstanding the fact that it is very bursty with its writes)? |
| Comment by Tim Day [ 11/Mar/23 ] |
|
Rescheduling seems more likely to me than clock inaccuracies. Looking at a handful of past runs shows that the test very reliably hits 25s for the migrate job. Not much variability.
I think it's preferable that the bandwidth cap is respected as much as possible, even if the migrate falls behind a bit. I think it makes more sense from a user perspective. I agree that the algorithm could be made less bursty in its writes. |
| Comment by Gerrit Updater [ 11/Mar/23 ] |
|
"Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50262 |
| Comment by Gerrit Updater [ 28/Mar/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50262/ |
| Comment by Peter Jones [ 29/Mar/23 ] |
|
Landed for 2.16 |