[LU-13138] sanity test 101d fails with 'readahead 283s > no-readahead 134s 500M' Created: 15/Jan/20 Updated: 13/Dec/23 Resolved: 12/Jan/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.4 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ppc, ubuntu | ||
| Environment: |
Ubuntu or PPC clients |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
sanity test_101d fails with 'readahead 283s > no-readahead 134s 500M'. The first time this test failed with this error was on 25 FEB 2019 while testing a patch for The first time this test failed with this error in non-patch review testing was on 02 MARCH 2019 for Ubuntu 18.04 client testing at https://testing.whamcloud.com/test_sets/02946220-3e27-11e9-a256-52540065bddc. And again for Ubuntu 18.04 client testing on 05 MAY 2019 at https://testing.whamcloud.com/test_sets/9cbca45c-705e-11e9-bd0e-52540065bddc. Since that time, we’ve seen about 20 failures in branch testing for Ubuntu and PPC testing. There is one case where this test fails for RHEL clients in branch testing where the dd time output does not match what the test is reporting; see the suite_log for the failure at https://testing.whamcloud.com/test_sets/55424ea8-34d8-11ea-80b4-52540065bddc. The test reports that reading the 500M file with read ahead turned off took about 134 seconds and the dd output confirms this Disable read-ahead Reading the test file /mnt/lustre/f101d.sanity with read-ahead disabled 500+0 records in 500+0 records out 524288000 bytes (524 MB) copied, 133.321 s, 3.9 MB/s Cancel LRU locks on lustre client to flush the client cache Next we turn on read ahead and read the same file. In this case, dd reports that the read took about 17 seconds, but the test reports that the read took 283 seconds Enable read-ahead with 40MB 0 Reading the test file /mnt/lustre/f101d.sanity with read-ahead enabled 500+0 records in 500+0 records out 524288000 bytes (524 MB) copied, 16.516 s, 31.7 MB/s read-ahead disabled time read 134 read-ahead enabled time read 283 Looking at do_and_time() in test-framework.sh, we see 6834 do_and_time () {
6835 local cmd=$1
6836 local rc
6837
6838 SECONDS=0
6839 eval '$cmd'
6840
6841 [ ${PIPESTATUS[0]} -eq 0 ] || rc=1
6842
6843 echo $SECONDS
6844 return $rc
6845 }
The way this function is used, the return code of do_and_time() is saved as the time dd takes. Although I don’t think this is the issue, an improvement would be for do_and_time() to only echo/return one value. |
| Comments |
| Comment by Andreas Dilger [ 29/Apr/20 ] |
|
Also a problem is that "$SECONDS" is a global value in the shell, and should not be changed. do_and_time() should just read "$SECONDS" before the test starts, and then again after it is finished, and return the difference. That said, this test would probably be better to parse the output of "dd" to get the actual bandwidth, since the 1s granularity of $SECONDS is probably not accurate enough, as can be seen with the following test failure: Reading the test file /mnt/lustre/f101d.sanity with read-ahead disabled 20480+0 records in 20480+0 records out 83886080 bytes (84 MB, 80 MiB) copied, 5.52689 s, 15.2 MB/s Cancel LRU locks on lustre client to flush the client cache Enable read-ahead with 40MB Reading the test file /mnt/lustre/f101d.sanity with read-ahead enabled 20480+0 records in 20480+0 records out 83886080 bytes (84 MB, 80 MiB) copied, 4.09317 s, 20.5 MB/s read-ahead disabled time read 5 read-ahead enabled time read 5 sanity test_101d: @@@@@@ FAIL: readahead 5s > no-readahead 5s 80M |
| Comment by Gerrit Updater [ 30/Apr/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38427 |
| Comment by Gerrit Updater [ 07/May/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38427/ |
| Comment by Yang Sheng [ 19/Jun/20 ] |
|
Other instance: |
| Comment by Andreas Dilger [ 04/Jan/21 ] |
|
+1 on Ubuntu: |
| Comment by Andreas Dilger [ 09/Jan/21 ] |
|
+1 on Ubuntu https://testing.whamcloud.com/test_sets/93206725-9b65-4264-b6ae-87ac957f5c20 |
| Comment by James Nunez (Inactive) [ 29/Jan/21 ] |
|
RHEL8.3 server/client failure at https://testing.whamcloud.com/test_sets/0af23296-c754-4951-8e2e-40383febe41b |
| Comment by Li Xi [ 05/Mar/21 ] |
|
+1 on ES5.2 https://testing.whamcloud.com/test_sets/59e0506b-8004-473f-aa44-753b27cc82f5 |
| Comment by Qian Yingjin [ 01/Apr/21 ] |
|
+1 on ES6 https://testing.whamcloud.com/test_sets/52982bf3-3774-4a63-a141-0ed0bfdb86e2 |
| Comment by James A Simmons [ 12/Jan/22 ] |
|
My testing on Ubuntu20.04 HWE shows double the performance when read ahead is enabled I see the same behaviour for RHEL7 as well. Thanks to the work of Wang Shilong. We can close this ticket but it can be reopened if someone reports this problem again. |
| Comment by Gerrit Updater [ 21/Nov/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53188 |
| Comment by Gerrit Updater [ 13/Dec/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53188/ |