[LU-13138] sanity test 101d fails with 'readahead 283s > no-readahead 134s 500M' Created: 15/Jan/20  Updated: 13/Dec/23  Resolved: 12/Jan/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.4
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: ppc, ubuntu
Environment:

Ubuntu or PPC clients


Issue Links:
Duplicate
is duplicated by LU-13698 sanity test_101d: readahead 5.23576s ... Resolved
Related
is related to LU-12748 parallel readahead needs to be optimi... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test_101d fails with 'readahead 283s > no-readahead 134s 500M'. The first time this test failed with this error was on 25 FEB 2019 while testing a patch for LU-11157 with the first failure at https://testing.whamcloud.com/test_sets/9377a2ee-3967-11e9-b88b-52540065bddc. These failures for this patch are a mix of RHEL and other client testing and look like they are different from the branch testing failures.

The first time this test failed with this error in non-patch review testing was on 02 MARCH 2019 for Ubuntu 18.04 client testing at https://testing.whamcloud.com/test_sets/02946220-3e27-11e9-a256-52540065bddc. And again for Ubuntu 18.04 client testing on 05 MAY 2019 at https://testing.whamcloud.com/test_sets/9cbca45c-705e-11e9-bd0e-52540065bddc. Since that time, we’ve seen about 20 failures in branch testing for Ubuntu and PPC testing.

There is one case where this test fails for RHEL clients in branch testing where the dd time output does not match what the test is reporting; see the suite_log for the failure at https://testing.whamcloud.com/test_sets/55424ea8-34d8-11ea-80b4-52540065bddc. The test reports that reading the 500M file with read ahead turned off took about 134 seconds and the dd output confirms this

Disable read-ahead
Reading the test file /mnt/lustre/f101d.sanity with read-ahead disabled
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 133.321 s, 3.9 MB/s
Cancel LRU locks on lustre client to flush the client cache

Next we turn on read ahead and read the same file. In this case, dd reports that the read took about 17 seconds, but the test reports that the read took 283 seconds

Enable read-ahead with 40MB
0
Reading the test file /mnt/lustre/f101d.sanity with read-ahead enabled
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 16.516 s, 31.7 MB/s
read-ahead disabled time read 134
read-ahead enabled  time read 283

Looking at do_and_time() in test-framework.sh, we see

6834 do_and_time () {
6835     local cmd=$1
6836     local rc
6837 
6838     SECONDS=0
6839     eval '$cmd'
6840 
6841     [ ${PIPESTATUS[0]} -eq 0 ] || rc=1
6842 
6843     echo $SECONDS
6844     return $rc
6845 }

The way this function is used, the return code of do_and_time() is saved as the time dd takes. Although I don’t think this is the issue, an improvement would be for do_and_time() to only echo/return one value.



 Comments   
Comment by Andreas Dilger [ 29/Apr/20 ]

Also a problem is that "$SECONDS" is a global value in the shell, and should not be changed. do_and_time() should just read "$SECONDS" before the test starts, and then again after it is finished, and return the difference.

That said, this test would probably be better to parse the output of "dd" to get the actual bandwidth, since the 1s granularity of $SECONDS is probably not accurate enough, as can be seen with the following test failure:
https://testing.whamcloud.com/test_sets/3366f64a-d49b-4851-8f07-fdffba254fc1

Reading the test file /mnt/lustre/f101d.sanity with read-ahead disabled
20480+0 records in
20480+0 records out
83886080 bytes (84 MB, 80 MiB) copied, 5.52689 s, 15.2 MB/s
Cancel LRU locks on lustre client to flush the client cache
Enable read-ahead with 40MB
Reading the test file /mnt/lustre/f101d.sanity with read-ahead enabled
20480+0 records in
20480+0 records out
83886080 bytes (84 MB, 80 MiB) copied, 4.09317 s, 20.5 MB/s
read-ahead disabled time read 5
read-ahead enabled time read 5
 sanity test_101d: @@@@@@ FAIL: readahead 5s > no-readahead 5s 80M
Comment by Gerrit Updater [ 30/Apr/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38427
Subject: LU-13138 tests: measure 'dd' time more accurately
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6a996fa0585de83cca8a8fd3f3c9b86545af17e4

Comment by Gerrit Updater [ 07/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38427/
Subject: LU-13138 tests: measure 'dd' time more accurately
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 43ebfad490a02613ada66c9ac79e80d1d9feaac5

Comment by Yang Sheng [ 19/Jun/20 ]

Other instance:
https://testing.whamcloud.com/test_sessions/43fe4d5d-90df-4962-985e-604656b4bb1b

Comment by Andreas Dilger [ 04/Jan/21 ]

+1 on Ubuntu:
https://testing.whamcloud.com/test_sets/9e3458e2-41d5-49f5-b7ee-277b5132492d

Comment by Andreas Dilger [ 09/Jan/21 ]

+1 on Ubuntu https://testing.whamcloud.com/test_sets/93206725-9b65-4264-b6ae-87ac957f5c20

Comment by James Nunez (Inactive) [ 29/Jan/21 ]

RHEL8.3 server/client failure at https://testing.whamcloud.com/test_sets/0af23296-c754-4951-8e2e-40383febe41b

Comment by Li Xi [ 05/Mar/21 ]

+1 on ES5.2 https://testing.whamcloud.com/test_sets/59e0506b-8004-473f-aa44-753b27cc82f5

Comment by Qian Yingjin [ 01/Apr/21 ]

+1 on ES6 https://testing.whamcloud.com/test_sets/52982bf3-3774-4a63-a141-0ed0bfdb86e2

Comment by James A Simmons [ 12/Jan/22 ]

My testing on Ubuntu20.04 HWE shows double the performance when read ahead is enabled I see the same behaviour for RHEL7 as well. Thanks to the work of Wang Shilong. We can close this ticket but it can be reopened if someone reports this problem again.

Comment by Gerrit Updater [ 21/Nov/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53188
Subject: LU-13138 tests: improve sanity/101d 'dd' parsing
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b32d4a2900e3b7ca7366f8b0f4583e4440152612

Comment by Gerrit Updater [ 13/Dec/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53188/
Subject: LU-13138 tests: improve sanity/101d 'dd' parsing
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2d9b57dfc45d4ee247363f89e69021827916edee

Generated at Sat Feb 10 02:58:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.