[LU-9069] ZFS sanity: FAIL: test_255a Speedup with willread is less than XX%, got YY% Created: 31/Jan/17  Updated: 01/Dec/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Casper Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: always_except, zfs
Environment:

onyx-64 thru 67
Full Group test, RHEL7.3/ZFS
master branch, b3499


Attachments: Zip Archive sanity_2017-01-31_2138.zip    
Issue Links:
Related
is related to LU-6254 Fix OFD/OSD prefetch for osd-ldiskfs ... Open
is related to LU-10314 implement ladvise dontneed for ZFS Open
is related to LU-8565 sanity test 255a fails with ‘Speedup ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity: FAIL: test_255a Speedup with willread is less than XX%, got YY%

Test is expecting willreads to be faster than cache reads, but willreads are slower:

From test_log:

Iter 1/10: cache speedup: 228%
Iter 1/10: ladvise speedup: 24%

Iter 2/10: cache speedup: 147%
Iter 2/10: ladvise speedup: 20%

Iter 3/10: cache speedup: 170%
Iter 3/10: ladvise speedup: 10%

Related to LU-8565, but that issue was attributed to VM slowness. This issue is also seen on physical machines.

Info required for matching: sanity 255a



 Comments   
Comment by Andreas Dilger [ 03/Mar/17 ]

It turns out that all of these real-world test failures are only failing with ZFS backends. That implies some problem with ladvise not prefetching the pages properly for ZFS, since the cached reads run much faster but ladvise reads sometimes do not run faster with ZFS.

Looking at the output more closely, it appears that the cases where ZFS is passing are only because the cached speedup is below the threshold:
https://testing.hpdd.intel.com/sub_tests/64172ba6-f9df-11e6-8b38-5254006e85c2

Iter 1/10: reading without willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 1/10: Reading again without willread hint
Iter 1/10: reading with willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 1/10: cache speedup:  6.79256/6.53006=4%
Iter 1/10: ladvise speedup:  6.59651/6.53006=1%
Iter 2/10: reading without willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 2/10: Reading again without willread hint
Iter 2/10: reading with willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 2/10: cache speedup:  7.24601/6.77954=6%
Iter 2/10: ladvise speedup:  6.651/6.77954=-1%
Iter 3/10: reading without willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 3/10: Reading again without willread hint
Iter 3/10: reading with willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 3/10: cache speedup:  6.82836/6.4925=5%
Iter 3/10: ladvise speedup:  6.39501/6.4925=-1%
Iter 4/10: reading without willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 4/10: Reading again without willread hint
Iter 4/10: reading with willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 4/10: cache speedup:  6.80656/6.68438=1%
Iter 4/10: ladvise speedup:  6.52108/6.68438=-2%
Iter 5/10: reading without willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 5/10: Reading again without willread hint
Iter 5/10: reading with willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 5/10: cache speedup:  7.23196/6.44395=12%
Iter 5/10: ladvise speedup:  6.67935/6.44395=3%
Iter 6/10: reading without willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 6/10: Reading again without willread hint
Iter 6/10: reading with willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 6/10: cache speedup:  6.90188/6.73113=2%
Iter 6/10: ladvise speedup:  6.61056/6.73113=-1%
Iter 7/10: reading without willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 7/10: Reading again without willread hint
Iter 7/10: reading with willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 7/10: cache speedup:  6.93889/6.67545=3%
Iter 7/10: ladvise speedup:  6.4859/6.67545=-2%
Iter 8/10: reading without willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 8/10: Reading again without willread hint
Iter 8/10: reading with willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 8/10: cache speedup:  7.03663/6.86692=2%
Iter 8/10: ladvise speedup:  6.60757/6.86692=-3%
Iter 9/10: reading without willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 9/10: Reading again without willread hint
Iter 9/10: reading with willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 9/10: cache speedup:  7.18227/6.64533=8%
Iter 9/10: ladvise speedup:  6.56245/6.64533=-1%
Iter 10/10: reading without willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 10/10: Reading again without willread hint
Iter 10/10: reading with willread hint
CMD: trevis-34vm4 echo 3 > /proc/sys/vm/drop_caches
Iter 10/10: cache speedup:  6.96441/6.69985=3%
Iter 10/10: ladvise speedup:  6.46659/6.69985=-3%
Speedup with cache is less than 20% (4%), skipping check of speedup with willread:  -1%

So it looks like the real problem is that "echo 3 > /proc/sys/vm/drop_caches" doesn't actually flush the cache for ZFS servers, so the "uncached speed" test is invalid.

What is needed is to implement the "dontneed" ladvice for ZFS so that the test can use it to flush the server ARC cache for this file and run a valid test.

Comment by Andreas Dilger [ 03/Mar/17 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/24907
Subject: LU-9069 tests: improve output of sanity test_255a
Project: fs/lustre-release
Branch: master
Current Patch Set: 5
Commit: 99c7ca0110affd39c33afda073d09e6c1335ac81

Comment by Gerrit Updater [ 01/May/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24907/
Subject: LU-9069 tests: improve output of sanity test_255a
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bdd470ff9765cb81d94b888ee1067e1c420303bc

Generated at Sat Feb 10 02:22:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.