[LU-9557] sanity-quota test 0 fails with 'SLOW IO for quota_usr (user): XX KB/sec' Created: 24/May/17  Updated: 26/May/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-quota test_0 fails with

'SLOW IO for quota_usr (user): 426 KB/sec' 

The last thing seen in the test_log from test 0 is as expected

running as uid/gid/euid/egid 60000/60000/60000/60000, groups:
 [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d0.sanity-quota/f0.sanity-quota-0] [count=10] [conv=fsync]
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.427085 s, 24.6 MB/s
 sanity-quota test_0: @@@@@@ FAIL: SLOW IO for quota_usr (user): 426 KB/sec 

The subroutine that is failing is test_quota_performance() in sanity-quota.sh:

 423 test_quota_performance() {
 424         local TESTFILE="$DIR/$tdir/$tfile-0"
 425         local size=$1 # in MB
 426         local stime=$(date +%s)
 427         $RUNAS $DD of=$TESTFILE count=$size conv=fsync ||
 428                 quota_error u $TSTUSR "write failure"
 429         local etime=$(date +%s)
 430         delta=$((etime - stime))
 431         if [ $delta -gt 0 ]; then
 432             rate=$((size * 1024 / delta))
 433             if [ $(facet_fstype $SINGLEMDS) = "zfs" ]; then
 434                 # LU-2872 - see LU-2887 for fix
 435                 [ $rate -gt 64 ] ||
 436                         error "SLOW IO for $TSTUSR (user): $rate KB/sec"
 437             else
 438                 [ $rate -gt 1024 ] ||
 439                         error "SLOW IO for $TSTUSR (user): $rate KB/sec"
 440             fi
 441         fi
 442         rm -f $TESTFILE
 443 }

This test started failing with this error message starting May 12, 2017 and is only failing, so far, in review-dne-part-2. The only patch that modified sanity-quota and landed close to the time this test started to fail was reviewed at https://review.whamcloud.com/26411 and has change-ID I4f22c1e9.

There is nothing interesting in the console logs on any node.

Here are all the failures so far:
https://testing.hpdd.intel.com/test_sets/0ae25452-373d-11e7-b0a8-5254006e85c2
https://testing.hpdd.intel.com/test_sets/0f17788e-379e-11e7-a0ec-5254006e85c2
https://testing.hpdd.intel.com/test_sets/933c6c4a-3d26-11e7-b556-5254006e85c2
https://testing.hpdd.intel.com/test_sets/9d856dfc-3fe5-11e7-b3f6-5254006e85c2
https://testing.hpdd.intel.com/test_sets/7a15e686-4098-11e7-bc6c-5254006e85c2
https://testing.hpdd.intel.com/test_sets/6ba8b6ae-40a8-11e7-91f4-5254006e85c2



 Comments   
Comment by Andreas Dilger [ 26/May/17 ]

It's strange that this test is complaining about running too slow, when "dd" is reporting a much higher speed:

10485760 bytes (10 MB) copied, 0.427085 s, 24.6 MB/s
 sanity-quota test_0: @@@@@@ FAIL: SLOW IO for quota_usr (user): 426 KB/sec 

"dd" reports 24.6MB/s, the script calculates 426 KB/s?

It would be useful to improve the error message to include stime, etime, and delta. As a side note, it is better to use the bash built-in "$SECONDS" variable to get stime and then calculate delta=$((SECONDS - stime)) directly to avoid two extra forks to call date.

Generated at Sat Feb 10 02:27:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.