[LU-9140] sanityn test_77j test_77k: The read / write rate exceeds x% of preset rate Created: 21/Feb/17  Updated: 19/Sep/18  Resolved: 19/Oct/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.11.0
Fix Version/s: Lustre 2.11.0, Lustre 2.10.2

Type: Bug Priority: Critical
Reporter: James Casper Assignee: Qian Yingjin (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

trevis-38vm3 thru 6, Full Group test,
RHEL7.3/ldiskfs, branch master, v2.9.53.19.gdad9ce4, b3534,
SLES 12.2 client


Issue Links:
Duplicate
is duplicated by LU-9308 sanityn test_77j: write rate exceeds ... Resolved
Related
is related to LU-9227 Changing rate of a TBF rule loses con... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.hpdd.intel.com/test_sets/16f600cc-f85e-11e6-b8e5-5254006e85c2

From test_log:

Limited write rate: 20, read rate: 5
Verify the write rate is under TBF control
Write speed is 4.91903
Verify the read rate is under TBF control
Read speed is 19.5095
 sanityn test_77j: @@@@@@ FAIL: The read rate (19.5095) exceeds 110% of preset rate (5) 

Info required for matching: sanityn 77j
Info required for matching: sanityn 77k



 Comments   
Comment by Bob Glossman (Inactive) [ 05/Mar/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/344078ec-0173-11e7-9255-5254006e85c2

Comment by Emoly Liu [ 15/Mar/17 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/c5cefc54-08c9-11e7-8c9f-5254006e85c2

Comment by Emoly Liu [ 15/Mar/17 ]

qian, could you please have a look at this issue? test_77j was introduced by LU-5620 and now its failure rate is about "6.00% of last 100 runs, 0 skipped (all branches) ".

Comment by Qian Yingjin (Inactive) [ 16/Mar/17 ]

It is a very strange bug. I can not reproduce it in my test environment. I was thought it caused by the CPT > 1 as the rate limit per server would reach CPT * rate. But I Checked the failure log of the testbed all with CPT = 1. Maybe the reproduce of this failure is dependent with other test cases?

Comment by James Casper [ 16/Mar/17 ]

Seen again with 2.9.54 tag:

https://testing.hpdd.intel.com/test_sessions/93d0b46a-0a02-11e7-9053-5254006e85c2
https://testing.hpdd.intel.com/test_sessions/1a6bc6e8-0a05-11e7-9053-5254006e85c2

We run the full group on 7 configs, and this seems to move around those configs (failed 2 of 7 times on .53 and .54).

Comment by Qian Yingjin (Inactive) [ 20/Mar/17 ]

I have ran the test more than 100 times for a Lustre configure with 7 OSTs, but it does not reproduce...

 

error()
{
 echo "$@"
 exit 1
}
pushd lustre-release/lustre/tests
for ((i=1; i<=100; i++))
do
 ONLY="77a 77b 77c 77d 77e 77f 77i 77j" RUNAS_ID=1000 sh sanityn.sh
 [ $? -ne 0 ] && error "FAILED $i testing"
done
popd

 

Comment by Sebastien Buisson (Inactive) [ 04/Apr/17 ]

+1 on master
https://testing.hpdd.intel.com/test_sets/f9d4cb54-1897-11e7-b742-5254006e85c2

Comment by James A Simmons [ 10/Apr/17 ]

Still broken

Comment by Jian Yu [ 12/Apr/17 ]

This is blocking patch review testing on master branch:
https://testing.hpdd.intel.com/test_sets/9d3acb4c-1db4-11e7-9de9-5254006e85c2
https://testing.hpdd.intel.com/test_sets/864f7644-1f6c-11e7-8920-5254006e85c2
https://testing.hpdd.intel.com/test_sets/a5d1ef7e-1f08-11e7-9073-5254006e85c2
https://testing.hpdd.intel.com/test_sets/1cbc0492-1ef2-11e7-b742-5254006e85c2

Comment by Andreas Dilger [ 13/Apr/17 ]

This is failing 5-10x per day so is causing some significant interference with new patch testing.

I suspect that the problem may relate to running in a VM, or possibly because there are so few RPCs?

Qian, please review one of the test results above (they should all be run with -1 debugging, e.g. https://testing.hpdd.intel.com/test_logs/8ce9c52c-1f6c-11e7-8920-5254006e85c2/show_text ) to see why NRS is not handling this properly.

Comment by Gerrit Updater [ 13/Apr/17 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/26615
Subject: LU-9140 tests: disable sanityn test_77j and test_77k
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6225bfd533457ca364ff5bde613f0cf86f130f38

Comment by Qian Yingjin (Inactive) [ 14/Apr/17 ]

Hi Andreas, I am sorry that the debugging log in NRS TBF is not enough for identifiing the problem. It needs to add new debug log to see what happend in NRS TBF.

Comment by Bob Glossman (Inactive) [ 14/Apr/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/1ff0c23e-20db-11e7-b742-5254006e85c2

Comment by Andreas Dilger [ 17/Apr/17 ]

Qian, can you please submit a patch that adds the required debugging to the test and/or the kernel.

Comment by Gerrit Updater [ 18/Apr/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26615/
Subject: LU-9140 tests: disable sanityn test_77j and test_77k
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e9b0c8b55c100e1d86093636309b385a913e927e

Comment by Li Xi (Inactive) [ 18/Apr/17 ]

Although it is still unclear to me how this problem happens, I am wondering whether we should use a special RPC type to check the rate of RPC. The software stack and also cache stack of read/write is too deap, and the speed of read/write could be affected by a lot of factors. So in order to check the speed of RPC control, a RPC type (e.g. "lfs advise") could be used.

Comment by Gerrit Updater [ 18/Apr/17 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/26701
Subject: LU-9140 nrs: add some debug log for NRS TBF
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ad151f5adee39af6b786f36ccc7263c43dcfbfb5

Comment by Peter Jones [ 05/May/17 ]

Qian

Do the results from the run with the denig patch give you a fuller picture as to what the problem is here?

Peter

Comment by Qian Yingjin (Inactive) [ 08/May/17 ]

Hi Peter,
Yes, I think the debug patch https://review.whamcloud.com/26701 can help to identify the problem. I used the similar debug log to find out the failure reason of LU-9227. Maybe the merged patch of LU-9227 have already solved the problem.

Comment by Andreas Dilger [ 08/May/17 ]

This subtest is still failing for patches that include the LU-9227 fix:

https://testing.hpdd.intel.com/sub_tests/a2b0f3f4-33d4-11e7-b0a8-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/abff3518-3278-11e7-8847-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/fcf81bb0-32f5-11e7-814a-5254006e85c2

We'll land the debug patch and try to catch some failures with the added debugging information.

Comment by Gerrit Updater [ 09/May/17 ]

Andreas Dilger (andreas.dilger@intel.com) merged in patch https://review.whamcloud.com/26701/
Subject: LU-9140 nrs: add some debug log for NRS TBF
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e0deeb237ae111e959e916ae910d15272ed4c120

Comment by James Nunez (Inactive) [ 12/May/17 ]

There are at least two test sessions that have the new debug information in them. I found the messages in the OST debug_logs for
https://testing.hpdd.intel.com/sub_tests/9951bf06-36e4-11e7-b0a8-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/9fd5daee-3700-11e7-8847-5254006e85c2

Comment by Qian Yingjin (Inactive) [ 15/May/17 ]

After analyzed the debug log, I found that the total time to write 100MB data is about 200s (becaused of network congestion, I think), so the measured I/O performance by dd is nearly 0.5MB/s, but the output of dd is used 'kB/s' as the unit autoatically, so the number is about 500. I think we can fix this bug in the sanityn.sh tbf_verify by adjusting the output result by dd according to its unit 'MB/s', 'GB/s' or 'kB/s'.

Comment by Andreas Dilger [ 15/May/17 ]

Would it be better to just compute the expected runtime for I/O based on the TBF limit, and then compare the actual runtime to expected, instead of parsing the output of "dd"?

The reason the IO is so slow is because these tests are run in a VM that may be using a shared disk also used by another VM guest.

Comment by Gerrit Updater [ 15/May/17 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/27110
Subject: LU-9140 nrs: adjust measured IOPS according to dd output
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1f9030a3624cabf633913edc84376486defa1719

Comment by Qian Yingjin (Inactive) [ 15/May/17 ]

Hi Andreas,

I just finished a patch to ajust the measured IOPS according to the unit of dd output.

The patch above is more intuitive; while your suggestion is much simpler.

If you think the comparson by using runtime is much better, I will push a new patch sooner. 

Comment by Gerrit Updater [ 24/May/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27110/
Subject: LU-9140 nrs: calculate the measured rate according to dd runtime
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5fb7d63e7d5ccd05f39b74b5c79d1856798477c7

Comment by Peter Jones [ 24/May/17 ]

Landed for 2.10

Comment by Sarah Liu [ 22/Jun/17 ]

It looks like the issue hasn't been resolved. there is syntax error

https://testing.hpdd.intel.com/test_sets/cb4e36e6-52c9-11e7-bacf-5254006e85c2
tag 2.9.59

test log

trevis-3vm6: 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0449807 s, 23.3 MB/s
trevis-3vm5: 1+0 records in
trevis-3vm5: 1+0 records out
trevis-3vm5: 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0186742 s, 56.2 MB/s
Limited write rate: 20, read rate: 5
Verify the write rate is under TBF control
(standard_in) 1: illegal character: M
(standard_in) 1: syntax error
Write runtime is MiB) s, speed is  IOPS
(standard_in) 1: syntax error
/usr/lib64/lustre/tests/sanityn.sh: line 3163: [: -eq: unary operator expected
 sanityn test_77j: @@@@@@ FAIL: The write rate () exceeds 110% of preset rate (20) 
Comment by Qian Yingjin (Inactive) [ 22/Jun/17 ]

From the test log above, the format of dd output has been changed, so the parsing result is incorrect. I will make a patch sooner.

Comment by Gerrit Updater [ 29/Jun/17 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/27878
Subject: LU-9140 nrs: measure the runtime of dd directly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8e5988cdf728577db247afd98525a6bc1bcb7ad2

Comment by Bob Glossman (Inactive) [ 14/Sep/17 ]

another on b2_10:
https://testing.hpdd.intel.com/test_sets/a9d32abe-9903-11e7-b778-5254006e85c2

Comment by Bob Glossman (Inactive) [ 14/Sep/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/bb51f0ae-991c-11e7-ba20-5254006e85c2

Comment by Bob Glossman (Inactive) [ 12/Oct/17 ]

more on master:
https://testing.hpdd.intel.com/test_sets/bf2cc154-af15-11e7-943d-5254006e85c2
https://testing.hpdd.intel.com/test_sets/2c690a30-af38-11e7-9eeb-5254006e85c2

Comment by Jian Yu [ 16/Oct/17 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/1da465c8-addd-11e7-bb19-5254006e85c2

Comment by Gerrit Updater [ 19/Oct/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27878/
Subject: LU-9140 nrs: measure the runtime of dd directly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 66abf6ffe9f032aaa7c271ee2bd1f9b569058bd5

Comment by Peter Jones [ 19/Oct/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 19/Oct/17 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/29686
Subject: LU-9140 nrs: measure the runtime of dd directly
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: bf27231ae6224598e2f04b2d937fd83c159dc58c

Comment by Gerrit Updater [ 26/Oct/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29686/
Subject: LU-9140 nrs: measure the runtime of dd directly
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 53604724c3e5dd2032303e22511f7be230d4feae

Generated at Sat Feb 10 02:23:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.