[LU-9140] sanityn test_77j test_77k: The read / write rate exceeds x% of preset rate Created: 21/Feb/17 Updated: 19/Sep/18 Resolved: 19/Oct/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0, Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.11.0, Lustre 2.10.2 |
| Type: | Bug | Priority: | Critical |
| Reporter: | James Casper | Assignee: | Qian Yingjin (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
trevis-38vm3 thru 6, Full Group test, |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
https://testing.hpdd.intel.com/test_sets/16f600cc-f85e-11e6-b8e5-5254006e85c2 From test_log: Limited write rate: 20, read rate: 5 Verify the write rate is under TBF control Write speed is 4.91903 Verify the read rate is under TBF control Read speed is 19.5095 sanityn test_77j: @@@@@@ FAIL: The read rate (19.5095) exceeds 110% of preset rate (5) Info required for matching: sanityn 77j |
| Comments |
| Comment by Bob Glossman (Inactive) [ 05/Mar/17 ] |
|
another on master: |
| Comment by Emoly Liu [ 15/Mar/17 ] |
|
+1 on master: |
| Comment by Emoly Liu [ 15/Mar/17 ] |
|
qian, could you please have a look at this issue? test_77j was introduced by |
| Comment by Qian Yingjin (Inactive) [ 16/Mar/17 ] |
|
It is a very strange bug. I can not reproduce it in my test environment. I was thought it caused by the CPT > 1 as the rate limit per server would reach CPT * rate. But I Checked the failure log of the testbed all with CPT = 1. Maybe the reproduce of this failure is dependent with other test cases? |
| Comment by James Casper [ 16/Mar/17 ] |
|
Seen again with 2.9.54 tag: https://testing.hpdd.intel.com/test_sessions/93d0b46a-0a02-11e7-9053-5254006e85c2 We run the full group on 7 configs, and this seems to move around those configs (failed 2 of 7 times on .53 and .54). |
| Comment by Qian Yingjin (Inactive) [ 20/Mar/17 ] |
|
I have ran the test more than 100 times for a Lustre configure with 7 OSTs, but it does not reproduce...
error()
{
echo "$@"
exit 1
}
pushd lustre-release/lustre/tests
for ((i=1; i<=100; i++))
do
ONLY="77a 77b 77c 77d 77e 77f 77i 77j" RUNAS_ID=1000 sh sanityn.sh
[ $? -ne 0 ] && error "FAILED $i testing"
done
popd
|
| Comment by Sebastien Buisson (Inactive) [ 04/Apr/17 ] |
|
+1 on master |
| Comment by James A Simmons [ 10/Apr/17 ] |
|
Still broken |
| Comment by Jian Yu [ 12/Apr/17 ] |
|
This is blocking patch review testing on master branch: |
| Comment by Andreas Dilger [ 13/Apr/17 ] |
|
This is failing 5-10x per day so is causing some significant interference with new patch testing. I suspect that the problem may relate to running in a VM, or possibly because there are so few RPCs? Qian, please review one of the test results above (they should all be run with -1 debugging, e.g. https://testing.hpdd.intel.com/test_logs/8ce9c52c-1f6c-11e7-8920-5254006e85c2/show_text ) to see why NRS is not handling this properly. |
| Comment by Gerrit Updater [ 13/Apr/17 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/26615 |
| Comment by Qian Yingjin (Inactive) [ 14/Apr/17 ] |
|
Hi Andreas, I am sorry that the debugging log in NRS TBF is not enough for identifiing the problem. It needs to add new debug log to see what happend in NRS TBF. |
| Comment by Bob Glossman (Inactive) [ 14/Apr/17 ] |
|
another on master: |
| Comment by Andreas Dilger [ 17/Apr/17 ] |
|
Qian, can you please submit a patch that adds the required debugging to the test and/or the kernel. |
| Comment by Gerrit Updater [ 18/Apr/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26615/ |
| Comment by Li Xi (Inactive) [ 18/Apr/17 ] |
|
Although it is still unclear to me how this problem happens, I am wondering whether we should use a special RPC type to check the rate of RPC. The software stack and also cache stack of read/write is too deap, and the speed of read/write could be affected by a lot of factors. So in order to check the speed of RPC control, a RPC type (e.g. "lfs advise") could be used. |
| Comment by Gerrit Updater [ 18/Apr/17 ] |
|
Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/26701 |
| Comment by Peter Jones [ 05/May/17 ] |
|
Qian Do the results from the run with the denig patch give you a fuller picture as to what the problem is here? Peter |
| Comment by Qian Yingjin (Inactive) [ 08/May/17 ] |
|
Hi Peter, |
| Comment by Andreas Dilger [ 08/May/17 ] |
|
This subtest is still failing for patches that include the https://testing.hpdd.intel.com/sub_tests/a2b0f3f4-33d4-11e7-b0a8-5254006e85c2 We'll land the debug patch and try to catch some failures with the added debugging information. |
| Comment by Gerrit Updater [ 09/May/17 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch https://review.whamcloud.com/26701/ |
| Comment by James Nunez (Inactive) [ 12/May/17 ] |
|
There are at least two test sessions that have the new debug information in them. I found the messages in the OST debug_logs for |
| Comment by Qian Yingjin (Inactive) [ 15/May/17 ] |
|
After analyzed the debug log, I found that the total time to write 100MB data is about 200s (becaused of network congestion, I think), so the measured I/O performance by dd is nearly 0.5MB/s, but the output of dd is used 'kB/s' as the unit autoatically, so the number is about 500. I think we can fix this bug in the sanityn.sh tbf_verify by adjusting the output result by dd according to its unit 'MB/s', 'GB/s' or 'kB/s'. |
| Comment by Andreas Dilger [ 15/May/17 ] |
|
Would it be better to just compute the expected runtime for I/O based on the TBF limit, and then compare the actual runtime to expected, instead of parsing the output of "dd"? The reason the IO is so slow is because these tests are run in a VM that may be using a shared disk also used by another VM guest. |
| Comment by Gerrit Updater [ 15/May/17 ] |
|
Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/27110 |
| Comment by Qian Yingjin (Inactive) [ 15/May/17 ] |
|
Hi Andreas, I just finished a patch to ajust the measured IOPS according to the unit of dd output. The patch above is more intuitive; while your suggestion is much simpler. If you think the comparson by using runtime is much better, I will push a new patch sooner. |
| Comment by Gerrit Updater [ 24/May/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27110/ |
| Comment by Peter Jones [ 24/May/17 ] |
|
Landed for 2.10 |
| Comment by Sarah Liu [ 22/Jun/17 ] |
|
It looks like the issue hasn't been resolved. there is syntax error https://testing.hpdd.intel.com/test_sets/cb4e36e6-52c9-11e7-bacf-5254006e85c2 test log trevis-3vm6: 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0449807 s, 23.3 MB/s trevis-3vm5: 1+0 records in trevis-3vm5: 1+0 records out trevis-3vm5: 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0186742 s, 56.2 MB/s Limited write rate: 20, read rate: 5 Verify the write rate is under TBF control (standard_in) 1: illegal character: M (standard_in) 1: syntax error Write runtime is MiB) s, speed is IOPS (standard_in) 1: syntax error /usr/lib64/lustre/tests/sanityn.sh: line 3163: [: -eq: unary operator expected sanityn test_77j: @@@@@@ FAIL: The write rate () exceeds 110% of preset rate (20) |
| Comment by Qian Yingjin (Inactive) [ 22/Jun/17 ] |
|
From the test log above, the format of dd output has been changed, so the parsing result is incorrect. I will make a patch sooner. |
| Comment by Gerrit Updater [ 29/Jun/17 ] |
|
Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/27878 |
| Comment by Bob Glossman (Inactive) [ 14/Sep/17 ] |
|
another on b2_10: |
| Comment by Bob Glossman (Inactive) [ 14/Sep/17 ] |
|
another on master: |
| Comment by Bob Glossman (Inactive) [ 12/Oct/17 ] |
|
more on master: |
| Comment by Jian Yu [ 16/Oct/17 ] |
|
+1 on master: |
| Comment by Gerrit Updater [ 19/Oct/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27878/ |
| Comment by Peter Jones [ 19/Oct/17 ] |
|
Landed for 2.11 |
| Comment by Gerrit Updater [ 19/Oct/17 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/29686 |
| Comment by Gerrit Updater [ 26/Oct/17 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29686/ |