[LU-9887] sanity-lfsck test_9a: FAIL: (4) Got speed 952, expected less than 144 Created: 17/Aug/17 Updated: 07/Jan/19 Resolved: 01/Dec/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.10.6 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
This issue was created by maloo for Bob Glossman <bob.glossman@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/742f02b4-837a-11e7-b90b-5254006e85c2. The sub-test test_9a failed with the following error: (4) Got speed 952, expected less than 144 This might be a dup of Info required for matching: sanity-lfsck 9a |
| Comments |
| Comment by Bob Glossman (Inactive) [ 17/Aug/17 ] |
|
think this fail may be showing up now due to recent landing of " Before that tests 9a & 9b were probably always skipped. |
| Comment by Gerrit Updater [ 17/Aug/17 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/28588 |
| Comment by Bob Glossman (Inactive) [ 18/Aug/17 ] |
|
more on master: |
| Comment by Andreas Dilger [ 18/Aug/17 ] |
|
One minor nit to fix when this test is repaired - it is adding 20% margin to both the time and the calculated speed, which is why it expects "less than 144" instead of "less than 120" for the actual speed: local RUN_TIME1=10
local TIME_DIFF=2
local MAX_SPEED=$((BASE_SPEED1 * (RUN_TIME1 + TIME_DIFF) / \
RUN_TIME1 * 12 / 10))
Either the TIME_DIFF or the 12 / 10 should be removed. Also, to improve the calculations, rather than using the RUN_TIME1 and RUN_TIME2 values to calculate the elapsed time, it would be better to record the start and end times for each step: $START_LAYOUT -r -s $BASE_SPEED1 || error "(2) Fail to start LFSCK!" local START_TIME1=$SECONDS sleep 10 STATUS=$($SHOW_LAYOUT | awk '/^status/ { print $2 }') local RUN_TIME1=$((SECONDS - START_TIME)) I put the start/end times both after the commands for consistency, since SSH can sometimes take a noticeable time on the VMs. |
| Comment by nasf (Inactive) [ 19/Aug/17 ] |
|
The issue is related with div_u64() and do_div(). Originally, we use do_div() in LFSCK for calculating the LFSCK scanning speed. Recently, it is replaced by div_u64() via the patch https://review.whamcloud.com/26466. But the main difference between the two functions are: The do_div(a, b) returns the remainder, the quotient is put in the dividend(@a). Both of them use 64-bits dividend(@a) and 32-bits divisor(@b). For LFSCK case, do_div() is enough. I will make patch to fix it. |
| Comment by Gerrit Updater [ 19/Aug/17 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/28617 |
| Comment by James A Simmons [ 19/Aug/17 ] |
|
That was done to avoid https://jira.hpdd.intel.com/browse/LU-6174. I guess we need to look at math64.h to see what the proper function is. |
| Comment by nasf (Inactive) [ 19/Aug/17 ] |
|
But div_u64() also uses 32-bits input parameter as the divisor, the return value of div_u64() is 64 bits. |
| Comment by nasf (Inactive) [ 19/Aug/17 ] |
|
On the other hand, for LFSCK, the divisor is the run time with second unit. It is impossible that the LFSCK run time exceeds 32-bit seconds. So do_div() is enough for LFSCK. |
| Comment by James A Simmons [ 19/Aug/17 ] |
|
True but if you want to do that you need to add a comment saying its okay to truncate the result. What will happen in the future is some one will look at the code and assume do_div() is wrong due to the truncate issue. So I ask you place a comment stating this. Personally I like to see things done is proper way i.e use a correct div64 function instead of reverting to do_div() but its not a hard requirement. |
| Comment by nasf (Inactive) [ 19/Aug/17 ] |
|
simmonsja, https://review.whamcloud.com/28617 is refreshed, the new version uses dir_u64(), but gets the result from the function return value. Please check. Thanks! |
| Comment by Gerrit Updater [ 20/Aug/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28588/ |
| Comment by Gerrit Updater [ 30/Sep/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28617/ |
| Comment by Peter Jones [ 30/Sep/17 ] |
|
Landed for 2.11 |
| Comment by Gerrit Updater [ 02/Oct/17 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/29293 |
| Comment by Gerrit Updater [ 02/Oct/17 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/29294 |
| Comment by Gerrit Updater [ 11/Oct/17 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29293/ |
| Comment by Bob Glossman (Inactive) [ 13/Oct/17 ] |
|
still seeing fails on master after the landing of https://review.whamcloud.com/28588 and https://review.whamcloud.com/28617: |
| Comment by Gerrit Updater [ 25/Oct/17 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29294/ |
| Comment by nasf (Inactive) [ 24/Nov/17 ] |
This is a different issue that is caused by calculation error. As you can, the diff is (145 - 144) / 144, it can be ignored in our VM test environment.
|
| Comment by Peter Jones [ 24/Nov/17 ] |
|
So we need a new ticket to track making this test more robust? |
| Comment by nasf (Inactive) [ 24/Nov/17 ] |
|
Currently, we allow some test error range for lfsck speed. If we want to make the test more robust, then either enlarge such error range or test more large data set. But there is no absolute solution for that. |
| Comment by Peter Jones [ 24/Nov/17 ] |
|
So is the current test actually telling us anything useful? It sounds like you are saying that the failures for this test are because the failure threshold is too low. If that is the case, we should either raise the threshold to reduce these failures or else remove the test. As things stand it is failing quite often but just being assumed to be fine. |
| Comment by nasf (Inactive) [ 24/Nov/17 ] |
|
Raising the threshold will be the most simple solution. I will push a patch soon with this ticket number. |
| Comment by Gerrit Updater [ 24/Nov/17 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30247 |
| Comment by nasf (Inactive) [ 24/Nov/17 ] |
|
More patch for this ticket. |
| Comment by Gerrit Updater [ 01/Dec/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30247/ |