[LU-9887] sanity-lfsck test_9a: FAIL: (4) Got speed 952, expected less than 144 Created: 17/Aug/17  Updated: 07/Jan/19  Resolved: 01/Dec/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.10.6

Type: Bug Priority: Critical
Reporter: Maloo Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-9894 lfsck always prints that it started a... Resolved
Related
is related to LU-6174 do_div() silently truncates divisor t... Resolved
is related to LU-9295 sanity-lfsck tests 9a and 9b skipped ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/742f02b4-837a-11e7-b90b-5254006e85c2.

The sub-test test_9a failed with the following error:

(4) Got speed 952, expected less than 144

This might be a dup of LU-8877, but those haven't been reported for quite a while.
Creating a new Jira ticket for recent instances. Will let somebody else decide if they are dups.

Info required for matching: sanity-lfsck 9a
Info required for matching: sanity-lfsck 9b



 Comments   
Comment by Bob Glossman (Inactive) [ 17/Aug/17 ]

think this fail may be showing up now due to recent landing of "LU-9295 test: skip sanity-lfsck 9 less than 2.8"

Before that tests 9a & 9b were probably always skipped.

Comment by Gerrit Updater [ 17/Aug/17 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/28588
Subject: LU-9887 tests: stop running sanity-lfsck tests 9a,9b
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 584667e902fda338b0f0a7fdbdb85edbfe320441

Comment by Bob Glossman (Inactive) [ 18/Aug/17 ]

more on master:
https://testing.hpdd.intel.com/test_sets/66b2ad28-839f-11e7-b33c-5254006e85c2
https://testing.hpdd.intel.com/test_sets/cb972bee-83a7-11e7-b33c-5254006e85c2

Comment by Andreas Dilger [ 18/Aug/17 ]

One minor nit to fix when this test is repaired - it is adding 20% margin to both the time and the calculated speed, which is why it expects "less than 144" instead of "less than 120" for the actual speed:

        local RUN_TIME1=10
        local TIME_DIFF=2
        local MAX_SPEED=$((BASE_SPEED1 * (RUN_TIME1 + TIME_DIFF) / \
                           RUN_TIME1 * 12 / 10))

Either the TIME_DIFF or the 12 / 10 should be removed.

Also, to improve the calculations, rather than using the RUN_TIME1 and RUN_TIME2 values to calculate the elapsed time, it would be better to record the start and end times for each step:

        $START_LAYOUT -r -s $BASE_SPEED1 || error "(2) Fail to start LFSCK!"
        local START_TIME1=$SECONDS

        sleep 10
        STATUS=$($SHOW_LAYOUT | awk '/^status/ { print $2 }')
        local RUN_TIME1=$((SECONDS - START_TIME))

I put the start/end times both after the commands for consistency, since SSH can sometimes take a noticeable time on the VMs.

Comment by nasf (Inactive) [ 19/Aug/17 ]

The issue is related with div_u64() and do_div(). Originally, we use do_div() in LFSCK for calculating the LFSCK scanning speed. Recently, it is replaced by div_u64() via the patch https://review.whamcloud.com/26466. But the main difference between the two functions are:

The do_div(a, b) returns the remainder, the quotient is put in the dividend(@a).
The div_u64(a, b) returns the quotient, both the dividend(@a) and divisor(@b) will NOT be changed.

Both of them use 64-bits dividend(@a) and 32-bits divisor(@b). For LFSCK case, do_div() is enough. I will make patch to fix it.

Comment by Gerrit Updater [ 19/Aug/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/28617
Subject: LU-9887 lfsck: repalce div_u64 with do_div
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cc198dafc29269a06fc3d72aebb8e73357e32144

Comment by James A Simmons [ 19/Aug/17 ]

That was done to avoid https://jira.hpdd.intel.com/browse/LU-6174. I guess we need to look at math64.h to see what the proper function is.

Comment by nasf (Inactive) [ 19/Aug/17 ]

But div_u64() also uses 32-bits input parameter as the divisor, the return value of div_u64() is 64 bits.

Comment by nasf (Inactive) [ 19/Aug/17 ]

On the other hand, for LFSCK, the divisor is the run time with second unit. It is impossible that the LFSCK run time exceeds 32-bit seconds. So do_div() is enough for LFSCK.

Comment by James A Simmons [ 19/Aug/17 ]

True but if you want to do that you need to add a comment saying its okay to truncate the result. What will happen in the future is some one will look at the code and assume do_div() is wrong due to the truncate issue. So I ask you place a comment stating this. Personally I like to see things done is proper way i.e use a correct div64 function instead of reverting to do_div() but its not a hard requirement.

Comment by nasf (Inactive) [ 19/Aug/17 ]

simmonsja, https://review.whamcloud.com/28617 is refreshed, the new version uses dir_u64(), but gets the result from the function return value. Please check. Thanks!

Comment by Gerrit Updater [ 20/Aug/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28588/
Subject: LU-9887 tests: ignore error sanity-lfsck test 9a,b
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6754b09335508ca4d977d10d1d05b5befd1a8aad

Comment by Gerrit Updater [ 30/Sep/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28617/
Subject: LU-9887 lfsck: calculate LFSCK speed properly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cf800c062c8c6424c442509139297095f8a708db

Comment by Peter Jones [ 30/Sep/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 02/Oct/17 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/29293
Subject: LU-9887 tests: ignore error sanity-lfsck test 9a,b
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 917d9a4021fc0ccb911fbb2b0e261a781b91f2be

Comment by Gerrit Updater [ 02/Oct/17 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/29294
Subject: LU-9887 lfsck: calculate LFSCK speed properly
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 8eb4db6c1125511d870cd848fd8cd5eba9e944eb

Comment by Gerrit Updater [ 11/Oct/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29293/
Subject: LU-9887 tests: ignore error sanity-lfsck test 9a,b
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 2efab9c82e806dc53b98fcb157aaed60af79a799

Comment by Bob Glossman (Inactive) [ 13/Oct/17 ]

still seeing fails on master after the landing of https://review.whamcloud.com/28588 and https://review.whamcloud.com/28617:
https://testing.hpdd.intel.com/test_sets/8ab76200-afbd-11e7-8d8d-5254006e85c2

Comment by Gerrit Updater [ 25/Oct/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29294/
Subject: LU-9887 lfsck: calculate LFSCK speed properly
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 0f14db83ab0fe0b505e3eabb7b51619cd42e5155

Comment by nasf (Inactive) [ 24/Nov/17 ]

still seeing fails on master after the landing of https://review.whamcloud.com/28588 and https://review.whamcloud.com/28617:
https://testing.hpdd.intel.com/test_sets/8ab76200-afbd-11e7-8d8d-5254006e85c2

This is a different issue that is caused by calculation error. As you can, the diff is (145 - 144) / 144, it can be ignored in our VM test environment.

sanity-lfsck test_9b: @@@@@@ FAIL: (10) Speed 145, expected < 144

Comment by Peter Jones [ 24/Nov/17 ]

So we need a new ticket to track making this test more robust?

Comment by nasf (Inactive) [ 24/Nov/17 ]

Currently, we allow some test error range for lfsck speed. If we want to make the test more robust, then either enlarge such error range or test more large data set. But there is no absolute solution for that.

Comment by Peter Jones [ 24/Nov/17 ]

So is the current test actually telling us anything useful? It sounds like you are saying that the failures for this test are because the failure threshold is too low. If that is the case, we should either raise the threshold to reduce these failures or else remove the test. As things stand it is failing quite often but just being assumed to be fine.

Comment by nasf (Inactive) [ 24/Nov/17 ]

Raising the threshold will be the most simple solution. I will push a patch soon with this ticket number.

Comment by Gerrit Updater [ 24/Nov/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30247
Subject: LU-9887 tests: adjust lfsck speek test error range
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 644ae8cc3930cc504b7cfd24c745c0b78b6481d4

Comment by nasf (Inactive) [ 24/Nov/17 ]

More patch for this ticket.

Comment by Gerrit Updater [ 01/Dec/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30247/
Subject: LU-9887 tests: adjust lfsck speek test error range
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0457798e95e3628d7b4f0894fdc2dd13c2dd23f6

Generated at Sat Feb 10 02:30:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.