[LU-9887] sanity-lfsck test_9a: FAIL: (4) Got speed 952, expected less than 144 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.10.6
Affects Version/s: Lustre 2.11.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/742f02b4-837a-11e7-b90b-5254006e85c2.

The sub-test test_9a failed with the following error:

(4) Got speed 952, expected less than 144

This might be a dup of ~~LU-8877~~, but those haven't been reported for quite a while.
Creating a new Jira ticket for recent instances. Will let somebody else decide if they are dups.

Info required for matching: sanity-lfsck 9a
Info required for matching: sanity-lfsck 9b

Attachments

Issue Links

duplicates

LU-9894 lfsck always prints that it started all types of lfsck

Resolved

is related to

LU-6174 do_div() silently truncates divisor to uint32_t

Resolved

LU-9295 sanity-lfsck tests 9a and 9b skipped due to uninitialized server version

Resolved

Activity

[LU-9887] sanity-lfsck test_9a: FAIL: (4) Got speed 952, expected less than 144

nasf (Inactive) added a comment - 19/Aug/17 3:11 PM

On the other hand, for LFSCK, the divisor is the run time with second unit. It is impossible that the LFSCK run time exceeds 32-bit seconds. So do_div() is enough for LFSCK.

nasf (Inactive) added a comment - 19/Aug/17 3:11 PM On the other hand, for LFSCK, the divisor is the run time with second unit. It is impossible that the LFSCK run time exceeds 32-bit seconds. So do_div() is enough for LFSCK.

nasf (Inactive) added a comment - 19/Aug/17 3:10 PM

But div_u64() also uses 32-bits input parameter as the divisor, the return value of div_u64() is 64 bits.

nasf (Inactive) added a comment - 19/Aug/17 3:10 PM But div_u64() also uses 32-bits input parameter as the divisor, the return value of div_u64() is 64 bits.

James A Simmons added a comment - 19/Aug/17 3:02 PM

That was done to avoid https://jira.hpdd.intel.com/browse/LU-6174. I guess we need to look at math64.h to see what the proper function is.

James A Simmons added a comment - 19/Aug/17 3:02 PM That was done to avoid https://jira.hpdd.intel.com/browse/LU-6174 . I guess we need to look at math64.h to see what the proper function is.

Gerrit Updater added a comment - 19/Aug/17 1:46 PM

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/28617
Subject: ~~LU-9887~~ lfsck: repalce div_u64 with do_div
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cc198dafc29269a06fc3d72aebb8e73357e32144

Gerrit Updater added a comment - 19/Aug/17 1:46 PM Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/28617 Subject: LU-9887 lfsck: repalce div_u64 with do_div Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: cc198dafc29269a06fc3d72aebb8e73357e32144

nasf (Inactive) added a comment - 19/Aug/17 1:44 PM

The issue is related with div_u64() and do_div(). Originally, we use do_div() in LFSCK for calculating the LFSCK scanning speed. Recently, it is replaced by div_u64() via the patch https://review.whamcloud.com/26466. But the main difference between the two functions are:

The do_div(a, b) returns the remainder, the quotient is put in the dividend(@a).
The div_u64(a, b) returns the quotient, both the dividend(@a) and divisor(@b) will NOT be changed.

Both of them use 64-bits dividend(@a) and 32-bits divisor(@b). For LFSCK case, do_div() is enough. I will make patch to fix it.

nasf (Inactive) added a comment - 19/Aug/17 1:44 PM The issue is related with div_u64() and do_div(). Originally, we use do_div() in LFSCK for calculating the LFSCK scanning speed. Recently, it is replaced by div_u64() via the patch https://review.whamcloud.com/26466 . But the main difference between the two functions are: The do_div(a, b) returns the remainder, the quotient is put in the dividend(@a). The div_u64(a, b) returns the quotient, both the dividend(@a) and divisor(@b) will NOT be changed. Both of them use 64-bits dividend(@a) and 32-bits divisor(@b). For LFSCK case, do_div() is enough. I will make patch to fix it.

Andreas Dilger added a comment - 18/Aug/17 5:09 PM

One minor nit to fix when this test is repaired - it is adding 20% margin to both the time and the calculated speed, which is why it expects "less than 144" instead of "less than 120" for the actual speed:

        local RUN_TIME1=10
        local TIME_DIFF=2
        local MAX_SPEED=$((BASE_SPEED1 * (RUN_TIME1 + TIME_DIFF) / \
                           RUN_TIME1 * 12 / 10))

Either the TIME_DIFF or the 12 / 10 should be removed.

Also, to improve the calculations, rather than using the RUN_TIME1 and RUN_TIME2 values to calculate the elapsed time, it would be better to record the start and end times for each step:

        $START_LAYOUT -r -s $BASE_SPEED1 || error "(2) Fail to start LFSCK!"
        local START_TIME1=$SECONDS

        sleep 10
        STATUS=$($SHOW_LAYOUT | awk '/^status/ { print $2 }')
        local RUN_TIME1=$((SECONDS - START_TIME))

I put the start/end times both after the commands for consistency, since SSH can sometimes take a noticeable time on the VMs.

Andreas Dilger added a comment - 18/Aug/17 5:09 PM One minor nit to fix when this test is repaired - it is adding 20% margin to both the time and the calculated speed, which is why it expects " less than 144" instead of " less than 120 " for the actual speed: local RUN_TIME1=10 local TIME_DIFF=2 local MAX_SPEED=$((BASE_SPEED1 * (RUN_TIME1 + TIME_DIFF) / \ RUN_TIME1 * 12 / 10)) Either the TIME_DIFF or the 12 / 10 should be removed. Also, to improve the calculations, rather than using the RUN_TIME1 and RUN_TIME2 values to calculate the elapsed time, it would be better to record the start and end times for each step: $START_LAYOUT -r -s $BASE_SPEED1 || error "(2) Fail to start LFSCK!" local START_TIME1=$SECONDS sleep 10 STATUS=$($SHOW_LAYOUT | awk '/^status/ { print $2 }' ) local RUN_TIME1=$((SECONDS - START_TIME)) I put the start/end times both after the commands for consistency, since SSH can sometimes take a noticeable time on the VMs.

Bob Glossman (Inactive) added a comment - 18/Aug/17 2:31 AM

more on master:
https://testing.hpdd.intel.com/test_sets/66b2ad28-839f-11e7-b33c-5254006e85c2
https://testing.hpdd.intel.com/test_sets/cb972bee-83a7-11e7-b33c-5254006e85c2

Bob Glossman (Inactive) added a comment - 18/Aug/17 2:31 AM more on master: https://testing.hpdd.intel.com/test_sets/66b2ad28-839f-11e7-b33c-5254006e85c2 https://testing.hpdd.intel.com/test_sets/cb972bee-83a7-11e7-b33c-5254006e85c2

Gerrit Updater added a comment - 17/Aug/17 8:08 PM

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/28588
Subject: ~~LU-9887~~ tests: stop running sanity-lfsck tests 9a,9b
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 584667e902fda338b0f0a7fdbdb85edbfe320441

Gerrit Updater added a comment - 17/Aug/17 8:08 PM James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/28588 Subject: LU-9887 tests: stop running sanity-lfsck tests 9a,9b Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 584667e902fda338b0f0a7fdbdb85edbfe320441

Bob Glossman (Inactive) added a comment - 17/Aug/17 7:51 PM

think this fail may be showing up now due to recent landing of "~~LU-9295~~ test: skip sanity-lfsck 9 less than 2.8"

Before that tests 9a & 9b were probably always skipped.

Bob Glossman (Inactive) added a comment - 17/Aug/17 7:51 PM think this fail may be showing up now due to recent landing of " LU-9295 test: skip sanity-lfsck 9 less than 2.8" Before that tests 9a & 9b were probably always skipped.

People

Assignee:: nasf (Inactive)

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 17/Aug/17 7:45 PM

Updated:: 07/Jan/19 6:52 PM

Resolved:: 01/Dec/17 5:28 AM