Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9887

sanity-lfsck test_9a: FAIL: (4) Got speed 952, expected less than 144

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.6
    • Lustre 2.11.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/742f02b4-837a-11e7-b90b-5254006e85c2.

      The sub-test test_9a failed with the following error:

      (4) Got speed 952, expected less than 144
      

      This might be a dup of LU-8877, but those haven't been reported for quite a while.
      Creating a new Jira ticket for recent instances. Will let somebody else decide if they are dups.

      Info required for matching: sanity-lfsck 9a
      Info required for matching: sanity-lfsck 9b

      Attachments

        Issue Links

          Activity

            [LU-9887] sanity-lfsck test_9a: FAIL: (4) Got speed 952, expected less than 144

            On the other hand, for LFSCK, the divisor is the run time with second unit. It is impossible that the LFSCK run time exceeds 32-bit seconds. So do_div() is enough for LFSCK.

            yong.fan nasf (Inactive) added a comment - On the other hand, for LFSCK, the divisor is the run time with second unit. It is impossible that the LFSCK run time exceeds 32-bit seconds. So do_div() is enough for LFSCK.

            But div_u64() also uses 32-bits input parameter as the divisor, the return value of div_u64() is 64 bits.

            yong.fan nasf (Inactive) added a comment - But div_u64() also uses 32-bits input parameter as the divisor, the return value of div_u64() is 64 bits.

            That was done to avoid https://jira.hpdd.intel.com/browse/LU-6174. I guess we need to look at math64.h to see what the proper function is.

            simmonsja James A Simmons added a comment - That was done to avoid https://jira.hpdd.intel.com/browse/LU-6174 . I guess we need to look at math64.h to see what the proper function is.

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/28617
            Subject: LU-9887 lfsck: repalce div_u64 with do_div
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: cc198dafc29269a06fc3d72aebb8e73357e32144

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/28617 Subject: LU-9887 lfsck: repalce div_u64 with do_div Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: cc198dafc29269a06fc3d72aebb8e73357e32144

            The issue is related with div_u64() and do_div(). Originally, we use do_div() in LFSCK for calculating the LFSCK scanning speed. Recently, it is replaced by div_u64() via the patch https://review.whamcloud.com/26466. But the main difference between the two functions are:

            The do_div(a, b) returns the remainder, the quotient is put in the dividend(@a).
            The div_u64(a, b) returns the quotient, both the dividend(@a) and divisor(@b) will NOT be changed.

            Both of them use 64-bits dividend(@a) and 32-bits divisor(@b). For LFSCK case, do_div() is enough. I will make patch to fix it.

            yong.fan nasf (Inactive) added a comment - The issue is related with div_u64() and do_div(). Originally, we use do_div() in LFSCK for calculating the LFSCK scanning speed. Recently, it is replaced by div_u64() via the patch https://review.whamcloud.com/26466 . But the main difference between the two functions are: The do_div(a, b) returns the remainder, the quotient is put in the dividend(@a). The div_u64(a, b) returns the quotient, both the dividend(@a) and divisor(@b) will NOT be changed. Both of them use 64-bits dividend(@a) and 32-bits divisor(@b). For LFSCK case, do_div() is enough. I will make patch to fix it.

            One minor nit to fix when this test is repaired - it is adding 20% margin to both the time and the calculated speed, which is why it expects "less than 144" instead of "less than 120" for the actual speed:

                    local RUN_TIME1=10
                    local TIME_DIFF=2
                    local MAX_SPEED=$((BASE_SPEED1 * (RUN_TIME1 + TIME_DIFF) / \
                                       RUN_TIME1 * 12 / 10))
            

            Either the TIME_DIFF or the 12 / 10 should be removed.

            Also, to improve the calculations, rather than using the RUN_TIME1 and RUN_TIME2 values to calculate the elapsed time, it would be better to record the start and end times for each step:

                    $START_LAYOUT -r -s $BASE_SPEED1 || error "(2) Fail to start LFSCK!"
                    local START_TIME1=$SECONDS
            
                    sleep 10
                    STATUS=$($SHOW_LAYOUT | awk '/^status/ { print $2 }')
                    local RUN_TIME1=$((SECONDS - START_TIME))
            

            I put the start/end times both after the commands for consistency, since SSH can sometimes take a noticeable time on the VMs.

            adilger Andreas Dilger added a comment - One minor nit to fix when this test is repaired - it is adding 20% margin to both the time and the calculated speed, which is why it expects " less than 144" instead of " less than 120 " for the actual speed: local RUN_TIME1=10 local TIME_DIFF=2 local MAX_SPEED=$((BASE_SPEED1 * (RUN_TIME1 + TIME_DIFF) / \ RUN_TIME1 * 12 / 10)) Either the TIME_DIFF or the 12 / 10 should be removed. Also, to improve the calculations, rather than using the RUN_TIME1 and RUN_TIME2 values to calculate the elapsed time, it would be better to record the start and end times for each step: $START_LAYOUT -r -s $BASE_SPEED1 || error "(2) Fail to start LFSCK!" local START_TIME1=$SECONDS sleep 10 STATUS=$($SHOW_LAYOUT | awk '/^status/ { print $2 }' ) local RUN_TIME1=$((SECONDS - START_TIME)) I put the start/end times both after the commands for consistency, since SSH can sometimes take a noticeable time on the VMs.
            bogl Bob Glossman (Inactive) added a comment - more on master: https://testing.hpdd.intel.com/test_sets/66b2ad28-839f-11e7-b33c-5254006e85c2 https://testing.hpdd.intel.com/test_sets/cb972bee-83a7-11e7-b33c-5254006e85c2

            James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/28588
            Subject: LU-9887 tests: stop running sanity-lfsck tests 9a,9b
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 584667e902fda338b0f0a7fdbdb85edbfe320441

            gerrit Gerrit Updater added a comment - James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/28588 Subject: LU-9887 tests: stop running sanity-lfsck tests 9a,9b Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 584667e902fda338b0f0a7fdbdb85edbfe320441

            think this fail may be showing up now due to recent landing of "LU-9295 test: skip sanity-lfsck 9 less than 2.8"

            Before that tests 9a & 9b were probably always skipped.

            bogl Bob Glossman (Inactive) added a comment - think this fail may be showing up now due to recent landing of " LU-9295 test: skip sanity-lfsck 9 less than 2.8" Before that tests 9a & 9b were probably always skipped.

            People

              yong.fan nasf (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: