Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9887

sanity-lfsck test_9a: FAIL: (4) Got speed 952, expected less than 144

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.6
    • Lustre 2.11.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/742f02b4-837a-11e7-b90b-5254006e85c2.

      The sub-test test_9a failed with the following error:

      (4) Got speed 952, expected less than 144
      

      This might be a dup of LU-8877, but those haven't been reported for quite a while.
      Creating a new Jira ticket for recent instances. Will let somebody else decide if they are dups.

      Info required for matching: sanity-lfsck 9a
      Info required for matching: sanity-lfsck 9b

      Attachments

        Issue Links

          Activity

            [LU-9887] sanity-lfsck test_9a: FAIL: (4) Got speed 952, expected less than 144
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28617/
            Subject: LU-9887 lfsck: calculate LFSCK speed properly
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: cf800c062c8c6424c442509139297095f8a708db

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28617/ Subject: LU-9887 lfsck: calculate LFSCK speed properly Project: fs/lustre-release Branch: master Current Patch Set: Commit: cf800c062c8c6424c442509139297095f8a708db

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28588/
            Subject: LU-9887 tests: ignore error sanity-lfsck test 9a,b
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 6754b09335508ca4d977d10d1d05b5befd1a8aad

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28588/ Subject: LU-9887 tests: ignore error sanity-lfsck test 9a,b Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6754b09335508ca4d977d10d1d05b5befd1a8aad

            simmonsja, https://review.whamcloud.com/28617 is refreshed, the new version uses dir_u64(), but gets the result from the function return value. Please check. Thanks!

            yong.fan nasf (Inactive) added a comment - simmonsja , https://review.whamcloud.com/28617 is refreshed, the new version uses dir_u64(), but gets the result from the function return value. Please check. Thanks!
            simmonsja James A Simmons added a comment - - edited

            True but if you want to do that you need to add a comment saying its okay to truncate the result. What will happen in the future is some one will look at the code and assume do_div() is wrong due to the truncate issue. So I ask you place a comment stating this. Personally I like to see things done is proper way i.e use a correct div64 function instead of reverting to do_div() but its not a hard requirement.

            simmonsja James A Simmons added a comment - - edited True but if you want to do that you need to add a comment saying its okay to truncate the result. What will happen in the future is some one will look at the code and assume do_div() is wrong due to the truncate issue. So I ask you place a comment stating this. Personally I like to see things done is proper way i.e use a correct div64 function instead of reverting to do_div() but its not a hard requirement.

            On the other hand, for LFSCK, the divisor is the run time with second unit. It is impossible that the LFSCK run time exceeds 32-bit seconds. So do_div() is enough for LFSCK.

            yong.fan nasf (Inactive) added a comment - On the other hand, for LFSCK, the divisor is the run time with second unit. It is impossible that the LFSCK run time exceeds 32-bit seconds. So do_div() is enough for LFSCK.

            But div_u64() also uses 32-bits input parameter as the divisor, the return value of div_u64() is 64 bits.

            yong.fan nasf (Inactive) added a comment - But div_u64() also uses 32-bits input parameter as the divisor, the return value of div_u64() is 64 bits.

            That was done to avoid https://jira.hpdd.intel.com/browse/LU-6174. I guess we need to look at math64.h to see what the proper function is.

            simmonsja James A Simmons added a comment - That was done to avoid https://jira.hpdd.intel.com/browse/LU-6174 . I guess we need to look at math64.h to see what the proper function is.

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/28617
            Subject: LU-9887 lfsck: repalce div_u64 with do_div
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: cc198dafc29269a06fc3d72aebb8e73357e32144

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/28617 Subject: LU-9887 lfsck: repalce div_u64 with do_div Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: cc198dafc29269a06fc3d72aebb8e73357e32144

            The issue is related with div_u64() and do_div(). Originally, we use do_div() in LFSCK for calculating the LFSCK scanning speed. Recently, it is replaced by div_u64() via the patch https://review.whamcloud.com/26466. But the main difference between the two functions are:

            The do_div(a, b) returns the remainder, the quotient is put in the dividend(@a).
            The div_u64(a, b) returns the quotient, both the dividend(@a) and divisor(@b) will NOT be changed.

            Both of them use 64-bits dividend(@a) and 32-bits divisor(@b). For LFSCK case, do_div() is enough. I will make patch to fix it.

            yong.fan nasf (Inactive) added a comment - The issue is related with div_u64() and do_div(). Originally, we use do_div() in LFSCK for calculating the LFSCK scanning speed. Recently, it is replaced by div_u64() via the patch https://review.whamcloud.com/26466 . But the main difference between the two functions are: The do_div(a, b) returns the remainder, the quotient is put in the dividend(@a). The div_u64(a, b) returns the quotient, both the dividend(@a) and divisor(@b) will NOT be changed. Both of them use 64-bits dividend(@a) and 32-bits divisor(@b). For LFSCK case, do_div() is enough. I will make patch to fix it.

            One minor nit to fix when this test is repaired - it is adding 20% margin to both the time and the calculated speed, which is why it expects "less than 144" instead of "less than 120" for the actual speed:

                    local RUN_TIME1=10
                    local TIME_DIFF=2
                    local MAX_SPEED=$((BASE_SPEED1 * (RUN_TIME1 + TIME_DIFF) / \
                                       RUN_TIME1 * 12 / 10))
            

            Either the TIME_DIFF or the 12 / 10 should be removed.

            Also, to improve the calculations, rather than using the RUN_TIME1 and RUN_TIME2 values to calculate the elapsed time, it would be better to record the start and end times for each step:

                    $START_LAYOUT -r -s $BASE_SPEED1 || error "(2) Fail to start LFSCK!"
                    local START_TIME1=$SECONDS
            
                    sleep 10
                    STATUS=$($SHOW_LAYOUT | awk '/^status/ { print $2 }')
                    local RUN_TIME1=$((SECONDS - START_TIME))
            

            I put the start/end times both after the commands for consistency, since SSH can sometimes take a noticeable time on the VMs.

            adilger Andreas Dilger added a comment - One minor nit to fix when this test is repaired - it is adding 20% margin to both the time and the calculated speed, which is why it expects " less than 144" instead of " less than 120 " for the actual speed: local RUN_TIME1=10 local TIME_DIFF=2 local MAX_SPEED=$((BASE_SPEED1 * (RUN_TIME1 + TIME_DIFF) / \ RUN_TIME1 * 12 / 10)) Either the TIME_DIFF or the 12 / 10 should be removed. Also, to improve the calculations, rather than using the RUN_TIME1 and RUN_TIME2 values to calculate the elapsed time, it would be better to record the start and end times for each step: $START_LAYOUT -r -s $BASE_SPEED1 || error "(2) Fail to start LFSCK!" local START_TIME1=$SECONDS sleep 10 STATUS=$($SHOW_LAYOUT | awk '/^status/ { print $2 }' ) local RUN_TIME1=$((SECONDS - START_TIME)) I put the start/end times both after the commands for consistency, since SSH can sometimes take a noticeable time on the VMs.

            People

              yong.fan nasf (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: