Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16420

sanity test_51d: FAIL: stripecount=3: OST 1 has more objects vs. OST 0

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/19ce9196-c2db-40c4-9140-ebea855a919f

      test_51d failed with the following error:

      OST0 has 265 objects, 86 are index 0
      OST1 has 322 objects, 145 are index 0
      OST2 has 320 objects, 85 are index 0
      OST3 has 322 objects, 86 are index 0
      OST4 has 265 objects, 87 are index 0
      OST5 has 323 objects, 143 are index 0
      OST6 has 324 objects, 88 are index 0
      OST7 has 323 objects, 88 are index 0
      
      'stripecount=3:  OST 1 has more objects vs. OST 0  (341 > 197 x 5/4)'
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity test_51d - 'stripecount=3: OST 1 has more objects vs. OST 0 (341 > 197 x 5/4)'

      Attachments

        Issue Links

          Activity

            [LU-16420] sanity test_51d: FAIL: stripecount=3: OST 1 has more objects vs. OST 0
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-16877 [ LU-16877 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is duplicated by LU-16902 [ LU-16902 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.16.0 [ 15190 ]
            Assignee Original: WC Triage [ wc-triage ] New: Andreas Dilger [ adilger ]
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51268/
            Subject: LU-16420 tests: move overstriping test to sanity-pfl
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 0a21aa0623b87c643d4b56b9bd45fdcc1e566d4b

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51268/ Subject: LU-16420 tests: move overstriping test to sanity-pfl Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0a21aa0623b87c643d4b56b9bd45fdcc1e566d4b

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51268
            Subject: LU-16420 tests: move overstriping test to sanity-pfl
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b05e0394254ca95c0e516acec1a12c945fea2f42

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51268 Subject: LU-16420 tests: move overstriping test to sanity-pfl Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b05e0394254ca95c0e516acec1a12c945fea2f42
            adilger Andreas Dilger made changes -
            Priority Original: Minor [ 4 ] New: Major [ 3 ]

            Note that a passing run is having almost perfect distribution:

            OST0 has 302 objects, 101 are index 0
            OST1 has 302 objects, 101 are index 0
            OST2 has 302 objects, 100 are index 0
            OST3 has 302 objects, 100 are index 0
            OST4 has 302 objects, 100 are index 0
            OST5 has 302 objects, 100 are index 0
            OST6 has 302 objects, 100 are index 0
            

            unlike a failing run, which (AFAICS) alway has two OSTs that have fewer objects and the following OSTs have more first objects, which is what causes this test to fail:

            OST0 has 265 objects, 86 are index 0
            OST1 has 322 objects, 145 are index 0
            OST2 has 320 objects, 85 are index 0
            OST3 has 322 objects, 86 are index 0
            OST4 has 265 objects, 87 are index 0
            OST5 has 323 objects, 143 are index 0
            OST6 has 324 objects, 88 are index 0
            OST7 has 323 objects, 88 are index 0
            

            I also see that the passing results are with 7 OSTs (review-ldiskfs-* 0/176 failed) and the failing ones are 8 OSTs (review-dne-part-1, 65/269 failed runs in past week). That lends some credence to "phase shifted MDT creates" causing the problem, since creates between MDTs are not coordinated for performance reasons.

            adilger Andreas Dilger added a comment - Note that a passing run is having almost perfect distribution: OST0 has 302 objects, 101 are index 0 OST1 has 302 objects, 101 are index 0 OST2 has 302 objects, 100 are index 0 OST3 has 302 objects, 100 are index 0 OST4 has 302 objects, 100 are index 0 OST5 has 302 objects, 100 are index 0 OST6 has 302 objects, 100 are index 0 unlike a failing run, which (AFAICS) alway has two OSTs that have fewer objects and the following OSTs have more first objects, which is what causes this test to fail: OST0 has 265 objects, 86 are index 0 OST1 has 322 objects, 145 are index 0 OST2 has 320 objects, 85 are index 0 OST3 has 322 objects, 86 are index 0 OST4 has 265 objects, 87 are index 0 OST5 has 323 objects, 143 are index 0 OST6 has 324 objects, 88 are index 0 OST7 has 323 objects, 88 are index 0 I also see that the passing results are with 7 OSTs (review-ldiskfs-* 0/176 failed) and the failing ones are 8 OSTs (review-dne-part-1, 65/269 failed runs in past week). That lends some credence to "phase shifted MDT creates" causing the problem, since creates between MDTs are not coordinated for performance reasons.
            adilger Andreas Dilger made changes -
            Description Original: This issue was created by maloo for jianyu <yujian@whamcloud.com>

            This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/19ce9196-c2db-40c4-9140-ebea855a919f

            test_51d failed with the following error:
            {noformat}
            'stripecount=3: OST 1 has more objects vs. OST 0 (341 > 197 x 5/4)'
            {noformat}







            VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
            sanity test_51d - 'stripecount=3: OST 1 has more objects vs. OST 0 (341 > 197 x 5/4)'
            New: This issue was created by maloo for jianyu <yujian@whamcloud.com>

            This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/19ce9196-c2db-40c4-9140-ebea855a919f

            test_51d failed with the following error:
            {noformat}
            OST0 has 265 objects, 86 are index 0
            OST1 has 322 objects, 145 are index 0
            OST2 has 320 objects, 85 are index 0
            OST3 has 322 objects, 86 are index 0
            OST4 has 265 objects, 87 are index 0
            OST5 has 323 objects, 143 are index 0
            OST6 has 324 objects, 88 are index 0
            OST7 has 323 objects, 88 are index 0

            'stripecount=3: OST 1 has more objects vs. OST 0 (341 > 197 x 5/4)'
            {noformat}







            VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
            sanity test_51d - 'stripecount=3: OST 1 has more objects vs. OST 0 (341 > 197 x 5/4)'
            adilger Andreas Dilger added a comment - - edited

            It seems possible that patch https://review.whamcloud.com/50532 "LU-13748 mdt: remove LASSERT in mdt_dump_lmm()" has caused this test to start failing much more often than before. It was failing only occasionally on master before 2023-05-22, with the exception of the above patch was failing test_51d regularly before it landed on 2023-05-19.

            The code change of that patch is almost impossible to be the cause (it just removes an LASSERT), but the added test_27Ch() that creates two fully overstriped files appears to somehow affect test_51d() in some obscure way. I don't really know why this is the case, since it is run about 20 minutes earlier. The patch stopped failing its testing once the overstriped file and directory were removed at the end of the test, though I also don't know why that was the case (it could have also been coincidence that all of the test runs passed in that case).

            Since I don't think the code change itself is causing the test failures, I'm instead going to try skipping the new test and/or move it later in the test series to see if this avoids the problem. It would also be good to understand exactly why this test is causing the problem.

            Some unsupported theories include:

            • lingering exhaustion of objects on some OST after overstriping consumed all objects (but for 20 minutes?)
            • create thread stuck or with bad state (but that wouldn't just have an imbalance?)
            • MDS delete thread is still deleting OST objects from overstriped files, somehow affecting create performance?
            • creating two overstriped files (on two different MDTs) somehow got the OST starting index "in sync" with other OSTs and imbalances the starting OST for the files just enough to trigger the failure?
            • MDS is somehow stuck in QOS allocation mode, even after overstriped file has been removed?
            adilger Andreas Dilger added a comment - - edited It seems possible that patch https://review.whamcloud.com/50532 " LU-13748 mdt: remove LASSERT in mdt_dump_lmm() " has caused this test to start failing much more often than before. It was failing only occasionally on master before 2023-05-22, with the exception of the above patch was failing test_51d regularly before it landed on 2023-05-19. The code change of that patch is almost impossible to be the cause (it just removes an LASSERT ), but the added test_27Ch() that creates two fully overstriped files appears to somehow affect test_51d() in some obscure way. I don't really know why this is the case, since it is run about 20 minutes earlier. The patch stopped failing its testing once the overstriped file and directory were removed at the end of the test, though I also don't know why that was the case (it could have also been coincidence that all of the test runs passed in that case). Since I don't think the code change itself is causing the test failures, I'm instead going to try skipping the new test and/or move it later in the test series to see if this avoids the problem. It would also be good to understand exactly why this test is causing the problem. Some unsupported theories include: lingering exhaustion of objects on some OST after overstriping consumed all objects (but for 20 minutes?) create thread stuck or with bad state (but that wouldn't just have an imbalance?) MDS delete thread is still deleting OST objects from overstriped files, somehow affecting create performance? creating two overstriped files (on two different MDTs) somehow got the OST starting index "in sync" with other OSTs and imbalances the starting OST for the files just enough to trigger the failure? MDS is somehow stuck in QOS allocation mode, even after overstriped file has been removed?

            People

              adilger Andreas Dilger
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: