Details

    • Technical task
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0
    • Lustre 2.6.0
    • None
    • 8490

    Description

      Create the test plan for all of LFSCK Phase II and attach to the Jira ticket so that the test engineers know how to test the feature for the release.

      Attachments

        Activity

          [LU-3423] Create LFSCK II Test Plan and attach to Jira ticket

          Here are the results for the LFSCK Phase II test plan. I plan to add past results to and clean up the presentation of results in this document.

          jamesanunez James Nunez (Inactive) added a comment - Here are the results for the LFSCK Phase II test plan. I plan to add past results to and clean up the presentation of results in this document.

          James' questions have been answered and the test plan does not need to be updated at this point.

          jlevi Jodi Levi (Inactive) added a comment - James' questions have been answered and the test plan does not need to be updated at this point.

          The new options "-c" has been tested in the sanity-lfsck.sh test_14/test_18; the new options "-o" has been tested by the sanity-lfsck.sh test_18. They are part of the test plan 1.1)

          LFSCK II test plan only needs to cover layout LFSCK, not necessary for namespace LFSCK which has already been done in LFSCK 1.5

          Currently, the LFSCK use single async OUT RPC to repair the inconsistency except the orphan handling. Means in spite of dangling reference or unmatched MDT-OST pairs or multiple references or inconsistent owner, they should not affect the repairing performance much. So I select the dangling reference case in the lfsck-performance.sh, which is easy to be simulated. I do not think we need to test all kinds of different inconsistency performance unless we have strong requirement and enough time to do that.

          yong.fan nasf (Inactive) added a comment - The new options "-c" has been tested in the sanity-lfsck.sh test_14/test_18; the new options "-o" has been tested by the sanity-lfsck.sh test_18. They are part of the test plan 1.1) LFSCK II test plan only needs to cover layout LFSCK, not necessary for namespace LFSCK which has already been done in LFSCK 1.5 Currently, the LFSCK use single async OUT RPC to repair the inconsistency except the orphan handling. Means in spite of dangling reference or unmatched MDT-OST pairs or multiple references or inconsistent owner, they should not affect the repairing performance much. So I select the dangling reference case in the lfsck-performance.sh, which is easy to be simulated. I do not think we need to test all kinds of different inconsistency performance unless we have strong requirement and enough time to do that.

          With recent patch landings to LFSCK, there are a few more options to choose from. Should we incorporate some of these into the existing test plan?

          All of the new options need to be tested, but for for any of the existing tests, that revolve around performance, do we want to :
          Create lost OST-objects (-c)?
          Handle orphan objects (-o)?
          What type should we run namespace, layout or both?

          Since XATTR_NAME_FID does not exist, in test 2.2, is setting fail_loc to OBD_LFSCK_UNMATCHED_PAIR* or OBD_LFSCK_INVALID_PFID just as good or will repairing different failures cause dramatically different performance results? Is LFSCK_DANGLING preferred?

          jamesanunez James Nunez (Inactive) added a comment - With recent patch landings to LFSCK, there are a few more options to choose from. Should we incorporate some of these into the existing test plan? All of the new options need to be tested, but for for any of the existing tests, that revolve around performance, do we want to : Create lost OST-objects (-c)? Handle orphan objects (-o)? What type should we run namespace, layout or both? Since XATTR_NAME_FID does not exist, in test 2.2, is setting fail_loc to OBD_LFSCK_UNMATCHED_PAIR* or OBD_LFSCK_INVALID_PFID just as good or will repairing different failures cause dramatically different performance results? Is LFSCK_DANGLING preferred?

          The test plan needs to be updated based on changes made to the functionality since the plan was written. James Nunez has the details. I will ask him to provide in this ticket.

          jlevi Jodi Levi (Inactive) added a comment - The test plan needs to be updated based on changes made to the functionality since the plan was written. James Nunez has the details. I will ask him to provide in this ticket.

          Sorry if the mention of the internal links in the Test Plan Template caused confusion. We will avoid using those links in any actual test plans. Please let me know if you see any of these links in actual test plans, and I will get it corrected.
          Thank you!

          jlevi Jodi Levi (Inactive) added a comment - Sorry if the mention of the internal links in the Test Plan Template caused confusion. We will avoid using those links in any actual test plans. Please let me know if you see any of these links in actual test plans, and I will get it corrected. Thank you!
          spitzcor Cory Spitz added a comment -

          Thanks, Jodi. The attached plan has embedded links to things like https://wiki.hpdd.intel.com/display/ENG/Regression+Test+Suites+and+Failover+
          Test+Suites, which we don't have access to.

          spitzcor Cory Spitz added a comment - Thanks, Jodi. The attached plan has embedded links to things like https://wiki.hpdd.intel.com/display/ENG/Regression+Test+Suites+and+Failover+ Test+Suites, which we don't have access to.

          Test plan is attached to ticket. Will update test plan as needed but will close this ticket as complete.

          jlevi Jodi Levi (Inactive) added a comment - Test plan is attached to ticket. Will update test plan as needed but will close this ticket as complete.
          jlevi Jodi Levi (Inactive) added a comment - - edited Link to Lustre Feature Test Plan Template: https://wiki.hpdd.intel.com/display/PMP/Lustre+Feature+Test+Plan+Template
          yong.fan nasf (Inactive) added a comment - - edited

          FLSCK 2 test plan (ldiskfs only)
          ****************

          1. Correctness
          ----------------
          1.1) sanity-lfsck on Maloo with commit message "Test-Parameters: envdefinitions=ENABLE_QUOTA=yes mdtcount=2 testlist=sanity-lfsck". All test cases should pass.
          1.2) sanity-scrub on Maloo with commit message "Test-Parameters: envdefinitions=ENABLE_QUOTA=yes testlist=sanity-scrub". All test cases should pass.
          1.3) normal acc-sm tests on Maloo. All test cases should pass except for some known master failures.

          2. Performance
          ----------------
          The file set to be tested should be generated with the following conditions:
          A) Create 'L' test root directories, 'L' is the MDTs count, for the test root dir-X, it locates on the MDT-X.
          B) Set default stripe size as 64KB, and default stripe count as 'M'.
          B) Create 'N' sub-directories under each test root directory.
          C) Under each sub-directory, generate 100K normal files, each file contains 64 * 'M' KB data.

          2.1) LFSCK against healthy 2.x system for consistency routine check.
          2.1.1) Create above test file sets with Lustre-2.6.
          2.1.2) Test the highest LFSCK speeds (full speed, without other work load) under different file sets: 'N' = 2, 4, 8, 16; and with different stripe counts: 'M' = 1, 2, 4; and with different MDTs count: 'L' = 1, 2, 4.

          2.2) LFSCK against the lustre-2.x system with inconsistent layout OST-objects.
          2.2.1) On the OSS, set fail_loc to skip the XATTR_NAME_FID set to simulate the case of MDT-OST inconsistency
          2.2.2) Create above test file sets with Lustre-2.6.
          2.2.3) Test the highest LFSCK speeds (full speed, without other work load) under different file sets: 'N' = 2, 4, 8, 16; and with different stripe counts: 'M' = 1, 2, 4; and with different MDTs count: 'L' = 1, 2, 4.

          3. Small files create performance impact by LFSCK
          ----------------
          Measure how much the routine LFSCK will affect normal small files create performance. Generate test file set as described in section 2 with N = 16, M = 4, L = 1.
          3.1) Run LFSCK with full speed on the file set. At the same time, use 'C' threads to create 512K (or 256K files if the LFSCK run too fast) small files in parallel, each file is 64KB single striped. Each thread creates under its private directory, and create 512K / 'C' files.
          3.2) Measure the create performance with different lfsck speed limit. According to the 3.1) result, we can know the highest speed for lfsck with small files create work load, assume it is 'S'. Then repeat the test with LFSCK speed limit = (1/4)'S', (1/2)'S', (3/4)'S'.

          4. Scale test
          ----------------
          Run LFSCK on more MDTs ('L' = 16) and OSTs ('M' = 16) for MDT-OST consistency verification.
          4.1) To verify whether there will be correctness issues under such scale mode.
          4.2 To verify whether the LFSCK mechanism is runnable under large scale mode, such as whether very slow or not.

          5. Resource requirement.
          ----------------
          5.1) Test 1 can be done locally and on Maloo.
          5.2) Test 2/3 need at least 4 MDS nodes, 2 OSS nodes, and 1 client.
          5.3) We can use the same hardware as test2/3 using, but it is better to use more real servers.
          5.4) Each OSS node needs at least 1TB storage.

          yong.fan nasf (Inactive) added a comment - - edited FLSCK 2 test plan (ldiskfs only) **************** 1. Correctness ---------------- 1.1) sanity-lfsck on Maloo with commit message "Test-Parameters: envdefinitions=ENABLE_QUOTA=yes mdtcount=2 testlist=sanity-lfsck". All test cases should pass. 1.2) sanity-scrub on Maloo with commit message "Test-Parameters: envdefinitions=ENABLE_QUOTA=yes testlist=sanity-scrub". All test cases should pass. 1.3) normal acc-sm tests on Maloo. All test cases should pass except for some known master failures. 2. Performance ---------------- The file set to be tested should be generated with the following conditions: A) Create 'L' test root directories, 'L' is the MDTs count, for the test root dir-X, it locates on the MDT-X. B) Set default stripe size as 64KB, and default stripe count as 'M'. B) Create 'N' sub-directories under each test root directory. C) Under each sub-directory, generate 100K normal files, each file contains 64 * 'M' KB data. 2.1) LFSCK against healthy 2.x system for consistency routine check. 2.1.1) Create above test file sets with Lustre-2.6. 2.1.2) Test the highest LFSCK speeds (full speed, without other work load) under different file sets: 'N' = 2, 4, 8, 16; and with different stripe counts: 'M' = 1, 2, 4; and with different MDTs count: 'L' = 1, 2, 4. 2.2) LFSCK against the lustre-2.x system with inconsistent layout OST-objects. 2.2.1) On the OSS, set fail_loc to skip the XATTR_NAME_FID set to simulate the case of MDT-OST inconsistency 2.2.2) Create above test file sets with Lustre-2.6. 2.2.3) Test the highest LFSCK speeds (full speed, without other work load) under different file sets: 'N' = 2, 4, 8, 16; and with different stripe counts: 'M' = 1, 2, 4; and with different MDTs count: 'L' = 1, 2, 4. 3. Small files create performance impact by LFSCK ---------------- Measure how much the routine LFSCK will affect normal small files create performance. Generate test file set as described in section 2 with N = 16, M = 4, L = 1. 3.1) Run LFSCK with full speed on the file set. At the same time, use 'C' threads to create 512K (or 256K files if the LFSCK run too fast) small files in parallel, each file is 64KB single striped. Each thread creates under its private directory, and create 512K / 'C' files. 3.2) Measure the create performance with different lfsck speed limit. According to the 3.1) result, we can know the highest speed for lfsck with small files create work load, assume it is 'S'. Then repeat the test with LFSCK speed limit = (1/4)'S', (1/2)'S', (3/4)'S'. 4. Scale test ---------------- Run LFSCK on more MDTs ('L' = 16) and OSTs ('M' = 16) for MDT-OST consistency verification. 4.1) To verify whether there will be correctness issues under such scale mode. 4.2 To verify whether the LFSCK mechanism is runnable under large scale mode, such as whether very slow or not. 5. Resource requirement. ---------------- 5.1) Test 1 can be done locally and on Maloo. 5.2) Test 2/3 need at least 4 MDS nodes, 2 OSS nodes, and 1 client. 5.3) We can use the same hardware as test2/3 using, but it is better to use more real servers. 5.4) Each OSS node needs at least 1TB storage.

          People

            yong.fan nasf (Inactive)
            jlevi Jodi Levi (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: