Details

    • Bug
    • Resolution: Done
    • Minor
    • None
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      I think the best approach here is to run the DNE-on-ZFS testing a few times to see which test scripts are passing regularly, and then move the passing scripts (or passing with minimal additions to ALWAYS_EXCEPT) to review-zfs-part-1 running with 4 MDTs and leave the failing tests (if not too many) in review-zfs-part-2 running with 1 MDT until we can fix the remaining test issues. As tests pass they can be moved into review-zfs-part-1 until it gets too large, and then just add the remaining few tests to ALWAYS_EXCEPT and set review-zfs-part-2 to also run with 4 MDTs and rebalance the tests to run with approximately the same time.

      That way we can start basic DNE-on-ZFS testing ASAP and migrate the remaining ZFS tests to DNE without having to skip many of the tests.

      Of course, if DNE-on-ZFS is mostly or completely passing today then this incremental migration to DNE testing can be skipped.

      Attachments

        Issue Links

          Activity

            [LU-7009] Testing DNE on ZFS

            There are multiple test groups that run in a DNE environment namely review-dne-*part

            {1,2,3,4}

            . Thus, this ticket can be closed.

            jamesanunez James Nunez (Inactive) added a comment - There are multiple test groups that run in a DNE environment namely review-dne-* part {1,2,3,4} . Thus, this ticket can be closed.

            James, I think this work is done now? We are always testing review-dne-zfs-partX for each patch. I think this can be closed.

            adilger Andreas Dilger added a comment - James, I think this work is done now? We are always testing review-dne-zfs-partX for each patch. I think this can be closed.

            James, not sure if this is already duplicate with another ticket you are using to track this work?

            adilger Andreas Dilger added a comment - James, not sure if this is already duplicate with another ticket you are using to track this work?

            Rather than doing "full" test runs (which we know have permanent failures even with ldiskfs), it would be better to run review-dne-part-1 and review-dne-part-2 with mdtfilesystemtype=zfs and ostfilesystemtype=zfs (or review-zfs-part-1 and review-zfs-part-2 with mdscount=4), so that we can have an apples-to-apples comparison of tests that are failing only with ZFS.

            I see that there is also a regular replay-single test failure, but other than that and the two failures mentioned above, I think there are enough tests that ARE passing that we should create a new test session review-zfs that has only the failing DNE ZFS tests, and change the main ZFS test sessions to have multiple MDTs/MDS. It would be good to do this now that we are past the 2.10 feature freeze and before the release so that we have high confidence in this code and there are no (further?) regressions.

            adilger Andreas Dilger added a comment - Rather than doing "full" test runs (which we know have permanent failures even with ldiskfs), it would be better to run review-dne-part-1 and review-dne-part-2 with mdtfilesystemtype=zfs and ostfilesystemtype=zfs (or review-zfs-part-1 and review-zfs-part-2 with mdscount=4 ), so that we can have an apples-to-apples comparison of tests that are failing only with ZFS. I see that there is also a regular replay-single test failure, but other than that and the two failures mentioned above, I think there are enough tests that ARE passing that we should create a new test session review-zfs that has only the failing DNE ZFS tests, and change the main ZFS test sessions to have multiple MDTs/MDS. It would be good to do this now that we are past the 2.10 feature freeze and before the release so that we have high confidence in this code and there are no (further?) regressions.

            We are running the 'full' test group with ZFS + DNE. Some recent results are at:

            https://testing.hpdd.intel.com/test_sessions/9803525a-f453-11e6-b655-5254006e85c2
            https://testing.hpdd.intel.com/test_sessions/d38b0f30-0262-11e7-a4d2-5254006e85c2
            https://testing.hpdd.intel.com/test_sessions/d5adbff2-096e-11e7-b5b2-5254006e85c2

            There are many failures, but there are a few that consistently fail:
            sanity-lfsck test_5e (LU-8840)
            sanity-quota test_12b (Ticket ??)

            jamesanunez James Nunez (Inactive) added a comment - We are running the 'full' test group with ZFS + DNE. Some recent results are at: https://testing.hpdd.intel.com/test_sessions/9803525a-f453-11e6-b655-5254006e85c2 https://testing.hpdd.intel.com/test_sessions/d38b0f30-0262-11e7-a4d2-5254006e85c2 https://testing.hpdd.intel.com/test_sessions/d5adbff2-096e-11e7-b5b2-5254006e85c2 There are many failures, but there are a few that consistently fail: sanity-lfsck test_5e ( LU-8840 ) sanity-quota test_12b (Ticket ??)

            I'm reluctant to break "full" runs more than they already are today. Also, history shows that tests running on master get far less attention than those running on review. Rather, I think the existing testing patch can be expanded until it starts passing or skipping (for only the ZFS+DNE config) all the tests currently being run for review-zfs-part-

            {1,2}

            and then landed like any other patch.

            Since it would only be adding exceptions for the ZFS+DNE tests it couldn't cause breakage when it lands. It should be including full runs of those tests so should be OK to enable the DNE config afterward, as long as we leave a non-DNE test running for the excepted or skipped tests, as we do with review-ldiskfs today.

            adilger Andreas Dilger added a comment - I'm reluctant to break "full" runs more than they already are today. Also, history shows that tests running on master get far less attention than those running on review. Rather, I think the existing testing patch can be expanded until it starts passing or skipping (for only the ZFS+DNE config) all the tests currently being run for review-zfs-part- {1,2} and then landed like any other patch. Since it would only be adding exceptions for the ZFS+DNE tests it couldn't cause breakage when it lands. It should be including full runs of those tests so should be OK to enable the DNE config afterward, as long as we leave a non-DNE test running for the excepted or skipped tests, as we do with review-ldiskfs today.
            pjones Peter Jones added a comment -

            That's a good point Andreas. We'd need to land either fixes for the bugs or else temporarily disable the affected tests while the fix was being worked on and then make the TEI request. Should we start by introducing that change on full test runs first and then rolling out to master after a few successful runs?

            pjones Peter Jones added a comment - That's a good point Andreas. We'd need to land either fixes for the bugs or else temporarily disable the affected tests while the fix was being worked on and then make the TEI request. Should we start by introducing that change on full test runs first and then rolling out to master after a few successful runs?

            I'd rather this not be closed until we are actually testing DNE+ZFS automatically via autotest.

            Per the original comment, if there are tests that are already passing them they can be moved into an autotest config file separate from the tests that are currently failing and/or disabled. If enough tests are passing, we could just rearrange review-zfs-part-

            {1,2}

            and change the config for one of them to start testing with MDSCOUNT=2.

            It makes sense to start doing this as soon as possible to avoid regressions while the existing bugs are fixed or disabled.

            adilger Andreas Dilger added a comment - I'd rather this not be closed until we are actually testing DNE+ZFS automatically via autotest. Per the original comment, if there are tests that are already passing them they can be moved into an autotest config file separate from the tests that are currently failing and/or disabled. If enough tests are passing, we could just rearrange review-zfs-part- {1,2} and change the config for one of them to start testing with MDSCOUNT=2. It makes sense to start doing this as soon as possible to avoid regressions while the existing bugs are fixed or disabled.
            pjones Peter Jones added a comment -

            This testing is complete and tickets were created for the issues found

            pjones Peter Jones added a comment - This testing is complete and tickets were created for the issues found

            Met this issue when testing on shadow

            19:31:31:Loading modules from /usr/lib64/lustre
            19:31:31:detected 2 online CPUs by sysfs
            19:31:31:Force libcfs to create 2 CPU partitions
            19:31:31:debug=vfstrace rpctrace dlmtrace neterror ha config 		      ioctl super lfsck
            19:31:31:subsystem_debug=all -lnet -lnd -pinger
            19:31:33:Formatting mgs, mds, osts
            19:31:33:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments
            19:31:35:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments
            19:31:35:Format mgs: lustre-mgs/mgs
            19:31:35:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments
            19:31:36:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments
            19:31:36:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments
            19:31:37:CMD: shadow-50vm3 grep -c /mnt/mgs' ' /proc/mounts
            19:31:37:CMD: shadow-50vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
            19:31:37:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments
            19:31:38:CMD: shadow-50vm3 ! zpool list -H lustre-mgs >/dev/null 2>&1 ||
            19:31:38:			grep -q ^lustre-mgs/ /proc/mounts ||
            19:31:38:			zpool export  lustre-mgs
            19:31:39:CMD: shadow-50vm3 mkfs.lustre --mgs --param=sys.timeout=20 --backfstype=zfs --device-size=2097152 --reformat lustre-mgs/mgs /dev/lvm-Role_MDS/P1
            19:31:39:shadow-50vm3: 
            19:31:40:shadow-50vm3: mkfs.lustre FATAL: unhandled fs type 5 'zfs'
            19:31:40:shadow-50vm3: 
            19:31:40:shadow-50vm3: mkfs.lustre FATAL: unable to prepare backend (22)
            19:31:40:shadow-50vm3: mkfs.lustre: exiting with 22 (Invalid argument)
            20:31:24:********** Timeout by autotest system **********
            

            Looks like a TEI issue.

            di.wang Di Wang (Inactive) added a comment - Met this issue when testing on shadow 19:31:31:Loading modules from /usr/lib64/lustre 19:31:31:detected 2 online CPUs by sysfs 19:31:31:Force libcfs to create 2 CPU partitions 19:31:31:debug=vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck 19:31:31:subsystem_debug=all -lnet -lnd -pinger 19:31:33:Formatting mgs, mds, osts 19:31:33:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments 19:31:35:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments 19:31:35:Format mgs: lustre-mgs/mgs 19:31:35:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments 19:31:36:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments 19:31:36:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments 19:31:37:CMD: shadow-50vm3 grep -c /mnt/mgs' ' /proc/mounts 19:31:37:CMD: shadow-50vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' 19:31:37:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments 19:31:38:CMD: shadow-50vm3 ! zpool list -H lustre-mgs >/dev/null 2>&1 || 19:31:38: grep -q ^lustre-mgs/ /proc/mounts || 19:31:38: zpool export lustre-mgs 19:31:39:CMD: shadow-50vm3 mkfs.lustre --mgs --param=sys.timeout=20 --backfstype=zfs --device-size=2097152 --reformat lustre-mgs/mgs /dev/lvm-Role_MDS/P1 19:31:39:shadow-50vm3: 19:31:40:shadow-50vm3: mkfs.lustre FATAL: unhandled fs type 5 'zfs' 19:31:40:shadow-50vm3: 19:31:40:shadow-50vm3: mkfs.lustre FATAL: unable to prepare backend (22) 19:31:40:shadow-50vm3: mkfs.lustre: exiting with 22 (Invalid argument) 20:31:24:********** Timeout by autotest system ********** Looks like a TEI issue.

            People

              jamesanunez James Nunez (Inactive)
              di.wang Di Wang (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: