[LU-7009] Testing DNE on ZFS Created: 14/Aug/15 Updated: 15/Nov/19 Resolved: 15/Nov/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Di Wang | Assignee: | James Nunez (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
I think the best approach here is to run the DNE-on-ZFS testing a few times to see which test scripts are passing regularly, and then move the passing scripts (or passing with minimal additions to ALWAYS_EXCEPT) to review-zfs-part-1 running with 4 MDTs and leave the failing tests (if not too many) in review-zfs-part-2 running with 1 MDT until we can fix the remaining test issues. As tests pass they can be moved into review-zfs-part-1 until it gets too large, and then just add the remaining few tests to ALWAYS_EXCEPT and set review-zfs-part-2 to also run with 4 MDTs and rebalance the tests to run with approximately the same time. That way we can start basic DNE-on-ZFS testing ASAP and migrate the remaining ZFS tests to DNE without having to skip many of the tests. Of course, if DNE-on-ZFS is mostly or completely passing today then this incremental migration to DNE testing can be skipped. |
| Comments |
| Comment by Gerrit Updater [ 28/Aug/15 ] |
|
Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/16122 |
| Comment by Andreas Dilger [ 01/Sep/15 ] |
|
See also https://testing.hpdd.intel.com/test_sessions/7f7566d8-d281-11e4-a20f-5254006e85c2 which has results from http://review.whamcloud.com/14148 but back in March. Not sure if those test results are all still available, but at least it provides one more data point for this testing. |
| Comment by Di Wang [ 01/Sep/15 ] |
|
Met this issue when testing on shadow 19:31:31:Loading modules from /usr/lib64/lustre 19:31:31:detected 2 online CPUs by sysfs 19:31:31:Force libcfs to create 2 CPU partitions 19:31:31:debug=vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck 19:31:31:subsystem_debug=all -lnet -lnd -pinger 19:31:33:Formatting mgs, mds, osts 19:31:33:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments 19:31:35:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments 19:31:35:Format mgs: lustre-mgs/mgs 19:31:35:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments 19:31:36:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments 19:31:36:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments 19:31:37:CMD: shadow-50vm3 grep -c /mnt/mgs' ' /proc/mounts 19:31:37:CMD: shadow-50vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' 19:31:37:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments 19:31:38:CMD: shadow-50vm3 ! zpool list -H lustre-mgs >/dev/null 2>&1 || 19:31:38: grep -q ^lustre-mgs/ /proc/mounts || 19:31:38: zpool export lustre-mgs 19:31:39:CMD: shadow-50vm3 mkfs.lustre --mgs --param=sys.timeout=20 --backfstype=zfs --device-size=2097152 --reformat lustre-mgs/mgs /dev/lvm-Role_MDS/P1 19:31:39:shadow-50vm3: 19:31:40:shadow-50vm3: mkfs.lustre FATAL: unhandled fs type 5 'zfs' 19:31:40:shadow-50vm3: 19:31:40:shadow-50vm3: mkfs.lustre FATAL: unable to prepare backend (22) 19:31:40:shadow-50vm3: mkfs.lustre: exiting with 22 (Invalid argument) 20:31:24:********** Timeout by autotest system ********** Looks like a TEI issue. |
| Comment by Peter Jones [ 22/Sep/15 ] |
|
This testing is complete and tickets were created for the issues found |
| Comment by Andreas Dilger [ 22/Sep/15 ] |
|
I'd rather this not be closed until we are actually testing DNE+ZFS automatically via autotest. Per the original comment, if there are tests that are already passing them they can be moved into an autotest config file separate from the tests that are currently failing and/or disabled. If enough tests are passing, we could just rearrange review-zfs-part- {1,2}and change the config for one of them to start testing with MDSCOUNT=2. It makes sense to start doing this as soon as possible to avoid regressions while the existing bugs are fixed or disabled. |
| Comment by Peter Jones [ 22/Sep/15 ] |
|
That's a good point Andreas. We'd need to land either fixes for the bugs or else temporarily disable the affected tests while the fix was being worked on and then make the TEI request. Should we start by introducing that change on full test runs first and then rolling out to master after a few successful runs? |
| Comment by Andreas Dilger [ 22/Sep/15 ] |
|
I'm reluctant to break "full" runs more than they already are today. Also, history shows that tests running on master get far less attention than those running on review. Rather, I think the existing testing patch can be expanded until it starts passing or skipping (for only the ZFS+DNE config) all the tests currently being run for review-zfs-part- {1,2}and then landed like any other patch. Since it would only be adding exceptions for the ZFS+DNE tests it couldn't cause breakage when it lands. It should be including full runs of those tests so should be OK to enable the DNE config afterward, as long as we leave a non-DNE test running for the excepted or skipped tests, as we do with review-ldiskfs today. |
| Comment by James Nunez (Inactive) [ 17/Mar/17 ] |
|
We are running the 'full' test group with ZFS + DNE. Some recent results are at: https://testing.hpdd.intel.com/test_sessions/9803525a-f453-11e6-b655-5254006e85c2 There are many failures, but there are a few that consistently fail: |
| Comment by Andreas Dilger [ 27/Apr/17 ] |
|
Rather than doing "full" test runs (which we know have permanent failures even with ldiskfs), it would be better to run review-dne-part-1 and review-dne-part-2 with mdtfilesystemtype=zfs and ostfilesystemtype=zfs (or review-zfs-part-1 and review-zfs-part-2 with mdscount=4), so that we can have an apples-to-apples comparison of tests that are failing only with ZFS. I see that there is also a regular replay-single test failure, but other than that and the two failures mentioned above, I think there are enough tests that ARE passing that we should create a new test session review-zfs that has only the failing DNE ZFS tests, and change the main ZFS test sessions to have multiple MDTs/MDS. It would be good to do this now that we are past the 2.10 feature freeze and before the release so that we have high confidence in this code and there are no (further?) regressions. |
| Comment by Andreas Dilger [ 04/Aug/18 ] |
|
James, not sure if this is already duplicate with another ticket you are using to track this work? |
| Comment by Andreas Dilger [ 15/Nov/19 ] |
|
James, I think this work is done now? We are always testing review-dne-zfs-partX for each patch. I think this can be closed. |
| Comment by James Nunez (Inactive) [ 15/Nov/19 ] |
|
There are multiple test groups that run in a DNE environment namely review-dne-* . Thus, this ticket can be closed. |