[LU-7009] Testing DNE on ZFS Created: 14/Aug/15  Updated: 15/Nov/19  Resolved: 15/Nov/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Di Wang Assignee: James Nunez (Inactive)
Resolution: Done Votes: 0
Labels: None

Issue Links:
Related
is related to LU-7191 sanity test_27z: FAIL: O/300000400/d2... Resolved
is related to LU-6831 The ticket for tracking all DNE2 bugs Reopened
is related to LU-7192 conf-sanity test_32c: ZFS test failur... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I think the best approach here is to run the DNE-on-ZFS testing a few times to see which test scripts are passing regularly, and then move the passing scripts (or passing with minimal additions to ALWAYS_EXCEPT) to review-zfs-part-1 running with 4 MDTs and leave the failing tests (if not too many) in review-zfs-part-2 running with 1 MDT until we can fix the remaining test issues. As tests pass they can be moved into review-zfs-part-1 until it gets too large, and then just add the remaining few tests to ALWAYS_EXCEPT and set review-zfs-part-2 to also run with 4 MDTs and rebalance the tests to run with approximately the same time.

That way we can start basic DNE-on-ZFS testing ASAP and migrate the remaining ZFS tests to DNE without having to skip many of the tests.

Of course, if DNE-on-ZFS is mostly or completely passing today then this incremental migration to DNE testing can be skipped.



 Comments   
Comment by Gerrit Updater [ 28/Aug/15 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/16122
Subject: LU-7009 test: test DNE on ZFS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fe93134b19331e9300f339c4584390b37fced368

Comment by Andreas Dilger [ 01/Sep/15 ]

See also https://testing.hpdd.intel.com/test_sessions/7f7566d8-d281-11e4-a20f-5254006e85c2 which has results from http://review.whamcloud.com/14148 but back in March. Not sure if those test results are all still available, but at least it provides one more data point for this testing.

Comment by Di Wang [ 01/Sep/15 ]

Met this issue when testing on shadow

19:31:31:Loading modules from /usr/lib64/lustre
19:31:31:detected 2 online CPUs by sysfs
19:31:31:Force libcfs to create 2 CPU partitions
19:31:31:debug=vfstrace rpctrace dlmtrace neterror ha config 		      ioctl super lfsck
19:31:31:subsystem_debug=all -lnet -lnd -pinger
19:31:33:Formatting mgs, mds, osts
19:31:33:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments
19:31:35:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments
19:31:35:Format mgs: lustre-mgs/mgs
19:31:35:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments
19:31:36:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments
19:31:36:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments
19:31:37:CMD: shadow-50vm3 grep -c /mnt/mgs' ' /proc/mounts
19:31:37:CMD: shadow-50vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
19:31:37:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments
19:31:38:CMD: shadow-50vm3 ! zpool list -H lustre-mgs >/dev/null 2>&1 ||
19:31:38:			grep -q ^lustre-mgs/ /proc/mounts ||
19:31:38:			zpool export  lustre-mgs
19:31:39:CMD: shadow-50vm3 mkfs.lustre --mgs --param=sys.timeout=20 --backfstype=zfs --device-size=2097152 --reformat lustre-mgs/mgs /dev/lvm-Role_MDS/P1
19:31:39:shadow-50vm3: 
19:31:40:shadow-50vm3: mkfs.lustre FATAL: unhandled fs type 5 'zfs'
19:31:40:shadow-50vm3: 
19:31:40:shadow-50vm3: mkfs.lustre FATAL: unable to prepare backend (22)
19:31:40:shadow-50vm3: mkfs.lustre: exiting with 22 (Invalid argument)
20:31:24:********** Timeout by autotest system **********

Looks like a TEI issue.

Comment by Peter Jones [ 22/Sep/15 ]

This testing is complete and tickets were created for the issues found

Comment by Andreas Dilger [ 22/Sep/15 ]

I'd rather this not be closed until we are actually testing DNE+ZFS automatically via autotest.

Per the original comment, if there are tests that are already passing them they can be moved into an autotest config file separate from the tests that are currently failing and/or disabled. If enough tests are passing, we could just rearrange review-zfs-part-

{1,2}

and change the config for one of them to start testing with MDSCOUNT=2.

It makes sense to start doing this as soon as possible to avoid regressions while the existing bugs are fixed or disabled.

Comment by Peter Jones [ 22/Sep/15 ]

That's a good point Andreas. We'd need to land either fixes for the bugs or else temporarily disable the affected tests while the fix was being worked on and then make the TEI request. Should we start by introducing that change on full test runs first and then rolling out to master after a few successful runs?

Comment by Andreas Dilger [ 22/Sep/15 ]

I'm reluctant to break "full" runs more than they already are today. Also, history shows that tests running on master get far less attention than those running on review. Rather, I think the existing testing patch can be expanded until it starts passing or skipping (for only the ZFS+DNE config) all the tests currently being run for review-zfs-part-

{1,2}

and then landed like any other patch.

Since it would only be adding exceptions for the ZFS+DNE tests it couldn't cause breakage when it lands. It should be including full runs of those tests so should be OK to enable the DNE config afterward, as long as we leave a non-DNE test running for the excepted or skipped tests, as we do with review-ldiskfs today.

Comment by James Nunez (Inactive) [ 17/Mar/17 ]

We are running the 'full' test group with ZFS + DNE. Some recent results are at:

https://testing.hpdd.intel.com/test_sessions/9803525a-f453-11e6-b655-5254006e85c2
https://testing.hpdd.intel.com/test_sessions/d38b0f30-0262-11e7-a4d2-5254006e85c2
https://testing.hpdd.intel.com/test_sessions/d5adbff2-096e-11e7-b5b2-5254006e85c2

There are many failures, but there are a few that consistently fail:
sanity-lfsck test_5e (LU-8840)
sanity-quota test_12b (Ticket ??)

Comment by Andreas Dilger [ 27/Apr/17 ]

Rather than doing "full" test runs (which we know have permanent failures even with ldiskfs), it would be better to run review-dne-part-1 and review-dne-part-2 with mdtfilesystemtype=zfs and ostfilesystemtype=zfs (or review-zfs-part-1 and review-zfs-part-2 with mdscount=4), so that we can have an apples-to-apples comparison of tests that are failing only with ZFS.

I see that there is also a regular replay-single test failure, but other than that and the two failures mentioned above, I think there are enough tests that ARE passing that we should create a new test session review-zfs that has only the failing DNE ZFS tests, and change the main ZFS test sessions to have multiple MDTs/MDS. It would be good to do this now that we are past the 2.10 feature freeze and before the release so that we have high confidence in this code and there are no (further?) regressions.

Comment by Andreas Dilger [ 04/Aug/18 ]

James, not sure if this is already duplicate with another ticket you are using to track this work?

Comment by Andreas Dilger [ 15/Nov/19 ]

James, I think this work is done now? We are always testing review-dne-zfs-partX for each patch. I think this can be closed.

Comment by James Nunez (Inactive) [ 15/Nov/19 ]

There are multiple test groups that run in a DNE environment namely review-dne-*part

{1,2,3,4}

. Thus, this ticket can be closed.

Generated at Sat Feb 10 02:05:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.