[LU-7009] Testing DNE on ZFS - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Done
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.8.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

I think the best approach here is to run the DNE-on-ZFS testing a few times to see which test scripts are passing regularly, and then move the passing scripts (or passing with minimal additions to ALWAYS_EXCEPT) to review-zfs-part-1 running with 4 MDTs and leave the failing tests (if not too many) in review-zfs-part-2 running with 1 MDT until we can fix the remaining test issues. As tests pass they can be moved into review-zfs-part-1 until it gets too large, and then just add the remaining few tests to ALWAYS_EXCEPT and set review-zfs-part-2 to also run with 4 MDTs and rebalance the tests to run with approximately the same time.

That way we can start basic DNE-on-ZFS testing ASAP and migrate the remaining ZFS tests to DNE without having to skip many of the tests.

Of course, if DNE-on-ZFS is mostly or completely passing today then this incremental migration to DNE testing can be skipped.

Attachments

Issue Links

is related to

LU-6831 The ticket for tracking all DNE2 bugs

Reopened

LU-7192 conf-sanity test_32c: ZFS test failure with DNE config

Resolved

is related to

LU-7191 sanity test_27z: FAIL: O/300000400/d21/75: no filter_fid info

Resolved

Activity

[LU-7009] Testing DNE on ZFS

James Nunez (Inactive) added a comment - 15/Nov/19 3:14 PM

There are multiple test groups that run in a DNE environment namely review-dne-*~~part~~

{1,2,3,4}

. Thus, this ticket can be closed.

James Nunez (Inactive) added a comment - 15/Nov/19 3:14 PM There are multiple test groups that run in a DNE environment namely review-dne-* part {1,2,3,4} . Thus, this ticket can be closed.

Andreas Dilger added a comment - 15/Nov/19 1:28 AM

James, I think this work is done now? We are always testing review-dne-zfs-partX for each patch. I think this can be closed.

Andreas Dilger added a comment - 15/Nov/19 1:28 AM James, I think this work is done now? We are always testing review-dne-zfs-partX for each patch. I think this can be closed.

Andreas Dilger added a comment - 04/Aug/18 6:49 PM

James, not sure if this is already duplicate with another ticket you are using to track this work?

Andreas Dilger added a comment - 04/Aug/18 6:49 PM James, not sure if this is already duplicate with another ticket you are using to track this work?

Andreas Dilger added a comment - 27/Apr/17 11:58 PM

Rather than doing "full" test runs (which we know have permanent failures even with ldiskfs), it would be better to run review-dne-part-1 and review-dne-part-2 with mdtfilesystemtype=zfs and ostfilesystemtype=zfs (or review-zfs-part-1 and review-zfs-part-2 with mdscount=4), so that we can have an apples-to-apples comparison of tests that are failing only with ZFS.

I see that there is also a regular replay-single test failure, but other than that and the two failures mentioned above, I think there are enough tests that ARE passing that we should create a new test session review-zfs that has only the failing DNE ZFS tests, and change the main ZFS test sessions to have multiple MDTs/MDS. It would be good to do this now that we are past the 2.10 feature freeze and before the release so that we have high confidence in this code and there are no (further?) regressions.

Andreas Dilger added a comment - 27/Apr/17 11:58 PM Rather than doing "full" test runs (which we know have permanent failures even with ldiskfs), it would be better to run review-dne-part-1 and review-dne-part-2 with mdtfilesystemtype=zfs and ostfilesystemtype=zfs (or review-zfs-part-1 and review-zfs-part-2 with mdscount=4 ), so that we can have an apples-to-apples comparison of tests that are failing only with ZFS. I see that there is also a regular replay-single test failure, but other than that and the two failures mentioned above, I think there are enough tests that ARE passing that we should create a new test session review-zfs that has only the failing DNE ZFS tests, and change the main ZFS test sessions to have multiple MDTs/MDS. It would be good to do this now that we are past the 2.10 feature freeze and before the release so that we have high confidence in this code and there are no (further?) regressions.

James Nunez (Inactive) added a comment - 17/Mar/17 4:42 PM

We are running the 'full' test group with ZFS + DNE. Some recent results are at:

https://testing.hpdd.intel.com/test_sessions/9803525a-f453-11e6-b655-5254006e85c2
https://testing.hpdd.intel.com/test_sessions/d38b0f30-0262-11e7-a4d2-5254006e85c2
https://testing.hpdd.intel.com/test_sessions/d5adbff2-096e-11e7-b5b2-5254006e85c2

There are many failures, but there are a few that consistently fail:
sanity-lfsck test_5e (~~LU-8840~~)
sanity-quota test_12b (Ticket ??)

James Nunez (Inactive) added a comment - 17/Mar/17 4:42 PM We are running the 'full' test group with ZFS + DNE. Some recent results are at: https://testing.hpdd.intel.com/test_sessions/9803525a-f453-11e6-b655-5254006e85c2 https://testing.hpdd.intel.com/test_sessions/d38b0f30-0262-11e7-a4d2-5254006e85c2 https://testing.hpdd.intel.com/test_sessions/d5adbff2-096e-11e7-b5b2-5254006e85c2 There are many failures, but there are a few that consistently fail: sanity-lfsck test_5e ( LU-8840 ) sanity-quota test_12b (Ticket ??)

Andreas Dilger added a comment - 22/Sep/15 5:14 AM

I'm reluctant to break "full" runs more than they already are today. Also, history shows that tests running on master get far less attention than those running on review. Rather, I think the existing testing patch can be expanded until it starts passing or skipping (for only the ZFS+DNE config) all the tests currently being run for review-zfs-part-

{1,2}

and then landed like any other patch.

Since it would only be adding exceptions for the ZFS+DNE tests it couldn't cause breakage when it lands. It should be including full runs of those tests so should be OK to enable the DNE config afterward, as long as we leave a non-DNE test running for the excepted or skipped tests, as we do with review-ldiskfs today.

Andreas Dilger added a comment - 22/Sep/15 5:14 AM I'm reluctant to break "full" runs more than they already are today. Also, history shows that tests running on master get far less attention than those running on review. Rather, I think the existing testing patch can be expanded until it starts passing or skipping (for only the ZFS+DNE config) all the tests currently being run for review-zfs-part- {1,2} and then landed like any other patch. Since it would only be adding exceptions for the ZFS+DNE tests it couldn't cause breakage when it lands. It should be including full runs of those tests so should be OK to enable the DNE config afterward, as long as we leave a non-DNE test running for the excepted or skipped tests, as we do with review-ldiskfs today.

Peter Jones added a comment - 22/Sep/15 4:14 AM

That's a good point Andreas. We'd need to land either fixes for the bugs or else temporarily disable the affected tests while the fix was being worked on and then make the TEI request. Should we start by introducing that change on full test runs first and then rolling out to master after a few successful runs?

Peter Jones added a comment - 22/Sep/15 4:14 AM That's a good point Andreas. We'd need to land either fixes for the bugs or else temporarily disable the affected tests while the fix was being worked on and then make the TEI request. Should we start by introducing that change on full test runs first and then rolling out to master after a few successful runs?

Andreas Dilger added a comment - 22/Sep/15 4:03 AM

I'd rather this not be closed until we are actually testing DNE+ZFS automatically via autotest.

Per the original comment, if there are tests that are already passing them they can be moved into an autotest config file separate from the tests that are currently failing and/or disabled. If enough tests are passing, we could just rearrange review-zfs-part-

{1,2}

and change the config for one of them to start testing with MDSCOUNT=2.

It makes sense to start doing this as soon as possible to avoid regressions while the existing bugs are fixed or disabled.

Andreas Dilger added a comment - 22/Sep/15 4:03 AM I'd rather this not be closed until we are actually testing DNE+ZFS automatically via autotest. Per the original comment, if there are tests that are already passing them they can be moved into an autotest config file separate from the tests that are currently failing and/or disabled. If enough tests are passing, we could just rearrange review-zfs-part- {1,2} and change the config for one of them to start testing with MDSCOUNT=2. It makes sense to start doing this as soon as possible to avoid regressions while the existing bugs are fixed or disabled.

Peter Jones added a comment - 22/Sep/15 3:26 AM

This testing is complete and tickets were created for the issues found

Peter Jones added a comment - 22/Sep/15 3:26 AM This testing is complete and tickets were created for the issues found

Di Wang (Inactive) added a comment - 01/Sep/15 10:33 PM

Met this issue when testing on shadow

19:31:31:Loading modules from /usr/lib64/lustre
19:31:31:detected 2 online CPUs by sysfs
19:31:31:Force libcfs to create 2 CPU partitions
19:31:31:debug=vfstrace rpctrace dlmtrace neterror ha config 		      ioctl super lfsck
19:31:31:subsystem_debug=all -lnet -lnd -pinger
19:31:33:Formatting mgs, mds, osts
19:31:33:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments
19:31:35:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments
19:31:35:Format mgs: lustre-mgs/mgs
19:31:35:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments
19:31:36:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments
19:31:36:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments
19:31:37:CMD: shadow-50vm3 grep -c /mnt/mgs' ' /proc/mounts
19:31:37:CMD: shadow-50vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
19:31:37:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments
19:31:38:CMD: shadow-50vm3 ! zpool list -H lustre-mgs >/dev/null 2>&1 ||
19:31:38:			grep -q ^lustre-mgs/ /proc/mounts ||
19:31:38:			zpool export  lustre-mgs
19:31:39:CMD: shadow-50vm3 mkfs.lustre --mgs --param=sys.timeout=20 --backfstype=zfs --device-size=2097152 --reformat lustre-mgs/mgs /dev/lvm-Role_MDS/P1
19:31:39:shadow-50vm3: 
19:31:40:shadow-50vm3: mkfs.lustre FATAL: unhandled fs type 5 'zfs'
19:31:40:shadow-50vm3: 
19:31:40:shadow-50vm3: mkfs.lustre FATAL: unable to prepare backend (22)
19:31:40:shadow-50vm3: mkfs.lustre: exiting with 22 (Invalid argument)
20:31:24:********** Timeout by autotest system **********

Looks like a TEI issue.

Di Wang (Inactive) added a comment - 01/Sep/15 10:33 PM Met this issue when testing on shadow 19:31:31:Loading modules from /usr/lib64/lustre 19:31:31:detected 2 online CPUs by sysfs 19:31:31:Force libcfs to create 2 CPU partitions 19:31:31:debug=vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck 19:31:31:subsystem_debug=all -lnet -lnd -pinger 19:31:33:Formatting mgs, mds, osts 19:31:33:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments 19:31:35:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments 19:31:35:Format mgs: lustre-mgs/mgs 19:31:35:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments 19:31:36:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments 19:31:36:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments 19:31:37:CMD: shadow-50vm3 grep -c /mnt/mgs' ' /proc/mounts 19:31:37:CMD: shadow-50vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' 19:31:37:/usr/lib64/lustre/tests/test-framework.sh: line 3157: [: too many arguments 19:31:38:CMD: shadow-50vm3 ! zpool list -H lustre-mgs >/dev/null 2>&1 || 19:31:38: grep -q ^lustre-mgs/ /proc/mounts || 19:31:38: zpool export lustre-mgs 19:31:39:CMD: shadow-50vm3 mkfs.lustre --mgs --param=sys.timeout=20 --backfstype=zfs --device-size=2097152 --reformat lustre-mgs/mgs /dev/lvm-Role_MDS/P1 19:31:39:shadow-50vm3: 19:31:40:shadow-50vm3: mkfs.lustre FATAL: unhandled fs type 5 'zfs' 19:31:40:shadow-50vm3: 19:31:40:shadow-50vm3: mkfs.lustre FATAL: unable to prepare backend (22) 19:31:40:shadow-50vm3: mkfs.lustre: exiting with 22 (Invalid argument) 20:31:24:********** Timeout by autotest system ********** Looks like a TEI issue.

People

Assignee:: James Nunez (Inactive)

Reporter:: Di Wang (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 14/Aug/15 9:23 PM

Updated:: 15/Nov/19 3:14 PM

Resolved:: 15/Nov/19 3:14 PM