[LU-5706] conf-sanity test_57a: @@@@@@ FAIL: OST registration from failnode should fail Created: 04/Oct/14 Updated: 27/Oct/14 Resolved: 27/Oct/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0, Lustre 2.5.4 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Nathaniel Clark |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | zfs | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 15995 | ||||
| Description |
|
This issue was created by maloo for Bruno Faccini <bfaccini62@gmail.com> Please provide additional information about the failure here. This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/637ff760-4ac1-11e4-a839-5254006e85c2. Seems that recent master reviews builds hit this failure quite frequently. A recent master change may have introduced a regression, |
| Comments |
| Comment by Jian Yu [ 04/Oct/14 ] |
|
Since 2014-10-02, the failure has been preventing Lustre b2_5 patches from passing review testing with FSTYPE=zfs: == conf-sanity test 57a: initial registration from failnode should fail (should return errs) == 15:21:43 (1412263303)
CMD: shadow-42vm4 /usr/sbin/lctl get_param nis
shadow-42vm4: error: get_param: /proc/{fs,sys}/{lnet,lustre}/nis: Found no match
CMD: shadow-42vm3 grep -c /mnt/mds1' ' /proc/mounts
CMD: shadow-42vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
CMD: shadow-42vm3 ! zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
grep -q ^lustre-mdt1/ /proc/mounts ||
zpool export lustre-mdt1
CMD: shadow-42vm3 tunefs.lustre --quiet --writeconf lustre-mdt1/mdt1
shadow-42vm3:
shadow-42vm3: tunefs.lustre FATAL: Device lustre-mdt1/mdt1 has not been formatted with mkfs.lustre
shadow-42vm3: tunefs.lustre: exiting with 19 (No such device)
|
| Comment by nasf (Inactive) [ 05/Oct/14 ] |
|
Another failure instance: https://testing.hpdd.intel.com/test_sets/c54fa3f0-4c01-11e4-bb84-5254006e85c2 |
| Comment by James Nunez (Inactive) [ 06/Oct/14 ] |
|
Another failure on review-zfs: https://testing.hpdd.intel.com/test_sets/47c826f8-4bef-11e4-bb84-5254006e85c2 |
| Comment by Peter Jones [ 06/Oct/14 ] |
|
Nathaniel Could you please look into this one? Thanks Peter |
| Comment by Oleg Drokin [ 06/Oct/14 ] |
|
I think this definitely was introduced by https://jira.hpdd.intel.com/browse/LU-4749 as there s a huge uptick in these failures since end of September in master: https://testing.hpdd.intel.com/sub_tests/query?commit=Update+results&page=1&sub_test[query_bugs]=&sub_test[status]=FAIL&sub_test[sub_test_script_id]=dc46d338-6c5a-11e0-b32b-52540025f9af&test_node[architecture_type_id]=&test_node[distribution_type_id]=&test_node[file_system_type_id]=&test_node[lustre_branch_id]=&test_node[os_type_id]=&test_node_network[network_type_id]=&test_session[query_date]=&test_session[query_recent_period]=&test_session[test_group]=&test_session[test_host]=&test_session[user_id]=&test_set[test_set_script_id]=7f66aa20-3db2-11e0-80c0-52540025f9af&utf8=%E2%9C%93 page 2 does not show, but page 3 starts in 2013... |
| Comment by Andreas Dilger [ 07/Oct/14 ] |
|
Oleg, have you reverted http://review.whamcloud.com/11956 from master and http://review.whamcloud.com/12196 from b2_5 yet? This is causing frequent test failures on review-zfs tests. |
| Comment by Li Wei (Inactive) [ 07/Oct/14 ] |
|
I don't think this was introduced by my == conf-sanity test 57a: initial registration from failnode should fail (should return errs) == 09:56:58 (1412675818)
CMD: shadow-42vm4 /usr/sbin/lctl get_param nis
shadow-42vm4: error: get_param: /proc/{fs,sys}/{lnet,lustre}/nis: Found no match
[...]
This resulted in an empty string in "NID", which was then used to generate the following command line: [...]
CMD: shadow-42vm4 tunefs.lustre --failnode= lustre-ost1/ost1
checking for existing Lustre data: found
Read previous values:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: zfs
Flags: 0x2
(OST )
Persistent mount opts:
Parameters: mgsnode=10.1.5.248@tcp sys.timeout=20
Permanent disk data:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: zfs
Flags: 0x42
(OST update )
Persistent mount opts:
Parameters: mgsnode=10.1.5.248@tcp sys.timeout=20 failover.node=Yi:
Writing lustre-ost1/ost1 properties
lustre:version=1
lustre:flags=66
lustre:index=0
lustre:fsname=lustre
lustre:svname=lustre-OST0000
lustre:mgsnode=10.1.5.248@tcp
lustre:sys.timeout=20
lustre:failover.node=Yi:
[...]
Notice that the empty "failnode" argument wasn't caught by tunefs.lustre. (Instead, convert_hostnames() returned garbage when given an empty string---a problem that needs to be fixed as well.) |
| Comment by Li Wei (Inactive) [ 07/Oct/14 ] |
|
As to why "nis" did not exist, could it be that test_57a() has been (incorrectly) depending on prior tests to leave modules loaded? Consider these:
Now the question is if "excepting tests: 32newtarball 56 export 76a 59 64 69" started to appear around the time Oleg suspected. |
| Comment by Oleg Drokin [ 07/Oct/14 ] |
|
Andreas: the 12196 did not land to b2_5 yet. And I was under impression there are no b2_5 failures outside of select patches that include it, but alas, apparently there are still some failures so it must be some different recently landed thing that causes this bug. LiWei, test 54a is skipped for zfs since Oct 2012, so it cannot be the trigger. |
| Comment by Li Wei (Inactive) [ 07/Oct/14 ] |
|
Oleg, I wasn't implying 54a to be the problem, but one has to know which was the last not-skipped test before 57a. As to 56, I've no idea how it got into the exception list too. (We should really change exception lists only with Git commits.) |
| Comment by Andreas Dilger [ 08/Oct/14 ] |
|
This is causing a large number of review-zfs test failures at this point (20 yesterday), so either the original patch should be reverted, or this test disabled since we are wasting our time. This isn't the only problem with review-zfs, but only about 15% of these test sessions are passing right now. |
| Comment by Nathaniel Clark [ 08/Oct/14 ] |
| Comment by Li Wei (Inactive) [ 13/Oct/14 ] |
|
The "56" in the exception list was indeed added by Autotest. Minh helped me remove that exception last Saturday. Now it looks this failure has gone (recent three review-zfs runs): https://testing.hpdd.intel.com/test_sets/aa8077d8-5224-11e4-a79f-5254006e85c2 This buys us more time to cook http://review.whamcloud.com/12236. |
| Comment by Nathaniel Clark [ 27/Oct/14 ] |
|
patch landed to master |