[LU-5706] conf-sanity test_57a: @@@@@@ FAIL: OST registration from failnode should fail Created: 04/Oct/14  Updated: 27/Oct/14  Resolved: 27/Oct/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.5.4
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Nathaniel Clark
Resolution: Fixed Votes: 0
Labels: zfs

Issue Links:
Related
Severity: 3
Rank (Obsolete): 15995

 Description   

This issue was created by maloo for Bruno Faccini <bfaccini62@gmail.com>

Please provide additional information about the failure here.

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/637ff760-4ac1-11e4-a839-5254006e85c2.

Seems that recent master reviews builds hit this failure quite frequently. A recent master change may have introduced a regression,



 Comments   
Comment by Jian Yu [ 04/Oct/14 ]

Since 2014-10-02, the failure has been preventing Lustre b2_5 patches from passing review testing with FSTYPE=zfs:
https://testing.hpdd.intel.com/test_sets/359d7a4e-4a75-11e4-95b1-5254006e85c2
https://testing.hpdd.intel.com/test_sets/97cdd432-4b03-11e4-8d48-5254006e85c2
https://testing.hpdd.intel.com/test_sets/8709c0fc-4b0d-11e4-b999-5254006e85c2
https://testing.hpdd.intel.com/test_sets/40a1872c-4b98-11e4-b01b-5254006e85c2
https://testing.hpdd.intel.com/test_sets/1636091a-4bb9-11e4-b01b-5254006e85c2
https://testing.hpdd.intel.com/test_sets/e865aef8-4bc4-11e4-b13d-5254006e85c2
https://testing.hpdd.intel.com/test_sets/55c870b0-4be9-11e4-b01b-5254006e85c2

== conf-sanity test 57a: initial registration from failnode should fail (should return errs) == 15:21:43 (1412263303)
CMD: shadow-42vm4 /usr/sbin/lctl get_param nis
shadow-42vm4: error: get_param: /proc/{fs,sys}/{lnet,lustre}/nis: Found no match
CMD: shadow-42vm3 grep -c /mnt/mds1' ' /proc/mounts
CMD: shadow-42vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
CMD: shadow-42vm3 ! zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
			grep -q ^lustre-mdt1/ /proc/mounts ||
			zpool export  lustre-mdt1
CMD: shadow-42vm3 tunefs.lustre --quiet --writeconf lustre-mdt1/mdt1
shadow-42vm3: 
shadow-42vm3: tunefs.lustre FATAL: Device lustre-mdt1/mdt1 has not been formatted with mkfs.lustre
shadow-42vm3: tunefs.lustre: exiting with 19 (No such device)
Comment by nasf (Inactive) [ 05/Oct/14 ]

Another failure instance:

https://testing.hpdd.intel.com/test_sets/c54fa3f0-4c01-11e4-bb84-5254006e85c2

Comment by James Nunez (Inactive) [ 06/Oct/14 ]

Another failure on review-zfs: https://testing.hpdd.intel.com/test_sets/47c826f8-4bef-11e4-bb84-5254006e85c2

Comment by Peter Jones [ 06/Oct/14 ]

Nathaniel

Could you please look into this one?

Thanks

Peter

Comment by Oleg Drokin [ 06/Oct/14 ]

I think this definitely was introduced by https://jira.hpdd.intel.com/browse/LU-4749 as there s a huge uptick in these failures since end of September in master: https://testing.hpdd.intel.com/sub_tests/query?commit=Update+results&page=1&sub_test[query_bugs]=&sub_test[status]=FAIL&sub_test[sub_test_script_id]=dc46d338-6c5a-11e0-b32b-52540025f9af&test_node[architecture_type_id]=&test_node[distribution_type_id]=&test_node[file_system_type_id]=&test_node[lustre_branch_id]=&test_node[os_type_id]=&test_node_network[network_type_id]=&test_session[query_date]=&test_session[query_recent_period]=&test_session[test_group]=&test_session[test_host]=&test_session[user_id]=&test_set[test_set_script_id]=7f66aa20-3db2-11e0-80c0-52540025f9af&utf8=%E2%9C%93

page 2 does not show, but page 3 starts in 2013...

Comment by Andreas Dilger [ 07/Oct/14 ]

Oleg, have you reverted http://review.whamcloud.com/11956 from master and http://review.whamcloud.com/12196 from b2_5 yet? This is causing frequent test failures on review-zfs tests.

Comment by Li Wei (Inactive) [ 07/Oct/14 ]

I don't think this was introduced by my LU-4749 patch. As Yu Jian suggested early on, errors started to occur from the very beginning (taken from a recent Maloo report):

== conf-sanity test 57a: initial registration from failnode should fail (should return errs) == 09:56:58 (1412675818)
CMD: shadow-42vm4 /usr/sbin/lctl get_param nis
shadow-42vm4: error: get_param: /proc/{fs,sys}/{lnet,lustre}/nis: Found no match
[...]

This resulted in an empty string in "NID", which was then used to generate the following command line:

[...]
CMD: shadow-42vm4 tunefs.lustre --failnode= lustre-ost1/ost1
checking for existing Lustre data: found

   Read previous values:
Target:     lustre-OST0000
Index:      0
Lustre FS:  lustre
Mount type: zfs
Flags:      0x2
              (OST )
Persistent mount opts: 
Parameters: mgsnode=10.1.5.248@tcp sys.timeout=20


   Permanent disk data:
Target:     lustre-OST0000
Index:      0
Lustre FS:  lustre
Mount type: zfs
Flags:      0x42
              (OST update )
Persistent mount opts: 
Parameters: mgsnode=10.1.5.248@tcp sys.timeout=20 failover.node=Yi:

Writing lustre-ost1/ost1 properties
  lustre:version=1
  lustre:flags=66
  lustre:index=0
  lustre:fsname=lustre
  lustre:svname=lustre-OST0000
  lustre:mgsnode=10.1.5.248@tcp
  lustre:sys.timeout=20
  lustre:failover.node=Yi:
[...]

Notice that the empty "failnode" argument wasn't caught by tunefs.lustre. (Instead, convert_hostnames() returned garbage when given an empty string---a problem that needs to be fixed as well.)

Comment by Li Wei (Inactive) [ 07/Oct/14 ]

As to why "nis" did not exist, could it be that test_57a() has been (incorrectly) depending on prior tests to leave modules loaded? Consider these:

Now the question is if "excepting tests: 32newtarball 56 export 76a 59 64 69" started to appear around the time Oleg suspected.

Comment by Oleg Drokin [ 07/Oct/14 ]

Andreas: the 12196 did not land to b2_5 yet. And I was under impression there are no b2_5 failures outside of select patches that include it, but alas, apparently there are still some failures so it must be some different recently landed thing that causes this bug.

LiWei, test 54a is skipped for zfs since Oct 2012, so it cannot be the trigger.
test 56 is not in always exclude list in b2_5 at least, so I wonder if it's also excluded manually in the testing system via a TEI ticket?

Comment by Li Wei (Inactive) [ 07/Oct/14 ]

Oleg, I wasn't implying 54a to be the problem, but one has to know which was the last not-skipped test before 57a. As to 56, I've no idea how it got into the exception list too. (We should really change exception lists only with Git commits.)

Comment by Andreas Dilger [ 08/Oct/14 ]

This is causing a large number of review-zfs test failures at this point (20 yesterday), so either the original patch should be reverted, or this test disabled since we are wasting our time. This isn't the only problem with review-zfs, but only about 15% of these test sessions are passing right now.

Comment by Nathaniel Clark [ 08/Oct/14 ]

http://review.whamcloud.com/12236

Comment by Li Wei (Inactive) [ 13/Oct/14 ]

The "56" in the exception list was indeed added by Autotest. Minh helped me remove that exception last Saturday. Now it looks this failure has gone (recent three review-zfs runs):

https://testing.hpdd.intel.com/test_sets/aa8077d8-5224-11e4-a79f-5254006e85c2
https://testing.hpdd.intel.com/test_sets/3d7c7ca8-5211-11e4-a79f-5254006e85c2
https://testing.hpdd.intel.com/test_sets/2a42a2de-51d0-11e4-88de-5254006e85c2

This buys us more time to cook http://review.whamcloud.com/12236.

Comment by Nathaniel Clark [ 27/Oct/14 ]

patch landed to master

Generated at Sat Feb 10 01:53:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.