[LU-1728] bogus test_smoke failures Created: 09/Aug/12  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Isaac Huang (Inactive) Assignee: Isaac Huang (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 10108

 Description   

Recently I've seen some bogus test_smoke failures, e.g.:
https://maloo.whamcloud.com/test_sets/abbe60f8-e087-11e1-a388-52540035b04c
https://maloo.whamcloud.com/test_sets/75af9aba-de64-11e1-8269-52540035b04c

Both were caused by PDSH failure to connect to a test node, e.g.:
pdsh@fat-intel-3vm2: fat-intel-3vm7: connect: Connection refused

The connection failure is highly likely due a previous test taking down a node, because every time it was a MDS node that PDSH couldn't connect to.

Two potential problems here:

  • It seemed that some MDS tests had run before test_smoke. I haven't verified it, but if it's true the test order must be corrected. test_smoke is the sanity test for LNet, so it should run before any tests that could use the Lustre networking.
  • The connect failure seemed to be test_smoke()>lst_prepare()>lst_cleanup_all() doing "do_rpc_nodes $list lst_cleanup" where list=$(comma_list $(nodes_list)). From the test log:
    /usr/sbin/lst add_group c 10.10.4.86@tcp 10.10.4.87@tcp
    /usr/sbin/lst add_group s 10.10.4.93@tcp
    10.10.4.86@tcp are added to session
    10.10.4.87@tcp are added to session
    10.10.4.93@tcp are added to session
    LST seemed to run just OK, so it looked like lst_cleanup_all() was trying to cleanup a node that is NOT going to participate in the LST test. This is NOT necessary at all.


 Comments   
Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:19:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.