|
Recently I've seen some bogus test_smoke failures, e.g.:
https://maloo.whamcloud.com/test_sets/abbe60f8-e087-11e1-a388-52540035b04c
https://maloo.whamcloud.com/test_sets/75af9aba-de64-11e1-8269-52540035b04c
Both were caused by PDSH failure to connect to a test node, e.g.:
pdsh@fat-intel-3vm2: fat-intel-3vm7: connect: Connection refused
The connection failure is highly likely due a previous test taking down a node, because every time it was a MDS node that PDSH couldn't connect to.
Two potential problems here:
- It seemed that some MDS tests had run before test_smoke. I haven't verified it, but if it's true the test order must be corrected. test_smoke is the sanity test for LNet, so it should run before any tests that could use the Lustre networking.
- The connect failure seemed to be test_smoke()
>lst_prepare()>lst_cleanup_all() doing "do_rpc_nodes $list lst_cleanup" where list=$(comma_list $(nodes_list)). From the test log:
/usr/sbin/lst add_group c 10.10.4.86@tcp 10.10.4.87@tcp
/usr/sbin/lst add_group s 10.10.4.93@tcp
10.10.4.86@tcp are added to session
10.10.4.87@tcp are added to session
10.10.4.93@tcp are added to session
LST seemed to run just OK, so it looked like lst_cleanup_all() was trying to cleanup a node that is NOT going to participate in the LST test. This is NOT necessary at all.
|