[LU-5076] Test failure on test suite conf-sanity, subtest test_46a test failed to respond and timed out Created: 17/May/14  Updated: 10/Jun/14  Resolved: 10/Jun/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Minh Diep
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-5064 sanity-scrub test_13: ls should fail Resolved
Severity: 3
Rank (Obsolete): 14010

 Description   

This issue was created by maloo for wangdi <di.wang@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/0638f47c-dd56-11e3-8e9b-52540035b04c.

The sub-test test_46a failed with the following error:

Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000240000400-0x0000000280000400):1:ost
Lustre: lustre-MDT0000: Client lustre-MDT0000-lwp-OST0006_UUID seen on new nid 10.10.4.199@tcp when existing nid 10.10.4.203@tcp is already connected
Lustre: Skipped 3 previous similar messages
Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000300000400-0x0000000340000400):4:ost
Lustre: Skipped 2 previous similar messages
Lustre: lustre-MDT0000: Client lustre-MDT0000-lwp-OST0006_UUID seen on new nid 10.10.4.199@tcp when existing nid 10.10.4.203@tcp is already connected
Lustre: Skipped 6 previous similar messages
Lustre: lustre-MDT0000: already connected client lustre-MDT0000-lwp-OST0000_UUID (at 10.10.4.199@tcp) with handle 0x2fb538c53b7cc26b. Rejecting client with the same UUID trying to reconnect with handle 0x4f578b0725086d9c
Lustre: Skipped 62 previous similar messages
Lustre: lustre-MDT0000: Client lustre-MDT0000-lwp-OST0006_UUID seen on new nid 10.10.4.199@tcp when existing nid 10.10.4.203@tcp is already connected
Lustre: Skipped 12 previous similar messages
Lustre: lustre-MDT0000: Client lustre-MDT0000-lwp-OST0006_UUID seen on new nid 10.10.4.199@tcp when existing nid 10.10.4.203@tcp is already connected
Lustre: Skipped 24 previous similar messages
LustreError: 11-0: lustre-OST0006-osc-MDT0000: Communicating with 10.10.4.199@tcp, operation ost_connect failed with -11.
LustreError: Skipped 94 previous similar messages
Lustre: lustre-MDT0000: Client lustre-MDT0000-lwp-OST0006_UUID seen on new nid 10.10.4.199@tcp when existing nid 10.10.4.203@tcp is already connected
Lustre: Skipped 50 previous similar messages
Lustre: lustre-MDT0000: already connected client lustre-MDT0000-lwp-OST0001_UUID (at 10.10.4.199@tcp) with handle 0x2fb538c53b7cc33d. Rejecting client with the same UUID trying to reconnect with handle 0x4f578b0725086f01
Lustre: Skipped 306 previous similar messages
Lustre: lustre-MDT0000: Client lustre-MDT0000-lwp-OST0006_UUID seen on new nid 10.10.4.199@tcp when existing nid 10.10.4.203@tcp is already connected
Lustre: Skipped 102 previous similar messages
LustreError: 11-0: lustre-OST0006-osc-MDT0000: Communicating with 10.10.4.199@tcp, operation ost_connect failed with -11.
LustreError: Skipped 120 previous similar messages
Lustre: lustre-MDT0000: already connected client lustre-MDT0000-lwp-OST0000_UUID (at 10.10.4.199@tcp) with handle 0x2fb538c53b7cc26b. Rejecting client with the same UUID trying to reconnect with handle 0x4f578b0725086d9c
Lustre: Skipped 364 previous similar messages
Lustre: lustre-MDT0000: Client lustre-MDT0000-lwp-OST0006_UUID seen on new nid 10.10.4.199@tcp when existing nid 10.10.4.203@tcp is already connected
Lustre: Skipped 120 previous similar messages
LustreError: 11-0: lustre-OST0006-osc-MDT0000: Communicating with 10.10.4.199@tcp, operation ost_connect failed with -11.
test failed to respond and timed out

This failure is a bit strange, according to the syslog on MDS0

Lustre: lustre-MDT0000: Client lustre-MDT0000-lwp-OST0006_UUID seen on new nid 10.10.4.199@tcp when existing nid 10.10.4.203@tcp is already connected

But the ip of OSS should be on 10.10.4.199, I do not know where this 10.10.4.203 comes from. So I am not sure this is a TEI ticket. If some one confirm this is a TEI ticket, please close this one. Thanks.

Info required for matching: conf-sanity 46a



 Comments   
Comment by Andreas Dilger [ 20/May/14 ]

It would we worthwhile to track down which mv cluster this other IO address belongs to, and why it thinks it should be connecting to this MDS.

Separately, one option to avoid such problems is to use a more unique $NAME variable for each test cluster (e.g. hostname of master test node instead of ALWAYS "lustre") so that the clients and servers are not able to connect to the wrong system being tested.

Comment by Andreas Dilger [ 10/Jun/14 ]

Closing this as a duplicate of TEI-1993. There are two possible fixes in the test infrastructure that are possible:

  • fix the test system not to leave test nodes running after changing the cluster config
  • fix the test system to assign more unique filesystem names for tests, so that old servers do not think they should be connecting to new servers
Generated at Sat Feb 10 01:48:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.