Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
I had set up 2 server with multirail (ib0 and ib1) like this:
srv1 10.151.26.196@o2ib (ib0) 10.151.26.195@o2ib (ib1) Srv2 10.151.26.197@o2ib (ib1) 10.151.26.198@o2ib (ib0)
Serv1 was rebooted and it came up with 2 interfaces.
Then serv2 was rebooted and it came up with 1 interface.
AFTER REBOOT CONFIG:
srv1 ~ # lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: o2ib
local NI(s):
- nid: 10.151.26.196@o2ib
status: up
interfaces:
0: ib0
- nid: 10.151.26.195@o2ib
status: up
interfaces:
0: ib1
---------------------------------
srv2 ~ # lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: o2ib
local NI(s):
- nid: 10.151.26.197@o2ib
status: up
interfaces:
0: ib1
But srv1 still things srv2 should have 2 interfaces.
srv1 # lnetctl peer show
...
- primary nid: 10.151.26.197@o2ib
Multi-Rail: True
peer ni:
- nid: 10.151.26.197@o2ib
state: NA
- nid: 10.151.26.198@o2ib
state: NA
....
srv1 ~ # lnetctl discover 10.151.26.197@o2ib
manage:
- discover:
errno: -1
descr: failed to discover 10.151.26.197@o2ib: Connection timed out
[ 2623.243967] LNet: 270:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.151.26.198@o2ib - queue depth reduced from 63 to 42 to allow for qp creation [ 2623.283462] LNet: 270:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 1813 previous similar messages [ 2741.589327] Lustre: 17563:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1551901661/real 1551901663] req@ffff882ba16f9500 x1627284088955520/t0(0) o13->nbp16-OST000d-osc-MDT0000@10.151.26.197@o2ib:7/4 lens 224/368 e 0 to 1 dl 1551902116 ref 1 fl Rpc:eX/2/ffffffff rc -11/-1 [ 2741.676417] Lustre: 17563:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 114 previous similar messages [ 2741.706242] Lustre: nbp16-OST000d-osc-MDT0000: Connection to nbp16-OST000d (at 10.151.26.197@o2ib) was lost; in progress operations using this service will wait for recovery to complete
So the srv1 keep trying to connect to the alternate nid on srv2. Even thought that nid is not even configured.