[LU-12049] Multirail - server trying to connect unconfigured nid Created: 06/Mar/19 Updated: 06/Jan/21 Resolved: 06/Jan/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Mahmoud Hanafi | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
I had set up 2 server with multirail (ib0 and ib1) like this: srv1 10.151.26.196@o2ib (ib0) 10.151.26.195@o2ib (ib1) Srv2 10.151.26.197@o2ib (ib1) 10.151.26.198@o2ib (ib0) Serv1 was rebooted and it came up with 2 interfaces. AFTER REBOOT CONFIG:
srv1 ~ # lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: o2ib
local NI(s):
- nid: 10.151.26.196@o2ib
status: up
interfaces:
0: ib0
- nid: 10.151.26.195@o2ib
status: up
interfaces:
0: ib1
---------------------------------
srv2 ~ # lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: o2ib
local NI(s):
- nid: 10.151.26.197@o2ib
status: up
interfaces:
0: ib1
But srv1 still things srv2 should have 2 interfaces.
srv1 # lnetctl peer show
...
- primary nid: 10.151.26.197@o2ib
Multi-Rail: True
peer ni:
- nid: 10.151.26.197@o2ib
state: NA
- nid: 10.151.26.198@o2ib
state: NA
....
srv1 ~ # lnetctl discover 10.151.26.197@o2ib
manage:
- discover:
errno: -1
descr: failed to discover 10.151.26.197@o2ib: Connection timed out
[ 2623.243967] LNet: 270:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.151.26.198@o2ib - queue depth reduced from 63 to 42 to allow for qp creation [ 2623.283462] LNet: 270:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 1813 previous similar messages [ 2741.589327] Lustre: 17563:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1551901661/real 1551901663] req@ffff882ba16f9500 x1627284088955520/t0(0) o13->nbp16-OST000d-osc-MDT0000@10.151.26.197@o2ib:7/4 lens 224/368 e 0 to 1 dl 1551902116 ref 1 fl Rpc:eX/2/ffffffff rc -11/-1 [ 2741.676417] Lustre: 17563:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 114 previous similar messages [ 2741.706242] Lustre: nbp16-OST000d-osc-MDT0000: Connection to nbp16-OST000d (at 10.151.26.197@o2ib) was lost; in progress operations using this service will wait for recovery to complete So the srv1 keep trying to connect to the alternate nid on srv2. Even thought that nid is not even configured. |
| Comments |
| Comment by Peter Jones [ 07/Mar/19 ] |
|
Mahmoud Could you please clarify which Lustre version you are using Amir Could you please advise? Thanks Peter |
| Comment by Mahmoud Hanafi [ 07/Mar/19 ] |
|
This is once a peer is discovered as having 2 nids and that peer is restarted with only a single nid. Clients and servers that had discovered with 2 nids are able to rediscover that it only has one nid now. |
| Comment by Amir Shehata (Inactive) [ 07/Mar/19 ] |
|
Yes there is a current issue with the way reboots are handled. Discovery uses a sequence number to check if the information its getting is out of date. That algorithm however doesn't work if the node reboots, changes and comes up. The sequence number gets reset, so all updates are deemed out of date. I have a fix for that on the multi-rail branch as part of the MR Routing/UDSP work. 4965bc886f792067046e7c25ec7b3c80888093eb LU-11478 lnet: misleading discovery seqno. |
| Comment by Mahmoud Hanafi [ 06/Jan/21 ] |
|
please close we have picked up |