[LU-12049] Multirail - server trying to connect unconfigured nid Created: 06/Mar/19  Updated: 06/Jan/21  Resolved: 06/Jan/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Mahmoud Hanafi Assignee: Amir Shehata (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Attachments: File srv1.debug.gz     File srv2.debug.gz    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I had set up 2 server with multirail (ib0 and ib1) like this:

srv1
10.151.26.196@o2ib (ib0)
10.151.26.195@o2ib (ib1)

Srv2
10.151.26.197@o2ib (ib1)
10.151.26.198@o2ib (ib0)

Serv1 was rebooted and it came up with 2 interfaces.
Then serv2 was rebooted and it came up with 1 interface.

AFTER REBOOT CONFIG:

srv1 ~ # lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: o2ib
      local NI(s):
        - nid: 10.151.26.196@o2ib
          status: up
          interfaces:
              0: ib0
        - nid: 10.151.26.195@o2ib
          status: up
          interfaces:
              0: ib1
---------------------------------
srv2 ~ # lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: o2ib
      local NI(s):
        - nid: 10.151.26.197@o2ib
          status: up
          interfaces:
              0: ib1

But srv1 still things srv2 should have 2 interfaces.

srv1 # lnetctl peer show
...
    - primary nid: 10.151.26.197@o2ib
      Multi-Rail: True
      peer ni:
        - nid: 10.151.26.197@o2ib
          state: NA
        - nid: 10.151.26.198@o2ib
          state: NA
....

srv1 ~ # lnetctl discover 10.151.26.197@o2ib
manage:
    - discover:
          errno: -1
          descr: failed to discover 10.151.26.197@o2ib: Connection timed out
[ 2623.243967] LNet: 270:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.151.26.198@o2ib - queue depth reduced from 63 to 42  to allow for qp creation
[ 2623.283462] LNet: 270:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 1813 previous similar messages
[ 2741.589327] Lustre: 17563:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1551901661/real 1551901663]  req@ffff882ba16f9500 x1627284088955520/t0(0) o13->nbp16-OST000d-osc-MDT0000@10.151.26.197@o2ib:7/4 lens 224/368 e 0 to 1 dl 1551902116 ref 1 fl Rpc:eX/2/ffffffff rc -11/-1
[ 2741.676417] Lustre: 17563:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 114 previous similar messages
[ 2741.706242] Lustre: nbp16-OST000d-osc-MDT0000: Connection to nbp16-OST000d (at 10.151.26.197@o2ib) was lost; in progress operations using this service will wait for recovery to complete

So the srv1 keep trying to connect to the alternate nid on srv2. Even thought that nid is not even configured. 



 Comments   
Comment by Peter Jones [ 07/Mar/19 ]

Mahmoud

Could you please clarify which Lustre version you are using

Amir

Could you please advise?

Thanks

Peter

Comment by Mahmoud Hanafi [ 07/Mar/19 ]

This is once a peer is discovered as having 2 nids and that peer is restarted with only a single nid. Clients and servers that had discovered with 2 nids are able to rediscover that it only has one nid now.

Comment by Amir Shehata (Inactive) [ 07/Mar/19 ]

Yes there is a current issue with the way reboots are handled. Discovery uses a sequence number to check if the information its getting is out of date. That algorithm however doesn't work if the node reboots, changes and comes up. The sequence number gets reset, so all updates are deemed out of date. I have a fix for that on the multi-rail branch as part of the MR Routing/UDSP work.

  
4965bc886f792067046e7c25ec7b3c80888093eb LU-11478 lnet: misleading discovery seqno.
Comment by Mahmoud Hanafi [ 06/Jan/21 ]

please close we have picked up LU-11478

Generated at Sat Feb 10 02:49:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.