Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12049

Multirail - server trying to connect unconfigured nid

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      I had set up 2 server with multirail (ib0 and ib1) like this:

      srv1
      10.151.26.196@o2ib (ib0)
      10.151.26.195@o2ib (ib1)
      
      Srv2
      10.151.26.197@o2ib (ib1)
      10.151.26.198@o2ib (ib0)
      

      Serv1 was rebooted and it came up with 2 interfaces.
      Then serv2 was rebooted and it came up with 1 interface.

      AFTER REBOOT CONFIG:

      srv1 ~ # lnetctl net show
      net:
          - net type: lo
            local NI(s):
              - nid: 0@lo
                status: up
          - net type: o2ib
            local NI(s):
              - nid: 10.151.26.196@o2ib
                status: up
                interfaces:
                    0: ib0
              - nid: 10.151.26.195@o2ib
                status: up
                interfaces:
                    0: ib1
      ---------------------------------
      srv2 ~ # lnetctl net show
      net:
          - net type: lo
            local NI(s):
              - nid: 0@lo
                status: up
          - net type: o2ib
            local NI(s):
              - nid: 10.151.26.197@o2ib
                status: up
                interfaces:
                    0: ib1
      

      But srv1 still things srv2 should have 2 interfaces.

      srv1 # lnetctl peer show
      ...
          - primary nid: 10.151.26.197@o2ib
            Multi-Rail: True
            peer ni:
              - nid: 10.151.26.197@o2ib
                state: NA
              - nid: 10.151.26.198@o2ib
                state: NA
      ....
      
      srv1 ~ # lnetctl discover 10.151.26.197@o2ib
      manage:
          - discover:
                errno: -1
                descr: failed to discover 10.151.26.197@o2ib: Connection timed out
      
      [ 2623.243967] LNet: 270:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.151.26.198@o2ib - queue depth reduced from 63 to 42  to allow for qp creation
      [ 2623.283462] LNet: 270:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 1813 previous similar messages
      [ 2741.589327] Lustre: 17563:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1551901661/real 1551901663]  req@ffff882ba16f9500 x1627284088955520/t0(0) o13->nbp16-OST000d-osc-MDT0000@10.151.26.197@o2ib:7/4 lens 224/368 e 0 to 1 dl 1551902116 ref 1 fl Rpc:eX/2/ffffffff rc -11/-1
      [ 2741.676417] Lustre: 17563:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 114 previous similar messages
      [ 2741.706242] Lustre: nbp16-OST000d-osc-MDT0000: Connection to nbp16-OST000d (at 10.151.26.197@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      

      So the srv1 keep trying to connect to the alternate nid on srv2. Even thought that nid is not even configured. 

      Attachments

        1. srv1.debug.gz
          5.93 MB
        2. srv2.debug.gz
          366 kB

        Activity

          People

            ashehata Amir Shehata (Inactive)
            mhanafi Mahmoud Hanafi
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: