Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12053

intermittent ping failures with MR/DD

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.13.0
    • Lustre 2.11.0
    • None
    • 3
    • 9223372036854775807

    Description

      I have three nodes - client, server, router. The LNet configuration approximates what we have on a typical Cray XC.

      client:

      sles15c01:/tmp # lctl list_nids
      192.168.2.22@tcp99
      192.168.2.22@tcp1
      sles15c01:/tmp # lctl show_route
      net                tcp hops 4294967295 gw                192.168.2.20@tcp1 up pri 0
      sles15c01:/tmp #
      

      Server:

      sles15s01:/tmp # lctl list_nids
      192.168.2.21@tcp
      sles15s01:/tmp # lctl show_route
      net               tcp1 hops 4294967295 gw                 192.168.2.20@tcp up pri 0
      net              tcp99 hops 4294967295 gw                 192.168.2.20@tcp down pri 0
      sles15s01:/tmp #
      

      Router:

      sles15build01:/tmp # lctl list_nids
      192.168.2.20@tcp99
      192.168.2.20@tcp1
      192.168.2.20@tcp
      sles15build01:/tmp #
      

      All nodes are Multi-Rail aware:

      sles15build01:/tmp # pdsh -w sles15build01,sles15s01,sles15c01 "lnetctl export | grep Multi"
      sles15c01:       Multi-Rail: True
      sles15c01:       Multi-Rail: True
      sles15c01:       Multi-Rail: True
      sles15s01:       Multi-Rail: True
      sles15s01:       Multi-Rail: True
      sles15build01:       Multi-Rail: True
      sles15build01:       Multi-Rail: True
      sles15build01:/tmp #
      

      Ping issued from server to client:

      sles15s01:/tmp # lctl ping 192.168.2.22@tcp99
      failed to ping 192.168.2.22@tcp99: Input/output error
      

      Server:

      00000400:00000200:0.0:1531255380.403698:0:29617:0:(lib-move.c:3251:LNetGet()) LNetGet -> 12345-192.168.2.22@tcp99
      00000400:00000001:0.0:1531255380.403703:0:29617:0:(lib-move.c:2187:lnet_send()) Process entered
      00000400:00000001:0.0:1531255380.403705:0:29617:0:(lib-move.c:1567:lnet_select_pathway()) Process entered
      ...
      00000400:00000200:0.0:1531255380.403760:0:29617:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp99 <?>
      00000400:00000200:0.0:1531255380.403808:0:29617:0:(lib-move.c:1412:lnet_find_route_locked()) ffff89c1c69f0200 Route not alive
      ...
      00000400:00000200:0.0:1531255380.403817:0:29617:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp1 <?>
      ...
      00000400:00000200:0.0:1531255380.403867:0:29617:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.21@tcp:<?>) -> 192.168.2.22@tcp1(192.168.2.22@tcp99:192.168.2.20@tcp) : GET
      

      What's happening here is:
      1. Destination is 192.168.2.22@tcp99 (See the LNetGet())
      2. Attempt to find route to @tcp99, but the route is "down" (see output of lctl show_route above)
      3. Because of MR/DD we know that the destination peer also has the nid 192.168.2.22@tcp1
      4. Attempt to find route to @tcp1, this succeeds.
      5. Send message to 192.168.2.22@tcp1 via gateway 192.168.2.20@tcp

      Router:

      00000400:00000200:18.0:1531255419.276466:0:13162:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp1(192.168.2.20@tcp) <- 192.168.2.21@tcp : GET - routed
      00000400:00000001:18.0:1531255419.276510:0:13162:0:(lib-move.c:2187:lnet_send()) Process entered
      00000400:00000001:18.0:1531255419.276512:0:13162:0:(lib-move.c:1567:lnet_select_pathway()) Process entered
      00000400:00000200:18.0:1531255419.276521:0:13162:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9fcd86856000 (192.168.2.22@tcp1:tcp1:192.168.2.22@tcp1)
      00000400:00000001:18.0:1531255419.276559:0:13162:0:(api-ni.c:1024:lnet_get_net_locked()) Process leaving (rc=18446638301376926080 : -105772332625536 : ffff9fccf5afd180)
      00000400:00000200:18.0:1531255419.276592:0:13162:0:(lib-move.c:2000:lnet_select_pathway()) Considering lpni ffff9fd2ec4dee00 (192.168.2.22@tcp99:tcp99:192.168.2.22@tcp99)
      00000400:00000200:18.0:1531255419.276598:0:13162:0:(lib-move.c:2043:lnet_select_pathway()) Set best_lpni to ffff9fd2ec4dee00
      00000400:00000200:18.0:1531255419.276606:0:13162:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.20@tcp99:<?>) -> 192.168.2.22@tcp99(192.168.2.22@tcp1:192.168.2.22@tcp99) : GET
      

      What's happening here is:
      1. Receive message on 192.168.2.20@tcp from 192.168.2.21@tcp that is destined for 192.168.2.22@tcp1
      2. Because of MR/DD we know destination peer also has the nid 192.168.2.22@tcp99
      3. Send message to 192.168.2.22@tcp99 over the local 192.168.2.20@tcp99 interface

      Client:

      00000400:00000200:1.0:1531255341.131514:0:26294:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp99(192.168.2.22@tcp99) <- 192.168.2.21@tcp : GET - for me
      00000400:00000001:1.0:1531255341.131544:0:26294:0:(lib-move.c:2187:lnet_send()) Process entered
      00000400:00000001:1.0:1531255341.131545:0:26294:0:(lib-move.c:1567:lnet_select_pathway()) Process entered
      00000400:00000200:1.0:1531255341.131550:0:26294:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp)
      00000400:00000200:1.0:1531255341.131572:0:26294:0:(lib-move.c:1673:lnet_select_pathway()) Got best_ni ffff9ab328829a00 (192.168.2.22@tcp99) from explicit src_nid 192.168.2.22@tcp99
      00000400:00000001:1.0:1531255341.131575:0:26294:0:(peer.c:647:lnet_find_peer_ni_locked()) Process leaving (rc=18446632693002343936 : -111380707207680 : ffff9ab328831600)
      00000400:00000200:1.0:1531255341.131579:0:26294:0:(lib-move.c:1706:lnet_select_pathway()) best_lpni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp) is not local. Finding gw
      00000400:00000200:1.0:1531255341.131582:0:26294:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.21@tcp <?>
      00000400:00000200:1.0:1531255341.131592:0:26294:0:(lib-move.c:1719:lnet_select_pathway()) Found best_gw ffff9ab329c81000 (192.168.2.20@tcp1:tcp1:<?>)
      00000400:00000200:1.0:1531255341.131600:0:26294:0:(lib-move.c:1740:lnet_select_pathway()) @@@ peer for best_gw ffff9ab329c81000  peer 192.168.2.20@tcp1 state 0x89 health adhy
      00000400:00000200:1.0:1531255341.131607:0:26294:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to tcp99 192.168.2.21@tcp <?>
      00000400:02000400:1.0:1531255341.131615:0:26294:0:(lib-move.c:1953:lnet_select_pathway()) No route to 192.168.2.21@tcp from 192.168.2.22@tcp99
      00000400:00020000:1.0:1531255341.131623:0:26294:0:(lib-move.c:2370:lnet_parse_get()) 192.168.2.22@tcp99: Unable to send REPLY for GET from 12345-192.168.2.21@tcp: -113
      

      What's happening here is:
      1. Receive message on 192.168.2.22@tcp99 from 192.168.2.21@tcp
      2. Lookup peer_ni based on based on the reply destination
      3. Determine the best local interface to send the reply is the one that we received the message on (192.168.2.22@tcp99)
      4. Try to find route to tcp0
      5. ???
      6. Try to find route to tcp99
      7. No route to tcp99 so fail the send.

      I need to figure out what is happening at step 5 there.

      Here's the successful case.

      lctl ping issued from server to client:

      sles15s01:/tmp # lctl ping 192.168.2.22@tcp99
      12345-0@lo
      12345-192.168.2.22@tcp99
      12345-192.168.2.22@tcp1
      sles15s01:/tmp #
      

      Server:

      00000400:00000200:1.0:1531255388.760890:0:29655:0:(lib-move.c:3251:LNetGet()) LNetGet -> 12345-192.168.2.22@tcp99
      00000400:00000001:1.0:1531255388.760895:0:29655:0:(lib-move.c:2187:lnet_send()) Process entered
      00000400:00000001:1.0:1531255388.760897:0:29655:0:(lib-move.c:1567:lnet_select_pathway()) Process entered
      00000400:00000200:1.0:1531255388.760958:0:29655:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp99 <?>
      00000400:00000200:1.0:1531255388.760974:0:29655:0:(lib-move.c:1410:lnet_find_route_locked()) Considering lp ffff89c1c69f0200 (192.168.2.20@tcp)
      00000400:00000200:1.0:1531255388.760977:0:29655:0:(lib-move.c:1412:lnet_find_route_locked()) ffff89c1c69f0200 Route not alive
      00000400:00000200:1.0:1531255388.760985:0:29655:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp1 <?>
      00000400:00000200:1.0:1531255388.761029:0:29655:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.21@tcp:<?>) -> 192.168.2.22@tcp1(192.168.2.22@tcp99:192.168.2.20@tcp) : GET
      

      Router:

      00000400:00000200:18.0:1531255427.633633:0:13162:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp1(192.168.2.20@tcp) <- 192.168.2.21@tcp : GET - routed
      00000400:00000001:18.0:1531255427.633677:0:13162:0:(lib-move.c:2187:lnet_send()) Process entered
      00000400:00000001:18.0:1531255427.633678:0:13162:0:(lib-move.c:1567:lnet_select_pathway()) Process entered
      00000400:00000200:18.0:1531255427.633688:0:13162:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9fcd86856000 (192.168.2.22@tcp1:tcp1:192.168.2.22@tcp1)
      00000400:00000200:18.0:1531255427.633764:0:13162:0:(lib-move.c:2000:lnet_select_pathway()) Considering lpni ffff9fcd86856000 (192.168.2.22@tcp1:tcp1:192.168.2.22@tcp99)
      00000400:00000200:18.0:1531255427.633770:0:13162:0:(lib-move.c:2043:lnet_select_pathway()) Set best_lpni to ffff9fcd86856000
      00000400:00000200:18.0:1531255427.633778:0:13162:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.20@tcp1:<?>) -> 192.168.2.22@tcp1(192.168.2.22@tcp1:192.168.2.22@tcp1) : GET
      

      Client:

      00000400:00000200:4.0:1531255349.488613:0:26298:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp1(192.168.2.22@tcp1) <- 192.168.2.21@tcp : GET - for me
      00000400:00000001:4.0:1531255349.488647:0:26298:0:(lib-move.c:2187:lnet_send()) Process entered
      00000400:00000001:4.0:1531255349.488648:0:26298:0:(lib-move.c:1567:lnet_select_pathway()) Process entered
      00000400:00000200:4.0:1531255349.488655:0:26298:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp)
      00000400:00000200:4.0:1531255349.488683:0:26298:0:(lib-move.c:1673:lnet_select_pathway()) Got best_ni ffff9ab328829200 (192.168.2.22@tcp1) from explicit src_nid 192.168.2.22@tcp1
      00000400:00000001:4.0:1531255349.488686:0:26298:0:(peer.c:647:lnet_find_peer_ni_locked()) Process leaving (rc=18446632693002343936 : -111380707207680 : ffff9ab328831600)
      00000400:00000200:4.0:1531255349.488692:0:26298:0:(lib-move.c:1706:lnet_select_pathway()) best_lpni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp) is not local. Finding gw
      00000400:00000200:4.0:1531255349.488695:0:26298:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.21@tcp <?>
      00000400:00000200:4.0:1531255349.488706:0:26298:0:(lib-move.c:1719:lnet_select_pathway()) Found best_gw ffff9ab329c81000 (192.168.2.20@tcp1:tcp1:<?>)
      00000400:00000200:4.0:1531255349.488716:0:26298:0:(lib-move.c:1740:lnet_select_pathway()) @@@ peer for best_gw ffff9ab329c81000  peer 192.168.2.20@tcp1 state 0x89 health adhy
      00000400:00000200:4.0:1531255349.488731:0:26298:0:(lib-move.c:2000:lnet_select_pathway()) Considering lpni ffff9ab329c81000 (192.168.2.20@tcp1:tcp1:192.168.2.20@tcp1)
      00000400:00000200:4.0:1531255349.488736:0:26298:0:(lib-move.c:2043:lnet_select_pathway()) Set best_lpni to ffff9ab329c81000
      00000400:00000200:4.0:1531255349.488744:0:26298:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.22@tcp1(192.168.2.22@tcp1:192.168.2.22@tcp1) -> 192.168.2.21@tcp(192.168.2.21@tcp:192.168.2.20@tcp1) : REPLY
      

      So the key difference in the successful case is that when forwarding the GET to the client the router sends the message over its @tcp1 interface rather than its @tcp99 interface. Since the client wants to send the REPLY over the same interface it received the message on it is able to to do successfully because it has a route defined for that interface.

      Attachments

        Activity

          [LU-12053] intermittent ping failures with MR/DD
          spitzcor Cory Spitz added a comment -

          pjones, this ticket can be marked resolved for Lustre 2.13.0.

          spitzcor Cory Spitz added a comment - pjones , this ticket can be marked resolved for Lustre 2.13.0.
          hornc Chris Horn added a comment -

          ashehata pjones I think this can be resolved with the merge of MR routing

          hornc Chris Horn added a comment - ashehata pjones I think this can be resolved with the merge of MR routing

          Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/34625/
          Subject: LU-12053 lnet: look up MR peers routes
          Project: fs/lustre-release
          Branch: multi-rail
          Current Patch Set:
          Commit: 52eef817974304d0107caaec023a2ed341c164e0

          gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/34625/ Subject: LU-12053 lnet: look up MR peers routes Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 52eef817974304d0107caaec023a2ed341c164e0

          Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34970
          Subject: LU-12053 lnet: look up MR peers routes
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 8b9927b88ead8b25f3651cd90f8b904db1d21314

          gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34970 Subject: LU-12053 lnet: look up MR peers routes Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8b9927b88ead8b25f3651cd90f8b904db1d21314

          Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34625
          Subject: LU-12053 lnet: look up MR peers routes
          Project: fs/lustre-release
          Branch: multi-rail
          Current Patch Set: 1
          Commit: e0ee3410e8e7eecdba4f3e137ab32efe826916c2

          gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34625 Subject: LU-12053 lnet: look up MR peers routes Project: fs/lustre-release Branch: multi-rail Current Patch Set: 1 Commit: e0ee3410e8e7eecdba4f3e137ab32efe826916c2
          hornc Chris Horn added a comment -

          Attached dk logs from router, client and server that show what may be the same or a similar issue.

          I had the following patches applied on top of 2.12

          05b3fa2bd (origin/task/test-mr-fixes) LU-11664 lnet: push router interface updates
          111818b07 LU-11297 lnet: set gw sensitivity from lnetctl
          a878b193c LU-11297 lnet: handle router health off
          c1b835d57 LU-11641 lnet: handle discovery off
          fb4f70ba5 LU-11470 lnet: drop all rule
          064e16551 LU-11478 lnet: misleading discovery seqno.
          ff3a0e631 LU-11477 lnet: handle health for incoming messages
          162357d38 LU-11475 lnet: allow deleting router primary_nid
          2a8c65bb3 LU-11300 lnet: consider alive_router_check_interval
          68ea55280 LU-11378 lnet: MR aware gateway selection
          7f8e99ac6 LU-11299 lnet: use discovery for routing
          a4224f8d8 LU-11299 lnet: modify lnd notification mechanism
          72aa3920f LU-11299 lnet: Cleanup rcd
          1ae265b66 LU-11300 lnet: simplify lnet_handle_local_failure()
          214a223ff LU-11300 lnet: router aliveness
          0affc079c LU-11300 lnet: peer aliveness
          9d33c4296 LU-11300 lnet: Cache the routing feature
          3e518dc0b LU-11300 lnet: cache ni status
          5b83c8c8e LU-11300 lnet: configure lnet router senstivity
          6cbd670a6 LU-11300 lnet: router sensitivity
          e05da588a LU-11551 lnet: Do not allow deleting of router nis
          e6b298711 LU-11299 lnet: lnet_add/del_route()
          18106c476 LU-11298 lnet: use peer for gateway
          7fda6b7d1 LU-11292 lnet: Discover routers on first use
          bb6cc75fe LU-11515 lnet: monitor thread wakeup algorithm
          9040e2159 LU-10153 lnet: remove route add restriction
          2519eacfc LU-11816 lnet: setup health timeout defaults
          

          Here's the LNet config:
          Client:

          sles15c01:/sys/kernel/debug/lnet # lnetctl net show
          net:
              - net type: lo
                local NI(s):
                  - nid: 0@lo
                    status: up
              - net type: tcp1
                local NI(s):
                  - nid: 192.168.2.22@tcp1
                    status: up
                    interfaces:
                        0: eth0
              - net type: tcp99
                local NI(s):
                  - nid: 192.168.2.22@tcp99
                    status: up
                    interfaces:
                        0: eth0
          sles15c01:/sys/kernel/debug/lnet # lnetctl route show
          route:
              - net: tcp
                gateway: 192.168.2.20@tcp1
          sles15c01:/sys/kernel/debug/lnet # lnetctl peer show
          peer:
              - primary nid: 192.168.2.20@tcp
                Multi-Rail: True
                peer ni:
                  - nid: 192.168.2.20@tcp1
                    state: up
                  - nid: 192.168.2.20@tcp
                    state: up
                  - nid: 192.168.2.26@tcp99
                    state: up
              - primary nid: 192.168.2.21@tcp
                Multi-Rail: True
                peer ni:
                  - nid: 192.168.2.21@tcp
                    state: NA
          sles15c01:/sys/kernel/debug/lnet #
          

          Server:

          sles15s01:/sys/module/lnet/parameters # lnetctl net show
          net:
              - net type: lo
                local NI(s):
                  - nid: 0@lo
                    status: up
              - net type: tcp
                local NI(s):
                  - nid: 192.168.2.21@tcp
                    status: up
                    interfaces:
                        0: eth0
          sles15s01:/sys/module/lnet/parameters # lnetctl route show
          route:
              - net: tcp1
                gateway: 192.168.2.20@tcp
              - net: tcp99
                gateway: 192.168.2.20@tcp
          sles15s01:/sys/module/lnet/parameters # lnetctl peer show
          peer:
              - primary nid: 192.168.2.20@tcp
                Multi-Rail: True
                peer ni:
                  - nid: 192.168.2.20@tcp
                    state: up
                  - nid: 192.168.2.20@tcp1
                    state: up
                  - nid: 192.168.2.26@tcp99
                    state: up
              - primary nid: 192.168.2.22@tcp1
                Multi-Rail: True
                peer ni:
                  - nid: 192.168.2.22@tcp99
                    state: NA
                  - nid: 192.168.2.22@tcp1
                    state: NA
          sles15s01:/sys/module/lnet/parameters #
          

          Router:

          sles15build01:/tmp # lnetctl net show
          net:
              - net type: lo
                local NI(s):
                  - nid: 0@lo
                    status: up
              - net type: tcp
                local NI(s):
                  - nid: 192.168.2.20@tcp
                    status: up
                    interfaces:
                        0: eth0
              - net type: tcp1
                local NI(s):
                  - nid: 192.168.2.20@tcp1
                    status: up
                    interfaces:
                        0: eth0
              - net type: tcp99
                local NI(s):
                  - nid: 192.168.2.26@tcp99
                    status: up
                    interfaces:
                        0: eth1
          sles15build01:/tmp # lnetctl route show
          sles15build01:/tmp # lnetctl peer show
          peer:
              - primary nid: 192.168.2.22@tcp1
                Multi-Rail: True
                peer ni:
                  - nid: 192.168.2.22@tcp1
                    state: up
                  - nid: 192.168.2.22@tcp99
                    state: up
              - primary nid: 192.168.2.21@tcp
                Multi-Rail: True
                peer ni:
                  - nid: 192.168.2.21@tcp
                    state: up
          sles15build01:/tmp #
          
          hornc Chris Horn added a comment - Attached dk logs from router, client and server that show what may be the same or a similar issue. I had the following patches applied on top of 2.12 05b3fa2bd (origin/task/test-mr-fixes) LU-11664 lnet: push router interface updates 111818b07 LU-11297 lnet: set gw sensitivity from lnetctl a878b193c LU-11297 lnet: handle router health off c1b835d57 LU-11641 lnet: handle discovery off fb4f70ba5 LU-11470 lnet: drop all rule 064e16551 LU-11478 lnet: misleading discovery seqno. ff3a0e631 LU-11477 lnet: handle health for incoming messages 162357d38 LU-11475 lnet: allow deleting router primary_nid 2a8c65bb3 LU-11300 lnet: consider alive_router_check_interval 68ea55280 LU-11378 lnet: MR aware gateway selection 7f8e99ac6 LU-11299 lnet: use discovery for routing a4224f8d8 LU-11299 lnet: modify lnd notification mechanism 72aa3920f LU-11299 lnet: Cleanup rcd 1ae265b66 LU-11300 lnet: simplify lnet_handle_local_failure() 214a223ff LU-11300 lnet: router aliveness 0affc079c LU-11300 lnet: peer aliveness 9d33c4296 LU-11300 lnet: Cache the routing feature 3e518dc0b LU-11300 lnet: cache ni status 5b83c8c8e LU-11300 lnet: configure lnet router senstivity 6cbd670a6 LU-11300 lnet: router sensitivity e05da588a LU-11551 lnet: Do not allow deleting of router nis e6b298711 LU-11299 lnet: lnet_add/del_route() 18106c476 LU-11298 lnet: use peer for gateway 7fda6b7d1 LU-11292 lnet: Discover routers on first use bb6cc75fe LU-11515 lnet: monitor thread wakeup algorithm 9040e2159 LU-10153 lnet: remove route add restriction 2519eacfc LU-11816 lnet: setup health timeout defaults Here's the LNet config: Client: sles15c01:/sys/kernel/debug/lnet # lnetctl net show net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: tcp1 local NI(s): - nid: 192.168.2.22@tcp1 status: up interfaces: 0: eth0 - net type: tcp99 local NI(s): - nid: 192.168.2.22@tcp99 status: up interfaces: 0: eth0 sles15c01:/sys/kernel/debug/lnet # lnetctl route show route: - net: tcp gateway: 192.168.2.20@tcp1 sles15c01:/sys/kernel/debug/lnet # lnetctl peer show peer: - primary nid: 192.168.2.20@tcp Multi-Rail: True peer ni: - nid: 192.168.2.20@tcp1 state: up - nid: 192.168.2.20@tcp state: up - nid: 192.168.2.26@tcp99 state: up - primary nid: 192.168.2.21@tcp Multi-Rail: True peer ni: - nid: 192.168.2.21@tcp state: NA sles15c01:/sys/kernel/debug/lnet # Server: sles15s01:/sys/module/lnet/parameters # lnetctl net show net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: tcp local NI(s): - nid: 192.168.2.21@tcp status: up interfaces: 0: eth0 sles15s01:/sys/module/lnet/parameters # lnetctl route show route: - net: tcp1 gateway: 192.168.2.20@tcp - net: tcp99 gateway: 192.168.2.20@tcp sles15s01:/sys/module/lnet/parameters # lnetctl peer show peer: - primary nid: 192.168.2.20@tcp Multi-Rail: True peer ni: - nid: 192.168.2.20@tcp state: up - nid: 192.168.2.20@tcp1 state: up - nid: 192.168.2.26@tcp99 state: up - primary nid: 192.168.2.22@tcp1 Multi-Rail: True peer ni: - nid: 192.168.2.22@tcp99 state: NA - nid: 192.168.2.22@tcp1 state: NA sles15s01:/sys/module/lnet/parameters # Router: sles15build01:/tmp # lnetctl net show net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: tcp local NI(s): - nid: 192.168.2.20@tcp status: up interfaces: 0: eth0 - net type: tcp1 local NI(s): - nid: 192.168.2.20@tcp1 status: up interfaces: 0: eth0 - net type: tcp99 local NI(s): - nid: 192.168.2.26@tcp99 status: up interfaces: 0: eth1 sles15build01:/tmp # lnetctl route show sles15build01:/tmp # lnetctl peer show peer: - primary nid: 192.168.2.22@tcp1 Multi-Rail: True peer ni: - nid: 192.168.2.22@tcp1 state: up - nid: 192.168.2.22@tcp99 state: up - primary nid: 192.168.2.21@tcp Multi-Rail: True peer ni: - nid: 192.168.2.21@tcp state: up sles15build01:/tmp #

          People

            ashehata Amir Shehata (Inactive)
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: