Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12053

intermittent ping failures with MR/DD

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.13.0
    • Lustre 2.11.0
    • None
    • 3
    • 9223372036854775807

    Description

      I have three nodes - client, server, router. The LNet configuration approximates what we have on a typical Cray XC.

      client:

      sles15c01:/tmp # lctl list_nids
      192.168.2.22@tcp99
      192.168.2.22@tcp1
      sles15c01:/tmp # lctl show_route
      net                tcp hops 4294967295 gw                192.168.2.20@tcp1 up pri 0
      sles15c01:/tmp #
      

      Server:

      sles15s01:/tmp # lctl list_nids
      192.168.2.21@tcp
      sles15s01:/tmp # lctl show_route
      net               tcp1 hops 4294967295 gw                 192.168.2.20@tcp up pri 0
      net              tcp99 hops 4294967295 gw                 192.168.2.20@tcp down pri 0
      sles15s01:/tmp #
      

      Router:

      sles15build01:/tmp # lctl list_nids
      192.168.2.20@tcp99
      192.168.2.20@tcp1
      192.168.2.20@tcp
      sles15build01:/tmp #
      

      All nodes are Multi-Rail aware:

      sles15build01:/tmp # pdsh -w sles15build01,sles15s01,sles15c01 "lnetctl export | grep Multi"
      sles15c01:       Multi-Rail: True
      sles15c01:       Multi-Rail: True
      sles15c01:       Multi-Rail: True
      sles15s01:       Multi-Rail: True
      sles15s01:       Multi-Rail: True
      sles15build01:       Multi-Rail: True
      sles15build01:       Multi-Rail: True
      sles15build01:/tmp #
      

      Ping issued from server to client:

      sles15s01:/tmp # lctl ping 192.168.2.22@tcp99
      failed to ping 192.168.2.22@tcp99: Input/output error
      

      Server:

      00000400:00000200:0.0:1531255380.403698:0:29617:0:(lib-move.c:3251:LNetGet()) LNetGet -> 12345-192.168.2.22@tcp99
      00000400:00000001:0.0:1531255380.403703:0:29617:0:(lib-move.c:2187:lnet_send()) Process entered
      00000400:00000001:0.0:1531255380.403705:0:29617:0:(lib-move.c:1567:lnet_select_pathway()) Process entered
      ...
      00000400:00000200:0.0:1531255380.403760:0:29617:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp99 <?>
      00000400:00000200:0.0:1531255380.403808:0:29617:0:(lib-move.c:1412:lnet_find_route_locked()) ffff89c1c69f0200 Route not alive
      ...
      00000400:00000200:0.0:1531255380.403817:0:29617:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp1 <?>
      ...
      00000400:00000200:0.0:1531255380.403867:0:29617:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.21@tcp:<?>) -> 192.168.2.22@tcp1(192.168.2.22@tcp99:192.168.2.20@tcp) : GET
      

      What's happening here is:
      1. Destination is 192.168.2.22@tcp99 (See the LNetGet())
      2. Attempt to find route to @tcp99, but the route is "down" (see output of lctl show_route above)
      3. Because of MR/DD we know that the destination peer also has the nid 192.168.2.22@tcp1
      4. Attempt to find route to @tcp1, this succeeds.
      5. Send message to 192.168.2.22@tcp1 via gateway 192.168.2.20@tcp

      Router:

      00000400:00000200:18.0:1531255419.276466:0:13162:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp1(192.168.2.20@tcp) <- 192.168.2.21@tcp : GET - routed
      00000400:00000001:18.0:1531255419.276510:0:13162:0:(lib-move.c:2187:lnet_send()) Process entered
      00000400:00000001:18.0:1531255419.276512:0:13162:0:(lib-move.c:1567:lnet_select_pathway()) Process entered
      00000400:00000200:18.0:1531255419.276521:0:13162:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9fcd86856000 (192.168.2.22@tcp1:tcp1:192.168.2.22@tcp1)
      00000400:00000001:18.0:1531255419.276559:0:13162:0:(api-ni.c:1024:lnet_get_net_locked()) Process leaving (rc=18446638301376926080 : -105772332625536 : ffff9fccf5afd180)
      00000400:00000200:18.0:1531255419.276592:0:13162:0:(lib-move.c:2000:lnet_select_pathway()) Considering lpni ffff9fd2ec4dee00 (192.168.2.22@tcp99:tcp99:192.168.2.22@tcp99)
      00000400:00000200:18.0:1531255419.276598:0:13162:0:(lib-move.c:2043:lnet_select_pathway()) Set best_lpni to ffff9fd2ec4dee00
      00000400:00000200:18.0:1531255419.276606:0:13162:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.20@tcp99:<?>) -> 192.168.2.22@tcp99(192.168.2.22@tcp1:192.168.2.22@tcp99) : GET
      

      What's happening here is:
      1. Receive message on 192.168.2.20@tcp from 192.168.2.21@tcp that is destined for 192.168.2.22@tcp1
      2. Because of MR/DD we know destination peer also has the nid 192.168.2.22@tcp99
      3. Send message to 192.168.2.22@tcp99 over the local 192.168.2.20@tcp99 interface

      Client:

      00000400:00000200:1.0:1531255341.131514:0:26294:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp99(192.168.2.22@tcp99) <- 192.168.2.21@tcp : GET - for me
      00000400:00000001:1.0:1531255341.131544:0:26294:0:(lib-move.c:2187:lnet_send()) Process entered
      00000400:00000001:1.0:1531255341.131545:0:26294:0:(lib-move.c:1567:lnet_select_pathway()) Process entered
      00000400:00000200:1.0:1531255341.131550:0:26294:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp)
      00000400:00000200:1.0:1531255341.131572:0:26294:0:(lib-move.c:1673:lnet_select_pathway()) Got best_ni ffff9ab328829a00 (192.168.2.22@tcp99) from explicit src_nid 192.168.2.22@tcp99
      00000400:00000001:1.0:1531255341.131575:0:26294:0:(peer.c:647:lnet_find_peer_ni_locked()) Process leaving (rc=18446632693002343936 : -111380707207680 : ffff9ab328831600)
      00000400:00000200:1.0:1531255341.131579:0:26294:0:(lib-move.c:1706:lnet_select_pathway()) best_lpni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp) is not local. Finding gw
      00000400:00000200:1.0:1531255341.131582:0:26294:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.21@tcp <?>
      00000400:00000200:1.0:1531255341.131592:0:26294:0:(lib-move.c:1719:lnet_select_pathway()) Found best_gw ffff9ab329c81000 (192.168.2.20@tcp1:tcp1:<?>)
      00000400:00000200:1.0:1531255341.131600:0:26294:0:(lib-move.c:1740:lnet_select_pathway()) @@@ peer for best_gw ffff9ab329c81000  peer 192.168.2.20@tcp1 state 0x89 health adhy
      00000400:00000200:1.0:1531255341.131607:0:26294:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to tcp99 192.168.2.21@tcp <?>
      00000400:02000400:1.0:1531255341.131615:0:26294:0:(lib-move.c:1953:lnet_select_pathway()) No route to 192.168.2.21@tcp from 192.168.2.22@tcp99
      00000400:00020000:1.0:1531255341.131623:0:26294:0:(lib-move.c:2370:lnet_parse_get()) 192.168.2.22@tcp99: Unable to send REPLY for GET from 12345-192.168.2.21@tcp: -113
      

      What's happening here is:
      1. Receive message on 192.168.2.22@tcp99 from 192.168.2.21@tcp
      2. Lookup peer_ni based on based on the reply destination
      3. Determine the best local interface to send the reply is the one that we received the message on (192.168.2.22@tcp99)
      4. Try to find route to tcp0
      5. ???
      6. Try to find route to tcp99
      7. No route to tcp99 so fail the send.

      I need to figure out what is happening at step 5 there.

      Here's the successful case.

      lctl ping issued from server to client:

      sles15s01:/tmp # lctl ping 192.168.2.22@tcp99
      12345-0@lo
      12345-192.168.2.22@tcp99
      12345-192.168.2.22@tcp1
      sles15s01:/tmp #
      

      Server:

      00000400:00000200:1.0:1531255388.760890:0:29655:0:(lib-move.c:3251:LNetGet()) LNetGet -> 12345-192.168.2.22@tcp99
      00000400:00000001:1.0:1531255388.760895:0:29655:0:(lib-move.c:2187:lnet_send()) Process entered
      00000400:00000001:1.0:1531255388.760897:0:29655:0:(lib-move.c:1567:lnet_select_pathway()) Process entered
      00000400:00000200:1.0:1531255388.760958:0:29655:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp99 <?>
      00000400:00000200:1.0:1531255388.760974:0:29655:0:(lib-move.c:1410:lnet_find_route_locked()) Considering lp ffff89c1c69f0200 (192.168.2.20@tcp)
      00000400:00000200:1.0:1531255388.760977:0:29655:0:(lib-move.c:1412:lnet_find_route_locked()) ffff89c1c69f0200 Route not alive
      00000400:00000200:1.0:1531255388.760985:0:29655:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp1 <?>
      00000400:00000200:1.0:1531255388.761029:0:29655:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.21@tcp:<?>) -> 192.168.2.22@tcp1(192.168.2.22@tcp99:192.168.2.20@tcp) : GET
      

      Router:

      00000400:00000200:18.0:1531255427.633633:0:13162:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp1(192.168.2.20@tcp) <- 192.168.2.21@tcp : GET - routed
      00000400:00000001:18.0:1531255427.633677:0:13162:0:(lib-move.c:2187:lnet_send()) Process entered
      00000400:00000001:18.0:1531255427.633678:0:13162:0:(lib-move.c:1567:lnet_select_pathway()) Process entered
      00000400:00000200:18.0:1531255427.633688:0:13162:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9fcd86856000 (192.168.2.22@tcp1:tcp1:192.168.2.22@tcp1)
      00000400:00000200:18.0:1531255427.633764:0:13162:0:(lib-move.c:2000:lnet_select_pathway()) Considering lpni ffff9fcd86856000 (192.168.2.22@tcp1:tcp1:192.168.2.22@tcp99)
      00000400:00000200:18.0:1531255427.633770:0:13162:0:(lib-move.c:2043:lnet_select_pathway()) Set best_lpni to ffff9fcd86856000
      00000400:00000200:18.0:1531255427.633778:0:13162:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.20@tcp1:<?>) -> 192.168.2.22@tcp1(192.168.2.22@tcp1:192.168.2.22@tcp1) : GET
      

      Client:

      00000400:00000200:4.0:1531255349.488613:0:26298:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp1(192.168.2.22@tcp1) <- 192.168.2.21@tcp : GET - for me
      00000400:00000001:4.0:1531255349.488647:0:26298:0:(lib-move.c:2187:lnet_send()) Process entered
      00000400:00000001:4.0:1531255349.488648:0:26298:0:(lib-move.c:1567:lnet_select_pathway()) Process entered
      00000400:00000200:4.0:1531255349.488655:0:26298:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp)
      00000400:00000200:4.0:1531255349.488683:0:26298:0:(lib-move.c:1673:lnet_select_pathway()) Got best_ni ffff9ab328829200 (192.168.2.22@tcp1) from explicit src_nid 192.168.2.22@tcp1
      00000400:00000001:4.0:1531255349.488686:0:26298:0:(peer.c:647:lnet_find_peer_ni_locked()) Process leaving (rc=18446632693002343936 : -111380707207680 : ffff9ab328831600)
      00000400:00000200:4.0:1531255349.488692:0:26298:0:(lib-move.c:1706:lnet_select_pathway()) best_lpni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp) is not local. Finding gw
      00000400:00000200:4.0:1531255349.488695:0:26298:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.21@tcp <?>
      00000400:00000200:4.0:1531255349.488706:0:26298:0:(lib-move.c:1719:lnet_select_pathway()) Found best_gw ffff9ab329c81000 (192.168.2.20@tcp1:tcp1:<?>)
      00000400:00000200:4.0:1531255349.488716:0:26298:0:(lib-move.c:1740:lnet_select_pathway()) @@@ peer for best_gw ffff9ab329c81000  peer 192.168.2.20@tcp1 state 0x89 health adhy
      00000400:00000200:4.0:1531255349.488731:0:26298:0:(lib-move.c:2000:lnet_select_pathway()) Considering lpni ffff9ab329c81000 (192.168.2.20@tcp1:tcp1:192.168.2.20@tcp1)
      00000400:00000200:4.0:1531255349.488736:0:26298:0:(lib-move.c:2043:lnet_select_pathway()) Set best_lpni to ffff9ab329c81000
      00000400:00000200:4.0:1531255349.488744:0:26298:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.22@tcp1(192.168.2.22@tcp1:192.168.2.22@tcp1) -> 192.168.2.21@tcp(192.168.2.21@tcp:192.168.2.20@tcp1) : REPLY
      

      So the key difference in the successful case is that when forwarding the GET to the client the router sends the message over its @tcp1 interface rather than its @tcp99 interface. Since the client wants to send the REPLY over the same interface it received the message on it is able to to do successfully because it has a route defined for that interface.

      Attachments

        Activity

          People

            ashehata Amir Shehata (Inactive)
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: