Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.11.0
-
None
-
3
-
9223372036854775807
Description
I have three nodes - client, server, router. The LNet configuration approximates what we have on a typical Cray XC.
client:
sles15c01:/tmp # lctl list_nids 192.168.2.22@tcp99 192.168.2.22@tcp1 sles15c01:/tmp # lctl show_route net tcp hops 4294967295 gw 192.168.2.20@tcp1 up pri 0 sles15c01:/tmp #
Server:
sles15s01:/tmp # lctl list_nids 192.168.2.21@tcp sles15s01:/tmp # lctl show_route net tcp1 hops 4294967295 gw 192.168.2.20@tcp up pri 0 net tcp99 hops 4294967295 gw 192.168.2.20@tcp down pri 0 sles15s01:/tmp #
Router:
sles15build01:/tmp # lctl list_nids 192.168.2.20@tcp99 192.168.2.20@tcp1 192.168.2.20@tcp sles15build01:/tmp #
All nodes are Multi-Rail aware:
sles15build01:/tmp # pdsh -w sles15build01,sles15s01,sles15c01 "lnetctl export | grep Multi" sles15c01: Multi-Rail: True sles15c01: Multi-Rail: True sles15c01: Multi-Rail: True sles15s01: Multi-Rail: True sles15s01: Multi-Rail: True sles15build01: Multi-Rail: True sles15build01: Multi-Rail: True sles15build01:/tmp #
Ping issued from server to client:
sles15s01:/tmp # lctl ping 192.168.2.22@tcp99 failed to ping 192.168.2.22@tcp99: Input/output error
Server:
00000400:00000200:0.0:1531255380.403698:0:29617:0:(lib-move.c:3251:LNetGet()) LNetGet -> 12345-192.168.2.22@tcp99 00000400:00000001:0.0:1531255380.403703:0:29617:0:(lib-move.c:2187:lnet_send()) Process entered 00000400:00000001:0.0:1531255380.403705:0:29617:0:(lib-move.c:1567:lnet_select_pathway()) Process entered ... 00000400:00000200:0.0:1531255380.403760:0:29617:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp99 <?> 00000400:00000200:0.0:1531255380.403808:0:29617:0:(lib-move.c:1412:lnet_find_route_locked()) ffff89c1c69f0200 Route not alive ... 00000400:00000200:0.0:1531255380.403817:0:29617:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp1 <?> ... 00000400:00000200:0.0:1531255380.403867:0:29617:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.21@tcp:<?>) -> 192.168.2.22@tcp1(192.168.2.22@tcp99:192.168.2.20@tcp) : GET
What's happening here is:
1. Destination is 192.168.2.22@tcp99 (See the LNetGet())
2. Attempt to find route to @tcp99, but the route is "down" (see output of lctl show_route above)
3. Because of MR/DD we know that the destination peer also has the nid 192.168.2.22@tcp1
4. Attempt to find route to @tcp1, this succeeds.
5. Send message to 192.168.2.22@tcp1 via gateway 192.168.2.20@tcp
Router:
00000400:00000200:18.0:1531255419.276466:0:13162:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp1(192.168.2.20@tcp) <- 192.168.2.21@tcp : GET - routed 00000400:00000001:18.0:1531255419.276510:0:13162:0:(lib-move.c:2187:lnet_send()) Process entered 00000400:00000001:18.0:1531255419.276512:0:13162:0:(lib-move.c:1567:lnet_select_pathway()) Process entered 00000400:00000200:18.0:1531255419.276521:0:13162:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9fcd86856000 (192.168.2.22@tcp1:tcp1:192.168.2.22@tcp1) 00000400:00000001:18.0:1531255419.276559:0:13162:0:(api-ni.c:1024:lnet_get_net_locked()) Process leaving (rc=18446638301376926080 : -105772332625536 : ffff9fccf5afd180) 00000400:00000200:18.0:1531255419.276592:0:13162:0:(lib-move.c:2000:lnet_select_pathway()) Considering lpni ffff9fd2ec4dee00 (192.168.2.22@tcp99:tcp99:192.168.2.22@tcp99) 00000400:00000200:18.0:1531255419.276598:0:13162:0:(lib-move.c:2043:lnet_select_pathway()) Set best_lpni to ffff9fd2ec4dee00 00000400:00000200:18.0:1531255419.276606:0:13162:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.20@tcp99:<?>) -> 192.168.2.22@tcp99(192.168.2.22@tcp1:192.168.2.22@tcp99) : GET
What's happening here is:
1. Receive message on 192.168.2.20@tcp from 192.168.2.21@tcp that is destined for 192.168.2.22@tcp1
2. Because of MR/DD we know destination peer also has the nid 192.168.2.22@tcp99
3. Send message to 192.168.2.22@tcp99 over the local 192.168.2.20@tcp99 interface
Client:
00000400:00000200:1.0:1531255341.131514:0:26294:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp99(192.168.2.22@tcp99) <- 192.168.2.21@tcp : GET - for me 00000400:00000001:1.0:1531255341.131544:0:26294:0:(lib-move.c:2187:lnet_send()) Process entered 00000400:00000001:1.0:1531255341.131545:0:26294:0:(lib-move.c:1567:lnet_select_pathway()) Process entered 00000400:00000200:1.0:1531255341.131550:0:26294:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp) 00000400:00000200:1.0:1531255341.131572:0:26294:0:(lib-move.c:1673:lnet_select_pathway()) Got best_ni ffff9ab328829a00 (192.168.2.22@tcp99) from explicit src_nid 192.168.2.22@tcp99 00000400:00000001:1.0:1531255341.131575:0:26294:0:(peer.c:647:lnet_find_peer_ni_locked()) Process leaving (rc=18446632693002343936 : -111380707207680 : ffff9ab328831600) 00000400:00000200:1.0:1531255341.131579:0:26294:0:(lib-move.c:1706:lnet_select_pathway()) best_lpni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp) is not local. Finding gw 00000400:00000200:1.0:1531255341.131582:0:26294:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.21@tcp <?> 00000400:00000200:1.0:1531255341.131592:0:26294:0:(lib-move.c:1719:lnet_select_pathway()) Found best_gw ffff9ab329c81000 (192.168.2.20@tcp1:tcp1:<?>) 00000400:00000200:1.0:1531255341.131600:0:26294:0:(lib-move.c:1740:lnet_select_pathway()) @@@ peer for best_gw ffff9ab329c81000 peer 192.168.2.20@tcp1 state 0x89 health adhy 00000400:00000200:1.0:1531255341.131607:0:26294:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to tcp99 192.168.2.21@tcp <?> 00000400:02000400:1.0:1531255341.131615:0:26294:0:(lib-move.c:1953:lnet_select_pathway()) No route to 192.168.2.21@tcp from 192.168.2.22@tcp99 00000400:00020000:1.0:1531255341.131623:0:26294:0:(lib-move.c:2370:lnet_parse_get()) 192.168.2.22@tcp99: Unable to send REPLY for GET from 12345-192.168.2.21@tcp: -113
What's happening here is:
1. Receive message on 192.168.2.22@tcp99 from 192.168.2.21@tcp
2. Lookup peer_ni based on based on the reply destination
3. Determine the best local interface to send the reply is the one that we received the message on (192.168.2.22@tcp99)
4. Try to find route to tcp0
5. ???
6. Try to find route to tcp99
7. No route to tcp99 so fail the send.
I need to figure out what is happening at step 5 there.
Here's the successful case.
lctl ping issued from server to client:
sles15s01:/tmp # lctl ping 192.168.2.22@tcp99 12345-0@lo 12345-192.168.2.22@tcp99 12345-192.168.2.22@tcp1 sles15s01:/tmp #
Server:
00000400:00000200:1.0:1531255388.760890:0:29655:0:(lib-move.c:3251:LNetGet()) LNetGet -> 12345-192.168.2.22@tcp99 00000400:00000001:1.0:1531255388.760895:0:29655:0:(lib-move.c:2187:lnet_send()) Process entered 00000400:00000001:1.0:1531255388.760897:0:29655:0:(lib-move.c:1567:lnet_select_pathway()) Process entered 00000400:00000200:1.0:1531255388.760958:0:29655:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp99 <?> 00000400:00000200:1.0:1531255388.760974:0:29655:0:(lib-move.c:1410:lnet_find_route_locked()) Considering lp ffff89c1c69f0200 (192.168.2.20@tcp) 00000400:00000200:1.0:1531255388.760977:0:29655:0:(lib-move.c:1412:lnet_find_route_locked()) ffff89c1c69f0200 Route not alive 00000400:00000200:1.0:1531255388.760985:0:29655:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp1 <?> 00000400:00000200:1.0:1531255388.761029:0:29655:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.21@tcp:<?>) -> 192.168.2.22@tcp1(192.168.2.22@tcp99:192.168.2.20@tcp) : GET
Router:
00000400:00000200:18.0:1531255427.633633:0:13162:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp1(192.168.2.20@tcp) <- 192.168.2.21@tcp : GET - routed 00000400:00000001:18.0:1531255427.633677:0:13162:0:(lib-move.c:2187:lnet_send()) Process entered 00000400:00000001:18.0:1531255427.633678:0:13162:0:(lib-move.c:1567:lnet_select_pathway()) Process entered 00000400:00000200:18.0:1531255427.633688:0:13162:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9fcd86856000 (192.168.2.22@tcp1:tcp1:192.168.2.22@tcp1) 00000400:00000200:18.0:1531255427.633764:0:13162:0:(lib-move.c:2000:lnet_select_pathway()) Considering lpni ffff9fcd86856000 (192.168.2.22@tcp1:tcp1:192.168.2.22@tcp99) 00000400:00000200:18.0:1531255427.633770:0:13162:0:(lib-move.c:2043:lnet_select_pathway()) Set best_lpni to ffff9fcd86856000 00000400:00000200:18.0:1531255427.633778:0:13162:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.20@tcp1:<?>) -> 192.168.2.22@tcp1(192.168.2.22@tcp1:192.168.2.22@tcp1) : GET
Client:
00000400:00000200:4.0:1531255349.488613:0:26298:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp1(192.168.2.22@tcp1) <- 192.168.2.21@tcp : GET - for me 00000400:00000001:4.0:1531255349.488647:0:26298:0:(lib-move.c:2187:lnet_send()) Process entered 00000400:00000001:4.0:1531255349.488648:0:26298:0:(lib-move.c:1567:lnet_select_pathway()) Process entered 00000400:00000200:4.0:1531255349.488655:0:26298:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp) 00000400:00000200:4.0:1531255349.488683:0:26298:0:(lib-move.c:1673:lnet_select_pathway()) Got best_ni ffff9ab328829200 (192.168.2.22@tcp1) from explicit src_nid 192.168.2.22@tcp1 00000400:00000001:4.0:1531255349.488686:0:26298:0:(peer.c:647:lnet_find_peer_ni_locked()) Process leaving (rc=18446632693002343936 : -111380707207680 : ffff9ab328831600) 00000400:00000200:4.0:1531255349.488692:0:26298:0:(lib-move.c:1706:lnet_select_pathway()) best_lpni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp) is not local. Finding gw 00000400:00000200:4.0:1531255349.488695:0:26298:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.21@tcp <?> 00000400:00000200:4.0:1531255349.488706:0:26298:0:(lib-move.c:1719:lnet_select_pathway()) Found best_gw ffff9ab329c81000 (192.168.2.20@tcp1:tcp1:<?>) 00000400:00000200:4.0:1531255349.488716:0:26298:0:(lib-move.c:1740:lnet_select_pathway()) @@@ peer for best_gw ffff9ab329c81000 peer 192.168.2.20@tcp1 state 0x89 health adhy 00000400:00000200:4.0:1531255349.488731:0:26298:0:(lib-move.c:2000:lnet_select_pathway()) Considering lpni ffff9ab329c81000 (192.168.2.20@tcp1:tcp1:192.168.2.20@tcp1) 00000400:00000200:4.0:1531255349.488736:0:26298:0:(lib-move.c:2043:lnet_select_pathway()) Set best_lpni to ffff9ab329c81000 00000400:00000200:4.0:1531255349.488744:0:26298:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.22@tcp1(192.168.2.22@tcp1:192.168.2.22@tcp1) -> 192.168.2.21@tcp(192.168.2.21@tcp:192.168.2.20@tcp1) : REPLY
So the key difference in the successful case is that when forwarding the GET to the client the router sends the message over its @tcp1 interface rather than its @tcp99 interface. Since the client wants to send the REPLY over the same interface it received the message on it is able to to do successfully because it has a route defined for that interface.
pjones, this ticket can be marked resolved for Lustre 2.13.0.