[LU-12053] intermittent ping failures with MR/DD Created: 08/Mar/19 Updated: 25/Nov/19 Resolved: 25/Nov/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.13.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Chris Horn | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
I have three nodes - client, server, router. The LNet configuration approximates what we have on a typical Cray XC. client: sles15c01:/tmp # lctl list_nids 192.168.2.22@tcp99 192.168.2.22@tcp1 sles15c01:/tmp # lctl show_route net tcp hops 4294967295 gw 192.168.2.20@tcp1 up pri 0 sles15c01:/tmp # Server: sles15s01:/tmp # lctl list_nids 192.168.2.21@tcp sles15s01:/tmp # lctl show_route net tcp1 hops 4294967295 gw 192.168.2.20@tcp up pri 0 net tcp99 hops 4294967295 gw 192.168.2.20@tcp down pri 0 sles15s01:/tmp # Router: sles15build01:/tmp # lctl list_nids 192.168.2.20@tcp99 192.168.2.20@tcp1 192.168.2.20@tcp sles15build01:/tmp # All nodes are Multi-Rail aware: sles15build01:/tmp # pdsh -w sles15build01,sles15s01,sles15c01 "lnetctl export | grep Multi" sles15c01: Multi-Rail: True sles15c01: Multi-Rail: True sles15c01: Multi-Rail: True sles15s01: Multi-Rail: True sles15s01: Multi-Rail: True sles15build01: Multi-Rail: True sles15build01: Multi-Rail: True sles15build01:/tmp # Ping issued from server to client: sles15s01:/tmp # lctl ping 192.168.2.22@tcp99 failed to ping 192.168.2.22@tcp99: Input/output error Server: 00000400:00000200:0.0:1531255380.403698:0:29617:0:(lib-move.c:3251:LNetGet()) LNetGet -> 12345-192.168.2.22@tcp99 00000400:00000001:0.0:1531255380.403703:0:29617:0:(lib-move.c:2187:lnet_send()) Process entered 00000400:00000001:0.0:1531255380.403705:0:29617:0:(lib-move.c:1567:lnet_select_pathway()) Process entered ... 00000400:00000200:0.0:1531255380.403760:0:29617:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp99 <?> 00000400:00000200:0.0:1531255380.403808:0:29617:0:(lib-move.c:1412:lnet_find_route_locked()) ffff89c1c69f0200 Route not alive ... 00000400:00000200:0.0:1531255380.403817:0:29617:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp1 <?> ... 00000400:00000200:0.0:1531255380.403867:0:29617:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.21@tcp:<?>) -> 192.168.2.22@tcp1(192.168.2.22@tcp99:192.168.2.20@tcp) : GET What's happening here is: Router: 00000400:00000200:18.0:1531255419.276466:0:13162:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp1(192.168.2.20@tcp) <- 192.168.2.21@tcp : GET - routed 00000400:00000001:18.0:1531255419.276510:0:13162:0:(lib-move.c:2187:lnet_send()) Process entered 00000400:00000001:18.0:1531255419.276512:0:13162:0:(lib-move.c:1567:lnet_select_pathway()) Process entered 00000400:00000200:18.0:1531255419.276521:0:13162:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9fcd86856000 (192.168.2.22@tcp1:tcp1:192.168.2.22@tcp1) 00000400:00000001:18.0:1531255419.276559:0:13162:0:(api-ni.c:1024:lnet_get_net_locked()) Process leaving (rc=18446638301376926080 : -105772332625536 : ffff9fccf5afd180) 00000400:00000200:18.0:1531255419.276592:0:13162:0:(lib-move.c:2000:lnet_select_pathway()) Considering lpni ffff9fd2ec4dee00 (192.168.2.22@tcp99:tcp99:192.168.2.22@tcp99) 00000400:00000200:18.0:1531255419.276598:0:13162:0:(lib-move.c:2043:lnet_select_pathway()) Set best_lpni to ffff9fd2ec4dee00 00000400:00000200:18.0:1531255419.276606:0:13162:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.20@tcp99:<?>) -> 192.168.2.22@tcp99(192.168.2.22@tcp1:192.168.2.22@tcp99) : GET What's happening here is: Client: 00000400:00000200:1.0:1531255341.131514:0:26294:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp99(192.168.2.22@tcp99) <- 192.168.2.21@tcp : GET - for me 00000400:00000001:1.0:1531255341.131544:0:26294:0:(lib-move.c:2187:lnet_send()) Process entered 00000400:00000001:1.0:1531255341.131545:0:26294:0:(lib-move.c:1567:lnet_select_pathway()) Process entered 00000400:00000200:1.0:1531255341.131550:0:26294:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp) 00000400:00000200:1.0:1531255341.131572:0:26294:0:(lib-move.c:1673:lnet_select_pathway()) Got best_ni ffff9ab328829a00 (192.168.2.22@tcp99) from explicit src_nid 192.168.2.22@tcp99 00000400:00000001:1.0:1531255341.131575:0:26294:0:(peer.c:647:lnet_find_peer_ni_locked()) Process leaving (rc=18446632693002343936 : -111380707207680 : ffff9ab328831600) 00000400:00000200:1.0:1531255341.131579:0:26294:0:(lib-move.c:1706:lnet_select_pathway()) best_lpni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp) is not local. Finding gw 00000400:00000200:1.0:1531255341.131582:0:26294:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.21@tcp <?> 00000400:00000200:1.0:1531255341.131592:0:26294:0:(lib-move.c:1719:lnet_select_pathway()) Found best_gw ffff9ab329c81000 (192.168.2.20@tcp1:tcp1:<?>) 00000400:00000200:1.0:1531255341.131600:0:26294:0:(lib-move.c:1740:lnet_select_pathway()) @@@ peer for best_gw ffff9ab329c81000 peer 192.168.2.20@tcp1 state 0x89 health adhy 00000400:00000200:1.0:1531255341.131607:0:26294:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to tcp99 192.168.2.21@tcp <?> 00000400:02000400:1.0:1531255341.131615:0:26294:0:(lib-move.c:1953:lnet_select_pathway()) No route to 192.168.2.21@tcp from 192.168.2.22@tcp99 00000400:00020000:1.0:1531255341.131623:0:26294:0:(lib-move.c:2370:lnet_parse_get()) 192.168.2.22@tcp99: Unable to send REPLY for GET from 12345-192.168.2.21@tcp: -113 What's happening here is: I need to figure out what is happening at step 5 there. Here's the successful case. lctl ping issued from server to client: sles15s01:/tmp # lctl ping 192.168.2.22@tcp99 12345-0@lo 12345-192.168.2.22@tcp99 12345-192.168.2.22@tcp1 sles15s01:/tmp # Server: 00000400:00000200:1.0:1531255388.760890:0:29655:0:(lib-move.c:3251:LNetGet()) LNetGet -> 12345-192.168.2.22@tcp99 00000400:00000001:1.0:1531255388.760895:0:29655:0:(lib-move.c:2187:lnet_send()) Process entered 00000400:00000001:1.0:1531255388.760897:0:29655:0:(lib-move.c:1567:lnet_select_pathway()) Process entered 00000400:00000200:1.0:1531255388.760958:0:29655:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp99 <?> 00000400:00000200:1.0:1531255388.760974:0:29655:0:(lib-move.c:1410:lnet_find_route_locked()) Considering lp ffff89c1c69f0200 (192.168.2.20@tcp) 00000400:00000200:1.0:1531255388.760977:0:29655:0:(lib-move.c:1412:lnet_find_route_locked()) ffff89c1c69f0200 Route not alive 00000400:00000200:1.0:1531255388.760985:0:29655:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp1 <?> 00000400:00000200:1.0:1531255388.761029:0:29655:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.21@tcp:<?>) -> 192.168.2.22@tcp1(192.168.2.22@tcp99:192.168.2.20@tcp) : GET Router: 00000400:00000200:18.0:1531255427.633633:0:13162:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp1(192.168.2.20@tcp) <- 192.168.2.21@tcp : GET - routed 00000400:00000001:18.0:1531255427.633677:0:13162:0:(lib-move.c:2187:lnet_send()) Process entered 00000400:00000001:18.0:1531255427.633678:0:13162:0:(lib-move.c:1567:lnet_select_pathway()) Process entered 00000400:00000200:18.0:1531255427.633688:0:13162:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9fcd86856000 (192.168.2.22@tcp1:tcp1:192.168.2.22@tcp1) 00000400:00000200:18.0:1531255427.633764:0:13162:0:(lib-move.c:2000:lnet_select_pathway()) Considering lpni ffff9fcd86856000 (192.168.2.22@tcp1:tcp1:192.168.2.22@tcp99) 00000400:00000200:18.0:1531255427.633770:0:13162:0:(lib-move.c:2043:lnet_select_pathway()) Set best_lpni to ffff9fcd86856000 00000400:00000200:18.0:1531255427.633778:0:13162:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.20@tcp1:<?>) -> 192.168.2.22@tcp1(192.168.2.22@tcp1:192.168.2.22@tcp1) : GET Client: 00000400:00000200:4.0:1531255349.488613:0:26298:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp1(192.168.2.22@tcp1) <- 192.168.2.21@tcp : GET - for me 00000400:00000001:4.0:1531255349.488647:0:26298:0:(lib-move.c:2187:lnet_send()) Process entered 00000400:00000001:4.0:1531255349.488648:0:26298:0:(lib-move.c:1567:lnet_select_pathway()) Process entered 00000400:00000200:4.0:1531255349.488655:0:26298:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp) 00000400:00000200:4.0:1531255349.488683:0:26298:0:(lib-move.c:1673:lnet_select_pathway()) Got best_ni ffff9ab328829200 (192.168.2.22@tcp1) from explicit src_nid 192.168.2.22@tcp1 00000400:00000001:4.0:1531255349.488686:0:26298:0:(peer.c:647:lnet_find_peer_ni_locked()) Process leaving (rc=18446632693002343936 : -111380707207680 : ffff9ab328831600) 00000400:00000200:4.0:1531255349.488692:0:26298:0:(lib-move.c:1706:lnet_select_pathway()) best_lpni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp) is not local. Finding gw 00000400:00000200:4.0:1531255349.488695:0:26298:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.21@tcp <?> 00000400:00000200:4.0:1531255349.488706:0:26298:0:(lib-move.c:1719:lnet_select_pathway()) Found best_gw ffff9ab329c81000 (192.168.2.20@tcp1:tcp1:<?>) 00000400:00000200:4.0:1531255349.488716:0:26298:0:(lib-move.c:1740:lnet_select_pathway()) @@@ peer for best_gw ffff9ab329c81000 peer 192.168.2.20@tcp1 state 0x89 health adhy 00000400:00000200:4.0:1531255349.488731:0:26298:0:(lib-move.c:2000:lnet_select_pathway()) Considering lpni ffff9ab329c81000 (192.168.2.20@tcp1:tcp1:192.168.2.20@tcp1) 00000400:00000200:4.0:1531255349.488736:0:26298:0:(lib-move.c:2043:lnet_select_pathway()) Set best_lpni to ffff9ab329c81000 00000400:00000200:4.0:1531255349.488744:0:26298:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.22@tcp1(192.168.2.22@tcp1:192.168.2.22@tcp1) -> 192.168.2.21@tcp(192.168.2.21@tcp:192.168.2.20@tcp1) : REPLY So the key difference in the successful case is that when forwarding the GET to the client the router sends the message over its @tcp1 interface rather than its @tcp99 interface. Since the client wants to send the REPLY over the same interface it received the message on it is able to to do successfully because it has a route defined for that interface. |
| Comments |
| Comment by Chris Horn [ 08/Mar/19 ] |
|
Attached dk logs from router, client and server that show what may be the same or a similar issue. I had the following patches applied on top of 2.12 05b3fa2bd (origin/task/test-mr-fixes) LU-11664 lnet: push router interface updates 111818b07 LU-11297 lnet: set gw sensitivity from lnetctl a878b193c LU-11297 lnet: handle router health off c1b835d57 LU-11641 lnet: handle discovery off fb4f70ba5 LU-11470 lnet: drop all rule 064e16551 LU-11478 lnet: misleading discovery seqno. ff3a0e631 LU-11477 lnet: handle health for incoming messages 162357d38 LU-11475 lnet: allow deleting router primary_nid 2a8c65bb3 LU-11300 lnet: consider alive_router_check_interval 68ea55280 LU-11378 lnet: MR aware gateway selection 7f8e99ac6 LU-11299 lnet: use discovery for routing a4224f8d8 LU-11299 lnet: modify lnd notification mechanism 72aa3920f LU-11299 lnet: Cleanup rcd 1ae265b66 LU-11300 lnet: simplify lnet_handle_local_failure() 214a223ff LU-11300 lnet: router aliveness 0affc079c LU-11300 lnet: peer aliveness 9d33c4296 LU-11300 lnet: Cache the routing feature 3e518dc0b LU-11300 lnet: cache ni status 5b83c8c8e LU-11300 lnet: configure lnet router senstivity 6cbd670a6 LU-11300 lnet: router sensitivity e05da588a LU-11551 lnet: Do not allow deleting of router nis e6b298711 LU-11299 lnet: lnet_add/del_route() 18106c476 LU-11298 lnet: use peer for gateway 7fda6b7d1 LU-11292 lnet: Discover routers on first use bb6cc75fe LU-11515 lnet: monitor thread wakeup algorithm 9040e2159 LU-10153 lnet: remove route add restriction 2519eacfc LU-11816 lnet: setup health timeout defaults Here's the LNet config: sles15c01:/sys/kernel/debug/lnet # lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: tcp1
local NI(s):
- nid: 192.168.2.22@tcp1
status: up
interfaces:
0: eth0
- net type: tcp99
local NI(s):
- nid: 192.168.2.22@tcp99
status: up
interfaces:
0: eth0
sles15c01:/sys/kernel/debug/lnet # lnetctl route show
route:
- net: tcp
gateway: 192.168.2.20@tcp1
sles15c01:/sys/kernel/debug/lnet # lnetctl peer show
peer:
- primary nid: 192.168.2.20@tcp
Multi-Rail: True
peer ni:
- nid: 192.168.2.20@tcp1
state: up
- nid: 192.168.2.20@tcp
state: up
- nid: 192.168.2.26@tcp99
state: up
- primary nid: 192.168.2.21@tcp
Multi-Rail: True
peer ni:
- nid: 192.168.2.21@tcp
state: NA
sles15c01:/sys/kernel/debug/lnet #
Server: sles15s01:/sys/module/lnet/parameters # lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: tcp
local NI(s):
- nid: 192.168.2.21@tcp
status: up
interfaces:
0: eth0
sles15s01:/sys/module/lnet/parameters # lnetctl route show
route:
- net: tcp1
gateway: 192.168.2.20@tcp
- net: tcp99
gateway: 192.168.2.20@tcp
sles15s01:/sys/module/lnet/parameters # lnetctl peer show
peer:
- primary nid: 192.168.2.20@tcp
Multi-Rail: True
peer ni:
- nid: 192.168.2.20@tcp
state: up
- nid: 192.168.2.20@tcp1
state: up
- nid: 192.168.2.26@tcp99
state: up
- primary nid: 192.168.2.22@tcp1
Multi-Rail: True
peer ni:
- nid: 192.168.2.22@tcp99
state: NA
- nid: 192.168.2.22@tcp1
state: NA
sles15s01:/sys/module/lnet/parameters #
Router: sles15build01:/tmp # lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: tcp
local NI(s):
- nid: 192.168.2.20@tcp
status: up
interfaces:
0: eth0
- net type: tcp1
local NI(s):
- nid: 192.168.2.20@tcp1
status: up
interfaces:
0: eth0
- net type: tcp99
local NI(s):
- nid: 192.168.2.26@tcp99
status: up
interfaces:
0: eth1
sles15build01:/tmp # lnetctl route show
sles15build01:/tmp # lnetctl peer show
peer:
- primary nid: 192.168.2.22@tcp1
Multi-Rail: True
peer ni:
- nid: 192.168.2.22@tcp1
state: up
- nid: 192.168.2.22@tcp99
state: up
- primary nid: 192.168.2.21@tcp
Multi-Rail: True
peer ni:
- nid: 192.168.2.21@tcp
state: up
sles15build01:/tmp #
|
| Comment by Gerrit Updater [ 09/Apr/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34625 |
| Comment by Gerrit Updater [ 28/May/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34970 |
| Comment by Gerrit Updater [ 07/Jun/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/34625/ |
| Comment by Chris Horn [ 26/Jul/19 ] |
|
ashehata pjones I think this can be resolved with the merge of MR routing |
| Comment by Cory Spitz [ 26/Aug/19 ] |
|
pjones, this ticket can be marked resolved for Lustre 2.13.0. |