[LU-12053] intermittent ping failures with MR/DD Created: 08/Mar/19  Updated: 25/Nov/19  Resolved: 25/Nov/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File LU-12053.dklogs.tar.bz2    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I have three nodes - client, server, router. The LNet configuration approximates what we have on a typical Cray XC.

client:

sles15c01:/tmp # lctl list_nids
192.168.2.22@tcp99
192.168.2.22@tcp1
sles15c01:/tmp # lctl show_route
net                tcp hops 4294967295 gw                192.168.2.20@tcp1 up pri 0
sles15c01:/tmp #

Server:

sles15s01:/tmp # lctl list_nids
192.168.2.21@tcp
sles15s01:/tmp # lctl show_route
net               tcp1 hops 4294967295 gw                 192.168.2.20@tcp up pri 0
net              tcp99 hops 4294967295 gw                 192.168.2.20@tcp down pri 0
sles15s01:/tmp #

Router:

sles15build01:/tmp # lctl list_nids
192.168.2.20@tcp99
192.168.2.20@tcp1
192.168.2.20@tcp
sles15build01:/tmp #

All nodes are Multi-Rail aware:

sles15build01:/tmp # pdsh -w sles15build01,sles15s01,sles15c01 "lnetctl export | grep Multi"
sles15c01:       Multi-Rail: True
sles15c01:       Multi-Rail: True
sles15c01:       Multi-Rail: True
sles15s01:       Multi-Rail: True
sles15s01:       Multi-Rail: True
sles15build01:       Multi-Rail: True
sles15build01:       Multi-Rail: True
sles15build01:/tmp #

Ping issued from server to client:

sles15s01:/tmp # lctl ping 192.168.2.22@tcp99
failed to ping 192.168.2.22@tcp99: Input/output error

Server:

00000400:00000200:0.0:1531255380.403698:0:29617:0:(lib-move.c:3251:LNetGet()) LNetGet -> 12345-192.168.2.22@tcp99
00000400:00000001:0.0:1531255380.403703:0:29617:0:(lib-move.c:2187:lnet_send()) Process entered
00000400:00000001:0.0:1531255380.403705:0:29617:0:(lib-move.c:1567:lnet_select_pathway()) Process entered
...
00000400:00000200:0.0:1531255380.403760:0:29617:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp99 <?>
00000400:00000200:0.0:1531255380.403808:0:29617:0:(lib-move.c:1412:lnet_find_route_locked()) ffff89c1c69f0200 Route not alive
...
00000400:00000200:0.0:1531255380.403817:0:29617:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp1 <?>
...
00000400:00000200:0.0:1531255380.403867:0:29617:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.21@tcp:<?>) -> 192.168.2.22@tcp1(192.168.2.22@tcp99:192.168.2.20@tcp) : GET

What's happening here is:
1. Destination is 192.168.2.22@tcp99 (See the LNetGet())
2. Attempt to find route to @tcp99, but the route is "down" (see output of lctl show_route above)
3. Because of MR/DD we know that the destination peer also has the nid 192.168.2.22@tcp1
4. Attempt to find route to @tcp1, this succeeds.
5. Send message to 192.168.2.22@tcp1 via gateway 192.168.2.20@tcp

Router:

00000400:00000200:18.0:1531255419.276466:0:13162:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp1(192.168.2.20@tcp) <- 192.168.2.21@tcp : GET - routed
00000400:00000001:18.0:1531255419.276510:0:13162:0:(lib-move.c:2187:lnet_send()) Process entered
00000400:00000001:18.0:1531255419.276512:0:13162:0:(lib-move.c:1567:lnet_select_pathway()) Process entered
00000400:00000200:18.0:1531255419.276521:0:13162:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9fcd86856000 (192.168.2.22@tcp1:tcp1:192.168.2.22@tcp1)
00000400:00000001:18.0:1531255419.276559:0:13162:0:(api-ni.c:1024:lnet_get_net_locked()) Process leaving (rc=18446638301376926080 : -105772332625536 : ffff9fccf5afd180)
00000400:00000200:18.0:1531255419.276592:0:13162:0:(lib-move.c:2000:lnet_select_pathway()) Considering lpni ffff9fd2ec4dee00 (192.168.2.22@tcp99:tcp99:192.168.2.22@tcp99)
00000400:00000200:18.0:1531255419.276598:0:13162:0:(lib-move.c:2043:lnet_select_pathway()) Set best_lpni to ffff9fd2ec4dee00
00000400:00000200:18.0:1531255419.276606:0:13162:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.20@tcp99:<?>) -> 192.168.2.22@tcp99(192.168.2.22@tcp1:192.168.2.22@tcp99) : GET

What's happening here is:
1. Receive message on 192.168.2.20@tcp from 192.168.2.21@tcp that is destined for 192.168.2.22@tcp1
2. Because of MR/DD we know destination peer also has the nid 192.168.2.22@tcp99
3. Send message to 192.168.2.22@tcp99 over the local 192.168.2.20@tcp99 interface

Client:

00000400:00000200:1.0:1531255341.131514:0:26294:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp99(192.168.2.22@tcp99) <- 192.168.2.21@tcp : GET - for me
00000400:00000001:1.0:1531255341.131544:0:26294:0:(lib-move.c:2187:lnet_send()) Process entered
00000400:00000001:1.0:1531255341.131545:0:26294:0:(lib-move.c:1567:lnet_select_pathway()) Process entered
00000400:00000200:1.0:1531255341.131550:0:26294:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp)
00000400:00000200:1.0:1531255341.131572:0:26294:0:(lib-move.c:1673:lnet_select_pathway()) Got best_ni ffff9ab328829a00 (192.168.2.22@tcp99) from explicit src_nid 192.168.2.22@tcp99
00000400:00000001:1.0:1531255341.131575:0:26294:0:(peer.c:647:lnet_find_peer_ni_locked()) Process leaving (rc=18446632693002343936 : -111380707207680 : ffff9ab328831600)
00000400:00000200:1.0:1531255341.131579:0:26294:0:(lib-move.c:1706:lnet_select_pathway()) best_lpni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp) is not local. Finding gw
00000400:00000200:1.0:1531255341.131582:0:26294:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.21@tcp <?>
00000400:00000200:1.0:1531255341.131592:0:26294:0:(lib-move.c:1719:lnet_select_pathway()) Found best_gw ffff9ab329c81000 (192.168.2.20@tcp1:tcp1:<?>)
00000400:00000200:1.0:1531255341.131600:0:26294:0:(lib-move.c:1740:lnet_select_pathway()) @@@ peer for best_gw ffff9ab329c81000  peer 192.168.2.20@tcp1 state 0x89 health adhy
00000400:00000200:1.0:1531255341.131607:0:26294:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to tcp99 192.168.2.21@tcp <?>
00000400:02000400:1.0:1531255341.131615:0:26294:0:(lib-move.c:1953:lnet_select_pathway()) No route to 192.168.2.21@tcp from 192.168.2.22@tcp99
00000400:00020000:1.0:1531255341.131623:0:26294:0:(lib-move.c:2370:lnet_parse_get()) 192.168.2.22@tcp99: Unable to send REPLY for GET from 12345-192.168.2.21@tcp: -113

What's happening here is:
1. Receive message on 192.168.2.22@tcp99 from 192.168.2.21@tcp
2. Lookup peer_ni based on based on the reply destination
3. Determine the best local interface to send the reply is the one that we received the message on (192.168.2.22@tcp99)
4. Try to find route to tcp0
5. ???
6. Try to find route to tcp99
7. No route to tcp99 so fail the send.

I need to figure out what is happening at step 5 there.

Here's the successful case.

lctl ping issued from server to client:

sles15s01:/tmp # lctl ping 192.168.2.22@tcp99
12345-0@lo
12345-192.168.2.22@tcp99
12345-192.168.2.22@tcp1
sles15s01:/tmp #

Server:

00000400:00000200:1.0:1531255388.760890:0:29655:0:(lib-move.c:3251:LNetGet()) LNetGet -> 12345-192.168.2.22@tcp99
00000400:00000001:1.0:1531255388.760895:0:29655:0:(lib-move.c:2187:lnet_send()) Process entered
00000400:00000001:1.0:1531255388.760897:0:29655:0:(lib-move.c:1567:lnet_select_pathway()) Process entered
00000400:00000200:1.0:1531255388.760958:0:29655:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp99 <?>
00000400:00000200:1.0:1531255388.760974:0:29655:0:(lib-move.c:1410:lnet_find_route_locked()) Considering lp ffff89c1c69f0200 (192.168.2.20@tcp)
00000400:00000200:1.0:1531255388.760977:0:29655:0:(lib-move.c:1412:lnet_find_route_locked()) ffff89c1c69f0200 Route not alive
00000400:00000200:1.0:1531255388.760985:0:29655:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.22@tcp1 <?>
00000400:00000200:1.0:1531255388.761029:0:29655:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.21@tcp:<?>) -> 192.168.2.22@tcp1(192.168.2.22@tcp99:192.168.2.20@tcp) : GET

Router:

00000400:00000200:18.0:1531255427.633633:0:13162:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp1(192.168.2.20@tcp) <- 192.168.2.21@tcp : GET - routed
00000400:00000001:18.0:1531255427.633677:0:13162:0:(lib-move.c:2187:lnet_send()) Process entered
00000400:00000001:18.0:1531255427.633678:0:13162:0:(lib-move.c:1567:lnet_select_pathway()) Process entered
00000400:00000200:18.0:1531255427.633688:0:13162:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9fcd86856000 (192.168.2.22@tcp1:tcp1:192.168.2.22@tcp1)
00000400:00000200:18.0:1531255427.633764:0:13162:0:(lib-move.c:2000:lnet_select_pathway()) Considering lpni ffff9fcd86856000 (192.168.2.22@tcp1:tcp1:192.168.2.22@tcp99)
00000400:00000200:18.0:1531255427.633770:0:13162:0:(lib-move.c:2043:lnet_select_pathway()) Set best_lpni to ffff9fcd86856000
00000400:00000200:18.0:1531255427.633778:0:13162:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.21@tcp(192.168.2.20@tcp1:<?>) -> 192.168.2.22@tcp1(192.168.2.22@tcp1:192.168.2.22@tcp1) : GET

Client:

00000400:00000200:4.0:1531255349.488613:0:26298:0:(lib-move.c:2663:lnet_parse()) TRACE: 192.168.2.22@tcp1(192.168.2.22@tcp1) <- 192.168.2.21@tcp : GET - for me
00000400:00000001:4.0:1531255349.488647:0:26298:0:(lib-move.c:2187:lnet_send()) Process entered
00000400:00000001:4.0:1531255349.488648:0:26298:0:(lib-move.c:1567:lnet_select_pathway()) Process entered
00000400:00000200:4.0:1531255349.488655:0:26298:0:(lib-move.c:1607:lnet_select_pathway()) Got peer_ni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp)
00000400:00000200:4.0:1531255349.488683:0:26298:0:(lib-move.c:1673:lnet_select_pathway()) Got best_ni ffff9ab328829200 (192.168.2.22@tcp1) from explicit src_nid 192.168.2.22@tcp1
00000400:00000001:4.0:1531255349.488686:0:26298:0:(peer.c:647:lnet_find_peer_ni_locked()) Process leaving (rc=18446632693002343936 : -111380707207680 : ffff9ab328831600)
00000400:00000200:4.0:1531255349.488692:0:26298:0:(lib-move.c:1706:lnet_select_pathway()) best_lpni ffff9ab328831600 (192.168.2.21@tcp:tcp:192.168.2.21@tcp) is not local. Finding gw
00000400:00000200:4.0:1531255349.488695:0:26298:0:(lib-move.c:1396:lnet_find_route_locked()) Looking for route to NULL 192.168.2.21@tcp <?>
00000400:00000200:4.0:1531255349.488706:0:26298:0:(lib-move.c:1719:lnet_select_pathway()) Found best_gw ffff9ab329c81000 (192.168.2.20@tcp1:tcp1:<?>)
00000400:00000200:4.0:1531255349.488716:0:26298:0:(lib-move.c:1740:lnet_select_pathway()) @@@ peer for best_gw ffff9ab329c81000  peer 192.168.2.20@tcp1 state 0x89 health adhy
00000400:00000200:4.0:1531255349.488731:0:26298:0:(lib-move.c:2000:lnet_select_pathway()) Considering lpni ffff9ab329c81000 (192.168.2.20@tcp1:tcp1:192.168.2.20@tcp1)
00000400:00000200:4.0:1531255349.488736:0:26298:0:(lib-move.c:2043:lnet_select_pathway()) Set best_lpni to ffff9ab329c81000
00000400:00000200:4.0:1531255349.488744:0:26298:0:(lib-move.c:2172:lnet_select_pathway()) TRACE: 192.168.2.22@tcp1(192.168.2.22@tcp1:192.168.2.22@tcp1) -> 192.168.2.21@tcp(192.168.2.21@tcp:192.168.2.20@tcp1) : REPLY

So the key difference in the successful case is that when forwarding the GET to the client the router sends the message over its @tcp1 interface rather than its @tcp99 interface. Since the client wants to send the REPLY over the same interface it received the message on it is able to to do successfully because it has a route defined for that interface.



 Comments   
Comment by Chris Horn [ 08/Mar/19 ]

Attached dk logs from router, client and server that show what may be the same or a similar issue.

I had the following patches applied on top of 2.12

05b3fa2bd (origin/task/test-mr-fixes) LU-11664 lnet: push router interface updates
111818b07 LU-11297 lnet: set gw sensitivity from lnetctl
a878b193c LU-11297 lnet: handle router health off
c1b835d57 LU-11641 lnet: handle discovery off
fb4f70ba5 LU-11470 lnet: drop all rule
064e16551 LU-11478 lnet: misleading discovery seqno.
ff3a0e631 LU-11477 lnet: handle health for incoming messages
162357d38 LU-11475 lnet: allow deleting router primary_nid
2a8c65bb3 LU-11300 lnet: consider alive_router_check_interval
68ea55280 LU-11378 lnet: MR aware gateway selection
7f8e99ac6 LU-11299 lnet: use discovery for routing
a4224f8d8 LU-11299 lnet: modify lnd notification mechanism
72aa3920f LU-11299 lnet: Cleanup rcd
1ae265b66 LU-11300 lnet: simplify lnet_handle_local_failure()
214a223ff LU-11300 lnet: router aliveness
0affc079c LU-11300 lnet: peer aliveness
9d33c4296 LU-11300 lnet: Cache the routing feature
3e518dc0b LU-11300 lnet: cache ni status
5b83c8c8e LU-11300 lnet: configure lnet router senstivity
6cbd670a6 LU-11300 lnet: router sensitivity
e05da588a LU-11551 lnet: Do not allow deleting of router nis
e6b298711 LU-11299 lnet: lnet_add/del_route()
18106c476 LU-11298 lnet: use peer for gateway
7fda6b7d1 LU-11292 lnet: Discover routers on first use
bb6cc75fe LU-11515 lnet: monitor thread wakeup algorithm
9040e2159 LU-10153 lnet: remove route add restriction
2519eacfc LU-11816 lnet: setup health timeout defaults

Here's the LNet config:
Client:

sles15c01:/sys/kernel/debug/lnet # lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: tcp1
      local NI(s):
        - nid: 192.168.2.22@tcp1
          status: up
          interfaces:
              0: eth0
    - net type: tcp99
      local NI(s):
        - nid: 192.168.2.22@tcp99
          status: up
          interfaces:
              0: eth0
sles15c01:/sys/kernel/debug/lnet # lnetctl route show
route:
    - net: tcp
      gateway: 192.168.2.20@tcp1
sles15c01:/sys/kernel/debug/lnet # lnetctl peer show
peer:
    - primary nid: 192.168.2.20@tcp
      Multi-Rail: True
      peer ni:
        - nid: 192.168.2.20@tcp1
          state: up
        - nid: 192.168.2.20@tcp
          state: up
        - nid: 192.168.2.26@tcp99
          state: up
    - primary nid: 192.168.2.21@tcp
      Multi-Rail: True
      peer ni:
        - nid: 192.168.2.21@tcp
          state: NA
sles15c01:/sys/kernel/debug/lnet #

Server:

sles15s01:/sys/module/lnet/parameters # lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: tcp
      local NI(s):
        - nid: 192.168.2.21@tcp
          status: up
          interfaces:
              0: eth0
sles15s01:/sys/module/lnet/parameters # lnetctl route show
route:
    - net: tcp1
      gateway: 192.168.2.20@tcp
    - net: tcp99
      gateway: 192.168.2.20@tcp
sles15s01:/sys/module/lnet/parameters # lnetctl peer show
peer:
    - primary nid: 192.168.2.20@tcp
      Multi-Rail: True
      peer ni:
        - nid: 192.168.2.20@tcp
          state: up
        - nid: 192.168.2.20@tcp1
          state: up
        - nid: 192.168.2.26@tcp99
          state: up
    - primary nid: 192.168.2.22@tcp1
      Multi-Rail: True
      peer ni:
        - nid: 192.168.2.22@tcp99
          state: NA
        - nid: 192.168.2.22@tcp1
          state: NA
sles15s01:/sys/module/lnet/parameters #

Router:

sles15build01:/tmp # lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: tcp
      local NI(s):
        - nid: 192.168.2.20@tcp
          status: up
          interfaces:
              0: eth0
    - net type: tcp1
      local NI(s):
        - nid: 192.168.2.20@tcp1
          status: up
          interfaces:
              0: eth0
    - net type: tcp99
      local NI(s):
        - nid: 192.168.2.26@tcp99
          status: up
          interfaces:
              0: eth1
sles15build01:/tmp # lnetctl route show
sles15build01:/tmp # lnetctl peer show
peer:
    - primary nid: 192.168.2.22@tcp1
      Multi-Rail: True
      peer ni:
        - nid: 192.168.2.22@tcp1
          state: up
        - nid: 192.168.2.22@tcp99
          state: up
    - primary nid: 192.168.2.21@tcp
      Multi-Rail: True
      peer ni:
        - nid: 192.168.2.21@tcp
          state: up
sles15build01:/tmp #
Comment by Gerrit Updater [ 09/Apr/19 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34625
Subject: LU-12053 lnet: look up MR peers routes
Project: fs/lustre-release
Branch: multi-rail
Current Patch Set: 1
Commit: e0ee3410e8e7eecdba4f3e137ab32efe826916c2

Comment by Gerrit Updater [ 28/May/19 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34970
Subject: LU-12053 lnet: look up MR peers routes
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8b9927b88ead8b25f3651cd90f8b904db1d21314

Comment by Gerrit Updater [ 07/Jun/19 ]

Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/34625/
Subject: LU-12053 lnet: look up MR peers routes
Project: fs/lustre-release
Branch: multi-rail
Current Patch Set:
Commit: 52eef817974304d0107caaec023a2ed341c164e0

Comment by Chris Horn [ 26/Jul/19 ]

ashehata pjones I think this can be resolved with the merge of MR routing

Comment by Cory Spitz [ 26/Aug/19 ]

pjones, this ticket can be marked resolved for Lustre 2.13.0.

Generated at Sat Feb 10 02:49:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.