[LU-13764] lp_dc_error is not cleared when new discovery ping is sent as response to incoming push Created: 08/Jul/20  Updated: 07/Aug/20  Resolved: 07/Aug/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This bug can result in routes being marked down when they are in fact up:

Ping fails: lp_dc_error is set, and lp_state is set with FORCE_PING

00000400:00000200:25.0:1594238418.911990:0:44689:0:(peer.c:3029:lnet_peer_send_ping()) peer 601@gni99
00000400:00000200:25.0:1594238418.911990:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x30275 rc 0
00000400:00000200:25.0:1594238420.912144:0:44689:0:(peer.c:3260:lnet_peer_discovery_wait_for_work()) woken: 0
00000400:00000200:25.0:1594238420.912146:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x30875
00000400:00000200:25.0:1594238420.912147:0:44689:0:(peer.c:2523:lnet_discovery_event_handler()) Received event: 6
00000400:00000200:25.0:1594238420.912148:0:44689:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff905fbb363210
00000400:00000200:25.0:1594238420.912150:0:44689:0:(peer.c:2954:lnet_peer_ping_failed()) peer 601@gni99:-113
00000400:00000200:25.0:1594238420.912150:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x32075 rc -113
00000400:00000200:25.0:1594238420.912152:0:44689:0:(peer.c:3161:lnet_peer_discovery_error()) Discovery error 601@gni99: -113
00000400:00000200:25.0:1594238420.912153:0:44689:0:(peer.c:1938:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 601@gni99
00000400:00000200:25.0:1594238420.912154:0:44689:0:(router.c:576:lnet_router_discovery_complete()) 601@gni99: Router discovery failed -113

Push arrives

00000400:00000200:0.0:1594238425.096430:0:43969:0:(api-ni.c:2085:lnet_push_target_event_handler()) type 2 status 0 unlinked 1
00000400:00000200:0.0:1594238425.096432:0:43969:0:(peer.c:2065:lnet_peer_push_event()) peer 601@gni99(ffff90348c6c8000) is MR
00000400:00000200:0.0:1594238425.096433:0:43969:0:(peer.c:2142:lnet_peer_push_event()) Received Push 601@gni99 1

Push is processed:

00000400:00000200:25.0:1594238425.096441:0:44689:0:(peer.c:3260:lnet_peer_discovery_wait_for_work()) woken: 0
00000400:00000200:25.0:1594238425.096445:0:44689:0:(api-ni.c:2075:lnet_push_target_post()) posted push target ffff9064ad063000
00000400:00000200:25.0:1594238425.096449:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x220f5
00000400:00000200:25.0:1594238425.096452:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 10.16.100.10@o2ib10 status 363512030
00000400:00000200:25.0:1594238425.096453:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 601@gni99 status 363512030
00000400:00000200:25.0:1594238425.096454:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 601@gni status 363512030
00000400:00000200:25.0:1594238425.096455:0:44689:0:(peer.c:2726:lnet_peer_merge_data()) peer 601@gni99 (ffff90348c6c8000): 0
00000400:00000200:25.0:1594238425.096456:0:44689:0:(peer.c:2921:lnet_peer_data_present()) peer 601@gni99(ffff90348c6c8000): 0. state = 0x22175
00000400:00000200:25.0:1594238425.096458:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x22175 rc 1

Peer is marked with FORCE_PING, so a ping is sent:

00000400:00000200:25.0:1594238425.096459:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x22175
00000400:00000200:25.0:1594238425.096462:0:44689:0:(lib-move.c:5068:LNetGet()) LNetGet -> 12345-10.16.100.10@o2ib10
...
00000400:00000200:25.0:1594238425.096475:0:44689:0:(lib-move.c:1855:lnet_handle_send()) TRACE: 10.16.100.60@o2ib10(10.16.100.60@o2ib10:<?>) -> 10.16.100.10@o2ib10(10.16.100.10@o2ib10:10.16.100.10@o2ib10) <?> : GET try# 0
00000800:00000200:25.0:1594238425.096477:0:44689:0:(o2iblnd_cb.c:1663:kiblnd_send()) sending 0 bytes in 0 frags to 12345-10.16.100.10@o2ib10
00000800:00000200:25.0:1594238425.096480:0:44689:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff9034adc34a80] -> 10.16.100.10@o2ib10 (2) version: 12
00000800:00000200:25.0:1594238425.096481:0:44689:0:(o2iblnd_cb.c:1543:kiblnd_launch_tx()) conn[ffff90345d3e1e00] (36)++
00000800:00000200:25.0:1594238425.096482:0:44689:0:(o2iblnd_cb.c:1316:kiblnd_queue_tx_locked()) conn[ffff90345d3e1e00] (37)++
00000800:00000200:25.0:1594238425.096485:0:44689:0:(o2iblnd_cb.c:1549:kiblnd_launch_tx()) conn[ffff90345d3e1e00] (38)--
00000400:00000200:25.0:1594238425.096486:0:44689:0:(peer.c:3029:lnet_peer_send_ping()) peer 601@gni99
00000400:00000200:25.0:1594238425.096486:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x20375 rc 0

Ping reply is processed:

00000400:00000200:25.0:1594238425.096580:0:44689:0:(peer.c:3260:lnet_peer_discovery_wait_for_work()) woken: 0
00000400:00000200:25.0:1594238425.096581:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x201f5
00000400:00000200:25.0:1594238425.096582:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 10.16.100.10@o2ib10 status 363512030
00000400:00000200:25.0:1594238425.096583:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 601@gni99 status 363512030
00000400:00000200:25.0:1594238425.096584:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 601@gni status 363512030
00000400:00000200:25.0:1594238425.096584:0:44689:0:(peer.c:2726:lnet_peer_merge_data()) peer 601@gni99 (ffff90348c6c8000): 0
00000400:00000200:25.0:1594238425.096585:0:44689:0:(peer.c:2921:lnet_peer_data_present()) peer 601@gni99(ffff90348c6c8000): 0. state = 0x20175
00000400:00000200:25.0:1594238425.096586:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x20175 rc 1
00000400:00000200:25.0:1594238425.096587:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x20175
00000400:00000200:25.0:1594238425.096587:0:44689:0:(peer.c:3180:lnet_peer_discovered()) peer 601@gni99
00000400:00000200:25.0:1594238425.096588:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x20115 rc 0
00000400:00000200:25.0:1594238425.096589:0:44689:0:(peer.c:1938:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 601@gni99

lp_dc_error was never cleared, so it looks like discovery failed, and route is marked down:

00000400:00000200:25.0:1594238425.096590:0:44689:0:(router.c:576:lnet_router_discovery_complete()) 601@gni99: Router discovery failed -113


 Comments   
Comment by Amir Shehata (Inactive) [ 08/Jul/20 ]

FORCE_PING should be cleared when the ping fails. This will not trigger another PING until the next time the router checker decides to re-discover the router. In which case the error will be cleared. Would that resolve the issue?

UPDATE: After discussing it with Chris, seems like we should clear the error when the discovery completes successfully. This will prevent the error from bringing down the route.

Normally, the discovery FSM marks the peer with an error and flags it for rediscovery, such that any action on that peer will trigger it to be rediscovered. The expected actions was a send messages to the peer. However, now that we're using discovery not only to actively discover, but to actively push updates to peers when the internal state changes, this scenario is possible and the FSM needs to be updated to take the new flow into consideration.

Comment by Gerrit Updater [ 13/Jul/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39348
Subject: LU-13764 lnet: Clear lp_dc_error when discovery completes
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 56df66290ca67494d71d9354e0f06686683a0608

Comment by Gerrit Updater [ 07/Aug/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39348/
Subject: LU-13764 lnet: Clear lp_dc_error when discovery completes
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d3fa46705225f3f17aa10144ab688008d5230daa

Comment by Cory Spitz [ 07/Aug/20 ]

I believe that this can be resolved (but I don't want to do it myself in case there is a specific protocol to follow that I'm not aware of).

Comment by Peter Jones [ 07/Aug/20 ]

Yes I agree - thanks Cory. No special protocol (other than setting fix version)

Generated at Sat Feb 10 03:04:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.