[LU-13764] lp_dc_error is not cleared when new discovery ping is sent as response to incoming push Created: 08/Jul/20 Updated: 07/Aug/20 Resolved: 07/Aug/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Chris Horn | Assignee: | Chris Horn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This bug can result in routes being marked down when they are in fact up: Ping fails: lp_dc_error is set, and lp_state is set with FORCE_PING 00000400:00000200:25.0:1594238418.911990:0:44689:0:(peer.c:3029:lnet_peer_send_ping()) peer 601@gni99 00000400:00000200:25.0:1594238418.911990:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x30275 rc 0 00000400:00000200:25.0:1594238420.912144:0:44689:0:(peer.c:3260:lnet_peer_discovery_wait_for_work()) woken: 0 00000400:00000200:25.0:1594238420.912146:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x30875 00000400:00000200:25.0:1594238420.912147:0:44689:0:(peer.c:2523:lnet_discovery_event_handler()) Received event: 6 00000400:00000200:25.0:1594238420.912148:0:44689:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff905fbb363210 00000400:00000200:25.0:1594238420.912150:0:44689:0:(peer.c:2954:lnet_peer_ping_failed()) peer 601@gni99:-113 00000400:00000200:25.0:1594238420.912150:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x32075 rc -113 00000400:00000200:25.0:1594238420.912152:0:44689:0:(peer.c:3161:lnet_peer_discovery_error()) Discovery error 601@gni99: -113 00000400:00000200:25.0:1594238420.912153:0:44689:0:(peer.c:1938:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 601@gni99 00000400:00000200:25.0:1594238420.912154:0:44689:0:(router.c:576:lnet_router_discovery_complete()) 601@gni99: Router discovery failed -113 Push arrives 00000400:00000200:0.0:1594238425.096430:0:43969:0:(api-ni.c:2085:lnet_push_target_event_handler()) type 2 status 0 unlinked 1 00000400:00000200:0.0:1594238425.096432:0:43969:0:(peer.c:2065:lnet_peer_push_event()) peer 601@gni99(ffff90348c6c8000) is MR 00000400:00000200:0.0:1594238425.096433:0:43969:0:(peer.c:2142:lnet_peer_push_event()) Received Push 601@gni99 1 Push is processed: 00000400:00000200:25.0:1594238425.096441:0:44689:0:(peer.c:3260:lnet_peer_discovery_wait_for_work()) woken: 0 00000400:00000200:25.0:1594238425.096445:0:44689:0:(api-ni.c:2075:lnet_push_target_post()) posted push target ffff9064ad063000 00000400:00000200:25.0:1594238425.096449:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x220f5 00000400:00000200:25.0:1594238425.096452:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 10.16.100.10@o2ib10 status 363512030 00000400:00000200:25.0:1594238425.096453:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 601@gni99 status 363512030 00000400:00000200:25.0:1594238425.096454:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 601@gni status 363512030 00000400:00000200:25.0:1594238425.096455:0:44689:0:(peer.c:2726:lnet_peer_merge_data()) peer 601@gni99 (ffff90348c6c8000): 0 00000400:00000200:25.0:1594238425.096456:0:44689:0:(peer.c:2921:lnet_peer_data_present()) peer 601@gni99(ffff90348c6c8000): 0. state = 0x22175 00000400:00000200:25.0:1594238425.096458:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x22175 rc 1 Peer is marked with FORCE_PING, so a ping is sent: 00000400:00000200:25.0:1594238425.096459:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x22175 00000400:00000200:25.0:1594238425.096462:0:44689:0:(lib-move.c:5068:LNetGet()) LNetGet -> 12345-10.16.100.10@o2ib10 ... 00000400:00000200:25.0:1594238425.096475:0:44689:0:(lib-move.c:1855:lnet_handle_send()) TRACE: 10.16.100.60@o2ib10(10.16.100.60@o2ib10:<?>) -> 10.16.100.10@o2ib10(10.16.100.10@o2ib10:10.16.100.10@o2ib10) <?> : GET try# 0 00000800:00000200:25.0:1594238425.096477:0:44689:0:(o2iblnd_cb.c:1663:kiblnd_send()) sending 0 bytes in 0 frags to 12345-10.16.100.10@o2ib10 00000800:00000200:25.0:1594238425.096480:0:44689:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff9034adc34a80] -> 10.16.100.10@o2ib10 (2) version: 12 00000800:00000200:25.0:1594238425.096481:0:44689:0:(o2iblnd_cb.c:1543:kiblnd_launch_tx()) conn[ffff90345d3e1e00] (36)++ 00000800:00000200:25.0:1594238425.096482:0:44689:0:(o2iblnd_cb.c:1316:kiblnd_queue_tx_locked()) conn[ffff90345d3e1e00] (37)++ 00000800:00000200:25.0:1594238425.096485:0:44689:0:(o2iblnd_cb.c:1549:kiblnd_launch_tx()) conn[ffff90345d3e1e00] (38)-- 00000400:00000200:25.0:1594238425.096486:0:44689:0:(peer.c:3029:lnet_peer_send_ping()) peer 601@gni99 00000400:00000200:25.0:1594238425.096486:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x20375 rc 0 Ping reply is processed: 00000400:00000200:25.0:1594238425.096580:0:44689:0:(peer.c:3260:lnet_peer_discovery_wait_for_work()) woken: 0 00000400:00000200:25.0:1594238425.096581:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x201f5 00000400:00000200:25.0:1594238425.096582:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 10.16.100.10@o2ib10 status 363512030 00000400:00000200:25.0:1594238425.096583:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 601@gni99 status 363512030 00000400:00000200:25.0:1594238425.096584:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 601@gni status 363512030 00000400:00000200:25.0:1594238425.096584:0:44689:0:(peer.c:2726:lnet_peer_merge_data()) peer 601@gni99 (ffff90348c6c8000): 0 00000400:00000200:25.0:1594238425.096585:0:44689:0:(peer.c:2921:lnet_peer_data_present()) peer 601@gni99(ffff90348c6c8000): 0. state = 0x20175 00000400:00000200:25.0:1594238425.096586:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x20175 rc 1 00000400:00000200:25.0:1594238425.096587:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x20175 00000400:00000200:25.0:1594238425.096587:0:44689:0:(peer.c:3180:lnet_peer_discovered()) peer 601@gni99 00000400:00000200:25.0:1594238425.096588:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x20115 rc 0 00000400:00000200:25.0:1594238425.096589:0:44689:0:(peer.c:1938:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 601@gni99 lp_dc_error was never cleared, so it looks like discovery failed, and route is marked down: 00000400:00000200:25.0:1594238425.096590:0:44689:0:(router.c:576:lnet_router_discovery_complete()) 601@gni99: Router discovery failed -113 |
| Comments |
| Comment by Amir Shehata (Inactive) [ 08/Jul/20 ] |
|
FORCE_PING should be cleared when the ping fails. This will not trigger another PING until the next time the router checker decides to re-discover the router. In which case the error will be cleared. Would that resolve the issue? UPDATE: After discussing it with Chris, seems like we should clear the error when the discovery completes successfully. This will prevent the error from bringing down the route. Normally, the discovery FSM marks the peer with an error and flags it for rediscovery, such that any action on that peer will trigger it to be rediscovered. The expected actions was a send messages to the peer. However, now that we're using discovery not only to actively discover, but to actively push updates to peers when the internal state changes, this scenario is possible and the FSM needs to be updated to take the new flow into consideration. |
| Comment by Gerrit Updater [ 13/Jul/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39348 |
| Comment by Gerrit Updater [ 07/Aug/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39348/ |
| Comment by Cory Spitz [ 07/Aug/20 ] |
|
I believe that this can be resolved (but I don't want to do it myself in case there is a specific protocol to follow that I'm not aware of). |
| Comment by Peter Jones [ 07/Aug/20 ] |
|
Yes I agree - thanks Cory. No special protocol (other than setting fix version) |