Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
This bug can result in routes being marked down when they are in fact up:
Ping fails: lp_dc_error is set, and lp_state is set with FORCE_PING
00000400:00000200:25.0:1594238418.911990:0:44689:0:(peer.c:3029:lnet_peer_send_ping()) peer 601@gni99 00000400:00000200:25.0:1594238418.911990:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x30275 rc 0 00000400:00000200:25.0:1594238420.912144:0:44689:0:(peer.c:3260:lnet_peer_discovery_wait_for_work()) woken: 0 00000400:00000200:25.0:1594238420.912146:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x30875 00000400:00000200:25.0:1594238420.912147:0:44689:0:(peer.c:2523:lnet_discovery_event_handler()) Received event: 6 00000400:00000200:25.0:1594238420.912148:0:44689:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff905fbb363210 00000400:00000200:25.0:1594238420.912150:0:44689:0:(peer.c:2954:lnet_peer_ping_failed()) peer 601@gni99:-113 00000400:00000200:25.0:1594238420.912150:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x32075 rc -113 00000400:00000200:25.0:1594238420.912152:0:44689:0:(peer.c:3161:lnet_peer_discovery_error()) Discovery error 601@gni99: -113 00000400:00000200:25.0:1594238420.912153:0:44689:0:(peer.c:1938:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 601@gni99 00000400:00000200:25.0:1594238420.912154:0:44689:0:(router.c:576:lnet_router_discovery_complete()) 601@gni99: Router discovery failed -113
Push arrives
00000400:00000200:0.0:1594238425.096430:0:43969:0:(api-ni.c:2085:lnet_push_target_event_handler()) type 2 status 0 unlinked 1 00000400:00000200:0.0:1594238425.096432:0:43969:0:(peer.c:2065:lnet_peer_push_event()) peer 601@gni99(ffff90348c6c8000) is MR 00000400:00000200:0.0:1594238425.096433:0:43969:0:(peer.c:2142:lnet_peer_push_event()) Received Push 601@gni99 1
Push is processed:
00000400:00000200:25.0:1594238425.096441:0:44689:0:(peer.c:3260:lnet_peer_discovery_wait_for_work()) woken: 0 00000400:00000200:25.0:1594238425.096445:0:44689:0:(api-ni.c:2075:lnet_push_target_post()) posted push target ffff9064ad063000 00000400:00000200:25.0:1594238425.096449:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x220f5 00000400:00000200:25.0:1594238425.096452:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 10.16.100.10@o2ib10 status 363512030 00000400:00000200:25.0:1594238425.096453:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 601@gni99 status 363512030 00000400:00000200:25.0:1594238425.096454:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 601@gni status 363512030 00000400:00000200:25.0:1594238425.096455:0:44689:0:(peer.c:2726:lnet_peer_merge_data()) peer 601@gni99 (ffff90348c6c8000): 0 00000400:00000200:25.0:1594238425.096456:0:44689:0:(peer.c:2921:lnet_peer_data_present()) peer 601@gni99(ffff90348c6c8000): 0. state = 0x22175 00000400:00000200:25.0:1594238425.096458:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x22175 rc 1
Peer is marked with FORCE_PING, so a ping is sent:
00000400:00000200:25.0:1594238425.096459:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x22175 00000400:00000200:25.0:1594238425.096462:0:44689:0:(lib-move.c:5068:LNetGet()) LNetGet -> 12345-10.16.100.10@o2ib10 ... 00000400:00000200:25.0:1594238425.096475:0:44689:0:(lib-move.c:1855:lnet_handle_send()) TRACE: 10.16.100.60@o2ib10(10.16.100.60@o2ib10:<?>) -> 10.16.100.10@o2ib10(10.16.100.10@o2ib10:10.16.100.10@o2ib10) <?> : GET try# 0 00000800:00000200:25.0:1594238425.096477:0:44689:0:(o2iblnd_cb.c:1663:kiblnd_send()) sending 0 bytes in 0 frags to 12345-10.16.100.10@o2ib10 00000800:00000200:25.0:1594238425.096480:0:44689:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff9034adc34a80] -> 10.16.100.10@o2ib10 (2) version: 12 00000800:00000200:25.0:1594238425.096481:0:44689:0:(o2iblnd_cb.c:1543:kiblnd_launch_tx()) conn[ffff90345d3e1e00] (36)++ 00000800:00000200:25.0:1594238425.096482:0:44689:0:(o2iblnd_cb.c:1316:kiblnd_queue_tx_locked()) conn[ffff90345d3e1e00] (37)++ 00000800:00000200:25.0:1594238425.096485:0:44689:0:(o2iblnd_cb.c:1549:kiblnd_launch_tx()) conn[ffff90345d3e1e00] (38)-- 00000400:00000200:25.0:1594238425.096486:0:44689:0:(peer.c:3029:lnet_peer_send_ping()) peer 601@gni99 00000400:00000200:25.0:1594238425.096486:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x20375 rc 0
Ping reply is processed:
00000400:00000200:25.0:1594238425.096580:0:44689:0:(peer.c:3260:lnet_peer_discovery_wait_for_work()) woken: 0 00000400:00000200:25.0:1594238425.096581:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x201f5 00000400:00000200:25.0:1594238425.096582:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 10.16.100.10@o2ib10 status 363512030 00000400:00000200:25.0:1594238425.096583:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 601@gni99 status 363512030 00000400:00000200:25.0:1594238425.096584:0:44689:0:(peer.c:2659:lnet_peer_merge_data()) cur lpni 601@gni status 363512030 00000400:00000200:25.0:1594238425.096584:0:44689:0:(peer.c:2726:lnet_peer_merge_data()) peer 601@gni99 (ffff90348c6c8000): 0 00000400:00000200:25.0:1594238425.096585:0:44689:0:(peer.c:2921:lnet_peer_data_present()) peer 601@gni99(ffff90348c6c8000): 0. state = 0x20175 00000400:00000200:25.0:1594238425.096586:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x20175 rc 1 00000400:00000200:25.0:1594238425.096587:0:44689:0:(peer.c:3375:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x20175 00000400:00000200:25.0:1594238425.096587:0:44689:0:(peer.c:3180:lnet_peer_discovered()) peer 601@gni99 00000400:00000200:25.0:1594238425.096588:0:44689:0:(peer.c:3394:lnet_peer_discovery()) peer 601@gni99(ffff90348c6c8000) state 0x20115 rc 0 00000400:00000200:25.0:1594238425.096589:0:44689:0:(peer.c:1938:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 601@gni99
lp_dc_error was never cleared, so it looks like discovery failed, and route is marked down:
00000400:00000200:25.0:1594238425.096590:0:44689:0:(router.c:576:lnet_router_discovery_complete()) 601@gni99: Router discovery failed -113