Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
Connection is failing because of ARP flux, however the peer NI health is never decremented because the failure is classified as a "local" one:
00000800:00020000:1.0:1615483922.587888:0:5629:0:(o2iblnd_cb.c:2933:kiblnd_rejected()) 10.12.2.4@o2ib41 rejected: consumer defined fatal error 00000800:00000200:1.0:1615483922.587890:0:5629:0:(o2iblnd_cb.c:2313:kiblnd_connreq_done()) 10.12.2.4@o2ib41: active(1), version(12), status(-111) 00000800:00000200:1.0:1615483922.587892:0:5629:0:(o2iblnd.c:420:kiblnd_unlink_peer_locked()) peer_ni[ffff8953de6a8600] -> 10.12.2.4@o2ib41 (2)-- 00000400:00000200:1.0:1615483922.587894:0:5629:0:(router.c:1720:lnet_notify()) 10.12.2.53@o2ib41 notifying 10.12.2.4@o2ib41: down 00000800:00000100:1.0:1615483922.587896:0:5629:0:(o2iblnd_cb.c:2294:kiblnd_peer_connect_failed()) Deleting messages for 10.12.2.4@o2ib41: connection failed 00000400:00000200:1.0:1615483922.587898:0:5629:0:(lib-msg.c:1011:lnet_is_health_check()) health check = 1, status = -111, hstatus = 2 00000400:00000200:1.0:1615483922.587899:0:5629:0:(lib-msg.c:860:lnet_health_check()) health check: 10.12.2.53@o2ib41->10.12.2.4@o2ib41: GET: LOCAL_DROPPED 00000400:00000200:1.0:1615483922.587901:0:5629:0:(lib-msg.c:479:lnet_handle_local_failure()) ni 10.12.2.53@o2ib41 added to recovery queue. Health = 900 00000400:00000200:1.0:1615483922.587903:0:5629:0:(lib-msg.c:641:lnet_resend_msg_locked()) 10.12.2.53@o2ib41->10.12.2.4@o2ib41:GET:LOCAL_DROPPED - queuing msg (ffff895f4c9171d8) for resend
It would be better to categorize this failure as REMOTE_DROPPED.
This issue was seen with Lustre version 2.12.4.3_cray_44_g2942581