[LU-14540] Connection failure does not cause peer NI health to decrement Created: 19/Mar/21 Updated: 15/Jul/21 Resolved: 06/Apr/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Chris Horn | Assignee: | Chris Horn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Connection is failing because of ARP flux, however the peer NI health is never decremented because the failure is classified as a "local" one: 00000800:00020000:1.0:1615483922.587888:0:5629:0:(o2iblnd_cb.c:2933:kiblnd_rejected()) 10.12.2.4@o2ib41 rejected: consumer defined fatal error 00000800:00000200:1.0:1615483922.587890:0:5629:0:(o2iblnd_cb.c:2313:kiblnd_connreq_done()) 10.12.2.4@o2ib41: active(1), version(12), status(-111) 00000800:00000200:1.0:1615483922.587892:0:5629:0:(o2iblnd.c:420:kiblnd_unlink_peer_locked()) peer_ni[ffff8953de6a8600] -> 10.12.2.4@o2ib41 (2)-- 00000400:00000200:1.0:1615483922.587894:0:5629:0:(router.c:1720:lnet_notify()) 10.12.2.53@o2ib41 notifying 10.12.2.4@o2ib41: down 00000800:00000100:1.0:1615483922.587896:0:5629:0:(o2iblnd_cb.c:2294:kiblnd_peer_connect_failed()) Deleting messages for 10.12.2.4@o2ib41: connection failed 00000400:00000200:1.0:1615483922.587898:0:5629:0:(lib-msg.c:1011:lnet_is_health_check()) health check = 1, status = -111, hstatus = 2 00000400:00000200:1.0:1615483922.587899:0:5629:0:(lib-msg.c:860:lnet_health_check()) health check: 10.12.2.53@o2ib41->10.12.2.4@o2ib41: GET: LOCAL_DROPPED 00000400:00000200:1.0:1615483922.587901:0:5629:0:(lib-msg.c:479:lnet_handle_local_failure()) ni 10.12.2.53@o2ib41 added to recovery queue. Health = 900 00000400:00000200:1.0:1615483922.587903:0:5629:0:(lib-msg.c:641:lnet_resend_msg_locked()) 10.12.2.53@o2ib41->10.12.2.4@o2ib41:GET:LOCAL_DROPPED - queuing msg (ffff895f4c9171d8) for resend It would be better to categorize this failure as REMOTE_DROPPED. This issue was seen with Lustre version 2.12.4.3_cray_44_g2942581 |
| Comments |
| Comment by Gerrit Updater [ 19/Mar/21 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/42114 |
| Comment by Gerrit Updater [ 06/Apr/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/42114/ |
| Comment by Peter Jones [ 06/Apr/21 ] |
|
Landed for 2.15 |