[LU-14540] Connection failure does not cause peer NI health to decrement Created: 19/Mar/21  Updated: 15/Jul/21  Resolved: 06/Apr/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Connection is failing because of ARP flux, however the peer NI health is never decremented because the failure is classified as a "local" one:

00000800:00020000:1.0:1615483922.587888:0:5629:0:(o2iblnd_cb.c:2933:kiblnd_rejected()) 10.12.2.4@o2ib41 rejected: consumer defined fatal error
00000800:00000200:1.0:1615483922.587890:0:5629:0:(o2iblnd_cb.c:2313:kiblnd_connreq_done()) 10.12.2.4@o2ib41: active(1), version(12), status(-111)
00000800:00000200:1.0:1615483922.587892:0:5629:0:(o2iblnd.c:420:kiblnd_unlink_peer_locked()) peer_ni[ffff8953de6a8600] -> 10.12.2.4@o2ib41 (2)--
00000400:00000200:1.0:1615483922.587894:0:5629:0:(router.c:1720:lnet_notify()) 10.12.2.53@o2ib41 notifying 10.12.2.4@o2ib41: down
00000800:00000100:1.0:1615483922.587896:0:5629:0:(o2iblnd_cb.c:2294:kiblnd_peer_connect_failed()) Deleting messages for 10.12.2.4@o2ib41: connection failed
00000400:00000200:1.0:1615483922.587898:0:5629:0:(lib-msg.c:1011:lnet_is_health_check()) health check = 1, status = -111, hstatus = 2
00000400:00000200:1.0:1615483922.587899:0:5629:0:(lib-msg.c:860:lnet_health_check()) health check: 10.12.2.53@o2ib41->10.12.2.4@o2ib41: GET: LOCAL_DROPPED
00000400:00000200:1.0:1615483922.587901:0:5629:0:(lib-msg.c:479:lnet_handle_local_failure()) ni 10.12.2.53@o2ib41 added to recovery queue. Health = 900
00000400:00000200:1.0:1615483922.587903:0:5629:0:(lib-msg.c:641:lnet_resend_msg_locked()) 10.12.2.53@o2ib41->10.12.2.4@o2ib41:GET:LOCAL_DROPPED - queuing msg (ffff895f4c9171d8) for resend

It would be better to categorize this failure as REMOTE_DROPPED.

This issue was seen with Lustre version 2.12.4.3_cray_44_g2942581



 Comments   
Comment by Gerrit Updater [ 19/Mar/21 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/42114
Subject: LU-14540 o2iblnd: Use REMOTE_DROPPED for ECONNREFUSED
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5affb30c70beb8bb371e2c417f64e53b14853081

Comment by Gerrit Updater [ 06/Apr/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/42114/
Subject: LU-14540 o2iblnd: Use REMOTE_DROPPED for ECONNREFUSED
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f9d837b479232bfc4f271f23cd3729ca67cb6c1d

Comment by Peter Jones [ 06/Apr/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:10:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.