Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14540

Connection failure does not cause peer NI health to decrement

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Connection is failing because of ARP flux, however the peer NI health is never decremented because the failure is classified as a "local" one:

      00000800:00020000:1.0:1615483922.587888:0:5629:0:(o2iblnd_cb.c:2933:kiblnd_rejected()) 10.12.2.4@o2ib41 rejected: consumer defined fatal error
      00000800:00000200:1.0:1615483922.587890:0:5629:0:(o2iblnd_cb.c:2313:kiblnd_connreq_done()) 10.12.2.4@o2ib41: active(1), version(12), status(-111)
      00000800:00000200:1.0:1615483922.587892:0:5629:0:(o2iblnd.c:420:kiblnd_unlink_peer_locked()) peer_ni[ffff8953de6a8600] -> 10.12.2.4@o2ib41 (2)--
      00000400:00000200:1.0:1615483922.587894:0:5629:0:(router.c:1720:lnet_notify()) 10.12.2.53@o2ib41 notifying 10.12.2.4@o2ib41: down
      00000800:00000100:1.0:1615483922.587896:0:5629:0:(o2iblnd_cb.c:2294:kiblnd_peer_connect_failed()) Deleting messages for 10.12.2.4@o2ib41: connection failed
      00000400:00000200:1.0:1615483922.587898:0:5629:0:(lib-msg.c:1011:lnet_is_health_check()) health check = 1, status = -111, hstatus = 2
      00000400:00000200:1.0:1615483922.587899:0:5629:0:(lib-msg.c:860:lnet_health_check()) health check: 10.12.2.53@o2ib41->10.12.2.4@o2ib41: GET: LOCAL_DROPPED
      00000400:00000200:1.0:1615483922.587901:0:5629:0:(lib-msg.c:479:lnet_handle_local_failure()) ni 10.12.2.53@o2ib41 added to recovery queue. Health = 900
      00000400:00000200:1.0:1615483922.587903:0:5629:0:(lib-msg.c:641:lnet_resend_msg_locked()) 10.12.2.53@o2ib41->10.12.2.4@o2ib41:GET:LOCAL_DROPPED - queuing msg (ffff895f4c9171d8) for resend
      

      It would be better to categorize this failure as REMOTE_DROPPED.

      This issue was seen with Lustre version 2.12.4.3_cray_44_g2942581

      Attachments

        Activity

          [LU-14540] Connection failure does not cause peer NI health to decrement
          pjones Peter Jones added a comment -

          Landed for 2.15

          pjones Peter Jones added a comment - Landed for 2.15

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/42114/
          Subject: LU-14540 o2iblnd: Use REMOTE_DROPPED for ECONNREFUSED
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: f9d837b479232bfc4f271f23cd3729ca67cb6c1d

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/42114/ Subject: LU-14540 o2iblnd: Use REMOTE_DROPPED for ECONNREFUSED Project: fs/lustre-release Branch: master Current Patch Set: Commit: f9d837b479232bfc4f271f23cd3729ca67cb6c1d

          Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/42114
          Subject: LU-14540 o2iblnd: Use REMOTE_DROPPED for ECONNREFUSED
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 5affb30c70beb8bb371e2c417f64e53b14853081

          gerrit Gerrit Updater added a comment - Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/42114 Subject: LU-14540 o2iblnd: Use REMOTE_DROPPED for ECONNREFUSED Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5affb30c70beb8bb371e2c417f64e53b14853081

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: