Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16451

kfilnd: Enhance TN state machine to handle peer in "failed" state

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • None
    • 9223372036854775807

    Description

      If a send request (immediate or bulk) fails with the EHOSTUNREACH error number means the cxi retry handler failed to send a message indicating an issue with the peer or the fabric.

      If a kfilnd transaction (TN) fails with the EHOSTUNREACH error number, update the peer to a new "failed" state.

      When a peer is in this failed state, require a completed HELLO before sending any more packets to that peer.

      The idea is to minimize the amount of outstanding messages (which consume cxi resources) until either the peer recovers or the timeouts expire.

      Attachments

        Activity

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: