Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9120

LNet Network Health Feature

    XMLWordPrintable

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: Lustre 2.12.0
    • Labels:
      None
    • Rank (Obsolete):
      9223372036854775807

      Description

      LNet Multi-Rail has implemented the ability for multiple interfaces to be used on the same LNet network or across multiple LNet networks. The LNet Health feature adds the ability to resend messages across different interfaces when interface or network failures are detected. This allows LNet to mitigate communication failures before passing the failures to upper layers for further error handling. To accomplish this, LNet Health depends on health information reported by the underlying fabrics such as MLX and OPA.
      LNet Health will monitor three different types of failures:

      • local interface failures as reported by the underlying fabric
      • remote interface failures as reported by the remote fabric
      • network timeouts.
        Each one of these classes of failures are dealt with separately at the LNet layer. The implementation of this health feature at the LNet layer allows LNet to retransmit messages across different types of interfaces. For example if a peer has both MLX and OPA interfaces and a transmit error is detected on one of them then LNet can retransmit the message on the other available interface.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ashehata Amir Shehata
                Reporter:
                ashehata Amir Shehata
              • Votes:
                0 Vote for this issue
                Watchers:
                16 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: