Details

    • New Feature
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • None
    • None
    • 9223372036854775807

    Description

      LNet Multi-Rail has implemented the ability for multiple interfaces to be used on the same LNet network or across multiple LNet networks. The LNet Health feature adds the ability to resend messages across different interfaces when interface or network failures are detected. This allows LNet to mitigate communication failures before passing the failures to upper layers for further error handling. To accomplish this, LNet Health depends on health information reported by the underlying fabrics such as MLX and OPA.
      LNet Health will monitor three different types of failures:

      • local interface failures as reported by the underlying fabric
      • remote interface failures as reported by the remote fabric
      • network timeouts.
        Each one of these classes of failures are dealt with separately at the LNet layer. The implementation of this health feature at the LNet layer allows LNet to retransmit messages across different types of interfaces. For example if a peer has both MLX and OPA interfaces and a transmit error is detected on one of them then LNet can retransmit the message on the other available interface.

      Attachments

        Issue Links

          Activity

            [LU-9120] LNet Network Health Feature
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-13510 [ LU-13510 ]
            sharmaso Sonia Sharma (Inactive) made changes -
            Link New: This issue is related to LU-11422 [ LU-11422 ]
            pjones Peter Jones made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            ashehata Amir Shehata (Inactive) made changes -
            Link New: This issue is related to LU-11273 [ LU-11273 ]
            ashehata Amir Shehata (Inactive) made changes -
            Link New: This issue is related to LU-11272 [ LU-11272 ]
            ashehata Amir Shehata (Inactive) made changes -
            Link New: This issue is related to LU-11271 [ LU-11271 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.12.0 [ 13495 ]
            jgmitter Joseph Gmitter (Inactive) made changes -
            Link Original: This issue is related to LUDOC-396 [ LUDOC-396 ]
            jgmitter Joseph Gmitter (Inactive) made changes -
            Link New: This issue is blocked by LUDOC-396 [ LUDOC-396 ]
            ashehata Amir Shehata (Inactive) made changes -
            Remote Link New: This issue links to "Page (HPDD Community Wiki)" [ 22829 ]

            People

              ashehata Amir Shehata (Inactive)
              ashehata Amir Shehata (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: