Details

    • New Feature
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • None
    • None
    • 9223372036854775807

    Description

      LNet Multi-Rail has implemented the ability for multiple interfaces to be used on the same LNet network or across multiple LNet networks. The LNet Health feature adds the ability to resend messages across different interfaces when interface or network failures are detected. This allows LNet to mitigate communication failures before passing the failures to upper layers for further error handling. To accomplish this, LNet Health depends on health information reported by the underlying fabrics such as MLX and OPA.
      LNet Health will monitor three different types of failures:

      • local interface failures as reported by the underlying fabric
      • remote interface failures as reported by the remote fabric
      • network timeouts.
        Each one of these classes of failures are dealt with separately at the LNet layer. The implementation of this health feature at the LNet layer allows LNet to retransmit messages across different types of interfaces. For example if a peer has both MLX and OPA interfaces and a transmit error is detected on one of them then LNet can retransmit the message on the other available interface.

      Attachments

        Issue Links

          Activity

            [LU-9120] LNet Network Health Feature

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32771/
            Subject: LU-9120 lnet: timeout delayed REPLYs and ACKs
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: a57fa1176e74ea549e0d63cd8753b97561dd8bbf

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32771/ Subject: LU-9120 lnet: timeout delayed REPLYs and ACKs Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: a57fa1176e74ea549e0d63cd8753b97561dd8bbf

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32861/
            Subject: LU-9120 lnet: sysfs functions for module params
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: 5169827bf79071d47d2b6d76e110fa412ff1fb38

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32861/ Subject: LU-9120 lnet: sysfs functions for module params Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 5169827bf79071d47d2b6d76e110fa412ff1fb38

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32770/
            Subject: LU-9120 lnet: calculate the lnd timeout
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: 84f3af43c4bdeb1744736f44cd746dd4b6e8fa6d

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32770/ Subject: LU-9120 lnet: calculate the lnd timeout Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 84f3af43c4bdeb1744736f44cd746dd4b6e8fa6d

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32769/
            Subject: LU-9120 lnet: add retry count
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: 20e23980eae2341c04688b6409442673516cb2c0

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32769/ Subject: LU-9120 lnet: add retry count Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 20e23980eae2341c04688b6409442673516cb2c0

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32767/
            Subject: LU-9120 lnet: handle remote errors in LNet
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: 76fad19c2deaa72b5b70eff4bf9d84e20a42a74e

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32767/ Subject: LU-9120 lnet: handle remote errors in LNet Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 76fad19c2deaa72b5b70eff4bf9d84e20a42a74e

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32766/
            Subject: LU-9120 lnet: handle socklnd tx failure
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: 25c1cb2c4d6f4430c8e1be915f5e8742ba16a94c

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32766/ Subject: LU-9120 lnet: handle socklnd tx failure Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 25c1cb2c4d6f4430c8e1be915f5e8742ba16a94c

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32765/
            Subject: LU-9120 lnet: handle o2iblnd tx failure
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: 8cf835e425d845da2ad3a787898a2b3001f75114

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32765/ Subject: LU-9120 lnet: handle o2iblnd tx failure Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 8cf835e425d845da2ad3a787898a2b3001f75114

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32764/
            Subject: LU-9120 lnet: handle local ni failure
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: 70616605dd44be37068f4e1a4745a2f8b90eb1f5

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32764/ Subject: LU-9120 lnet: handle local ni failure Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 70616605dd44be37068f4e1a4745a2f8b90eb1f5

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32763/
            Subject: LU-9120 lnet: add monitor thread
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: b01e6fce1c988139b5fe59484c7568362992f37b

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32763/ Subject: LU-9120 lnet: add monitor thread Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: b01e6fce1c988139b5fe59484c7568362992f37b

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32762/
            Subject: LU-9120 lnet: add lnet_health_sensitivity
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: 63cf744d0fdf72fc5ac7e154ec60c4a08139acc4

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32762/ Subject: LU-9120 lnet: add lnet_health_sensitivity Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 63cf744d0fdf72fc5ac7e154ec60c4a08139acc4

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32761/
            Subject: LU-9120 lnet: add health value per ni
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: d54afb86116c0640d7a201571b337042c87a3e40

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32761/ Subject: LU-9120 lnet: add health value per ni Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: d54afb86116c0640d7a201571b337042c87a3e40

            People

              ashehata Amir Shehata (Inactive)
              ashehata Amir Shehata (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: