Details

    • New Feature
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • None
    • None
    • 9223372036854775807

    Description

      LNet Multi-Rail has implemented the ability for multiple interfaces to be used on the same LNet network or across multiple LNet networks. The LNet Health feature adds the ability to resend messages across different interfaces when interface or network failures are detected. This allows LNet to mitigate communication failures before passing the failures to upper layers for further error handling. To accomplish this, LNet Health depends on health information reported by the underlying fabrics such as MLX and OPA.
      LNet Health will monitor three different types of failures:

      • local interface failures as reported by the underlying fabric
      • remote interface failures as reported by the remote fabric
      • network timeouts.
        Each one of these classes of failures are dealt with separately at the LNet layer. The implementation of this health feature at the LNet layer allows LNet to retransmit messages across different types of interfaces. For example if a peer has both MLX and OPA interfaces and a transmit error is detected on one of them then LNet can retransmit the message on the other available interface.

      Attachments

        Issue Links

          Activity

            [LU-9120] LNet Network Health Feature

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32767/
            Subject: LU-9120 lnet: handle remote errors in LNet
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: 76fad19c2deaa72b5b70eff4bf9d84e20a42a74e

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32767/ Subject: LU-9120 lnet: handle remote errors in LNet Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 76fad19c2deaa72b5b70eff4bf9d84e20a42a74e

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32766/
            Subject: LU-9120 lnet: handle socklnd tx failure
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: 25c1cb2c4d6f4430c8e1be915f5e8742ba16a94c

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32766/ Subject: LU-9120 lnet: handle socklnd tx failure Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 25c1cb2c4d6f4430c8e1be915f5e8742ba16a94c

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32765/
            Subject: LU-9120 lnet: handle o2iblnd tx failure
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: 8cf835e425d845da2ad3a787898a2b3001f75114

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32765/ Subject: LU-9120 lnet: handle o2iblnd tx failure Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 8cf835e425d845da2ad3a787898a2b3001f75114

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32764/
            Subject: LU-9120 lnet: handle local ni failure
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: 70616605dd44be37068f4e1a4745a2f8b90eb1f5

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32764/ Subject: LU-9120 lnet: handle local ni failure Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 70616605dd44be37068f4e1a4745a2f8b90eb1f5

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32763/
            Subject: LU-9120 lnet: add monitor thread
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: b01e6fce1c988139b5fe59484c7568362992f37b

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32763/ Subject: LU-9120 lnet: add monitor thread Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: b01e6fce1c988139b5fe59484c7568362992f37b

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32762/
            Subject: LU-9120 lnet: add lnet_health_sensitivity
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: 63cf744d0fdf72fc5ac7e154ec60c4a08139acc4

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32762/ Subject: LU-9120 lnet: add lnet_health_sensitivity Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 63cf744d0fdf72fc5ac7e154ec60c4a08139acc4

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32761/
            Subject: LU-9120 lnet: add health value per ni
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: d54afb86116c0640d7a201571b337042c87a3e40

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32761/ Subject: LU-9120 lnet: add health value per ni Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: d54afb86116c0640d7a201571b337042c87a3e40

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32760/
            Subject: LU-9120 lnet: refactor lnet_select_pathway()
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: 4e48761a57193279ce3f3d5170c3e38cf287b59a

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32760/ Subject: LU-9120 lnet: refactor lnet_select_pathway() Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 4e48761a57193279ce3f3d5170c3e38cf287b59a

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32992
            Subject: LU-9120 lnet: remove duplicate timeout mechanism
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set: 1
            Commit: 3dbdebcfb26c2703e5f94e772a5d119c070bf7f2

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32992 Subject: LU-9120 lnet: remove duplicate timeout mechanism Project: fs/lustre-release Branch: multi-rail Current Patch Set: 1 Commit: 3dbdebcfb26c2703e5f94e772a5d119c070bf7f2

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32951
            Subject: LU-9120 lnet: health error simulation
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set: 1
            Commit: 4de8ad54a5703b217035715c842b5682591cb70e

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32951 Subject: LU-9120 lnet: health error simulation Project: fs/lustre-release Branch: multi-rail Current Patch Set: 1 Commit: 4de8ad54a5703b217035715c842b5682591cb70e

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32949
            Subject: LU-9120 lnet: keep track of resent messages
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set: 1
            Commit: be3e441cb5876d90c83be7f0a7f1c2c4a7e16c2d

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32949 Subject: LU-9120 lnet: keep track of resent messages Project: fs/lustre-release Branch: multi-rail Current Patch Set: 1 Commit: be3e441cb5876d90c83be7f0a7f1c2c4a7e16c2d

            People

              ashehata Amir Shehata (Inactive)
              ashehata Amir Shehata (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: