Details

    • New Feature
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • None
    • None
    • 9223372036854775807

    Description

      LNet Multi-Rail has implemented the ability for multiple interfaces to be used on the same LNet network or across multiple LNet networks. The LNet Health feature adds the ability to resend messages across different interfaces when interface or network failures are detected. This allows LNet to mitigate communication failures before passing the failures to upper layers for further error handling. To accomplish this, LNet Health depends on health information reported by the underlying fabrics such as MLX and OPA.
      LNet Health will monitor three different types of failures:

      • local interface failures as reported by the underlying fabric
      • remote interface failures as reported by the remote fabric
      • network timeouts.
        Each one of these classes of failures are dealt with separately at the LNet layer. The implementation of this health feature at the LNet layer allows LNet to retransmit messages across different types of interfaces. For example if a peer has both MLX and OPA interfaces and a transmit error is detected on one of them then LNet can retransmit the message on the other available interface.

      Attachments

        Issue Links

          Activity

            [LU-9120] LNet Network Health Feature
            pjones Peter Jones added a comment -

            Feature landed for 2.12. Any bug fixes etc will be tracked under new tickets

            pjones Peter Jones added a comment - Feature landed for 2.12. Any bug fixes etc will be tracked under new tickets

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33037
            Subject: LU-9120 lnet: LNet Health/Resiliency Feature
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 3461795c2a407aa99b7e178c275367e27a3ddc42

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33037 Subject: LU-9120 lnet: LNet Health/Resiliency Feature Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3461795c2a407aa99b7e178c275367e27a3ddc42

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33023
            Subject: LU-9120 lnet: LNet Health/Resiliency Feature
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 958ef71f33fa925e6657f9902702cd3677e15ec9

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33023 Subject: LU-9120 lnet: LNet Health/Resiliency Feature Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 958ef71f33fa925e6657f9902702cd3677e15ec9

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32951/
            Subject: LU-9120 lnet: health error simulation
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: 5c17777d97bd20cde68771c6186320b5eae90e62

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32951/ Subject: LU-9120 lnet: health error simulation Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 5c17777d97bd20cde68771c6186320b5eae90e62

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32950/
            Subject: LU-9120 lnet: print recovery queues content
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: 826ea19c077b2a3e1a32464a7eb63fba6e460946

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32950/ Subject: LU-9120 lnet: print recovery queues content Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 826ea19c077b2a3e1a32464a7eb63fba6e460946

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32949/
            Subject: LU-9120 lnet: add global health statistics
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: 15020fd977af68620e862ad999eaab17688933e2

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32949/ Subject: LU-9120 lnet: add global health statistics Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 15020fd977af68620e862ad999eaab17688933e2

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32863/
            Subject: LU-9120 lnet: set health value from user space
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: c0ad398fd71610c42b7ed06f8d2ca722daa01391

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32863/ Subject: LU-9120 lnet: set health value from user space Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: c0ad398fd71610c42b7ed06f8d2ca722daa01391

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32783/
            Subject: LU-9120 lnet: show peer ni health stats
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: f64abb795a893d8208688350220c69e808d0af69

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32783/ Subject: LU-9120 lnet: show peer ni health stats Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: f64abb795a893d8208688350220c69e808d0af69

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32782/
            Subject: LU-9120 lnet: show local ni health stats
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: 60fc3c74757025f03ddd4b8e322716811ae97d3c

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32782/ Subject: LU-9120 lnet: show local ni health stats Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 60fc3c74757025f03ddd4b8e322716811ae97d3c

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32779/
            Subject: LU-9120 lnet: set health sensitivity from lnetctl
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: 4a7357d5e945e5fe89e03b5d65edb45af95b49ee

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32779/ Subject: LU-9120 lnet: set health sensitivity from lnetctl Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: 4a7357d5e945e5fe89e03b5d65edb45af95b49ee

            Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32778/
            Subject: LU-9120 lnet: set transaction timeout from lnetctl
            Project: fs/lustre-release
            Branch: multi-rail
            Current Patch Set:
            Commit: cf47570b0273e11f81be0fc67126172fe73ad367

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/32778/ Subject: LU-9120 lnet: set transaction timeout from lnetctl Project: fs/lustre-release Branch: multi-rail Current Patch Set: Commit: cf47570b0273e11f81be0fc67126172fe73ad367

            People

              ashehata Amir Shehata (Inactive)
              ashehata Amir Shehata (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: