Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12303

Use lnet_health_sensitivity for restoring health for each lnet_recovery_internal

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.13.0
    • None
    • Any lustre 2.12 system with LNet health enabled
    • 9223372036854775807

    Description

      Currently for each lnet_health_interval the LNet health is incremented by 1. The maximum LNet health value so it is possible to take up to 1000 seconds to recovery depending on the setup. A better way to handle this is to use the lnet_health_interval by the same amount that the health went by it.

      Attachments

        Issue Links

          Activity

            [LU-12303] Use lnet_health_sensitivity for restoring health for each lnet_recovery_internal
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36920/
            Subject: LU-12303 lnet: recover health at same rate as dec
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1d94a29dbc018fd00aa1c8a7a7ae343e0c9a4b83

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36920/ Subject: LU-12303 lnet: recover health at same rate as dec Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1d94a29dbc018fd00aa1c8a7a7ae343e0c9a4b83

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36920
            Subject: LU-12303 lnet: recover health at same rate as dec
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 601be615b8409dc74f2f6e5c49fe0810bc443a73

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36920 Subject: LU-12303 lnet: recover health at same rate as dec Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 601be615b8409dc74f2f6e5c49fe0810bc443a73

            Your docs are wrong  It states at

            https://wiki.whamcloud.com/display/LNet/LNet+Health+User+Documentation

            that the health increments every time interval which is one seconds. By those docs it could be 1000 seconds before the interface is seen as healthy.

            simmonsja James A Simmons added a comment - Your docs are wrong  It states at https://wiki.whamcloud.com/display/LNet/LNet+Health+User+Documentation that the health increments every time interval which is one seconds. By those docs it could be 1000 seconds before the interface is seen as healthy.

            If the lnet_health_sensitivity was also used to increment the interface health, then there would be no point in having a variable lnet_health_sensitivity value. It would mean that there are N health decrements and the same N health increments for an interface. Also note (AFAIK, but Amir to confirm) that while the retry interval is 1s, it will increment the health for every successful RPC sent/received.

            The reason that lnet_health_sensitivity=100 for decrements, but 1 for increments, this implies that the interface can only lose 1/100 = 1% of messages on that interface for it to continue to be in use. If it fails RPCs more than 1% of the time it will decrement faster than increment, which is good because you don't want to be using that interface. If it fails less than 1% it will generally remain in use. This "minimum acceptable failure ratio" is tunable by lnet_health_sensitivity.

            adilger Andreas Dilger added a comment - If the lnet_health_sensitivity was also used to increment the interface health, then there would be no point in having a variable lnet_health_sensitivity value. It would mean that there are N health decrements and the same N health increments for an interface. Also note (AFAIK, but Amir to confirm) that while the retry interval is 1s, it will increment the health for every successful RPC sent/received. The reason that lnet_health_sensitivity=100 for decrements, but 1 for increments, this implies that the interface can only lose 1/100 = 1% of messages on that interface for it to continue to be in use. If it fails RPCs more than 1% of the time it will decrement faster than increment, which is good because you don't want to be using that interface. If it fails less than 1% it will generally remain in use. This "minimum acceptable failure ratio" is tunable by lnet_health_sensitivity .

            People

              ashehata Amir Shehata (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: