Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13569

LNet Health should not recover interfaces indefinitely

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • None
    • None

    Description

      We should not recover interfaces, remote or local, indefinitely. A remote interface may never come back (remote peer rebooted with new LNet config). A local interface may be down for extended periods of time when, say, a NIC needs to be swapped or cable replaced.

      Let's define criteria for when we should stop trying to recover an interface.

      • After X recovery attempts?
      • After Y amount of time?

      We also need to decide how to stop doing recover.

      • Flag the ni/lpni so that it doesn't get placed back onto the recovery queue?
      • Delete the ni/lpni outright?
      • Rather than stop completely, use an exponential backoff algorithm so that recovery doesn't add any real load to the system.

      Attachments

        Issue Links

          Activity

            [LU-13569] LNet Health should not recover interfaces indefinitely
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39723/
            Subject: LU-13569 tests: Check LNet Health recovery logic
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: aa7391445519b46752b5b0adcbe5baa368750e70

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39723/ Subject: LU-13569 tests: Check LNet Health recovery logic Project: fs/lustre-release Branch: master Current Patch Set: Commit: aa7391445519b46752b5b0adcbe5baa368750e70

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40314/
            Subject: LU-13569 lnet: Add health ping stats
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4c7e4aa57629660386ae2849151a0639b6177200

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40314/ Subject: LU-13569 lnet: Add health ping stats Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4c7e4aa57629660386ae2849151a0639b6177200

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39722/
            Subject: LU-13569 lnet: Deprecate lnet_recovery_interval
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 79ab0535622782c82636cee47918dc4b19983144

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39722/ Subject: LU-13569 lnet: Deprecate lnet_recovery_interval Project: fs/lustre-release Branch: master Current Patch Set: Commit: 79ab0535622782c82636cee47918dc4b19983144

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39721/
            Subject: LU-13569 lnet: Recover local NI w/exponential backoff interval
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 8fdf2bc62ac9c418bd8e326112da9a2835f09ccb

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39721/ Subject: LU-13569 lnet: Recover local NI w/exponential backoff interval Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8fdf2bc62ac9c418bd8e326112da9a2835f09ccb

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39720/
            Subject: LU-13569 lnet: Recover peer NI w/exponential backoff interval
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 917553c537a8860f57a50dc9752e3ac69d06c11c

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39720/ Subject: LU-13569 lnet: Recover peer NI w/exponential backoff interval Project: fs/lustre-release Branch: master Current Patch Set: Commit: 917553c537a8860f57a50dc9752e3ac69d06c11c

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39719/
            Subject: LU-13569 lnet: Only recover known good peer NIs
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 39a169cd02738a13866f3b88fbe3304dc20565d6

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39719/ Subject: LU-13569 lnet: Only recover known good peer NIs Project: fs/lustre-release Branch: master Current Patch Set: Commit: 39a169cd02738a13866f3b88fbe3304dc20565d6

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39718/
            Subject: LU-13569 lnet: Age peer NI out of recovery
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: cc27201a76574b51dc3ffb37f039b3364cab386d

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39718/ Subject: LU-13569 lnet: Age peer NI out of recovery Project: fs/lustre-release Branch: master Current Patch Set: Commit: cc27201a76574b51dc3ffb37f039b3364cab386d

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39717/
            Subject: LU-13569 lnet: Add lnet_recovery_limit to lnetctl
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 3e5c6620fd0b0511498d14d38e8610d08f6da7b3

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39717/ Subject: LU-13569 lnet: Add lnet_recovery_limit to lnetctl Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3e5c6620fd0b0511498d14d38e8610d08f6da7b3

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39716/
            Subject: LU-13569 lnet: Introduce lnet_recovery_limit parameter
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a2e61838f8de89e0f7c80c3bf288cbeb1b358baa

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39716/ Subject: LU-13569 lnet: Introduce lnet_recovery_limit parameter Project: fs/lustre-release Branch: master Current Patch Set: Commit: a2e61838f8de89e0f7c80c3bf288cbeb1b358baa

            People

              hornc Chris Horn
              hornc Chris Horn
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: