[LU-13569] LNet Health should not recover interfaces indefinitely Created: 15/May/20  Updated: 27/Jan/23  Resolved: 14/Jun/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Improvement Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-13572 LNet Health should only attempt recov... Closed
Rank (Obsolete): 9223372036854775807
Epic Link: unlabelled-LU-13422

 Description   

We should not recover interfaces, remote or local, indefinitely. A remote interface may never come back (remote peer rebooted with new LNet config). A local interface may be down for extended periods of time when, say, a NIC needs to be swapped or cable replaced.

Let's define criteria for when we should stop trying to recover an interface.

  • After X recovery attempts?
  • After Y amount of time?

We also need to decide how to stop doing recover.

  • Flag the ni/lpni so that it doesn't get placed back onto the recovery queue?
  • Delete the ni/lpni outright?
  • Rather than stop completely, use an exponential backoff algorithm so that recovery doesn't add any real load to the system.


 Comments   
Comment by Andreas Dilger [ 15/May/20 ]

What is the expected timeframe for giving up on the interface? I can definitely see that interfaces might be down for many hours/days because of bad cables/switches/etc. Typically what we do in such cases is have exponential backoff of the retry so that they do not add any real load to the system (on the order of one message every 5-10 minutes.

The alternative would be to disable local-side recovery and wait until the peer starts using the interface to send messages to this node again. The drawback here would be if e.g. a switch goes down for an hour and the nodes all stop using their interfaces and never restart.

Comment by Chris Horn [ 19/Jun/20 ]

What I'm thinking is that:
1. We can put a simple hard limit on remote NI recovery. Try to recover remote NIs, perhaps with exponential backoff, until the hard limit is reached. When the hard limit is reached we no longer attempt to recover the NI until we receive a message from that NI. At that point the algorithm resets and the remote NI becomes eligible for recovery again.
1.1 We might be able to leverage the existing lpni_last_alive field of the lnet_peer_ni struct for this purpose.
2. We recover local NIs indefinitely, but with an exponential backoff. This ensures that when a local NI becomes healthy again then we'll eventually use it for a send and this will cause other peers to mark it back up.
3. Maybe successful recovery messages cause us to send subsequent ones without delay. Failed recovery ping -> increases recovery interval; successful recovery ping -> decreases recovery interval.

Comment by Gerrit Updater [ 24/Aug/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39716
Subject: LU-13569 lnet: Introduce lnet_recovery_limit parameter
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b8b728355cbaead6cba862f5cec9703e77bfced7

Comment by Gerrit Updater [ 24/Aug/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39717
Subject: LU-13569 lnet: Add lnet_recovery_limit to lnetctl
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 21ba92081131f86de4a6ee0bd1211a51b6f8ff32

Comment by Gerrit Updater [ 24/Aug/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39718
Subject: LU-13569 lnet: Age peer NI out of recovery
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0aa87980f624ceb0e8c96553becb953a5c7b95cb

Comment by Gerrit Updater [ 24/Aug/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39719
Subject: LU-13569 lnet: Only recover known good peer NIs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 043ae30076f397cb1ac2f5a0bdd2c6ac4e171fd1

Comment by Gerrit Updater [ 24/Aug/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39720
Subject: LU-13569 lnet: Recover peer NI w/exponential backoff interval
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 35017495ca09ad86c6f414d1d3c1f7f61394ee65

Comment by Gerrit Updater [ 24/Aug/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39721
Subject: LU-13569 lnet: Recover local NI w/exponential backoff interval
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e9ba241290326846f9796b9652d58e80b2796f92

Comment by Gerrit Updater [ 24/Aug/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39722
Subject: LU-13569 lnet: Deprecate lnet_recovery_interval
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f7e142362ad03d6c43af1dee6dd3fdd15fe26fc0

Comment by Gerrit Updater [ 24/Aug/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39723
Subject: LU-13569 tests: Check LNet Health recovery logic
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2f0cf735e93c9defb7ac911289a65ff5cccd0f5e

Comment by Gerrit Updater [ 20/Oct/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/40314
Subject: LU-13569 lnet: Add health ping stats
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bc7d9963c08ced81fca69f3f735e477583ba0bf4

Comment by Gerrit Updater [ 09/Dec/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39716/
Subject: LU-13569 lnet: Introduce lnet_recovery_limit parameter
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a2e61838f8de89e0f7c80c3bf288cbeb1b358baa

Comment by Gerrit Updater [ 09/Dec/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39717/
Subject: LU-13569 lnet: Add lnet_recovery_limit to lnetctl
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3e5c6620fd0b0511498d14d38e8610d08f6da7b3

Comment by Gerrit Updater [ 30/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39718/
Subject: LU-13569 lnet: Age peer NI out of recovery
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cc27201a76574b51dc3ffb37f039b3364cab386d

Comment by Gerrit Updater [ 30/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39719/
Subject: LU-13569 lnet: Only recover known good peer NIs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 39a169cd02738a13866f3b88fbe3304dc20565d6

Comment by Gerrit Updater [ 30/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39720/
Subject: LU-13569 lnet: Recover peer NI w/exponential backoff interval
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 917553c537a8860f57a50dc9752e3ac69d06c11c

Comment by Gerrit Updater [ 28/Apr/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39721/
Subject: LU-13569 lnet: Recover local NI w/exponential backoff interval
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8fdf2bc62ac9c418bd8e326112da9a2835f09ccb

Comment by Gerrit Updater [ 28/Apr/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39722/
Subject: LU-13569 lnet: Deprecate lnet_recovery_interval
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 79ab0535622782c82636cee47918dc4b19983144

Comment by Gerrit Updater [ 14/Jun/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40314/
Subject: LU-13569 lnet: Add health ping stats
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4c7e4aa57629660386ae2849151a0639b6177200

Comment by Gerrit Updater [ 14/Jun/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39723/
Subject: LU-13569 tests: Check LNet Health recovery logic
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: aa7391445519b46752b5b0adcbe5baa368750e70

Comment by Peter Jones [ 14/Jun/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:02:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.