[LU-13569] LNet Health should not recover interfaces indefinitely Created: 15/May/20 Updated: 27/Jan/23 Resolved: 14/Jun/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Chris Horn | Assignee: | Chris Horn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Epic Link: | unlabelled-LU-13422 | ||||||||
| Description |
|
We should not recover interfaces, remote or local, indefinitely. A remote interface may never come back (remote peer rebooted with new LNet config). A local interface may be down for extended periods of time when, say, a NIC needs to be swapped or cable replaced. Let's define criteria for when we should stop trying to recover an interface.
We also need to decide how to stop doing recover.
|
| Comments |
| Comment by Andreas Dilger [ 15/May/20 ] |
|
What is the expected timeframe for giving up on the interface? I can definitely see that interfaces might be down for many hours/days because of bad cables/switches/etc. Typically what we do in such cases is have exponential backoff of the retry so that they do not add any real load to the system (on the order of one message every 5-10 minutes. The alternative would be to disable local-side recovery and wait until the peer starts using the interface to send messages to this node again. The drawback here would be if e.g. a switch goes down for an hour and the nodes all stop using their interfaces and never restart. |
| Comment by Chris Horn [ 19/Jun/20 ] |
|
What I'm thinking is that: |
| Comment by Gerrit Updater [ 24/Aug/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39716 |
| Comment by Gerrit Updater [ 24/Aug/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39717 |
| Comment by Gerrit Updater [ 24/Aug/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39718 |
| Comment by Gerrit Updater [ 24/Aug/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39719 |
| Comment by Gerrit Updater [ 24/Aug/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39720 |
| Comment by Gerrit Updater [ 24/Aug/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39721 |
| Comment by Gerrit Updater [ 24/Aug/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39722 |
| Comment by Gerrit Updater [ 24/Aug/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39723 |
| Comment by Gerrit Updater [ 20/Oct/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/40314 |
| Comment by Gerrit Updater [ 09/Dec/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39716/ |
| Comment by Gerrit Updater [ 09/Dec/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39717/ |
| Comment by Gerrit Updater [ 30/Mar/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39718/ |
| Comment by Gerrit Updater [ 30/Mar/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39719/ |
| Comment by Gerrit Updater [ 30/Mar/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39720/ |
| Comment by Gerrit Updater [ 28/Apr/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39721/ |
| Comment by Gerrit Updater [ 28/Apr/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39722/ |
| Comment by Gerrit Updater [ 14/Jun/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40314/ |
| Comment by Gerrit Updater [ 14/Jun/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39723/ |
| Comment by Peter Jones [ 14/Jun/21 ] |
|
Landed for 2.15 |