[LU-14979] LNet: add tunable parameter to control max recovery interval duration Created: 01/Sep/21  Updated: 11/Jun/22  Resolved: 11/Jun/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Improvement Priority: Minor
Reporter: Serguei Smirnov Assignee: Cyril Bordage
Resolution: Fixed Votes: 0
Labels: lnet, lnet-health

Rank (Obsolete): 9223372036854775807

 Description   

Currently implemented recovery ping mechanism increases the next scheduled recovery ping attempt timeout exponentially (base 2) and limits the timeout at 900 seconds. This hard-coded value appears to be too high in many cases. Introduce a tunable parameter that can be used to limit the recovery ping timeout and come up with a reasonable default.



 Comments   
Comment by Chris Horn [ 01/Sep/21 ]

Can you add some detail about the cases where the value is too high? My hope was that resetting the interval when we received a message from an NI would be sufficient. Is that not working for some reason?

Comment by Serguei Smirnov [ 01/Sep/21 ]

Chris,

I set up a test forĀ LU-14978: Node A with one NI, Node B with two NIs, all on the same net. I was using lnetctl ping to create traffic from A to B. Then I executed "ifdown" on the interface corresponding to one of the B's NIs. (This was to simulate a hardware failure on node B.) Some lnetctl pings failed and the "failed" peer NI's health got decremented as seen by A. I left it alone for a few minutes, then brought the "failed" interface on B back up. A didn't realize that B had both NIs healthy until it got around to sending the next recovery ping. In my opinion, the delay was too long. Unless I initiate a ping from B to A and it uses the recovered interface, A has no idea that B has both NIs back until 900 second timeout expires.

Comment by Chris Horn [ 01/Sep/21 ]

Okay, that makes sense and is working like I would expect. It might be interesting to see whether this is an issue in an environment where Node A is a Lustre server and B is a Lustre client (and vice versa) and there is actual i/o going on (or maybe even just idle client traffic). I think if there was i/o going on then things might recover more quickly, but the idle client case might also take a while to recover the NI (but, if the client is idle maybe it doesn't really matter. Once I/O was started we may again recover quickly).

Comment by Gerrit Updater [ 15/Sep/21 ]

"Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44927
Subject: LU-14979 lnet: set max recovery interval duration
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f3bea849cd255b6bcd8c379904795e3d8d6ffde8

Comment by Gerrit Updater [ 11/Jun/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44927/
Subject: LU-14979 lnet: set max recovery interval duration
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4027395fe463b6ea11084ff2af43ba0732ad0ddb

Comment by Peter Jones [ 11/Jun/22 ]

Landed for 2.16

Generated at Sat Feb 10 03:14:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.