[LU-12292] Decrement Health Value even if recovery processing fault Created: 13/May/19  Updated: 16/Oct/20  Resolved: 31/Mar/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.1
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: Tatsushi Takamura Assignee: Tatsushi Takamura
Resolution: Fixed Votes: 0
Labels: LTS12

Issue Links:
Related
is related to LU-12303 Use lnet_health_sensitivity for resto... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Health value is used to determine route.
In case of device failure, the value is periodically decremented by recovery process and
after restration of the device, the value is incremented periodically.
But, normal route is not selected until the value will be restored.
We think the value should not be decremented by recovery process,
because it takes time to be restored to its original value.

We stopped health value decrement at recovery processing after a device failure is detected

 



 Comments   
Comment by Amir Shehata (Inactive) [ 16/May/19 ]

simmonsja had a good suggestion in this case. We should increment the health value by the same amount we decrement it with. In this way the interface recovers faster. For example if you want to fail the interface on the first failure and recover it on the first success you can set the health value to 1000. This will basically bring down the interface on the first failure and once we get one successful recovery the interface will be back to the full health.

thoughts?

Comment by Chris Horn [ 16/May/19 ]

I like that idea.

Comment by Philip B Curtis [ 17/May/19 ]

I think Andreas had a valid point in LU-12303 where this behavior could cause issues with flapping hardware. Could a better approach be a more weighted recovery where it would start with the 1 per success but increase the health recovery amount on consecutive pings? This could allow faster recovery after the hardware issue is resolved, but not simply trust it is resolved right away.

Comment by Amir Shehata (Inactive) [ 21/May/19 ]

curtispb, yes that sounds like a good idea. Do you have suggestions on the ratio it should increase by on consecutive successes?

Comment by Philip B Curtis [ 22/May/19 ]

My first pass at this would be an exponential growth pattern such as health value + ((1 + consecutive successes)^2) with a max bound of 1000 which would recover in ~15 consecutive successes if the health value hit the floor. If there is a failure during this recovery period decrement normally and the consecutive counter is reset. Thoughts?

Comment by Tatsushi Takamura [ 20/Sep/19 ]

Recent IB is stable and high quality, so we thought it is enough for us to stop health value decrement at recovery processing (1000sec  is too much).
I think your idea is good, because it can handle flapping hardware and recovers in 15 seconds normal case.

Comment by Gerrit Updater [ 04/Dec/19 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36921
Subject: LU-12292 lnet: keep health even if recovery failed
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f29e84d3ea2f7a16c98619489b63b42571774003

Comment by Gerrit Updater [ 31/Mar/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36921/
Subject: LU-12292 lnet: keep health even if recovery failed
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 610a7542107d5a8ab0a12dc8bda7a4f44f9f0b60

Comment by Peter Jones [ 31/Mar/20 ]

Landed for 2.14

Generated at Sat Feb 10 02:51:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.