[LU-12292] Decrement Health Value even if recovery processing fault Created: 13/May/19 Updated: 16/Oct/20 Resolved: 31/Mar/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.1 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Tatsushi Takamura | Assignee: | Tatsushi Takamura |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LTS12 | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Health value is used to determine route. We stopped health value decrement at recovery processing after a device failure is detected
|
| Comments |
| Comment by Amir Shehata (Inactive) [ 16/May/19 ] |
|
simmonsja had a good suggestion in this case. We should increment the health value by the same amount we decrement it with. In this way the interface recovers faster. For example if you want to fail the interface on the first failure and recover it on the first success you can set the health value to 1000. This will basically bring down the interface on the first failure and once we get one successful recovery the interface will be back to the full health. thoughts? |
| Comment by Chris Horn [ 16/May/19 ] |
|
I like that idea. |
| Comment by Philip B Curtis [ 17/May/19 ] |
|
I think Andreas had a valid point in |
| Comment by Amir Shehata (Inactive) [ 21/May/19 ] |
|
curtispb, yes that sounds like a good idea. Do you have suggestions on the ratio it should increase by on consecutive successes? |
| Comment by Philip B Curtis [ 22/May/19 ] |
|
My first pass at this would be an exponential growth pattern such as health value + ((1 + consecutive successes)^2) with a max bound of 1000 which would recover in ~15 consecutive successes if the health value hit the floor. If there is a failure during this recovery period decrement normally and the consecutive counter is reset. Thoughts? |
| Comment by Tatsushi Takamura [ 20/Sep/19 ] |
|
Recent IB is stable and high quality, so we thought it is enough for us to stop health value decrement at recovery processing (1000sec is too much). |
| Comment by Gerrit Updater [ 04/Dec/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36921 |
| Comment by Gerrit Updater [ 31/Mar/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36921/ |
| Comment by Peter Jones [ 31/Mar/20 ] |
|
Landed for 2.14 |