[LU-14518] allow slow request processing to be removed from health check Created: 12/Mar/21 Updated: 23/Jan/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Andreas Dilger |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
If a request is not being processed in a timely manner, it will mark the service unhealthy, which can lead to STONITH. However, in some cases when the server is very heavily loaded, requests may take longer than at_max to be processed and this shouldn't cause the server to be killed, since that will slow down request processing even further and put extra load on the backup server(s), slowing down their processing and possibly causing them to fail in a similar manner. |
| Comments |
| Comment by Mikhail Pershin [ 01/Feb/23 ] |
|
I am duplicating here my comment in review: we could consider history of timeouts, e.g. ost.OSS.ost_io.timeouts to decide about service health taking into account amount of timeouts per time unit |
| Comment by Mikhail Pershin [ 01/Feb/23 ] |
|
Another note about that health checking based on services, it stops whole node if failed, so all severs will fail over which looks as overreaction in some cases, e.g. thread is stuck waiting for some event like another server recovery or so, or even worse if storage becomes overloaded for a moment in which case server movement to another node will not help, since storage remains the same, that would cause constant ping-pong failovers. The only case when that is helpful - deadlock situation or any other when node requires reboot. Detecting that by single thread timeout is too aggressive |
| Comment by Gerrit Updater [ 24/Nov/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53225 |
| Comment by Gerrit Updater [ 14/Dec/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53451 |