[LU-14518] allow slow request processing to be removed from health check Created: 12/Mar/21  Updated: 23/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: Andreas Dilger
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
Rank (Obsolete): 9223372036854775807

 Description   

If a request is not being processed in a timely manner, it will mark the service unhealthy, which can lead to STONITH. However, in some cases when the server is very heavily loaded, requests may take longer than at_max to be processed and this shouldn't cause the server to be killed, since that will slow down request processing even further and put extra load on the backup server(s), slowing down their processing and possibly causing them to fail in a similar manner.



 Comments   
Comment by Mikhail Pershin [ 01/Feb/23 ]

I am duplicating here my comment in review: we could consider history of timeouts, e.g. ost.OSS.ost_io.timeouts to decide about service health taking into account amount of timeouts per time unit

Comment by Mikhail Pershin [ 01/Feb/23 ]

Another note about that health checking based on services, it stops whole node if failed, so all severs will fail over which looks as overreaction in some cases, e.g. thread is stuck waiting for some event like another server recovery or so, or even worse if storage becomes overloaded for a moment in which case server movement to another node will not help, since storage remains the same, that would cause constant ping-pong failovers. The only case when that is helpful - deadlock situation or any other when node requires reboot. Detecting that by single thread timeout is too aggressive

Comment by Gerrit Updater [ 24/Nov/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53225
Subject: LU-14518 ptlrpc: WIP avoid server STONITH for slow requests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 515100fb45dda4587c9d60b73a44258c7aba5bdc

Comment by Gerrit Updater [ 14/Dec/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53451
Subject: LU-14518 libcfs: print CFS_FAIL_CHECK() location
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3f1231a9deb9cde5f2b81d8899d80359ff73dbe7

Generated at Sat Feb 10 03:10:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.