[LU-16002] Ping evictor delayed client eviction for 3 ping interval more than defined Created: 10/Jul/22  Updated: 13/Sep/23

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Alexander Boyko Assignee: Alexander Boyko
Resolution: Unresolved Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-9912 fix multiple client mounts with diffe... Open
is related to LU-16271 replay-single test_89 FAIL: 20480 blo... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Ping evictor adds 3 ping interval to eviction time(6*ping interval) PING_EVICT_TIMEOUT. For obd_timeout 300 the result eviction time became 670 instead of 450. It confuses and delays all conflicting requests on server side.



 Comments   
Comment by Gerrit Updater [ 10/Jul/22 ]

"Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/47928
Subject: LU-16002 ptlrpc: reduce pinger eviction time
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 747f6d6f7dfad19a9340275e905e79152978cf35

Comment by Gerrit Updater [ 19/Jul/22 ]

"Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/47982
Subject: LU-16002 ptlrpc: adds configurable ping interval
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f53e72d9b8ad713f5bb509d6e3ff3765eef8f587

Comment by Gerrit Updater [ 17/Sep/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47982/
Subject: LU-16002 ptlrpc: adds configurable ping interval
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8e66f061c01e53cda84ce80af3860f488e927210

Comment by Gerrit Updater [ 15/Oct/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/47928/
Subject: LU-16002 ptlrpc: reduce pinger eviction time
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6bdeda7afe92d61db56367875774fa074aaac0fd

Comment by Peter Jones [ 16/Oct/22 ]

Landed for 2.16

Comment by Andreas Dilger [ 11/Sep/23 ]

aboyko, could you please provide some more background on why a tunable ping_interval is needed? I'm concerned that allowing ping_interval to be tuned separately from obd_timeout can lead to random client eviction when clients are not sending RPCs or OBD_PING in a timely manner. This might be hard to notice if it works out like e.g. ping_interval = obd_timeout - 10 and this is OK while an import is active and sending RPCs or OBD_PING, but fails intermittently if the import becomes idle and a ping is also lost.

I'd much prefer to have a per-device obd_timeout value as implemented in patch https://review.whamcloud.com/50519 "LU-9912 ptlrpc: make obd timeout a per-device param", and then the ping_interval for each import is controlled by obd->obd_timeout / 4. This would work properly for clients that mount multiple filesystems, unlike having a global obd_timeout (and now global ping_interval).

However, before we change anything with the global ping_interval that was aadded in 2.16, I'd like to understand why it was added and what problem it was solving. I'd prefer to avoid having a tunable ping_interval entirely, just because it can go badly. If this was needed to solve some specific problem, would a per-device obd_timeout also solve this same issue? Also, hornc landed patch https://review.whamcloud.com/49807 "LU-16483 ptlrpc: Track highest reply XID" that also solves a longstanding problem where clients reconnect on a ping failure, even though they have successfully sent other pings in the meantime.

I'm thinking we should remove the global ping_interval tunable completely (so that pings are always tied to obd_timeout), and use something like:

#define PING_INTERVAL(obd) (obd_timeout(obd) / 4)

It would still be possible to keep evict_multiplier if that is important, something like:

#define PING_INTERVAL(obd) (obd_timeout(obd) * 3 / (evict_multiplier * 2))

but before we add complexity I'd like to understand what this was needed for.

Comment by Alexander Boyko [ 12/Sep/23 ]

We had an issue there cascading failures bring timeouts to ~1700s, blocking callback timeout. Something like - one client node with LDLM lock crashed, server waited it, increased AT. Crash and eviction was not a problem to a whole system but it highly increased AT and shared lock for a root directory. We detected 3-6 problems during it, bl timeouts, eviction logic, etc.  The one way to prevent such case is to detect crashed client early and evict it by pinger_evictor, we can reduce ping_interval and evictor multiplier for this. By default eviction time is 6 ping interval, server would not evict client if 5 pings are lost. For a perfect network it is overhead, and could be reduced (eviction multiplier). Similar things relate to a ping_interval. If obd_timeout is 300s(it is used in real), ping interval is 75s. To detect client fail faster ping interval should be reduced.

I've made a comment about obd_timeout at LU. From my point of view obd_timeout is especially recovery timeout, but recovery and pinger don't have any relations. Only some historical.

Comment by Andreas Dilger [ 13/Sep/23 ]

I agree that it is useful in such cases to be able to tune ping_interval and/or evict_multiplier, but it would make sense to ensure that ping_interval < obd_timeout / 2 and evict_multiplier >= 2 so that it cannot be set to a value where the client will be evicted easily. Even so, it still makes sense to use obd_timeout(exp) directly by default, unless ping_interval is explicitly set:

#define PING_INTERVAL(obd) (ping_interval ?: (obd_timeout(obd) * 3 / (evict_multiplier * 2)))

Also, for future reference, it is possible to evict specific clients from the MDS with "lctl set_param mdt.*.evict_client=UUID" or "lctl set_param mdt.*.evict_client=nid:NID". It will evict the client UUID/NID from all the targets if "mdt.*.evict_tgt_nids" is set.

Generated at Sat Feb 10 03:23:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.