[LU-9912] fix multiple client mounts with different server timeouts Created: 24/Aug/17  Updated: 08/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Andreas Dilger Assignee: Feng Lei
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16749 apply a filter to the configuration log. Open
is related to LU-16002 Ping evictor delayed client eviction ... Reopened
is related to LU-8750 Wrong obd_timeout on the client when ... Resolved
is related to LU-15246 Add per device adaptive timeout param... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

If a single client is mounting multiple filesystems, but the servers are configured with different timeouts (e.g. local vs. WAN mounts) then the client will use the timeout value of the most recent filesystem mount. Since the obd_timeout value is global across all connections, if the most recent mount has a significantly higher timeout than the other filesystems, the client will not ping the servers often enough, resulting in client evictions during periods of inactivity.

The timeout parameter is currently implemented as a single value for all filesystems currently mounted, but it seems possible to potentially fix this to work better when filesystems with multiple timeouts are mounted. One possibility is to use the minimum timeout set between the multiple filesystem configurations as the client ping interval. Currently, since http://review.whamcloud.com/2881 the timeout is the maximum value from all different filesystems, which was to fix a problem with timeouts on the MGS. For clients it makes sense to use the minimum timeout between all filesystems mounted.

Of course, it would be even better if the timeout value in the config was stored on a per-mountpoint or per-import basis, so that it is possible to have different timeouts for local and remote filesystems, or large and small filesystems, or interactive and batch filesystems, etc. I don't know how easily that would be implemented, but it would be the better long-term solution than a single timeout for the whole client.



 Comments   
Comment by James A Simmons [ 25/Aug/17 ]

The most logical place to me to expose this to admins is /sys/fs/lustre/mgc/MGC10.37.248.196@o2ib1/*. We could do a LND type approach. We treat the module parameters as global settings that by default applied to all network interfaces. Admin can configure via DLC each network interface and over ride the network interface values.

Comment by Chris Hunter (Inactive) [ 11/Sep/17 ]

Isn't obd_timeout controlled by adaptive timeouts ?

Comment by Andreas Dilger [ 11/Sep/17 ]

Not exactly. Adaptive timeouts keep per-connection values for RPC processing and resends, but obd_timeout is still used for some case if there is no activity or are not directly related to RPC timeouts (e.g. pings when client is idle).

For this ticket, we don't need to change the server code, but we should track the ping interval on a per server basis.

Comment by Chris Hunter (Inactive) [ 11/Sep/17 ]

Does obd_timeout show up as ptlrpc timeout messages ?
Any idea what opcodes are reported (eg. 38/MDS_CONNECT or 400/OBD_PING) ?

Thanks.

Comment by Andreas Dilger [ 09/Apr/21 ]

To make this work properly, I think we should introduce a new device-specific parameter "*.*.timeout" (and equivalent for at_min, timeout, and other global parameters we want to fix), and then allow setting it directly. If the global timeout parameter is sent from the MGS it should only be applied to devices configured by that MGS, possibly tracked by fsname and translated into a glob like *.*fsname*.timeout or equivalent when it is applied.

Comment by Andreas Dilger [ 29/Mar/23 ]

This needs a similar change as patch https://review.whamcloud.com/45598 "LU-15246 ptlrpc: add per-device adaptive timeout parameters" that changes the global "timeout" parameter to be per-device "*.<fsname>*.timeout" parameters.

Separately, a patch for the MGC to translate config log parameter global settings of timeout, at_min, at_max, at_history, ldlm_enqueue_min to *.<fsname>*.<param> so that they only apply to devices from that filesystem rather than for all filesystems mounted by the client.

Comment by Feng Lei [ 31/Mar/23 ]

adilger do you mean the var obd_timeout in class_obd.c?

Comment by Andreas Dilger [ 31/Mar/23 ]

Yes, this is the global "timeout" parameter set for the node. That is actually more problematic than at_min, etc. because obd_timeout controls the client's OBD_PING interval when it is idle. If client has smaller obd_timeout than server, no significant harm is done and client is more noisy than needed. If client has much larger obd_timeout than server then it will repeatedly be evicted by the servers, because they think it is dead (no contact for minutes).

Implementation should be the same as previous patch - add a wrapper for to access obd_timeout from the device if set, otherwise (by default for now) use the global obd_timeout value.

Comment by Gerrit Updater [ 04/Apr/23 ]

"Feng Lei <flei@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50519
Subject: LU-9912 ptlrpc: add per-device timeout param
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f29a032a06507b980281d9921413800fdf7d3846

Comment by Alexander Boyko [ 12/Sep/23 ]

Lustre uses obd_timeout mostly for

1) recovery case soft timeout, hard timeout, stale client waiting etc.

2) ping interval and pinger eviction timeout. With LU-16002 ping interval and eviction multiplier were separated from obd_timeout. I don't see a strong case to have dependency between ping and recovery. Also I think that pinger timeout eviction could be based on AT and not obd_timeout. And network latency and busy ptlrpc could be addressed by pinger evictor.

3) obd_timeout should cover network timeout layers. Something like different network configuration with/without routers, link changing, network reconfiguration etc.  

Other parts - request timeout, ldlm timeouts are using Adaptive Timeouts mostly.

I see the idea of having different ping interval for different FS. But I think that adding obd_timeout to every device is not good for original idea. It is better to have something like per-mount/per-FS variables.

 

Comment by Andreas Dilger [ 13/Sep/23 ]

I also thought about per-mount/per-filesystem timeouts, but this is also complex because the pinging is actually done for each OBD device import basis, so having the timeout be on the OBD device makes sense. When this parameter is documented, the parameter will be described as "*.FSNAME-*.timeout=TIMEOUT", so that it applies to all OBD devices of that filesystem uniformly.

Comment by Andreas Dilger [ 08/Jan/24 ]

One final thing that needs to be done here to fix the problem of a client mounting two different filesystems is to filter the "global" tunable parameters (timeout, ldlm_timeout, at_min, at_max, etc.) to only apply to OBD devices for that mountpoint. For example, if a client is mounting testfs1, then it should internally map timeout and related parameters to *.testfs1-*.timeout. When mounting otherfs from a different MGS it should map timeout to *.otherfs-*.timeout, etc. so that there is no confusion about global parameters affecting a different filesystem than intended.

Generated at Sat Feb 10 02:30:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.