[LU-8750] Wrong obd_timeout on the client when we have 2 or more lustre fs Created: 24/Oct/16  Updated: 18/Apr/23  Resolved: 29/Mar/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Antoine Percher Assignee: Hongchao Zhang
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9912 fix multiple client mounts with diffe... Open
is related to LU-8066 Move lustre procfs handling to sysfs ... Open
is related to LU-15246 Add per device adaptive timeout param... Resolved
is related to LU-16749 apply a filter to the configuration log. Open
Epic/Theme: lnet
Rank (Obsolete): 9223372036854775807

 Description   

when we mount 2 or more lustre fs on a client, the obd_timeout is the max of the all server obd_timeout. in some cases, it could be have some server evict due to that one of server does not wait obd_ping request enough time

in my case, I have 2 lustre fs, Servers 2.5.X and some Clients 2.7, the first server have obd_timeout=100 and the second server have obd_timeout=300 so the obd_timeout inherited on the client is obd_timeout=300. the client send one obd-ping request each 75 seconds if just one obd_ping request is lost, the client could be evict, so it could be better to have a obd_timeout by filesystems or the min of the each servers filesystems



 Comments   
Comment by Andreas Dilger [ 25/Oct/16 ]

I agree that this is a potential issue, and having a single global obd_timeout value is something that doesn't align with configurations where e.g. one filesystem is local and another is remote, and they should really have different timeout values.

There are a few options that can be tried to resolve this problem without needing to wait for a patch and new release:
1) Try mounting the filesystems on a test client in the opposite order: the filesystem with the longer timeout (FS300) mounted first and the shorter timeout (FS100) mounted second, and then check lctl get_param timeout to see if this client uses the 100s timeout. If yes, then this could be put into production immediately without any further changes, except in the rare case where one filesystem is being mounted inside the other. If the client still has a timeout of 300s, then it appears that FS100 is using the default obd_timeout of 100s and not explicitly setting a timeout at all, and something more needs to be done.
2) As with #1 above, change the mount order to mount FS300 first and FS100 second, and also explicitly set the timeout parameter for FSshort via lctl conf_param <fsname>.sys.timeout=100 and see if this allows the client to store the shorter timeout.
3) Set the timeout for FS100 to 300s to match FS300, so that the servers will wait up to 300s for the pings to arrive. However, this will also increase the recovery time for FS100 and that may not be desirable for some configurations.

There are also potential code fixes for this problem, in particular we discussed to add a per-target ping_interval tunable in /proc, similar to max_rpcs_in_flight and max_pages_per_rpc that allows setting the ping interval for a single filesystem explicitly.

Comment by Joseph Gmitter (Inactive) [ 25/Oct/16 ]

Hi Hongchao,

Can you please look into the suggested code fixes that Andreas has highlighted in the last comment?

Thanks.
Joe

Comment by Hongchao Zhang [ 28/Oct/16 ]

test output:
1) mount client with timeout 300 first, mount client with timeout 100 second
the timeout is 100

After setting timeout of FS100 to 300 explicitly, the timeout will be changed to 300 lctl conf_param FS100.sys.timeout=300

2) mount client with timeout 100 first, mount client with timeout 300 second
the timeout is 300

After setting timeout of FS300 to 100 explicitly, the timeout will be changed to 100 lctl conf_param FS300.sys.timeout=100

Comment by Andreas Dilger [ 12/Apr/18 ]

To properly fix this problem, it would be good to store the ping_interval and obd_timeout on a per-import basis. That would allow a single client to mount two or more different filesystems with different server timeouts (which the client can't control).

Comment by Andreas Dilger [ 16/Aug/18 ]

With the newer userspace-driven parameter parsing (an upcall via udev to lctl) it may be easier to implement per-OBD timeouts relatively easily. By default, new OBD devices will inherit the global timeout value when they are created (stored in each obd_device or obd_export separately, and always used from the local device instead of the global value). If there is a timeout parameter in the configuration logs (which would normally generate an "lctl set_param timeout=<value>" upcall), this would be replaced by "*.<fsname>-*.timeout" so that the upcall for that filesystem's configuration log will only change the devices for the named filesystem.

Comment by Andreas Dilger [ 29/Mar/23 ]

Closing this as a duplicate of LU-9912, I've copied CC's over already.

Generated at Sat Feb 10 02:20:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.