Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8750

Wrong obd_timeout on the client when we have 2 or more lustre fs

Details

    • Improvement
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.7.0
    • None
    • 9223372036854775807

    Description

      when we mount 2 or more lustre fs on a client, the obd_timeout is the max of the all server obd_timeout. in some cases, it could be have some server evict due to that one of server does not wait obd_ping request enough time

      in my case, I have 2 lustre fs, Servers 2.5.X and some Clients 2.7, the first server have obd_timeout=100 and the second server have obd_timeout=300 so the obd_timeout inherited on the client is obd_timeout=300. the client send one obd-ping request each 75 seconds if just one obd_ping request is lost, the client could be evict, so it could be better to have a obd_timeout by filesystems or the min of the each servers filesystems

      Attachments

        Issue Links

          Activity

            [LU-8750] Wrong obd_timeout on the client when we have 2 or more lustre fs

            Closing this as a duplicate of LU-9912, I've copied CC's over already.

            adilger Andreas Dilger added a comment - Closing this as a duplicate of LU-9912 , I've copied CC's over already.

            With the newer userspace-driven parameter parsing (an upcall via udev to lctl) it may be easier to implement per-OBD timeouts relatively easily. By default, new OBD devices will inherit the global timeout value when they are created (stored in each obd_device or obd_export separately, and always used from the local device instead of the global value). If there is a timeout parameter in the configuration logs (which would normally generate an "lctl set_param timeout=<value>" upcall), this would be replaced by "*.<fsname>-*.timeout" so that the upcall for that filesystem's configuration log will only change the devices for the named filesystem.

            adilger Andreas Dilger added a comment - With the newer userspace-driven parameter parsing (an upcall via udev to lctl ) it may be easier to implement per-OBD timeouts relatively easily. By default, new OBD devices will inherit the global timeout value when they are created (stored in each obd_device or obd_export separately, and always used from the local device instead of the global value). If there is a timeout parameter in the configuration logs (which would normally generate an " lctl set_param timeout=<value> " upcall), this would be replaced by " *.<fsname>-*.timeout " so that the upcall for that filesystem's configuration log will only change the devices for the named filesystem.

            To properly fix this problem, it would be good to store the ping_interval and obd_timeout on a per-import basis. That would allow a single client to mount two or more different filesystems with different server timeouts (which the client can't control).

            adilger Andreas Dilger added a comment - To properly fix this problem, it would be good to store the ping_interval and obd_timeout on a per-import basis. That would allow a single client to mount two or more different filesystems with different server timeouts (which the client can't control).

            test output:
            1) mount client with timeout 300 first, mount client with timeout 100 second
            the timeout is 100

            After setting timeout of FS100 to 300 explicitly, the timeout will be changed to 300 lctl conf_param FS100.sys.timeout=300

            2) mount client with timeout 100 first, mount client with timeout 300 second
            the timeout is 300

            After setting timeout of FS300 to 100 explicitly, the timeout will be changed to 100 lctl conf_param FS300.sys.timeout=100

            hongchao.zhang Hongchao Zhang added a comment - test output: 1) mount client with timeout 300 first, mount client with timeout 100 second the timeout is 100 After setting timeout of FS100 to 300 explicitly, the timeout will be changed to 300 lctl conf_param FS100.sys.timeout=300 2) mount client with timeout 100 first, mount client with timeout 300 second the timeout is 300 After setting timeout of FS300 to 100 explicitly, the timeout will be changed to 100 lctl conf_param FS300.sys.timeout=100

            Hi Hongchao,

            Can you please look into the suggested code fixes that Andreas has highlighted in the last comment?

            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hi Hongchao, Can you please look into the suggested code fixes that Andreas has highlighted in the last comment? Thanks. Joe

            I agree that this is a potential issue, and having a single global obd_timeout value is something that doesn't align with configurations where e.g. one filesystem is local and another is remote, and they should really have different timeout values.

            There are a few options that can be tried to resolve this problem without needing to wait for a patch and new release:
            1) Try mounting the filesystems on a test client in the opposite order: the filesystem with the longer timeout (FS300) mounted first and the shorter timeout (FS100) mounted second, and then check lctl get_param timeout to see if this client uses the 100s timeout. If yes, then this could be put into production immediately without any further changes, except in the rare case where one filesystem is being mounted inside the other. If the client still has a timeout of 300s, then it appears that FS100 is using the default obd_timeout of 100s and not explicitly setting a timeout at all, and something more needs to be done.
            2) As with #1 above, change the mount order to mount FS300 first and FS100 second, and also explicitly set the timeout parameter for FSshort via lctl conf_param <fsname>.sys.timeout=100 and see if this allows the client to store the shorter timeout.
            3) Set the timeout for FS100 to 300s to match FS300, so that the servers will wait up to 300s for the pings to arrive. However, this will also increase the recovery time for FS100 and that may not be desirable for some configurations.

            There are also potential code fixes for this problem, in particular we discussed to add a per-target ping_interval tunable in /proc, similar to max_rpcs_in_flight and max_pages_per_rpc that allows setting the ping interval for a single filesystem explicitly.

            adilger Andreas Dilger added a comment - I agree that this is a potential issue, and having a single global obd_timeout value is something that doesn't align with configurations where e.g. one filesystem is local and another is remote, and they should really have different timeout values. There are a few options that can be tried to resolve this problem without needing to wait for a patch and new release: 1) Try mounting the filesystems on a test client in the opposite order: the filesystem with the longer timeout (FS300) mounted first and the shorter timeout (FS100) mounted second, and then check lctl get_param timeout to see if this client uses the 100s timeout. If yes, then this could be put into production immediately without any further changes, except in the rare case where one filesystem is being mounted inside the other. If the client still has a timeout of 300s, then it appears that FS100 is using the default obd_timeout of 100s and not explicitly setting a timeout at all, and something more needs to be done. 2) As with #1 above, change the mount order to mount FS300 first and FS100 second, and also explicitly set the timeout parameter for FSshort via lctl conf_param <fsname>.sys.timeout=100 and see if this allows the client to store the shorter timeout. 3) Set the timeout for FS100 to 300s to match FS300, so that the servers will wait up to 300s for the pings to arrive. However, this will also increase the recovery time for FS100 and that may not be desirable for some configurations. There are also potential code fixes for this problem, in particular we discussed to add a per-target ping_interval tunable in /proc, similar to max_rpcs_in_flight and max_pages_per_rpc that allows setting the ping interval for a single filesystem explicitly.

            People

              hongchao.zhang Hongchao Zhang
              apercher Antoine Percher
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: