[LU-12029] do not try to muck with max_sectors_kb on multipath configurations Created: 27/Feb/19  Updated: 26/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-12297 Add an option to disable mount.lustre... Resolved
Related
is related to LU-9551 I/O errors when lustre uses multipath... Resolved
is related to LU-12387 l_tunedisk does not properly handle m... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Lots of reports lately how on multipath config increasing the sectors count breaks the multipath.

It looks like we realy need to stop adjusting the values there and just print a warning so that the users can investigate if any larger count works and if so incorporates this into their config by some other means.

There are just too many patches to list here.



 Comments   
Comment by Chris Walker (Inactive) [ 30/Jul/19 ]

For installations with a lot (>50) attached disks, the additional overhead of running l_tunedisk on every change/add event is significant: One test with an 84-disk enclosure showed a 'udev settle' time of 55s with l_tunedisk in place and 34s without.

While I realize that removing 99-lustre-server.rules entirely might not be palatable for some customers, but could it at least be moved to /usr/lib/udev/rules.d so that it can be stubbed out easily? This seems like a more appropriate place for it, since that's the directory for 'system' files, and the Lustre RPM is (IMO) the 'system' in this case.

Thanks,
Chris

Comment by Andreas Dilger [ 10/Aug/20 ]

The primary reason that the l_tunedisk udev script exists is because dm_multipath devices "forget" their max_sectors_kb (and other) settings when they disconnect and reconnect as new SCSI devices (e.g. because of a brief cable disconnect or SCSI bus reset). For the kernel, this results in the underlying SCSI device (e.g. /dev/sdb) to disappear and reappear, possibly with a new device name, and reset back to the default settings.

The upper-layer dm_multipath device still reports the larger size for max_sectors_kb, but the reconnected device has been reset to the smaller default value (no problem is hit if the user-specified max_sectors_kb is smaller than the default). This causes SCSI errors as reported in LU-9551 (and many other tickets) because in-flight IO is larger than what the "new" device queue will accept:

Mar 31 00:02:44 oss01 kernel: blk_cloned_rq_check_limits: over max size limit.
Mar 31 00:02:44 oss01 kernel: device-mapper: multipath: Failing path 8:160.
:
Mar 31 00:17:30 oss01 kernel: blk_update_request: I/O error, dev dm-17, sector 1182279680

It would be best to fix the root of this problem in the kernel in the dm_multipath device driver code, rather than continue to make l_tunedisk or other udev scripts increasingly complex to handle this case. Since this is (IMHO) a bug in the core kernel, it should also be submitted upstream.

Firstly, the "blk_cloned_rq_check_limits: over max size limit." error message should be improved to print the device name, actual request size, and the current queue size limit to make it clear where the error lies (too large a request, or too small a limit). This is just a one-line change to this function.

Secondly, the dm_multipath code needs to remember the max_sectors_kb (and other) block device settings set on the multipath device. It should already have these parameters stored in its own queue settings, it just needs to automatically set those parameters on the underlying device when they are re-added to the multipath, before any IO is submitted there. This might benefit from having flags that indicate which parameters were tuned away from the defaults, so that it doesn't mess with parameters that have never actually been changed.

This should properly handle already in-flight IOs that were generated "while" the device was being connected, and avoids the gap between the "new" device being added to the multipath and the (potentially several second long) delay when the "udev" script is run to (re-)tune the low-level block device queue values. I don't think that would be too hard to patch the dm_multipath code to do this, but I haven't looked at this code in detail.

AFAIK, there is already code in dm_multipath to limit the size of max_sectors_kb (and other parameters) to the minimum value reported by any of the underlying storage paths at setup time, and there is code to pass the tuning written to /sys/block/dm-X/queue/max_sectors_kb down to /sys/block/sdX,sdY,sdZ/queue/max_sectors_kb at the time it is set, but this essentially needs to be made "persistent" when a device is reconnected to the multipath. In theory it would be possible for a new path to be reintroduced with a smaller limit (e.g. connected via a "worse" HBA or iSCSI transport), and that new limit should also "bubble up" to the higher levels (if it doesn't already), but it is far more likely that the previously-tuned parameters can be set on the new device again because it was just a temporary blip in connectivity (flakey/disconnected cable) and is still the same device.

Comment by Andreas Dilger [ 10/May/22 ]

According to RedHat, the /etc/multipath.conf has a parameter for setting max_sectors_kb at setup time:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/dm_multipath/config_file_multipath

max_sectors_kb
Red Hat Enterprise Linux Release 6.9 and later) Sets the max_sectors_kb device queue parameter to the specified value on all underlying paths of a multipath device before the multipath device is first activated. When a multipath device is created, the device inherits the max_sectors_kb value from the path devices. Manually raising this value for the multipath device or lowering this value for the path devices can cause multipath to create I/O operations larger than the path devices allow. Using the max_sectors_kb parameter is an easy way to set these values before a multipath device is created on top of the path devices and prevent invalid-sized I/O operations from being passed If this parameter is not set by the user, the path devices have it set by their device driver, and the multipath device inherits it from the path devices.

It still isn't clear from this description if it is any better than calling tune_devices.sh from UDEV, since it only mentions "before a multipath device is created" and not anything about if the path reconnects (which is the core issue here).

Comment by Etienne Aujames [ 17/Jul/23 ]

We recently hit this issue at the CEA during disk firmware update on SFA18K:

  • Firmware update trigger alua event on the device.
  • udev rule 99-lustre-server.rules is triggered for each server with the block device (mounted or unmounted).
  • The rule run l_tunedisk for each VM
  • osd_is_lustre/ldiskfs_is_lustre tries to access to the raw device (via debugfs/e2fsprogs api) concurrently on every server (lot of 4k read)
  • Hang the targets

1 target is seen on 8 VMs. 1 SFA pool contains 2 VD. So the udev event is triggered 16 times for each firmware update. The OSTs are large: +620T

To mitigate this issue, I think we should avoid to use debugfs/e2fsprogs on the raw devices to identify if the device is used by Lustre and run l_tunedisk only for the mounted devices (on device add/change rules).

Comment by Gerrit Updater [ 17/Jul/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51695
Subject: LU-12029 utils: l_tunedisk only tune mounted target
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 217a0475d4a1bf3df394e1008ff4937f60a12e9d

Comment by Etienne Aujames [ 19/Jul/23 ]

Hi Andreas,

I have submitted a pull request on multipathd: https://github.com/opensvc/multipath-tools/pull/69

With the following kernel patch, this should work fine:

commit 3ae706561637331aa578e52bb89ecbba5edcb7a9
Author: Mike Snitzer <snitzer@redhat.com>
Date:   Wed Sep 26 23:45:45 2012 +0100

    dm: retain table limits when swapping to new table with no devices
    
    Add a safety net that will re-use the DM device's existing limits in the
    event that DM device has a temporary table that doesn't have any
    component devices.  This is to reduce the chance that requests not
    respecting the hardware limits will reach the device.
    
    DM recalculates queue limits based only on devices which currently exist
    in the table.  This creates a problem in the event all devices are
    temporarily removed such as all paths being lost in multipath.  DM will
    reset the limits to the maximum permissible, which can then assemble
    requests which exceed the limits of the paths when the paths are
    restored.  The request will fail the blk_rq_check_limits() test when
    sent to a path with lower limits, and will be retried without end by
    multipath.  This became a much bigger issue after v3.6 commit fe86cdcef
    ("block: do not artificially constrain max_sectors for stacking
    drivers").
    
    Reported-by: David Jeffery <djeffery@redhat.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
    Signed-off-by: Alasdair G Kergon <agk@redhat.com>

And the following multipathd path:

commit 8fd48686d72ee10e8665f03399da128e8c1362bd
Author: Benjamin Marzinski <bmarzins@redhat.com>
Date:   Fri Apr 7 01:16:37 2017 -0500

    libmultipath: don't set max_sectors_kb on reloads                                             

    Multipath was setting max_sectors_kb on the multipath device and all its                      
    path devices both when the device was created, and when it was reloaded.                      
    The problem with this is that while this would set max_sectors_kb on all                      
    the devices under multipath, it couldn't set this on devices on top of                        
    multipath.  This meant that if a user lowered max_sectors_kb on an                            
    already existing multipath device with a LV on top of it, the LV could                        
    send down IO that was too large for the new max_sectors_kb value,                             
    because the LV was still using the old value.  The solution to this is                        
    to only set max_sectors_kb to the configured value when the device is                         
    originally created, not when it is later reloaded.  Since not all paths                       
    may be present when the device is original created, on reloads multipath                      
    still needs to make sure that the max_sectors_kb value on all the path                        
    devices is the same as the value of the multipath device. But if this                         
    value doesn't match the configuration value, that's o.k.                                      

    This means that the max_sectors_kb value for a multipath device won't                         
    change after it have been initially created. All of the devices created                       
    on top of the multipath device will inherit that value, and all of the                        
    devices will use it all the way down, so IOs will never be mis-sized.                         

    I also moved sysfs_set_max_sectors_kb to configure.c, since it is only                        
    called from there, and it it makes use of static functions from there.                        

    Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>                                       
Comment by Andreas Dilger [ 19/Jul/23 ]

It looks like the DM patch was landed as v3.6-rc7-5-g3ae706561637 so it should be present in el8.7 and later server kernels (3.10). Can you confirm that the libmultipath patch is also included in el8 installs...

Comment by Etienne Aujames [ 19/Jul/23 ]

I am on Rocky Linux 8.8, the libmultipath patch is present.
This patch only set max_sectors_kb at multipath device init (if specified in configuration). And then it will keep the value set by a user on the device.

So, as workaround, a value can be set for max_sectors_kb inside the mulitpath.conf.

Comment by Etienne Aujames [ 08/Sep/23 ]

The multipath patch landed in multipath-tools 0.9.6: https://github.com/opensvc/multipath-tools/pull/68/commits/bbb77f318ee483292f50a7782aecaecc7e60f727

Should we remove the 99-lustre-server.rules?

Comment by Andreas Dilger [ 10/Sep/23 ]

Etienne, thanks for submitting the patch upstream. I don't think we can remove this until at least the main distros have a version of multipath-tools that include your fix.

Generated at Sat Feb 10 02:49:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.