[LU-12029] do not try to muck with max_sectors_kb on multipath configurations Created: 27/Feb/19 Updated: 26/Jan/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Oleg Drokin | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
Lots of reports lately how on multipath config increasing the sectors count breaks the multipath. It looks like we realy need to stop adjusting the values there and just print a warning so that the users can investigate if any larger count works and if so incorporates this into their config by some other means. There are just too many patches to list here. |
| Comments |
| Comment by Chris Walker (Inactive) [ 30/Jul/19 ] |
|
For installations with a lot (>50) attached disks, the additional overhead of running l_tunedisk on every change/add event is significant: One test with an 84-disk enclosure showed a 'udev settle' time of 55s with l_tunedisk in place and 34s without. While I realize that removing 99-lustre-server.rules entirely might not be palatable for some customers, but could it at least be moved to /usr/lib/udev/rules.d so that it can be stubbed out easily? This seems like a more appropriate place for it, since that's the directory for 'system' files, and the Lustre RPM is (IMO) the 'system' in this case. Thanks, |
| Comment by Andreas Dilger [ 10/Aug/20 ] |
|
The primary reason that the l_tunedisk udev script exists is because dm_multipath devices "forget" their max_sectors_kb (and other) settings when they disconnect and reconnect as new SCSI devices (e.g. because of a brief cable disconnect or SCSI bus reset). For the kernel, this results in the underlying SCSI device (e.g. /dev/sdb) to disappear and reappear, possibly with a new device name, and reset back to the default settings. The upper-layer dm_multipath device still reports the larger size for max_sectors_kb, but the reconnected device has been reset to the smaller default value (no problem is hit if the user-specified max_sectors_kb is smaller than the default). This causes SCSI errors as reported in Mar 31 00:02:44 oss01 kernel: blk_cloned_rq_check_limits: over max size limit. Mar 31 00:02:44 oss01 kernel: device-mapper: multipath: Failing path 8:160. : Mar 31 00:17:30 oss01 kernel: blk_update_request: I/O error, dev dm-17, sector 1182279680 It would be best to fix the root of this problem in the kernel in the dm_multipath device driver code, rather than continue to make l_tunedisk or other udev scripts increasingly complex to handle this case. Since this is (IMHO) a bug in the core kernel, it should also be submitted upstream. Firstly, the "blk_cloned_rq_check_limits: over max size limit." error message should be improved to print the device name, actual request size, and the current queue size limit to make it clear where the error lies (too large a request, or too small a limit). This is just a one-line change to this function. Secondly, the dm_multipath code needs to remember the max_sectors_kb (and other) block device settings set on the multipath device. It should already have these parameters stored in its own queue settings, it just needs to automatically set those parameters on the underlying device when they are re-added to the multipath, before any IO is submitted there. This might benefit from having flags that indicate which parameters were tuned away from the defaults, so that it doesn't mess with parameters that have never actually been changed. This should properly handle already in-flight IOs that were generated "while" the device was being connected, and avoids the gap between the "new" device being added to the multipath and the (potentially several second long) delay when the "udev" script is run to (re-)tune the low-level block device queue values. I don't think that would be too hard to patch the dm_multipath code to do this, but I haven't looked at this code in detail. AFAIK, there is already code in dm_multipath to limit the size of max_sectors_kb (and other parameters) to the minimum value reported by any of the underlying storage paths at setup time, and there is code to pass the tuning written to /sys/block/dm-X/queue/max_sectors_kb down to /sys/block/sdX,sdY,sdZ/queue/max_sectors_kb at the time it is set, but this essentially needs to be made "persistent" when a device is reconnected to the multipath. In theory it would be possible for a new path to be reintroduced with a smaller limit (e.g. connected via a "worse" HBA or iSCSI transport), and that new limit should also "bubble up" to the higher levels (if it doesn't already), but it is far more likely that the previously-tuned parameters can be set on the new device again because it was just a temporary blip in connectivity (flakey/disconnected cable) and is still the same device. |
| Comment by Andreas Dilger [ 10/May/22 ] |
|
According to RedHat, the /etc/multipath.conf has a parameter for setting max_sectors_kb at setup time:
It still isn't clear from this description if it is any better than calling tune_devices.sh from UDEV, since it only mentions "before a multipath device is created" and not anything about if the path reconnects (which is the core issue here). |
| Comment by Etienne Aujames [ 17/Jul/23 ] |
|
We recently hit this issue at the CEA during disk firmware update on SFA18K:
1 target is seen on 8 VMs. 1 SFA pool contains 2 VD. So the udev event is triggered 16 times for each firmware update. The OSTs are large: +620T To mitigate this issue, I think we should avoid to use debugfs/e2fsprogs on the raw devices to identify if the device is used by Lustre and run l_tunedisk only for the mounted devices (on device add/change rules). |
| Comment by Gerrit Updater [ 17/Jul/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51695 |
| Comment by Etienne Aujames [ 19/Jul/23 ] |
|
Hi Andreas, I have submitted a pull request on multipathd: https://github.com/opensvc/multipath-tools/pull/69 With the following kernel patch, this should work fine: commit 3ae706561637331aa578e52bb89ecbba5edcb7a9
Author: Mike Snitzer <snitzer@redhat.com>
Date: Wed Sep 26 23:45:45 2012 +0100
dm: retain table limits when swapping to new table with no devices
Add a safety net that will re-use the DM device's existing limits in the
event that DM device has a temporary table that doesn't have any
component devices. This is to reduce the chance that requests not
respecting the hardware limits will reach the device.
DM recalculates queue limits based only on devices which currently exist
in the table. This creates a problem in the event all devices are
temporarily removed such as all paths being lost in multipath. DM will
reset the limits to the maximum permissible, which can then assemble
requests which exceed the limits of the paths when the paths are
restored. The request will fail the blk_rq_check_limits() test when
sent to a path with lower limits, and will be retried without end by
multipath. This became a much bigger issue after v3.6 commit fe86cdcef
("block: do not artificially constrain max_sectors for stacking
drivers").
Reported-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
And the following multipathd path: commit 8fd48686d72ee10e8665f03399da128e8c1362bd
Author: Benjamin Marzinski <bmarzins@redhat.com>
Date: Fri Apr 7 01:16:37 2017 -0500
libmultipath: don't set max_sectors_kb on reloads
Multipath was setting max_sectors_kb on the multipath device and all its
path devices both when the device was created, and when it was reloaded.
The problem with this is that while this would set max_sectors_kb on all
the devices under multipath, it couldn't set this on devices on top of
multipath. This meant that if a user lowered max_sectors_kb on an
already existing multipath device with a LV on top of it, the LV could
send down IO that was too large for the new max_sectors_kb value,
because the LV was still using the old value. The solution to this is
to only set max_sectors_kb to the configured value when the device is
originally created, not when it is later reloaded. Since not all paths
may be present when the device is original created, on reloads multipath
still needs to make sure that the max_sectors_kb value on all the path
devices is the same as the value of the multipath device. But if this
value doesn't match the configuration value, that's o.k.
This means that the max_sectors_kb value for a multipath device won't
change after it have been initially created. All of the devices created
on top of the multipath device will inherit that value, and all of the
devices will use it all the way down, so IOs will never be mis-sized.
I also moved sysfs_set_max_sectors_kb to configure.c, since it is only
called from there, and it it makes use of static functions from there.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
|
| Comment by Andreas Dilger [ 19/Jul/23 ] |
|
It looks like the DM patch was landed as v3.6-rc7-5-g3ae706561637 so it should be present in el8.7 and later server kernels (3.10). Can you confirm that the libmultipath patch is also included in el8 installs... |
| Comment by Etienne Aujames [ 19/Jul/23 ] |
|
I am on Rocky Linux 8.8, the libmultipath patch is present. So, as workaround, a value can be set for max_sectors_kb inside the mulitpath.conf. |
| Comment by Etienne Aujames [ 08/Sep/23 ] |
|
The multipath patch landed in multipath-tools 0.9.6: https://github.com/opensvc/multipath-tools/pull/68/commits/bbb77f318ee483292f50a7782aecaecc7e60f727 Should we remove the 99-lustre-server.rules? |
| Comment by Andreas Dilger [ 10/Sep/23 ] |
|
Etienne, thanks for submitting the patch upstream. I don't think we can remove this until at least the main distros have a version of multipath-tools that include your fix. |