[LU-12387] l_tunedisk does not properly handle multipath devices Created: 04/Jun/19  Updated: 14/Jul/20  Resolved: 16/Jun/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.3
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Bug Priority: Major
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-12029 do not try to muck with max_sectors_k... Open
is related to LU-9551 I/O errors when lustre uses multipath... Resolved
is related to LU-11736 Do not apply bulk IO tuning on MDT or... Resolved
is related to LU-12297 Add an option to disable mount.lustre... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I've identified two bugs or deficiencies in the l_tunedisk utility in Lustre 2.12/2.13

The first is a regression introduced by "LU-11736 utils: don't set max_sectors_kb on MDT/MGT".

The l_tunedisk path does provide sufficient information to the tune_block_dev() function for it to ascertain whether the device being tuned is in-fact an OST. The result is that tuning is not performed for any device. I believe the solution is, after we've determined that the device has been formatted for Lustre, we should read the "lustre_disk_data" (ldd) from the device. This information is stored in the "struct mount_opts" that is passed through to the underlying tuning functions.

diff --git a/lustre/utils/l_tunedisk.c b/lustre/utils/l_tunedisk.c
index 3321fd1f0f..feecdeea77 100644
--- a/lustre/utils/l_tunedisk.c
+++ b/lustre/utils/l_tunedisk.c
@@ -45,6 +45,8 @@ int main(int argc, char *const argv[])
        struct mount_opts mop = {
                .mo_max_sectors_kb = -1
        };
+       struct lustre_disk_data *ldd = &mop.mo_ldd;
+
        char real_path[PATH_MAX] = {'\0'};
        unsigned int mount_type;
        static struct option long_opts[] = {
@@ -95,6 +97,13 @@ int main(int argc, char *const argv[])
        if (ret == 0)
                goto out;

+       ret = osd_read_ldd(mop.mo_source, ldd);
+       if (ret != 0) {
+               fprintf(stderr, "Failed to read previous Lustre data from %s "
+                       "(%d)\n", mop.mo_source, ret);
+               goto out;
+       }
+
        ret = osd_tune_lustre(mop.mo_source, &mop);

 out:

The second issue has to do with how tunings are emplaced for slave devices when the device that was initially passed to l_tunedisk is a multipath device.

static int tune_block_dev(const char *src, struct mount_opts *mop)
{
<snip>
                /* If device is multipath device then tune its slave
                 * devices. */
                rc = tune_block_dev_slaves(real_sys_path, mop);

Here "real_sys_path" is something like

/sys/devices/virtual/block/dm-2

tune_block_dev_slaves() then looks in the slaves subdirectory.

static int tune_block_dev_slaves(const char *sys_path, struct mount_opts *mop)
{
<snip>
        snprintf(slaves_path, sizeof(slaves_path), "%s/slaves", sys_path);
        slaves_dir = opendir(slaves_path);

e.g.

# ls /sys/devices/virtual/block/dm-2/slaves
sdc  sdh  sdm  sdr
#

And it recursively calls tune_block_dev() on each symlink in that directory:

        while ((d = readdir(slaves_dir)) != NULL) {
                char path[PATH_MAX];
                int rc2;

                if (d->d_type != DT_LNK)
                        continue;

                snprintf(path, sizeof(path), "%s/%s", slaves_path, d->d_name);
                rc2 = tune_block_dev(path, mop);

The problem is that tune_block_dev() performs a stat on its "src" argument which is supposed to be the device path that we are tuning. And if the stat does not indicate that the file is a block device then we return without performing any tuning:

static int tune_block_dev(const char *src, struct mount_opts *mop)
{
<snip>
        rc = stat(src, &st);
        if (rc < 0) {
                if (verbose)
                        fprintf(stderr, "warning: cannot stat '%s': %s\n",
                                src, strerror(errno));
                return errno;
        }

        if (!S_ISBLK(st.st_mode))
                return 0;

stat will resolve the symbolic link, however these links point at directories. Directories are not block devices to tune_block_dev() will not tune them.

I'm not sure what is going to come out of LU-12029.
That ticket suggests that we might be getting rid of l_tunedisk altogether? If that's the case then this ticket could be closed. But as long as we're providing this utility we may as well fix it up.



 Comments   
Comment by Chris Horn [ 05/Jun/19 ]

It's not totally clear to me the best way to handle the symlink issue. I wonder if a simple fix would be to change the path for the slave device to:

        while ((d = readdir(slaves_dir)) != NULL) {
                char path[PATH_MAX];
                int rc2;

                if (d->d_type != DT_LNK)
                        continue;

                snprintf(path, sizeof(path), "/dev/%s", d->d_name);
                rc2 = tune_block_dev(path, mop);
Comment by Chris Horn [ 05/Jun/19 ]

The change proposed above seems to do the trick for my testbed, but it'd be nice to have someone with expertise in this area weigh in.

oss1:~ # for i in /sys/block/sd{m,c,r,h} /sys/block/dm-2; do echo 1280 > $i/queue/max_sectors_kb; done
oss1:~ # for i in /sys/block/sd{m,c,r,h} /sys/block/dm-2; do echo "$i $(cat $i/queue/max_sectors_kb)"; done
/sys/block/sdm 1280
/sys/block/sdc 1280
/sys/block/sdr 1280
/sys/block/sdh 1280
/sys/block/dm-2 1280
oss1:~ # /usr/sbin/l_tunedisk /dev/mapper/mpathc
l_tunedisk: increased '/sys/devices/virtual/block/dm-2/queue/max_sectors_kb' from 1280 to 16383
Tuning slaves for /sys/devices/virtual/block/dm-2
Checking for slaves at /sys/devices/virtual/block/dm-2/slaves
tuning slave /dev/sdm
l_tunedisk: increased '/sys/devices/pci0000:00/0000:00:03.2/0000:03:00.0/host0/port-0:0/expander-0:0/port-0:0:2/end_device-0:0:2/target0:0:2/0:0:2:2/block/sdm/queue/max_sectors_kb' from 1280 to 16383
Tuning slaves for /sys/devices/pci0000:00/0000:00:03.2/0000:03:00.0/host0/port-0:0/expander-0:0/port-0:0:2/end_device-0:0:2/target0:0:2/0:0:2:2/block/sdm
Checking for slaves at /sys/devices/pci0000:00/0000:00:03.2/0000:03:00.0/host0/port-0:0/expander-0:0/port-0:0:2/end_device-0:0:2/target0:0:2/0:0:2:2/block/sdm/slaves
tuning slave /dev/sdc
l_tunedisk: increased '/sys/devices/pci0000:00/0000:00:03.2/0000:03:00.0/host0/port-0:0/expander-0:0/port-0:0:0/end_device-0:0:0/target0:0:0/0:0:0:2/block/sdc/queue/max_sectors_kb' from 1280 to 16383
Tuning slaves for /sys/devices/pci0000:00/0000:00:03.2/0000:03:00.0/host0/port-0:0/expander-0:0/port-0:0:0/end_device-0:0:0/target0:0:0/0:0:0:2/block/sdc
Checking for slaves at /sys/devices/pci0000:00/0000:00:03.2/0000:03:00.0/host0/port-0:0/expander-0:0/port-0:0:0/end_device-0:0:0/target0:0:0/0:0:0:2/block/sdc/slaves
tuning slave /dev/sdr
l_tunedisk: increased '/sys/devices/pci0000:00/0000:00:03.2/0000:03:00.0/host0/port-0:0/expander-0:0/port-0:0:3/end_device-0:0:3/target0:0:3/0:0:3:2/block/sdr/queue/max_sectors_kb' from 1280 to 16383
Tuning slaves for /sys/devices/pci0000:00/0000:00:03.2/0000:03:00.0/host0/port-0:0/expander-0:0/port-0:0:3/end_device-0:0:3/target0:0:3/0:0:3:2/block/sdr
Checking for slaves at /sys/devices/pci0000:00/0000:00:03.2/0000:03:00.0/host0/port-0:0/expander-0:0/port-0:0:3/end_device-0:0:3/target0:0:3/0:0:3:2/block/sdr/slaves
tuning slave /dev/sdh
l_tunedisk: increased '/sys/devices/pci0000:00/0000:00:03.2/0000:03:00.0/host0/port-0:0/expander-0:0/port-0:0:1/end_device-0:0:1/target0:0:1/0:0:1:2/block/sdh/queue/max_sectors_kb' from 1280 to 16383
Tuning slaves for /sys/devices/pci0000:00/0000:00:03.2/0000:03:00.0/host0/port-0:0/expander-0:0/port-0:0:1/end_device-0:0:1/target0:0:1/0:0:1:2/block/sdh
Checking for slaves at /sys/devices/pci0000:00/0000:00:03.2/0000:03:00.0/host0/port-0:0/expander-0:0/port-0:0:1/end_device-0:0:1/target0:0:1/0:0:1:2/block/sdh/slaves
oss1:~ # for i in /sys/block/sd{m,c,r,h} /sys/block/dm-2; do echo "$i $(cat $i/queue/max_sectors_kb)"; done
/sys/block/sdm 16383
/sys/block/sdc 16383
/sys/block/sdr 16383
/sys/block/sdh 16383
/sys/block/dm-2 16383
oss1:~ #
Comment by Gerrit Updater [ 05/Jun/19 ]

Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/35065
Subject: LU-12387 utils: Avoid passing symlink to tune_block_dev
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 985f626fd0b0ecc9e5582af1ab8f6f59f706582f

Comment by Gerrit Updater [ 05/Jun/19 ]

Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/35066
Subject: LU-12387 utils: Read existing ldd data in l_tunedisk
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5ec740f26828d53c9e1ff898df737d95f58eb699

Comment by Gerrit Updater [ 06/Jun/19 ]

Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/35081
Subject: LU-12387 tests: Validate l_tunedisk max_sectors_kb tuning
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 490484ed694121ea70dcdcc5ffe8e2b26617e2ca

Comment by Gerrit Updater [ 16/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35066/
Subject: LU-12387 utils: Read existing ldd data in l_tunedisk
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9cb4f810164efa5f058dc7605fb1835ea51b0a92

Comment by Gerrit Updater [ 16/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35065/
Subject: LU-12387 utils: Avoid passing symlink to tune_block_dev
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c632a238e6d0c4a3240959a894d36a8a409d64f8

Comment by Gerrit Updater [ 16/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35081/
Subject: LU-12387 tests: Validate l_tunedisk max_sectors_kb tuning
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ac8bbb3ddd646e4aa04b77cb1e7640b5865f2c04

Comment by Peter Jones [ 16/Jun/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 28/Jun/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35370
Subject: LU-12387 utils: Read existing ldd data in l_tunedisk
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 3e7469ab4f69d08a2f3eb351a4d1555d97ac35a4

Comment by Gerrit Updater [ 28/Jun/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35371
Subject: LU-12387 utils: Avoid passing symlink to tune_block_dev
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 8d463022229509147991589ca3f7829feda3eeeb

Comment by Gerrit Updater [ 28/Jun/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35372
Subject: LU-12387 tests: Validate l_tunedisk max_sectors_kb tuning
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 6b647d7d365a172e7df2c0b3681f8abe6ba74fe9

Comment by Gerrit Updater [ 12/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35370/
Subject: LU-12387 utils: Read existing ldd data in l_tunedisk
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 5c552e31271824d3512aa9fe8814422b5f6812e0

Comment by Gerrit Updater [ 12/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35371/
Subject: LU-12387 utils: Avoid passing symlink to tune_block_dev
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 8c797c66583cbf5c13988df5a50652b5df9820b2

Comment by Gerrit Updater [ 05/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35372/
Subject: LU-12387 tests: Validate l_tunedisk max_sectors_kb tuning
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: bab0570ce3081c4a434126da2cb7a9bda1b2cdbb

Generated at Sat Feb 10 02:52:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.