[LU-9551] I/O errors when lustre uses multipath devices Created: 24/May/17  Updated: 03/Dec/21  Resolved: 13/Apr/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.12.0, Lustre 2.10.4

Type: Bug Priority: Critical
Reporter: xiangmin shen Assignee: Nathaniel Clark
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS Linux release 7.3.1611 (Core),OFED.3.4.2.0.0.1,lustre-2.7.19.8,Mellanox Technologies MT27500 Family


Attachments: Text File oss01.log    
Issue Links:
Duplicate
duplicates LU-9132 Tuning max_sectors_kb on mount Resolved
is duplicated by LU-9552 lustre uses multipath devices I/O errors Resolved
Related
is related to LU-10898 conf-sanity test 32a and 32d fail wit... Resolved
is related to LU-12029 do not try to muck with max_sectors_k... Open
is related to LU-12387 l_tunedisk does not properly handle m... Resolved
is related to LU-11563 99-lustre.rules on clients tries to e... Resolved
is related to LU-12530 udev add/change rule loads zfs module... Resolved
is related to LU-11736 Do not apply bulk IO tuning on MDT or... Resolved
Epic/Theme: centos7.3, lustre-2.7.19.8
Severity: 3
Epic: mount, server
Rank (Obsolete): 9223372036854775807

 Description   

When the lustre servers have OST configured with multipath devices, there are I/O errors that can lead to a server crash.

The following error appears in the system log:
Mar 31 00:02:44 oss01 kernel: blk_cloned_rq_check_limits: over max size limit.
Mar 31 00:02:44 oss01 kernel: device-mapper: multipath: Failing path 8:160.

Followed by several I/O errors
Mar 31 00:17:30 oss01 kernel: blk_update_request: I/O error, dev dm-17, sector 1182279680
Mar 31 00:17:30 oss01 kernel: blk_update_request: I/O error, dev dm-17, sector 1182291968
Mar 31 00:17:30 oss01 kernel: blk_update_request: I/O error, dev dm-17, sector 1182267392
Mar 31 00:17:30 oss01 kernel: blk_update_request: I/O error, dev dm-17, sector 1182304256
Mar 30 21:04:22 oss01 kernel: LDISKFS-fs (dm-17): Remounting filesystem read-only



 Comments   
Comment by Chris Hunter (Inactive) [ 20/Jun/17 ]

Message " blk_cloned_rq_check_limits" seen on non-lustre filesystems, believed caused by upstream commit to 4.3 kernel
https://patchwork.kernel.org/patch/8307491/

Feb. 14, 2016, 10:20 p.m.
From: Hannes Reinecke <hare@suse.de>
commit bf4e6b4e757488dee1b6a581f49c7ac34cd217f8 upstream.

When a cloned request is retried on other queues it always needs
to be checked against the queue limits of that queue.
Otherwise the calculations for nr_phys_segments might be wrong,
leading to a crash in scsi_init_sgtable().
Comment by Malcolm Haak - NCI (Inactive) [ 27/Sep/17 ]

We just hit this at ANU. The fix is to ensure that max_sectors_kb is 'large enough'.

We had an issue where multipath was generating 1MB I/Os (as that's what lustre was configured for) but the underlying /dev block devices had max_sectors_kb = 512

I'm not sure how that is possible, but naturally it was resolved by adding a udev rule to set max_sectors_kb=> 1024 but < max_hw_sectors_kb

I'm not sure if this is actually a lustre error or a multipath error. Based on my reading of https://patchwork.kernel.org/patch/9140337/
this is resolved in a new enough kernel but it seems that there might be some patches that require backporting into Centos/RHEL

EDIT: Interestingly this was only seen months after the filesystem went into production.

EDIT: yes I know that patch is for ppc.. The conversation was relevant.

Comment by Chris Hunter (Inactive) [ 27/Sep/17 ]

One possible workaround is described in LU-9132 setting env variable "MOUNT_LUSTRE_MAX_SECTORS_KB=0", which will stop mount.lustre from changing max_sectors_kb when mounting OSTs. The OSTs would retain the max_sectors_kb value set by your udev rules.

Comment by Malcolm Haak - NCI (Inactive) [ 28/Sep/17 ]

Has that been backported into 2.7/IEEL3?

I can see that it exists in 2.10 and Master.

Also It doesn't explain why we would get issues months after going live. The OST's were mounted and were not remounted.

Comment by Malcolm Haak - NCI (Inactive) [ 28/Sep/17 ]

Also this might not fix it. Our issue seemed to come from the fact that the backing devices behind multipath had been reset to the default 512 value. Not the multipath devices that lustre was mounted on.

Our udev rules only change the backing devices/paths not the resulting dm-X devices lustre is mounted on

Reading some of the discussions on the kernel.org threads it seems that also during failover between paths multipath can do the wrong thing and not check against max_sectors_kb and only check max_hw_sectors_kb.

Previously, this would not have been an issue. But with the extra checks, this is clearly an issue.

Comment by Malcolm Haak - NCI (Inactive) [ 16/Oct/17 ]

The exact cause of our issues was discovered:

Lustre had increased the values at mount, some paths went away and came back. They were set to default values upon return.

Prior to the patch to the kernel this would not have been an issue, so for us the udev rule enforcing max on probe will resolve the issue

Comment by Peter Jones [ 21/Dec/17 ]

This is fixed in more current releases

Comment by Gerrit Updater [ 28/Feb/18 ]

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: https://review.whamcloud.com/31464
Subject: LU-9551 utils: add l_tunedisk to fix disk tunings
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 681f208b5ec25a12eeac5a7c1cea238154ffd6ff

Comment by Gerrit Updater [ 09/Apr/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31464/
Subject: LU-9551 utils: add l_tunedisk to fix disk tunings
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 892280742a2b6347df1464379b3ed223b2961ed4

Comment by Peter Jones [ 09/Apr/18 ]

Landed for 2.12

Comment by Gerrit Updater [ 11/Apr/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31951
Subject: LU-9551 utils: add l_tunedisk to fix disk tunings
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: b30fb047c12c6354df2d81e2a0cd5dd21852f6b3

Comment by Chris Hunter (Inactive) [ 11/Apr/18 ]

The old mount method in LU-275 sets value from sysfs/block parameter max_hw_sectors_kb.

However due to bugs in the transport protocol this value can be wrong (https://patchwork.kernel.org/patch/7614871/; https://patchwork.kernel.org/patch/6662311/) and produce an error when used by lustre mount command.
Feature in LU-9132 to adjust mount behaviour would be useful in this scenario.

Comment by Gerrit Updater [ 12/Apr/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31951/
Subject: LU-9551 utils: add l_tunedisk to fix disk tunings
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 3281d5d57cec9d6deaa50cb4d9ec9509e3d03507

Comment by Minh Diep [ 13/Apr/18 ]

This patch caused LU-10898

Comment by Peter Jones [ 13/Apr/18 ]

It looks like it is going to be fixed under LU-10898 rather than reverted so keeping as resolved

Comment by Nathaniel Clark [ 30/May/18 ]

This got reverted on b2_10, but it didn't actually cause LU-10898 (afaik).  ZED holds zfs open if it's running.  Can we re-land this?  Should I resubmit?

Comment by Peter Jones [ 30/May/18 ]

Yes we want to resubmit it

Comment by Gerrit Updater [ 30/May/18 ]

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: https://review.whamcloud.com/32583
Subject: LU-9551 utils: add l_tunedisk to fix disk tunings
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 1743fa638e8fdbe16e6cfd33dd91c24fa5047492

Comment by Gerrit Updater [ 01/Aug/18 ]

John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32583/
Subject: LU-9551 utils: add l_tunedisk to fix disk tunings
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 703d418908fa32f60decc3bd535e77784d2721c6

Generated at Sat Feb 10 02:27:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.