[LU-9551] I/O errors when lustre uses multipath devices Created: 24/May/17 Updated: 03/Dec/21 Resolved: 13/Apr/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.12.0, Lustre 2.10.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | xiangmin shen | Assignee: | Nathaniel Clark |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS Linux release 7.3.1611 (Core),OFED.3.4.2.0.0.1,lustre-2.7.19.8,Mellanox Technologies MT27500 Family |
||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Epic/Theme: | centos7.3, lustre-2.7.19.8 | ||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||
| Epic: | mount, server | ||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
When the lustre servers have OST configured with multipath devices, there are I/O errors that can lead to a server crash. The following error appears in the system log: Followed by several I/O errors |
| Comments |
| Comment by Chris Hunter (Inactive) [ 20/Jun/17 ] |
|
Message " blk_cloned_rq_check_limits" seen on non-lustre filesystems, believed caused by upstream commit to 4.3 kernel Feb. 14, 2016, 10:20 p.m. From: Hannes Reinecke <hare@suse.de> commit bf4e6b4e757488dee1b6a581f49c7ac34cd217f8 upstream. When a cloned request is retried on other queues it always needs to be checked against the queue limits of that queue. Otherwise the calculations for nr_phys_segments might be wrong, leading to a crash in scsi_init_sgtable(). |
| Comment by Malcolm Haak - NCI (Inactive) [ 27/Sep/17 ] |
|
We just hit this at ANU. The fix is to ensure that max_sectors_kb is 'large enough'. We had an issue where multipath was generating 1MB I/Os (as that's what lustre was configured for) but the underlying /dev block devices had max_sectors_kb = 512 I'm not sure how that is possible, but naturally it was resolved by adding a udev rule to set max_sectors_kb=> 1024 but < max_hw_sectors_kb I'm not sure if this is actually a lustre error or a multipath error. Based on my reading of https://patchwork.kernel.org/patch/9140337/ EDIT: Interestingly this was only seen months after the filesystem went into production. EDIT: yes I know that patch is for ppc.. The conversation was relevant. |
| Comment by Chris Hunter (Inactive) [ 27/Sep/17 ] |
|
One possible workaround is described in |
| Comment by Malcolm Haak - NCI (Inactive) [ 28/Sep/17 ] |
|
Has that been backported into 2.7/IEEL3? I can see that it exists in 2.10 and Master. Also It doesn't explain why we would get issues months after going live. The OST's were mounted and were not remounted. |
| Comment by Malcolm Haak - NCI (Inactive) [ 28/Sep/17 ] |
|
Also this might not fix it. Our issue seemed to come from the fact that the backing devices behind multipath had been reset to the default 512 value. Not the multipath devices that lustre was mounted on. Our udev rules only change the backing devices/paths not the resulting dm-X devices lustre is mounted on Reading some of the discussions on the kernel.org threads it seems that also during failover between paths multipath can do the wrong thing and not check against max_sectors_kb and only check max_hw_sectors_kb. Previously, this would not have been an issue. But with the extra checks, this is clearly an issue. |
| Comment by Malcolm Haak - NCI (Inactive) [ 16/Oct/17 ] |
|
The exact cause of our issues was discovered: Lustre had increased the values at mount, some paths went away and came back. They were set to default values upon return. Prior to the patch to the kernel this would not have been an issue, so for us the udev rule enforcing max on probe will resolve the issue |
| Comment by Peter Jones [ 21/Dec/17 ] |
|
This is fixed in more current releases |
| Comment by Gerrit Updater [ 28/Feb/18 ] |
|
Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: https://review.whamcloud.com/31464 |
| Comment by Gerrit Updater [ 09/Apr/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31464/ |
| Comment by Peter Jones [ 09/Apr/18 ] |
|
Landed for 2.12 |
| Comment by Gerrit Updater [ 11/Apr/18 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31951 |
| Comment by Chris Hunter (Inactive) [ 11/Apr/18 ] |
|
The old mount method in However due to bugs in the transport protocol this value can be wrong (https://patchwork.kernel.org/patch/7614871/; https://patchwork.kernel.org/patch/6662311/) and produce an error when used by lustre mount command. |
| Comment by Gerrit Updater [ 12/Apr/18 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31951/ |
| Comment by Minh Diep [ 13/Apr/18 ] |
|
This patch caused |
| Comment by Peter Jones [ 13/Apr/18 ] |
|
It looks like it is going to be fixed under |
| Comment by Nathaniel Clark [ 30/May/18 ] |
|
This got reverted on b2_10, but it didn't actually cause |
| Comment by Peter Jones [ 30/May/18 ] |
|
Yes we want to resubmit it |
| Comment by Gerrit Updater [ 30/May/18 ] |
|
Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: https://review.whamcloud.com/32583 |
| Comment by Gerrit Updater [ 01/Aug/18 ] |
|
John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32583/ |