Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.10.3
-
None
-
2
-
9223372036854775807
Description
We had seen this before. LU-5481. At time we just removed MMP from the OST, because we didn't use hos failover. But our new filesystem does use host failover. We are seeing the same error on a ISER+T10PI connect storage. This error can happen at mount time and random times during IO.
[ 3520.840977] mlx5_3:mlx5_poll_one:657:(pid 0): CQN: 0xc05 Got SIGERR on key: 0x80007b0b err_type 0 err_offset 207 expected 9b3c actual a13c [ 3520.878451] PI error found type 0 at sector 1337928 expected 953c vs actual 9b3c [ 3520.900800] PI error found type 0 at sector 1337928 expected 9b3c vs actual a13c [ 3520.923968] blk_update_request: I/O error, dev sdai, sector 20150568 [ 3520.943377] blk_update_request: I/O error, dev sdae, sector 20150568 [ 3520.963067] blk_update_request: I/O error, dev dm-15, sector 20150568 [ 3520.982436] Buffer I/O error on dev dm-15, logical block 2518821, lost async page write [ 3521.006511] Buffer I/O error on dev dm-15, logical block 2518822, lost async page write [ 3521.006558] blk_update_request: I/O error, dev dm-13, sector 20150568 [ 3521.006559] Buffer I/O error on dev dm-13, logical block 2518821, lost async page write [ 3521.006563] Buffer I/O error on dev dm-13, logical block 2518822, lost async device /dev/dm-15 mounted by lustre Filesystem volume name: nbp10-OST001d Last mounted on: / Filesystem UUID: 08b337bb-b3b1-48b0-925b-0bf5d3ba7253 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent 64bit mmp flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 9337344 Block count: 19122880512 Reserved block count: 0 Free blocks: 19120188065 Free inodes: 9337011 First block: 0 Block size: 4096 Fragment size: 4096 Group descriptor size: 64 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 16 Inode blocks per group: 2 Flex block group size: 64 Filesystem created: Fri Jul 27 10:21:56 2018 Last mount time: Fri Jul 27 10:44:14 2018 Last write time: Fri Jul 27 10:44:15 2018 Mount count: 4 Maximum mount count: -1 Last checked: Fri Jul 27 10:21:56 2018 Check interval: 0 (<none>) Lifetime writes: 7774 kB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 512 Required extra isize: 32 Desired extra isize: 32 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: 2ebd542d-9757-456f-b597-43fae5c542c0 Journal backup: inode blocks MMP block number: 2518821 MMP update interval: 5 User quota inode: 3 Group quota inode: 4
Note block with the error is the MMP block.
Attachments
Issue Links
- is duplicated by
-
LU-5481 mmp updates can some times fail T10PI checks
-
- Resolved
-
One thing that is puzzling is the error message "lost async page write", since the REQ_SYNC flag should be forcing the write to be synchronous? I wonder if this is an artifact of the DM Multipath code submitting sync writes asynchronously, so that it isn't blocked waiting for completion if one of the paths fails? That would lend more weight to trying to reproduce this problem without the DM Multipath driver involved. If the problem goes away, you can contact Red Hat about this issue, since MMP and ext4 exist in the upstream kernel and we do not modify MMP in recent releases so it should be reproducible without Lustre (given a sufficiently similar IO workload).