Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.10.3
-
None
-
2
-
9223372036854775807
Description
We had seen this before. LU-5481. At time we just removed MMP from the OST, because we didn't use hos failover. But our new filesystem does use host failover. We are seeing the same error on a ISER+T10PI connect storage. This error can happen at mount time and random times during IO.
[ 3520.840977] mlx5_3:mlx5_poll_one:657:(pid 0): CQN: 0xc05 Got SIGERR on key: 0x80007b0b err_type 0 err_offset 207 expected 9b3c actual a13c [ 3520.878451] PI error found type 0 at sector 1337928 expected 953c vs actual 9b3c [ 3520.900800] PI error found type 0 at sector 1337928 expected 9b3c vs actual a13c [ 3520.923968] blk_update_request: I/O error, dev sdai, sector 20150568 [ 3520.943377] blk_update_request: I/O error, dev sdae, sector 20150568 [ 3520.963067] blk_update_request: I/O error, dev dm-15, sector 20150568 [ 3520.982436] Buffer I/O error on dev dm-15, logical block 2518821, lost async page write [ 3521.006511] Buffer I/O error on dev dm-15, logical block 2518822, lost async page write [ 3521.006558] blk_update_request: I/O error, dev dm-13, sector 20150568 [ 3521.006559] Buffer I/O error on dev dm-13, logical block 2518821, lost async page write [ 3521.006563] Buffer I/O error on dev dm-13, logical block 2518822, lost async device /dev/dm-15 mounted by lustre Filesystem volume name: nbp10-OST001d Last mounted on: / Filesystem UUID: 08b337bb-b3b1-48b0-925b-0bf5d3ba7253 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent 64bit mmp flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 9337344 Block count: 19122880512 Reserved block count: 0 Free blocks: 19120188065 Free inodes: 9337011 First block: 0 Block size: 4096 Fragment size: 4096 Group descriptor size: 64 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 16 Inode blocks per group: 2 Flex block group size: 64 Filesystem created: Fri Jul 27 10:21:56 2018 Last mount time: Fri Jul 27 10:44:14 2018 Last write time: Fri Jul 27 10:44:15 2018 Mount count: 4 Maximum mount count: -1 Last checked: Fri Jul 27 10:21:56 2018 Check interval: 0 (<none>) Lifetime writes: 7774 kB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 512 Required extra isize: 32 Desired extra isize: 32 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: 2ebd542d-9757-456f-b597-43fae5c542c0 Journal backup: inode blocks MMP block number: 2518821 MMP update interval: 5 User quota inode: 3 Group quota inode: 4
Note block with the error is the MMP block.
Attachments
Issue Links
- is duplicated by
-
LU-5481 mmp updates can some times fail T10PI checks
-
- Resolved
-
It is also confusing to me why there is no "Error writing to MMP block" message being printed in this case, since the write error should be propagated up to the caller with REQ_SYNC. It makes me start to wonder if this block write is being generated somewhere else in the code, and only the MMP code is overwriting the same block in place?
As mentioned previously, it might help to hexdump the MMP block contents in the low-level code, and print out the address of the buffer being written, so that we can see if it is the same page as was submitted by write_mmp_block() or some other copy.