Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11187

MMP updated sometimes failes T10PI checks

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.12.0, Lustre 2.10.7
    • Lustre 2.10.3
    • None
    • 2
    • 9223372036854775807

    Description

      We had seen this before. LU-5481. At time we just removed MMP from the OST, because we didn't use hos failover. But our new filesystem does use host failover. We are seeing the same error on a ISER+T10PI connect storage. This error can happen at mount time and random times during IO.

       [ 3520.840977] mlx5_3:mlx5_poll_one:657:(pid 0): CQN: 0xc05 Got SIGERR on key: 0x80007b0b err_type 0 err_offset 207 expected 9b3c actual a13c
      [ 3520.878451] PI error found type 0 at sector 1337928 expected 953c vs actual 9b3c
      [ 3520.900800] PI error found type 0 at sector 1337928 expected 9b3c vs actual a13c
      [ 3520.923968] blk_update_request: I/O error, dev sdai, sector 20150568
      [ 3520.943377] blk_update_request: I/O error, dev sdae, sector 20150568
      [ 3520.963067] blk_update_request: I/O error, dev dm-15, sector 20150568
      [ 3520.982436] Buffer I/O error on dev dm-15, logical block 2518821, lost async page write
      [ 3521.006511] Buffer I/O error on dev dm-15, logical block 2518822, lost async page write
      [ 3521.006558] blk_update_request: I/O error, dev dm-13, sector 20150568
      [ 3521.006559] Buffer I/O error on dev dm-13, logical block 2518821, lost async page write
      [ 3521.006563] Buffer I/O error on dev dm-13, logical block 2518822, lost async 
      
      device /dev/dm-15 mounted by lustre
      Filesystem volume name:   nbp10-OST001d
      Last mounted on:          /
      Filesystem UUID:          08b337bb-b3b1-48b0-925b-0bf5d3ba7253
      Filesystem magic number:  0xEF53
      Filesystem revision #:    1 (dynamic)
      Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent 64bit mmp flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota
      Filesystem flags:         signed_directory_hash 
      Default mount options:    user_xattr acl
      Filesystem state:         clean
      Errors behavior:          Continue
      Filesystem OS type:       Linux
      Inode count:              9337344
      Block count:              19122880512
      Reserved block count:     0
      Free blocks:              19120188065
      Free inodes:              9337011
      First block:              0
      Block size:               4096
      Fragment size:            4096
      Group descriptor size:    64
      Blocks per group:         32768
      Fragments per group:      32768
      Inodes per group:         16
      Inode blocks per group:   2
      Flex block group size:    64
      Filesystem created:       Fri Jul 27 10:21:56 2018
      Last mount time:          Fri Jul 27 10:44:14 2018
      Last write time:          Fri Jul 27 10:44:15 2018
      Mount count:              4
      Maximum mount count:      -1
      Last checked:             Fri Jul 27 10:21:56 2018
      Check interval:           0 (<none>)
      Lifetime writes:          7774 kB
      Reserved blocks uid:      0 (user root)
      Reserved blocks gid:      0 (group root)
      First inode:              11
      Inode size:               512
      Required extra isize:     32
      Desired extra isize:      32
      Journal inode:            8
      Default directory hash:   half_md4
      Directory Hash Seed:      2ebd542d-9757-456f-b597-43fae5c542c0
      Journal backup:           inode blocks
      MMP block number:         2518821
      MMP update interval:      5
      User quota inode:         3
      Group quota inode:        4
      

      Note block with the error is the MMP block.

      Attachments

        1. dm20.hexdump
          1.76 MB
        2. trace.dat
          4.73 MB

        Issue Links

          Activity

            [LU-11187] MMP updated sometimes failes T10PI checks
            dongyang Dongyang Li added a comment -

            This actually shows the blkdev writeback kicked in and wrote the mmp block,

            which explains where did the "lost async page write" messages come from.

            I think the mmp thread should have total control of the mmp block, we don't want the writeback to kick in and mess with us, also I guess the checksum error is because when the block was submitted by writeback and under I/O, the mmp thread started to modify the contents of mmp block, stepping on the toes of the writeback.

            I've updated https://review.whamcloud.com/#/c/33038/  patchset 6, can you give it a try?

            The patch is simple you can also apply it to ext4 vanilla centos7 and test it from there.

            dongyang Dongyang Li added a comment - This actually shows the blkdev writeback kicked in and wrote the mmp block, which explains where did the "lost async page write" messages come from. I think the mmp thread should have total control of the mmp block, we don't want the writeback to kick in and mess with us, also I guess the checksum error is because when the block was submitted by writeback and under I/O, the mmp thread started to modify the contents of mmp block, stepping on the toes of the writeback. I've updated https://review.whamcloud.com/#/c/33038/   patchset 6, can you give it a try? The patch is simple you can also apply it to ext4 vanilla centos7 and test it from there.

            I was able to reproduce this issues on the lustre kernel with ext4 and using vanilla cento7 (3.10.0-862.11.6.el7.x86_64) using ext4.
            I also was able to capture the error using trace-cmd.

            This was on the 3.10.0-862.11.6.el7.x86_64 kernel.

            Error reported at the console

            [ 5246.256578] mlx5_2:mlx5_poll_one:671:(pid 0): CQN: 0x405 Got SIGERR on key: 0x800015f5 err_type 0 err_offset 207 expected 12a8 actual 18a8
            [ 5246.294172] PI error found type 0 at sector 12528 expected 12a8 vs actual 18a8
            [ 5246.325235] blk_update_request: I/O error, dev sdq, sector 75048
            [ 5246.359541] blk_update_request: I/O error, dev dm-14, sector 75048
            [ 5246.378121] Buffer I/O error on dev dm-14, logical block 9381, lost async page write
            

            DM-14 is 253,14 and from the trace-cmd report we have

            => ret_from_fork_nospec_begin (ffffffff8b7255dd)
              kworker/u72:10-11060 [001]  5246.065529: block_bio_queue:      253,14 W 75048 + 8 [kworker/u72:10]
              kworker/u72:10-11060 [001]  5246.065533: kernel_stack:         <stack trace>
            => _submit_bh (ffffffff8b2573d7)
            => __block_write_full_page (ffffffff8b257652)
            => block_write_full_page (ffffffff8b257a1e)
            => blkdev_writepage (ffffffff8b25d828)
            => __writepage (ffffffff8b1a1c49)
            => write_cache_pages (ffffffff8b1a2744)
            => generic_writepages (ffffffff8b1a2a1d)
            => blkdev_writepages (ffffffff8b25d7e5)
            => do_writepages (ffffffff8b1a3ac1)
            => __writeback_single_inode (ffffffff8b24cf00)
            => writeback_sb_inodes (ffffffff8b24d994)
            => __writeback_inodes_wb (ffffffff8b24dcff)
            => wb_writeback (ffffffff8b24e533)
            => bdi_writeback_workfn (ffffffff8b24eebb)
            => process_one_work (ffffffff8b0b613f)
            => worker_thread (ffffffff8b0b71d6)
            => kthread (ffffffff8b0bdf21)
            
            => ret_from_fork_nospec_begin (ffffffff8b7255dd)
                 kmmpd-dm-24-6248  [007]  5246.324427: block_bio_queue:      253,24 WSM 75048 + 8 [kmmpd-dm-24]
                 kmmpd-dm-24-6248  [007]  5246.324443: kernel_stack:         <stack trace>
            => _submit_bh (ffffffff8b2573d7)
            => submit_bh (ffffffff8b257420)
            => write_mmp_block (ffffffffc058fdb1)
            => kmmpd (ffffffffc0590028)
            => kthread (ffffffff8b0bdf21)
            
            => ret_from_fork_nospec_begin (ffffffff8b7255dd)
                 kmmpd-dm-14-6186  [007]  5246.401415: block_bio_queue:      253,14 WSM 75048 + 8 [kmmpd-dm-14]
                 kmmpd-dm-14-6186  [007]  5246.401419: kernel_stack:         <stack trace>
            => _submit_bh (ffffffff8b2573d7)
            => submit_bh (ffffffff8b257420)
            => write_mmp_block (ffffffffc058fdb1)
            => kmmpd (ffffffffc0590028)
            => kthread (ffffffff8b0bdf21)
            

            I am also attaching the trace.dat file.

            trace.dat

            mhanafi Mahmoud Hanafi added a comment - I was able to reproduce this issues on the lustre kernel with ext4 and using vanilla cento7 (3.10.0-862.11.6.el7.x86_64) using ext4. I also was able to capture the error using trace-cmd. This was on the 3.10.0-862.11.6.el7.x86_64 kernel. Error reported at the console [ 5246.256578] mlx5_2:mlx5_poll_one:671:(pid 0): CQN: 0x405 Got SIGERR on key: 0x800015f5 err_type 0 err_offset 207 expected 12a8 actual 18a8 [ 5246.294172] PI error found type 0 at sector 12528 expected 12a8 vs actual 18a8 [ 5246.325235] blk_update_request: I/O error, dev sdq, sector 75048 [ 5246.359541] blk_update_request: I/O error, dev dm-14, sector 75048 [ 5246.378121] Buffer I/O error on dev dm-14, logical block 9381, lost async page write DM-14 is 253,14 and from the trace-cmd report we have => ret_from_fork_nospec_begin (ffffffff8b7255dd) kworker/u72:10-11060 [001] 5246.065529: block_bio_queue: 253,14 W 75048 + 8 [kworker/u72:10] kworker/u72:10-11060 [001] 5246.065533: kernel_stack: <stack trace> => _submit_bh (ffffffff8b2573d7) => __block_write_full_page (ffffffff8b257652) => block_write_full_page (ffffffff8b257a1e) => blkdev_writepage (ffffffff8b25d828) => __writepage (ffffffff8b1a1c49) => write_cache_pages (ffffffff8b1a2744) => generic_writepages (ffffffff8b1a2a1d) => blkdev_writepages (ffffffff8b25d7e5) => do_writepages (ffffffff8b1a3ac1) => __writeback_single_inode (ffffffff8b24cf00) => writeback_sb_inodes (ffffffff8b24d994) => __writeback_inodes_wb (ffffffff8b24dcff) => wb_writeback (ffffffff8b24e533) => bdi_writeback_workfn (ffffffff8b24eebb) => process_one_work (ffffffff8b0b613f) => worker_thread (ffffffff8b0b71d6) => kthread (ffffffff8b0bdf21) => ret_from_fork_nospec_begin (ffffffff8b7255dd) kmmpd-dm-24-6248 [007] 5246.324427: block_bio_queue: 253,24 WSM 75048 + 8 [kmmpd-dm-24] kmmpd-dm-24-6248 [007] 5246.324443: kernel_stack: <stack trace> => _submit_bh (ffffffff8b2573d7) => submit_bh (ffffffff8b257420) => write_mmp_block (ffffffffc058fdb1) => kmmpd (ffffffffc0590028) => kthread (ffffffff8b0bdf21) => ret_from_fork_nospec_begin (ffffffff8b7255dd) kmmpd-dm-14-6186 [007] 5246.401415: block_bio_queue: 253,14 WSM 75048 + 8 [kmmpd-dm-14] kmmpd-dm-14-6186 [007] 5246.401419: kernel_stack: <stack trace> => _submit_bh (ffffffff8b2573d7) => submit_bh (ffffffff8b257420) => write_mmp_block (ffffffffc058fdb1) => kmmpd (ffffffffc0590028) => kthread (ffffffff8b0bdf21) I am also attaching the trace.dat file. trace.dat
            dongyang Dongyang Li added a comment -

            Maybe we can figure out who submitted the write to the block, by using trace-cmd with something like this:

            trace-cmd record -e block_bio_queue -f sector==20150568 -T

            and then try to reproduce the I/O error, without multipath to avoid generating too much messages.

            then the result can be viewed by trace-cmd report

             

            dongyang Dongyang Li added a comment - Maybe we can figure out who submitted the write to the block, by using trace-cmd with something like this: trace-cmd record -e block_bio_queue -f sector==20150568 -T and then try to reproduce the I/O error, without multipath to avoid generating too much messages. then the result can be viewed by trace-cmd report  

            It is also confusing to me why there is no "Error writing to MMP block" message being printed in this case, since the write error should be propagated up to the caller with REQ_SYNC. It makes me start to wonder if this block write is being generated somewhere else in the code, and only the MMP code is overwriting the same block in place?

            As mentioned previously, it might help to hexdump the MMP block contents in the low-level code, and print out the address of the buffer being written, so that we can see if it is the same page as was submitted by write_mmp_block() or some other copy.

            adilger Andreas Dilger added a comment - It is also confusing to me why there is no " Error writing to MMP block " message being printed in this case, since the write error should be propagated up to the caller with REQ_SYNC . It makes me start to wonder if this block write is being generated somewhere else in the code, and only the MMP code is overwriting the same block in place? As mentioned previously, it might help to hexdump the MMP block contents in the low-level code, and print out the address of the buffer being written, so that we can see if it is the same page as was submitted by write_mmp_block() or some other copy.

            I ran with OSTs mounted directly to slave device. Disabled multipath and flushed all the paths.. It still got an error.

             [ 7087.139668] mlx5_2:mlx5_poll_one:671:(pid 0): CQN: 0x405 Got SIGERR on key: 0x8000c273 err_type 0 err_offset 207 expected 948f actual 9a8f
            [ 7087.200807] PI error found type 0 at sector 1337928 expected 948f vs actual 9a8f
            [ 7087.223066] blk_update_request: I/O error, dev sdac, sector 20150568
            [ 7087.242167] Buffer I/O error on dev sdac, logical block 2518821, lost async page write
            
             
            mhanafi Mahmoud Hanafi added a comment - I ran with OSTs mounted directly to slave device. Disabled multipath and flushed all the paths.. It still got an error. [ 7087.139668] mlx5_2:mlx5_poll_one:671:(pid 0): CQN: 0x405 Got SIGERR on key: 0x8000c273 err_type 0 err_offset 207 expected 948f actual 9a8f [ 7087.200807] PI error found type 0 at sector 1337928 expected 948f vs actual 9a8f [ 7087.223066] blk_update_request: I/O error, dev sdac, sector 20150568 [ 7087.242167] Buffer I/O error on dev sdac, logical block 2518821, lost async page write
            mhanafi Mahmoud Hanafi added a comment - - edited

            We are only using DM. I Will test with-out DM.

             

            If the write fails as it is logged. Why doesn't the MMP update log an error. 

            mhanafi Mahmoud Hanafi added a comment - - edited We are only using DM. I Will test with-out DM.   If the write fails as it is logged. Why doesn't the MMP update log an error. 

            One thing that is puzzling is the error message "lost async page write", since the REQ_SYNC flag should be forcing the write to be synchronous? I wonder if this is an artifact of the DM Multipath code submitting sync writes asynchronously, so that it isn't blocked waiting for completion if one of the paths fails? That would lend more weight to trying to reproduce this problem without the DM Multipath driver involved. If the problem goes away, you can contact Red Hat about this issue, since MMP and ext4 exist in the upstream kernel and we do not modify MMP in recent releases so it should be reproducible without Lustre (given a sufficiently similar IO workload).

            adilger Andreas Dilger added a comment - One thing that is puzzling is the error message " lost async page write ", since the REQ_SYNC flag should be forcing the write to be synchronous? I wonder if this is an artifact of the DM Multipath code submitting sync writes asynchronously, so that it isn't blocked waiting for completion if one of the paths fails? That would lend more weight to trying to reproduce this problem without the DM Multipath driver involved. If the problem goes away, you can contact Red Hat about this issue, since MMP and ext4 exist in the upstream kernel and we do not modify MMP in recent releases so it should be reproducible without Lustre (given a sufficiently similar IO workload).

            Mahmoud, that indicates that the T10-PI layer detected some kind of corruption, but the block was not modified in memory. This would indicate corruption at some lower layer, though it is confusing why only the MMP block is affected.

            One option would be to add a similar hex dump at the point where the error is being reported after the checksum failure, to see if the data is different somehow? It also makes sense to print out the buffer address, to see if there is a copy of the page being used or something, possibly being caused by the device mapper layer.

            Are you using software RAID or similar on this system, or is DM only in use for multipath? Have you tried to disable multipath to see if the problems go away? The only other thing I can think of is to remove the REQ_SYNC | REQ_META | REQ_PRIO flags from submit_bh() one at a time to see if this makes a difference, as it might indicate where the problem is located. Removing the REQ_SYNC flag may cause legitimate failures if the system is very busy, since the MMP thread would only be waiting on the 5s MMP update interval and not the actual write completion.

            At this point the problem looks to be outside of the scope of Lustre/ext4 so I'm not sure what else we can do.

            adilger Andreas Dilger added a comment - Mahmoud, that indicates that the T10-PI layer detected some kind of corruption, but the block was not modified in memory. This would indicate corruption at some lower layer, though it is confusing why only the MMP block is affected. One option would be to add a similar hex dump at the point where the error is being reported after the checksum failure, to see if the data is different somehow? It also makes sense to print out the buffer address, to see if there is a copy of the page being used or something, possibly being caused by the device mapper layer. Are you using software RAID or similar on this system, or is DM only in use for multipath? Have you tried to disable multipath to see if the problems go away? The only other thing I can think of is to remove the REQ_SYNC | REQ_META | REQ_PRIO flags from submit_bh() one at a time to see if this makes a difference, as it might indicate where the problem is located. Removing the REQ_SYNC flag may cause legitimate failures if the system is very busy, since the MMP thread would only be waiting on the 5s MMP update interval and not the actual write completion. At this point the problem looks to be outside of the scope of Lustre/ext4 so I'm not sure what else we can do.

            I ran with the latest patch set 3.  Reproduced the PI error but there was no output from the debug patch.

            [  988.391944] mlx5_3:mlx5_poll_one:671:(pid 0): CQN: 0xc05 Got SIGERR on key: 0x80009d6b err_type 0 err_offset 207 expected 24f1 actual 2af1
            [  988.430812] PI error found type 0 at sector 1337928 expected 24f1 vs actual 2af1
            [  988.453087] blk_update_request: I/O error, dev sdam, sector 20150568
            [  988.472382] blk_update_request: I/O error, dev dm-18, sector 20150568
            [  988.491746] Buffer I/O error on dev dm-18, logical block 2518821, lost async page write
            
            [  996.218102] mlx5_2:mlx5_poll_one:671:(pid 0): CQN: 0x405 Got SIGERR on key: 0x80003110 err_type 0 err_offset 207 expected 9377 actual 9977
            [  996.260232] PI error found type 0 at sector 1337928 expected 9377 vs actual 9977
            [  996.282509] blk_update_request: I/O error, dev sdu, sector 20150568
            [  996.301422] blk_update_request: I/O error, dev dm-9, sector 20150568
            [  996.320524] Buffer I/O error on dev dm-9, logical block 2518821, lost async page write
            
            mhanafi Mahmoud Hanafi added a comment - I ran with the latest patch set 3.  Reproduced the PI error but there was no output from the debug patch. [ 988.391944] mlx5_3:mlx5_poll_one:671:(pid 0): CQN: 0xc05 Got SIGERR on key: 0x80009d6b err_type 0 err_offset 207 expected 24f1 actual 2af1 [ 988.430812] PI error found type 0 at sector 1337928 expected 24f1 vs actual 2af1 [ 988.453087] blk_update_request: I/O error, dev sdam, sector 20150568 [ 988.472382] blk_update_request: I/O error, dev dm-18, sector 20150568 [ 988.491746] Buffer I/O error on dev dm-18, logical block 2518821, lost async page write [ 996.218102] mlx5_2:mlx5_poll_one:671:(pid 0): CQN: 0x405 Got SIGERR on key: 0x80003110 err_type 0 err_offset 207 expected 9377 actual 9977 [ 996.260232] PI error found type 0 at sector 1337928 expected 9377 vs actual 9977 [ 996.282509] blk_update_request: I/O error, dev sdu, sector 20150568 [ 996.301422] blk_update_request: I/O error, dev dm-9, sector 20150568 [ 996.320524] Buffer I/O error on dev dm-9, logical block 2518821, lost async page write
            dongyang Dongyang Li added a comment -

            ok I've made a stupid mistake in the debug patch, Mahmoud can you rerun it using the latest patchset 3? My apologies.

            dongyang Dongyang Li added a comment - ok I've made a stupid mistake in the debug patch, Mahmoud can you rerun it using the latest patchset 3? My apologies.
            mhanafi Mahmoud Hanafi added a comment - - edited

            dm20.hexdump

             

            Attaching hexdump for one of the devices.

             I was running deadline I changed it to noop no difference in the MMP failures.

            mhanafi Mahmoud Hanafi added a comment - - edited dm20.hexdump   Attaching hexdump for one of the devices.  I was running deadline I changed it to noop no difference in the MMP failures.

            People

              dongyang Dongyang Li
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: