Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11697

BAD WRITE CHECKSUM with t10ip4K and t10ip512 checksums

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.12.0
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      I'm running racer and get the following messages quite often. I think Oleg can confirm this.
      if I set checksum_type to adler, then racer runs fine with no single bad checksum.

      LustreError: 168-f: lustre-OST0000: BAD WRITE CHECKSUM: from 12345-0@lo inode [0x200000402:0x3b89:0x0] object 0x0:1921 extent [326-4194303]: client csum 53f1f04e, server csum 9bf3f057
      LustreError: 132-0: lustre-OST0000-osc-ffff8801d25e8800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 0@lo inode [0x200000402:0x3b89:0x0] object 0x0:1921 extent [326-4194303], original client csum 53f1f04e (type 20), server csum 9bf3f057 (type 20), client csum now 53f1f04e

      Attachments

        Issue Links

          Activity

            [LU-11697] BAD WRITE CHECKSUM with t10ip4K and t10ip512 checksums
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33752/
            Subject: LU-11697 ost: do not reuse T10PI guards of unaligned page write
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 6d8a7af1cd65409bdb5a307e76090ed460d5a6f6

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33752/ Subject: LU-11697 ost: do not reuse T10PI guards of unaligned page write Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6d8a7af1cd65409bdb5a307e76090ed460d5a6f6

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33727/
            Subject: LU-11697 osc: wrong page offset for T10PI checksum
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: c1f0520554460237e835a3ad37e7d3baa0dca937

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33727/ Subject: LU-11697 osc: wrong page offset for T10PI checksum Project: fs/lustre-release Branch: master Current Patch Set: Commit: c1f0520554460237e835a3ad37e7d3baa0dca937

            Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33768
            Subject: LU-11697 osc: calc checksum at right offset
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 425fd09d48a3d564324d36fa52e31191845eda6e

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33768 Subject: LU-11697 osc: calc checksum at right offset Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 425fd09d48a3d564324d36fa52e31191845eda6e

            the checksum mismatch issue also exists for LDiskFS backend.

            hongchao.zhang Hongchao Zhang added a comment - the checksum mismatch issue also exists for LDiskFS backend.
            lixi_wc Li Xi added a comment - - edited

            Hi Andreas, I agree on the long term ideal solution and 2.12 fix. And agreed that the implementation on 2.12 might be enough for most of the cases. The ideal solution is too complex thus error prone. And I doubt the correctness and usefulness of the ideal solution unless we have complex codes for fault injection.

            Anyway, I will refresh the https://review.whamcloud.com/33752 patch.

            lixi_wc Li Xi added a comment - - edited Hi Andreas, I agree on the long term ideal solution and 2.12 fix. And agreed that the implementation on 2.12 might be enough for most of the cases. The ideal solution is too complex thus error prone. And I doubt the correctness and usefulness of the ideal solution unless we have complex codes for fault injection. Anyway, I will refresh the https://review.whamcloud.com/33752 patch.
            adilger Andreas Dilger added a comment - - edited

            There is still a problem with T10-PI checksums for partial pages, and I don't think that https://review.whamcloud.com/33752 is fixing the problem properly. That patch makes the T10-PI checksums consistent between the client and the server, but runs the risk of corrupting the data in the rest of the page that is being checksummed. Also, the main problem is that a T10-PI checksum on a partial sector is never going to be correct when sent to the underlying storage.

            What needs to be done is to have two separate T10-PI checksums for the first and last sectors, if they are partially overwritten. One checksum on the partial sector is needed to compute the checksum-of-checksums and match with the RPC checksum. A second checksum needs to be computed on a whole sector after it is modified, in order to submit to disk. If whole sectors are being modified (e.g. in the 10240-byte writes and 512-byte sector size) then the partial-sector handling is not needed.

            In order to ensure that there is proper end-to-end integrity for the checksums, when first or last partial page has been read from disk, but before the overwrite is done, the partial checksum should be computed for the areas that are not going to be overwritten (start of first partial sector, end of last partial sector). Ideally, the full sector will be checksummed again afterward to verify it still matches the original full-sector checksum, to verify that the partial sector checksum was done on correct data. Then, after the BRW/RDMA has overwritten the partial page, a full-sector checksum needs to be computed from the modified data. Next, the checksum on the first partial-sector checksum should be verified to match the original value computed, and the end partial-sector checksum should be verified against the value sent from the client. That ensures the two (non-overlapping) parts of the sector match the original data, and that the new full-sector checksum was done on the valid data. The full-sector checksum should be used when submitting the bio.

            For 2.12 it may be enough to read the page, apply the write, compute the checksum on the full sector for bio submission, and then verify that the partial-sector checksum matches the checksum from the client. This ensures the data sent from the network is valid, with an overlap for the full-sector checksum. The risk of corrupting the partial sector in RAM between the read/verify from disk and the overwrite is very small, and if this is happening then it is likely we would see other corruption on the full-sector writes as well. However, doing the extra partial-sector checksum for the unmodified part of the sector (as described above) would catch the case if the partial-sector overwrite is done incorrectly.

            adilger Andreas Dilger added a comment - - edited There is still a problem with T10-PI checksums for partial pages, and I don't think that https://review.whamcloud.com/33752 is fixing the problem properly. That patch makes the T10-PI checksums consistent between the client and the server, but runs the risk of corrupting the data in the rest of the page that is being checksummed. Also, the main problem is that a T10-PI checksum on a partial sector is never going to be correct when sent to the underlying storage. What needs to be done is to have two separate T10-PI checksums for the first and last sectors, if they are partially overwritten. One checksum on the partial sector is needed to compute the checksum-of-checksums and match with the RPC checksum. A second checksum needs to be computed on a whole sector after it is modified, in order to submit to disk. If whole sectors are being modified (e.g. in the 10240-byte writes and 512-byte sector size) then the partial-sector handling is not needed. In order to ensure that there is proper end-to-end integrity for the checksums, when first or last partial page has been read from disk, but before the overwrite is done, the partial checksum should be computed for the areas that are not going to be overwritten (start of first partial sector, end of last partial sector). Ideally, the full sector will be checksummed again afterward to verify it still matches the original full-sector checksum, to verify that the partial sector checksum was done on correct data. Then, after the BRW/RDMA has overwritten the partial page, a full-sector checksum needs to be computed from the modified data. Next, the checksum on the first partial-sector checksum should be verified to match the original value computed, and the end partial-sector checksum should be verified against the value sent from the client. That ensures the two (non-overlapping) parts of the sector match the original data, and that the new full-sector checksum was done on the valid data. The full-sector checksum should be used when submitting the bio. For 2.12 it may be enough to read the page, apply the write, compute the checksum on the full sector for bio submission, and then verify that the partial-sector checksum matches the checksum from the client. This ensures the data sent from the network is valid, with an overlap for the full-sector checksum. The risk of corrupting the partial sector in RAM between the read/verify from disk and the overwrite is very small, and if this is happening then it is likely we would see other corruption on the full-sector writes as well. However, doing the extra partial-sector checksum for the unmodified part of the sector (as described above) would catch the case if the partial-sector overwrite is done incorrectly.
            lixi_wc Li Xi added a comment -

            Hmm, I think https://review.whamcloud.com/#/c/32899 is still needed, because of the reason that I said. And I think the key reason here is not the line of "lnb[i].lnb_page_offset = poff;"
            It can be changed to "lnb[i].lnb_page_offset = 0;" and the checksum should still be the same. However, the most important codes are:

            				plen = min_t(int, poff + sz_in_block,
            					     PAGE_SIZE);
            				plen -= poff;
            

            This fixes the problem of merging two pages into a single one.

            lixi_wc Li Xi added a comment - Hmm, I think https://review.whamcloud.com/#/c/32899 is still needed, because of the reason that I said. And I think the key reason here is not the line of "lnb [i] .lnb_page_offset = poff;" It can be changed to "lnb [i] .lnb_page_offset = 0;" and the checksum should still be the same. However, the most important codes are: plen = min_t( int , poff + sz_in_block, PAGE_SIZE); plen -= poff; This fixes the problem of merging two pages into a single one.
            lixi_wc Li Xi added a comment -

            I tend to believe 4K T10PI checksum doesn't support this feature, since it should calculate the whole page for each sector. Thus, merging data belongs to two pages on server side will corrupt the checksum for sure...

            lixi_wc Li Xi added a comment - I tend to believe 4K T10PI checksum doesn't support this feature, since it should calculate the whole page for each sector. Thus, merging data belongs to two pages on server side will corrupt the checksum for sure...
            lixi_wc Li Xi added a comment - - edited

            Hi Alex, I guess this might be related to the characteristics of checksum algorithms? I am not an expert on this, but I guess not all checksum types generate the same value for the following two process:

            1) Calculate checksum of 3K byte and them accumulate checksum of 1K byte
            2) Calculate checksum of 4K byte data

            Even the 3K byte data + 1K byte data = 4 K byte data, the checksum values of 1) and 2) might be different for some checksum types, right? I am not sure about whether T10PI checksum or crc32, crc32c supports this feature. But let's assume there is an algorithm which doesn't support this feature.

            Then, if we don't apply the patch of LU-10683, two pages with size 1K byte and 3Kbyte on client side might be changed to one page of 4K bytes on server side. Maybe checksum calculation might be different for these two cases?

            The problem LU-10683 shows with crc32c checksum, right? Since the checksum type is OBD_CKSUM_CRC32C(4). We really need to understand whether crc32c and T10PI supports this feature. Otherwise, checksum corruption might happen in the case of 1) non-page aligned I/O 2) page size differences between client and server...

            lixi_wc Li Xi added a comment - - edited Hi Alex, I guess this might be related to the characteristics of checksum algorithms? I am not an expert on this, but I guess not all checksum types generate the same value for the following two process: 1) Calculate checksum of 3K byte and them accumulate checksum of 1K byte 2) Calculate checksum of 4K byte data Even the 3K byte data + 1K byte data = 4 K byte data, the checksum values of 1) and 2) might be different for some checksum types, right? I am not sure about whether T10PI checksum or crc32, crc32c supports this feature. But let's assume there is an algorithm which doesn't support this feature. Then, if we don't apply the patch of LU-10683 , two pages with size 1K byte and 3Kbyte on client side might be changed to one page of 4K bytes on server side. Maybe checksum calculation might be different for these two cases? The problem LU-10683 shows with crc32c checksum, right? Since the checksum type is OBD_CKSUM_CRC32C(4). We really need to understand whether crc32c and T10PI supports this feature. Otherwise, checksum corruption might happen in the case of 1) non-page aligned I/O 2) page size differences between client and server...

            People

              lixi_wc Li Xi
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: