Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10472

Data Integrity(T10PI) support for Lustre

Details

    • New Feature
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • None
    • 9223372036854775807

    Description

      Data integrity(T10PI) support for hard disk is common now which raises the need for adding the support into Lustre. This is not the first attempt to implement T10PI support for Lustre, but the former work has been stopped for years (LU-2584). Instead of implementing end-to-end data integrity in one shot, we are trying to implement data integrity step by step. Because of this difference, we feel it would be better to create a different ticket.

      The first step would be adding support for pretecting data integrity from Lustre OSD to disk, i.e. OSD-to-Storage T10PI.

      Given the fact that checksum are already supported for Lustre RPCs, the data are already protected when transfering through the network. By using both network checksum and OSD-to-Storage T10PI together, the time window with no data protection will be decreased a lot. The only danger would be the page cache in memory on OSD are changed somehow between RPC checksum check and OSD T10PI checksum calculation. It would be a doubt whether it is really necessary to implement end-to-end T10PI support. However, convincing the concerned users to accept this small probability is still difficult, especially when we do not have a quantitative estimation of the probability.

      But, even Lustre supports OSC-to-storage T10PI feature, the data are not fully protected in theory, unless some kind of APIs of T10PI are provided for applications, e.g. https://lwn.net/Articles/592113. Suporting that kind of APIs would be even more difficult, because stripe of LOV needs to be taken care of, or unless we limit the end-to-end T10PI to one-stripe files.

      One major difficulty of implementing OSC-to-storage T10PI feature for Lustre is, a single bulk RPC can be splitted into server I/Os on disk, which means, the checkum is needed to be re-calculated again.

      In order to avoid this problem, Lustre client need to start smaller bulk RPC so as to make sure the RPC can always be written/readed in one I/O. This will eliminate the need to recalculate the T10PI protection data on OSD side.

       

      Attachments

        Issue Links

          Activity

            [LU-10472] Data Integrity(T10PI) support for Lustre

            To avoid reading through this ticket again, here is the final conclusion:

            In summary, the current T10-PI checksum support in 2.12 is no worse than the existing bulk RPC checksums (with separate T10-PI checksums at the kernel bio level), and with the 32266 patch it is better still due to reduced CPU overhead. With that patch, reads can use the GRD tags straight from the hardware to compute the RPC checksum without any bulk data checksums on the OSS.

            32266 includes a series of server-side kernel patches for RHEL7. These patches allow for overlapping protection domains from Lustre client to disk. Without the patches, there is a protection hole (no worse than older network checksums). 

            nrutman Nathan Rutman added a comment - To avoid reading through this ticket again, here is the final conclusion: In summary, the current T10-PI checksum support in 2.12 is no worse than the existing bulk RPC checksums (with separate T10-PI checksums at the kernel bio level), and  with  the 32266 patch it is better still due to reduced CPU overhead. With that patch, reads can use the GRD tags straight from the hardware to compute the RPC checksum without any bulk data checksums on the OSS. 32266 includes a series of server-side kernel patches for RHEL7. These patches allow for overlapping protection domains from Lustre client to disk. Without the patches, there is a protection hole (no worse than older network checksums). 

            Correct, ZFS does not implement T10-PI support. There is the old design for integrated end-to-end checksums from the client, which would use a similar, but more sophisticated, checksum method as the current T10 mechanism.

            adilger Andreas Dilger added a comment - Correct, ZFS does not implement T10-PI support. There is the old design for integrated end-to-end checksums from the client, which would use a similar, but more sophisticated, checksum method as the current T10 mechanism.

            Thanks Andreas for the response. The T10-PI checks (end therefore end-to-end) only apply to ldiskfs, not ZFS, correct?

            nrutman Nathan Rutman added a comment - Thanks Andreas for the response. The T10-PI checks (end therefore end-to-end) only apply to ldiskfs, not ZFS, correct?
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32266/
            Subject: LU-10472 osd-ldiskfs: T10PI between RPC and BIO
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ccf3674c9ca3ed8918c49163007708d1ae5db6f5

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32266/ Subject: LU-10472 osd-ldiskfs: T10PI between RPC and BIO Project: fs/lustre-release Branch: master Current Patch Set: Commit: ccf3674c9ca3ed8918c49163007708d1ae5db6f5

            Nathan, the patch https://review.whamcloud.com/32266 "osd-ldiskfs: T10PI between RPC and BIO" will allow passing the GRD tags from Lustre down to the storage, if the hardware supports it. The benefit of the integration with Lustre is indeed, as you wrote, to reduce the duplicate checksum CPU overhead on the server. Using the checksum-of-checksums avoids increasing the request size for GRD tags as the RPC size grows (and will help ZFS in the future since it is doing a Merkle tree for its checksums).

            To respond to your #2, Lustre has always computed the network checksums in an "overlapping" manner. It will verify the RPC checksum after the data checksums are computed on the OSS, so if they don't match, the client will re-send the RPC. This ensures that the T10-PI checksums that the OSS is computing (and the pages they represent) are the same as the ones computed at the client (up to the limit of the 16-bit GRD checksum itself, which is IMHO fairly weak). At least the 16-bit GRD checksum is only on the 4096-byte sector, and we have a somewhat stronger 32-bit checksum for the 2KB of GRD tags (for 4MB RPC).

            In summary, the current T10-PI checksum support in 2.12 is no worse than the existing bulk RPC checksums (with separate T10-PI checksums at the kernel bio level), and with the 32266 patch it is better still due to reduced CPU overhead. With that patch, reads can use the GRD tags straight from the hardware to compute the RPC checksum without any bulk data checksums on the OSS.

            adilger Andreas Dilger added a comment - Nathan, the patch https://review.whamcloud.com/32266 " osd-ldiskfs: T10PI between RPC and BIO " will allow passing the GRD tags from Lustre down to the storage, if the hardware supports it. The benefit of the integration with Lustre is indeed, as you wrote, to reduce the duplicate checksum CPU overhead on the server. Using the checksum-of-checksums avoids increasing the request size for GRD tags as the RPC size grows (and will help ZFS in the future since it is doing a Merkle tree for its checksums). To respond to your #2, Lustre has always computed the network checksums in an "overlapping" manner. It will verify the RPC checksum after the data checksums are computed on the OSS, so if they don't match, the client will re-send the RPC. This ensures that the T10-PI checksums that the OSS is computing (and the pages they represent) are the same as the ones computed at the client (up to the limit of the 16-bit GRD checksum itself, which is IMHO fairly weak). At least the 16-bit GRD checksum is only on the 4096-byte sector, and we have a somewhat stronger 32-bit checksum for the 2KB of GRD tags (for 4MB RPC). In summary, the current T10-PI checksum support in 2.12 is no worse than the existing bulk RPC checksums (with separate T10-PI checksums at the kernel bio level), and with the 32266 patch it is better still due to reduced CPU overhead. With that patch, reads can use the GRD tags straight from the hardware to compute the RPC checksum without any bulk data checksums on the OSS.

            People

              qian_wc Qian Yingjin
              lixi Li Xi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: