Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10472

Data Integrity(T10PI) support for Lustre

Details

    • New Feature
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • None
    • 9223372036854775807

    Description

      Data integrity(T10PI) support for hard disk is common now which raises the need for adding the support into Lustre. This is not the first attempt to implement T10PI support for Lustre, but the former work has been stopped for years (LU-2584). Instead of implementing end-to-end data integrity in one shot, we are trying to implement data integrity step by step. Because of this difference, we feel it would be better to create a different ticket.

      The first step would be adding support for pretecting data integrity from Lustre OSD to disk, i.e. OSD-to-Storage T10PI.

      Given the fact that checksum are already supported for Lustre RPCs, the data are already protected when transfering through the network. By using both network checksum and OSD-to-Storage T10PI together, the time window with no data protection will be decreased a lot. The only danger would be the page cache in memory on OSD are changed somehow between RPC checksum check and OSD T10PI checksum calculation. It would be a doubt whether it is really necessary to implement end-to-end T10PI support. However, convincing the concerned users to accept this small probability is still difficult, especially when we do not have a quantitative estimation of the probability.

      But, even Lustre supports OSC-to-storage T10PI feature, the data are not fully protected in theory, unless some kind of APIs of T10PI are provided for applications, e.g. https://lwn.net/Articles/592113. Suporting that kind of APIs would be even more difficult, because stripe of LOV needs to be taken care of, or unless we limit the end-to-end T10PI to one-stripe files.

      One major difficulty of implementing OSC-to-storage T10PI feature for Lustre is, a single bulk RPC can be splitted into server I/Os on disk, which means, the checkum is needed to be re-calculated again.

      In order to avoid this problem, Lustre client need to start smaller bulk RPC so as to make sure the RPC can always be written/readed in one I/O. This will eliminate the need to recalculate the T10PI protection data on OSD side.

       

      Attachments

        Issue Links

          Activity

            [LU-10472] Data Integrity(T10PI) support for Lustre

            To avoid reading through this ticket again, here is the final conclusion:

            In summary, the current T10-PI checksum support in 2.12 is no worse than the existing bulk RPC checksums (with separate T10-PI checksums at the kernel bio level), and with the 32266 patch it is better still due to reduced CPU overhead. With that patch, reads can use the GRD tags straight from the hardware to compute the RPC checksum without any bulk data checksums on the OSS.

            32266 includes a series of server-side kernel patches for RHEL7. These patches allow for overlapping protection domains from Lustre client to disk. Without the patches, there is a protection hole (no worse than older network checksums). 

            nrutman Nathan Rutman added a comment - To avoid reading through this ticket again, here is the final conclusion: In summary, the current T10-PI checksum support in 2.12 is no worse than the existing bulk RPC checksums (with separate T10-PI checksums at the kernel bio level), and  with  the 32266 patch it is better still due to reduced CPU overhead. With that patch, reads can use the GRD tags straight from the hardware to compute the RPC checksum without any bulk data checksums on the OSS. 32266 includes a series of server-side kernel patches for RHEL7. These patches allow for overlapping protection domains from Lustre client to disk. Without the patches, there is a protection hole (no worse than older network checksums). 

            Correct, ZFS does not implement T10-PI support. There is the old design for integrated end-to-end checksums from the client, which would use a similar, but more sophisticated, checksum method as the current T10 mechanism.

            adilger Andreas Dilger added a comment - Correct, ZFS does not implement T10-PI support. There is the old design for integrated end-to-end checksums from the client, which would use a similar, but more sophisticated, checksum method as the current T10 mechanism.

            Thanks Andreas for the response. The T10-PI checks (end therefore end-to-end) only apply to ldiskfs, not ZFS, correct?

            nrutman Nathan Rutman added a comment - Thanks Andreas for the response. The T10-PI checks (end therefore end-to-end) only apply to ldiskfs, not ZFS, correct?
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32266/
            Subject: LU-10472 osd-ldiskfs: T10PI between RPC and BIO
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ccf3674c9ca3ed8918c49163007708d1ae5db6f5

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32266/ Subject: LU-10472 osd-ldiskfs: T10PI between RPC and BIO Project: fs/lustre-release Branch: master Current Patch Set: Commit: ccf3674c9ca3ed8918c49163007708d1ae5db6f5

            Nathan, the patch https://review.whamcloud.com/32266 "osd-ldiskfs: T10PI between RPC and BIO" will allow passing the GRD tags from Lustre down to the storage, if the hardware supports it. The benefit of the integration with Lustre is indeed, as you wrote, to reduce the duplicate checksum CPU overhead on the server. Using the checksum-of-checksums avoids increasing the request size for GRD tags as the RPC size grows (and will help ZFS in the future since it is doing a Merkle tree for its checksums).

            To respond to your #2, Lustre has always computed the network checksums in an "overlapping" manner. It will verify the RPC checksum after the data checksums are computed on the OSS, so if they don't match, the client will re-send the RPC. This ensures that the T10-PI checksums that the OSS is computing (and the pages they represent) are the same as the ones computed at the client (up to the limit of the 16-bit GRD checksum itself, which is IMHO fairly weak). At least the 16-bit GRD checksum is only on the 4096-byte sector, and we have a somewhat stronger 32-bit checksum for the 2KB of GRD tags (for 4MB RPC).

            In summary, the current T10-PI checksum support in 2.12 is no worse than the existing bulk RPC checksums (with separate T10-PI checksums at the kernel bio level), and with the 32266 patch it is better still due to reduced CPU overhead. With that patch, reads can use the GRD tags straight from the hardware to compute the RPC checksum without any bulk data checksums on the OSS.

            adilger Andreas Dilger added a comment - Nathan, the patch https://review.whamcloud.com/32266 " osd-ldiskfs: T10PI between RPC and BIO " will allow passing the GRD tags from Lustre down to the storage, if the hardware supports it. The benefit of the integration with Lustre is indeed, as you wrote, to reduce the duplicate checksum CPU overhead on the server. Using the checksum-of-checksums avoids increasing the request size for GRD tags as the RPC size grows (and will help ZFS in the future since it is doing a Merkle tree for its checksums). To respond to your #2, Lustre has always computed the network checksums in an "overlapping" manner. It will verify the RPC checksum after the data checksums are computed on the OSS, so if they don't match, the client will re-send the RPC. This ensures that the T10-PI checksums that the OSS is computing (and the pages they represent) are the same as the ones computed at the client (up to the limit of the 16-bit GRD checksum itself, which is IMHO fairly weak). At least the 16-bit GRD checksum is only on the 4096-byte sector, and we have a somewhat stronger 32-bit checksum for the 2KB of GRD tags (for 4MB RPC). In summary, the current T10-PI checksum support in 2.12 is no worse than the existing bulk RPC checksums (with separate T10-PI checksums at the kernel bio level), and with the 32266 patch it is better still due to reduced CPU overhead. With that patch, reads can use the GRD tags straight from the hardware to compute the RPC checksum without any bulk data checksums on the OSS.

            There are a few things I still don't understand about this design.

            1. Any over-the-wire checksum can be used, as long as you calculate the final GRD tags before the network checksum is verified. Yes, I suppose you might save some work on the server calculating both wire and T10 csums if they are related to each other, but this makes assumptions about the efficiency of various checksum types. It seems like T10 verification on the OSS should work with any wire checksum.
            2. As hinted at by Andrew's and Li Xi's latest comments, there's some problems with Linux T10 support. To Li Xi's last comment in particular, as I stated back in Feb, you CAN'T allow the OS to recalculate GRD tags at write (or read) time. This completely breaks the end-to-end guarantee, because the data may have changed between the first check and the second re-calculation, and there is no way to catch that. Without the end-to-end guarantee, I don't see how this is any better than what we have today - separate network checksum and T10 calculated/verified in HBA hardware.

            If the answer to #2 is to patch the kernel, then it seems that is required not just for performance, but for correctness (of the end-to-end claim).

             

            nrutman Nathan Rutman added a comment - There are a few things I still don't understand about this design. Any over-the-wire checksum can be used, as long as you calculate the final GRD tags before the network checksum is verified. Yes, I suppose you might save some work on the server calculating both wire and T10 csums if they are related to each other, but this makes assumptions about the efficiency of various checksum types. It seems like T10 verification on the OSS should work with any wire checksum. As hinted at by Andrew's and Li Xi's latest comments, there's some problems with Linux T10 support. To Li Xi's last comment in particular, as I stated back in Feb, you CAN'T allow the OS to recalculate GRD tags at write (or read) time. This completely breaks the end-to-end guarantee, because the data may have changed between the first check and the second re-calculation, and there is no way to catch that. Without the end-to-end guarantee, I don't see how this is any better than what we have today - separate network checksum and T10 calculated/verified in HBA hardware. If the answer to #2 is to patch the kernel, then it seems that is required not just for performance, but for correctness (of the end-to-end claim).  

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30980/
            Subject: LU-10472 osc: add T10PI support for RPC checksum
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: b1e7be00cb6e1e7b3e35877cf8b8a209e456a2ea

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30980/ Subject: LU-10472 osc: add T10PI support for RPC checksum Project: fs/lustre-release Branch: master Current Patch Set: Commit: b1e7be00cb6e1e7b3e35877cf8b8a209e456a2ea

            I am working on combining the T10PI RPC checksum with BIO layer. And I figured out that without patching Linux kernel it might be impossible to avoid recalculation of T10PI guard tags, which is bad for performance. But if we apply a small Linux kernel patch, thing will be much easier.

            The modification to Linux kernel would be (note all the following discussion will be based on Linux master branch):

            Add a new argument “struct bio *” to the following method:

            typedef blk_status_t (integrity_processing_fn) (struct blk_integrity_iter *);

            It becomes:

            typedef blk_status_t (integrity_processing_fn) (struct bio *, struct blk_integrity_iter *);

            This doesn’t need much change to the Linux kernel, since integrity_processing_fn is only used in following places:

            • bio_integrity_process() will call integrity_processing_fn, and bio_integrity_process() itself has “struct bio *bio”as an argument.
            • Function t10_pi_type[1|3][generate|verify][ip|crc] functions defined in T10-pi.c will need to be modified slightly.“struct bio *bio”can be added as an argument, but not used.

            Doing that change to Linux kernel enables Lustre (and other file systems) to integrate with BIO for integrity verification/generation. Following is the design:

            • When Lustre OSD starts, it will first backup the default template of “struct blk_integrity_profile” registered by the disk, and then replace the profile with Lustre native one.
            • When Lustre OSD umounts, it will change the“struct blk_integrity_profile”to the original one.
            • Private information can be attached to bio->bi_private. Existing T10PI guard tags of calculated by Lustre RPC checksum codes can be put into bio->bi_private in osd_do_bio().
            • When doing write, bio_integrity_process() needs to calls bi->profile->generate_fn(). In that case, the native method of Lustre will copy the guard tags rather than calculate it.
            • When doing read, bio_integrity_process() needs to calls bi->profile->verify_fn(). In that case, the native method of Lustre will verify (optional) and then copy the guard tags to “struct osd_iobuf *iobuf”and used them later for RPC checksum.

            I investigated other ways to do similar thing. However, it seems impossible, because bio_end() will call bio_integrity_endio() before calling bi->bi_end_io() method. That means, the integrity data of BIO will be cleaned up before Lustre has any chance to copy to its own buffer. 

            I know adding Linux kernel patch is the last thing we want to do, since the community have spent so much effort into patch-less server. But it seems there is no other choice. And since the Linux kernel patch is small, I think it would be relatively easy to persuade Linux kernel community to merge it to mainline.

            Any comment is welcome!

            lixi Li Xi (Inactive) added a comment - I am working on combining the T10PI RPC checksum with BIO layer. And I figured out that without patching Linux kernel it might be impossible to avoid recalculation of T10PI guard tags, which is bad for performance. But if we apply a small Linux kernel patch, thing will be much easier. The modification to Linux kernel would be (note all the following discussion will be based on Linux master branch): Add a new argument “struct bio *” to the following method: typedef blk_status_t (integrity_processing_fn) (struct blk_integrity_iter *); It becomes: typedef blk_status_t (integrity_processing_fn) (struct bio *, struct blk_integrity_iter *); This doesn’t need much change to the Linux kernel, since integrity_processing_fn is only used in following places: bio_integrity_process() will call integrity_processing_fn, and bio_integrity_process() itself has “struct bio *bio”as an argument. Function t10_pi_type [1|3] [generate|verify] [ip|crc] functions defined in T10-pi.c will need to be modified slightly.“struct bio *bio”can be added as an argument, but not used. Doing that change to Linux kernel enables Lustre (and other file systems) to integrate with BIO for integrity verification/generation. Following is the design: When Lustre OSD starts, it will first backup the default template of “struct blk_integrity_profile” registered by the disk, and then replace the profile with Lustre native one. When Lustre OSD umounts, it will change the“struct blk_integrity_profile”to the original one. Private information can be attached to bio->bi_private. Existing T10PI guard tags of calculated by Lustre RPC checksum codes can be put into bio->bi_private in osd_do_bio(). When doing write, bio_integrity_process() needs to calls bi->profile->generate_fn(). In that case, the native method of Lustre will copy the guard tags rather than calculate it. When doing read, bio_integrity_process() needs to calls bi->profile->verify_fn(). In that case, the native method of Lustre will verify (optional) and then copy the guard tags to “struct osd_iobuf *iobuf”and used them later for RPC checksum. I investigated other ways to do similar thing. However, it seems impossible, because bio_end() will call bio_integrity_endio() before calling bi->bi_end_io() method. That means, the integrity data of BIO will be cleaned up before Lustre has any chance to copy to its own buffer.  I know adding Linux kernel patch is the last thing we want to do, since the community have spent so much effort into patch-less server. But it seems there is no other choice. And since the Linux kernel patch is small, I think it would be relatively easy to persuade Linux kernel community to merge it to mainline. Any comment is welcome!
            lixi Li Xi (Inactive) added a comment - random_write.c

            People

              qian_wc Qian Yingjin
              lixi Li Xi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: