Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16138

remove need for T10-PI block integrity patches from server kernel

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • Lustre 2.17.0
    • None
    • 9223372036854775807

    Description

      Kernels since v4.3-rc4-30-g0f8087ecdeac separate the block integrity verify and generate functions into a separate struct blk_integrity_profile instead of copying the pointers into struct blk_integrity. This might be used to facilitate replacing the blk_integrity_profile pointer on the block device for the duration that osd-ldiskfs is mounting it, and potentially avoid the need to patch the kernel for T10-PI at all.

      As a starting point, we would need to call blk_integrity_register() using a template holding the integrity methods provided by osd-ldiskfs - osd_dif_type<1,3>generate<crc,ip>() and osd_dif_type<1,3>verify<crc,ip>(). It looks like this function is just copying the individual fields from the provided template into the disk queue's blk_integrity fields, so we should save the original blk_integrity to be restored when osdldiskfs unmounts the target, otherwise the kernel would crash trying to access the old function pointers if there is any block device IO after osd-ldiskfs is removed from the kernel.

      It seems prudent to call blk_integrity_unregister() before unloading the module, as that would force all block device IO using the osd-ldiskfs.ko methods to be finished, and then blk_integrity_register() the saved values so that T10-PI is still functional on the device.

      The osd_dif_generate() and osd_dif_verify() functions need to handle being called on block read/write calls that are unrelated to osd-ldiskfs BRW IOs, since that will now happen for every block in the filesystem whether or not it came from osd-ldiskfs. One way to separate OSD vs. non-OSD IO might be by putting a magic at the end of struct osd_bio_private that can be used to reliably detect if this is a BRW IO or if it is IO on some other ldiskfs block. Non-BRW blocks would return NULL from find_lnb() and not dereference bio_private if NULL, and it looks like the rest of the code will "just work".

      Attachments

        Issue Links

          Activity

            [LU-16138] remove need for T10-PI block integrity patches from server kernel
            dongyang Dongyang Li added a comment -

            The per-device hash table could work, if the sector won't be remapped after we submit_bio().
            I'm trying to find what else we can make use of on the unpatched kernel to map the prot_buf to niobuf.

            dongyang Dongyang Li added a comment - The per-device hash table could work, if the sector won't be remapped after we submit_bio() . I'm trying to find what else we can make use of on the unpatched kernel to map the prot_buf to niobuf .

            OK, what about just having a per-device hash table using the device offsets? The generate_fn is at least called with "sector_t seed" and "disk_name" that could be mapped to the niobuf. It definitely isn't ideal, but could potentially work and we could see the overhead of that method vs. patching the kernel.

            I also see that there is a "prepare_fn" callback in the blk_integrity_profile on the device (e.g. t10_pi_type1_ip.t10_pi_type1_prepare) that has access to the bio, which might be helpful.

            adilger Andreas Dilger added a comment - OK, what about just having a per-device hash table using the device offsets? The generate_fn is at least called with " sector_t seed " and " disk_name " that could be mapped to the niobuf . It definitely isn't ideal, but could potentially work and we could see the overhead of that method vs. patching the kernel. I also see that there is a " prepare_fn " callback in the blk_integrity_profile on the device (e.g. t10_pi_type1_ip.t10_pi_type1_prepare ) that has access to the bio , which might be helpful.
            dongyang Dongyang Li added a comment - - edited

            Looking up bio using the bi_iter won't work, we are looking at different iter. the generate/verify_fn is using struct blk_integrity_iter and bio has struct bvec_iter. The blk_integrity_iter is also declared on stack of bio_integrity_process().

            One challenge of using iter.prot_buf as a hash key with unpatched kernel is the buf is allocated by bio_integrity_prep(), and bio_integrity_prep() will generate the guards for the write case immediately,
            so our generate_fn will see the iter.prot_buf before we setup the private data?

            dongyang Dongyang Li added a comment - - edited Looking up bio using the bi_iter won't work, we are looking at different iter. the generate/verify_fn is using struct blk_integrity_iter and bio has struct bvec_iter . The blk_integrity_iter is also declared on stack of bio_integrity_process() . One challenge of using iter.prot_buf as a hash key with unpatched kernel is the buf is allocated by bio_integrity_prep() , and bio_integrity_prep() will generate the guards for the write case immediately, so our generate_fn will see the iter.prot_buf before we setup the private data?

            Some potential options for discussion:

            • could we use "container_of(iter, struct bio, bi_iter)" to access the bio? It isn't very obvious that this would work, because there look like lots of "struct bvec_iter" copies being made and we aren't sure to be using the one in the bio.
            • we could use the hole in struct blk_integrity_iter between "interval/tuple_size" and "disk_name" directly without patching the kernel. In el8 this hole is 2 bytes (used for bi_idx in the current patch), but in newer kernels it is only 1 byte. If we had a maximum of 255 OST threads we could use this byte as an index into an array to get per-thread data, but there can be more than 255 threads
            • iter.prot_buf is constant for the whole call to bio_integrity_process() and could potentially be used as a hash key to look up the private data?
            adilger Andreas Dilger added a comment - Some potential options for discussion: could we use " container_of(iter, struct bio, bi_iter) " to access the bio? It isn't very obvious that this would work, because there look like lots of "struct bvec_iter" copies being made and we aren't sure to be using the one in the bio. we could use the hole in struct blk_integrity_iter between " interval/tuple_size " and " disk_name " directly without patching the kernel. In el8 this hole is 2 bytes (used for bi_idx in the current patch), but in newer kernels it is only 1 byte. If we had a maximum of 255 OST threads we could use this byte as an index into an array to get per-thread data, but there can be more than 255 threads iter.prot_buf is constant for the whole call to bio_integrity_process() and could potentially be used as a hash key to look up the private data?
            dongyang Dongyang Li added a comment -

            Andreas, I think patching the kernel is inevitable. We could register our own integrity_profile, or we could save the generate/verify functions and replace them with our versions, then restore them when umount osd.

            But without at least applying block-pass-bio-into-integrity_processing_fn the struct blk_integrity_exchg/blk_integrity_iter doesn't know which bio are we working on, without the bio we could not access the struct osd_bio_private, and could not use find_lnb() to get the guard buffers.

            dongyang Dongyang Li added a comment - Andreas, I think patching the kernel is inevitable. We could register our own integrity_profile, or we could save the generate/verify functions and replace them with our versions, then restore them when umount osd. But without at least applying block-pass-bio-into-integrity_processing_fn the struct blk_integrity_exchg/blk_integrity_iter doesn't know which bio are we working on, without the bio we could not access the struct osd_bio_private, and could not use find_lnb() to get the guard buffers.

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48609/
            Subject: LU-16138 kernel: preserve RHEL8.x server kABI for block integrity
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: f582cd01fba4a32c6861a3be56fb320c69376064

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48609/ Subject: LU-16138 kernel: preserve RHEL8.x server kABI for block integrity Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: f582cd01fba4a32c6861a3be56fb320c69376064
            yujian Jian Yu added a comment -

            Have to reopen this issue because removing the block integrity patches has not been done.
            The kABI issue is resolved.

            yujian Jian Yu added a comment - Have to reopen this issue because removing the block integrity patches has not been done. The kABI issue is resolved.
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48608/
            Subject: LU-16138 kernel: preserve RHEL8.x server kABI for block integrity
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: b90c0100dd93b56f3bfaee037b3bdd077523f43e

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48608/ Subject: LU-16138 kernel: preserve RHEL8.x server kABI for block integrity Project: fs/lustre-release Branch: master Current Patch Set: Commit: b90c0100dd93b56f3bfaee037b3bdd077523f43e

            "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48609
            Subject: LU-16138 kernel: preserve RHEL8.x server kABI for block integrity
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: aba52c68e750710f9b799b87d7e03221b3d1e52a

            gerrit Gerrit Updater added a comment - "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48609 Subject: LU-16138 kernel: preserve RHEL8.x server kABI for block integrity Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: aba52c68e750710f9b799b87d7e03221b3d1e52a

            People

              dongyang Dongyang Li
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: