Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16138

remove need for T10-PI block integrity patches from server kernel

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • Lustre 2.17.0
    • None
    • 9223372036854775807

    Description

      Kernels since v4.3-rc4-30-g0f8087ecdeac separate the block integrity verify and generate functions into a separate struct blk_integrity_profile instead of copying the pointers into struct blk_integrity. This might be used to facilitate replacing the blk_integrity_profile pointer on the block device for the duration that osd-ldiskfs is mounting it, and potentially avoid the need to patch the kernel for T10-PI at all.

      As a starting point, we would need to call blk_integrity_register() using a template holding the integrity methods provided by osd-ldiskfs - osd_dif_type<1,3>generate<crc,ip>() and osd_dif_type<1,3>verify<crc,ip>(). It looks like this function is just copying the individual fields from the provided template into the disk queue's blk_integrity fields, so we should save the original blk_integrity to be restored when osdldiskfs unmounts the target, otherwise the kernel would crash trying to access the old function pointers if there is any block device IO after osd-ldiskfs is removed from the kernel.

      It seems prudent to call blk_integrity_unregister() before unloading the module, as that would force all block device IO using the osd-ldiskfs.ko methods to be finished, and then blk_integrity_register() the saved values so that T10-PI is still functional on the device.

      The osd_dif_generate() and osd_dif_verify() functions need to handle being called on block read/write calls that are unrelated to osd-ldiskfs BRW IOs, since that will now happen for every block in the filesystem whether or not it came from osd-ldiskfs. One way to separate OSD vs. non-OSD IO might be by putting a magic at the end of struct osd_bio_private that can be used to reliably detect if this is a BRW IO or if it is IO on some other ldiskfs block. Non-BRW blocks would return NULL from find_lnb() and not dereference bio_private if NULL, and it looks like the rest of the code will "just work".

      Attachments

        Issue Links

          Activity

            [LU-16138] remove need for T10-PI block integrity patches from server kernel
            adilger Andreas Dilger made changes -
            Labels Original: medium New: kernel medium
            adilger Andreas Dilger made changes -
            Link New: This issue is related to EX-10081 [ EX-10081 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.17.0 [ 16192 ]
            Fix Version/s Original: Lustre 2.16.0 [ 15190 ]
            adilger Andreas Dilger made changes -
            Labels New: medium
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-16413 [ LU-16413 ]
            yujian Jian Yu made changes -
            Assignee Original: Jian Yu [ yujian ] New: Dongyang Li [ dongyang ]
            dongyang Dongyang Li added a comment -

            The per-device hash table could work, if the sector won't be remapped after we submit_bio().
            I'm trying to find what else we can make use of on the unpatched kernel to map the prot_buf to niobuf.

            dongyang Dongyang Li added a comment - The per-device hash table could work, if the sector won't be remapped after we submit_bio() . I'm trying to find what else we can make use of on the unpatched kernel to map the prot_buf to niobuf .

            OK, what about just having a per-device hash table using the device offsets? The generate_fn is at least called with "sector_t seed" and "disk_name" that could be mapped to the niobuf. It definitely isn't ideal, but could potentially work and we could see the overhead of that method vs. patching the kernel.

            I also see that there is a "prepare_fn" callback in the blk_integrity_profile on the device (e.g. t10_pi_type1_ip.t10_pi_type1_prepare) that has access to the bio, which might be helpful.

            adilger Andreas Dilger added a comment - OK, what about just having a per-device hash table using the device offsets? The generate_fn is at least called with " sector_t seed " and " disk_name " that could be mapped to the niobuf . It definitely isn't ideal, but could potentially work and we could see the overhead of that method vs. patching the kernel. I also see that there is a " prepare_fn " callback in the blk_integrity_profile on the device (e.g. t10_pi_type1_ip.t10_pi_type1_prepare ) that has access to the bio , which might be helpful.
            dongyang Dongyang Li added a comment - - edited

            Looking up bio using the bi_iter won't work, we are looking at different iter. the generate/verify_fn is using struct blk_integrity_iter and bio has struct bvec_iter. The blk_integrity_iter is also declared on stack of bio_integrity_process().

            One challenge of using iter.prot_buf as a hash key with unpatched kernel is the buf is allocated by bio_integrity_prep(), and bio_integrity_prep() will generate the guards for the write case immediately,
            so our generate_fn will see the iter.prot_buf before we setup the private data?

            dongyang Dongyang Li added a comment - - edited Looking up bio using the bi_iter won't work, we are looking at different iter. the generate/verify_fn is using struct blk_integrity_iter and bio has struct bvec_iter . The blk_integrity_iter is also declared on stack of bio_integrity_process() . One challenge of using iter.prot_buf as a hash key with unpatched kernel is the buf is allocated by bio_integrity_prep() , and bio_integrity_prep() will generate the guards for the write case immediately, so our generate_fn will see the iter.prot_buf before we setup the private data?

            Some potential options for discussion:

            • could we use "container_of(iter, struct bio, bi_iter)" to access the bio? It isn't very obvious that this would work, because there look like lots of "struct bvec_iter" copies being made and we aren't sure to be using the one in the bio.
            • we could use the hole in struct blk_integrity_iter between "interval/tuple_size" and "disk_name" directly without patching the kernel. In el8 this hole is 2 bytes (used for bi_idx in the current patch), but in newer kernels it is only 1 byte. If we had a maximum of 255 OST threads we could use this byte as an index into an array to get per-thread data, but there can be more than 255 threads
            • iter.prot_buf is constant for the whole call to bio_integrity_process() and could potentially be used as a hash key to look up the private data?
            adilger Andreas Dilger added a comment - Some potential options for discussion: could we use " container_of(iter, struct bio, bi_iter) " to access the bio? It isn't very obvious that this would work, because there look like lots of "struct bvec_iter" copies being made and we aren't sure to be using the one in the bio. we could use the hole in struct blk_integrity_iter between " interval/tuple_size " and " disk_name " directly without patching the kernel. In el8 this hole is 2 bytes (used for bi_idx in the current patch), but in newer kernels it is only 1 byte. If we had a maximum of 255 OST threads we could use this byte as an index into an array to get per-thread data, but there can be more than 255 threads iter.prot_buf is constant for the whole call to bio_integrity_process() and could potentially be used as a hash key to look up the private data?

            People

              dongyang Dongyang Li
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: