[LU-16138] remove need for T10-PI block integrity patches from server kernel Created: 06/Sep/22  Updated: 02/Oct/23

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Improvement Priority: Major
Reporter: Andreas Dilger Assignee: Dongyang Li
Resolution: Unresolved Votes: 0
Labels: medium

Issue Links:
Related
is related to LU-16137 Lustre 2.15.1 server kernel issue (RH... Open
is related to LU-16413 T10PI is broken for CentOS 8.x Resolved
Rank (Obsolete): 9223372036854775807

 Description   

Kernels since v4.3-rc4-30-g0f8087ecdeac separate the block integrity verify and generate functions into a separate struct blk_integrity_profile instead of copying the pointers into struct blk_integrity. This might be used to facilitate replacing the blk_integrity_profile pointer on the block device for the duration that osd-ldiskfs is mounting it, and potentially avoid the need to patch the kernel for T10-PI at all.

As a starting point, we would need to call blk_integrity_register() using a template holding the integrity methods provided by osd-ldiskfs - osd_dif_type<1,3>generate<crc,ip>() and osd_dif_type<1,3>verify<crc,ip>(). It looks like this function is just copying the individual fields from the provided template into the disk queue's blk_integrity fields, so we should save the original blk_integrity to be restored when osdldiskfs unmounts the target, otherwise the kernel would crash trying to access the old function pointers if there is any block device IO after osd-ldiskfs is removed from the kernel.

It seems prudent to call blk_integrity_unregister() before unloading the module, as that would force all block device IO using the osd-ldiskfs.ko methods to be finished, and then blk_integrity_register() the saved values so that T10-PI is still functional on the device.

The osd_dif_generate() and osd_dif_verify() functions need to handle being called on block read/write calls that are unrelated to osd-ldiskfs BRW IOs, since that will now happen for every block in the filesystem whether or not it came from osd-ldiskfs. One way to separate OSD vs. non-OSD IO might be by putting a magic at the end of struct osd_bio_private that can be used to reliably detect if this is a BRW IO or if it is IO on some other ldiskfs block. Non-BRW blocks would return NULL from find_lnb() and not dereference bio_private if NULL, and it looks like the rest of the code will "just work".



 Comments   
Comment by Gerrit Updater [ 20/Sep/22 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48608
Subject: LU-16138 kernel: preserve RHEL8.x server kABI for block integrity
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ec555c17c0254662c5886188dd2f7c0447fd4aef

Comment by Gerrit Updater [ 20/Sep/22 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48609
Subject: LU-16138 kernel: preserve RHEL8.x server kABI for block integrity
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: aba52c68e750710f9b799b87d7e03221b3d1e52a

Comment by Gerrit Updater [ 10/Oct/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48608/
Subject: LU-16138 kernel: preserve RHEL8.x server kABI for block integrity
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b90c0100dd93b56f3bfaee037b3bdd077523f43e

Comment by Peter Jones [ 10/Oct/22 ]

Landed for 2.16

Comment by Jian Yu [ 10/Oct/22 ]

Have to reopen this issue because removing the block integrity patches has not been done.
The kABI issue is resolved.

Comment by Gerrit Updater [ 26/Oct/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48609/
Subject: LU-16138 kernel: preserve RHEL8.x server kABI for block integrity
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: f582cd01fba4a32c6861a3be56fb320c69376064

Comment by Dongyang Li [ 20/Dec/22 ]

Andreas, I think patching the kernel is inevitable. We could register our own integrity_profile, or we could save the generate/verify functions and replace them with our versions, then restore them when umount osd.

But without at least applying block-pass-bio-into-integrity_processing_fn the struct blk_integrity_exchg/blk_integrity_iter doesn't know which bio are we working on, without the bio we could not access the struct osd_bio_private, and could not use find_lnb() to get the guard buffers.

Comment by Andreas Dilger [ 21/Dec/22 ]

Some potential options for discussion:

  • could we use "container_of(iter, struct bio, bi_iter)" to access the bio? It isn't very obvious that this would work, because there look like lots of "struct bvec_iter" copies being made and we aren't sure to be using the one in the bio.
  • we could use the hole in struct blk_integrity_iter between "interval/tuple_size" and "disk_name" directly without patching the kernel. In el8 this hole is 2 bytes (used for bi_idx in the current patch), but in newer kernels it is only 1 byte. If we had a maximum of 255 OST threads we could use this byte as an index into an array to get per-thread data, but there can be more than 255 threads
  • iter.prot_buf is constant for the whole call to bio_integrity_process() and could potentially be used as a hash key to look up the private data?
Comment by Dongyang Li [ 21/Dec/22 ]

Looking up bio using the bi_iter won't work, we are looking at different iter. the generate/verify_fn is using struct blk_integrity_iter and bio has struct bvec_iter. The blk_integrity_iter is also declared on stack of bio_integrity_process().

One challenge of using iter.prot_buf as a hash key with unpatched kernel is the buf is allocated by bio_integrity_prep(), and bio_integrity_prep() will generate the guards for the write case immediately,
so our generate_fn will see the iter.prot_buf before we setup the private data?

Comment by Andreas Dilger [ 21/Dec/22 ]

OK, what about just having a per-device hash table using the device offsets? The generate_fn is at least called with "sector_t seed" and "disk_name" that could be mapped to the niobuf. It definitely isn't ideal, but could potentially work and we could see the overhead of that method vs. patching the kernel.

I also see that there is a "prepare_fn" callback in the blk_integrity_profile on the device (e.g. t10_pi_type1_ip.t10_pi_type1_prepare) that has access to the bio, which might be helpful.

Comment by Dongyang Li [ 22/Dec/22 ]

The per-device hash table could work, if the sector won't be remapped after we submit_bio().
I'm trying to find what else we can make use of on the unpatched kernel to map the prot_buf to niobuf.

Generated at Sat Feb 10 03:24:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.