[LU-10472] Data Integrity(T10PI) support for Lustre - Whamcloud Community JIRA

Details

Type: New Feature
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.12.0
Affects Version/s: None
Labels:
- patch

Rank (Obsolete):
9223372036854775807

Description

Data integrity(T10PI) support for hard disk is common now which raises the need for adding the support into Lustre. This is not the first attempt to implement T10PI support for Lustre, but the former work has been stopped for years (~~LU-2584~~). Instead of implementing end-to-end data integrity in one shot, we are trying to implement data integrity step by step. Because of this difference, we feel it would be better to create a different ticket.

The first step would be adding support for pretecting data integrity from Lustre OSD to disk, i.e. OSD-to-Storage T10PI.

Given the fact that checksum are already supported for Lustre RPCs, the data are already protected when transfering through the network. By using both network checksum and OSD-to-Storage T10PI together, the time window with no data protection will be decreased a lot. The only danger would be the page cache in memory on OSD are changed somehow between RPC checksum check and OSD T10PI checksum calculation. It would be a doubt whether it is really necessary to implement end-to-end T10PI support. However, convincing the concerned users to accept this small probability is still difficult, especially when we do not have a quantitative estimation of the probability.

But, even Lustre supports OSC-to-storage T10PI feature, the data are not fully protected in theory, unless some kind of APIs of T10PI are provided for applications, e.g. https://lwn.net/Articles/592113. Suporting that kind of APIs would be even more difficult, because stripe of LOV needs to be taken care of, or unless we limit the end-to-end T10PI to one-stripe files.

One major difficulty of implementing OSC-to-storage T10PI feature for Lustre is, a single bulk RPC can be splitted into server I/Os on disk, which means, the checkum is needed to be re-calculated again.

In order to avoid this problem, Lustre client need to start smaller bulk RPC so as to make sure the RPC can always be written/readed in one I/O. This will eliminate the need to recalculate the T10PI protection data on OSD side.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

random_write.c
2 kB
21/May/18 9:26 AM

Issue Links

is related to

LU-11770 preserve kernel API when T10-PI patches are applied

Resolved

LU-14912 client picking other checksum type over T10PI

Resolved

is related to

LU-11697 BAD WRITE CHECKSUM with t10ip4K and t10ip512 checksums

Resolved

LU-2584 End-to-End Data Integrity with T10

Resolved

Activity

[LU-10472] Data Integrity(T10PI) support for Lustre

Nathan Rutman added a comment - 06/Feb/19 6:10 PM

To avoid reading through this ticket again, here is the final conclusion:

In summary, the current T10-PI checksum support in 2.12 is no worse than the existing bulk RPC checksums (with separate T10-PI checksums at the kernel bio level), and with the 32266 patch it is better still due to reduced CPU overhead. With that patch, reads can use the GRD tags straight from the hardware to compute the RPC checksum without any bulk data checksums on the OSS.

32266 includes a series of server-side kernel patches for RHEL7. These patches allow for overlapping protection domains from Lustre client to disk. Without the patches, there is a protection hole (no worse than older network checksums).

Nathan Rutman added a comment - 06/Feb/19 6:10 PM To avoid reading through this ticket again, here is the final conclusion: In summary, the current T10-PI checksum support in 2.12 is no worse than the existing bulk RPC checksums (with separate T10-PI checksums at the kernel bio level), and with the 32266 patch it is better still due to reduced CPU overhead. With that patch, reads can use the GRD tags straight from the hardware to compute the RPC checksum without any bulk data checksums on the OSS. 32266 includes a series of server-side kernel patches for RHEL7. These patches allow for overlapping protection domains from Lustre client to disk. Without the patches, there is a protection hole (no worse than older network checksums).

Andreas Dilger added a comment - 22/Nov/18 1:43 AM

Correct, ZFS does not implement T10-PI support. There is the old design for integrated end-to-end checksums from the client, which would use a similar, but more sophisticated, checksum method as the current T10 mechanism.

Andreas Dilger added a comment - 22/Nov/18 1:43 AM Correct, ZFS does not implement T10-PI support. There is the old design for integrated end-to-end checksums from the client, which would use a similar, but more sophisticated, checksum method as the current T10 mechanism.

Nathan Rutman added a comment - 21/Nov/18 11:13 PM

Thanks Andreas for the response. The T10-PI checks (end therefore end-to-end) only apply to ldiskfs, not ZFS, correct?

Nathan Rutman added a comment - 21/Nov/18 11:13 PM Thanks Andreas for the response. The T10-PI checks (end therefore end-to-end) only apply to ldiskfs, not ZFS, correct?

Peter Jones added a comment - 06/Nov/18 1:47 PM

Landed for 2.12

Peter Jones added a comment - 06/Nov/18 1:47 PM Landed for 2.12

Gerrit Updater added a comment - 06/Nov/18 7:13 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32266/
Subject: ~~LU-10472~~ osd-ldiskfs: T10PI between RPC and BIO
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ccf3674c9ca3ed8918c49163007708d1ae5db6f5

Gerrit Updater added a comment - 06/Nov/18 7:13 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32266/ Subject: LU-10472 osd-ldiskfs: T10PI between RPC and BIO Project: fs/lustre-release Branch: master Current Patch Set: Commit: ccf3674c9ca3ed8918c49163007708d1ae5db6f5

Andreas Dilger added a comment - 22/Oct/18 6:58 AM

Nathan, the patch https://review.whamcloud.com/32266 "osd-ldiskfs: T10PI between RPC and BIO" will allow passing the GRD tags from Lustre down to the storage, if the hardware supports it. The benefit of the integration with Lustre is indeed, as you wrote, to reduce the duplicate checksum CPU overhead on the server. Using the checksum-of-checksums avoids increasing the request size for GRD tags as the RPC size grows (and will help ZFS in the future since it is doing a Merkle tree for its checksums).

To respond to your #2, Lustre has always computed the network checksums in an "overlapping" manner. It will verify the RPC checksum after the data checksums are computed on the OSS, so if they don't match, the client will re-send the RPC. This ensures that the T10-PI checksums that the OSS is computing (and the pages they represent) are the same as the ones computed at the client (up to the limit of the 16-bit GRD checksum itself, which is IMHO fairly weak). At least the 16-bit GRD checksum is only on the 4096-byte sector, and we have a somewhat stronger 32-bit checksum for the 2KB of GRD tags (for 4MB RPC).

In summary, the current T10-PI checksum support in 2.12 is no worse than the existing bulk RPC checksums (with separate T10-PI checksums at the kernel bio level), and with the 32266 patch it is better still due to reduced CPU overhead. With that patch, reads can use the GRD tags straight from the hardware to compute the RPC checksum without any bulk data checksums on the OSS.

Andreas Dilger added a comment - 22/Oct/18 6:58 AM Nathan, the patch https://review.whamcloud.com/32266 " osd-ldiskfs: T10PI between RPC and BIO " will allow passing the GRD tags from Lustre down to the storage, if the hardware supports it. The benefit of the integration with Lustre is indeed, as you wrote, to reduce the duplicate checksum CPU overhead on the server. Using the checksum-of-checksums avoids increasing the request size for GRD tags as the RPC size grows (and will help ZFS in the future since it is doing a Merkle tree for its checksums). To respond to your #2, Lustre has always computed the network checksums in an "overlapping" manner. It will verify the RPC checksum after the data checksums are computed on the OSS, so if they don't match, the client will re-send the RPC. This ensures that the T10-PI checksums that the OSS is computing (and the pages they represent) are the same as the ones computed at the client (up to the limit of the 16-bit GRD checksum itself, which is IMHO fairly weak). At least the 16-bit GRD checksum is only on the 4096-byte sector, and we have a somewhat stronger 32-bit checksum for the 2KB of GRD tags (for 4MB RPC). In summary, the current T10-PI checksum support in 2.12 is no worse than the existing bulk RPC checksums (with separate T10-PI checksums at the kernel bio level), and with the 32266 patch it is better still due to reduced CPU overhead. With that patch, reads can use the GRD tags straight from the hardware to compute the RPC checksum without any bulk data checksums on the OSS.

Data Integrity(T10PI) support for Lustre

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates