[LU-9939] Bad checksums from clients using SR-IOV Created: 01/Sep/17  Updated: 06/Nov/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Bruno Faccini (Inactive)
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Lustre 2.9 including fixes for LU-8851 (nodemap: add uid/gid only flags to control mapping) and LU-9258 (nodemap: group quota ID not properly mapped), kernel 3.10.0-514.16.1.el7_lustre.x86_64 on servers, 3.10.0-514.10.2.el7_lustre.x86_64 on clients


Attachments: File lustre-log-checksum_dump.tar.gz     Text File oak-gw06_kernel_Aug31.log     Text File oak-gw06_kernel_full.log     Text File oak-io1-s1_kernel_Aug31.log    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

A Lustre client on a KVM hypervisor using SR-IOV for IB has started to generate the following errors:

OSS (oak-io1-s1 10.0.2.101@o2ib5):

Aug 31 11:27:04 oak-io1-s1 kernel: LustreError: 168-f: BAD WRITE CHECKSUM: oak-OST001a from 12345-10.0.2.225@o2ib5 inode [0x200002f84:0x6b6a:0x0] object 0x0:4413301 extent [726925312-727973887]: client csum 4ecd330, server csum 5610e5e5


The second OSS in production also has the same errors.

SR-IOV based client (oak-gw06 10.0.2.225@o2ib5):

Aug 31 11:27:05 oak-gw06 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.0.2.101@o2ib5 inode [0x200002f84:0x6b6a:0x0] object 0x0:4413301 extent [726925312-727973887]


The client also gets some read checksum errors later:

Aug 31 11:37:42 oak-gw06 kernel: LustreError: 133-1: oak-OST001a-osc-ffff88041b99c000: BAD READ CHECKSUM: from 10.0.2.101@o2ib5 inode [0x0:0x0:0x0] object 0x0:4413301 extent [1581252608-1582301183]


I will attach kernel logs of both.

In this particular case, the client is a Globus endpoint, using Lustre a the backend. This is actually the second time we've seen this, indeed the same issue was seen on another VM running rsnapshot jobs. Rebooting the impacted VM does fix the issue.

Are you aware of such issues when using SR-IOV? Any idea how we could troubleshoot this?

Thanks!
Stephane Thiell



 Comments   
Comment by Bruno Faccini (Inactive) [ 02/Sep/17 ]

Hello Stephane!
Can you check if your Lustre 2.9 version includes patch for LU-8376 or not?
Because if yes, you should enable pages dump upon cksum error on both Client and OST sides and then may be have more infos to help find the cause of the error.

Comment by Stephane Thiell [ 06/Sep/17 ]

Hey Bruno!

That's very good to know! And no, I don't have the patch from LU-8376 in our 2.9, but we plan to upgrade to 2.10.1 in the near future, so I will just wait for that and then enable this new debugging option.

Thanks!

Stephane

Comment by Bruno Faccini (Inactive) [ 06/Sep/17 ]

Ok, but on the other hand, I am sorry but I don't have any other option to debug and I did not find any similar report of checksum error running SR-IOV.

Comment by Stephane Thiell [ 26/Oct/17 ]

Hi Bruno!

The problem occurred again on a VM used as a Globus endpoint. The good news is that we are now running 2.10.1 on this system (clients and servers), so I did enable this checksum_dump thing and attached some of the resulting files to this ticket. Do you know how to troubleshoot this?

Thanks!

Stephane

Comment by Bruno Faccini (Inactive) [ 26/Oct/17 ]

Hello Stephane,
Tu nous a manque au dernier LAD!!
Your tarball of dumps only contains files from Client side, so did you also enable dump of pages in a bulk xfer with cksum error on OSSs side? If yes, there should be corresponding dumps available on OSSs that are of interest for comparison.
Concerned Clients and OSSs syslogs would be also helpful along with the striping infos of each affected files/FIDs, and the files content themselves if still available and not modified.

Comment by Stephane Thiell [ 06/Nov/17 ]

Thanks Bruno! I'm still waiting to see a new occurrence to further troubleshoot this issue on the OSS side. I believe our Globus endpoint VMs have been less loaded lately, that might be why I haven't see the problem yet. I'll keep you posted.

Generated at Sat Feb 10 02:30:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.