[LU-9939] Bad checksums from clients using SR-IOV Created: 01/Sep/17 Updated: 06/Nov/17 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Stephane Thiell | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre 2.9 including fixes for |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
A Lustre client on a KVM hypervisor using SR-IOV for IB has started to generate the following errors: OSS (oak-io1-s1 10.0.2.101@o2ib5): Aug 31 11:27:04 oak-io1-s1 kernel: LustreError: 168-f: BAD WRITE CHECKSUM: oak-OST001a from 12345-10.0.2.225@o2ib5 inode [0x200002f84:0x6b6a:0x0] object 0x0:4413301 extent [726925312-727973887]: client csum 4ecd330, server csum 5610e5e5 The second OSS in production also has the same errors. SR-IOV based client (oak-gw06 10.0.2.225@o2ib5): Aug 31 11:27:05 oak-gw06 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.0.2.101@o2ib5 inode [0x200002f84:0x6b6a:0x0] object 0x0:4413301 extent [726925312-727973887] The client also gets some read checksum errors later: Aug 31 11:37:42 oak-gw06 kernel: LustreError: 133-1: oak-OST001a-osc-ffff88041b99c000: BAD READ CHECKSUM: from 10.0.2.101@o2ib5 inode [0x0:0x0:0x0] object 0x0:4413301 extent [1581252608-1582301183] I will attach kernel logs of both. In this particular case, the client is a Globus endpoint, using Lustre a the backend. This is actually the second time we've seen this, indeed the same issue was seen on another VM running rsnapshot jobs. Rebooting the impacted VM does fix the issue. Are you aware of such issues when using SR-IOV? Any idea how we could troubleshoot this? Thanks! |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 02/Sep/17 ] |
|
Hello Stephane! |
| Comment by Stephane Thiell [ 06/Sep/17 ] |
|
Hey Bruno! That's very good to know! And no, I don't have the patch from Thanks! Stephane |
| Comment by Bruno Faccini (Inactive) [ 06/Sep/17 ] |
|
Ok, but on the other hand, I am sorry but I don't have any other option to debug and I did not find any similar report of checksum error running SR-IOV. |
| Comment by Stephane Thiell [ 26/Oct/17 ] |
|
Hi Bruno! The problem occurred again on a VM used as a Globus endpoint. The good news is that we are now running 2.10.1 on this system (clients and servers), so I did enable this checksum_dump thing and attached some of the resulting files to this ticket. Do you know how to troubleshoot this? Thanks! Stephane |
| Comment by Bruno Faccini (Inactive) [ 26/Oct/17 ] |
|
Hello Stephane, |
| Comment by Stephane Thiell [ 06/Nov/17 ] |
|
Thanks Bruno! I'm still waiting to see a new occurrence to further troubleshoot this issue on the OSS side. I believe our Globus endpoint VMs have been less loaded lately, that might be why I haven't see the problem yet. I'll keep you posted. |