Details
-
3
-
9223372036854775807
Description
A Lustre client on a KVM hypervisor using SR-IOV for IB has started to generate the following errors:
OSS (oak-io1-s1 10.0.2.101@o2ib5):
Aug 31 11:27:04 oak-io1-s1 kernel: LustreError: 168-f: BAD WRITE CHECKSUM: oak-OST001a from 12345-10.0.2.225@o2ib5 inode [0x200002f84:0x6b6a:0x0] object 0x0:4413301 extent [726925312-727973887]: client csum 4ecd330, server csum 5610e5e5
The second OSS in production also has the same errors.
SR-IOV based client (oak-gw06 10.0.2.225@o2ib5):
Aug 31 11:27:05 oak-gw06 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.0.2.101@o2ib5 inode [0x200002f84:0x6b6a:0x0] object 0x0:4413301 extent [726925312-727973887]
The client also gets some read checksum errors later:
Aug 31 11:37:42 oak-gw06 kernel: LustreError: 133-1: oak-OST001a-osc-ffff88041b99c000: BAD READ CHECKSUM: from 10.0.2.101@o2ib5 inode [0x0:0x0:0x0] object 0x0:4413301 extent [1581252608-1582301183]
I will attach kernel logs of both.
In this particular case, the client is a Globus endpoint, using Lustre a the backend. This is actually the second time we've seen this, indeed the same issue was seen on another VM running rsnapshot jobs. Rebooting the impacted VM does fix the issue.
Are you aware of such issues when using SR-IOV? Any idea how we could troubleshoot this?
Thanks!
Stephane Thiell