Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.12.3
-
None
-
Server side is all Dell gear. MDS and OSS are R740, storage is Dell ME4 storage enclosures. lnet routers are Lenovo SR630 running lustre 2.12.4, clients are mostly Dell and Lenovo gear running 2.10.7-1.
-
1
-
9223372036854775807
Description
One of our Lustre storage servers has become unstable. Below are the messages we found on the 2 OSS that fell over (rebooted) due to bad write checksums. We don't recall seeing this before. We tracked down the files referenced via the Hex codes in the error messages below with the lfs fid2path command and killed the user's jobs. I will mentioned that we did update our lnet routers from running 2.13.0 to 2.12.4 in the past week or two. Also the lustre clients that are accessing our Lustre storage are running lustre 2.10. 7-1.
This particular lustre server that is unstable is running lustre 2.12.3:
[root@holyscratch01mds01 ~]# rpm -qa |grep lustre
kernel-3.10.0-1062.1.1.el7_lustre.x86_64
kmod-lustre-2.12.3-1.el7.x86_64
kmod-lustre-osd-ldiskfs-2.12.3-1.el7.x86_64
kernel-devel-3.10.0-1062.1.1.el7_lustre.x86_64
lustre-osd-zfs-mount-2.12.3-1.el7.x86_64
lustre-2.12.3-1.el7.x86_64
lustre-zfs-dkms-2.12.3-1.el7.noarch
lustre-resource-agents-2.12.3-1.el7.x86_64
lustre-ldiskfs-zfs-5.0.0-1.el7.x86_64
kernel-mft-4.13.3-3.10.0_1062.1.1.el7_lustre.x86_64.x86_64
lustre-osd-ldiskfs-mount-2.12.3-1.el7.x86_64
kmod-spl-3.10.0-1062.1.1.el7_lustre.x86_64-0.7.13-1.el7.x86_64
Here are the errors:
Jun 3 18:23:54 holyscratch01oss03 kernel: LustreError: 168-f: scratch1-OST001a: BAD WRITE CHECKSUM: from 12345-10.31.164.172@o2ib via 10.31.179.131@o2ib4 inode [0x20001bc69:0xa3f6:0x0] object 0x0:76971842 extent [71303168-75497471]: client csum e6b01811, server csum 880d3728
Jun 3 18:23:55 holyscratch01oss03 kernel: LustreError: 168-f: scratch1-OST0009: BAD WRITE CHECKSUM: from 12345-10.31.163.222@o2ib via 10.31.179.133@o2ib4 inode [0x200012232:0xb3f4:0x0] object 0x0:76503185 extent [1279262720-1283457023]: client csum e611ac34, server csum 82fdc50a
Any ideas as to why this happening ?