Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13632

BAD WRITE CHECKSUM

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.12.3
    • None
    • Server side is all Dell gear. MDS and OSS are R740, storage is Dell ME4 storage enclosures. lnet routers are Lenovo SR630 running lustre 2.12.4, clients are mostly Dell and Lenovo gear running 2.10.7-1.
    • 1
    • 9223372036854775807

    Description

      One of our Lustre storage servers has become unstable. Below are the messages we found on the 2 OSS that fell over (rebooted) due to bad write checksums. We don't recall seeing this before. We tracked down the files referenced via the Hex codes in the error messages below with the lfs fid2path command and killed the user's jobs. I will mentioned that we did update our lnet routers from running 2.13.0 to 2.12.4 in the past week or two. Also the lustre clients that are accessing our Lustre storage are running lustre 2.10. 7-1.

      This particular lustre server that is unstable is running lustre 2.12.3:
      [root@holyscratch01mds01 ~]# rpm -qa |grep lustre
      kernel-3.10.0-1062.1.1.el7_lustre.x86_64
      kmod-lustre-2.12.3-1.el7.x86_64
      kmod-lustre-osd-ldiskfs-2.12.3-1.el7.x86_64
      kernel-devel-3.10.0-1062.1.1.el7_lustre.x86_64
      lustre-osd-zfs-mount-2.12.3-1.el7.x86_64
      lustre-2.12.3-1.el7.x86_64
      lustre-zfs-dkms-2.12.3-1.el7.noarch
      lustre-resource-agents-2.12.3-1.el7.x86_64
      lustre-ldiskfs-zfs-5.0.0-1.el7.x86_64
      kernel-mft-4.13.3-3.10.0_1062.1.1.el7_lustre.x86_64.x86_64
      lustre-osd-ldiskfs-mount-2.12.3-1.el7.x86_64
      kmod-spl-3.10.0-1062.1.1.el7_lustre.x86_64-0.7.13-1.el7.x86_64

      Here are the errors:

      Jun 3 18:23:54 holyscratch01oss03 kernel: LustreError: 168-f: scratch1-OST001a: BAD WRITE CHECKSUM: from 12345-10.31.164.172@o2ib via 10.31.179.131@o2ib4 inode [0x20001bc69:0xa3f6:0x0] object 0x0:76971842 extent [71303168-75497471]: client csum e6b01811, server csum 880d3728
      Jun 3 18:23:55 holyscratch01oss03 kernel: LustreError: 168-f: scratch1-OST0009: BAD WRITE CHECKSUM: from 12345-10.31.163.222@o2ib via 10.31.179.133@o2ib4 inode [0x200012232:0xb3f4:0x0] object 0x0:76503185 extent [1279262720-1283457023]: client csum e611ac34, server csum 82fdc50a

      Any ideas as to why this happening ?

      Attachments

        Activity

          People

            wshilong Wang Shilong (Inactive)
            mre64 Michael Ethier (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated: