Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.10.3
-
None
-
centos7, skylake, opa, zfs 0.7.6, lustre 2.10.3 + https://review.whamcloud.com/#/c/29992/. kernel nopti on servers, pti on clients, 3.10.0-693.17.1.el7.x86_64
-
3
-
9223372036854775807
Description
Hiya,
these appeared last night.
john100,101 are clients. arkle3,6 are OSS's. transom1 is running the fabric manager.
in light of the similar looking LU-9305 I thought I would create this ticket.
we run default (4M) rpcs on clients and servers. our OSTs are each 4 raidz3's in 1 zpool, and have recordsize=2M. 2 OSTs per OSS. 16 OSTs total.
I suppose it could be a OPA network glitch, but it affected 2 clients and 2 servers so that seems unlikely.
we have just moved from zfs 0.7.5 to zfs 0.7.6. we ran ior and mdtest after this change and they were fine. these errors occurred a couple of days after that.
Feb 19 23:45:12 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 54e81a18 Feb 19 23:45:12 john100 kernel: LNetError: 899:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1591386600741136 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:45:12 arkle3 kernel: LNet: Using FMR for registration Feb 19 23:45:12 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881764648600 Feb 19 23:45:12 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881764648600 Feb 19 23:45:12 arkle3 kernel: LustreError: 337237:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff881687561450 x1591386600741136/t0(0) o4->8c8018f7-2e02-6c2b-cbcf-29133ecabf02@192.168.44.200@o2ib44:173/0 lens 608/448 e 0 to 0 dl 1519044318 ref 1 fl Interpret:/0/0 rc 0/0 Feb 19 23:45:12 arkle3 kernel: Lustre: dagg-OST0005: Bulk IO write error with 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44), client will retry: rc = -110 Feb 19 23:45:12 arkle3 kernel: LNet: Skipped 1 previous similar message Feb 19 23:45:12 john101 kernel: LNetError: 904:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.36@o2ib44, match 1591895282655776 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:45:12 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8810cb714600 Feb 19 23:45:12 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8810cb714600 Feb 19 23:45:12 arkle6 kernel: LustreError: 42356:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff88276c3d5850 x1591895282655776/t0(0) o4->400cfa1c-7c7d-1d14-09ed-f6043574fd7c@192.168.44.201@o2ib44:173/0 lens 608/448 e 0 to 0 dl 1519044318 ref 1 fl Interpret:/0/0 rc 0/0 Feb 19 23:45:12 arkle6 kernel: Lustre: dagg-OST000b: Bulk IO write error with 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44), client will retry: rc = -110 Feb 19 23:45:12 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 54e81a18 (type 4), client csum now efde5b36 Feb 19 23:45:12 john100 kernel: LustreError: 924:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff880e7a9b4e00 x1591386600740944/t197580821951(197580821951) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044319 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:45:13 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum df7fda9d Feb 19 23:45:13 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum df7fda9d (type 4), client csum now efde5b36 Feb 19 23:45:13 john100 kernel: LustreError: 911:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff8807492fe300 x1591386600747696/t197580821955(197580821955) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044320 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:45:15 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 87da008a Feb 19 23:45:15 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 87da008a (type 4), client csum now efde5b36 Feb 19 23:45:15 john100 kernel: LustreError: 910:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff880eb3d51b00 x1591386600751360/t197580821956(197580821956) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044322 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:45:18 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 1cc7a793 Feb 19 23:45:18 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 1cc7a793 (type 4), client csum now efde5b36
full log attached.
there were hung and unkillable user processes on the 2 clients afterwards. a reboot of the 2 clients has cleared up the looping messages of the type shown below.
Feb 20 14:28:15 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e25ef2c00 Feb 20 14:28:19 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88176544be00 Feb 20 14:28:19 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8814072e0200 Feb 20 14:28:19 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88176544be00 Feb 20 14:28:19 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8814072e0200 Feb 20 14:28:47 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e25ef3800 Feb 20 14:28:47 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e25ef3800 Feb 20 14:28:51 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881615f0f000 Feb 20 14:28:51 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880be3931c00 Feb 20 14:28:51 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881615f0f000
cheers,
robin
Attachments
Issue Links
Activity
Hi Robin, did you reduce the recordsize to 1MB on the filesystem? We haven't done any testing ourselves with larger recordsize. The clients would need to be remounted to also get a smaller RPC size (they default to RPC size == max blocksize for ZFS at mount).
Also, are you using TID RDMA (cap_mask=....) for your OPA connection? We've seen problems with that under load, and if yes it should be disabled.
a bunch more CHECKSUM and LNet errors today. this lot were again definitely associated with over quota. I don't know if all the incidents are though...
I guess I'd be very worried if these weren't purely over quota events, which is why the read checksum messages were very alarming.
any thoughts on whether these are just from quota events or not? is there any way we can easily tell that?
I'll attach messages for today's errors to this ticket in a min.
the user has many (~1100 so far) of the below in their job output. looks like they have ~127 write processes across their 28 nodes and 896 cores. the code is looping on the nodes trying to complete the writes.
HDF5-DIAG: Error detected in HDF5 (1.10.1) MPI-process 168: #000: H5Dio.c line 269 in H5Dwrite(): can't prepare for writing data major: Dataset minor: Write failed #001: H5Dio.c line 345 in H5D__pre_write(): can't write data major: Dataset minor: Write failed #002: H5Dio.c line 791 in H5D__write(): can't write data major: Dataset minor: Write failed #003: H5Dcontig.c line 642 in H5D__contig_write(): contiguous write failed major: Dataset minor: Write failed #004: H5Dselect.c line 309 in H5D__select_write(): write error major: Dataspace minor: Write failed #005: H5Dselect.c line 220 in H5D__select_io(): write error major: Dataspace minor: Write failed #006: H5Dcontig.c line 1267 in H5D__contig_writevv(): can't perform vectorized sieve buffer write major: Dataset minor: Can't operate on object #007: H5VM.c line 1500 in H5VM_opvv(): can't perform operation major: Internal error (too specific to document in detail) minor: Can't operate on object #008: H5Dcontig.c line 1014 in H5D__contig_writevv_sieve_cb(): block write failed major: Dataset minor: Write failed #009: H5Fio.c line 195 in H5F_block_write(): write through page buffer failed major: Low-level I/O minor: Write failed #010: H5PB.c line 1041 in H5PB_write(): write through metadata accumulator failed major: Page Buffering minor: Write failed #011: H5Faccum.c line 834 in H5F__accum_write(): file write failed major: Low-level I/O minor: Write failed #012: H5FDint.c line 308 in H5FD_write(): driver write request failed major: Virtual File Layer minor: Write failed #013: H5FDsec2.c line 810 in H5FD_sec2_write(): file write failed: time = Fri Mar 16 03:17:20 2018 , filename = './L35_N2650/snapshot_050.24.hdf5', file descriptor = 24, errno = 122, error message = 'Disk quota exceeded', buf = 0x2b07c54a3010, total write size = 31457280, bytes this sub-write = 31457280, bytes actually written = 184467440737052 major: Low-level I/O minor: Write failed
cheers,
robin
Hi Nathaniel,
all messages for the lustre servers and from the 7 clients affected so far for all of 2018 have been uploaded.
grep for CHECKSUM
if you'd like console logs from these machines too then please let us know.
cheers,
robin
Below are instructions for uploading logs to our write-only ftp site:
Sometimes the diagnostic data collected as part of Lustre troubleshooting is too large to be attached to a JIRA ticket. For these cases, HPDD provides an anonymous write-only FTP upload service. In order to use this service, you'll need an FTP client (e.g. ncftp, ftp, etc.) and a JIRA issue. Use the 'uploads' directory and create a new subdirectory using your Jira issue as a name.
In the following example, there are three debug logs in a single directory and the JIRA issue LU-4242 has been created. After completing the upload, please update the relevant issue with a note mentioning the upload, so that our engineers know where to find your logs.
$ ls -lh total 333M -rw-r--r-- 1 mjmac mjmac 98M Feb 23 17:36 mds-debug -rw-r--r-- 1 mjmac mjmac 118M Feb 23 17:37 oss-00-debug -rw-r--r-- 1 mjmac mjmac 118M Feb 23 17:37 oss-01-debug $ ncftp ftp.hpdd.intel.com NcFTP 3.2.2 (Sep 04, 2008) by Mike Gleason (http://www.NcFTP.com/contact/). Connecting to 99.96.190.235... (vsFTPd 2.2.2) Logging in... Login successful. Logged in to ftp.hpdd.intel.com. ncftp / > cd uploads Directory successfully changed. ncftp /uploads > mkdir LU-4242 ncftp /uploads > cd LU-4242 Directory successfully changed. ncftp /uploads/LU-4242 > put * mds-debug: 97.66 MB 11.22 MB/s oss-00-debug: 117.19 MB 11.16 MB/s oss-01-debug: 117.48 MB 11.18 MB/s ncftp /uploads/LU-4242 >
Please note that this is a WRITE-ONLY FTP service, so you will not be able to see (with ls) the files or directories you've created, nor will you (or anyone other than HPDD staff) be able to see or read them.
Hi Nathaniel,
hmm, I just grep'd a bit more and this is worrying. now there are read and write checksum errors, and also to our small /home OSTs (OSS's are umlaut1,2, which have zfs recordsize 1M and compression on.)
we have a lot of logspam at the moment from various things so I missed these 'til now
can I give you (via email or something) a URL to download complete logs from? I don't want to put them here 'cos they have usernames etc. in them.
all lustre server are now running zfs 0.7.6, and all servers and clients still lustre 2.10.3.
clients on kernel 3.10.0-693.21.1.el7.x86_64, servers on 3.10.0-693.17.1.el7.x86_64
/var/log/messages-20180220.gz:Feb 19 23:45:12 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 54e81a18 /var/log/messages-20180220.gz:Feb 19 23:45:12 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 54e81a18 (type 4), client csum now efde5b36 /var/log/messages-20180220.gz:Feb 19 23:45:13 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum df7fda9d /var/log/messages-20180220.gz:Feb 19 23:45:13 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum df7fda9d (type 4), client csum now efde5b36 /var/log/messages-20180220.gz:Feb 19 23:45:15 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 87da008a /var/log/messages-20180220.gz:Feb 19 23:45:15 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 87da008a (type 4), client csum now efde5b36 /var/log/messages-20180220.gz:Feb 19 23:45:18 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 1cc7a793 /var/log/messages-20180220.gz:Feb 19 23:45:18 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 1cc7a793 (type 4), client csum now efde5b36 /var/log/messages-20180220.gz:Feb 19 23:45:22 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum a7de2e17 /var/log/messages-20180220.gz:Feb 19 23:45:23 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum a7de2e17 (type 4), client csum now efde5b36 /var/log/messages-20180220.gz:Feb 19 23:45:35 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum a7de2e17 /var/log/messages-20180220.gz:Feb 19 23:45:35 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum a7de2e17 (type 4), client csum now efde5b36 /var/log/messages-20180220.gz:Feb 19 23:45:53 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum a7de2e17 /var/log/messages-20180220.gz:Feb 19 23:45:54 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum a7de2e17 (type 4), client csum now efde5b36 /var/log/messages-20180220.gz:Feb 19 23:46:31 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [358060376-362254335]: client csum 164c6d3b, server csum c5bdd26c /var/log/messages-20180220.gz:Feb 19 23:46:32 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [358060376-362254335], original client csum 164c6d3b (type 4), server csum c5bdd26c (type 4), client csum now 164c6d3b /var/log/messages-20180220.gz:Feb 19 23:47:38 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [389517656-393711615]: client csum f0550656, server csum ea7a06d9 /var/log/messages-20180220.gz:Feb 19 23:47:38 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [389517656-393711615], original client csum f0550656 (type 4), server csum ea7a06d9 (type 4), client csum now f0550656 /var/log/messages-20180220.gz:Feb 19 23:49:51 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [447804664-451997695]: client csum f6a340ce, server csum 62064124 /var/log/messages-20180220.gz:Feb 19 23:49:53 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [447804664-451997695], original client csum f6a340ce (type 4), server csum 62064124 (type 4), client csum now f6a340ce /var/log/messages-20180220.gz:Feb 19 23:54:11 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [599407780-603598847]: client csum 2a11218d, server csum 5539349b /var/log/messages-20180220.gz:Feb 19 23:54:12 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [599407780-603598847], original client csum 2a11218d (type 4), server csum 5539349b (type 4), client csum now 2a11218d /var/log/messages-20180223.gz:Feb 22 17:13:04 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.13@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343]: client csum c2abd616, server csum 8dd68c18 /var/log/messages-20180223.gz:Feb 22 17:13:04 farnarkle1 kernel: LustreError: 132-0: home-OST0000-osc-ffff88189a352800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343], original client csum c2abd616 (type 4), server csum 8dd68c18 (type 4), client csum now c2abd616 /var/log/messages-20180223.gz:Feb 22 17:13:05 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.13@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343]: client csum c2abd616, server csum 8dd68c18 /var/log/messages-20180223.gz:Feb 22 17:13:05 farnarkle1 kernel: LustreError: 132-0: home-OST0000-osc-ffff88189a352800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343], original client csum c2abd616 (type 4), server csum 8dd68c18 (type 4), client csum now c2abd616 /var/log/messages-20180223.gz:Feb 22 17:13:07 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.13@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343]: client csum c2abd616, server csum c1141035 /var/log/messages-20180223.gz:Feb 22 17:13:07 farnarkle1 kernel: LustreError: 132-0: home-OST0000-osc-ffff88189a352800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343], original client csum c2abd616 (type 4), server csum c1141035 (type 4), client csum now c2abd616 /var/log/messages-20180223.gz:Feb 22 17:13:10 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.13@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343]: client csum c2abd616, server csum 3e076abf /var/log/messages-20180223.gz:Feb 22 17:13:11 farnarkle1 kernel: LustreError: 132-0: home-OST0000-osc-ffff88189a352800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343], original client csum c2abd616 (type 4), server csum 3e076abf (type 4), client csum now c2abd616 /var/log/messages-20180223.gz:Feb 22 17:13:15 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.13@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343]: client csum c2abd616, server csum 556fb060 /var/log/messages-20180223.gz:Feb 22 17:13:15 farnarkle1 kernel: LustreError: 132-0: home-OST0000-osc-ffff88189a352800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343], original client csum c2abd616 (type 4), server csum 556fb060 (type 4), client csum now c2abd616 /var/log/messages-20180223.gz:Feb 22 17:13:26 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.13@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343]: client csum c2abd616, server csum 8dd68c18 /var/log/messages-20180223.gz:Feb 22 17:13:26 farnarkle1 kernel: LustreError: 132-0: home-OST0000-osc-ffff88189a352800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343], original client csum c2abd616 (type 4), server csum 8dd68c18 (type 4), client csum now c2abd616 /var/log/messages-20180223.gz:Feb 22 17:13:44 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.13@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343]: client csum c2abd616, server csum d623d4d5 /var/log/messages-20180223.gz:Feb 22 17:13:44 farnarkle1 kernel: LustreError: 132-0: home-OST0000-osc-ffff88189a352800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343], original client csum c2abd616 (type 4), server csum d623d4d5 (type 4), client csum now c2abd616 /var/log/messages-20180308.gz:Mar 7 15:03:06 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x44:0x0] object 0x0:11031820 extent [516-4611]: client csum 9977a425, server csum 5747b5ea /var/log/messages-20180308.gz:Mar 7 15:03:06 john72 kernel: LustreError: 132-0: home-OST0001-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.52@o2ib44 inode [0x200004296:0x44:0x0] object 0x0:11031820 extent [516-4611], original client csum 9977a425 (type 4), server csum 5747b5ea (type 4), client csum now 9977a425 /var/log/messages-20180308.gz:Mar 7 15:03:07 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x44:0x0] object 0x0:11031820 extent [4620-8715]: client csum 5ef5f7db, server csum bf44f4ab /var/log/messages-20180308.gz:Mar 7 15:03:07 john72 kernel: LustreError: 132-0: home-OST0001-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.52@o2ib44 inode [0x200004296:0x44:0x0] object 0x0:11031820 extent [4620-8715], original client csum 5ef5f7db (type 4), server csum bf44f4ab (type 4), client csum now 5ef5f7db /var/log/messages-20180308.gz:Mar 7 15:03:16 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x45:0x0] object 0x0:11014864 extent [516-4611]: client csum 1dcd24ad, server csum 3a01165 /var/log/messages-20180308.gz:Mar 7 15:03:16 john72 kernel: LustreError: 132-0: home-OST0000-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004296:0x45:0x0] object 0x0:11014864 extent [516-4611], original client csum 1dcd24ad (type 4), server csum 3a01165 (type 4), client csum now 1dcd24ad /var/log/messages-20180308.gz:Mar 7 15:03:17 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x45:0x0] object 0x0:11014864 extent [4620-8715]: client csum 27a49363, server csum 799787cb /var/log/messages-20180308.gz:Mar 7 15:04:07 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x46:0x0] object 0x0:11014870 extent [516-4611]: client csum 8a6d82ff, server csum c06e206e /var/log/messages-20180308.gz:Mar 7 15:04:07 john72 kernel: LustreError: 132-0: home-OST0000-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004296:0x46:0x0] object 0x0:11014870 extent [516-4611], original client csum 8a6d82ff (type 4), server csum c06e206e (type 4), client csum now 8a6d82ff /var/log/messages-20180308.gz:Mar 7 15:04:50 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x47:0x0] object 0x0:11014876 extent [516-4611]: client csum e76922f6, server csum ad6a8067 /var/log/messages-20180308.gz:Mar 7 15:04:50 john72 kernel: LustreError: 132-0: home-OST0000-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004296:0x47:0x0] object 0x0:11014876 extent [516-4611], original client csum e76922f6 (type 4), server csum ad6a8067 (type 4), client csum now e76922f6 /var/log/messages-20180308.gz:Mar 7 15:05:14 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x48:0x0] object 0x0:11031833 extent [516-4611]: client csum c606d023, server csum e53f9b17 /var/log/messages-20180308.gz:Mar 7 15:05:14 john72 kernel: LustreError: 132-0: home-OST0001-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.52@o2ib44 inode [0x200004296:0x48:0x0] object 0x0:11031833 extent [516-4611], original client csum c606d023 (type 4), server csum e53f9b17 (type 4), client csum now c606d023 /var/log/messages-20180308.gz:Mar 7 15:05:24 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x49:0x0] object 0x0:11014877 extent [516-4611]: client csum d80dfda8, server csum 89d05138 /var/log/messages-20180308.gz:Mar 7 15:07:44 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x4a:0x0] object 0x0:11014894 extent [516-4611]: client csum 60a143e6, server csum 2aa2e177 /var/log/messages-20180308.gz:Mar 7 15:07:44 john72 kernel: LustreError: 132-0: home-OST0000-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004296:0x4a:0x0] object 0x0:11014894 extent [516-4611], original client csum 60a143e6 (type 4), server csum 2aa2e177 (type 4), client csum now 60a143e6 /var/log/messages-20180308.gz:Mar 7 15:15:05 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x4b:0x0] object 0x0:11014906 extent [516-4611]: client csum ff1728a9, server csum 8ba1a1e0 /var/log/messages-20180308.gz:Mar 7 15:15:05 john72 kernel: LustreError: 132-0: home-OST0000-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004296:0x4b:0x0] object 0x0:11014906 extent [516-4611], original client csum ff1728a9 (type 4), server csum 8ba1a1e0 (type 4), client csum now ff1728a9 /var/log/messages-20180308.gz:Mar 7 15:15:16 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x4c:0x0] object 0x0:11031863 extent [516-4611]: client csum cca2b223, server csum 3a7a9a4e /var/log/messages-20180308.gz:Mar 7 15:15:23 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x4e:0x0] object 0x0:11031864 extent [516-4611]: client csum ee48cab1, server csum 6f69a9ca /var/log/messages-20180308.gz:Mar 7 15:15:43 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x50:0x0] object 0x0:11031865 extent [516-4611]: client csum c8bdaad3, server csum 82be0842 /var/log/messages-20180308.gz:Mar 7 15:15:46 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x51:0x0] object 0x0:11014909 extent [516-4611]: client csum 57497f3, server csum 4f773562 /var/log/messages-20180308.gz:Mar 7 15:16:21 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x52:0x0] object 0x0:11031866 extent [516-4611]: client csum 179adff1, server csum d885540d /var/log/messages-20180308.gz:Mar 7 15:16:21 john72 kernel: LustreError: 132-0: home-OST0001-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.52@o2ib44 inode [0x200004296:0x52:0x0] object 0x0:11031866 extent [516-4611], original client csum 179adff1 (type 4), server csum d885540d (type 4), client csum now 179adff1 /var/log/messages-20180308.gz:Mar 7 15:19:50 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x59:0x0] object 0x0:11014910 extent [516-4611]: client csum 98a90fb3, server csum d2aaad22 /var/log/messages-20180308.gz:Mar 7 15:19:50 john72 kernel: LustreError: 132-0: home-OST0000-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004296:0x59:0x0] object 0x0:11014910 extent [516-4611], original client csum 98a90fb3 (type 4), server csum d2aaad22 (type 4), client csum now 98a90fb3 /var/log/messages-20180308.gz:Mar 7 15:20:08 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x5d:0x0] object 0x0:11031867 extent [516-4611]: client csum 34fda7cb, server csum e078ee98 /var/log/messages-20180308.gz:Mar 7 15:21:15 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x64:0x0] object 0x0:11031870 extent [516-4611]: client csum 24b13698, server csum 6eb29409 /var/log/messages-20180308.gz:Mar 7 15:22:56 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x67:0x0] object 0x0:11014915 extent [516-4611]: client csum fcb4e51c, server csum ffb2a769 /var/log/messages-20180308.gz:Mar 7 15:23:50 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x6a:0x0] object 0x0:11031873 extent [516-4611]: client csum a1a66091, server csum 320a1aef /var/log/messages-20180308.gz:Mar 7 15:25:07 john72 kernel: LustreError: 132-0: home-OST0001-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.52@o2ib44 inode [0x200004296:0x6e:0x0] object 0x0:11031875 extent [516-4611], original client csum 4a4e93eb (type 4), server csum fe6a3a5f (type 4), client csum now 4a4e93eb /var/log/messages-20180308.gz:Mar 7 15:27:16 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x75:0x0] object 0x0:11014922 extent [516-4611]: client csum 629c147f, server csum 289fb6ee /var/log/messages-20180308.gz:Mar 7 15:30:39 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x88:0x0] object 0x0:11031881 extent [516-4611]: client csum b0252143, server csum fa2683d2 /var/log/messages-20180308.gz:Mar 7 15:51:08 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x89:0x0] object 0x0:11014925 extent [516-4611]: client csum f7b9952b, server csum 539602f8 /var/log/messages-20180308.gz:Mar 7 15:51:08 john72 kernel: LustreError: 132-0: home-OST0000-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004296:0x89:0x0] object 0x0:11014925 extent [516-4611], original client csum f7b9952b (type 4), server csum 539602f8 (type 4), client csum now f7b9952b /var/log/messages-20180308.gz:Mar 7 17:44:25 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.173@o2ib44 inode [0x200004a85:0x1d:0x0] object 0x0:11015246 extent [516-4611]: client csum c6b24442, server csum bb49caf9 /var/log/messages-20180308.gz:Mar 7 17:44:25 john73 kernel: LustreError: 132-0: home-OST0000-osc-ffff882fd67d1800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004a85:0x1d:0x0] object 0x0:11015246 extent [516-4611], original client csum c6b24442 (type 4), server csum bb49caf9 (type 4), client csum now c6b24442 /var/log/messages-20180308.gz:Mar 7 17:44:27 john73 kernel: LustreError: 132-0: home-OST0000-osc-ffff882fd67d1800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004a85:0x1d:0x0] object 0x0:11015246 extent [4620-8715], original client csum bfce872c (type 4), server csum 850e6171 (type 4), client csum now bfce872c /var/log/messages-20180308.gz:Mar 7 18:00:30 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.173@o2ib44 inode [0x200004a85:0x20:0x0] object 0x0:11032280 extent [516-4611]: client csum 941ab186, server csum 43900b50 /var/log/messages-20180308.gz:Mar 7 18:00:30 john73 kernel: LustreError: 132-0: home-OST0001-osc-ffff882fd67d1800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.52@o2ib44 inode [0x200004a85:0x20:0x0] object 0x0:11032280 extent [516-4611], original client csum 941ab186 (type 4), server csum 43900b50 (type 4), client csum now 941ab186 /var/log/messages-20180308.gz:Mar 7 18:00:31 john73 kernel: LustreError: 132-0: home-OST0001-osc-ffff882fd67d1800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.52@o2ib44 inode [0x200004a85:0x20:0x0] object 0x0:11032280 extent [4620-8715], original client csum 4da5042f (type 4), server csum 1a0af759 (type 4), client csum now 4da5042f /var/log/messages-20180308.gz:Mar 7 18:00:42 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.173@o2ib44 inode [0x200004a85:0x21:0x0] object 0x0:11015326 extent [516-4611]: client csum 51cfa889, server csum 308c8cb3 /var/log/messages-20180308.gz:Mar 7 18:00:42 john73 kernel: LustreError: 132-0: home-OST0000-osc-ffff882fd67d1800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004a85:0x21:0x0] object 0x0:11015326 extent [516-4611], original client csum 51cfa889 (type 4), server csum 308c8cb3 (type 4), client csum now 51cfa889 /var/log/messages-20180308.gz:Mar 7 18:00:59 john73 kernel: LustreError: 132-0: home-OST0001-osc-ffff882fd67d1800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.52@o2ib44 inode [0x200004a85:0x22:0x0] object 0x0:11032283 extent [29924-72851], original client csum 8d6ddb8c (type 4), server csum c3ce8506 (type 4), client csum now 8d6ddb8c /var/log/messages-20180314.gz:Mar 13 15:23:06 john57 kernel: LustreError: 133-1: dagg-OST0004-osc-ffff881899dd5800: BAD READ CHECKSUM: from 192.168.44.33@o2ib44 inode [0x28001c1e4:0x128:0x0] object 0x540000400:8139196 extent [0-4095], client 9187de52, server 3ea5bb6b, cksum_type 4 /var/log/messages-20180314.gz:Mar 13 15:23:06 john56 kernel: LustreError: 133-1: dagg-OST0003-osc-ffff8817c5166000: BAD READ CHECKSUM: from 192.168.44.32@o2ib44 inode [0x28001c1c2:0x16c:0x0] object 0x500000400:8116043 extent [0-4095], client 37af0280, server a79ccd9f, cksum_type 4 /var/log/messages-20180314.gz:Mar 13 15:23:06 john63 kernel: LustreError: 133-1: dagg-OST000c-osc-ffff88189a274000: BAD READ CHECKSUM: from 192.168.44.37@o2ib44 inode [0x28001c1c4:0x172:0x0] object 0x400000400:8173179 extent [0-4095], client 8cdf64cc, server 4df3e0bb, cksum_type 4 /var/log/messages-20180314.gz:Mar 13 15:23:07 arkle7 kernel: LustreError: 132-0: dagg-OST000c: BAD READ CHECKSUM: should have changed on the client or in transit: from 192.168.44.163@o2ib44 inode [0x28001c1c4:0x172:0x0] object 0x400000400:8173179 extent [0-4095], client returned csum 8cdf64cc (type 4), server csum 4df3e0bb (type 4) /var/log/messages-20180314.gz:Mar 13 15:23:07 arkle3 kernel: LustreError: 132-0: dagg-OST0004: BAD READ CHECKSUM: should have changed on the client or in transit: from 192.168.44.157@o2ib44 inode [0x28001c1e4:0x128:0x0] object 0x540000400:8139196 extent [0-4095], client returned csum 9187de52 (type 4), server csum 3ea5bb6b (type 4) /var/log/messages-20180314.gz:Mar 13 15:23:07 arkle2 kernel: LustreError: 132-0: dagg-OST0003: BAD READ CHECKSUM: should have changed on the client or in transit: from 192.168.44.156@o2ib44 inode [0x28001c1c2:0x16c:0x0] object 0x500000400:8116043 extent [0-4095], client returned csum 37af0280 (type 4), server csum a79ccd9f (type 4)
cheers,
robin
rjh,
I'm trying to figure out if it's just that file, or if other files are affected. So the messages file for a couple hours or even days would be useful.
Hi Nathaniel,
you mean more messages in than the attachment above? if you'd like a few hours before and after then I can do that... let me know. we also have conman (console) logs if you'd like, but I'm not sure there's any more info in those.
are parallel writes to zfs which exceed group quotas part of the lustre test suite y'all run things through?
looks like the file is still there...
we play some bind mount tricks to mount subdirs and fid2path wasn't happy with those, but after I mounted the root of the fs directly on a tmp mountpoint then it it worked ok (+/- the // in the filename?) ->
[root@john5 ~]# lfs fid2path /tmp/dagg 0x20000c02e:0x9ef:0x0 /tmp/dagg/projects/oz009/N1024//snapshot_100.7.hdf5 [root@john5 ~]# ls -l /tmp/dagg/projects/oz009/N1024//snapshot_100.7.hdf5 -rw-rw-r-- 1 <username> oz009 1194134864 Feb 21 19:54 /tmp/dagg/projects/oz009/N1024//snapshot_100.7.hdf5 [root@john5 ~]# ls -lsh /tmp/dagg/projects/oz009/N1024//snapshot_100.7.hdf5 915M -rw-rw-r-- 1 <username> oz009 1.2G Feb 21 19:54 /tmp/dagg/projects/oz009/N1024//snapshot_100.7.hdf5 [root@john5 ~]# stat /tmp/dagg/projects/oz009/N1024//snapshot_100.7.hdf5 File: ‘/tmp/dagg/projects/oz009/N1024//snapshot_100.7.hdf5’ Size: 1194134864 Blocks: 1872277 IO Block: 4194304 regular file Device: ef57e2ach/4015514284d Inode: 144116013481331183 Links: 1 Access: (0664/-rw-rw-r--) Uid: (10056/ <username>) Gid: (10204/ oz009) Access: 2018-03-11 13:32:32.000000000 +1100 Modify: 2018-02-21 19:54:15.000000000 +1100 Change: 2018-03-05 16:53:29.000000000 +1100 Birth: - [root@john5 ~]# lfs getstripe /tmp/dagg/projects/oz009/N1024//snapshot_100.7.hdf5 /tmp/dagg/projects/oz009/N1024//snapshot_100.7.hdf5 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 5 obdidx objid objid group 5 2683981 0x28f44d 0
cheers,
robin
Do you have full messages files from both the clients and servers? In he messages file, it looks like only a single file is having this issue FID:0x20000c02e:0x9ef:0x0
To find the file (on a client):
lfs fid2path /PATH/TO/LUSTRE 0x20000c02e:0x9ef:0x0
I asked the user to run the code again and go over quota in a similar way, and there was nothing triggered from lustre this time. so unfortunately this may be hard to reproduce.
cheers,
robin
What is the hdf5 library (I assume by the H5 prefix) do you use for this io (if any), does it use direct IO internally by any chance?