[LU-10683] write checksum errors Created: 20/Feb/18 Updated: 28/Nov/18 Resolved: 24/Jul/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.3 |
| Fix Version/s: | Lustre 2.12.0, Lustre 2.10.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | SC Admin (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
centos7, skylake, opa, zfs 0.7.6, lustre 2.10.3 + https://review.whamcloud.com/#/c/29992/. kernel nopti on servers, pti on clients, 3.10.0-693.17.1.el7.x86_64 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Hiya, these appeared last night. in light of the similar looking I suppose it could be a OPA network glitch, but it affected 2 clients and 2 servers so that seems unlikely. we have just moved from zfs 0.7.5 to zfs 0.7.6. we ran ior and mdtest after this change and they were fine. these errors occurred a couple of days after that. Feb 19 23:45:12 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 54e81a18 Feb 19 23:45:12 john100 kernel: LNetError: 899:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1591386600741136 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:45:12 arkle3 kernel: LNet: Using FMR for registration Feb 19 23:45:12 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881764648600 Feb 19 23:45:12 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881764648600 Feb 19 23:45:12 arkle3 kernel: LustreError: 337237:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff881687561450 x1591386600741136/t0(0) o4->8c8018f7-2e02-6c2b-cbcf-29133ecabf02@192.168.44.200@o2ib44:173/0 lens 608/448 e 0 to 0 dl 1519044318 ref 1 fl Interpret:/0/0 rc 0/0 Feb 19 23:45:12 arkle3 kernel: Lustre: dagg-OST0005: Bulk IO write error with 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44), client will retry: rc = -110 Feb 19 23:45:12 arkle3 kernel: LNet: Skipped 1 previous similar message Feb 19 23:45:12 john101 kernel: LNetError: 904:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.36@o2ib44, match 1591895282655776 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:45:12 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8810cb714600 Feb 19 23:45:12 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8810cb714600 Feb 19 23:45:12 arkle6 kernel: LustreError: 42356:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff88276c3d5850 x1591895282655776/t0(0) o4->400cfa1c-7c7d-1d14-09ed-f6043574fd7c@192.168.44.201@o2ib44:173/0 lens 608/448 e 0 to 0 dl 1519044318 ref 1 fl Interpret:/0/0 rc 0/0 Feb 19 23:45:12 arkle6 kernel: Lustre: dagg-OST000b: Bulk IO write error with 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44), client will retry: rc = -110 Feb 19 23:45:12 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 54e81a18 (type 4), client csum now efde5b36 Feb 19 23:45:12 john100 kernel: LustreError: 924:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff880e7a9b4e00 x1591386600740944/t197580821951(197580821951) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044319 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:45:13 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum df7fda9d Feb 19 23:45:13 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum df7fda9d (type 4), client csum now efde5b36 Feb 19 23:45:13 john100 kernel: LustreError: 911:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff8807492fe300 x1591386600747696/t197580821955(197580821955) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044320 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:45:15 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 87da008a Feb 19 23:45:15 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 87da008a (type 4), client csum now efde5b36 Feb 19 23:45:15 john100 kernel: LustreError: 910:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff880eb3d51b00 x1591386600751360/t197580821956(197580821956) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044322 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:45:18 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 1cc7a793 Feb 19 23:45:18 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 1cc7a793 (type 4), client csum now efde5b36 full log attached. Feb 20 14:28:15 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e25ef2c00 Feb 20 14:28:19 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88176544be00 Feb 20 14:28:19 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8814072e0200 Feb 20 14:28:19 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88176544be00 Feb 20 14:28:19 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8814072e0200 Feb 20 14:28:47 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e25ef3800 Feb 20 14:28:47 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e25ef3800 Feb 20 14:28:51 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881615f0f000 Feb 20 14:28:51 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880be3931c00 Feb 20 14:28:51 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881615f0f000 cheers, |
| Comments |
| Comment by Robin Humble [ 20/Feb/18 ] |
|
the user running the code thinks that they went over group block quota when the above message occurred and the code hung. it was probably writing hdf5's. the zfs filesystem has compression turned on. cheers, |
| Comment by Andreas Dilger [ 23/Feb/18 ] |
|
While it is possible the quota issue is related, it should be that the client can finish writing data that it had cached before the quota was exceeded. It looks more like there was a problem with the bulk RPC that prevented the write from completing peoperly? Two possible ways to mitigate this, and see if we can isolate the cause:
The |
| Comment by Peter Jones [ 23/Feb/18 ] |
|
Nathaniel Anything to add on this one? Peter |
| Comment by SC Admin (Inactive) [ 26/Feb/18 ] |
|
I asked the user to run the code again and go over quota in a similar way, and there was nothing triggered from lustre this time. so unfortunately this may be hard to reproduce. cheers, |
| Comment by Nathaniel Clark [ 12/Mar/18 ] |
|
Do you have full messages files from both the clients and servers? In he messages file, it looks like only a single file is having this issue FID:0x20000c02e:0x9ef:0x0 To find the file (on a client): lfs fid2path /PATH/TO/LUSTRE 0x20000c02e:0x9ef:0x0 |
| Comment by Robin Humble [ 13/Mar/18 ] |
|
Hi Nathaniel, you mean more messages in than the attachment above? if you'd like a few hours before and after then I can do that... let me know. we also have conman (console) logs if you'd like, but I'm not sure there's any more info in those. are parallel writes to zfs which exceed group quotas part of the lustre test suite y'all run things through? looks like the file is still there... [root@john5 ~]# lfs fid2path /tmp/dagg 0x20000c02e:0x9ef:0x0 /tmp/dagg/projects/oz009/N1024//snapshot_100.7.hdf5 [root@john5 ~]# ls -l /tmp/dagg/projects/oz009/N1024//snapshot_100.7.hdf5 -rw-rw-r-- 1 <username> oz009 1194134864 Feb 21 19:54 /tmp/dagg/projects/oz009/N1024//snapshot_100.7.hdf5 [root@john5 ~]# ls -lsh /tmp/dagg/projects/oz009/N1024//snapshot_100.7.hdf5 915M -rw-rw-r-- 1 <username> oz009 1.2G Feb 21 19:54 /tmp/dagg/projects/oz009/N1024//snapshot_100.7.hdf5 [root@john5 ~]# stat /tmp/dagg/projects/oz009/N1024//snapshot_100.7.hdf5 File: ‘/tmp/dagg/projects/oz009/N1024//snapshot_100.7.hdf5’ Size: 1194134864 Blocks: 1872277 IO Block: 4194304 regular file Device: ef57e2ach/4015514284d Inode: 144116013481331183 Links: 1 Access: (0664/-rw-rw-r--) Uid: (10056/ <username>) Gid: (10204/ oz009) Access: 2018-03-11 13:32:32.000000000 +1100 Modify: 2018-02-21 19:54:15.000000000 +1100 Change: 2018-03-05 16:53:29.000000000 +1100 Birth: - [root@john5 ~]# lfs getstripe /tmp/dagg/projects/oz009/N1024//snapshot_100.7.hdf5 /tmp/dagg/projects/oz009/N1024//snapshot_100.7.hdf5 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 5 obdidx objid objid group 5 2683981 0x28f44d 0 cheers, |
| Comment by Nathaniel Clark [ 13/Mar/18 ] |
|
rjh, I'm trying to figure out if it's just that file, or if other files are affected. So the messages file for a couple hours or even days would be useful. |
| Comment by Robin Humble [ 14/Mar/18 ] |
|
Hi Nathaniel, hmm, I just grep'd a bit more and this is worrying. now there are read and write checksum errors, and also to our small /home OSTs (OSS's are umlaut1,2, which have zfs recordsize 1M and compression on.) we have a lot of logspam at the moment from various things so I missed these 'til now can I give you (via email or something) a URL to download complete logs from? I don't want to put them here 'cos they have usernames etc. in them. all lustre server are now running zfs 0.7.6, and all servers and clients still lustre 2.10.3. /var/log/messages-20180220.gz:Feb 19 23:45:12 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 54e81a18 /var/log/messages-20180220.gz:Feb 19 23:45:12 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 54e81a18 (type 4), client csum now efde5b36 /var/log/messages-20180220.gz:Feb 19 23:45:13 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum df7fda9d /var/log/messages-20180220.gz:Feb 19 23:45:13 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum df7fda9d (type 4), client csum now efde5b36 /var/log/messages-20180220.gz:Feb 19 23:45:15 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 87da008a /var/log/messages-20180220.gz:Feb 19 23:45:15 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 87da008a (type 4), client csum now efde5b36 /var/log/messages-20180220.gz:Feb 19 23:45:18 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 1cc7a793 /var/log/messages-20180220.gz:Feb 19 23:45:18 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 1cc7a793 (type 4), client csum now efde5b36 /var/log/messages-20180220.gz:Feb 19 23:45:22 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum a7de2e17 /var/log/messages-20180220.gz:Feb 19 23:45:23 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum a7de2e17 (type 4), client csum now efde5b36 /var/log/messages-20180220.gz:Feb 19 23:45:35 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum a7de2e17 /var/log/messages-20180220.gz:Feb 19 23:45:35 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum a7de2e17 (type 4), client csum now efde5b36 /var/log/messages-20180220.gz:Feb 19 23:45:53 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum a7de2e17 /var/log/messages-20180220.gz:Feb 19 23:45:54 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum a7de2e17 (type 4), client csum now efde5b36 /var/log/messages-20180220.gz:Feb 19 23:46:31 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [358060376-362254335]: client csum 164c6d3b, server csum c5bdd26c /var/log/messages-20180220.gz:Feb 19 23:46:32 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [358060376-362254335], original client csum 164c6d3b (type 4), server csum c5bdd26c (type 4), client csum now 164c6d3b /var/log/messages-20180220.gz:Feb 19 23:47:38 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [389517656-393711615]: client csum f0550656, server csum ea7a06d9 /var/log/messages-20180220.gz:Feb 19 23:47:38 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [389517656-393711615], original client csum f0550656 (type 4), server csum ea7a06d9 (type 4), client csum now f0550656 /var/log/messages-20180220.gz:Feb 19 23:49:51 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [447804664-451997695]: client csum f6a340ce, server csum 62064124 /var/log/messages-20180220.gz:Feb 19 23:49:53 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [447804664-451997695], original client csum f6a340ce (type 4), server csum 62064124 (type 4), client csum now f6a340ce /var/log/messages-20180220.gz:Feb 19 23:54:11 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [599407780-603598847]: client csum 2a11218d, server csum 5539349b /var/log/messages-20180220.gz:Feb 19 23:54:12 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [599407780-603598847], original client csum 2a11218d (type 4), server csum 5539349b (type 4), client csum now 2a11218d /var/log/messages-20180223.gz:Feb 22 17:13:04 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.13@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343]: client csum c2abd616, server csum 8dd68c18 /var/log/messages-20180223.gz:Feb 22 17:13:04 farnarkle1 kernel: LustreError: 132-0: home-OST0000-osc-ffff88189a352800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343], original client csum c2abd616 (type 4), server csum 8dd68c18 (type 4), client csum now c2abd616 /var/log/messages-20180223.gz:Feb 22 17:13:05 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.13@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343]: client csum c2abd616, server csum 8dd68c18 /var/log/messages-20180223.gz:Feb 22 17:13:05 farnarkle1 kernel: LustreError: 132-0: home-OST0000-osc-ffff88189a352800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343], original client csum c2abd616 (type 4), server csum 8dd68c18 (type 4), client csum now c2abd616 /var/log/messages-20180223.gz:Feb 22 17:13:07 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.13@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343]: client csum c2abd616, server csum c1141035 /var/log/messages-20180223.gz:Feb 22 17:13:07 farnarkle1 kernel: LustreError: 132-0: home-OST0000-osc-ffff88189a352800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343], original client csum c2abd616 (type 4), server csum c1141035 (type 4), client csum now c2abd616 /var/log/messages-20180223.gz:Feb 22 17:13:10 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.13@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343]: client csum c2abd616, server csum 3e076abf /var/log/messages-20180223.gz:Feb 22 17:13:11 farnarkle1 kernel: LustreError: 132-0: home-OST0000-osc-ffff88189a352800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343], original client csum c2abd616 (type 4), server csum 3e076abf (type 4), client csum now c2abd616 /var/log/messages-20180223.gz:Feb 22 17:13:15 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.13@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343]: client csum c2abd616, server csum 556fb060 /var/log/messages-20180223.gz:Feb 22 17:13:15 farnarkle1 kernel: LustreError: 132-0: home-OST0000-osc-ffff88189a352800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343], original client csum c2abd616 (type 4), server csum 556fb060 (type 4), client csum now c2abd616 /var/log/messages-20180223.gz:Feb 22 17:13:26 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.13@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343]: client csum c2abd616, server csum 8dd68c18 /var/log/messages-20180223.gz:Feb 22 17:13:26 farnarkle1 kernel: LustreError: 132-0: home-OST0000-osc-ffff88189a352800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343], original client csum c2abd616 (type 4), server csum 8dd68c18 (type 4), client csum now c2abd616 /var/log/messages-20180223.gz:Feb 22 17:13:44 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.13@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343]: client csum c2abd616, server csum d623d4d5 /var/log/messages-20180223.gz:Feb 22 17:13:44 farnarkle1 kernel: LustreError: 132-0: home-OST0000-osc-ffff88189a352800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x2000032e9:0x8a9c:0x0] object 0x0:10896796 extent [7152-13343], original client csum c2abd616 (type 4), server csum d623d4d5 (type 4), client csum now c2abd616 /var/log/messages-20180308.gz:Mar 7 15:03:06 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x44:0x0] object 0x0:11031820 extent [516-4611]: client csum 9977a425, server csum 5747b5ea /var/log/messages-20180308.gz:Mar 7 15:03:06 john72 kernel: LustreError: 132-0: home-OST0001-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.52@o2ib44 inode [0x200004296:0x44:0x0] object 0x0:11031820 extent [516-4611], original client csum 9977a425 (type 4), server csum 5747b5ea (type 4), client csum now 9977a425 /var/log/messages-20180308.gz:Mar 7 15:03:07 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x44:0x0] object 0x0:11031820 extent [4620-8715]: client csum 5ef5f7db, server csum bf44f4ab /var/log/messages-20180308.gz:Mar 7 15:03:07 john72 kernel: LustreError: 132-0: home-OST0001-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.52@o2ib44 inode [0x200004296:0x44:0x0] object 0x0:11031820 extent [4620-8715], original client csum 5ef5f7db (type 4), server csum bf44f4ab (type 4), client csum now 5ef5f7db /var/log/messages-20180308.gz:Mar 7 15:03:16 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x45:0x0] object 0x0:11014864 extent [516-4611]: client csum 1dcd24ad, server csum 3a01165 /var/log/messages-20180308.gz:Mar 7 15:03:16 john72 kernel: LustreError: 132-0: home-OST0000-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004296:0x45:0x0] object 0x0:11014864 extent [516-4611], original client csum 1dcd24ad (type 4), server csum 3a01165 (type 4), client csum now 1dcd24ad /var/log/messages-20180308.gz:Mar 7 15:03:17 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x45:0x0] object 0x0:11014864 extent [4620-8715]: client csum 27a49363, server csum 799787cb /var/log/messages-20180308.gz:Mar 7 15:04:07 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x46:0x0] object 0x0:11014870 extent [516-4611]: client csum 8a6d82ff, server csum c06e206e /var/log/messages-20180308.gz:Mar 7 15:04:07 john72 kernel: LustreError: 132-0: home-OST0000-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004296:0x46:0x0] object 0x0:11014870 extent [516-4611], original client csum 8a6d82ff (type 4), server csum c06e206e (type 4), client csum now 8a6d82ff /var/log/messages-20180308.gz:Mar 7 15:04:50 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x47:0x0] object 0x0:11014876 extent [516-4611]: client csum e76922f6, server csum ad6a8067 /var/log/messages-20180308.gz:Mar 7 15:04:50 john72 kernel: LustreError: 132-0: home-OST0000-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004296:0x47:0x0] object 0x0:11014876 extent [516-4611], original client csum e76922f6 (type 4), server csum ad6a8067 (type 4), client csum now e76922f6 /var/log/messages-20180308.gz:Mar 7 15:05:14 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x48:0x0] object 0x0:11031833 extent [516-4611]: client csum c606d023, server csum e53f9b17 /var/log/messages-20180308.gz:Mar 7 15:05:14 john72 kernel: LustreError: 132-0: home-OST0001-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.52@o2ib44 inode [0x200004296:0x48:0x0] object 0x0:11031833 extent [516-4611], original client csum c606d023 (type 4), server csum e53f9b17 (type 4), client csum now c606d023 /var/log/messages-20180308.gz:Mar 7 15:05:24 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x49:0x0] object 0x0:11014877 extent [516-4611]: client csum d80dfda8, server csum 89d05138 /var/log/messages-20180308.gz:Mar 7 15:07:44 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x4a:0x0] object 0x0:11014894 extent [516-4611]: client csum 60a143e6, server csum 2aa2e177 /var/log/messages-20180308.gz:Mar 7 15:07:44 john72 kernel: LustreError: 132-0: home-OST0000-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004296:0x4a:0x0] object 0x0:11014894 extent [516-4611], original client csum 60a143e6 (type 4), server csum 2aa2e177 (type 4), client csum now 60a143e6 /var/log/messages-20180308.gz:Mar 7 15:15:05 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x4b:0x0] object 0x0:11014906 extent [516-4611]: client csum ff1728a9, server csum 8ba1a1e0 /var/log/messages-20180308.gz:Mar 7 15:15:05 john72 kernel: LustreError: 132-0: home-OST0000-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004296:0x4b:0x0] object 0x0:11014906 extent [516-4611], original client csum ff1728a9 (type 4), server csum 8ba1a1e0 (type 4), client csum now ff1728a9 /var/log/messages-20180308.gz:Mar 7 15:15:16 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x4c:0x0] object 0x0:11031863 extent [516-4611]: client csum cca2b223, server csum 3a7a9a4e /var/log/messages-20180308.gz:Mar 7 15:15:23 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x4e:0x0] object 0x0:11031864 extent [516-4611]: client csum ee48cab1, server csum 6f69a9ca /var/log/messages-20180308.gz:Mar 7 15:15:43 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x50:0x0] object 0x0:11031865 extent [516-4611]: client csum c8bdaad3, server csum 82be0842 /var/log/messages-20180308.gz:Mar 7 15:15:46 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x51:0x0] object 0x0:11014909 extent [516-4611]: client csum 57497f3, server csum 4f773562 /var/log/messages-20180308.gz:Mar 7 15:16:21 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x52:0x0] object 0x0:11031866 extent [516-4611]: client csum 179adff1, server csum d885540d /var/log/messages-20180308.gz:Mar 7 15:16:21 john72 kernel: LustreError: 132-0: home-OST0001-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.52@o2ib44 inode [0x200004296:0x52:0x0] object 0x0:11031866 extent [516-4611], original client csum 179adff1 (type 4), server csum d885540d (type 4), client csum now 179adff1 /var/log/messages-20180308.gz:Mar 7 15:19:50 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x59:0x0] object 0x0:11014910 extent [516-4611]: client csum 98a90fb3, server csum d2aaad22 /var/log/messages-20180308.gz:Mar 7 15:19:50 john72 kernel: LustreError: 132-0: home-OST0000-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004296:0x59:0x0] object 0x0:11014910 extent [516-4611], original client csum 98a90fb3 (type 4), server csum d2aaad22 (type 4), client csum now 98a90fb3 /var/log/messages-20180308.gz:Mar 7 15:20:08 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x5d:0x0] object 0x0:11031867 extent [516-4611]: client csum 34fda7cb, server csum e078ee98 /var/log/messages-20180308.gz:Mar 7 15:21:15 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x64:0x0] object 0x0:11031870 extent [516-4611]: client csum 24b13698, server csum 6eb29409 /var/log/messages-20180308.gz:Mar 7 15:22:56 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x67:0x0] object 0x0:11014915 extent [516-4611]: client csum fcb4e51c, server csum ffb2a769 /var/log/messages-20180308.gz:Mar 7 15:23:50 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x6a:0x0] object 0x0:11031873 extent [516-4611]: client csum a1a66091, server csum 320a1aef /var/log/messages-20180308.gz:Mar 7 15:25:07 john72 kernel: LustreError: 132-0: home-OST0001-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.52@o2ib44 inode [0x200004296:0x6e:0x0] object 0x0:11031875 extent [516-4611], original client csum 4a4e93eb (type 4), server csum fe6a3a5f (type 4), client csum now 4a4e93eb /var/log/messages-20180308.gz:Mar 7 15:27:16 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x75:0x0] object 0x0:11014922 extent [516-4611]: client csum 629c147f, server csum 289fb6ee /var/log/messages-20180308.gz:Mar 7 15:30:39 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x88:0x0] object 0x0:11031881 extent [516-4611]: client csum b0252143, server csum fa2683d2 /var/log/messages-20180308.gz:Mar 7 15:51:08 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.172@o2ib44 inode [0x200004296:0x89:0x0] object 0x0:11014925 extent [516-4611]: client csum f7b9952b, server csum 539602f8 /var/log/messages-20180308.gz:Mar 7 15:51:08 john72 kernel: LustreError: 132-0: home-OST0000-osc-ffff8817d9a20800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004296:0x89:0x0] object 0x0:11014925 extent [516-4611], original client csum f7b9952b (type 4), server csum 539602f8 (type 4), client csum now f7b9952b /var/log/messages-20180308.gz:Mar 7 17:44:25 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.173@o2ib44 inode [0x200004a85:0x1d:0x0] object 0x0:11015246 extent [516-4611]: client csum c6b24442, server csum bb49caf9 /var/log/messages-20180308.gz:Mar 7 17:44:25 john73 kernel: LustreError: 132-0: home-OST0000-osc-ffff882fd67d1800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004a85:0x1d:0x0] object 0x0:11015246 extent [516-4611], original client csum c6b24442 (type 4), server csum bb49caf9 (type 4), client csum now c6b24442 /var/log/messages-20180308.gz:Mar 7 17:44:27 john73 kernel: LustreError: 132-0: home-OST0000-osc-ffff882fd67d1800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004a85:0x1d:0x0] object 0x0:11015246 extent [4620-8715], original client csum bfce872c (type 4), server csum 850e6171 (type 4), client csum now bfce872c /var/log/messages-20180308.gz:Mar 7 18:00:30 umlaut2 kernel: LustreError: 168-f: home-OST0001: BAD WRITE CHECKSUM: from 12345-192.168.44.173@o2ib44 inode [0x200004a85:0x20:0x0] object 0x0:11032280 extent [516-4611]: client csum 941ab186, server csum 43900b50 /var/log/messages-20180308.gz:Mar 7 18:00:30 john73 kernel: LustreError: 132-0: home-OST0001-osc-ffff882fd67d1800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.52@o2ib44 inode [0x200004a85:0x20:0x0] object 0x0:11032280 extent [516-4611], original client csum 941ab186 (type 4), server csum 43900b50 (type 4), client csum now 941ab186 /var/log/messages-20180308.gz:Mar 7 18:00:31 john73 kernel: LustreError: 132-0: home-OST0001-osc-ffff882fd67d1800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.52@o2ib44 inode [0x200004a85:0x20:0x0] object 0x0:11032280 extent [4620-8715], original client csum 4da5042f (type 4), server csum 1a0af759 (type 4), client csum now 4da5042f /var/log/messages-20180308.gz:Mar 7 18:00:42 umlaut1 kernel: LustreError: 168-f: home-OST0000: BAD WRITE CHECKSUM: from 12345-192.168.44.173@o2ib44 inode [0x200004a85:0x21:0x0] object 0x0:11015326 extent [516-4611]: client csum 51cfa889, server csum 308c8cb3 /var/log/messages-20180308.gz:Mar 7 18:00:42 john73 kernel: LustreError: 132-0: home-OST0000-osc-ffff882fd67d1800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.51@o2ib44 inode [0x200004a85:0x21:0x0] object 0x0:11015326 extent [516-4611], original client csum 51cfa889 (type 4), server csum 308c8cb3 (type 4), client csum now 51cfa889 /var/log/messages-20180308.gz:Mar 7 18:00:59 john73 kernel: LustreError: 132-0: home-OST0001-osc-ffff882fd67d1800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.52@o2ib44 inode [0x200004a85:0x22:0x0] object 0x0:11032283 extent [29924-72851], original client csum 8d6ddb8c (type 4), server csum c3ce8506 (type 4), client csum now 8d6ddb8c /var/log/messages-20180314.gz:Mar 13 15:23:06 john57 kernel: LustreError: 133-1: dagg-OST0004-osc-ffff881899dd5800: BAD READ CHECKSUM: from 192.168.44.33@o2ib44 inode [0x28001c1e4:0x128:0x0] object 0x540000400:8139196 extent [0-4095], client 9187de52, server 3ea5bb6b, cksum_type 4 /var/log/messages-20180314.gz:Mar 13 15:23:06 john56 kernel: LustreError: 133-1: dagg-OST0003-osc-ffff8817c5166000: BAD READ CHECKSUM: from 192.168.44.32@o2ib44 inode [0x28001c1c2:0x16c:0x0] object 0x500000400:8116043 extent [0-4095], client 37af0280, server a79ccd9f, cksum_type 4 /var/log/messages-20180314.gz:Mar 13 15:23:06 john63 kernel: LustreError: 133-1: dagg-OST000c-osc-ffff88189a274000: BAD READ CHECKSUM: from 192.168.44.37@o2ib44 inode [0x28001c1c4:0x172:0x0] object 0x400000400:8173179 extent [0-4095], client 8cdf64cc, server 4df3e0bb, cksum_type 4 /var/log/messages-20180314.gz:Mar 13 15:23:07 arkle7 kernel: LustreError: 132-0: dagg-OST000c: BAD READ CHECKSUM: should have changed on the client or in transit: from 192.168.44.163@o2ib44 inode [0x28001c1c4:0x172:0x0] object 0x400000400:8173179 extent [0-4095], client returned csum 8cdf64cc (type 4), server csum 4df3e0bb (type 4) /var/log/messages-20180314.gz:Mar 13 15:23:07 arkle3 kernel: LustreError: 132-0: dagg-OST0004: BAD READ CHECKSUM: should have changed on the client or in transit: from 192.168.44.157@o2ib44 inode [0x28001c1e4:0x128:0x0] object 0x540000400:8139196 extent [0-4095], client returned csum 9187de52 (type 4), server csum 3ea5bb6b (type 4) /var/log/messages-20180314.gz:Mar 13 15:23:07 arkle2 kernel: LustreError: 132-0: dagg-OST0003: BAD READ CHECKSUM: should have changed on the client or in transit: from 192.168.44.156@o2ib44 inode [0x28001c1c2:0x16c:0x0] object 0x500000400:8116043 extent [0-4095], client returned csum 37af0280 (type 4), server csum a79ccd9f (type 4) cheers, |
| Comment by Nathaniel Clark [ 14/Mar/18 ] |
|
Below are instructions for uploading logs to our write-only ftp site:
Sometimes the diagnostic data collected as part of Lustre troubleshooting is too large to be attached to a JIRA ticket. For these cases, HPDD provides an anonymous write-only FTP upload service. In order to use this service, you'll need an FTP client (e.g. ncftp, ftp, etc.) and a JIRA issue. Use the 'uploads' directory and create a new subdirectory using your Jira issue as a name. In the following example, there are three debug logs in a single directory and the JIRA issue $ ls -lh total 333M -rw-r--r-- 1 mjmac mjmac 98M Feb 23 17:36 mds-debug -rw-r--r-- 1 mjmac mjmac 118M Feb 23 17:37 oss-00-debug -rw-r--r-- 1 mjmac mjmac 118M Feb 23 17:37 oss-01-debug $ ncftp ftp.hpdd.intel.com NcFTP 3.2.2 (Sep 04, 2008) by Mike Gleason (http://www.NcFTP.com/contact/). Connecting to 99.96.190.235... (vsFTPd 2.2.2) Logging in... Login successful. Logged in to ftp.hpdd.intel.com. ncftp / > cd uploads Directory successfully changed. ncftp /uploads > mkdir LU-4242 ncftp /uploads > cd LU-4242 Directory successfully changed. ncftp /uploads/LU-4242 > put * mds-debug: 97.66 MB 11.22 MB/s oss-00-debug: 117.19 MB 11.16 MB/s oss-01-debug: 117.48 MB 11.18 MB/s ncftp /uploads/LU-4242 > Please note that this is a WRITE-ONLY FTP service, so you will not be able to see (with ls) the files or directories you've created, nor will you (or anyone other than HPDD staff) be able to see or read them. |
| Comment by Robin Humble [ 15/Mar/18 ] |
|
Hi Nathaniel, all messages for the lustre servers and from the 7 clients affected so far for all of 2018 have been uploaded. if you'd like console logs from these machines too then please let us know. cheers, |
| Comment by Robin Humble [ 16/Mar/18 ] |
|
a bunch more CHECKSUM and LNet errors today. this lot were again definitely associated with over quota. I don't know if all the incidents are though... I guess I'd be very worried if these weren't purely over quota events, which is why the read checksum messages were very alarming. any thoughts on whether these are just from quota events or not? is there any way we can easily tell that? I'll attach messages for today's errors to this ticket in a min. the user has many (~1100 so far) of the below in their job output. looks like they have ~127 write processes across their 28 nodes and 896 cores. the code is looping on the nodes trying to complete the writes. HDF5-DIAG: Error detected in HDF5 (1.10.1) MPI-process 168:
#000: H5Dio.c line 269 in H5Dwrite(): can't prepare for writing data
major: Dataset
minor: Write failed
#001: H5Dio.c line 345 in H5D__pre_write(): can't write data
major: Dataset
minor: Write failed
#002: H5Dio.c line 791 in H5D__write(): can't write data
major: Dataset
minor: Write failed
#003: H5Dcontig.c line 642 in H5D__contig_write(): contiguous write failed
major: Dataset
minor: Write failed
#004: H5Dselect.c line 309 in H5D__select_write(): write error
major: Dataspace
minor: Write failed
#005: H5Dselect.c line 220 in H5D__select_io(): write error
major: Dataspace
minor: Write failed
#006: H5Dcontig.c line 1267 in H5D__contig_writevv(): can't perform vectorized sieve buffer write
major: Dataset
minor: Can't operate on object
#007: H5VM.c line 1500 in H5VM_opvv(): can't perform operation
major: Internal error (too specific to document in detail)
minor: Can't operate on object
#008: H5Dcontig.c line 1014 in H5D__contig_writevv_sieve_cb(): block write failed
major: Dataset
minor: Write failed
#009: H5Fio.c line 195 in H5F_block_write(): write through page buffer failed
major: Low-level I/O
minor: Write failed
#010: H5PB.c line 1041 in H5PB_write(): write through metadata accumulator failed
major: Page Buffering
minor: Write failed
#011: H5Faccum.c line 834 in H5F__accum_write(): file write failed
major: Low-level I/O
minor: Write failed
#012: H5FDint.c line 308 in H5FD_write(): driver write request failed
major: Virtual File Layer
minor: Write failed
#013: H5FDsec2.c line 810 in H5FD_sec2_write(): file write failed: time = Fri Mar 16 03:17:20 2018
, filename = './L35_N2650/snapshot_050.24.hdf5', file descriptor = 24, errno = 122, error message = 'Disk quota exceeded', buf = 0x2b07c54a3010, total write size = 31457280, bytes this sub-write = 31457280, bytes actually written = 184467440737052
major: Low-level I/O
minor: Write failed
cheers, |
| Comment by Peter Jones [ 26/Mar/18 ] |
|
Hongchao Can you please advise with this one? Thanks Peter |
| Comment by Andreas Dilger [ 27/Mar/18 ] |
|
Hi Robin, did you reduce the recordsize to 1MB on the filesystem? We haven't done any testing ourselves with larger recordsize. The clients would need to be remounted to also get a smaller RPC size (they default to RPC size == max blocksize for ZFS at mount). Also, are you using TID RDMA (cap_mask=....) for your OPA connection? We've seen problems with that under load, and if yes it should be disabled. |
| Comment by Oleg Drokin [ 27/Mar/18 ] |
|
What is the hdf5 library (I assume by the H5 prefix) do you use for this io (if any), does it use direct IO internally by any chance? |
| Comment by Robin Humble [ 28/Mar/18 ] |
|
Hi Andreas, on our big /dagg filesystem I left the zfs recordsize at 2MB. it's a significant performance loss to set it to 1M - halves the size of all the i/o's to disks. however these events also happen to the /home lustre filesystem where the zfs recordsize is 1M and always has been. those are in the logs above. oh, damn, looks like /sys/module/zfs/parameters/zfs_max_recordsize=2M for these /home filesystems though, even though nothing uses that. that must be a left over from testing. I haven't tried to change the RPC size for anything, sorry. all the lustre filesystems in this cluster seem to be using max_pages_per_rpc=1024. is that the right number to look at? we haven't tweaked anything around that. I just assumed 4M was the default these days. we aren't using any cap_mask= options for hfi1. on clients it is options hfi1 sge_copy_mode=2 krcvqs=4 piothreshold=0 wss_threshold=70 max_mtu=10240 eager_buffer_size=4194304 and on lustre servers it's options hfi1 sge_copy_mode=2 krcvqs=8 piothreshold=0 wss_threshold=70 with max_mtu=10240 being a driver default, and the default eager_buffer_size is 2097152. but AFAIK lustre's verbs doesn't use that eager_buffer stuff - it's for PSM2 comms I think. I actually don't know what most of these options do. it's just what various Intel people and docs told us was good. Oleg - yes it's a hdf5 library. we're looking into the hdf5 code and will try to communicate with the user and see what options they used when calling it in parallel. I suspect it uses whatever MPIIO uses, but haven't looked at either of those in years. thanks for looking into this. cheers, |
| Comment by Amir Shehata (Inactive) [ 10/Apr/18 ] |
|
Is it possible to turn on net logging and capture logs for a short period of time when this happens: lctl set_param debug=+"net neterror"
lctl dk > log
|
| Comment by SC Admin (Inactive) [ 10/Apr/18 ] |
|
Hi Amir, the incidents are rare. the last one was 3 days ago. probably over-quota related. we've tried to reproduce them but haven't managed to with simple dd's etc. I guess I could turn on 'net neterror' debug and write a script to tail syslog and automatically run dk's when it sees the next burst of CHECKSUM's. cheers, |
| Comment by Amir Shehata (Inactive) [ 10/Apr/18 ] |
|
Hi Robin, From both client and server would be great, separated into two files respectively. thanks |
| Comment by SC Admin (Inactive) [ 12/Apr/18 ] |
|
Hi Amir, I setup a script to tail syslog and run dk on anything that hits a CHECKSUM error. Apr 12 18:57:25 arkle5 kernel: ------------[ cut here ]------------ Apr 12 18:57:25 arkle5 kernel: WARNING: CPU: 1 PID: 127223 at kernel/softirq.c:151 __local_bh_enable_ip+0x82/0xb0 Apr 12 18:57:25 arkle5 kernel: Modules linked in: sctp_diag sctp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag ip6table_filter ip6_tables iptable_filter osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) mptctl mptbase 8021q garp mrp stp llc hfi1 sunrpc xfs dm_round_robin dcdbas int el_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr iTCO_wdt iTCO_vendor_support zfs(POE) zunic ode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) mgag200 ttm dm_multipath drm_kms_helper ses syscopyarea enclosure dm_mod sysfillrect sysimgblt fb_sys_fops drm mei_me lpc_ich shpchp i2c_i801 sg mei Apr 12 18:57:25 arkle5 kernel: ipmi_si ipmi_devintf nfit ipmi_msghandler libnvdimm acpi_power_meter tpm_crb rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm iw_cm binfmt_misc ip_tables ib_ipoib ib_cm sr_mod cdrom sd_mod crc_t10dif crct10dif_generic bonding bnx2x rdmavt ahci i2c_algo_bit libahci crct10dif_pclmul mpt3sas mdio crct10dif_common i2c_core crc32c_intel ptp raid_class ib_core libata megaraid_sas scsi_transp ort_sas pps_core libcrc32c [last unloaded: hfi1] Apr 12 18:57:25 arkle5 kernel: CPU: 1 PID: 127223 Comm: hfi1_cq0 Tainted: P OE ------------ 3.10.0-693.17.1.el7.x86_64 #1 Apr 12 18:57:25 arkle5 kernel: Hardware name: Dell Inc. PowerEdge R740/06G98X, BIOS 1.3.7 02/08/2018 Apr 12 18:57:25 arkle5 kernel: Call Trace: Apr 12 18:57:25 arkle5 kernel: [<ffffffff816a6071>] dump_stack+0x19/0x1b Apr 12 18:57:25 arkle5 kernel: [<ffffffff810895e8>] __warn+0xd8/0x100 Apr 12 18:57:25 arkle5 kernel: [<ffffffff8108972d>] warn_slowpath_null+0x1d/0x20 Apr 12 18:57:25 arkle5 kernel: [<ffffffff81091be2>] __local_bh_enable_ip+0x82/0xb0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff816ade8e>] _raw_spin_unlock_bh+0x1e/0x20 Apr 12 18:57:25 arkle5 kernel: [<ffffffffc06183b5>] cfs_trace_unlock_tcd+0x55/0x90 [libcfs] Apr 12 18:57:25 arkle5 kernel: [<ffffffffc0623708>] libcfs_debug_vmsg2+0x6d8/0xb40 [libcfs] Apr 12 18:57:25 arkle5 kernel: [<ffffffff810cfb6c>] ? dequeue_entity+0x11c/0x5d0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810c95d5>] ? sched_clock_cpu+0x85/0xc0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 Apr 12 18:57:25 arkle5 kernel: [<ffffffffc0623bc7>] libcfs_debug_msg+0x57/0x80 [libcfs] Apr 12 18:57:25 arkle5 kernel: [<ffffffffc069682a>] kiblnd_cq_completion+0x11a/0x160 [ko2iblnd] Apr 12 18:57:25 arkle5 kernel: [<ffffffffc03ab4a2>] send_complete+0x32/0x50 [rdmavt] Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2ac0>] kthread_worker_fn+0x80/0x180 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2a40>] ? kthread_stop+0xe0/0xe0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b270f>] kthread+0xcf/0xe0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40 Apr 12 18:57:25 arkle5 kernel: [<ffffffff816b8798>] ret_from_fork+0x58/0x90 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40 Apr 12 18:57:25 arkle5 kernel: ---[ end trace aaf779f5b67c32db ]--- Apr 12 18:57:25 arkle7 kernel: ------------[ cut here ]------------ I also have one client that looks permanently upset now (john50 is a client, arkle3 is an OSS) Apr 12 21:28:05 john50 kernel: LNetError: 909:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1594448751341360 length 1048576 too big: 1045288 left, 1045288 allo wed Apr 12 21:28:05 arkle3 kernel: LustreError: 296380:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880c21c90a00 Apr 12 21:28:05 arkle3 kernel: LustreError: 296380:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880c21c90a00 Apr 12 21:28:05 arkle3 kernel: LustreError: 233673:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8817643ce450 x1594448751341360/t0(0) o4->98af46e3-9fa3-6f5b-2dcd-f89325115978@192.168.44.150@o2ib44:626/0 lens 608/448 e 0 to 0 dl 1523532491 ref 1 fl Interpret:/0/0 rc 0/0 Apr 12 21:28:05 arkle3 kernel: Lustre: dagg-OST0004: Bulk IO write error with 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44), client will retry: rc = -110 Apr 12 21:28:12 john50 kernel: Lustre: 933:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523532485/real 1523532485] req@ffff8817b5bd6900 x1594448751341360/t0(0) o4->dagg-OST0004-osc-ffff88189aa1f000@192.168.44.33@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1523532492 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Apr 12 21:28:12 john50 kernel: Lustre: 933:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Apr 12 21:28:12 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection to dagg-OST0004 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Apr 12 21:28:12 john50 kernel: Lustre: Skipped 3 previous similar messages Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Client 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44) reconnecting Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Connection restored to 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44) Apr 12 21:28:12 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44) Apr 12 21:28:12 john50 kernel: Lustre: Skipped 2 previous similar messages Apr 12 21:28:12 john50 kernel: LNetError: 910:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1594448751343216 length 1048576 too big: 1045288 left, 1045288 allowed Apr 12 21:28:12 arkle3 kernel: LustreError: 296379:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88169ed25a00 Apr 12 21:28:12 arkle3 kernel: LustreError: 296379:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88169ed25a00 Apr 12 21:28:12 arkle3 kernel: LustreError: 50811:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8816dd9f2c50 x1594448751341360/t0(0) o4->98af46e3-9fa3-6f5b-2dcd-f89325115978@192.168.44.150@o2ib44:633/0 lens 608/448 e 0 to 0 dl 1523532498 ref 1 fl Interpret:/2/0 rc 0/0 Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Bulk IO write error with 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44), client will retry: rc = -110 Apr 12 21:28:19 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection to dagg-OST0004 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete I'll reboot the client as (IIRC) this has cleared up this kind of problem in the past Apr 13 01:22:00 arkle3 kernel: LustreError: 296378:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880f6ea28800 Apr 13 01:22:07 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88150b4d4800 Apr 13 01:22:07 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88150b4d4800 Apr 13 01:22:15 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88150b4d2800 Apr 13 01:22:15 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88150b4d2800 debug +neterror (a default) is still enabled and the dk will still catch that. hopefully that will be enough for you. cheers, |
| Comment by SC Admin (Inactive) [ 18/Apr/18 ] |
|
the script triggered on 3 sets of checksum errors last night. the 4 fids fingered are [root@john5 ~]# lfs fid2path /dagg 0x20001251f:0x144:0x0 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0500/meraxes_grids_74.hdf5 [root@john5 ~]# lfs fid2path /dagg 0x280022737:0x5b:0x0 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0010/meraxes_grids_72.hdf5 [root@john5 ~]# lfs fid2path /dagg 0x680024e04:0x46:0x0 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_67.hdf5 [root@john5 ~]# lfs fid2path /dagg 0x680024e04:0x47:0x0 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_68.hdf5 [root@john5 ~]# ls -l /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0500/meraxes_grids_74.hdf5 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0010/meraxes_grids_72.hdf5 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_67.hdf5 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_68.hdf5 -rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:31 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0010/meraxes_grids_72.hdf5 -rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:22 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_67.hdf5 -rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:24 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_68.hdf5 -rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:35 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0500/meraxes_grids_74.hdf5 and that group is currently well under quota [root@john5 ~]# lfs quota -h -g oz025 /dagg/
Disk quotas for grp oz025 (gid 10227):
Filesystem used quota limit grace files quota limit grace
/dagg/ 6.342T 0k 10T - 188885 0 1000000 -
I'll check with them about the state of jobs that ran last night, and also the state of those files, and quota. cheers, |
| Comment by SC Admin (Inactive) [ 18/Apr/18 ] |
|
so far it looks like this group was NOT over quota, but was again using parallel hdf5 writes. the 4 files in question apparently look ok to them (ie. not obviously corrupted). cheers, |
| Comment by Hongchao Zhang [ 19/Apr/18 ] |
|
As per the logs, the checksum is a little strange at client side [3111837064-3112361351]: client csum c253e960, server csum 5e6e4da8 00000020:02020000:15.0:1523986236.833762:0:10678:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3186813384-3187337671]: client csum c253e960, server csum a42a5590 00000020:02020000:1.0:1523986236.845204:0:285172:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3069890888-3070415175]: client csum c253e960, server csum fc43f2ee 00000020:02020000:17.0:1523986236.845219:0:298862:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3128614280-3129138567]: client csum c253e960, server csum 976d257c 00000020:02020000:5.0:1523986236.845231:0:287464:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3095056712-3095580999]: client csum c253e960, server csum 1747b303 00000020:02020000:7.0:1523986236.845411:0:287463:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3061502280-3062026567]: client csum c253e960, server csum 1c6d4fd9 00000020:02020000:9.0:1523986236.845578:0:134229:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3036333320-3036857607]: client csum c253e960, server csum 4b320be5 00000020:02020000:15.0:1523986236.845877:0:287086:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3027420424-3027944711]: client csum c253e960, server csum d537dae0 00000020:02020000:1.0:1523986236.846009:0:34879:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3102921032-3103445319]: client csum c253e960, server csum 50418907 the checksum of different extent for the same file (0x280022737:0x5b:0x0) is the same from the client side (all are "c253e960"), I'll look at it more deeply to find out what cause it. |
| Comment by SC Admin (Inactive) [ 04/May/18 ] |
|
from reading the lustre manual it sounds like these checksum events are re-tried after they are detected, so the users might be seeing no effect from them. cheers, |
| Comment by Andreas Dilger [ 08/May/18 ] |
|
Correct. The client will resend on checksum failures up to 10 times by default (controlled by osc.*.resend_count). That the checksum at the client is always the same implies that the data is also the same (e.g. all zero). That it is different on the server each time implies it is being changed after the client has computed the checksum (e.g. in client RAM, over the network, or in OSS RAM). If you are using mmap() on files or O_DIRECT with another thread modifying the pages it is possible to see such corruption, but Lustre can't do anything about it (short of copying the data, which is highly undesirable). |
| Comment by Gerrit Updater [ 05/Jul/18 ] |
|
Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32788 |
| Comment by Gerrit Updater [ 24/Jul/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32788/ |
| Comment by Gerrit Updater [ 30/Jul/18 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32899 |
| Comment by Gerrit Updater [ 02/Aug/18 ] |
|
John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32899/ |
| Comment by Gerrit Updater [ 27/Nov/18 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33728 |
| Comment by Gerrit Updater [ 28/Nov/18 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33741 |