Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.10.3
-
None
-
centos7, skylake, opa, zfs 0.7.6, lustre 2.10.3 + https://review.whamcloud.com/#/c/29992/. kernel nopti on servers, pti on clients, 3.10.0-693.17.1.el7.x86_64
-
3
-
9223372036854775807
Description
Hiya,
these appeared last night.
john100,101 are clients. arkle3,6 are OSS's. transom1 is running the fabric manager.
in light of the similar looking LU-9305 I thought I would create this ticket.
we run default (4M) rpcs on clients and servers. our OSTs are each 4 raidz3's in 1 zpool, and have recordsize=2M. 2 OSTs per OSS. 16 OSTs total.
I suppose it could be a OPA network glitch, but it affected 2 clients and 2 servers so that seems unlikely.
we have just moved from zfs 0.7.5 to zfs 0.7.6. we ran ior and mdtest after this change and they were fine. these errors occurred a couple of days after that.
Feb 19 23:45:12 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 54e81a18 Feb 19 23:45:12 john100 kernel: LNetError: 899:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1591386600741136 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:45:12 arkle3 kernel: LNet: Using FMR for registration Feb 19 23:45:12 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881764648600 Feb 19 23:45:12 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881764648600 Feb 19 23:45:12 arkle3 kernel: LustreError: 337237:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff881687561450 x1591386600741136/t0(0) o4->8c8018f7-2e02-6c2b-cbcf-29133ecabf02@192.168.44.200@o2ib44:173/0 lens 608/448 e 0 to 0 dl 1519044318 ref 1 fl Interpret:/0/0 rc 0/0 Feb 19 23:45:12 arkle3 kernel: Lustre: dagg-OST0005: Bulk IO write error with 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44), client will retry: rc = -110 Feb 19 23:45:12 arkle3 kernel: LNet: Skipped 1 previous similar message Feb 19 23:45:12 john101 kernel: LNetError: 904:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.36@o2ib44, match 1591895282655776 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:45:12 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8810cb714600 Feb 19 23:45:12 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8810cb714600 Feb 19 23:45:12 arkle6 kernel: LustreError: 42356:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff88276c3d5850 x1591895282655776/t0(0) o4->400cfa1c-7c7d-1d14-09ed-f6043574fd7c@192.168.44.201@o2ib44:173/0 lens 608/448 e 0 to 0 dl 1519044318 ref 1 fl Interpret:/0/0 rc 0/0 Feb 19 23:45:12 arkle6 kernel: Lustre: dagg-OST000b: Bulk IO write error with 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44), client will retry: rc = -110 Feb 19 23:45:12 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 54e81a18 (type 4), client csum now efde5b36 Feb 19 23:45:12 john100 kernel: LustreError: 924:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff880e7a9b4e00 x1591386600740944/t197580821951(197580821951) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044319 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:45:13 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum df7fda9d Feb 19 23:45:13 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum df7fda9d (type 4), client csum now efde5b36 Feb 19 23:45:13 john100 kernel: LustreError: 911:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff8807492fe300 x1591386600747696/t197580821955(197580821955) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044320 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:45:15 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 87da008a Feb 19 23:45:15 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 87da008a (type 4), client csum now efde5b36 Feb 19 23:45:15 john100 kernel: LustreError: 910:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff880eb3d51b00 x1591386600751360/t197580821956(197580821956) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044322 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:45:18 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 1cc7a793 Feb 19 23:45:18 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 1cc7a793 (type 4), client csum now efde5b36
full log attached.
there were hung and unkillable user processes on the 2 clients afterwards. a reboot of the 2 clients has cleared up the looping messages of the type shown below.
Feb 20 14:28:15 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e25ef2c00 Feb 20 14:28:19 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88176544be00 Feb 20 14:28:19 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8814072e0200 Feb 20 14:28:19 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88176544be00 Feb 20 14:28:19 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8814072e0200 Feb 20 14:28:47 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e25ef3800 Feb 20 14:28:47 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e25ef3800 Feb 20 14:28:51 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881615f0f000 Feb 20 14:28:51 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880be3931c00 Feb 20 14:28:51 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881615f0f000
cheers,
robin