Feb 19 23:36:37 john100 slurmd[311531]: _run_prolog: run job script took usec=233917 Feb 19 23:36:37 john100 slurmd[311531]: _run_prolog: prolog with lock for job 24986 ran for 0 seconds Feb 19 23:36:37 john100 slurmstepd[349198]: task/cgroup: /slurm/uid_10056/job_24986: alloc=131072MB mem.limit=131072MB memsw.limit=144179MB Feb 19 23:36:37 john100 slurmstepd[349198]: task/cgroup: /slurm/uid_10056/job_24986/step_extern: alloc=131072MB mem.limit=131072MB memsw.limit=144179MB Feb 19 23:36:37 john102 slurmd[207898]: _run_prolog: run job script took usec=275513 Feb 19 23:36:37 john102 slurmd[207898]: _run_prolog: prolog with lock for job 24986 ran for 0 seconds Feb 19 23:36:37 john104 slurmd[195983]: _run_prolog: run job script took usec=293687 Feb 19 23:36:37 john104 slurmd[195983]: _run_prolog: prolog with lock for job 24986 ran for 0 seconds Feb 19 23:36:37 john101 slurmd[19462]: _run_prolog: run job script took usec=297182 Feb 19 23:36:37 john101 slurmd[19462]: _run_prolog: prolog with lock for job 24986 ran for 0 seconds Feb 19 23:36:37 john102 slurmstepd[245530]: task/cgroup: /slurm/uid_10056/job_24986: alloc=131072MB mem.limit=131072MB memsw.limit=144179MB Feb 19 23:36:37 john102 slurmstepd[245530]: task/cgroup: /slurm/uid_10056/job_24986/step_extern: alloc=131072MB mem.limit=131072MB memsw.limit=144179MB Feb 19 23:36:37 john104 slurmstepd[233595]: task/cgroup: /slurm/uid_10056/job_24986: alloc=65536MB mem.limit=65536MB memsw.limit=72089MB Feb 19 23:36:37 john104 slurmstepd[233595]: task/cgroup: /slurm/uid_10056/job_24986/step_extern: alloc=65536MB mem.limit=65536MB memsw.limit=72089MB Feb 19 23:36:37 john103 slurmd[210392]: _run_prolog: run job script took usec=333523 Feb 19 23:36:37 john103 slurmd[210392]: _run_prolog: prolog with lock for job 24986 ran for 0 seconds Feb 19 23:36:37 john101 slurmstepd[57109]: task/cgroup: /slurm/uid_10056/job_24986: alloc=131072MB mem.limit=131072MB memsw.limit=144179MB Feb 19 23:36:37 john101 slurmstepd[57109]: task/cgroup: /slurm/uid_10056/job_24986/step_extern: alloc=131072MB mem.limit=131072MB memsw.limit=144179MB Feb 19 23:36:37 john103 slurmstepd[248037]: task/cgroup: /slurm/uid_10056/job_24986: alloc=131072MB mem.limit=131072MB memsw.limit=144179MB Feb 19 23:36:37 john103 slurmstepd[248037]: task/cgroup: /slurm/uid_10056/job_24986/step_extern: alloc=131072MB mem.limit=131072MB memsw.limit=144179MB Feb 19 23:36:38 john100 slurmd[311531]: Launching batch job 24986 for UID 10056 Feb 19 23:36:38 john100 slurmstepd[349278]: task/cgroup: /slurm/uid_10056/job_24986: alloc=131072MB mem.limit=131072MB memsw.limit=144179MB Feb 19 23:36:38 john100 slurmstepd[349278]: task/cgroup: /slurm/uid_10056/job_24986/step_batch: alloc=131072MB mem.limit=131072MB memsw.limit=144179MB Feb 19 23:36:38 john100 slurmd[311531]: launch task 24986.0 request from 10056.504@192.168.44.200 (port 10430) Feb 19 23:36:38 john102 slurmd[207898]: launch task 24986.0 request from 10056.504@192.168.44.200 (port 6343) Feb 19 23:36:38 john103 slurmd[210392]: launch task 24986.0 request from 10056.504@192.168.44.200 (port 15571) Feb 19 23:36:38 john104 slurmd[195983]: launch task 24986.0 request from 10056.504@192.168.44.200 (port 61598) Feb 19 23:36:38 john101 slurmd[19462]: launch task 24986.0 request from 10056.504@192.168.44.200 (port 49891) Feb 19 23:36:38 john100 slurmstepd[349794]: task/cgroup: /slurm/uid_10056/job_24986: alloc=131072MB mem.limit=131072MB memsw.limit=144179MB Feb 19 23:36:38 john100 slurmstepd[349794]: task/cgroup: /slurm/uid_10056/job_24986/step_0: alloc=131072MB mem.limit=131072MB memsw.limit=144179MB Feb 19 23:36:38 john104 slurmstepd[233606]: task/cgroup: /slurm/uid_10056/job_24986: alloc=65536MB mem.limit=65536MB memsw.limit=72089MB Feb 19 23:36:38 john104 slurmstepd[233606]: task/cgroup: /slurm/uid_10056/job_24986/step_0: alloc=65536MB mem.limit=65536MB memsw.limit=72089MB Feb 19 23:36:38 john102 slurmstepd[245541]: task/cgroup: /slurm/uid_10056/job_24986: alloc=131072MB mem.limit=131072MB memsw.limit=144179MB Feb 19 23:36:38 john102 slurmstepd[245541]: task/cgroup: /slurm/uid_10056/job_24986/step_0: alloc=131072MB mem.limit=131072MB memsw.limit=144179MB Feb 19 23:36:38 john101 slurmstepd[57120]: task/cgroup: /slurm/uid_10056/job_24986: alloc=131072MB mem.limit=131072MB memsw.limit=144179MB Feb 19 23:36:38 john101 slurmstepd[57120]: task/cgroup: /slurm/uid_10056/job_24986/step_0: alloc=131072MB mem.limit=131072MB memsw.limit=144179MB Feb 19 23:36:38 john103 slurmstepd[248047]: task/cgroup: /slurm/uid_10056/job_24986: alloc=131072MB mem.limit=131072MB memsw.limit=144179MB Feb 19 23:36:38 john103 slurmstepd[248047]: task/cgroup: /slurm/uid_10056/job_24986/step_0: alloc=131072MB mem.limit=131072MB memsw.limit=144179MB Feb 19 23:40:01 transom1 systemd: Started Session 8316 of user root. Feb 19 23:40:01 transom1 systemd: Starting Session 8316 of user root. Feb 19 23:40:48 transom1 fm0_sm[158971]: PROGR[topology]: SM: topology_main: TT: DISCOVERY CYCLE START - REASON: Scheduled sweep interval Feb 19 23:40:48 transom1 fm0_sm[158971]: PROGR[topology]: SM: topology_main: DISCOVERY CYCLE END. 9 SWs, 136 HFIs, 136 end ports, 569 total ports, 1 SM(s), 1893 packets, 0 retries, 0.293 sec sweep Feb 19 23:45:12 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 54e81a18 Feb 19 23:45:12 john100 kernel: LNetError: 899:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1591386600741136 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:45:12 arkle3 kernel: LNet: Using FMR for registration Feb 19 23:45:12 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881764648600 Feb 19 23:45:12 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881764648600 Feb 19 23:45:12 arkle3 kernel: LustreError: 337237:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff881687561450 x1591386600741136/t0(0) o4->8c8018f7-2e02-6c2b-cbcf-29133ecabf02@192.168.44.200@o2ib44:173/0 lens 608/448 e 0 to 0 dl 1519044318 ref 1 fl Interpret:/0/0 rc 0/0 Feb 19 23:45:12 arkle3 kernel: Lustre: dagg-OST0005: Bulk IO write error with 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44), client will retry: rc = -110 Feb 19 23:45:12 arkle3 kernel: LNet: Skipped 1 previous similar message Feb 19 23:45:12 john101 kernel: LNetError: 904:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.36@o2ib44, match 1591895282655776 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:45:12 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8810cb714600 Feb 19 23:45:12 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8810cb714600 Feb 19 23:45:12 arkle6 kernel: LustreError: 42356:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff88276c3d5850 x1591895282655776/t0(0) o4->400cfa1c-7c7d-1d14-09ed-f6043574fd7c@192.168.44.201@o2ib44:173/0 lens 608/448 e 0 to 0 dl 1519044318 ref 1 fl Interpret:/0/0 rc 0/0 Feb 19 23:45:12 arkle6 kernel: Lustre: dagg-OST000b: Bulk IO write error with 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44), client will retry: rc = -110 Feb 19 23:45:12 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 54e81a18 (type 4), client csum now efde5b36 Feb 19 23:45:12 john100 kernel: LustreError: 924:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff880e7a9b4e00 x1591386600740944/t197580821951(197580821951) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044319 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:45:13 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum df7fda9d Feb 19 23:45:13 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum df7fda9d (type 4), client csum now efde5b36 Feb 19 23:45:13 john100 kernel: LustreError: 911:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff8807492fe300 x1591386600747696/t197580821955(197580821955) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044320 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:45:15 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 87da008a Feb 19 23:45:15 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 87da008a (type 4), client csum now efde5b36 Feb 19 23:45:15 john100 kernel: LustreError: 910:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff880eb3d51b00 x1591386600751360/t197580821956(197580821956) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044322 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:45:18 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 1cc7a793 Feb 19 23:45:18 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 1cc7a793 (type 4), client csum now efde5b36 Feb 19 23:45:18 john100 kernel: LustreError: 915:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff880eb3d57500 x1591386600762336/t197580821957(197580821957) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044325 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:45:19 john101 kernel: Lustre: 929:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044312/real 1519044312] req@ffff8817407c1800 x1591895282655776/t0(0) o4->dagg-OST000b-osc-ffff881899d0c800@192.168.44.36@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1519044319 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Feb 19 23:45:19 john101 kernel: Lustre: 929:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 6 previous similar messages Feb 19 23:45:19 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection to dagg-OST000b (at 192.168.44.36@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:45:19 john101 kernel: Lustre: Skipped 4 previous similar messages Feb 19 23:45:19 arkle6 kernel: Lustre: dagg-OST000b: Client 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44) reconnecting Feb 19 23:45:19 arkle6 kernel: Lustre: dagg-OST000b: Connection restored to (at 192.168.44.201@o2ib44) Feb 19 23:45:19 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection restored to 192.168.44.36@o2ib44 (at 192.168.44.36@o2ib44) Feb 19 23:45:19 john101 kernel: Lustre: Skipped 1 previous similar message Feb 19 23:45:19 john101 kernel: LNetError: 906:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.36@o2ib44, match 1591895282679664 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:45:19 john100 kernel: Lustre: 945:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044312/real 1519044312] req@ffff8822d040c500 x1591386600741136/t0(0) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1519044319 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Feb 19 23:45:19 john100 kernel: Lustre: 945:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Feb 19 23:45:19 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection to dagg-OST0005 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:45:19 john100 kernel: Lustre: Skipped 5 previous similar messages Feb 19 23:45:19 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e07fe5000 Feb 19 23:45:19 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e07fe5000 Feb 19 23:45:19 arkle6 kernel: LustreError: 128420:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff882f41635050 x1591895282655776/t0(0) o4->400cfa1c-7c7d-1d14-09ed-f6043574fd7c@192.168.44.201@o2ib44:180/0 lens 608/448 e 0 to 0 dl 1519044325 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:45:19 arkle3 kernel: Lustre: dagg-OST0005: Client 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) reconnecting Feb 19 23:45:19 arkle3 kernel: Lustre: dagg-OST0005: Connection restored to 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) Feb 19 23:45:19 arkle6 kernel: Lustre: dagg-OST000b: Bulk IO write error with 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44), client will retry: rc = -110 Feb 19 23:45:19 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44) Feb 19 23:45:19 john100 kernel: Lustre: Skipped 1 previous similar message Feb 19 23:45:19 john100 kernel: LNetError: 900:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1591386600773504 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:45:19 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88086d3e4a00 Feb 19 23:45:19 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88086d3e4a00 Feb 19 23:45:19 arkle3 kernel: LustreError: 141780:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff880a86a82050 x1591386600741136/t0(0) o4->8c8018f7-2e02-6c2b-cbcf-29133ecabf02@192.168.44.200@o2ib44:180/0 lens 608/448 e 0 to 0 dl 1519044325 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:45:19 arkle3 kernel: Lustre: dagg-OST0005: Bulk IO write error with 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44), client will retry: rc = -110 Feb 19 23:45:22 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum a7de2e17 Feb 19 23:45:23 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum a7de2e17 (type 4), client csum now efde5b36 Feb 19 23:45:23 john100 kernel: LustreError: 927:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff8807492fd400 x1591386600773456/t197580821958(197580821958) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044329 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:45:26 john101 kernel: Lustre: 929:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044319/real 1519044319] req@ffff8817407c1800 x1591895282655776/t0(0) o4->dagg-OST000b-osc-ffff881899d0c800@192.168.44.36@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1519044326 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:45:26 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection to dagg-OST000b (at 192.168.44.36@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:45:26 arkle6 kernel: Lustre: dagg-OST000b: Client 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44) reconnecting Feb 19 23:45:26 arkle6 kernel: Lustre: dagg-OST000b: Connection restored to (at 192.168.44.201@o2ib44) Feb 19 23:45:26 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection restored to 192.168.44.36@o2ib44 (at 192.168.44.36@o2ib44) Feb 19 23:45:26 john101 kernel: LNetError: 907:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.36@o2ib44, match 1591895282680208 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:45:26 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88204916ac00 Feb 19 23:45:26 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88204916ac00 Feb 19 23:45:26 arkle6 kernel: LustreError: 128420:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff88276c3d0c50 x1591895282655776/t0(0) o4->400cfa1c-7c7d-1d14-09ed-f6043574fd7c@192.168.44.201@o2ib44:187/0 lens 608/448 e 0 to 0 dl 1519044332 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:45:26 john100 kernel: Lustre: 945:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044319/real 1519044319] req@ffff8822d040c500 x1591386600741136/t0(0) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1519044326 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:45:26 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection to dagg-OST0005 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:45:26 arkle3 kernel: Lustre: dagg-OST0005: Client 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) reconnecting Feb 19 23:45:26 arkle3 kernel: Lustre: dagg-OST0005: Connection restored to 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) Feb 19 23:45:26 arkle6 kernel: Lustre: dagg-OST000b: Bulk IO write error with 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44), client will retry: rc = -110 Feb 19 23:45:26 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44) Feb 19 23:45:26 john100 kernel: LNetError: 899:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1591386600773664 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:45:26 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88086d3e2600 Feb 19 23:45:26 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88086d3e2600 Feb 19 23:45:26 arkle3 kernel: LustreError: 141780:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8800617f9050 x1591386600741136/t0(0) o4->8c8018f7-2e02-6c2b-cbcf-29133ecabf02@192.168.44.200@o2ib44:187/0 lens 608/448 e 0 to 0 dl 1519044332 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:45:27 arkle3 kernel: Lustre: dagg-OST0005: Bulk IO write error with 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44), client will retry: rc = -110 Feb 19 23:45:33 john101 kernel: Lustre: 929:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044326/real 1519044326] req@ffff8817407c1800 x1591895282655776/t0(0) o4->dagg-OST000b-osc-ffff881899d0c800@192.168.44.36@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1519044333 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:45:33 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection to dagg-OST000b (at 192.168.44.36@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:45:33 arkle6 kernel: Lustre: dagg-OST000b: Client 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44) reconnecting Feb 19 23:45:33 arkle6 kernel: Lustre: dagg-OST000b: Connection restored to (at 192.168.44.201@o2ib44) Feb 19 23:45:33 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection restored to 192.168.44.36@o2ib44 (at 192.168.44.36@o2ib44) Feb 19 23:45:33 john101 kernel: LNetError: 905:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.36@o2ib44, match 1591895282680256 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:45:33 john100 kernel: Lustre: 945:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044326/real 1519044326] req@ffff8822d040c500 x1591386600741136/t0(0) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1519044333 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:45:33 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection to dagg-OST0005 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:45:33 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88204916e800 Feb 19 23:45:33 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88204916e800 Feb 19 23:45:33 arkle6 kernel: LustreError: 128420:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff882f3774d450 x1591895282655776/t0(0) o4->400cfa1c-7c7d-1d14-09ed-f6043574fd7c@192.168.44.201@o2ib44:194/0 lens 608/448 e 0 to 0 dl 1519044339 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:45:34 arkle3 kernel: Lustre: dagg-OST0005: Client 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) reconnecting Feb 19 23:45:34 arkle3 kernel: Lustre: dagg-OST0005: Connection restored to 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) Feb 19 23:45:34 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44) Feb 19 23:45:34 john100 kernel: LNetError: 900:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1591386600774016 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:45:34 arkle6 kernel: Lustre: dagg-OST000b: Bulk IO write error with 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44), client will retry: rc = -110 Feb 19 23:45:34 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88086d3e1000 Feb 19 23:45:34 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88086d3e1000 Feb 19 23:45:34 arkle3 kernel: LustreError: 141780:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff880a86a83450 x1591386600741136/t0(0) o4->8c8018f7-2e02-6c2b-cbcf-29133ecabf02@192.168.44.200@o2ib44:195/0 lens 608/448 e 0 to 0 dl 1519044340 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:45:34 arkle3 kernel: Lustre: dagg-OST0005: Bulk IO write error with 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44), client will retry: rc = -110 Feb 19 23:45:35 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum a7de2e17 Feb 19 23:45:35 arkle3 kernel: LustreError: Skipped 1 previous similar message Feb 19 23:45:35 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum a7de2e17 (type 4), client csum now efde5b36 Feb 19 23:45:35 john100 kernel: LustreError: Skipped 1 previous similar message Feb 19 23:45:35 john100 kernel: LustreError: 918:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff880e7a9b6900 x1591386600773696/t197580821960(197580821960) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044341 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:45:35 john100 kernel: LustreError: 918:0:(osc_request.c:1611:osc_brw_redo_request()) Skipped 1 previous similar message Feb 19 23:45:40 john101 kernel: Lustre: 929:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044333/real 1519044333] req@ffff8817407c1800 x1591895282655776/t0(0) o4->dagg-OST000b-osc-ffff881899d0c800@192.168.44.36@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1519044340 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:45:40 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection to dagg-OST000b (at 192.168.44.36@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:45:41 arkle6 kernel: Lustre: dagg-OST000b: Client 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44) reconnecting Feb 19 23:45:41 arkle6 kernel: Lustre: dagg-OST000b: Connection restored to (at 192.168.44.201@o2ib44) Feb 19 23:45:41 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection restored to 192.168.44.36@o2ib44 (at 192.168.44.36@o2ib44) Feb 19 23:45:41 john101 kernel: LNetError: 907:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.36@o2ib44, match 1591895282680288 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:45:41 john100 kernel: Lustre: 945:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044334/real 1519044334] req@ffff8822d040c500 x1591386600741136/t0(0) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1519044341 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:45:41 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection to dagg-OST0005 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:45:41 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88204916c600 Feb 19 23:45:41 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88204916c600 Feb 19 23:45:41 arkle6 kernel: LustreError: 128420:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff882f41630c50 x1591895282655776/t0(0) o4->400cfa1c-7c7d-1d14-09ed-f6043574fd7c@192.168.44.201@o2ib44:202/0 lens 608/448 e 0 to 0 dl 1519044347 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:45:41 arkle3 kernel: Lustre: dagg-OST0005: Client 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) reconnecting Feb 19 23:45:41 arkle3 kernel: Lustre: dagg-OST0005: Connection restored to 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) Feb 19 23:45:41 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44) Feb 19 23:45:41 john100 kernel: LNetError: 900:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1591386600774144 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:45:41 arkle6 kernel: Lustre: dagg-OST000b: Bulk IO write error with 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44), client will retry: rc = -110 Feb 19 23:45:41 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88086d3e4e00 Feb 19 23:45:41 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88086d3e4e00 Feb 19 23:45:41 arkle3 kernel: LustreError: 141780:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff880a86a87050 x1591386600741136/t0(0) o4->8c8018f7-2e02-6c2b-cbcf-29133ecabf02@192.168.44.200@o2ib44:202/0 lens 608/448 e 0 to 0 dl 1519044347 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:45:41 arkle3 kernel: Lustre: dagg-OST0005: Bulk IO write error with 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44), client will retry: rc = -110 Feb 19 23:45:48 john101 kernel: Lustre: 929:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044341/real 1519044341] req@ffff8817407c1800 x1591895282655776/t0(0) o4->dagg-OST000b-osc-ffff881899d0c800@192.168.44.36@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1519044348 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:45:48 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection to dagg-OST000b (at 192.168.44.36@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:45:48 arkle6 kernel: Lustre: dagg-OST000b: Client 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44) reconnecting Feb 19 23:45:48 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection restored to 192.168.44.36@o2ib44 (at 192.168.44.36@o2ib44) Feb 19 23:45:48 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882049168400 Feb 19 23:45:48 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882049168400 Feb 19 23:45:48 transom1 fm0_sm[158971]: PROGR[topology]: SM: topology_main: TT: DISCOVERY CYCLE START - REASON: Scheduled sweep interval Feb 19 23:45:48 john100 kernel: Lustre: 945:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044341/real 1519044341] req@ffff8822d040c500 x1591386600741136/t0(0) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1519044348 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:45:48 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection to dagg-OST0005 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:45:48 arkle3 kernel: Lustre: dagg-OST0005: Client 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) reconnecting Feb 19 23:45:48 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44) Feb 19 23:45:48 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8805cb53ea00 Feb 19 23:45:48 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8805cb53ea00 Feb 19 23:45:48 transom1 fm0_sm[158971]: PROGR[topology]: SM: topology_main: DISCOVERY CYCLE END. 9 SWs, 136 HFIs, 136 end ports, 569 total ports, 1 SM(s), 1893 packets, 0 retries, 0.290 sec sweep Feb 19 23:45:53 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum a7de2e17 Feb 19 23:45:53 arkle3 kernel: LustreError: Skipped 2 previous similar messages Feb 19 23:45:54 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum a7de2e17 (type 4), client csum now efde5b36 Feb 19 23:45:54 john100 kernel: LustreError: Skipped 2 previous similar messages Feb 19 23:45:54 john100 kernel: LustreError: 923:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff8807492ff200 x1591386600774352/t197580821979(197580821979) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044360 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:45:54 john100 kernel: LustreError: 923:0:(osc_request.c:1611:osc_brw_redo_request()) Skipped 2 previous similar messages Feb 19 23:45:55 arkle6 kernel: Lustre: dagg-OST000b: Connection restored to (at 192.168.44.201@o2ib44) Feb 19 23:45:55 arkle6 kernel: Lustre: Skipped 1 previous similar message Feb 19 23:45:55 john101 kernel: LNetError: 904:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.36@o2ib44, match 1591895282681024 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:45:55 john101 kernel: LNetError: 904:0:(lib-ptl.c:190:lnet_try_match_md()) Skipped 1 previous similar message Feb 19 23:45:55 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882049168400 Feb 19 23:45:55 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882049168400 Feb 19 23:45:55 arkle6 kernel: LustreError: 128420:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff882f36a65850 x1591895282655776/t0(0) o4->400cfa1c-7c7d-1d14-09ed-f6043574fd7c@192.168.44.201@o2ib44:216/0 lens 608/448 e 0 to 0 dl 1519044361 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:45:55 arkle6 kernel: LustreError: 128420:0:(ldlm_lib.c:3242:target_bulk_io()) Skipped 1 previous similar message Feb 19 23:45:55 arkle6 kernel: Lustre: dagg-OST000b: Bulk IO write error with 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44), client will retry: rc = -110 Feb 19 23:45:55 arkle6 kernel: Lustre: Skipped 1 previous similar message Feb 19 23:45:55 arkle3 kernel: Lustre: dagg-OST0005: Connection restored to 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) Feb 19 23:45:55 arkle3 kernel: Lustre: Skipped 1 previous similar message Feb 19 23:45:55 john100 kernel: LNetError: 901:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1591386600774704 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:45:55 john100 kernel: LNetError: 901:0:(lib-ptl.c:190:lnet_try_match_md()) Skipped 1 previous similar message Feb 19 23:45:55 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8827175e5a00 Feb 19 23:45:55 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8827175e5a00 Feb 19 23:45:55 arkle3 kernel: LustreError: 141780:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff881687560850 x1591386600741136/t0(0) o4->8c8018f7-2e02-6c2b-cbcf-29133ecabf02@192.168.44.200@o2ib44:216/0 lens 608/448 e 0 to 0 dl 1519044361 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:45:55 arkle3 kernel: LustreError: 141780:0:(ldlm_lib.c:3242:target_bulk_io()) Skipped 1 previous similar message Feb 19 23:45:55 arkle3 kernel: Lustre: dagg-OST0005: Bulk IO write error with 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44), client will retry: rc = -110 Feb 19 23:45:55 arkle3 kernel: Lustre: Skipped 1 previous similar message Feb 19 23:46:00 john100 kernel: LustreError: 927:0:(osc_request.c:1735:brw_interpret()) dagg-OST0005-osc-ffff88015a1ad800: too many resent retries for object: 0:2683981, rc = -11. Feb 19 23:46:02 john101 kernel: Lustre: 929:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044355/real 1519044355] req@ffff8817407c1800 x1591895282655776/t0(0) o4->dagg-OST000b-osc-ffff881899d0c800@192.168.44.36@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1519044362 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:46:02 john101 kernel: Lustre: 929:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message Feb 19 23:46:02 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection to dagg-OST000b (at 192.168.44.36@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:46:02 john101 kernel: Lustre: Skipped 1 previous similar message Feb 19 23:46:02 arkle6 kernel: Lustre: dagg-OST000b: Client 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44) reconnecting Feb 19 23:46:02 arkle6 kernel: Lustre: Skipped 1 previous similar message Feb 19 23:46:02 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection restored to 192.168.44.36@o2ib44 (at 192.168.44.36@o2ib44) Feb 19 23:46:02 john101 kernel: Lustre: Skipped 1 previous similar message Feb 19 23:46:02 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88204916da00 Feb 19 23:46:02 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88204916da00 Feb 19 23:46:02 john100 kernel: Lustre: 945:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044355/real 1519044355] req@ffff8822d040c500 x1591386600741136/t0(0) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1519044362 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:46:02 john100 kernel: Lustre: 945:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message Feb 19 23:46:02 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection to dagg-OST0005 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:46:02 john100 kernel: Lustre: Skipped 1 previous similar message Feb 19 23:46:02 arkle3 kernel: Lustre: dagg-OST0005: Client 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) reconnecting Feb 19 23:46:02 arkle3 kernel: Lustre: Skipped 1 previous similar message Feb 19 23:46:02 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44) Feb 19 23:46:02 john100 kernel: Lustre: Skipped 1 previous similar message Feb 19 23:46:02 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8827175e5800 Feb 19 23:46:02 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8827175e5800 Feb 19 23:46:09 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88204916f200 Feb 19 23:46:09 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88204916f200 Feb 19 23:46:09 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8827175e3e00 Feb 19 23:46:09 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8827175e3e00 Feb 19 23:46:16 arkle6 kernel: Lustre: dagg-OST000b: Connection restored to (at 192.168.44.201@o2ib44) Feb 19 23:46:16 arkle6 kernel: Lustre: Skipped 2 previous similar messages Feb 19 23:46:16 john101 kernel: LNetError: 905:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.36@o2ib44, match 1591895282681136 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:46:16 john101 kernel: LNetError: 905:0:(lib-ptl.c:190:lnet_try_match_md()) Skipped 2 previous similar messages Feb 19 23:46:16 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88204916c200 Feb 19 23:46:16 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88204916c200 Feb 19 23:46:16 arkle6 kernel: LustreError: 128420:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff882f41633450 x1591895282655776/t0(0) o4->400cfa1c-7c7d-1d14-09ed-f6043574fd7c@192.168.44.201@o2ib44:237/0 lens 608/448 e 0 to 0 dl 1519044382 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:46:16 arkle6 kernel: LustreError: 128420:0:(ldlm_lib.c:3242:target_bulk_io()) Skipped 2 previous similar messages Feb 19 23:46:16 arkle6 kernel: Lustre: dagg-OST000b: Bulk IO write error with 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44), client will retry: rc = -110 Feb 19 23:46:16 arkle6 kernel: Lustre: Skipped 2 previous similar messages Feb 19 23:46:16 arkle3 kernel: Lustre: dagg-OST0005: Connection restored to 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) Feb 19 23:46:16 arkle3 kernel: Lustre: Skipped 2 previous similar messages Feb 19 23:46:16 john100 kernel: LNetError: 898:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1591386600775376 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:46:16 john100 kernel: LNetError: 898:0:(lib-ptl.c:190:lnet_try_match_md()) Skipped 2 previous similar messages Feb 19 23:46:16 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8827175e5a00 Feb 19 23:46:16 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8827175e5a00 Feb 19 23:46:16 arkle3 kernel: LustreError: 145591:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff880a86a80c50 x1591386600741136/t0(0) o4->8c8018f7-2e02-6c2b-cbcf-29133ecabf02@192.168.44.200@o2ib44:237/0 lens 608/448 e 0 to 0 dl 1519044382 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:46:16 arkle3 kernel: LustreError: 145591:0:(ldlm_lib.c:3242:target_bulk_io()) Skipped 2 previous similar messages Feb 19 23:46:16 arkle3 kernel: Lustre: dagg-OST0005: Bulk IO write error with 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44), client will retry: rc = -110 Feb 19 23:46:16 arkle3 kernel: Lustre: Skipped 2 previous similar messages Feb 19 23:46:23 john101 kernel: Lustre: 929:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044376/real 1519044376] req@ffff8817407c1800 x1591895282655776/t0(0) o4->dagg-OST000b-osc-ffff881899d0c800@192.168.44.36@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1519044383 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:46:23 john101 kernel: Lustre: 929:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Feb 19 23:46:23 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection to dagg-OST000b (at 192.168.44.36@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:46:23 john101 kernel: Lustre: Skipped 2 previous similar messages Feb 19 23:46:23 arkle6 kernel: Lustre: dagg-OST000b: Client 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44) reconnecting Feb 19 23:46:23 arkle6 kernel: Lustre: Skipped 2 previous similar messages Feb 19 23:46:23 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection restored to 192.168.44.36@o2ib44 (at 192.168.44.36@o2ib44) Feb 19 23:46:23 john101 kernel: Lustre: Skipped 2 previous similar messages Feb 19 23:46:23 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88204916c400 Feb 19 23:46:23 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88204916c400 Feb 19 23:46:23 john100 kernel: Lustre: 945:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044376/real 1519044376] req@ffff8822d040c500 x1591386600741136/t0(0) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1519044383 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:46:23 john100 kernel: Lustre: 945:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Feb 19 23:46:23 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection to dagg-OST0005 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:46:23 john100 kernel: Lustre: Skipped 2 previous similar messages Feb 19 23:46:23 arkle3 kernel: Lustre: dagg-OST0005: Client 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) reconnecting Feb 19 23:46:23 arkle3 kernel: Lustre: Skipped 2 previous similar messages Feb 19 23:46:23 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44) Feb 19 23:46:23 john100 kernel: Lustre: Skipped 2 previous similar messages Feb 19 23:46:23 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8815c8c7d400 Feb 19 23:46:23 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8815c8c7d400 Feb 19 23:46:30 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882af4aa8a00 Feb 19 23:46:30 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882af4aa8a00 Feb 19 23:46:30 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8815c8c79000 Feb 19 23:46:30 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8815c8c79000 Feb 19 23:46:31 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [358060376-362254335]: client csum 164c6d3b, server csum c5bdd26c Feb 19 23:46:31 arkle3 kernel: LustreError: Skipped 8 previous similar messages Feb 19 23:46:32 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [358060376-362254335], original client csum 164c6d3b (type 4), server csum c5bdd26c (type 4), client csum now 164c6d3b Feb 19 23:46:32 john100 kernel: LustreError: Skipped 8 previous similar messages Feb 19 23:46:32 john100 kernel: LustreError: 921:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff8807492ff200 x1591386600775568/t197580821991(197580821991) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044435 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:46:32 john100 kernel: LustreError: 921:0:(osc_request.c:1611:osc_brw_redo_request()) Skipped 7 previous similar messages Feb 19 23:46:37 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882af4aa8a00 Feb 19 23:46:37 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882af4aa8a00 Feb 19 23:46:37 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8815c8c7fe00 Feb 19 23:46:37 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8815c8c7fe00 Feb 19 23:46:44 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8813f8afa800 Feb 19 23:46:44 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8813f8afa800 Feb 19 23:46:44 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8815c8c79c00 Feb 19 23:46:44 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8815c8c79c00 Feb 19 23:46:51 arkle6 kernel: Lustre: dagg-OST000b: Connection restored to (at 192.168.44.201@o2ib44) Feb 19 23:46:51 arkle6 kernel: Lustre: Skipped 4 previous similar messages Feb 19 23:46:51 john101 kernel: LNetError: 904:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.36@o2ib44, match 1591895282682192 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:46:51 john101 kernel: LNetError: 904:0:(lib-ptl.c:190:lnet_try_match_md()) Skipped 4 previous similar messages Feb 19 23:46:51 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8813f8aff600 Feb 19 23:46:51 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8813f8aff600 Feb 19 23:46:51 arkle6 kernel: LustreError: 128420:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff88276c3d3050 x1591895282655776/t0(0) o4->400cfa1c-7c7d-1d14-09ed-f6043574fd7c@192.168.44.201@o2ib44:272/0 lens 608/448 e 0 to 0 dl 1519044417 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:46:51 arkle6 kernel: LustreError: 128420:0:(ldlm_lib.c:3242:target_bulk_io()) Skipped 4 previous similar messages Feb 19 23:46:51 arkle6 kernel: Lustre: dagg-OST000b: Bulk IO write error with 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44), client will retry: rc = -110 Feb 19 23:46:51 arkle6 kernel: Lustre: Skipped 4 previous similar messages Feb 19 23:46:51 arkle3 kernel: Lustre: dagg-OST0005: Connection restored to 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) Feb 19 23:46:51 arkle3 kernel: Lustre: Skipped 4 previous similar messages Feb 19 23:46:51 john100 kernel: LNetError: 901:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1591386600776832 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:46:51 john100 kernel: LNetError: 901:0:(lib-ptl.c:190:lnet_try_match_md()) Skipped 4 previous similar messages Feb 19 23:46:51 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8815c8c7ba00 Feb 19 23:46:51 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8815c8c7ba00 Feb 19 23:46:51 arkle3 kernel: LustreError: 145591:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff880a86a85450 x1591386600741136/t0(0) o4->8c8018f7-2e02-6c2b-cbcf-29133ecabf02@192.168.44.200@o2ib44:272/0 lens 608/448 e 0 to 0 dl 1519044417 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:46:51 arkle3 kernel: LustreError: 145591:0:(ldlm_lib.c:3242:target_bulk_io()) Skipped 4 previous similar messages Feb 19 23:46:52 arkle3 kernel: Lustre: dagg-OST0005: Bulk IO write error with 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44), client will retry: rc = -110 Feb 19 23:46:52 arkle3 kernel: Lustre: Skipped 4 previous similar messages Feb 19 23:46:58 john101 kernel: Lustre: 929:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044411/real 1519044411] req@ffff8817407c1800 x1591895282655776/t0(0) o4->dagg-OST000b-osc-ffff881899d0c800@192.168.44.36@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1519044418 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:46:58 john101 kernel: Lustre: 929:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Feb 19 23:46:58 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection to dagg-OST000b (at 192.168.44.36@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:46:58 john101 kernel: Lustre: Skipped 4 previous similar messages Feb 19 23:46:58 arkle6 kernel: Lustre: dagg-OST000b: Client 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44) reconnecting Feb 19 23:46:58 arkle6 kernel: Lustre: Skipped 4 previous similar messages Feb 19 23:46:58 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection restored to 192.168.44.36@o2ib44 (at 192.168.44.36@o2ib44) Feb 19 23:46:58 john101 kernel: Lustre: Skipped 4 previous similar messages Feb 19 23:46:58 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e07fe1000 Feb 19 23:46:58 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e07fe1000 Feb 19 23:46:58 john100 kernel: Lustre: 945:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044411/real 1519044411] req@ffff8822d040c500 x1591386600741136/t0(0) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1519044418 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:46:58 john100 kernel: Lustre: 945:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Feb 19 23:46:58 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection to dagg-OST0005 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:46:58 arkle3 kernel: Lustre: dagg-OST0005: Client 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) reconnecting Feb 19 23:46:58 arkle3 kernel: Lustre: Skipped 4 previous similar messages Feb 19 23:46:59 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8815c8c7ba00 Feb 19 23:46:59 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8815c8c7ba00 Feb 19 23:46:58 john100 kernel: Lustre: Skipped 4 previous similar messages Feb 19 23:46:59 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44) Feb 19 23:46:59 john100 kernel: Lustre: Skipped 4 previous similar messages Feb 19 23:47:01 john100 kernel: LustreError: 924:0:(osc_request.c:1735:brw_interpret()) dagg-OST0005-osc-ffff88015a1ad800: too many resent retries for object: 0:2683981, rc = -11. Feb 19 23:47:05 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8807fca6e400 Feb 19 23:47:05 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8807fca6e400 Feb 19 23:47:06 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880be960d000 Feb 19 23:47:06 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880be960d000 Feb 19 23:47:12 arkle6 kernel: LNet: Using FMR for registration Feb 19 23:47:12 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8813f8af8c00 Feb 19 23:47:12 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8813f8af8c00 Feb 19 23:47:12 arkle6 kernel: LNet: Skipped 1 previous similar message Feb 19 23:47:13 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8816060b2a00 Feb 19 23:47:13 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8816060b2a00 Feb 19 23:47:19 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882633384a00 Feb 19 23:47:19 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882633384a00 Feb 19 23:47:20 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8805cb53da00 Feb 19 23:47:20 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8805cb53da00 Feb 19 23:47:26 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882efb34d800 Feb 19 23:47:26 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882efb34d800 Feb 19 23:47:27 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8816060b3e00 Feb 19 23:47:27 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8816060b3e00 Feb 19 23:47:33 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882efb34d800 Feb 19 23:47:33 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882efb34d800 Feb 19 23:47:34 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88128afe1400 Feb 19 23:47:34 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88128afe1400 Feb 19 23:47:38 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [389517656-393711615]: client csum f0550656, server csum ea7a06d9 Feb 19 23:47:38 arkle3 kernel: LustreError: Skipped 11 previous similar messages Feb 19 23:47:38 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [389517656-393711615], original client csum f0550656 (type 4), server csum ea7a06d9 (type 4), client csum now f0550656 Feb 19 23:47:38 john100 kernel: LustreError: Skipped 11 previous similar messages Feb 19 23:47:38 john100 kernel: LustreError: 918:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff8807492fe600 x1591386600777984/t197580822009(197580822009) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044502 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:47:38 john100 kernel: LustreError: 918:0:(osc_request.c:1611:osc_brw_redo_request()) Skipped 10 previous similar messages Feb 19 23:47:40 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88206fa9a200 Feb 19 23:47:40 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88206fa9a200 Feb 19 23:47:41 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8804a77f7200 Feb 19 23:47:41 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8804a77f7200 Feb 19 23:47:47 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882c7b419200 Feb 19 23:47:47 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882c7b419200 Feb 19 23:47:48 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8805cb53aa00 Feb 19 23:47:48 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8805cb53aa00 Feb 19 23:47:54 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8823bb460600 Feb 19 23:47:54 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8823bb460600 Feb 19 23:47:55 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881693e49000 Feb 19 23:47:55 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881693e49000 Feb 19 23:47:57 john100 kernel: LustreError: 924:0:(osc_request.c:1735:brw_interpret()) dagg-OST0005-osc-ffff88015a1ad800: too many resent retries for object: 0:2683981, rc = -11. Feb 19 23:48:01 arkle6 kernel: Lustre: dagg-OST000b: Connection restored to (at 192.168.44.201@o2ib44) Feb 19 23:48:01 arkle6 kernel: Lustre: Skipped 9 previous similar messages Feb 19 23:48:01 john101 kernel: LNetError: 905:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.36@o2ib44, match 1591895282683424 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:48:01 john101 kernel: LNetError: 905:0:(lib-ptl.c:190:lnet_try_match_md()) Skipped 9 previous similar messages Feb 19 23:48:01 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882df8095c00 Feb 19 23:48:01 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882df8095c00 Feb 19 23:48:01 arkle6 kernel: LustreError: 63277:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff88276c3d0c50 x1591895282655776/t0(0) o4->400cfa1c-7c7d-1d14-09ed-f6043574fd7c@192.168.44.201@o2ib44:342/0 lens 608/448 e 0 to 0 dl 1519044487 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:48:01 arkle6 kernel: LustreError: 63277:0:(ldlm_lib.c:3242:target_bulk_io()) Skipped 9 previous similar messages Feb 19 23:48:01 arkle6 kernel: Lustre: dagg-OST000b: Bulk IO write error with 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44), client will retry: rc = -110 Feb 19 23:48:01 arkle6 kernel: Lustre: Skipped 9 previous similar messages Feb 19 23:48:02 arkle3 kernel: Lustre: dagg-OST0005: Connection restored to 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) Feb 19 23:48:02 arkle3 kernel: Lustre: Skipped 9 previous similar messages Feb 19 23:48:02 john100 kernel: LNetError: 901:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1591386600778800 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:48:02 john100 kernel: LNetError: 901:0:(lib-ptl.c:190:lnet_try_match_md()) Skipped 9 previous similar messages Feb 19 23:48:02 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8804a77f1800 Feb 19 23:48:02 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8804a77f1800 Feb 19 23:48:02 arkle3 kernel: LustreError: 143351:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff881714b5b850 x1591386600741136/t0(0) o4->8c8018f7-2e02-6c2b-cbcf-29133ecabf02@192.168.44.200@o2ib44:343/0 lens 608/448 e 0 to 0 dl 1519044488 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:48:02 arkle3 kernel: LustreError: 143351:0:(ldlm_lib.c:3242:target_bulk_io()) Skipped 9 previous similar messages Feb 19 23:48:02 arkle3 kernel: Lustre: dagg-OST0005: Bulk IO write error with 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44), client will retry: rc = -110 Feb 19 23:48:02 arkle3 kernel: Lustre: Skipped 9 previous similar messages Feb 19 23:48:08 john101 kernel: Lustre: 929:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044481/real 1519044481] req@ffff8817407c1800 x1591895282655776/t0(0) o4->dagg-OST000b-osc-ffff881899d0c800@192.168.44.36@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1519044488 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:48:08 john101 kernel: Lustre: 929:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 9 previous similar messages Feb 19 23:48:08 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection to dagg-OST000b (at 192.168.44.36@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:48:08 john101 kernel: Lustre: Skipped 9 previous similar messages Feb 19 23:48:08 arkle6 kernel: Lustre: dagg-OST000b: Client 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44) reconnecting Feb 19 23:48:08 arkle6 kernel: Lustre: Skipped 9 previous similar messages Feb 19 23:48:08 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection restored to 192.168.44.36@o2ib44 (at 192.168.44.36@o2ib44) Feb 19 23:48:08 john101 kernel: Lustre: Skipped 9 previous similar messages Feb 19 23:48:08 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882aead49200 Feb 19 23:48:08 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882aead49200 Feb 19 23:48:09 john100 kernel: Lustre: 945:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044482/real 1519044482] req@ffff8822d040c500 x1591386600741136/t0(0) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1519044489 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:48:09 john100 kernel: Lustre: 945:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 9 previous similar messages Feb 19 23:48:09 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection to dagg-OST0005 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:48:09 john100 kernel: Lustre: Skipped 9 previous similar messages Feb 19 23:48:09 arkle3 kernel: Lustre: dagg-OST0005: Client 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) reconnecting Feb 19 23:48:09 arkle3 kernel: Lustre: Skipped 9 previous similar messages Feb 19 23:48:09 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44) Feb 19 23:48:09 john100 kernel: Lustre: Skipped 9 previous similar messages Feb 19 23:48:09 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8809e2feac00 Feb 19 23:48:09 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8809e2feac00 Feb 19 23:48:15 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882fd2198200 Feb 19 23:48:15 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882fd2198200 Feb 19 23:48:16 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8809e2fe9000 Feb 19 23:48:16 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8809e2fe9000 Feb 19 23:48:22 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881d97615a00 Feb 19 23:48:22 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881d97615a00 Feb 19 23:48:23 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881693e4ac00 Feb 19 23:48:23 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881693e4ac00 Feb 19 23:48:29 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8807fca6fc00 Feb 19 23:48:29 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8807fca6fc00 Feb 19 23:48:30 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8817ca0b9200 Feb 19 23:48:30 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8817ca0b9200 Feb 19 23:48:36 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88228a5a0c00 Feb 19 23:48:36 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88228a5a0c00 Feb 19 23:48:37 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8814cc55b600 Feb 19 23:48:37 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8814cc55b600 Feb 19 23:48:43 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882d469e0c00 Feb 19 23:48:43 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882d469e0c00 Feb 19 23:48:44 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8814654d7e00 Feb 19 23:48:44 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8814654d7e00 Feb 19 23:48:50 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881e07427600 Feb 19 23:48:50 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881e07427600 Feb 19 23:48:51 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880b1f0f4200 Feb 19 23:48:51 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880b1f0f4200 Feb 19 23:48:54 john100 kernel: LustreError: 910:0:(osc_request.c:1735:brw_interpret()) dagg-OST0005-osc-ffff88015a1ad800: too many resent retries for object: 0:2683981, rc = -11. Feb 19 23:48:57 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8807fca6c000 Feb 19 23:48:57 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8807fca6c000 Feb 19 23:48:58 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880b72656600 Feb 19 23:48:58 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880b72656600 Feb 19 23:49:04 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88228a5a1400 Feb 19 23:49:04 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88228a5a1400 Feb 19 23:49:05 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880eae89f000 Feb 19 23:49:05 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880eae89f000 Feb 19 23:49:11 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e07fe3400 Feb 19 23:49:11 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e07fe3400 Feb 19 23:49:12 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8816706db800 Feb 19 23:49:12 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8816706db800 Feb 19 23:49:18 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e756ffc00 Feb 19 23:49:18 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e756ffc00 Feb 19 23:49:19 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8816940af600 Feb 19 23:49:19 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8816940af600 Feb 19 23:49:25 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e756fcc00 Feb 19 23:49:25 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e756fcc00 Feb 19 23:49:26 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880b1f0f5200 Feb 19 23:49:26 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880b1f0f5200 Feb 19 23:49:32 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e756f8000 Feb 19 23:49:32 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e756f8000 Feb 19 23:49:33 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8817802e4800 Feb 19 23:49:33 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8817802e4800 Feb 19 23:49:39 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e756f8000 Feb 19 23:49:39 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e756f8000 Feb 19 23:49:40 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880a7766f000 Feb 19 23:49:40 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880a7766f000 Feb 19 23:49:46 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e756fce00 Feb 19 23:49:46 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e756fce00 Feb 19 23:49:47 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880a7766c800 Feb 19 23:49:47 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880a7766c800 Feb 19 23:49:51 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [447804664-451997695]: client csum f6a340ce, server csum 62064124 Feb 19 23:49:51 arkle3 kernel: LustreError: Skipped 23 previous similar messages Feb 19 23:49:53 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [447804664-451997695], original client csum f6a340ce (type 4), server csum 62064124 (type 4), client csum now f6a340ce Feb 19 23:49:53 john100 kernel: LustreError: Skipped 23 previous similar messages Feb 19 23:49:53 john100 kernel: LustreError: 911:0:(osc_request.c:1735:brw_interpret()) dagg-OST0005-osc-ffff88015a1ad800: too many resent retries for object: 0:2683981, rc = -11. Feb 19 23:49:53 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e756fea00 Feb 19 23:49:53 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e756fea00 Feb 19 23:49:53 john100 kernel: LustreError: 913:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff880eb3d54200 x1591386600781488/t197580822040(197580822040) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044637 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:49:53 john100 kernel: LustreError: 913:0:(osc_request.c:1611:osc_brw_redo_request()) Skipped 21 previous similar messages Feb 19 23:49:54 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8806e4e16200 Feb 19 23:49:54 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8806e4e16200 Feb 19 23:50:00 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8807fca6f600 Feb 19 23:50:00 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8807fca6f600 Feb 19 23:50:01 transom1 systemd: Started Session 8317 of user root. Feb 19 23:50:01 transom1 systemd: Starting Session 8317 of user root. Feb 19 23:50:01 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8806e4e16800 Feb 19 23:50:01 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8806e4e16800 Feb 19 23:50:07 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8807fca69200 Feb 19 23:50:07 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8807fca69200 Feb 19 23:50:08 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8806e4e12c00 Feb 19 23:50:08 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8806e4e12c00 Feb 19 23:50:14 arkle6 kernel: Lustre: dagg-OST000b: Connection restored to (at 192.168.44.201@o2ib44) Feb 19 23:50:14 arkle6 kernel: Lustre: Skipped 18 previous similar messages Feb 19 23:50:14 john101 kernel: LNetError: 905:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.36@o2ib44, match 1591895282686304 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:50:14 john101 kernel: LNetError: 905:0:(lib-ptl.c:190:lnet_try_match_md()) Skipped 18 previous similar messages Feb 19 23:50:14 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881e07424e00 Feb 19 23:50:14 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881e07424e00 Feb 19 23:50:14 arkle6 kernel: LustreError: 63277:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff88276c3d6050 x1591895282655776/t0(0) o4->400cfa1c-7c7d-1d14-09ed-f6043574fd7c@192.168.44.201@o2ib44:475/0 lens 608/448 e 0 to 0 dl 1519044620 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:50:14 arkle6 kernel: LustreError: 63277:0:(ldlm_lib.c:3242:target_bulk_io()) Skipped 18 previous similar messages Feb 19 23:50:14 arkle6 kernel: Lustre: dagg-OST000b: Bulk IO write error with 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44), client will retry: rc = -110 Feb 19 23:50:14 arkle6 kernel: Lustre: Skipped 18 previous similar messages Feb 19 23:50:15 arkle3 kernel: Lustre: dagg-OST0005: Connection restored to 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) Feb 19 23:50:15 arkle3 kernel: Lustre: Skipped 18 previous similar messages Feb 19 23:50:15 john100 kernel: LNetError: 898:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1591386600782240 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:50:15 john100 kernel: LNetError: 898:0:(lib-ptl.c:190:lnet_try_match_md()) Skipped 18 previous similar messages Feb 19 23:50:21 john101 kernel: Lustre: 929:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044614/real 1519044614] req@ffff8817407c1800 x1591895282655776/t0(0) o4->dagg-OST000b-osc-ffff881899d0c800@192.168.44.36@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1519044621 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:50:21 john101 kernel: Lustre: 929:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 18 previous similar messages Feb 19 23:50:21 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection to dagg-OST000b (at 192.168.44.36@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:50:21 john101 kernel: Lustre: Skipped 18 previous similar messages Feb 19 23:50:21 arkle6 kernel: Lustre: dagg-OST000b: Client 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44) reconnecting Feb 19 23:50:21 arkle6 kernel: Lustre: Skipped 18 previous similar messages Feb 19 23:50:21 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection restored to 192.168.44.36@o2ib44 (at 192.168.44.36@o2ib44) Feb 19 23:50:21 john101 kernel: Lustre: Skipped 18 previous similar messages Feb 19 23:50:21 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881e07425000 Feb 19 23:50:21 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881e07425000 Feb 19 23:50:23 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880f3248c200 Feb 19 23:50:23 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880f3248c200 Feb 19 23:50:23 arkle3 kernel: LustreError: 142057:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8817ceecbc50 x1591386600741136/t0(0) o4->8c8018f7-2e02-6c2b-cbcf-29133ecabf02@192.168.44.200@o2ib44:501/0 lens 608/448 e 1 to 0 dl 1519044646 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:50:23 arkle3 kernel: LustreError: 142057:0:(ldlm_lib.c:3242:target_bulk_io()) Skipped 18 previous similar messages Feb 19 23:50:24 arkle3 kernel: Lustre: dagg-OST0005: Bulk IO write error with 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44), client will retry: rc = -110 Feb 19 23:50:24 arkle3 kernel: Lustre: Skipped 18 previous similar messages Feb 19 23:50:28 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881e07420200 Feb 19 23:50:28 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881e07420200 Feb 19 23:50:35 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881e07424000 Feb 19 23:50:35 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881e07424000 Feb 19 23:50:42 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881e07427e00 Feb 19 23:50:42 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881e07427e00 Feb 19 23:50:47 john100 kernel: Lustre: 945:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044615/real 1519044615] req@ffff8822d040c500 x1591386600741136/t0(0) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/448 e 1 to 1 dl 1519044647 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:50:47 john100 kernel: Lustre: 945:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 18 previous similar messages Feb 19 23:50:47 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection to dagg-OST0005 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:50:47 john100 kernel: Lustre: Skipped 18 previous similar messages Feb 19 23:50:47 arkle3 kernel: Lustre: dagg-OST0005: Client 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) reconnecting Feb 19 23:50:47 arkle3 kernel: Lustre: Skipped 18 previous similar messages Feb 19 23:50:47 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44) Feb 19 23:50:47 john100 kernel: Lustre: Skipped 18 previous similar messages Feb 19 23:50:47 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880ecb1b5600 Feb 19 23:50:47 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880ecb1b5600 Feb 19 23:50:48 transom1 fm0_sm[158971]: PROGR[topology]: SM: topology_main: TT: DISCOVERY CYCLE START - REASON: Scheduled sweep interval Feb 19 23:50:49 transom1 fm0_sm[158971]: PROGR[topology]: SM: topology_main: DISCOVERY CYCLE END. 9 SWs, 136 HFIs, 136 end ports, 569 total ports, 1 SM(s), 1893 packets, 0 retries, 0.292 sec sweep Feb 19 23:50:49 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88228a5a6000 Feb 19 23:50:49 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88228a5a6000 Feb 19 23:50:51 john100 kernel: LustreError: 918:0:(osc_request.c:1735:brw_interpret()) dagg-OST0005-osc-ffff88015a1ad800: too many resent retries for object: 0:2683981, rc = -11. Feb 19 23:50:56 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882fd219aa00 Feb 19 23:50:56 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882fd219aa00 Feb 19 23:51:03 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882fd219e000 Feb 19 23:51:03 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882fd219e000 Feb 19 23:51:10 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882fd219d600 Feb 19 23:51:10 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882fd219d600 Feb 19 23:51:17 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882fd2199400 Feb 19 23:51:17 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882fd2199400 Feb 19 23:51:19 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8804c76bf800 Feb 19 23:51:19 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8804c76bf800 Feb 19 23:51:24 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882fd219ba00 Feb 19 23:51:24 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882fd219ba00 Feb 19 23:51:31 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882fd219e000 Feb 19 23:51:31 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882fd219e000 Feb 19 23:51:38 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882fd2198200 Feb 19 23:51:38 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882fd2198200 Feb 19 23:51:45 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882fd219a800 Feb 19 23:51:45 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882fd219a800 Feb 19 23:51:49 john100 kernel: LustreError: 926:0:(osc_request.c:1735:brw_interpret()) dagg-OST0005-osc-ffff88015a1ad800: too many resent retries for object: 0:2683981, rc = -11. Feb 19 23:51:51 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880d59fdf600 Feb 19 23:51:51 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880d59fdf600 Feb 19 23:51:52 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882ca437ae00 Feb 19 23:51:52 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882ca437ae00 Feb 19 23:51:59 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882ca4378200 Feb 19 23:51:59 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882ca4378200 Feb 19 23:52:06 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882ca4378200 Feb 19 23:52:06 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882ca4378200 Feb 19 23:52:13 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882ca4378200 Feb 19 23:52:13 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882ca4378200 Feb 19 23:52:20 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882ca437fa00 Feb 19 23:52:20 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882ca437fa00 Feb 19 23:52:23 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881377bbc600 Feb 19 23:52:23 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881377bbc600 Feb 19 23:52:27 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881e07427a00 Feb 19 23:52:27 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881e07427a00 Feb 19 23:52:34 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881e07427a00 Feb 19 23:52:34 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881e07427a00 Feb 19 23:52:41 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881e07421600 Feb 19 23:52:41 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881e07421600 Feb 19 23:52:48 john100 kernel: LustreError: 927:0:(osc_request.c:1735:brw_interpret()) dagg-OST0005-osc-ffff88015a1ad800: too many resent retries for object: 0:2683981, rc = -11. Feb 19 23:52:48 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881e07424200 Feb 19 23:52:48 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881e07424200 Feb 19 23:52:55 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88178d2e0c00 Feb 19 23:52:55 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88178d2e0c00 Feb 19 23:52:55 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882b345b7800 Feb 19 23:52:55 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882b345b7800 Feb 19 23:53:01 transom1 systemd: Started Session 8318 of user root. Feb 19 23:53:01 transom1 systemd: Starting Session 8318 of user root. Feb 19 23:53:02 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882ebb674e00 Feb 19 23:53:02 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882ebb674e00 Feb 19 23:53:09 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882ebb670c00 Feb 19 23:53:09 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882ebb670c00 Feb 19 23:53:16 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881e07426600 Feb 19 23:53:16 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881e07426600 Feb 19 23:53:23 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88228a5a2400 Feb 19 23:53:23 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88228a5a2400 Feb 19 23:53:27 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88178d2e4a00 Feb 19 23:53:27 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88178d2e4a00 Feb 19 23:53:30 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88228a5a0c00 Feb 19 23:53:30 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88228a5a0c00 Feb 19 23:53:37 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882ca437a400 Feb 19 23:53:37 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882ca437a400 Feb 19 23:53:53 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882ca437a400 Feb 19 23:53:53 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882ca437a400 Feb 19 23:53:59 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88132b31be00 Feb 19 23:53:59 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88132b31be00 Feb 19 23:54:11 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [599407780-603598847]: client csum 2a11218d, server csum 5539349b Feb 19 23:54:11 arkle3 kernel: LustreError: Skipped 51 previous similar messages Feb 19 23:54:12 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [599407780-603598847], original client csum 2a11218d (type 4), server csum 5539349b (type 4), client csum now 2a11218d Feb 19 23:54:12 john100 kernel: LustreError: Skipped 50 previous similar messages Feb 19 23:54:12 john100 kernel: LustreError: 927:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff880e7a9b0000 x1591386600788656/t197580822106(197580822106) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044895 ref 2 fl Interpret:RM/0/0 rc 0/0 Feb 19 23:54:12 john100 kernel: LustreError: 927:0:(osc_request.c:1611:osc_brw_redo_request()) Skipped 45 previous similar messages Feb 19 23:54:16 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882ca437be00 Feb 19 23:54:16 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882ca437be00 Feb 19 23:54:31 arkle3 kernel: Lustre: dagg-OST0005: Connection restored to 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) Feb 19 23:54:31 arkle3 kernel: Lustre: Skipped 7 previous similar messages Feb 19 23:54:31 john100 kernel: LNetError: 899:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1591386600789184 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:54:31 john100 kernel: LNetError: 899:0:(lib-ptl.c:190:lnet_try_match_md()) Skipped 7 previous similar messages Feb 19 23:54:31 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88047592c600 Feb 19 23:54:31 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88047592c600 Feb 19 23:54:47 john100 kernel: LustreError: 922:0:(osc_request.c:1735:brw_interpret()) dagg-OST0005-osc-ffff88015a1ad800: too many resent retries for object: 0:2683981, rc = -11. Feb 19 23:54:47 john100 kernel: LustreError: 922:0:(osc_request.c:1735:brw_interpret()) Skipped 1 previous similar message Feb 19 23:54:48 john101 kernel: Lustre: 929:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044856/real 1519044856] req@ffff8817407c1800 x1591895282655776/t0(0) o4->dagg-OST000b-osc-ffff881899d0c800@192.168.44.36@o2ib44:6/4 lens 608/448 e 1 to 1 dl 1519044888 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:54:48 john101 kernel: Lustre: 929:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 30 previous similar messages Feb 19 23:54:48 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection to dagg-OST000b (at 192.168.44.36@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:54:48 john101 kernel: Lustre: Skipped 30 previous similar messages Feb 19 23:54:48 arkle6 kernel: Lustre: dagg-OST000b: Client 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44) reconnecting Feb 19 23:54:48 arkle6 kernel: Lustre: Skipped 30 previous similar messages Feb 19 23:54:48 arkle6 kernel: Lustre: dagg-OST000b: Connection restored to (at 192.168.44.201@o2ib44) Feb 19 23:54:48 arkle6 kernel: Lustre: Skipped 31 previous similar messages Feb 19 23:54:48 john101 kernel: Lustre: dagg-OST000b-osc-ffff881899d0c800: Connection restored to 192.168.44.36@o2ib44 (at 192.168.44.36@o2ib44) Feb 19 23:54:48 john101 kernel: Lustre: Skipped 30 previous similar messages Feb 19 23:54:48 john101 kernel: LNetError: 907:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.36@o2ib44, match 1591895282693568 length 1048576 too big: 1048176 left, 1048176 allowed Feb 19 23:54:48 john101 kernel: LNetError: 907:0:(lib-ptl.c:190:lnet_try_match_md()) Skipped 31 previous similar messages Feb 19 23:54:48 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e07fe2e00 Feb 19 23:54:48 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e07fe2e00 Feb 19 23:54:48 arkle6 kernel: LustreError: 63277:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff882dd99f6c50 x1591895282655776/t0(0) o4->400cfa1c-7c7d-1d14-09ed-f6043574fd7c@192.168.44.201@o2ib44:19/0 lens 608/448 e 0 to 0 dl 1519044919 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:54:48 arkle6 kernel: LustreError: 63277:0:(ldlm_lib.c:3242:target_bulk_io()) Skipped 31 previous similar messages Feb 19 23:54:48 arkle6 kernel: Lustre: dagg-OST000b: Bulk IO write error with 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44), client will retry: rc = -110 Feb 19 23:54:48 arkle6 kernel: Lustre: Skipped 31 previous similar messages Feb 19 23:55:03 john100 kernel: Lustre: 945:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1519044871/real 1519044871] req@ffff8822d040c500 x1591386600741136/t0(0) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/448 e 1 to 1 dl 1519044903 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Feb 19 23:55:03 john100 kernel: Lustre: 945:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Feb 19 23:55:03 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection to dagg-OST0005 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Feb 19 23:55:03 john100 kernel: Lustre: Skipped 7 previous similar messages Feb 19 23:55:03 arkle3 kernel: Lustre: dagg-OST0005: Client 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44) reconnecting Feb 19 23:55:03 arkle3 kernel: Lustre: Skipped 7 previous similar messages Feb 19 23:55:03 john100 kernel: Lustre: dagg-OST0005-osc-ffff88015a1ad800: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44) Feb 19 23:55:03 john100 kernel: Lustre: Skipped 7 previous similar messages Feb 19 23:55:03 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8813faa3ac00 Feb 19 23:55:03 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8813faa3ac00 Feb 19 23:55:03 arkle3 kernel: LustreError: 142057:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8812eddcb450 x1591386600741136/t0(0) o4->8c8018f7-2e02-6c2b-cbcf-29133ecabf02@192.168.44.200@o2ib44:34/0 lens 608/448 e 0 to 0 dl 1519044934 ref 1 fl Interpret:/2/0 rc 0/0 Feb 19 23:55:03 arkle3 kernel: LustreError: 142057:0:(ldlm_lib.c:3242:target_bulk_io()) Skipped 8 previous similar messages Feb 19 23:55:03 arkle3 kernel: Lustre: dagg-OST0005: Bulk IO write error with 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44), client will retry: rc = -110 Feb 19 23:55:03 arkle3 kernel: Lustre: Skipped 8 previous similar messages Feb 19 23:55:20 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e07fe1e00 Feb 19 23:55:20 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e07fe1e00 Feb 19 23:55:35 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880abc8eec00 Feb 19 23:55:35 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880abc8eec00 Feb 19 23:55:49 transom1 fm0_sm[158971]: PROGR[topology]: SM: topology_main: TT: DISCOVERY CYCLE START - REASON: Scheduled sweep interval Feb 19 23:55:49 transom1 fm0_sm[158971]: PROGR[topology]: SM: topology_main: DISCOVERY CYCLE END. 9 SWs, 136 HFIs, 136 end ports, 569 total ports, 1 SM(s), 1893 packets, 0 retries, 0.287 sec sweep Feb 19 23:55:52 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e07fe3400 Feb 19 23:55:52 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e07fe3400 Feb 19 23:56:07 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880d8bd73a00 Feb 19 23:56:07 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880d8bd73a00 Feb 19 23:56:24 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e07fe3800 Feb 19 23:56:24 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e07fe3800 Feb 19 23:56:39 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880d8bd74a00 Feb 19 23:56:39 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880d8bd74a00 Feb 19 23:56:56 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e77e66a00 Feb 19 23:56:56 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e77e66a00 Feb 19 23:57:11 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880709a44800 Feb 19 23:57:11 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880709a44800 Feb 19 23:57:28 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e77e61800 Feb 19 23:57:28 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e77e61800 Feb 19 23:57:42 john100 kernel: LustreError: 920:0:(osc_request.c:1735:brw_interpret()) dagg-OST0005-osc-ffff88015a1ad800: too many resent retries for object: 0:2683981, rc = -11. Feb 19 23:57:42 john100 kernel: LustreError: 920:0:(osc_request.c:1735:brw_interpret()) Skipped 2 previous similar messages Feb 19 23:57:42 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880709a40c00 Feb 19 23:57:42 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880709a40c00 Feb 19 23:57:43 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880abc8ef800 Feb 19 23:57:43 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88086d3e1000 Feb 19 23:57:43 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880abc8ef800 Feb 19 23:57:43 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88086d3e1000 Feb 19 23:58:00 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e77e62200 Feb 19 23:58:00 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e77e62200 Feb 19 23:58:15 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88090c258000 Feb 19 23:58:15 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88086d3e7c00 Feb 19 23:58:15 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88090c258000 Feb 19 23:58:15 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88086d3e7c00 Feb 19 23:58:32 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e77e67a00 Feb 19 23:58:32 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e77e67a00 Feb 19 23:58:47 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88177c4bda00 Feb 19 23:58:47 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88086d3e2c00 Feb 19 23:58:47 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88177c4bda00 Feb 19 23:58:47 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88086d3e2c00 Feb 19 23:59:04 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e77e67000 Feb 19 23:59:04 arkle6 kernel: LustreError: 298396:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e77e67000 ...