[LU-10683] write checksum errors - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.12.0, Lustre 2.10.5
Affects Version/s: Lustre 2.10.3
Labels:
None
Environment:
centos7, skylake, opa, zfs 0.7.6, lustre 2.10.3 + https://review.whamcloud.com/#/c/29992/. kernel nopti on servers, pti on clients, 3.10.0-693.17.1.el7.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Hiya,

these appeared last night.
john100,101 are clients. arkle3,6 are OSS's. transom1 is running the fabric manager.

in light of the similar looking ~~LU-9305~~ I thought I would create this ticket.
we run default (4M) rpcs on clients and servers. our OSTs are each 4 raidz3's in 1 zpool, and have recordsize=2M. 2 OSTs per OSS. 16 OSTs total.

I suppose it could be a OPA network glitch, but it affected 2 clients and 2 servers so that seems unlikely.

we have just moved from zfs 0.7.5 to zfs 0.7.6. we ran ior and mdtest after this change and they were fine. these errors occurred a couple of days after that.

Feb 19 23:45:12 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 54e81a18
Feb 19 23:45:12 john100 kernel: LNetError: 899:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1591386600741136 length 1048576 too big: 1048176 left, 1048176 allowed
Feb 19 23:45:12 arkle3 kernel: LNet: Using FMR for registration
Feb 19 23:45:12 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881764648600
Feb 19 23:45:12 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881764648600
Feb 19 23:45:12 arkle3 kernel: LustreError: 337237:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff881687561450 x1591386600741136/t0(0) o4->8c8018f7-2e02-6c2b-cbcf-29133ecabf02@192.168.44.200@o2ib44:173/0 lens 608/448 e 0 to 0 dl 1519044318 ref 1 fl Interpret:/0/0 rc 0/0
Feb 19 23:45:12 arkle3 kernel: Lustre: dagg-OST0005: Bulk IO write error with 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44), client will retry: rc = -110
Feb 19 23:45:12 arkle3 kernel: LNet: Skipped 1 previous similar message
Feb 19 23:45:12 john101 kernel: LNetError: 904:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.36@o2ib44, match 1591895282655776 length 1048576 too big: 1048176 left, 1048176 allowed
Feb 19 23:45:12 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8810cb714600
Feb 19 23:45:12 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8810cb714600
Feb 19 23:45:12 arkle6 kernel: LustreError: 42356:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff88276c3d5850 x1591895282655776/t0(0) o4->400cfa1c-7c7d-1d14-09ed-f6043574fd7c@192.168.44.201@o2ib44:173/0 lens 608/448 e 0 to 0 dl 1519044318 ref 1 fl Interpret:/0/0 rc 0/0
Feb 19 23:45:12 arkle6 kernel: Lustre: dagg-OST000b: Bulk IO write error with 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44), client will retry: rc = -110
Feb 19 23:45:12 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 54e81a18 (type 4), client csum now efde5b36
Feb 19 23:45:12 john100 kernel: LustreError: 924:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff880e7a9b4e00 x1591386600740944/t197580821951(197580821951) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044319 ref 2 fl Interpret:RM/0/0 rc 0/0
Feb 19 23:45:13 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum df7fda9d
Feb 19 23:45:13 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum df7fda9d (type 4), client csum now efde5b36
Feb 19 23:45:13 john100 kernel: LustreError: 911:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff8807492fe300 x1591386600747696/t197580821955(197580821955) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044320 ref 2 fl Interpret:RM/0/0 rc 0/0
Feb 19 23:45:15 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 87da008a
Feb 19 23:45:15 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 87da008a (type 4), client csum now efde5b36
Feb 19 23:45:15 john100 kernel: LustreError: 910:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff880eb3d51b00 x1591386600751360/t197580821956(197580821956) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044322 ref 2 fl Interpret:RM/0/0 rc 0/0
Feb 19 23:45:18 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 1cc7a793
Feb 19 23:45:18 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 1cc7a793 (type 4), client csum now efde5b36

full log attached.
there were hung and unkillable user processes on the 2 clients afterwards. a reboot of the 2 clients has cleared up the looping messages of the type shown below.

Feb 20 14:28:15 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e25ef2c00
Feb 20 14:28:19 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88176544be00
Feb 20 14:28:19 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8814072e0200
Feb 20 14:28:19 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88176544be00
Feb 20 14:28:19 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8814072e0200
Feb 20 14:28:47 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e25ef3800
Feb 20 14:28:47 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e25ef3800
Feb 20 14:28:51 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881615f0f000
Feb 20 14:28:51 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880be3931c00
Feb 20 14:28:51 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881615f0f000

cheers,
robin

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

2018-04-18-03.LU10683.neterror.tgz
46.88 MB
18/Apr/18 8:26 AM
messages_lustre-servers-and-clients-20180316.log
654 kB
16/Mar/18 4:16 AM
messages-20180220.txt
105 kB
20/Feb/18 4:36 AM

Issue Links

is related to

LU-11663 corrupt data after page-unaligned write with zfs backend lustre 2.10

Resolved

LU-11093 clients hang when over quota

Resolved

Activity

[LU-10683] write checksum errors

Gerrit Updater added a comment - 27/Nov/18 5:25 PM

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33728
Subject: Revert "~~LU-10683~~ osd_zfs: set offset in page correctly"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a95e7e398ee411a14c3d67072fbea273832e0957

Gerrit Updater added a comment - 27/Nov/18 5:25 PM Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33728 Subject: Revert " LU-10683 osd_zfs: set offset in page correctly" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a95e7e398ee411a14c3d67072fbea273832e0957

Gerrit Updater added a comment - 02/Aug/18 7:25 PM

John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32899/
Subject: ~~LU-10683~~ osd_zfs: set offset in page correctly
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 686a73ea9467c53d261cf12d0802bb1332d50f4a

Gerrit Updater added a comment - 02/Aug/18 7:25 PM John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32899/ Subject: LU-10683 osd_zfs: set offset in page correctly Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 686a73ea9467c53d261cf12d0802bb1332d50f4a

Gerrit Updater added a comment - 30/Jul/18 3:47 PM

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32899
Subject: ~~LU-10683~~ osd_zfs: set offset in page correctly
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 1484662e0213007bfa5b7b68f02df77719a1a6d7

Gerrit Updater added a comment - 30/Jul/18 3:47 PM Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32899 Subject: LU-10683 osd_zfs: set offset in page correctly Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 1484662e0213007bfa5b7b68f02df77719a1a6d7

Gerrit Updater added a comment - 24/Jul/18 3:59 PM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32788/
Subject: ~~LU-10683~~ osd_zfs: set offset in page correctly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 83cb17031913ba2f33a5b67219a03c5605f48f27

Gerrit Updater added a comment - 24/Jul/18 3:59 PM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32788/ Subject: LU-10683 osd_zfs: set offset in page correctly Project: fs/lustre-release Branch: master Current Patch Set: Commit: 83cb17031913ba2f33a5b67219a03c5605f48f27

Gerrit Updater added a comment - 05/Jul/18 11:44 AM

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32788
Subject: ~~LU-10683~~ osd_zfs: set offset in page correctly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1fb85e7eba6823b5822f7298c7fa648770239635

Gerrit Updater added a comment - 05/Jul/18 11:44 AM Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32788 Subject: LU-10683 osd_zfs: set offset in page correctly Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1fb85e7eba6823b5822f7298c7fa648770239635

Andreas Dilger added a comment - 08/May/18 10:17 PM

Correct. The client will resend on checksum failures up to 10 times by default (controlled by osc.*.resend_count).

That the checksum at the client is always the same implies that the data is also the same (e.g. all zero). That it is different on the server each time implies it is being changed after the client has computed the checksum (e.g. in client RAM, over the network, or in OSS RAM). If you are using mmap() on files or O_DIRECT with another thread modifying the pages it is possible to see such corruption, but Lustre can't do anything about it (short of copying the data, which is highly undesirable).

Andreas Dilger added a comment - 08/May/18 10:17 PM Correct. The client will resend on checksum failures up to 10 times by default (controlled by osc.*.resend_count ). That the checksum at the client is always the same implies that the data is also the same (e.g. all zero). That it is different on the server each time implies it is being changed after the client has computed the checksum (e.g. in client RAM, over the network, or in OSS RAM). If you are using mmap() on files or O_DIRECT with another thread modifying the pages it is possible to see such corruption, but Lustre can't do anything about it (short of copying the data, which is highly undesirable).

SC Admin added a comment - 04/May/18 9:26 AM

from reading the lustre manual it sounds like these checksum events are re-tried after they are detected, so the users might be seeing no effect from them.
is that correct?

cheers,
robin

SC Admin added a comment - 04/May/18 9:26 AM from reading the lustre manual it sounds like these checksum events are re-tried after they are detected, so the users might be seeing no effect from them. is that correct? cheers, robin

Hongchao Zhang added a comment - 19/Apr/18 10:46 AM

As per the logs, the checksum is a little strange at client side

[3111837064-3112361351]: client csum c253e960, server csum 5e6e4da8
00000020:02020000:15.0:1523986236.833762:0:10678:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3186813384-3187337671]: client csum c253e960, server csum a42a5590
00000020:02020000:1.0:1523986236.845204:0:285172:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3069890888-3070415175]: client csum c253e960, server csum fc43f2ee
00000020:02020000:17.0:1523986236.845219:0:298862:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3128614280-3129138567]: client csum c253e960, server csum 976d257c
00000020:02020000:5.0:1523986236.845231:0:287464:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3095056712-3095580999]: client csum c253e960, server csum 1747b303
00000020:02020000:7.0:1523986236.845411:0:287463:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3061502280-3062026567]: client csum c253e960, server csum 1c6d4fd9
00000020:02020000:9.0:1523986236.845578:0:134229:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3036333320-3036857607]: client csum c253e960, server csum 4b320be5
00000020:02020000:15.0:1523986236.845877:0:287086:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3027420424-3027944711]: client csum c253e960, server csum d537dae0
00000020:02020000:1.0:1523986236.846009:0:34879:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3102921032-3103445319]: client csum c253e960, server csum 50418907

the checksum of different extent for the same file (0x280022737:0x5b:0x0) is the same from the client side (all are "c253e960"),
but it is different at server side.

I'll look at it more deeply to find out what cause it.

Hongchao Zhang added a comment - 19/Apr/18 10:46 AM As per the logs, the checksum is a little strange at client side [3111837064-3112361351]: client csum c253e960, server csum 5e6e4da8 00000020:02020000:15.0:1523986236.833762:0:10678:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3186813384-3187337671]: client csum c253e960, server csum a42a5590 00000020:02020000:1.0:1523986236.845204:0:285172:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3069890888-3070415175]: client csum c253e960, server csum fc43f2ee 00000020:02020000:17.0:1523986236.845219:0:298862:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3128614280-3129138567]: client csum c253e960, server csum 976d257c 00000020:02020000:5.0:1523986236.845231:0:287464:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3095056712-3095580999]: client csum c253e960, server csum 1747b303 00000020:02020000:7.0:1523986236.845411:0:287463:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3061502280-3062026567]: client csum c253e960, server csum 1c6d4fd9 00000020:02020000:9.0:1523986236.845578:0:134229:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3036333320-3036857607]: client csum c253e960, server csum 4b320be5 00000020:02020000:15.0:1523986236.845877:0:287086:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3027420424-3027944711]: client csum c253e960, server csum d537dae0 00000020:02020000:1.0:1523986236.846009:0:34879:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3102921032-3103445319]: client csum c253e960, server csum 50418907 the checksum of different extent for the same file (0x280022737:0x5b:0x0) is the same from the client side (all are "c253e960"), but it is different at server side. I'll look at it more deeply to find out what cause it.

SC Admin added a comment - 18/Apr/18 3:35 PM

so far it looks like this group was NOT over quota, but was again using parallel hdf5 writes.

the 4 files in question apparently look ok to them (ie. not obviously corrupted).

cheers,
robin

SC Admin added a comment - 18/Apr/18 3:35 PM so far it looks like this group was NOT over quota, but was again using parallel hdf5 writes. the 4 files in question apparently look ok to them (ie. not obviously corrupted). cheers, robin

SC Admin added a comment - 18/Apr/18 8:53 AM

the script triggered on 3 sets of checksum errors last night.
dk's attached in 2018-04-18-03.LU10683.neterror.tgz
the names of the files tell you when they were captured.
also messages.checksum-i gives you an overall picture of when and where the dk's ran too.
arkles are servers. john's are clients.

the 4 fids fingered are

[root@john5 ~]# lfs fid2path /dagg 0x20001251f:0x144:0x0
/dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0500/meraxes_grids_74.hdf5
[root@john5 ~]# lfs fid2path /dagg 0x280022737:0x5b:0x0
/dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0010/meraxes_grids_72.hdf5
[root@john5 ~]# lfs fid2path /dagg 0x680024e04:0x46:0x0
/dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_67.hdf5
[root@john5 ~]# lfs fid2path /dagg 0x680024e04:0x47:0x0
/dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_68.hdf5
[root@john5 ~]# ls -l /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0500/meraxes_grids_74.hdf5 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0010/meraxes_grids_72.hdf5 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_67.hdf5 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_68.hdf5
-rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:31 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0010/meraxes_grids_72.hdf5
-rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:22 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_67.hdf5
-rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:24 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_68.hdf5
-rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:35 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0500/meraxes_grids_74.hdf5

and that group is currently well under quota

[root@john5 ~]# lfs quota -h -g oz025 /dagg/
Disk quotas for grp oz025 (gid 10227):
     Filesystem    used   quota   limit   grace   files   quota   limit   grace
         /dagg/  6.342T      0k     10T       -  188885       0 1000000       -

I'll check with them about the state of jobs that ran last night, and also the state of those files, and quota.

cheers,
robin

SC Admin added a comment - 18/Apr/18 8:53 AM the script triggered on 3 sets of checksum errors last night. dk's attached in 2018-04-18-03.LU10683.neterror.tgz the names of the files tell you when they were captured. also messages.checksum-i gives you an overall picture of when and where the dk's ran too. arkles are servers. john's are clients. the 4 fids fingered are [root@john5 ~]# lfs fid2path /dagg 0x20001251f:0x144:0x0 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0500/meraxes_grids_74.hdf5 [root@john5 ~]# lfs fid2path /dagg 0x280022737:0x5b:0x0 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0010/meraxes_grids_72.hdf5 [root@john5 ~]# lfs fid2path /dagg 0x680024e04:0x46:0x0 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_67.hdf5 [root@john5 ~]# lfs fid2path /dagg 0x680024e04:0x47:0x0 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_68.hdf5 [root@john5 ~]# ls -l /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0500/meraxes_grids_74.hdf5 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0010/meraxes_grids_72.hdf5 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_67.hdf5 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_68.hdf5 -rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:31 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0010/meraxes_grids_72.hdf5 -rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:22 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_67.hdf5 -rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:24 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_68.hdf5 -rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:35 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0500/meraxes_grids_74.hdf5 and that group is currently well under quota [root@john5 ~]# lfs quota -h -g oz025 /dagg/ Disk quotas for grp oz025 (gid 10227): Filesystem used quota limit grace files quota limit grace /dagg/ 6.342T 0k 10T - 188885 0 1000000 - I'll check with them about the state of jobs that ran last night, and also the state of those files, and quota. cheers, robin

SC Admin added a comment - 12/Apr/18 3:25 PM

Hi Amir,

I setup a script to tail syslog and run dk on anything that hits a CHECKSUM error.
however it doesn't appear safe to turn on +net.
I've seen 132 stack traces like these across all the servers and clients since I turned on +net this afternoon, so I've now turned it off.

Apr 12 18:57:25 arkle5 kernel: ------------[ cut here ]------------
Apr 12 18:57:25 arkle5 kernel: WARNING: CPU: 1 PID: 127223 at kernel/softirq.c:151 __local_bh_enable_ip+0x82/0xb0
Apr 12 18:57:25 arkle5 kernel: Modules linked in: sctp_diag sctp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag ip6table_filter ip6_tables iptable_filter osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_
zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) mptctl mptbase 8021q garp mrp stp llc hfi1 sunrpc xfs dm_round_robin dcdbas int
el_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr iTCO_wdt iTCO_vendor_support zfs(POE) zunic
ode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) mgag200 ttm dm_multipath drm_kms_helper ses syscopyarea enclosure dm_mod sysfillrect sysimgblt fb_sys_fops drm mei_me lpc_ich shpchp i2c_i801 sg
 mei
Apr 12 18:57:25 arkle5 kernel: ipmi_si ipmi_devintf nfit ipmi_msghandler libnvdimm acpi_power_meter tpm_crb rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm iw_cm binfmt_misc ip_tables ib_ipoib ib_cm sr_mod cdrom 
sd_mod crc_t10dif crct10dif_generic bonding bnx2x rdmavt ahci i2c_algo_bit libahci crct10dif_pclmul mpt3sas mdio crct10dif_common i2c_core crc32c_intel ptp raid_class ib_core libata megaraid_sas scsi_transp
ort_sas pps_core libcrc32c [last unloaded: hfi1]
Apr 12 18:57:25 arkle5 kernel: CPU: 1 PID: 127223 Comm: hfi1_cq0 Tainted: P           OE  ------------   3.10.0-693.17.1.el7.x86_64 #1
Apr 12 18:57:25 arkle5 kernel: Hardware name: Dell Inc. PowerEdge R740/06G98X, BIOS 1.3.7 02/08/2018
Apr 12 18:57:25 arkle5 kernel: Call Trace:
Apr 12 18:57:25 arkle5 kernel: [<ffffffff816a6071>] dump_stack+0x19/0x1b
Apr 12 18:57:25 arkle5 kernel: [<ffffffff810895e8>] __warn+0xd8/0x100
Apr 12 18:57:25 arkle5 kernel: [<ffffffff8108972d>] warn_slowpath_null+0x1d/0x20
Apr 12 18:57:25 arkle5 kernel: [<ffffffff81091be2>] __local_bh_enable_ip+0x82/0xb0
Apr 12 18:57:25 arkle5 kernel: [<ffffffff816ade8e>] _raw_spin_unlock_bh+0x1e/0x20
Apr 12 18:57:25 arkle5 kernel: [<ffffffffc06183b5>] cfs_trace_unlock_tcd+0x55/0x90 [libcfs]
Apr 12 18:57:25 arkle5 kernel: [<ffffffffc0623708>] libcfs_debug_vmsg2+0x6d8/0xb40 [libcfs]
Apr 12 18:57:25 arkle5 kernel: [<ffffffff810cfb6c>] ? dequeue_entity+0x11c/0x5d0
Apr 12 18:57:25 arkle5 kernel: [<ffffffff810c95d5>] ? sched_clock_cpu+0x85/0xc0
Apr 12 18:57:25 arkle5 kernel: [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
Apr 12 18:57:25 arkle5 kernel: [<ffffffffc0623bc7>] libcfs_debug_msg+0x57/0x80 [libcfs]
Apr 12 18:57:25 arkle5 kernel: [<ffffffffc069682a>] kiblnd_cq_completion+0x11a/0x160 [ko2iblnd]
Apr 12 18:57:25 arkle5 kernel: [<ffffffffc03ab4a2>] send_complete+0x32/0x50 [rdmavt]
Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2ac0>] kthread_worker_fn+0x80/0x180
Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2a40>] ? kthread_stop+0xe0/0xe0
Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b270f>] kthread+0xcf/0xe0
Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
Apr 12 18:57:25 arkle5 kernel: [<ffffffff816b8798>] ret_from_fork+0x58/0x90
Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
Apr 12 18:57:25 arkle5 kernel: ---[ end trace aaf779f5b67c32db ]---
Apr 12 18:57:25 arkle7 kernel: ------------[ cut here ]------------

I also have one client that looks permanently upset now (john50 is a client, arkle3 is an OSS)

Apr 12 21:28:05 john50 kernel: LNetError: 909:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1594448751341360 length 1048576 too big: 1045288 left, 1045288 allo
wed
Apr 12 21:28:05 arkle3 kernel: LustreError: 296380:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880c21c90a00
Apr 12 21:28:05 arkle3 kernel: LustreError: 296380:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880c21c90a00
Apr 12 21:28:05 arkle3 kernel: LustreError: 233673:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff8817643ce450 x1594448751341360/t0(0) o4->98af46e3-9fa3-6f5b-2dcd-f89325115978@192.168.44.150@o2ib44:626/0 lens 608/448 e 0 to 0 dl 1523532491 ref 1 fl Interpret:/0/0 rc 0/0
Apr 12 21:28:05 arkle3 kernel: Lustre: dagg-OST0004: Bulk IO write error with 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44), client will retry: rc = -110
Apr 12 21:28:12 john50 kernel: Lustre: 933:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523532485/real 1523532485]  req@ffff8817b5bd6900 x1594448751341360/t0(0) o4->dagg-OST0004-osc-ffff88189aa1f000@192.168.44.33@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1523532492 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Apr 12 21:28:12 john50 kernel: Lustre: 933:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Apr 12 21:28:12 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection to dagg-OST0004 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete
Apr 12 21:28:12 john50 kernel: Lustre: Skipped 3 previous similar messages
Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Client 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44) reconnecting
Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Connection restored to 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44)
Apr 12 21:28:12 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44)
Apr 12 21:28:12 john50 kernel: Lustre: Skipped 2 previous similar messages
Apr 12 21:28:12 john50 kernel: LNetError: 910:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1594448751343216 length 1048576 too big: 1045288 left, 1045288 allowed
Apr 12 21:28:12 arkle3 kernel: LustreError: 296379:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88169ed25a00
Apr 12 21:28:12 arkle3 kernel: LustreError: 296379:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88169ed25a00
Apr 12 21:28:12 arkle3 kernel: LustreError: 50811:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff8816dd9f2c50 x1594448751341360/t0(0) o4->98af46e3-9fa3-6f5b-2dcd-f89325115978@192.168.44.150@o2ib44:633/0 lens 608/448 e 0 to 0 dl 1523532498 ref 1 fl Interpret:/2/0 rc 0/0
Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Bulk IO write error with 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44), client will retry: rc = -110
Apr 12 21:28:19 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection to dagg-OST0004 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete

I'll reboot the client as (IIRC) this has cleared up this kind of problem in the past

Apr 13 01:22:00 arkle3 kernel: LustreError: 296378:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880f6ea28800
Apr 13 01:22:07 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88150b4d4800
Apr 13 01:22:07 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88150b4d4800
Apr 13 01:22:15 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88150b4d2800
Apr 13 01:22:15 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88150b4d2800

debug +neterror (a default) is still enabled and the dk will still catch that. hopefully that will be enough for you.

cheers,
robin

SC Admin added a comment - 12/Apr/18 3:25 PM Hi Amir, I setup a script to tail syslog and run dk on anything that hits a CHECKSUM error. however it doesn't appear safe to turn on +net. I've seen 132 stack traces like these across all the servers and clients since I turned on +net this afternoon, so I've now turned it off. Apr 12 18:57:25 arkle5 kernel: ------------[ cut here ]------------ Apr 12 18:57:25 arkle5 kernel: WARNING: CPU: 1 PID: 127223 at kernel/softirq.c:151 __local_bh_enable_ip+0x82/0xb0 Apr 12 18:57:25 arkle5 kernel: Modules linked in: sctp_diag sctp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag ip6table_filter ip6_tables iptable_filter osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) mptctl mptbase 8021q garp mrp stp llc hfi1 sunrpc xfs dm_round_robin dcdbas int el_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr iTCO_wdt iTCO_vendor_support zfs(POE) zunic ode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) mgag200 ttm dm_multipath drm_kms_helper ses syscopyarea enclosure dm_mod sysfillrect sysimgblt fb_sys_fops drm mei_me lpc_ich shpchp i2c_i801 sg mei Apr 12 18:57:25 arkle5 kernel: ipmi_si ipmi_devintf nfit ipmi_msghandler libnvdimm acpi_power_meter tpm_crb rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm iw_cm binfmt_misc ip_tables ib_ipoib ib_cm sr_mod cdrom sd_mod crc_t10dif crct10dif_generic bonding bnx2x rdmavt ahci i2c_algo_bit libahci crct10dif_pclmul mpt3sas mdio crct10dif_common i2c_core crc32c_intel ptp raid_class ib_core libata megaraid_sas scsi_transp ort_sas pps_core libcrc32c [last unloaded: hfi1] Apr 12 18:57:25 arkle5 kernel: CPU: 1 PID: 127223 Comm: hfi1_cq0 Tainted: P OE ------------ 3.10.0-693.17.1.el7.x86_64 #1 Apr 12 18:57:25 arkle5 kernel: Hardware name: Dell Inc. PowerEdge R740/06G98X, BIOS 1.3.7 02/08/2018 Apr 12 18:57:25 arkle5 kernel: Call Trace: Apr 12 18:57:25 arkle5 kernel: [<ffffffff816a6071>] dump_stack+0x19/0x1b Apr 12 18:57:25 arkle5 kernel: [<ffffffff810895e8>] __warn+0xd8/0x100 Apr 12 18:57:25 arkle5 kernel: [<ffffffff8108972d>] warn_slowpath_null+0x1d/0x20 Apr 12 18:57:25 arkle5 kernel: [<ffffffff81091be2>] __local_bh_enable_ip+0x82/0xb0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff816ade8e>] _raw_spin_unlock_bh+0x1e/0x20 Apr 12 18:57:25 arkle5 kernel: [<ffffffffc06183b5>] cfs_trace_unlock_tcd+0x55/0x90 [libcfs] Apr 12 18:57:25 arkle5 kernel: [<ffffffffc0623708>] libcfs_debug_vmsg2+0x6d8/0xb40 [libcfs] Apr 12 18:57:25 arkle5 kernel: [<ffffffff810cfb6c>] ? dequeue_entity+0x11c/0x5d0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810c95d5>] ? sched_clock_cpu+0x85/0xc0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 Apr 12 18:57:25 arkle5 kernel: [<ffffffffc0623bc7>] libcfs_debug_msg+0x57/0x80 [libcfs] Apr 12 18:57:25 arkle5 kernel: [<ffffffffc069682a>] kiblnd_cq_completion+0x11a/0x160 [ko2iblnd] Apr 12 18:57:25 arkle5 kernel: [<ffffffffc03ab4a2>] send_complete+0x32/0x50 [rdmavt] Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2ac0>] kthread_worker_fn+0x80/0x180 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2a40>] ? kthread_stop+0xe0/0xe0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b270f>] kthread+0xcf/0xe0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40 Apr 12 18:57:25 arkle5 kernel: [<ffffffff816b8798>] ret_from_fork+0x58/0x90 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40 Apr 12 18:57:25 arkle5 kernel: ---[ end trace aaf779f5b67c32db ]--- Apr 12 18:57:25 arkle7 kernel: ------------[ cut here ]------------ I also have one client that looks permanently upset now (john50 is a client, arkle3 is an OSS) Apr 12 21:28:05 john50 kernel: LNetError: 909:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1594448751341360 length 1048576 too big: 1045288 left, 1045288 allo wed Apr 12 21:28:05 arkle3 kernel: LustreError: 296380:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880c21c90a00 Apr 12 21:28:05 arkle3 kernel: LustreError: 296380:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880c21c90a00 Apr 12 21:28:05 arkle3 kernel: LustreError: 233673:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8817643ce450 x1594448751341360/t0(0) o4->98af46e3-9fa3-6f5b-2dcd-f89325115978@192.168.44.150@o2ib44:626/0 lens 608/448 e 0 to 0 dl 1523532491 ref 1 fl Interpret:/0/0 rc 0/0 Apr 12 21:28:05 arkle3 kernel: Lustre: dagg-OST0004: Bulk IO write error with 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44), client will retry: rc = -110 Apr 12 21:28:12 john50 kernel: Lustre: 933:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523532485/real 1523532485] req@ffff8817b5bd6900 x1594448751341360/t0(0) o4->dagg-OST0004-osc-ffff88189aa1f000@192.168.44.33@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1523532492 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Apr 12 21:28:12 john50 kernel: Lustre: 933:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Apr 12 21:28:12 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection to dagg-OST0004 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Apr 12 21:28:12 john50 kernel: Lustre: Skipped 3 previous similar messages Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Client 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44) reconnecting Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Connection restored to 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44) Apr 12 21:28:12 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44) Apr 12 21:28:12 john50 kernel: Lustre: Skipped 2 previous similar messages Apr 12 21:28:12 john50 kernel: LNetError: 910:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1594448751343216 length 1048576 too big: 1045288 left, 1045288 allowed Apr 12 21:28:12 arkle3 kernel: LustreError: 296379:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88169ed25a00 Apr 12 21:28:12 arkle3 kernel: LustreError: 296379:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88169ed25a00 Apr 12 21:28:12 arkle3 kernel: LustreError: 50811:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8816dd9f2c50 x1594448751341360/t0(0) o4->98af46e3-9fa3-6f5b-2dcd-f89325115978@192.168.44.150@o2ib44:633/0 lens 608/448 e 0 to 0 dl 1523532498 ref 1 fl Interpret:/2/0 rc 0/0 Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Bulk IO write error with 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44), client will retry: rc = -110 Apr 12 21:28:19 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection to dagg-OST0004 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete I'll reboot the client as (IIRC) this has cleared up this kind of problem in the past Apr 13 01:22:00 arkle3 kernel: LustreError: 296378:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880f6ea28800 Apr 13 01:22:07 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88150b4d4800 Apr 13 01:22:07 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88150b4d4800 Apr 13 01:22:15 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88150b4d2800 Apr 13 01:22:15 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88150b4d2800 debug +neterror (a default) is still enabled and the dk will still catch that. hopefully that will be enough for you. cheers, robin

People

Assignee:: Hongchao Zhang

Reporter:: SC Admin

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 20/Feb/18 4:40 AM

Updated:: 28/Nov/18 5:12 PM

Resolved:: 24/Jul/18 4:08 PM