Details

    • 3
    • 9223372036854775807

    Description

      Hiya,

      these appeared last night.
      john100,101 are clients. arkle3,6 are OSS's. transom1 is running the fabric manager.

      in light of the similar looking LU-9305 I thought I would create this ticket.
      we run default (4M) rpcs on clients and servers. our OSTs are each 4 raidz3's in 1 zpool, and have recordsize=2M. 2 OSTs per OSS. 16 OSTs total.

      I suppose it could be a OPA network glitch, but it affected 2 clients and 2 servers so that seems unlikely.

      we have just moved from zfs 0.7.5 to zfs 0.7.6. we ran ior and mdtest after this change and they were fine. these errors occurred a couple of days after that.

      Feb 19 23:45:12 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 54e81a18
      Feb 19 23:45:12 john100 kernel: LNetError: 899:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1591386600741136 length 1048576 too big: 1048176 left, 1048176 allowed
      Feb 19 23:45:12 arkle3 kernel: LNet: Using FMR for registration
      Feb 19 23:45:12 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881764648600
      Feb 19 23:45:12 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881764648600
      Feb 19 23:45:12 arkle3 kernel: LustreError: 337237:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff881687561450 x1591386600741136/t0(0) o4->8c8018f7-2e02-6c2b-cbcf-29133ecabf02@192.168.44.200@o2ib44:173/0 lens 608/448 e 0 to 0 dl 1519044318 ref 1 fl Interpret:/0/0 rc 0/0
      Feb 19 23:45:12 arkle3 kernel: Lustre: dagg-OST0005: Bulk IO write error with 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44), client will retry: rc = -110
      Feb 19 23:45:12 arkle3 kernel: LNet: Skipped 1 previous similar message
      Feb 19 23:45:12 john101 kernel: LNetError: 904:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.36@o2ib44, match 1591895282655776 length 1048576 too big: 1048176 left, 1048176 allowed
      Feb 19 23:45:12 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8810cb714600
      Feb 19 23:45:12 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8810cb714600
      Feb 19 23:45:12 arkle6 kernel: LustreError: 42356:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff88276c3d5850 x1591895282655776/t0(0) o4->400cfa1c-7c7d-1d14-09ed-f6043574fd7c@192.168.44.201@o2ib44:173/0 lens 608/448 e 0 to 0 dl 1519044318 ref 1 fl Interpret:/0/0 rc 0/0
      Feb 19 23:45:12 arkle6 kernel: Lustre: dagg-OST000b: Bulk IO write error with 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44), client will retry: rc = -110
      Feb 19 23:45:12 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 54e81a18 (type 4), client csum now efde5b36
      Feb 19 23:45:12 john100 kernel: LustreError: 924:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff880e7a9b4e00 x1591386600740944/t197580821951(197580821951) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044319 ref 2 fl Interpret:RM/0/0 rc 0/0
      Feb 19 23:45:13 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum df7fda9d
      Feb 19 23:45:13 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum df7fda9d (type 4), client csum now efde5b36
      Feb 19 23:45:13 john100 kernel: LustreError: 911:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff8807492fe300 x1591386600747696/t197580821955(197580821955) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044320 ref 2 fl Interpret:RM/0/0 rc 0/0
      Feb 19 23:45:15 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 87da008a
      Feb 19 23:45:15 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 87da008a (type 4), client csum now efde5b36
      Feb 19 23:45:15 john100 kernel: LustreError: 910:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff880eb3d51b00 x1591386600751360/t197580821956(197580821956) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044322 ref 2 fl Interpret:RM/0/0 rc 0/0
      Feb 19 23:45:18 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 1cc7a793
      Feb 19 23:45:18 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 1cc7a793 (type 4), client csum now efde5b36
      

      full log attached.
      there were hung and unkillable user processes on the 2 clients afterwards. a reboot of the 2 clients has cleared up the looping messages of the type shown below.

      Feb 20 14:28:15 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e25ef2c00
      Feb 20 14:28:19 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88176544be00
      Feb 20 14:28:19 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8814072e0200
      Feb 20 14:28:19 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88176544be00
      Feb 20 14:28:19 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8814072e0200
      Feb 20 14:28:47 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e25ef3800
      Feb 20 14:28:47 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e25ef3800
      Feb 20 14:28:51 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881615f0f000
      Feb 20 14:28:51 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880be3931c00
      Feb 20 14:28:51 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881615f0f000
      

      cheers,
      robin

      Attachments

        Issue Links

          Activity

            [LU-10683] write checksum errors

            Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33728
            Subject: Revert "LU-10683 osd_zfs: set offset in page correctly"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a95e7e398ee411a14c3d67072fbea273832e0957

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33728 Subject: Revert " LU-10683 osd_zfs: set offset in page correctly" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a95e7e398ee411a14c3d67072fbea273832e0957

            John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32899/
            Subject: LU-10683 osd_zfs: set offset in page correctly
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: 686a73ea9467c53d261cf12d0802bb1332d50f4a

            gerrit Gerrit Updater added a comment - John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32899/ Subject: LU-10683 osd_zfs: set offset in page correctly Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 686a73ea9467c53d261cf12d0802bb1332d50f4a

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32899
            Subject: LU-10683 osd_zfs: set offset in page correctly
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 1484662e0213007bfa5b7b68f02df77719a1a6d7

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32899 Subject: LU-10683 osd_zfs: set offset in page correctly Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 1484662e0213007bfa5b7b68f02df77719a1a6d7

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32788/
            Subject: LU-10683 osd_zfs: set offset in page correctly
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 83cb17031913ba2f33a5b67219a03c5605f48f27

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32788/ Subject: LU-10683 osd_zfs: set offset in page correctly Project: fs/lustre-release Branch: master Current Patch Set: Commit: 83cb17031913ba2f33a5b67219a03c5605f48f27

            Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32788
            Subject: LU-10683 osd_zfs: set offset in page correctly
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1fb85e7eba6823b5822f7298c7fa648770239635

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32788 Subject: LU-10683 osd_zfs: set offset in page correctly Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1fb85e7eba6823b5822f7298c7fa648770239635

            Correct. The client will resend on checksum failures up to 10 times by default (controlled by osc.*.resend_count).

            That the checksum at the client is always the same implies that the data is also the same (e.g. all zero). That it is different on the server each time implies it is being changed after the client has computed the checksum (e.g. in client RAM, over the network, or in OSS RAM). If you are using mmap() on files or O_DIRECT with another thread modifying the pages it is possible to see such corruption, but Lustre can't do anything about it (short of copying the data, which is highly undesirable).

            adilger Andreas Dilger added a comment - Correct. The client will resend on checksum failures up to 10 times by default (controlled by osc.*.resend_count ). That the checksum at the client is always the same implies that the data is also the same (e.g. all zero). That it is different on the server each time implies it is being changed after the client has computed the checksum (e.g. in client RAM, over the network, or in OSS RAM). If you are using mmap() on files or O_DIRECT with another thread modifying the pages it is possible to see such corruption, but Lustre can't do anything about it (short of copying the data, which is highly undesirable).
            scadmin SC Admin added a comment -

            from reading the lustre manual it sounds like these checksum events are re-tried after they are detected, so the users might be seeing no effect from them.
            is that correct?

            cheers,
            robin

            scadmin SC Admin added a comment - from reading the lustre manual it sounds like these checksum events are re-tried after they are detected, so the users might be seeing no effect from them. is that correct? cheers, robin

            As per the logs, the checksum is a little strange at client side

            [3111837064-3112361351]: client csum c253e960, server csum 5e6e4da8
            00000020:02020000:15.0:1523986236.833762:0:10678:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3186813384-3187337671]: client csum c253e960, server csum a42a5590
            00000020:02020000:1.0:1523986236.845204:0:285172:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3069890888-3070415175]: client csum c253e960, server csum fc43f2ee
            00000020:02020000:17.0:1523986236.845219:0:298862:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3128614280-3129138567]: client csum c253e960, server csum 976d257c
            00000020:02020000:5.0:1523986236.845231:0:287464:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3095056712-3095580999]: client csum c253e960, server csum 1747b303
            00000020:02020000:7.0:1523986236.845411:0:287463:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3061502280-3062026567]: client csum c253e960, server csum 1c6d4fd9
            00000020:02020000:9.0:1523986236.845578:0:134229:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3036333320-3036857607]: client csum c253e960, server csum 4b320be5
            00000020:02020000:15.0:1523986236.845877:0:287086:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3027420424-3027944711]: client csum c253e960, server csum d537dae0
            00000020:02020000:1.0:1523986236.846009:0:34879:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3102921032-3103445319]: client csum c253e960, server csum 50418907
            

            the checksum of different extent for the same file (0x280022737:0x5b:0x0) is the same from the client side (all are "c253e960"),
            but it is different at server side.

            I'll look at it more deeply to find out what cause it.

            hongchao.zhang Hongchao Zhang added a comment - As per the logs, the checksum is a little strange at client side [3111837064-3112361351]: client csum c253e960, server csum 5e6e4da8 00000020:02020000:15.0:1523986236.833762:0:10678:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3186813384-3187337671]: client csum c253e960, server csum a42a5590 00000020:02020000:1.0:1523986236.845204:0:285172:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3069890888-3070415175]: client csum c253e960, server csum fc43f2ee 00000020:02020000:17.0:1523986236.845219:0:298862:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3128614280-3129138567]: client csum c253e960, server csum 976d257c 00000020:02020000:5.0:1523986236.845231:0:287464:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3095056712-3095580999]: client csum c253e960, server csum 1747b303 00000020:02020000:7.0:1523986236.845411:0:287463:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3061502280-3062026567]: client csum c253e960, server csum 1c6d4fd9 00000020:02020000:9.0:1523986236.845578:0:134229:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3036333320-3036857607]: client csum c253e960, server csum 4b320be5 00000020:02020000:15.0:1523986236.845877:0:287086:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3027420424-3027944711]: client csum c253e960, server csum d537dae0 00000020:02020000:1.0:1523986236.846009:0:34879:0:(tgt_handler.c:2112:tgt_warn_on_cksum()) 168-f: dagg-OST0004: BAD WRITE CHECKSUM: from 12345-192.168.44.170@o2ib44 inode [0x280022737:0x5b:0x0] object 0x540000400:8286655 extent [3102921032-3103445319]: client csum c253e960, server csum 50418907 the checksum of different extent for the same file (0x280022737:0x5b:0x0) is the same from the client side (all are "c253e960"), but it is different at server side. I'll look at it more deeply to find out what cause it.
            scadmin SC Admin added a comment -

            so far it looks like this group was NOT over quota, but was again using parallel hdf5 writes.

            the 4 files in question apparently look ok to them (ie. not obviously corrupted).

            cheers,
            robin

            scadmin SC Admin added a comment - so far it looks like this group was NOT over quota, but was again using parallel hdf5 writes. the 4 files in question apparently look ok to them (ie. not obviously corrupted). cheers, robin
            scadmin SC Admin added a comment -

            the script triggered on 3 sets of checksum errors last night.
            dk's attached in 2018-04-18-03.LU10683.neterror.tgz
            the names of the files tell you when they were captured.
            also messages.checksum-i gives you an overall picture of when and where the dk's ran too.
            arkles are servers. john's are clients.

            the 4 fids fingered are

            [root@john5 ~]# lfs fid2path /dagg 0x20001251f:0x144:0x0
            /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0500/meraxes_grids_74.hdf5
            [root@john5 ~]# lfs fid2path /dagg 0x280022737:0x5b:0x0
            /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0010/meraxes_grids_72.hdf5
            [root@john5 ~]# lfs fid2path /dagg 0x680024e04:0x46:0x0
            /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_67.hdf5
            [root@john5 ~]# lfs fid2path /dagg 0x680024e04:0x47:0x0
            /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_68.hdf5
            [root@john5 ~]# ls -l /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0500/meraxes_grids_74.hdf5 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0010/meraxes_grids_72.hdf5 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_67.hdf5 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_68.hdf5
            -rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:31 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0010/meraxes_grids_72.hdf5
            -rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:22 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_67.hdf5
            -rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:24 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_68.hdf5
            -rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:35 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0500/meraxes_grids_74.hdf5
            

            and that group is currently well under quota

            [root@john5 ~]# lfs quota -h -g oz025 /dagg/
            Disk quotas for grp oz025 (gid 10227):
                 Filesystem    used   quota   limit   grace   files   quota   limit   grace
                     /dagg/  6.342T      0k     10T       -  188885       0 1000000       -
            

            I'll check with them about the state of jobs that ran last night, and also the state of those files, and quota.

            cheers,
            robin

            scadmin SC Admin added a comment - the script triggered on 3 sets of checksum errors last night. dk's attached in 2018-04-18-03.LU10683.neterror.tgz the names of the files tell you when they were captured. also messages.checksum-i gives you an overall picture of when and where the dk's ran too. arkles are servers. john's are clients. the 4 fids fingered are [root@john5 ~]# lfs fid2path /dagg 0x20001251f:0x144:0x0 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0500/meraxes_grids_74.hdf5 [root@john5 ~]# lfs fid2path /dagg 0x280022737:0x5b:0x0 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0010/meraxes_grids_72.hdf5 [root@john5 ~]# lfs fid2path /dagg 0x680024e04:0x46:0x0 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_67.hdf5 [root@john5 ~]# lfs fid2path /dagg 0x680024e04:0x47:0x0 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_68.hdf5 [root@john5 ~]# ls -l /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0500/meraxes_grids_74.hdf5 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0010/meraxes_grids_72.hdf5 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_67.hdf5 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_68.hdf5 -rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:31 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0010/meraxes_grids_72.hdf5 -rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:22 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_67.hdf5 -rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:24 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0050/meraxes_grids_68.hdf5 -rw-rw-r-- 1 yqin oz025 4832124728 Apr 18 03:35 /dagg/projects/oz025/$user/dragons/results/Tiamat/popiii_v2_tocf/PopIIIEfficiency_0pt0500/meraxes_grids_74.hdf5 and that group is currently well under quota [root@john5 ~]# lfs quota -h -g oz025 /dagg/ Disk quotas for grp oz025 (gid 10227): Filesystem used quota limit grace files quota limit grace /dagg/ 6.342T 0k 10T - 188885 0 1000000 - I'll check with them about the state of jobs that ran last night, and also the state of those files, and quota. cheers, robin
            scadmin SC Admin added a comment -

            Hi Amir,

            I setup a script to tail syslog and run dk on anything that hits a CHECKSUM error.
            however it doesn't appear safe to turn on +net.
            I've seen 132 stack traces like these across all the servers and clients since I turned on +net this afternoon, so I've now turned it off.

            Apr 12 18:57:25 arkle5 kernel: ------------[ cut here ]------------
            Apr 12 18:57:25 arkle5 kernel: WARNING: CPU: 1 PID: 127223 at kernel/softirq.c:151 __local_bh_enable_ip+0x82/0xb0
            Apr 12 18:57:25 arkle5 kernel: Modules linked in: sctp_diag sctp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag ip6table_filter ip6_tables iptable_filter osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_
            zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) mptctl mptbase 8021q garp mrp stp llc hfi1 sunrpc xfs dm_round_robin dcdbas int
            el_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr iTCO_wdt iTCO_vendor_support zfs(POE) zunic
            ode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) mgag200 ttm dm_multipath drm_kms_helper ses syscopyarea enclosure dm_mod sysfillrect sysimgblt fb_sys_fops drm mei_me lpc_ich shpchp i2c_i801 sg
             mei
            Apr 12 18:57:25 arkle5 kernel: ipmi_si ipmi_devintf nfit ipmi_msghandler libnvdimm acpi_power_meter tpm_crb rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm iw_cm binfmt_misc ip_tables ib_ipoib ib_cm sr_mod cdrom 
            sd_mod crc_t10dif crct10dif_generic bonding bnx2x rdmavt ahci i2c_algo_bit libahci crct10dif_pclmul mpt3sas mdio crct10dif_common i2c_core crc32c_intel ptp raid_class ib_core libata megaraid_sas scsi_transp
            ort_sas pps_core libcrc32c [last unloaded: hfi1]
            Apr 12 18:57:25 arkle5 kernel: CPU: 1 PID: 127223 Comm: hfi1_cq0 Tainted: P           OE  ------------   3.10.0-693.17.1.el7.x86_64 #1
            Apr 12 18:57:25 arkle5 kernel: Hardware name: Dell Inc. PowerEdge R740/06G98X, BIOS 1.3.7 02/08/2018
            Apr 12 18:57:25 arkle5 kernel: Call Trace:
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff816a6071>] dump_stack+0x19/0x1b
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff810895e8>] __warn+0xd8/0x100
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff8108972d>] warn_slowpath_null+0x1d/0x20
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff81091be2>] __local_bh_enable_ip+0x82/0xb0
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff816ade8e>] _raw_spin_unlock_bh+0x1e/0x20
            Apr 12 18:57:25 arkle5 kernel: [<ffffffffc06183b5>] cfs_trace_unlock_tcd+0x55/0x90 [libcfs]
            Apr 12 18:57:25 arkle5 kernel: [<ffffffffc0623708>] libcfs_debug_vmsg2+0x6d8/0xb40 [libcfs]
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff810cfb6c>] ? dequeue_entity+0x11c/0x5d0
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff810c95d5>] ? sched_clock_cpu+0x85/0xc0
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
            Apr 12 18:57:25 arkle5 kernel: [<ffffffffc0623bc7>] libcfs_debug_msg+0x57/0x80 [libcfs]
            Apr 12 18:57:25 arkle5 kernel: [<ffffffffc069682a>] kiblnd_cq_completion+0x11a/0x160 [ko2iblnd]
            Apr 12 18:57:25 arkle5 kernel: [<ffffffffc03ab4a2>] send_complete+0x32/0x50 [rdmavt]
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2ac0>] kthread_worker_fn+0x80/0x180
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2a40>] ? kthread_stop+0xe0/0xe0
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b270f>] kthread+0xcf/0xe0
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff816b8798>] ret_from_fork+0x58/0x90
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
            Apr 12 18:57:25 arkle5 kernel: ---[ end trace aaf779f5b67c32db ]---
            Apr 12 18:57:25 arkle7 kernel: ------------[ cut here ]------------
            

            I also have one client that looks permanently upset now (john50 is a client, arkle3 is an OSS)

            Apr 12 21:28:05 john50 kernel: LNetError: 909:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1594448751341360 length 1048576 too big: 1045288 left, 1045288 allo
            wed
            Apr 12 21:28:05 arkle3 kernel: LustreError: 296380:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880c21c90a00
            Apr 12 21:28:05 arkle3 kernel: LustreError: 296380:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880c21c90a00
            Apr 12 21:28:05 arkle3 kernel: LustreError: 233673:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff8817643ce450 x1594448751341360/t0(0) o4->98af46e3-9fa3-6f5b-2dcd-f89325115978@192.168.44.150@o2ib44:626/0 lens 608/448 e 0 to 0 dl 1523532491 ref 1 fl Interpret:/0/0 rc 0/0
            Apr 12 21:28:05 arkle3 kernel: Lustre: dagg-OST0004: Bulk IO write error with 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44), client will retry: rc = -110
            Apr 12 21:28:12 john50 kernel: Lustre: 933:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523532485/real 1523532485]  req@ffff8817b5bd6900 x1594448751341360/t0(0) o4->dagg-OST0004-osc-ffff88189aa1f000@192.168.44.33@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1523532492 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            Apr 12 21:28:12 john50 kernel: Lustre: 933:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
            Apr 12 21:28:12 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection to dagg-OST0004 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete
            Apr 12 21:28:12 john50 kernel: Lustre: Skipped 3 previous similar messages
            Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Client 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44) reconnecting
            Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Connection restored to 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44)
            Apr 12 21:28:12 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44)
            Apr 12 21:28:12 john50 kernel: Lustre: Skipped 2 previous similar messages
            Apr 12 21:28:12 john50 kernel: LNetError: 910:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1594448751343216 length 1048576 too big: 1045288 left, 1045288 allowed
            Apr 12 21:28:12 arkle3 kernel: LustreError: 296379:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88169ed25a00
            Apr 12 21:28:12 arkle3 kernel: LustreError: 296379:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88169ed25a00
            Apr 12 21:28:12 arkle3 kernel: LustreError: 50811:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff8816dd9f2c50 x1594448751341360/t0(0) o4->98af46e3-9fa3-6f5b-2dcd-f89325115978@192.168.44.150@o2ib44:633/0 lens 608/448 e 0 to 0 dl 1523532498 ref 1 fl Interpret:/2/0 rc 0/0
            Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Bulk IO write error with 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44), client will retry: rc = -110
            Apr 12 21:28:19 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection to dagg-OST0004 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete
            

            I'll reboot the client as (IIRC) this has cleared up this kind of problem in the past

            Apr 13 01:22:00 arkle3 kernel: LustreError: 296378:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880f6ea28800
            Apr 13 01:22:07 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88150b4d4800
            Apr 13 01:22:07 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88150b4d4800
            Apr 13 01:22:15 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88150b4d2800
            Apr 13 01:22:15 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88150b4d2800
            

            debug +neterror (a default) is still enabled and the dk will still catch that. hopefully that will be enough for you.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi Amir, I setup a script to tail syslog and run dk on anything that hits a CHECKSUM error. however it doesn't appear safe to turn on +net. I've seen 132 stack traces like these across all the servers and clients since I turned on +net this afternoon, so I've now turned it off. Apr 12 18:57:25 arkle5 kernel: ------------[ cut here ]------------ Apr 12 18:57:25 arkle5 kernel: WARNING: CPU: 1 PID: 127223 at kernel/softirq.c:151 __local_bh_enable_ip+0x82/0xb0 Apr 12 18:57:25 arkle5 kernel: Modules linked in: sctp_diag sctp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag ip6table_filter ip6_tables iptable_filter osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) mptctl mptbase 8021q garp mrp stp llc hfi1 sunrpc xfs dm_round_robin dcdbas int el_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr iTCO_wdt iTCO_vendor_support zfs(POE) zunic ode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) mgag200 ttm dm_multipath drm_kms_helper ses syscopyarea enclosure dm_mod sysfillrect sysimgblt fb_sys_fops drm mei_me lpc_ich shpchp i2c_i801 sg mei Apr 12 18:57:25 arkle5 kernel: ipmi_si ipmi_devintf nfit ipmi_msghandler libnvdimm acpi_power_meter tpm_crb rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm iw_cm binfmt_misc ip_tables ib_ipoib ib_cm sr_mod cdrom sd_mod crc_t10dif crct10dif_generic bonding bnx2x rdmavt ahci i2c_algo_bit libahci crct10dif_pclmul mpt3sas mdio crct10dif_common i2c_core crc32c_intel ptp raid_class ib_core libata megaraid_sas scsi_transp ort_sas pps_core libcrc32c [last unloaded: hfi1] Apr 12 18:57:25 arkle5 kernel: CPU: 1 PID: 127223 Comm: hfi1_cq0 Tainted: P OE ------------ 3.10.0-693.17.1.el7.x86_64 #1 Apr 12 18:57:25 arkle5 kernel: Hardware name: Dell Inc. PowerEdge R740/06G98X, BIOS 1.3.7 02/08/2018 Apr 12 18:57:25 arkle5 kernel: Call Trace: Apr 12 18:57:25 arkle5 kernel: [<ffffffff816a6071>] dump_stack+0x19/0x1b Apr 12 18:57:25 arkle5 kernel: [<ffffffff810895e8>] __warn+0xd8/0x100 Apr 12 18:57:25 arkle5 kernel: [<ffffffff8108972d>] warn_slowpath_null+0x1d/0x20 Apr 12 18:57:25 arkle5 kernel: [<ffffffff81091be2>] __local_bh_enable_ip+0x82/0xb0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff816ade8e>] _raw_spin_unlock_bh+0x1e/0x20 Apr 12 18:57:25 arkle5 kernel: [<ffffffffc06183b5>] cfs_trace_unlock_tcd+0x55/0x90 [libcfs] Apr 12 18:57:25 arkle5 kernel: [<ffffffffc0623708>] libcfs_debug_vmsg2+0x6d8/0xb40 [libcfs] Apr 12 18:57:25 arkle5 kernel: [<ffffffff810cfb6c>] ? dequeue_entity+0x11c/0x5d0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810c95d5>] ? sched_clock_cpu+0x85/0xc0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 Apr 12 18:57:25 arkle5 kernel: [<ffffffffc0623bc7>] libcfs_debug_msg+0x57/0x80 [libcfs] Apr 12 18:57:25 arkle5 kernel: [<ffffffffc069682a>] kiblnd_cq_completion+0x11a/0x160 [ko2iblnd] Apr 12 18:57:25 arkle5 kernel: [<ffffffffc03ab4a2>] send_complete+0x32/0x50 [rdmavt] Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2ac0>] kthread_worker_fn+0x80/0x180 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2a40>] ? kthread_stop+0xe0/0xe0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b270f>] kthread+0xcf/0xe0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40 Apr 12 18:57:25 arkle5 kernel: [<ffffffff816b8798>] ret_from_fork+0x58/0x90 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40 Apr 12 18:57:25 arkle5 kernel: ---[ end trace aaf779f5b67c32db ]--- Apr 12 18:57:25 arkle7 kernel: ------------[ cut here ]------------ I also have one client that looks permanently upset now (john50 is a client, arkle3 is an OSS) Apr 12 21:28:05 john50 kernel: LNetError: 909:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1594448751341360 length 1048576 too big: 1045288 left, 1045288 allo wed Apr 12 21:28:05 arkle3 kernel: LustreError: 296380:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880c21c90a00 Apr 12 21:28:05 arkle3 kernel: LustreError: 296380:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880c21c90a00 Apr 12 21:28:05 arkle3 kernel: LustreError: 233673:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8817643ce450 x1594448751341360/t0(0) o4->98af46e3-9fa3-6f5b-2dcd-f89325115978@192.168.44.150@o2ib44:626/0 lens 608/448 e 0 to 0 dl 1523532491 ref 1 fl Interpret:/0/0 rc 0/0 Apr 12 21:28:05 arkle3 kernel: Lustre: dagg-OST0004: Bulk IO write error with 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44), client will retry: rc = -110 Apr 12 21:28:12 john50 kernel: Lustre: 933:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523532485/real 1523532485] req@ffff8817b5bd6900 x1594448751341360/t0(0) o4->dagg-OST0004-osc-ffff88189aa1f000@192.168.44.33@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1523532492 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Apr 12 21:28:12 john50 kernel: Lustre: 933:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Apr 12 21:28:12 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection to dagg-OST0004 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Apr 12 21:28:12 john50 kernel: Lustre: Skipped 3 previous similar messages Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Client 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44) reconnecting Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Connection restored to 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44) Apr 12 21:28:12 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44) Apr 12 21:28:12 john50 kernel: Lustre: Skipped 2 previous similar messages Apr 12 21:28:12 john50 kernel: LNetError: 910:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1594448751343216 length 1048576 too big: 1045288 left, 1045288 allowed Apr 12 21:28:12 arkle3 kernel: LustreError: 296379:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88169ed25a00 Apr 12 21:28:12 arkle3 kernel: LustreError: 296379:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88169ed25a00 Apr 12 21:28:12 arkle3 kernel: LustreError: 50811:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8816dd9f2c50 x1594448751341360/t0(0) o4->98af46e3-9fa3-6f5b-2dcd-f89325115978@192.168.44.150@o2ib44:633/0 lens 608/448 e 0 to 0 dl 1523532498 ref 1 fl Interpret:/2/0 rc 0/0 Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Bulk IO write error with 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44), client will retry: rc = -110 Apr 12 21:28:19 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection to dagg-OST0004 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete I'll reboot the client as (IIRC) this has cleared up this kind of problem in the past Apr 13 01:22:00 arkle3 kernel: LustreError: 296378:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880f6ea28800 Apr 13 01:22:07 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88150b4d4800 Apr 13 01:22:07 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88150b4d4800 Apr 13 01:22:15 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88150b4d2800 Apr 13 01:22:15 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88150b4d2800 debug +neterror (a default) is still enabled and the dk will still catch that. hopefully that will be enough for you. cheers, robin

            People

              hongchao.zhang Hongchao Zhang
              scadmin SC Admin
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: