Details

    • 3
    • 9223372036854775807

    Description

      Hiya,

      these appeared last night.
      john100,101 are clients. arkle3,6 are OSS's. transom1 is running the fabric manager.

      in light of the similar looking LU-9305 I thought I would create this ticket.
      we run default (4M) rpcs on clients and servers. our OSTs are each 4 raidz3's in 1 zpool, and have recordsize=2M. 2 OSTs per OSS. 16 OSTs total.

      I suppose it could be a OPA network glitch, but it affected 2 clients and 2 servers so that seems unlikely.

      we have just moved from zfs 0.7.5 to zfs 0.7.6. we ran ior and mdtest after this change and they were fine. these errors occurred a couple of days after that.

      Feb 19 23:45:12 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 54e81a18
      Feb 19 23:45:12 john100 kernel: LNetError: 899:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1591386600741136 length 1048576 too big: 1048176 left, 1048176 allowed
      Feb 19 23:45:12 arkle3 kernel: LNet: Using FMR for registration
      Feb 19 23:45:12 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881764648600
      Feb 19 23:45:12 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881764648600
      Feb 19 23:45:12 arkle3 kernel: LustreError: 337237:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff881687561450 x1591386600741136/t0(0) o4->8c8018f7-2e02-6c2b-cbcf-29133ecabf02@192.168.44.200@o2ib44:173/0 lens 608/448 e 0 to 0 dl 1519044318 ref 1 fl Interpret:/0/0 rc 0/0
      Feb 19 23:45:12 arkle3 kernel: Lustre: dagg-OST0005: Bulk IO write error with 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44), client will retry: rc = -110
      Feb 19 23:45:12 arkle3 kernel: LNet: Skipped 1 previous similar message
      Feb 19 23:45:12 john101 kernel: LNetError: 904:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.36@o2ib44, match 1591895282655776 length 1048576 too big: 1048176 left, 1048176 allowed
      Feb 19 23:45:12 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8810cb714600
      Feb 19 23:45:12 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8810cb714600
      Feb 19 23:45:12 arkle6 kernel: LustreError: 42356:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff88276c3d5850 x1591895282655776/t0(0) o4->400cfa1c-7c7d-1d14-09ed-f6043574fd7c@192.168.44.201@o2ib44:173/0 lens 608/448 e 0 to 0 dl 1519044318 ref 1 fl Interpret:/0/0 rc 0/0
      Feb 19 23:45:12 arkle6 kernel: Lustre: dagg-OST000b: Bulk IO write error with 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44), client will retry: rc = -110
      Feb 19 23:45:12 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 54e81a18 (type 4), client csum now efde5b36
      Feb 19 23:45:12 john100 kernel: LustreError: 924:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff880e7a9b4e00 x1591386600740944/t197580821951(197580821951) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044319 ref 2 fl Interpret:RM/0/0 rc 0/0
      Feb 19 23:45:13 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum df7fda9d
      Feb 19 23:45:13 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum df7fda9d (type 4), client csum now efde5b36
      Feb 19 23:45:13 john100 kernel: LustreError: 911:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff8807492fe300 x1591386600747696/t197580821955(197580821955) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044320 ref 2 fl Interpret:RM/0/0 rc 0/0
      Feb 19 23:45:15 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 87da008a
      Feb 19 23:45:15 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 87da008a (type 4), client csum now efde5b36
      Feb 19 23:45:15 john100 kernel: LustreError: 910:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff880eb3d51b00 x1591386600751360/t197580821956(197580821956) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044322 ref 2 fl Interpret:RM/0/0 rc 0/0
      Feb 19 23:45:18 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 1cc7a793
      Feb 19 23:45:18 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 1cc7a793 (type 4), client csum now efde5b36
      

      full log attached.
      there were hung and unkillable user processes on the 2 clients afterwards. a reboot of the 2 clients has cleared up the looping messages of the type shown below.

      Feb 20 14:28:15 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e25ef2c00
      Feb 20 14:28:19 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88176544be00
      Feb 20 14:28:19 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8814072e0200
      Feb 20 14:28:19 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88176544be00
      Feb 20 14:28:19 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8814072e0200
      Feb 20 14:28:47 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e25ef3800
      Feb 20 14:28:47 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e25ef3800
      Feb 20 14:28:51 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881615f0f000
      Feb 20 14:28:51 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880be3931c00
      Feb 20 14:28:51 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881615f0f000
      

      cheers,
      robin

      Attachments

        Issue Links

          Activity

            [LU-10683] write checksum errors
            scadmin SC Admin added a comment -

            Hi Amir,

            I setup a script to tail syslog and run dk on anything that hits a CHECKSUM error.
            however it doesn't appear safe to turn on +net.
            I've seen 132 stack traces like these across all the servers and clients since I turned on +net this afternoon, so I've now turned it off.

            Apr 12 18:57:25 arkle5 kernel: ------------[ cut here ]------------
            Apr 12 18:57:25 arkle5 kernel: WARNING: CPU: 1 PID: 127223 at kernel/softirq.c:151 __local_bh_enable_ip+0x82/0xb0
            Apr 12 18:57:25 arkle5 kernel: Modules linked in: sctp_diag sctp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag ip6table_filter ip6_tables iptable_filter osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_
            zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) mptctl mptbase 8021q garp mrp stp llc hfi1 sunrpc xfs dm_round_robin dcdbas int
            el_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr iTCO_wdt iTCO_vendor_support zfs(POE) zunic
            ode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) mgag200 ttm dm_multipath drm_kms_helper ses syscopyarea enclosure dm_mod sysfillrect sysimgblt fb_sys_fops drm mei_me lpc_ich shpchp i2c_i801 sg
             mei
            Apr 12 18:57:25 arkle5 kernel: ipmi_si ipmi_devintf nfit ipmi_msghandler libnvdimm acpi_power_meter tpm_crb rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm iw_cm binfmt_misc ip_tables ib_ipoib ib_cm sr_mod cdrom 
            sd_mod crc_t10dif crct10dif_generic bonding bnx2x rdmavt ahci i2c_algo_bit libahci crct10dif_pclmul mpt3sas mdio crct10dif_common i2c_core crc32c_intel ptp raid_class ib_core libata megaraid_sas scsi_transp
            ort_sas pps_core libcrc32c [last unloaded: hfi1]
            Apr 12 18:57:25 arkle5 kernel: CPU: 1 PID: 127223 Comm: hfi1_cq0 Tainted: P           OE  ------------   3.10.0-693.17.1.el7.x86_64 #1
            Apr 12 18:57:25 arkle5 kernel: Hardware name: Dell Inc. PowerEdge R740/06G98X, BIOS 1.3.7 02/08/2018
            Apr 12 18:57:25 arkle5 kernel: Call Trace:
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff816a6071>] dump_stack+0x19/0x1b
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff810895e8>] __warn+0xd8/0x100
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff8108972d>] warn_slowpath_null+0x1d/0x20
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff81091be2>] __local_bh_enable_ip+0x82/0xb0
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff816ade8e>] _raw_spin_unlock_bh+0x1e/0x20
            Apr 12 18:57:25 arkle5 kernel: [<ffffffffc06183b5>] cfs_trace_unlock_tcd+0x55/0x90 [libcfs]
            Apr 12 18:57:25 arkle5 kernel: [<ffffffffc0623708>] libcfs_debug_vmsg2+0x6d8/0xb40 [libcfs]
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff810cfb6c>] ? dequeue_entity+0x11c/0x5d0
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff810c95d5>] ? sched_clock_cpu+0x85/0xc0
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
            Apr 12 18:57:25 arkle5 kernel: [<ffffffffc0623bc7>] libcfs_debug_msg+0x57/0x80 [libcfs]
            Apr 12 18:57:25 arkle5 kernel: [<ffffffffc069682a>] kiblnd_cq_completion+0x11a/0x160 [ko2iblnd]
            Apr 12 18:57:25 arkle5 kernel: [<ffffffffc03ab4a2>] send_complete+0x32/0x50 [rdmavt]
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2ac0>] kthread_worker_fn+0x80/0x180
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2a40>] ? kthread_stop+0xe0/0xe0
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b270f>] kthread+0xcf/0xe0
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff816b8798>] ret_from_fork+0x58/0x90
            Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
            Apr 12 18:57:25 arkle5 kernel: ---[ end trace aaf779f5b67c32db ]---
            Apr 12 18:57:25 arkle7 kernel: ------------[ cut here ]------------
            

            I also have one client that looks permanently upset now (john50 is a client, arkle3 is an OSS)

            Apr 12 21:28:05 john50 kernel: LNetError: 909:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1594448751341360 length 1048576 too big: 1045288 left, 1045288 allo
            wed
            Apr 12 21:28:05 arkle3 kernel: LustreError: 296380:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880c21c90a00
            Apr 12 21:28:05 arkle3 kernel: LustreError: 296380:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880c21c90a00
            Apr 12 21:28:05 arkle3 kernel: LustreError: 233673:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff8817643ce450 x1594448751341360/t0(0) o4->98af46e3-9fa3-6f5b-2dcd-f89325115978@192.168.44.150@o2ib44:626/0 lens 608/448 e 0 to 0 dl 1523532491 ref 1 fl Interpret:/0/0 rc 0/0
            Apr 12 21:28:05 arkle3 kernel: Lustre: dagg-OST0004: Bulk IO write error with 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44), client will retry: rc = -110
            Apr 12 21:28:12 john50 kernel: Lustre: 933:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523532485/real 1523532485]  req@ffff8817b5bd6900 x1594448751341360/t0(0) o4->dagg-OST0004-osc-ffff88189aa1f000@192.168.44.33@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1523532492 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            Apr 12 21:28:12 john50 kernel: Lustre: 933:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
            Apr 12 21:28:12 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection to dagg-OST0004 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete
            Apr 12 21:28:12 john50 kernel: Lustre: Skipped 3 previous similar messages
            Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Client 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44) reconnecting
            Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Connection restored to 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44)
            Apr 12 21:28:12 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44)
            Apr 12 21:28:12 john50 kernel: Lustre: Skipped 2 previous similar messages
            Apr 12 21:28:12 john50 kernel: LNetError: 910:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1594448751343216 length 1048576 too big: 1045288 left, 1045288 allowed
            Apr 12 21:28:12 arkle3 kernel: LustreError: 296379:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88169ed25a00
            Apr 12 21:28:12 arkle3 kernel: LustreError: 296379:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88169ed25a00
            Apr 12 21:28:12 arkle3 kernel: LustreError: 50811:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff8816dd9f2c50 x1594448751341360/t0(0) o4->98af46e3-9fa3-6f5b-2dcd-f89325115978@192.168.44.150@o2ib44:633/0 lens 608/448 e 0 to 0 dl 1523532498 ref 1 fl Interpret:/2/0 rc 0/0
            Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Bulk IO write error with 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44), client will retry: rc = -110
            Apr 12 21:28:19 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection to dagg-OST0004 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete
            

            I'll reboot the client as (IIRC) this has cleared up this kind of problem in the past

            Apr 13 01:22:00 arkle3 kernel: LustreError: 296378:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880f6ea28800
            Apr 13 01:22:07 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88150b4d4800
            Apr 13 01:22:07 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88150b4d4800
            Apr 13 01:22:15 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88150b4d2800
            Apr 13 01:22:15 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88150b4d2800
            

            debug +neterror (a default) is still enabled and the dk will still catch that. hopefully that will be enough for you.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi Amir, I setup a script to tail syslog and run dk on anything that hits a CHECKSUM error. however it doesn't appear safe to turn on +net. I've seen 132 stack traces like these across all the servers and clients since I turned on +net this afternoon, so I've now turned it off. Apr 12 18:57:25 arkle5 kernel: ------------[ cut here ]------------ Apr 12 18:57:25 arkle5 kernel: WARNING: CPU: 1 PID: 127223 at kernel/softirq.c:151 __local_bh_enable_ip+0x82/0xb0 Apr 12 18:57:25 arkle5 kernel: Modules linked in: sctp_diag sctp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag ip6table_filter ip6_tables iptable_filter osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) mptctl mptbase 8021q garp mrp stp llc hfi1 sunrpc xfs dm_round_robin dcdbas int el_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr iTCO_wdt iTCO_vendor_support zfs(POE) zunic ode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) mgag200 ttm dm_multipath drm_kms_helper ses syscopyarea enclosure dm_mod sysfillrect sysimgblt fb_sys_fops drm mei_me lpc_ich shpchp i2c_i801 sg mei Apr 12 18:57:25 arkle5 kernel: ipmi_si ipmi_devintf nfit ipmi_msghandler libnvdimm acpi_power_meter tpm_crb rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm iw_cm binfmt_misc ip_tables ib_ipoib ib_cm sr_mod cdrom sd_mod crc_t10dif crct10dif_generic bonding bnx2x rdmavt ahci i2c_algo_bit libahci crct10dif_pclmul mpt3sas mdio crct10dif_common i2c_core crc32c_intel ptp raid_class ib_core libata megaraid_sas scsi_transp ort_sas pps_core libcrc32c [last unloaded: hfi1] Apr 12 18:57:25 arkle5 kernel: CPU: 1 PID: 127223 Comm: hfi1_cq0 Tainted: P OE ------------ 3.10.0-693.17.1.el7.x86_64 #1 Apr 12 18:57:25 arkle5 kernel: Hardware name: Dell Inc. PowerEdge R740/06G98X, BIOS 1.3.7 02/08/2018 Apr 12 18:57:25 arkle5 kernel: Call Trace: Apr 12 18:57:25 arkle5 kernel: [<ffffffff816a6071>] dump_stack+0x19/0x1b Apr 12 18:57:25 arkle5 kernel: [<ffffffff810895e8>] __warn+0xd8/0x100 Apr 12 18:57:25 arkle5 kernel: [<ffffffff8108972d>] warn_slowpath_null+0x1d/0x20 Apr 12 18:57:25 arkle5 kernel: [<ffffffff81091be2>] __local_bh_enable_ip+0x82/0xb0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff816ade8e>] _raw_spin_unlock_bh+0x1e/0x20 Apr 12 18:57:25 arkle5 kernel: [<ffffffffc06183b5>] cfs_trace_unlock_tcd+0x55/0x90 [libcfs] Apr 12 18:57:25 arkle5 kernel: [<ffffffffc0623708>] libcfs_debug_vmsg2+0x6d8/0xb40 [libcfs] Apr 12 18:57:25 arkle5 kernel: [<ffffffff810cfb6c>] ? dequeue_entity+0x11c/0x5d0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810c95d5>] ? sched_clock_cpu+0x85/0xc0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 Apr 12 18:57:25 arkle5 kernel: [<ffffffffc0623bc7>] libcfs_debug_msg+0x57/0x80 [libcfs] Apr 12 18:57:25 arkle5 kernel: [<ffffffffc069682a>] kiblnd_cq_completion+0x11a/0x160 [ko2iblnd] Apr 12 18:57:25 arkle5 kernel: [<ffffffffc03ab4a2>] send_complete+0x32/0x50 [rdmavt] Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2ac0>] kthread_worker_fn+0x80/0x180 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2a40>] ? kthread_stop+0xe0/0xe0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b270f>] kthread+0xcf/0xe0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40 Apr 12 18:57:25 arkle5 kernel: [<ffffffff816b8798>] ret_from_fork+0x58/0x90 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40 Apr 12 18:57:25 arkle5 kernel: ---[ end trace aaf779f5b67c32db ]--- Apr 12 18:57:25 arkle7 kernel: ------------[ cut here ]------------ I also have one client that looks permanently upset now (john50 is a client, arkle3 is an OSS) Apr 12 21:28:05 john50 kernel: LNetError: 909:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1594448751341360 length 1048576 too big: 1045288 left, 1045288 allo wed Apr 12 21:28:05 arkle3 kernel: LustreError: 296380:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880c21c90a00 Apr 12 21:28:05 arkle3 kernel: LustreError: 296380:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880c21c90a00 Apr 12 21:28:05 arkle3 kernel: LustreError: 233673:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8817643ce450 x1594448751341360/t0(0) o4->98af46e3-9fa3-6f5b-2dcd-f89325115978@192.168.44.150@o2ib44:626/0 lens 608/448 e 0 to 0 dl 1523532491 ref 1 fl Interpret:/0/0 rc 0/0 Apr 12 21:28:05 arkle3 kernel: Lustre: dagg-OST0004: Bulk IO write error with 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44), client will retry: rc = -110 Apr 12 21:28:12 john50 kernel: Lustre: 933:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523532485/real 1523532485] req@ffff8817b5bd6900 x1594448751341360/t0(0) o4->dagg-OST0004-osc-ffff88189aa1f000@192.168.44.33@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1523532492 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Apr 12 21:28:12 john50 kernel: Lustre: 933:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Apr 12 21:28:12 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection to dagg-OST0004 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Apr 12 21:28:12 john50 kernel: Lustre: Skipped 3 previous similar messages Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Client 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44) reconnecting Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Connection restored to 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44) Apr 12 21:28:12 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44) Apr 12 21:28:12 john50 kernel: Lustre: Skipped 2 previous similar messages Apr 12 21:28:12 john50 kernel: LNetError: 910:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1594448751343216 length 1048576 too big: 1045288 left, 1045288 allowed Apr 12 21:28:12 arkle3 kernel: LustreError: 296379:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88169ed25a00 Apr 12 21:28:12 arkle3 kernel: LustreError: 296379:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88169ed25a00 Apr 12 21:28:12 arkle3 kernel: LustreError: 50811:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8816dd9f2c50 x1594448751341360/t0(0) o4->98af46e3-9fa3-6f5b-2dcd-f89325115978@192.168.44.150@o2ib44:633/0 lens 608/448 e 0 to 0 dl 1523532498 ref 1 fl Interpret:/2/0 rc 0/0 Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Bulk IO write error with 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44), client will retry: rc = -110 Apr 12 21:28:19 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection to dagg-OST0004 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete I'll reboot the client as (IIRC) this has cleared up this kind of problem in the past Apr 13 01:22:00 arkle3 kernel: LustreError: 296378:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880f6ea28800 Apr 13 01:22:07 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88150b4d4800 Apr 13 01:22:07 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88150b4d4800 Apr 13 01:22:15 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88150b4d2800 Apr 13 01:22:15 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88150b4d2800 debug +neterror (a default) is still enabled and the dk will still catch that. hopefully that will be enough for you. cheers, robin

            Hi Robin,

            From both client and server would be great, separated into two files respectively.

            thanks
            amir

            ashehata Amir Shehata (Inactive) added a comment - Hi Robin, From both client and server would be great, separated into two files respectively. thanks amir
            scadmin SC Admin added a comment -

            Hi Amir,

            the incidents are rare. the last one was 3 days ago. probably over-quota related. we've tried to reproduce them but haven't managed to with simple dd's etc.

            I guess I could turn on 'net neterror' debug and write a script to tail syslog and automatically run dk's when it sees the next burst of CHECKSUM's.
            does that sound ok?
            do you want dk just from servers or clients too?

            cheers,
            robin

            scadmin SC Admin added a comment - Hi Amir, the incidents are rare. the last one was 3 days ago. probably over-quota related. we've tried to reproduce them but haven't managed to with simple dd's etc. I guess I could turn on 'net neterror' debug and write a script to tail syslog and automatically run dk's when it sees the next burst of CHECKSUM's. does that sound ok? do you want dk just from servers or clients too? cheers, robin

            Is it possible to turn on net logging and capture logs for a short period of time when this happens:

            lctl set_param debug=+"net neterror"
            lctl dk > log

             

            ashehata Amir Shehata (Inactive) added a comment - Is it possible to turn on net logging and capture logs for a short period of time when this happens: lctl set_param debug=+ "net neterror" lctl dk > log  
            rjh Robin Humble (Inactive) added a comment - - edited

            Hi Andreas,

            on our big /dagg filesystem I left the zfs recordsize at 2MB. it's a significant performance loss to set it to 1M - halves the size of all the i/o's to disks.

            however these events also happen to the /home lustre filesystem where the zfs recordsize is 1M and always has been. those are in the logs above.

            oh, damn, looks like /sys/module/zfs/parameters/zfs_max_recordsize=2M for these /home filesystems though, even though nothing uses that. that must be a left over from testing.

            I haven't tried to change the RPC size for anything, sorry. all the lustre filesystems in this cluster seem to be using max_pages_per_rpc=1024. is that the right number to look at? we haven't tweaked anything around that. I just assumed 4M was the default these days.

            we aren't using any cap_mask= options for hfi1. on clients it is

            options hfi1 sge_copy_mode=2 krcvqs=4 piothreshold=0 wss_threshold=70 max_mtu=10240 eager_buffer_size=4194304
            

            and on lustre servers it's

            options hfi1 sge_copy_mode=2 krcvqs=8 piothreshold=0 wss_threshold=70
            

            with max_mtu=10240 being a driver default, and the default eager_buffer_size is 2097152. but AFAIK lustre's verbs doesn't use that eager_buffer stuff - it's for PSM2 comms I think.

            I actually don't know what most of these options do. it's just what various Intel people and docs told us was good.
            only krcvqs seems to impact performance (<=half a socket seems best), so we've tuned that a bit.

            Oleg - yes it's a hdf5 library. we're looking into the hdf5 code and will try to communicate with the user and see what options they used when calling it in parallel. I suspect it uses whatever MPIIO uses, but haven't looked at either of those in years.

            thanks for looking into this.

            cheers,
            robin

            rjh Robin Humble (Inactive) added a comment - - edited Hi Andreas, on our big /dagg filesystem I left the zfs recordsize at 2MB. it's a significant performance loss to set it to 1M - halves the size of all the i/o's to disks. however these events also happen to the /home lustre filesystem where the zfs recordsize is 1M and always has been. those are in the logs above. oh, damn, looks like /sys/module/zfs/parameters/zfs_max_recordsize=2M for these /home filesystems though, even though nothing uses that. that must be a left over from testing. I haven't tried to change the RPC size for anything, sorry. all the lustre filesystems in this cluster seem to be using max_pages_per_rpc=1024. is that the right number to look at? we haven't tweaked anything around that. I just assumed 4M was the default these days. we aren't using any cap_mask= options for hfi1. on clients it is options hfi1 sge_copy_mode=2 krcvqs=4 piothreshold=0 wss_threshold=70 max_mtu=10240 eager_buffer_size=4194304 and on lustre servers it's options hfi1 sge_copy_mode=2 krcvqs=8 piothreshold=0 wss_threshold=70 with max_mtu=10240 being a driver default, and the default eager_buffer_size is 2097152. but AFAIK lustre's verbs doesn't use that eager_buffer stuff - it's for PSM2 comms I think. I actually don't know what most of these options do. it's just what various Intel people and docs told us was good. only krcvqs seems to impact performance (<=half a socket seems best), so we've tuned that a bit. Oleg - yes it's a hdf5 library. we're looking into the hdf5 code and will try to communicate with the user and see what options they used when calling it in parallel. I suspect it uses whatever MPIIO uses, but haven't looked at either of those in years. thanks for looking into this. cheers, robin
            green Oleg Drokin added a comment -

            What is the hdf5 library (I assume by the H5 prefix) do you use for this io (if any), does it use direct IO internally by any chance?

            green Oleg Drokin added a comment - What is the hdf5 library (I assume by the H5 prefix) do you use for this io (if any), does it use direct IO internally by any chance?

            Hi Robin, did you reduce the recordsize to 1MB on the filesystem? We haven't done any testing ourselves with larger recordsize. The clients would need to be remounted to also get a smaller RPC size (they default to RPC size == max blocksize for ZFS at mount).

            Also, are you using TID RDMA (cap_mask=....) for your OPA connection? We've seen problems with that under load, and if yes it should be disabled.

            adilger Andreas Dilger added a comment - Hi Robin, did you reduce the recordsize to 1MB on the filesystem? We haven't done any testing ourselves with larger recordsize. The clients would need to be remounted to also get a smaller RPC size (they default to RPC size == max blocksize for ZFS at mount). Also, are you using TID RDMA ( cap_mask=.... ) for your OPA connection? We've seen problems with that under load, and if yes it should be disabled.
            pjones Peter Jones added a comment -

            Hongchao

            Can you please advise with this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Hongchao Can you please advise with this one? Thanks Peter

            a bunch more CHECKSUM and LNet errors today. this lot were again definitely associated with over quota. I don't know if all the incidents are though...

            I guess I'd be very worried if these weren't purely over quota events, which is why the read checksum messages were very alarming.

            any thoughts on whether these are just from quota events or not? is there any way we can easily tell that?

            I'll attach messages for today's errors to this ticket in a min.

            the user has many (~1100 so far) of the below in their job output. looks like they have ~127 write processes across their 28 nodes and 896 cores. the code is looping on the nodes trying to complete the writes.

            HDF5-DIAG: Error detected in HDF5 (1.10.1) MPI-process 168:
              #000: H5Dio.c line 269 in H5Dwrite(): can't prepare for writing data
                major: Dataset
                minor: Write failed
              #001: H5Dio.c line 345 in H5D__pre_write(): can't write data
                major: Dataset
                minor: Write failed
              #002: H5Dio.c line 791 in H5D__write(): can't write data
                major: Dataset
                minor: Write failed
              #003: H5Dcontig.c line 642 in H5D__contig_write(): contiguous write failed
                major: Dataset
                minor: Write failed
              #004: H5Dselect.c line 309 in H5D__select_write(): write error
                major: Dataspace
                minor: Write failed
              #005: H5Dselect.c line 220 in H5D__select_io(): write error
                major: Dataspace
                minor: Write failed
              #006: H5Dcontig.c line 1267 in H5D__contig_writevv(): can't perform vectorized sieve buffer write
                major: Dataset
                minor: Can't operate on object
              #007: H5VM.c line 1500 in H5VM_opvv(): can't perform operation
                major: Internal error (too specific to document in detail)
                minor: Can't operate on object
              #008: H5Dcontig.c line 1014 in H5D__contig_writevv_sieve_cb(): block write failed
                major: Dataset
                minor: Write failed
              #009: H5Fio.c line 195 in H5F_block_write(): write through page buffer failed
                major: Low-level I/O
                minor: Write failed
              #010: H5PB.c line 1041 in H5PB_write(): write through metadata accumulator failed
                major: Page Buffering
                minor: Write failed
              #011: H5Faccum.c line 834 in H5F__accum_write(): file write failed
                major: Low-level I/O
                minor: Write failed
              #012: H5FDint.c line 308 in H5FD_write(): driver write request failed
                major: Virtual File Layer
                minor: Write failed
              #013: H5FDsec2.c line 810 in H5FD_sec2_write(): file write failed: time = Fri Mar 16 03:17:20 2018
            , filename = './L35_N2650/snapshot_050.24.hdf5', file descriptor = 24, errno = 122, error message = 'Disk quota exceeded', buf = 0x2b07c54a3010, total write size = 31457280, bytes this sub-write = 31457280, bytes actually written = 184467440737052
                major: Low-level I/O
                minor: Write failed
            

            cheers,
            robin

            rjh Robin Humble (Inactive) added a comment - a bunch more CHECKSUM and LNet errors today. this lot were again definitely associated with over quota. I don't know if all the incidents are though... I guess I'd be very worried if these weren't purely over quota events, which is why the read checksum messages were very alarming. any thoughts on whether these are just from quota events or not? is there any way we can easily tell that? I'll attach messages for today's errors to this ticket in a min. the user has many (~1100 so far) of the below in their job output. looks like they have ~127 write processes across their 28 nodes and 896 cores. the code is looping on the nodes trying to complete the writes. HDF5-DIAG: Error detected in HDF5 (1.10.1) MPI-process 168: #000: H5Dio.c line 269 in H5Dwrite(): can't prepare for writing data major: Dataset minor: Write failed #001: H5Dio.c line 345 in H5D__pre_write(): can't write data major: Dataset minor: Write failed #002: H5Dio.c line 791 in H5D__write(): can't write data major: Dataset minor: Write failed #003: H5Dcontig.c line 642 in H5D__contig_write(): contiguous write failed major: Dataset minor: Write failed #004: H5Dselect.c line 309 in H5D__select_write(): write error major: Dataspace minor: Write failed #005: H5Dselect.c line 220 in H5D__select_io(): write error major: Dataspace minor: Write failed #006: H5Dcontig.c line 1267 in H5D__contig_writevv(): can't perform vectorized sieve buffer write major: Dataset minor: Can't operate on object #007: H5VM.c line 1500 in H5VM_opvv(): can't perform operation major: Internal error (too specific to document in detail) minor: Can't operate on object #008: H5Dcontig.c line 1014 in H5D__contig_writevv_sieve_cb(): block write failed major: Dataset minor: Write failed #009: H5Fio.c line 195 in H5F_block_write(): write through page buffer failed major: Low-level I/O minor: Write failed #010: H5PB.c line 1041 in H5PB_write(): write through metadata accumulator failed major: Page Buffering minor: Write failed #011: H5Faccum.c line 834 in H5F__accum_write(): file write failed major: Low-level I/O minor: Write failed #012: H5FDint.c line 308 in H5FD_write(): driver write request failed major: Virtual File Layer minor: Write failed #013: H5FDsec2.c line 810 in H5FD_sec2_write(): file write failed: time = Fri Mar 16 03:17:20 2018 , filename = './L35_N2650/snapshot_050.24.hdf5', file descriptor = 24, errno = 122, error message = 'Disk quota exceeded', buf = 0x2b07c54a3010, total write size = 31457280, bytes this sub-write = 31457280, bytes actually written = 184467440737052 major: Low-level I/O minor: Write failed cheers, robin

            Hi Nathaniel,

            all messages for the lustre servers and from the 7 clients affected so far for all of 2018 have been uploaded.
            grep for CHECKSUM

            if you'd like console logs from these machines too then please let us know.

            cheers,
            robin

            rjh Robin Humble (Inactive) added a comment - Hi Nathaniel, all messages for the lustre servers and from the 7 clients affected so far for all of 2018 have been uploaded. grep for CHECKSUM if you'd like console logs from these machines too then please let us know. cheers, robin

            Below are instructions for uploading logs to our write-only ftp site:

             

            Sometimes the diagnostic data collected as part of Lustre troubleshooting is too large to be attached to a JIRA ticket. For these cases, HPDD provides an anonymous write-only FTP upload service. In order to use this service, you'll need an FTP client (e.g. ncftp, ftp, etc.) and a JIRA issue. Use the 'uploads' directory and create a new subdirectory using your Jira issue as a name.

            In the following example, there are three debug logs in a single directory and the JIRA issue LU-4242 has been created. After completing the upload, please update the relevant issue with a note mentioning the upload, so that our engineers know where to find your logs.

            $ ls -lh
            total 333M
            -rw-r--r-- 1 mjmac mjmac  98M Feb 23 17:36 mds-debug
            -rw-r--r-- 1 mjmac mjmac 118M Feb 23 17:37 oss-00-debug
            -rw-r--r-- 1 mjmac mjmac 118M Feb 23 17:37 oss-01-debug
            $ ncftp ftp.hpdd.intel.com
            NcFTP 3.2.2 (Sep 04, 2008) by Mike Gleason (http://www.NcFTP.com/contact/).
            Connecting to 99.96.190.235...
            (vsFTPd 2.2.2)
            Logging in...
            Login successful.
            Logged in to ftp.hpdd.intel.com.
            ncftp / > cd uploads
            Directory successfully changed.
            ncftp /uploads > mkdir LU-4242
            ncftp /uploads > cd LU-4242
            Directory successfully changed.
            ncftp /uploads/LU-4242 > put *
            mds-debug:                                              97.66 MB   11.22 MB/s
            oss-00-debug:                                          117.19 MB   11.16 MB/s
            oss-01-debug:                                          117.48 MB   11.18 MB/s
            ncftp /uploads/LU-4242 >
            

            Please note that this is a WRITE-ONLY FTP service, so you will not be able to see (with ls) the files or directories you've created, nor will you (or anyone other than HPDD staff) be able to see or read them.

            utopiabound Nathaniel Clark added a comment - Below are instructions for uploading logs to our write-only ftp site:   Sometimes the diagnostic data collected as part of Lustre troubleshooting is too large to be attached to a JIRA ticket. For these cases, HPDD provides an anonymous write-only FTP upload service. In order to use this service, you'll need an FTP client (e.g. ncftp, ftp, etc.) and a JIRA issue. Use the 'uploads' directory and create a new subdirectory using your Jira issue as a name. In the following example, there are three debug logs in a single directory and the JIRA issue LU-4242 has been created. After completing the upload, please update the relevant issue with a note mentioning the upload, so that our engineers know where to find your logs. $ ls -lh total 333M -rw-r--r-- 1 mjmac mjmac 98M Feb 23 17:36 mds-debug -rw-r--r-- 1 mjmac mjmac 118M Feb 23 17:37 oss-00-debug -rw-r--r-- 1 mjmac mjmac 118M Feb 23 17:37 oss-01-debug $ ncftp ftp.hpdd.intel.com NcFTP 3.2.2 (Sep 04, 2008) by Mike Gleason (http://www.NcFTP.com/contact/). Connecting to 99.96.190.235... (vsFTPd 2.2.2) Logging in... Login successful. Logged in to ftp.hpdd.intel.com. ncftp / > cd uploads Directory successfully changed. ncftp /uploads > mkdir LU-4242 ncftp /uploads > cd LU-4242 Directory successfully changed. ncftp /uploads/LU-4242 > put * mds-debug: 97.66 MB 11.22 MB/s oss-00-debug: 117.19 MB 11.16 MB/s oss-01-debug: 117.48 MB 11.18 MB/s ncftp /uploads/LU-4242 > Please note that this is a WRITE-ONLY FTP service, so you will not be able to see (with ls) the files or directories you've created, nor will you (or anyone other than HPDD staff) be able to see or read them.

            People

              hongchao.zhang Hongchao Zhang
              scadmin SC Admin
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: