Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9304

BUG: Bad page state in process ll_ost_io01_013 pfn:1a01bcd kernel BUG at include/linux/scatterlist.h:65!

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      Running 4 Lustre Clients, 2 OSS nodes each with 1 zpool, and 1 mds.
      This OSS node:

      1. zpool status -v
        pool: ost0
        state: ONLINE
        scan: none requested
        config:

      NAME STATE READ WRITE CKSUM
      ost0 ONLINE 0 0 0
      draid1-0

      {any}

      ONLINE 0 0 0
      mpathaj ONLINE 0 0 0
      mpathai ONLINE 0 0 0
      mpathah ONLINE 0 0 0
      mpathag ONLINE 0 0 0
      mpathaq ONLINE 0 0 0
      mpathap ONLINE 0 0 0
      mpathak ONLINE 0 0 0
      mpathz ONLINE 0 0 0
      mpatham ONLINE 0 0 0
      mpathal ONLINE 0 0 0
      mpathao ONLINE 0 0 0
      spares
      $draid1-0-s0 AVAIL

      errors: No known data errors

      This build of zfs was from coral-prototype branch and Lustre was a Lustre Master from Dec 1st.

      We were running our file system aging utility: FileAger.py (1-2 copies on each of the 4 client nodes) along an IOR: mpirun -wdir /mnt/lustre/ -np 4 -rr -machinefile hosts -env I_MPI_EXTRA_FILESYSTEM=on -env I_MPI_EXTRA_FILESYSTEM_LIST=lustre /home/johnsali/wolf-3/ior/src/ior -a POSIX -F -N 4 -d 2 -i 1 -s 20000 -b 16MB -t 16MB -k -w -r

      While this was running it appears we hit this failure.

      [159898.950714] BUG: Bad page state in process ll_ost_io01_013 pfn:1a01bcd
      [159898.960045] page:ffffea006806f340 count:-1 mapcount:0 mapping: (null) index:0x0
      [159898.970667] page flags: 0x6fffff00000000()
      [159898.976808] page dumped because: nonzero _count
      [159898.983412] Modules linked in: nfsv3 nfs_acl raid10 osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd mpt3sas ipmi_devintf ipmi_ssif ipmi_si
      [159899.072452] raid_class sb_edac iTCO_wdt iTCO_vendor_support scsi_transport_sas sg edac_core pcspkr ipmi_msghandler wmi ioatdma mei_me mei lpc_ich shpchp i2c_i801 mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_ib mlx4_en ib_sa vxlan ib_mad ip6_udp_tunnel udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper crct10dif_pclmul igb crct10dif_common ttm ptp crc32c_intel ahci pps_core drm mlx4_core libahci dca i2c_algo_bit libata i2c_core [last unloaded: zunicode]
      [159899.135473] CPU: 57 PID: 98747 Comm: ll_ost_io01_013 Tainted: G IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1
      [159899.149461] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
      [159899.162801] ffffea006806f340 00000000424e76b3 ffff880f9e233908 ffffffff81636431
      [159899.172821] ffff880f9e233930 ffffffff81631645 ffffea006806f340 0000000000000000
      [159899.182870] 000fffff00000000 ffff880f9e233978 ffffffff811714dd fff00000fe000000
      [159899.192895] Call Trace:
      [159899.197269] [<ffffffff81636431>] dump_stack+0x19/0x1b
      [159899.204667] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc
      [159899.212639] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190
      [159899.220965] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140
      [159899.229171] [<ffffffff8117200f>] __free_pages+0x3f/0x60
      [159899.236690] [<ffffffffa100bad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs]
      [159899.245372] [<ffffffffa118284a>] ofd_commitrw_write+0xea/0x1c20 [ofd]
      [159899.254234] [<ffffffffa1186f2d>] ofd_commitrw+0x51d/0xa40 [ofd]
      [159899.262551] [<ffffffffa0d538d5>] obd_commitrw+0x2ec/0x32f [ptlrpc]
      [159899.271488] [<ffffffffa0d2bf71>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
      [159899.280509] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
      [159899.288372] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
      [159899.297010] [<ffffffffa0c82560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
      [159899.306746] [<ffffffffa0d28225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
      [159899.316058] [<ffffffffa0cd41ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      [159899.326348] [<ffffffffa0967128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
      [159899.335679] [<ffffffffa0cd1d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [159899.345029] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
      [159899.353394] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
      [159899.361264] [<ffffffffa0cd8260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
      [159899.369596] [<ffffffffa0cd77c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
      [159899.379160] [<ffffffff810a5b8f>] kthread+0xcf/0xe0
      [159899.385881] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [159899.394413] [<ffffffff81646a98>] ret_from_fork+0x58/0x90
      [159899.401653] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [159899.410157] Disabling lock debugging due to kernel taint
      [163012.964891] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.8@o2ib inode [0x200000406:0x3c5:0x0] object 0x0:44785 extent [67108864-80752639]: client csum 7f08fe36, server csum f8fbfe4c
      [163012.990138] LustreError: Skipped 2 previous similar messages
      [163020.008131] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.8@o2ib inode [0x200000406:0x3d6:0x0] object 0x0:44794 extent [83886080-100270079]: client csum 886feb33, server csum ccc0eb4a
      [163042.829796] -----------[ cut here ]-----------
      [163042.837389] kernel BUG at include/linux/scatterlist.h:65!
      [163042.845758] invalid opcode: 0000 1 SMP
      [163042.852645] Modules linked in: nfsv3 nfs_acl raid10 osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd mpt3sas ipmi_devintf ipmi_ssif ipmi_si
      [163042.944819] raid_class sb_edac iTCO_wdt iTCO_vendor_support scsi_transport_sas sg edac_core pcspkr ipmi_msghandler wmi ioatdma mei_me mei lpc_ich shpchp i2c_i801 mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_ib mlx4_en ib_sa vxlan ib_mad ip6_udp_tunnel udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper crct10dif_pclmul igb crct10dif_common ttm ptp crc32c_intel ahci pps_core drm mlx4_core libahci dca i2c_algo_bit libata i2c_core [last unloaded: zunicode]
      [163043.010335] CPU: 12 PID: 84956 Comm: ll_ost_io00_002 Tainted: G B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1
      [163043.025057] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
      [163043.038989] task: ffff880fc52bc500 ti: ffff880fc55bc000 task.ti: ffff880fc55bc000
      [163043.049639] RIP: 0010:[<ffffffffa0960fef>] [<ffffffffa0960fef>] cfs_crypto_hash_update_page+0x9f/0xb0 [libcfs]
      [163043.063453] RSP: 0018:ffff880fc55bfab8 EFLAGS: 00010202
      [163043.071687] RAX: 0000000000000002 RBX: ffff8810f6db9b80 RCX: 0000000000000000
      [163043.081918] RDX: 0000000000000020 RSI: 0000000000000000 RDI: ffff880fc55bfad8
      [163043.092095] RBP: ffff880fc55bfb00 R08: 00000000000195a0 R09: ffff880fc55bfab8
      [163043.103441] R10: ffff88103e807900 R11: 0000000000000001 R12: 3635343332313036
      [163043.113462] R13: 0000000033323130 R14: 0000000000000534 R15: 0000000000000000
      [163043.123487] FS: 0000000000000000(0000) GS:ffff88103ef00000(0000) knlGS:0000000000000000
      [163043.134599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [163043.143101] CR2: 00007fce5afab000 CR3: 000000000194a000 CR4: 00000000001407e0
      [163043.153184] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [163043.163242] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [163043.173280] Stack:
      [163043.177580] 0000000000000002 0000000000000000 0000000000000000 0000000000000000
      [163043.188354] 00000000f43b381e 0000000000000000 ffff880fcc7d1301 ffff880e73ecc200
      [163043.199140] 0000000000000000 ffff880fc55bfb68 ffffffffa0d5345c ffff88202563f0a8
      [163043.209907] Call Trace:
      [163043.215455] [<ffffffffa0d5345c>] tgt_checksum_bulk.isra.33+0x35a/0x4e7 [ptlrpc]
      [163043.226242] [<ffffffffa0d2c21d>] tgt_brw_write+0x114d/0x1640 [ptlrpc]
      [163043.235986] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
      [163043.244558] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
      [163043.254271] [<ffffffffa0c82560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
      [163043.264858] [<ffffffffa0d28225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
      [163043.275043] [<ffffffffa0cd41ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      [163043.286074] [<ffffffffa0967128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
      [163043.296175] [<ffffffffa0cd1d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [163043.306194] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
      [163043.315553] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
      [163043.324714] [<ffffffffa0cd8260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
      [163043.334070] [<ffffffffa0cd77c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
      [163043.344635] [<ffffffff810a5b8f>] kthread+0xcf/0xe0
      [163043.352181] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [163043.361606] [<ffffffff81646a98>] ret_from_fork+0x58/0x90
      [163043.369571] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [163043.378772] Code: 89 43 38 48 8b 43 20 ff 50 c0 48 8b 55 d8 65 48 33 14 25 28 00 00 00 75 0d 48 83 c4 28 5b 41 5c 41 5d 41 5e 5d c3 e8 61 a0 71 e0 <0f> 0b 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
      [163043.406113] RIP [<ffffffffa0960fef>] cfs_crypto_hash_update_page+0x9f/0xb0 [libcfs]
      [163043.416991] RSP <ffff880fc55bfab8>

      This happened fairly quickly. After this run I restarted the system and it happened again almost immediately.

      Attachments

        Issue Links

          Activity

            [LU-9304] BUG: Bad page state in process ll_ost_io01_013 pfn:1a01bcd kernel BUG at include/linux/scatterlist.h:65!

            Yesterday I tried the following combinations:
            Lustre 2.9.0 + latest coral_beta_combined record size 16M brw_size=16 draid zfs_abd_scatter_enabled = 0, max_pages_per_rpc=4096 – crash 10.8.1.3-2017-04-14-22:46:09
            Lustre 2.9.0 + latest coral_beta_combined record size 16M brw_size=16 draid zfs_abd_scatter_enabled = 0, max_pages_per_rpc=256 – crash 10.8.1.3-2017-04-15-00:39:07

            wolf-3 OSS 10.8.1.3-2017-04-14-22:46:09

            147931.299899] Lustre: lsdraid-OST0000: new disk, initializing
            [147931.307239] Lustre: srv-lsdraid-OST0000: No data found on store. Initialize space
            [147936.355608] Lustre: lsdraid-OST0000: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib)
            [147963.624729] Lustre: lsdraid-OST0000: Connection restored to bd4f4e40-dbac-a829-f1fd-3c4450a08dcb (at 192.168.1.6@o2ib)
            [147970.995882] Lustre: lsdraid-OST0000: Connection restored to b9fbce4c-a90b-3f7f-770e-f9863c38efb5 (at 192.168.1.8@o2ib)
            [147975.210049] Lustre: lsdraid-OST0000: Connection restored to 862f84d1-bf42-0dd3-ba54-1e1a9568317e (at 192.168.1.7@o2ib)
            [147975.223042] Lustre: Skipped 1 previous similar message
            [148306.620448] Lustre: lsdraid-OST0000: Connection restored to b9fbce4c-a90b-3f7f-770e-f9863c38efb5 (at 192.168.1.8@o2ib)
            [148306.633674] Lustre: Skipped 1 previous similar message
            [233987.779195] perf interrupt took too long (10163 > 9615), lowering kernel.perf_event_max_sample_rate to 13000
            [414188.327658] Lustre: 83697:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1492208717/real 1492208717]  req@ffff880f11ac8300 x1564414877971952/t0(0) o39->lsdraid-MDT0000-lwp-OST0000@192.168.1.5@o2ib:12/10 lens 224/224 e 0 to 1 dl 1492208723 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
            [414188.364839] Lustre: Failing over lsdraid-OST0000
            [414192.689319] Lustre: 118209:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1492208721/real 1492208721]  req@ffff8815f6846f00 x1564414877971968/t0(0) o400->MGC192.168.1.5@o2ib@192.168.1.5@o2ib:26/25 lens 224/224 e 0 to 1 dl 1492208728 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [414194.373337] Lustre: 83697:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1492208723/real 1492208723]  req@ffff880f11ac8300 x1564414877972032/t0(0) o251->MGC192.168.1.5@o2ib@192.168.1.5@o2ib:26/25 lens 224/224 e 0 to 1 dl 1492208729 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
            [414194.411850] Lustre: server umount lsdraid-OST0000 complete
            [414368.256969] Lustre: lsdraid-OST0000: new disk, initializing
            [414368.265405] Lustre: srv-lsdraid-OST0000: No data found on store. Initialize space
            [414375.147139] Lustre: lsdraid-OST0000: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib)
            [414533.259382] Lustre: Failing over lsdraid-OST0000
            [414533.276260] Lustre: server umount lsdraid-OST0000 complete
            [414724.001373] Lustre: lsdraid-OST0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
            [414725.696637] Lustre: lsdraid-OST0000: Will be in recovery for at least 2:30, or until 1 client reconnects
            [414725.709414] Lustre: lsdraid-OST0000: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib)
            [414725.874431] Lustre: lsdraid-OST0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted.
            [415336.132350] Lustre: lsdraid-OST0000: Connection restored to bd4f4e40-dbac-a829-f1fd-3c4450a08dcb (at 192.168.1.6@o2ib)
            [415406.632740] ------------[ cut here ]------------
            [415406.633861] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.6@o2ib inode [0x200000401:0x7a:0x0] object 0x0:88 extent [50331648-57343999]: client csum 41b33fd5, server csum 649d3feb
            [415406.665939] kernel BUG at include/linux/scatterlist.h:65!
            [415406.674352] invalid opcode: 0000 [#1] SMP 
            [415406.681344] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_generic crypto_null xfs libcrc32c rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm dm_service_time ses enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper mpt3sas ablk_helper cryptd raid_class scsi_transport_sas ipmi_devintf ipmi_ssif iTCO_wdt
            [415406.776798]  sg pcspkr iTCO_vendor_support ipmi_si ipmi_msghandler mei_me sb_edac acpi_power_meter ioatdma lpc_ich edac_core acpi_pad shpchp mei wmi i2c_i801 mfd_core dm_multipath dm_mod nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb crct10dif_pclmul ttm crct10dif_common ptp crc32c_intel ahci pps_core drm mlx4_core libahci dca i2c_algo_bit libata i2c_core [last unloaded: zunicode]
            [415406.848441] CPU: 29 PID: 89865 Comm: ll_ost_io01_000 Tainted: G          IOE  ------------   3.10.0-327.36.3.el7.x86_64 #1
            [415406.863708] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
            [415406.878344] task: ffff8817d96e5c00 ti: ffff881a6b35c000 task.ti: ffff881a6b35c000
            [415406.889651] RIP: 0010:[<ffffffffa0c0cfef>]  [<ffffffffa0c0cfef>] cfs_crypto_hash_update_page+0x9f/0xb0 [libcfs]
            [415406.903951] RSP: 0018:ffff881a6b35fab8  EFLAGS: 00010202
            [415406.912870] RAX: 0000000000000002 RBX: ffff8820050b5900 RCX: 0000000000000000
            [415406.923849] RDX: 0000000000000020 RSI: 0000000000000000 RDI: ffff881a6b35fad8
            [415406.934787] RBP: ffff881a6b35fb00 R08: 00000000000195a0 R09: ffff881a6b35fab8
            [415406.945693] R10: ffff88103e807900 R11: 0000000000000001 R12: 3534333231303635
            [415406.956568] R13: 0000000032313036 R14: 0000000000000433 R15: 0000000000000000
            [415406.967407] FS:  0000000000000000(0000) GS:ffff88203e6c0000(0000) knlGS:0000000000000000
            [415406.979287] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [415406.988490] CR2: 00007fc89400b008 CR3: 000000000194a000 CR4: 00000000001407e0
            [415406.999227] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            [415407.009940] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            [415407.020607] Stack:
            [415407.025494]  0000000000000002 0000000000000000 0000000000000000 0000000000000000
            [415407.036487]  00000000ced088e5 0000000000000000 ffff882024772701 ffff880db7053000
            [415407.047418]  0000000000000000 ffff881a6b35fb68 ffffffffa0f8e459 ffff8819d6ea98a8
            [415407.058319] Call Trace:
            [415407.063640]  [<ffffffffa0f8e459>] tgt_checksum_bulk.isra.33+0x35a/0x4e7 [ptlrpc]
            [415407.074501]  [<ffffffffa0f6721d>] tgt_brw_write+0x114d/0x1640 [ptlrpc]
            [415407.084323]  [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
            [415407.092958]  [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
            [415407.102588]  [<ffffffffa0ebd560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [415407.113192]  [<ffffffffa0f63225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [415407.123952]  [<ffffffffa0f0f1ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [415407.135575]  [<ffffffffa0c13128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [415407.146329]  [<ffffffffa0f0cd68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [415407.156963]  [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
            [415407.166363]  [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
            [415407.175301]  [<ffffffffa0f13260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
            [415407.184635]  [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
            [415407.193114]  [<ffffffffa0f127c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [415407.204113]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [415407.212374]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [415407.222423]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [415407.231187]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [415407.241105] Code: 89 43 38 48 8b 43 20 ff 50 c0 48 8b 55 d8 65 48 33 14 25 28 00 00 00 75 0d 48 83 c4 28 5b 41 5c 41 5d 41 5e 5d c3 e8 61 e0 46 e0 <0f> 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 
            [415407.268624] RIP  [<ffffffffa0c0cfef>] cfs_crypto_hash_update_page+0x9f/0xb0 [libcfs]
            [415407.279914]  RSP <ffff881a6b35fab8>
            

            wolf-3 OSS 10.8.1.3-2017-04-15-00:39:07

            [ 6415.538534] Lustre: lsdraid-OST0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
            [ 6422.155237] Lustre: lsdraid-OST0000: Will be in recovery for at least 2:30, or until 1 client reconnects
            [ 6422.165992] Lustre: lsdraid-OST0000: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib)
            [ 6422.291438] Lustre: lsdraid-OST0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted.
            [ 6422.301549] Lustre: lsdraid-OST0000: deleting orphan objects from 0x0:91 to 0x0:129
            [ 6474.856831] Lustre: lsdraid-OST0000: Connection restored to  (at 192.168.1.8@o2ib)
            [ 6565.960924] BUG: Bad page state in process ll_ost_io01_007  pfn:18eecce
            [ 6565.961668] BUG: Bad page state in process ll_ost_io01_006  pfn:18eecca
            [ 6565.961672] page:ffffea0063bb3280 count:-1 mapcount:0 mapping:          (null) index:0x0
            [ 6565.961674] page flags: 0x6fffff00000000()
            [ 6565.961675] page dumped because: nonzero _count
            [ 6565.961726] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate dm_service_time xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas raid_class lrw gf128mul glue_helper ablk_helper cryptd scsi_transport_sas iTCO_wdt iTCO_vendor_support mei_me ipmi_devintf ipmi_ssif lpc_ich sb_edac ipmi_si
            [ 6565.961778]  edac_core sg ipmi_msghandler mei shpchp pcspkr ioatdma mfd_core i2c_i801 wmi acpi_pad acpi_power_meter nfsd dm_multipath dm_mod nfs_acl lockd binfmt_misc grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm ptp crct10dif_pclmul pps_core crct10dif_common ahci drm crc32c_intel dca libahci mlx4_core i2c_algo_bit libata i2c_core
            [ 6565.961782] CPU: 31 PID: 10886 Comm: ll_ost_io01_006 Tainted: G          IOE  ------------   3.10.0-327.36.3.el7.x86_64 #1
            [ 6565.961784] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
            [ 6565.961792]  ffffea0063bb3280 000000008d05d0f2 ffff88202236f6f8 ffffffff81636431
            [ 6565.961797]  ffff88202236f720 ffffffff81631645 ffff88203e759c68 0000000000003735
            [ 6565.961803]  0000000000000001 ffff88202236f828 ffffffff81173028 ffff881022e59370
            [ 6565.961804] Call Trace:
            [ 6565.961819]  [<ffffffff81636431>] dump_stack+0x19/0x1b
            [ 6565.961824]  [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc
            [ 6565.961833]  [<ffffffff81173028>] get_page_from_freelist+0x848/0x9b0
            [ 6565.961844]  [<ffffffffa06cadaa>] ? spl_kmem_free+0x2a/0x40 [spl]
            [ 6565.961848]  [<ffffffff81173327>] __alloc_pages_nodemask+0x197/0xba0
            [ 6565.961862]  [<ffffffffa01f9f02>] ? mlx4_ib_post_send+0x4e2/0xb20 [mlx4_ib]
            [ 6565.961910]  [<ffffffffa0b68f8d>] ? lu_obj_hop_keycmp+0x1d/0x30 [obdclass]
            [ 6565.961927]  [<ffffffffa081d717>] ? cfs_hash_bd_lookup_intent+0x57/0x160 [libcfs]
            [ 6565.961935]  [<ffffffff811b4afa>] alloc_pages_current+0xaa/0x170
            [ 6565.961952]  [<ffffffffa0d5786b>] osd_bufs_get+0x4cb/0xba0 [osd_zfs]
            [ 6565.961970]  [<ffffffffa10ade3d>] ofd_preprw_write.isra.29+0x1bd/0xcd0 [ofd]
            [ 6565.961980]  [<ffffffffa10af13a>] ofd_preprw+0x7ea/0x10c0 [ofd]
            [ 6565.962092]  [<ffffffffa0e8fce7>] tgt_brw_write+0xc17/0x1640 [ptlrpc]
            [ 6565.962098]  [<ffffffff81632d15>] ? __slab_free+0x10e/0x277
            [ 6565.962105]  [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
            [ 6565.962110]  [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
            [ 6565.962115]  [<ffffffff81639d72>] ? mutex_lock+0x12/0x2f
            [ 6565.962178]  [<ffffffffa0e8c225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [ 6565.962234]  [<ffffffffa0e381ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [ 6565.962249]  [<ffffffffa081a128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [ 6565.962302]  [<ffffffffa0e35d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [ 6565.962311]  [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
            [ 6565.962315]  [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
            [ 6565.962368]  [<ffffffffa0e3c260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
            [ 6565.962377]  [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
            [ 6565.962428]  [<ffffffffa0e3b7c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [ 6565.962436]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [ 6565.962441]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [ 6565.962449]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [ 6565.962454]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [ 6565.962456] Disabling lock debugging due to kernel taint
            [ 6565.962539] BUG: Bad page state in process ll_ost_io01_006  pfn:18eecc5
            [ 6565.962541] page:ffffea0063bb3140 count:-1 mapcount:0 mapping:          (null) index:0x0
            [ 6565.962542] page flags: 0x6fffff00000000()
            [ 6565.962543] page dumped because: nonzero _count
            [ 6565.962576] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate dm_service_time xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas raid_class lrw gf128mul glue_helper ablk_helper cryptd scsi_transport_sas iTCO_wdt iTCO_vendor_support mei_me ipmi_devintf ipmi_ssif lpc_ich sb_edac ipmi_si
            [ 6565.962601]  edac_core sg ipmi_msghandler mei shpchp pcspkr ioatdma mfd_core i2c_i801 wmi acpi_pad acpi_power_meter nfsd dm_multipath dm_mod nfs_acl lockd binfmt_misc grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm ptp crct10dif_pclmul pps_core crct10dif_common ahci drm crc32c_intel dca libahci mlx4_core i2c_algo_bit libata i2c_core
            [ 6565.962604] CPU: 31 PID: 10886 Comm: ll_ost_io01_006 Tainted: G    B     IOE  ------------   3.10.0-327.36.3.el7.x86_64 #1
            [ 6565.962605] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
            [ 6565.962612]  ffffea0063bb3140 000000008d05d0f2 ffff88202236f6f8 ffffffff81636431
            [ 6565.962619]  ffff88202236f720 ffffffff81631645 ffff88203e759c68 0000000000003735
            [ 6565.962625]  0000000000000001 ffff88202236f828 ffffffff81173028 ffff881022e59370
            [ 6565.962626] Call Trace:
            [ 6565.962632]  [<ffffffff81636431>] dump_stack+0x19/0x1b
            [ 6565.962636]  [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc
            [ 6565.962641]  [<ffffffff81173028>] get_page_from_freelist+0x848/0x9b0
            [ 6565.962650]  [<ffffffffa06cadaa>] ? spl_kmem_free+0x2a/0x40 [spl]
            [ 6565.962655]  [<ffffffff81173327>] __alloc_pages_nodemask+0x197/0xba0
            [ 6565.962669]  [<ffffffffa01f9f02>] ? mlx4_ib_post_send+0x4e2/0xb20 [mlx4_ib]
            [ 6565.962711]  [<ffffffffa0b68f8d>] ? lu_obj_hop_keycmp+0x1d/0x30 [obdclass]
            [ 6565.962727]  [<ffffffffa081d717>] ? cfs_hash_bd_lookup_intent+0x57/0x160 [libcfs]
            [ 6565.962733]  [<ffffffff811b4afa>] alloc_pages_current+0xaa/0x170
            [ 6565.962745]  [<ffffffffa0d5786b>] osd_bufs_get+0x4cb/0xba0 [osd_zfs]
            [ 6565.962767]  [<ffffffffa10ade3d>] ofd_preprw_write.isra.29+0x1bd/0xcd0 [ofd]
            [ 6565.962781]  [<ffffffffa10af13a>] ofd_preprw+0x7ea/0x10c0 [ofd]
            [ 6565.962855]  [<ffffffffa0e8fce7>] tgt_brw_write+0xc17/0x1640 [ptlrpc]
            [ 6565.962861]  [<ffffffff81632d15>] ? __slab_free+0x10e/0x277
            [ 6565.962866]  [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
            [ 6565.962870]  [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
            [ 6565.962875]  [<ffffffff81639d72>] ? mutex_lock+0x12/0x2f
            [ 6565.962949]  [<ffffffffa0e8c225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [ 6565.963019]  [<ffffffffa0e381ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [ 6565.963034]  [<ffffffffa081a128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [ 6565.963103]  [<ffffffffa0e35d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [ 6565.963109]  [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
            [ 6565.963112]  [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
            [ 6565.963181]  [<ffffffffa0e3c260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
            [ 6565.963187]  [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
            [ 6565.963256]  [<ffffffffa0e3b7c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [ 6565.963262]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [ 6565.963267]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [ 6565.963273]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [ 6565.963278]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [ 6565.963280] BUG: Bad page state in process ll_ost_io01_006  pfn:18eecc6
            [ 6565.963282] page:ffffea0063bb3180 count:-1 mapcount:0 mapping:          (null) index:0x0
            [ 6565.963284] page flags: 0x6fffff00000000()
            [ 6565.963285] page dumped because: nonzero _count
            [ 6565.963320] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate dm_service_time xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas raid_class lrw gf128mul glue_helper ablk_helper cryptd scsi_transport_sas iTCO_wdt iTCO_vendor_support mei_me ipmi_devintf ipmi_ssif lpc_ich sb_edac ipmi_si
            [ 6565.963346]  edac_core sg ipmi_msghandler mei shpchp pcspkr ioatdma mfd_core i2c_i801 wmi acpi_pad acpi_power_meter nfsd dm_multipath dm_mod nfs_acl lockd binfmt_misc grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm ptp crct10dif_pclmul pps_core crct10dif_common ahci drm crc32c_intel dca libahci mlx4_core i2c_algo_bit libata i2c_core
            [ 6565.963349] CPU: 31 PID: 10886 Comm: ll_ost_io01_006 Tainted: G    B     IOE  ------------   3.10.0-327.36.3.el7.x86_64 #1
            [ 6565.963350] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
            [ 6565.963358]  ffffea0063bb3180 000000008d05d0f2 ffff88202236f6f8 ffffffff81636431
            [ 6565.963365]  ffff88202236f720 ffffffff81631645 ffff88203e759c68 0000000000003735
            [ 6565.963372]  0000000000000001 ffff88202236f828 ffffffff81173028 ffff881022e59370
            [ 6565.963372] Call Trace:
            [ 6565.963378]  [<ffffffff81636431>] dump_stack+0x19/0x1b
            [ 6565.963383]  [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc
            [ 6565.963388]  [<ffffffff81173028>] get_page_from_freelist+0x848/0x9b0
            [ 6565.963397]  [<ffffffffa06cadaa>] ? spl_kmem_free+0x2a/0x40 [spl]
            [ 6565.963403]  [<ffffffff81173327>] __alloc_pages_nodemask+0x197/0xba0
            [ 6565.963416]  [<ffffffffa01f9f02>] ? mlx4_ib_post_send+0x4e2/0xb20 [mlx4_ib]
            [ 6565.963458]  [<ffffffffa0b68f8d>] ? lu_obj_hop_keycmp+0x1d/0x30 [obdclass]
            [ 6565.963473]  [<ffffffffa081d717>] ? cfs_hash_bd_lookup_intent+0x57/0x160 [libcfs]
            [ 6565.963479]  [<ffffffff811b4afa>] alloc_pages_current+0xaa/0x170
            [ 6565.963491]  [<ffffffffa0d5786b>] osd_bufs_get+0x4cb/0xba0 [osd_zfs]
            [ 6565.963506]  [<ffffffffa10ade3d>] ofd_preprw_write.isra.29+0x1bd/0xcd0 [ofd]
            [ 6565.963519]  [<ffffffffa10af13a>] ofd_preprw+0x7ea/0x10c0 [ofd]
            [ 6565.963593]  [<ffffffffa0e8fce7>] tgt_brw_write+0xc17/0x1640 [ptlrpc]
            [ 6565.963599]  [<ffffffff81632d15>] ? __slab_free+0x10e/0x277
            [ 6565.963603]  [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
            [ 6565.963607]  [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
            [ 6565.963612]  [<ffffffff81639d72>] ? mutex_lock+0x12/0x2f
            [ 6565.963686]  [<ffffffffa0e8c225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [ 6565.963756]  [<ffffffffa0e381ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [ 6565.963778]  [<ffffffffa081a128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [ 6565.963847]  [<ffffffffa0e35d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [ 6565.963853]  [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
            [ 6565.963856]  [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
            [ 6565.963925]  [<ffffffffa0e3c260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
            [ 6565.963931]  [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
            [ 6565.964000]  [<ffffffffa0e3b7c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [ 6565.964006]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [ 6565.964011]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [ 6565.964016]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [ 6565.964021]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [ 6567.436859] page:ffffea0063bb3380 count:-1 mapcount:0 mapping:          (null) index:0x0
            [ 6567.447916] page flags: 0x6fffff00000000()
            [ 6567.454287] page dumped because: nonzero _count
            [ 6567.461107] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate dm_service_time xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas raid_class lrw gf128mul glue_helper ablk_helper cryptd scsi_transport_sas iTCO_wdt iTCO_vendor_support mei_me ipmi_devintf ipmi_ssif lpc_ich sb_edac ipmi_si
            [ 6567.549458]  edac_core sg ipmi_msghandler mei shpchp pcspkr ioatdma mfd_core i2c_i801 wmi acpi_pad acpi_power_meter nfsd dm_multipath dm_mod nfs_acl lockd binfmt_misc grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm ptp crct10dif_pclmul pps_core crct10dif_common ahci drm crc32c_intel dca libahci mlx4_core i2c_algo_bit libata i2c_core
            [ 6567.606553] CPU: 19 PID: 11266 Comm: ll_ost_io01_007 Tainted: G    B     IOE  ------------   3.10.0-327.36.3.el7.x86_64 #1
            [ 6567.619967] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
            [ 6567.632682]  ffffea0063bb3380 0000000029637c7c ffff880f32283908 ffffffff81636431
            [ 6567.642074]  ffff880f32283930 ffffffff81631645 ffffea0063bb3380 0000000000000000
            [ 6567.651459]  000fffff00000000 ffff880f32283978 ffffffff811714dd fff00000fe000000
            [ 6567.660857] Call Trace:
            [ 6567.664645]  [<ffffffff81636431>] dump_stack+0x19/0x1b
            [ 6567.671441]  [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc
            [ 6567.678829]  [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190
            [ 6567.686591]  [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140
            [ 6567.694250]  [<ffffffff8117200f>] __free_pages+0x3f/0x60
            [ 6567.701235]  [<ffffffffa0d56ad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs]
            [ 6567.709381]  [<ffffffffa10ab84a>] ofd_commitrw_write+0xea/0x1c20 [ofd]
            [ 6567.717717]  [<ffffffffa10aff2d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [ 6567.725522]  [<ffffffffa0eb78d2>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [ 6567.733604]  [<ffffffffa0e8ff71>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
            [ 6567.741863]  [<ffffffffa0de6560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [ 6567.751008]  [<ffffffffa0e8c225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [ 6567.760002]  [<ffffffffa0e381ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [ 6567.769852]  [<ffffffffa081a128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [ 6567.778843]  [<ffffffffa0e35d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [ 6567.787757]  [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
            [ 6567.796038]  [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
            [ 6567.803866]  [<ffffffffa0e3c260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
            [ 6567.812150]  [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
            [ 6567.819620]  [<ffffffffa0e3b7c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [ 6567.828951]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [ 6567.835460]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [ 6567.843817]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [ 6567.850900]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [ 6591.647844] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.6@o2ib inode [0x200000402:0x38:0x0] object 0x0:151 extent [67108864-74711039]: client csum 10225ab5, server csum d83f5ab1
            [ 6602.366408] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.6@o2ib inode [0x200000402:0x46:0x0] object 0x0:158 extent [67108864-82968575]: client csum df6bd34a, server csum a629d34d
            [ 6611.821644] general protection fault: 0000 [#1] SMP 
            [ 6611.829518] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate dm_service_time xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas raid_class lrw gf128mul glue_helper ablk_helper cryptd scsi_transport_sas iTCO_wdt iTCO_vendor_support mei_me ipmi_devintf ipmi_ssif lpc_ich sb_edac ipmi_si
            [ 6611.923714]  edac_core sg ipmi_msghandler mei shpchp pcspkr ioatdma mfd_core i2c_i801 wmi acpi_pad acpi_power_meter nfsd dm_multipath dm_mod nfs_acl lockd binfmt_misc grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm ptp crct10dif_pclmul pps_core crct10dif_common ahci drm crc32c_intel dca libahci mlx4_core i2c_algo_bit libata i2c_core
            [ 6611.985416] CPU: 55 PID: 9668 Comm: ll_ost_io01_000 Tainted: G    B     IOE  ------------   3.10.0-327.36.3.el7.x86_64 #1
            [ 6611.999894] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
            [ 6612.013786] task: ffff880fd957e780 ti: ffff880fdb368000 task.ti: ffff880fdb368000
            [ 6612.024361] RIP: 0010:[<ffffffffa0814e30>]  [<ffffffffa0814e30>] adler32_update+0x70/0x250 [libcfs]
            [ 6612.036764] RSP: 0018:ffff880fdb36b990  EFLAGS: 00010212
            [ 6612.044902] RAX: 0000000000000cce RBX: 0000000000000cce RCX: 3433323130363534
            [ 6612.055097] RDX: 0000000000000cce RSI: 0cd1944c0d8d4332 RDI: 0cd1944c0d8d4332
            [ 6612.065272] RBP: ffff880fdb36b9f8 R08: 00000000000195a0 R09: 0000000000000cce
            [ 6612.075453] R10: ffff88103e807900 R11: 0000000000000001 R12: 3433323130363534
            [ 6612.085641] R13: 0000000031303635 R14: ffffffffa0834410 R15: 0000000000000001
            [ 6612.095830] FS:  0000000000000000(0000) GS:ffff88203e8c0000(0000) knlGS:0000000000000000
            [ 6612.107119] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [ 6612.115792] CR2: 00007f19c6c7c000 CR3: 000000000194a000 CR4: 00000000001407e0
            [ 6612.126030] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            [ 6612.136265] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            [ 6612.146492] Stack:
            [ 6612.150994]  ffff881d1a0d1cd0 00000ccedb36b9c8 0cd1944c0d8d4332 0000000000000000
            [ 6612.161627]  00000cce00000000 ffffffffa0834410 ffff882027752a08 ffff880fdb36b9f0
            [ 6612.172284]  0cd1944c0d8d4332 3433323130363534 0000000031303635 ffffffffa0834410
            [ 6612.182948] Call Trace:
            [ 6612.187988]  [<ffffffff812b1a78>] crypto_shash_update+0x38/0x100
            [ 6612.197017]  [<ffffffff812b1d6e>] shash_ahash_update+0x3e/0x70
            [ 6612.205854]  [<ffffffff812b1db2>] shash_async_update+0x12/0x20
            [ 6612.214676]  [<ffffffffa0813fce>] cfs_crypto_hash_update_page+0x7e/0xb0 [libcfs]
            [ 6612.225344]  [<ffffffffa0eb7459>] tgt_checksum_bulk.isra.33+0x35a/0x4e7 [ptlrpc]
            [ 6612.236606]  [<ffffffffa0e9021d>] tgt_brw_write+0x114d/0x1640 [ptlrpc]
            [ 6612.246831]  [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
            [ 6612.255910]  [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
            [ 6612.265910]  [<ffffffffa0de6560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [ 6612.276879]  [<ffffffffa0e8c225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [ 6612.287460]  [<ffffffffa0e381ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [ 6612.298869]  [<ffffffffa081a128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [ 6612.309312]  [<ffffffffa0e35d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [ 6612.319759]  [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
            [ 6612.329610]  [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
            [ 6612.338955]  [<ffffffffa0e3c260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
            [ 6612.348565]  [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
            [ 6612.357360]  [<ffffffffa0e3b7c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [ 6612.368146]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [ 6612.376092]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [ 6612.385802]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [ 6612.394179]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [ 6612.403767] Code: 44 00 00 8b 5d b8 b8 b0 15 00 00 81 fb b0 15 00 00 0f 46 c3 29 45 b8 83 f8 0f 89 45 a4 0f 8e f8 00 00 00 48 8b 7d a8 89 45 bc 90 <44> 0f b6 2f 44 0f b6 77 01 48 83 c7 10 44 0f b6 67 f2 0f b6 5f 
            [ 6612.430647] RIP  [<ffffffffa0814e30>] adler32_update+0x70/0x250 [libcfs]
            [ 6612.440428]  RSP <ffff880fdb36b990>
            
            jsalians_intel John Salinas (Inactive) added a comment - Yesterday I tried the following combinations: Lustre 2.9.0 + latest coral_beta_combined record size 16M brw_size=16 draid zfs_abd_scatter_enabled = 0, max_pages_per_rpc=4096 – crash 10.8.1.3-2017-04-14-22:46:09 Lustre 2.9.0 + latest coral_beta_combined record size 16M brw_size=16 draid zfs_abd_scatter_enabled = 0, max_pages_per_rpc=256 – crash 10.8.1.3-2017-04-15-00:39:07 wolf-3 OSS 10.8.1.3-2017-04-14-22:46:09 147931.299899] Lustre: lsdraid-OST0000: new disk, initializing [147931.307239] Lustre: srv-lsdraid-OST0000: No data found on store. Initialize space [147936.355608] Lustre: lsdraid-OST0000: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) [147963.624729] Lustre: lsdraid-OST0000: Connection restored to bd4f4e40-dbac-a829-f1fd-3c4450a08dcb (at 192.168.1.6@o2ib) [147970.995882] Lustre: lsdraid-OST0000: Connection restored to b9fbce4c-a90b-3f7f-770e-f9863c38efb5 (at 192.168.1.8@o2ib) [147975.210049] Lustre: lsdraid-OST0000: Connection restored to 862f84d1-bf42-0dd3-ba54-1e1a9568317e (at 192.168.1.7@o2ib) [147975.223042] Lustre: Skipped 1 previous similar message [148306.620448] Lustre: lsdraid-OST0000: Connection restored to b9fbce4c-a90b-3f7f-770e-f9863c38efb5 (at 192.168.1.8@o2ib) [148306.633674] Lustre: Skipped 1 previous similar message [233987.779195] perf interrupt took too long (10163 > 9615), lowering kernel.perf_event_max_sample_rate to 13000 [414188.327658] Lustre: 83697:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1492208717/real 1492208717] req@ffff880f11ac8300 x1564414877971952/t0(0) o39->lsdraid-MDT0000-lwp-OST0000@192.168.1.5@o2ib:12/10 lens 224/224 e 0 to 1 dl 1492208723 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [414188.364839] Lustre: Failing over lsdraid-OST0000 [414192.689319] Lustre: 118209:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1492208721/real 1492208721] req@ffff8815f6846f00 x1564414877971968/t0(0) o400->MGC192.168.1.5@o2ib@192.168.1.5@o2ib:26/25 lens 224/224 e 0 to 1 dl 1492208728 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [414194.373337] Lustre: 83697:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1492208723/real 1492208723] req@ffff880f11ac8300 x1564414877972032/t0(0) o251->MGC192.168.1.5@o2ib@192.168.1.5@o2ib:26/25 lens 224/224 e 0 to 1 dl 1492208729 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [414194.411850] Lustre: server umount lsdraid-OST0000 complete [414368.256969] Lustre: lsdraid-OST0000: new disk, initializing [414368.265405] Lustre: srv-lsdraid-OST0000: No data found on store. Initialize space [414375.147139] Lustre: lsdraid-OST0000: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) [414533.259382] Lustre: Failing over lsdraid-OST0000 [414533.276260] Lustre: server umount lsdraid-OST0000 complete [414724.001373] Lustre: lsdraid-OST0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 [414725.696637] Lustre: lsdraid-OST0000: Will be in recovery for at least 2:30, or until 1 client reconnects [414725.709414] Lustre: lsdraid-OST0000: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) [414725.874431] Lustre: lsdraid-OST0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted. [415336.132350] Lustre: lsdraid-OST0000: Connection restored to bd4f4e40-dbac-a829-f1fd-3c4450a08dcb (at 192.168.1.6@o2ib) [415406.632740] ------------[ cut here ]------------ [415406.633861] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.6@o2ib inode [0x200000401:0x7a:0x0] object 0x0:88 extent [50331648-57343999]: client csum 41b33fd5, server csum 649d3feb [415406.665939] kernel BUG at include/linux/scatterlist.h:65! [415406.674352] invalid opcode: 0000 [#1] SMP [415406.681344] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_generic crypto_null xfs libcrc32c rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm dm_service_time ses enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper mpt3sas ablk_helper cryptd raid_class scsi_transport_sas ipmi_devintf ipmi_ssif iTCO_wdt [415406.776798] sg pcspkr iTCO_vendor_support ipmi_si ipmi_msghandler mei_me sb_edac acpi_power_meter ioatdma lpc_ich edac_core acpi_pad shpchp mei wmi i2c_i801 mfd_core dm_multipath dm_mod nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb crct10dif_pclmul ttm crct10dif_common ptp crc32c_intel ahci pps_core drm mlx4_core libahci dca i2c_algo_bit libata i2c_core [last unloaded: zunicode] [415406.848441] CPU: 29 PID: 89865 Comm: ll_ost_io01_000 Tainted: G IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [415406.863708] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [415406.878344] task: ffff8817d96e5c00 ti: ffff881a6b35c000 task.ti: ffff881a6b35c000 [415406.889651] RIP: 0010:[<ffffffffa0c0cfef>] [<ffffffffa0c0cfef>] cfs_crypto_hash_update_page+0x9f/0xb0 [libcfs] [415406.903951] RSP: 0018:ffff881a6b35fab8 EFLAGS: 00010202 [415406.912870] RAX: 0000000000000002 RBX: ffff8820050b5900 RCX: 0000000000000000 [415406.923849] RDX: 0000000000000020 RSI: 0000000000000000 RDI: ffff881a6b35fad8 [415406.934787] RBP: ffff881a6b35fb00 R08: 00000000000195a0 R09: ffff881a6b35fab8 [415406.945693] R10: ffff88103e807900 R11: 0000000000000001 R12: 3534333231303635 [415406.956568] R13: 0000000032313036 R14: 0000000000000433 R15: 0000000000000000 [415406.967407] FS: 0000000000000000(0000) GS:ffff88203e6c0000(0000) knlGS:0000000000000000 [415406.979287] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [415406.988490] CR2: 00007fc89400b008 CR3: 000000000194a000 CR4: 00000000001407e0 [415406.999227] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [415407.009940] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [415407.020607] Stack: [415407.025494] 0000000000000002 0000000000000000 0000000000000000 0000000000000000 [415407.036487] 00000000ced088e5 0000000000000000 ffff882024772701 ffff880db7053000 [415407.047418] 0000000000000000 ffff881a6b35fb68 ffffffffa0f8e459 ffff8819d6ea98a8 [415407.058319] Call Trace: [415407.063640] [<ffffffffa0f8e459>] tgt_checksum_bulk.isra.33+0x35a/0x4e7 [ptlrpc] [415407.074501] [<ffffffffa0f6721d>] tgt_brw_write+0x114d/0x1640 [ptlrpc] [415407.084323] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150 [415407.092958] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0 [415407.102588] [<ffffffffa0ebd560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [415407.113192] [<ffffffffa0f63225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [415407.123952] [<ffffffffa0f0f1ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [415407.135575] [<ffffffffa0c13128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [415407.146329] [<ffffffffa0f0cd68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [415407.156963] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [415407.166363] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [415407.175301] [<ffffffffa0f13260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [415407.184635] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 [415407.193114] [<ffffffffa0f127c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [415407.204113] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [415407.212374] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [415407.222423] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [415407.231187] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [415407.241105] Code: 89 43 38 48 8b 43 20 ff 50 c0 48 8b 55 d8 65 48 33 14 25 28 00 00 00 75 0d 48 83 c4 28 5b 41 5c 41 5d 41 5e 5d c3 e8 61 e0 46 e0 <0f> 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 [415407.268624] RIP [<ffffffffa0c0cfef>] cfs_crypto_hash_update_page+0x9f/0xb0 [libcfs] [415407.279914] RSP <ffff881a6b35fab8> wolf-3 OSS 10.8.1.3-2017-04-15-00:39:07 [ 6415.538534] Lustre: lsdraid-OST0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 [ 6422.155237] Lustre: lsdraid-OST0000: Will be in recovery for at least 2:30, or until 1 client reconnects [ 6422.165992] Lustre: lsdraid-OST0000: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) [ 6422.291438] Lustre: lsdraid-OST0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted. [ 6422.301549] Lustre: lsdraid-OST0000: deleting orphan objects from 0x0:91 to 0x0:129 [ 6474.856831] Lustre: lsdraid-OST0000: Connection restored to (at 192.168.1.8@o2ib) [ 6565.960924] BUG: Bad page state in process ll_ost_io01_007 pfn:18eecce [ 6565.961668] BUG: Bad page state in process ll_ost_io01_006 pfn:18eecca [ 6565.961672] page:ffffea0063bb3280 count:-1 mapcount:0 mapping: (null) index:0x0 [ 6565.961674] page flags: 0x6fffff00000000() [ 6565.961675] page dumped because: nonzero _count [ 6565.961726] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate dm_service_time xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas raid_class lrw gf128mul glue_helper ablk_helper cryptd scsi_transport_sas iTCO_wdt iTCO_vendor_support mei_me ipmi_devintf ipmi_ssif lpc_ich sb_edac ipmi_si [ 6565.961778] edac_core sg ipmi_msghandler mei shpchp pcspkr ioatdma mfd_core i2c_i801 wmi acpi_pad acpi_power_meter nfsd dm_multipath dm_mod nfs_acl lockd binfmt_misc grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm ptp crct10dif_pclmul pps_core crct10dif_common ahci drm crc32c_intel dca libahci mlx4_core i2c_algo_bit libata i2c_core [ 6565.961782] CPU: 31 PID: 10886 Comm: ll_ost_io01_006 Tainted: G IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [ 6565.961784] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 6565.961792] ffffea0063bb3280 000000008d05d0f2 ffff88202236f6f8 ffffffff81636431 [ 6565.961797] ffff88202236f720 ffffffff81631645 ffff88203e759c68 0000000000003735 [ 6565.961803] 0000000000000001 ffff88202236f828 ffffffff81173028 ffff881022e59370 [ 6565.961804] Call Trace: [ 6565.961819] [<ffffffff81636431>] dump_stack+0x19/0x1b [ 6565.961824] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc [ 6565.961833] [<ffffffff81173028>] get_page_from_freelist+0x848/0x9b0 [ 6565.961844] [<ffffffffa06cadaa>] ? spl_kmem_free+0x2a/0x40 [spl] [ 6565.961848] [<ffffffff81173327>] __alloc_pages_nodemask+0x197/0xba0 [ 6565.961862] [<ffffffffa01f9f02>] ? mlx4_ib_post_send+0x4e2/0xb20 [mlx4_ib] [ 6565.961910] [<ffffffffa0b68f8d>] ? lu_obj_hop_keycmp+0x1d/0x30 [obdclass] [ 6565.961927] [<ffffffffa081d717>] ? cfs_hash_bd_lookup_intent+0x57/0x160 [libcfs] [ 6565.961935] [<ffffffff811b4afa>] alloc_pages_current+0xaa/0x170 [ 6565.961952] [<ffffffffa0d5786b>] osd_bufs_get+0x4cb/0xba0 [osd_zfs] [ 6565.961970] [<ffffffffa10ade3d>] ofd_preprw_write.isra.29+0x1bd/0xcd0 [ofd] [ 6565.961980] [<ffffffffa10af13a>] ofd_preprw+0x7ea/0x10c0 [ofd] [ 6565.962092] [<ffffffffa0e8fce7>] tgt_brw_write+0xc17/0x1640 [ptlrpc] [ 6565.962098] [<ffffffff81632d15>] ? __slab_free+0x10e/0x277 [ 6565.962105] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150 [ 6565.962110] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0 [ 6565.962115] [<ffffffff81639d72>] ? mutex_lock+0x12/0x2f [ 6565.962178] [<ffffffffa0e8c225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [ 6565.962234] [<ffffffffa0e381ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [ 6565.962249] [<ffffffffa081a128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [ 6565.962302] [<ffffffffa0e35d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [ 6565.962311] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [ 6565.962315] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [ 6565.962368] [<ffffffffa0e3c260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [ 6565.962377] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 [ 6565.962428] [<ffffffffa0e3b7c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [ 6565.962436] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [ 6565.962441] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6565.962449] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [ 6565.962454] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6565.962456] Disabling lock debugging due to kernel taint [ 6565.962539] BUG: Bad page state in process ll_ost_io01_006 pfn:18eecc5 [ 6565.962541] page:ffffea0063bb3140 count:-1 mapcount:0 mapping: (null) index:0x0 [ 6565.962542] page flags: 0x6fffff00000000() [ 6565.962543] page dumped because: nonzero _count [ 6565.962576] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate dm_service_time xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas raid_class lrw gf128mul glue_helper ablk_helper cryptd scsi_transport_sas iTCO_wdt iTCO_vendor_support mei_me ipmi_devintf ipmi_ssif lpc_ich sb_edac ipmi_si [ 6565.962601] edac_core sg ipmi_msghandler mei shpchp pcspkr ioatdma mfd_core i2c_i801 wmi acpi_pad acpi_power_meter nfsd dm_multipath dm_mod nfs_acl lockd binfmt_misc grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm ptp crct10dif_pclmul pps_core crct10dif_common ahci drm crc32c_intel dca libahci mlx4_core i2c_algo_bit libata i2c_core [ 6565.962604] CPU: 31 PID: 10886 Comm: ll_ost_io01_006 Tainted: G B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [ 6565.962605] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 6565.962612] ffffea0063bb3140 000000008d05d0f2 ffff88202236f6f8 ffffffff81636431 [ 6565.962619] ffff88202236f720 ffffffff81631645 ffff88203e759c68 0000000000003735 [ 6565.962625] 0000000000000001 ffff88202236f828 ffffffff81173028 ffff881022e59370 [ 6565.962626] Call Trace: [ 6565.962632] [<ffffffff81636431>] dump_stack+0x19/0x1b [ 6565.962636] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc [ 6565.962641] [<ffffffff81173028>] get_page_from_freelist+0x848/0x9b0 [ 6565.962650] [<ffffffffa06cadaa>] ? spl_kmem_free+0x2a/0x40 [spl] [ 6565.962655] [<ffffffff81173327>] __alloc_pages_nodemask+0x197/0xba0 [ 6565.962669] [<ffffffffa01f9f02>] ? mlx4_ib_post_send+0x4e2/0xb20 [mlx4_ib] [ 6565.962711] [<ffffffffa0b68f8d>] ? lu_obj_hop_keycmp+0x1d/0x30 [obdclass] [ 6565.962727] [<ffffffffa081d717>] ? cfs_hash_bd_lookup_intent+0x57/0x160 [libcfs] [ 6565.962733] [<ffffffff811b4afa>] alloc_pages_current+0xaa/0x170 [ 6565.962745] [<ffffffffa0d5786b>] osd_bufs_get+0x4cb/0xba0 [osd_zfs] [ 6565.962767] [<ffffffffa10ade3d>] ofd_preprw_write.isra.29+0x1bd/0xcd0 [ofd] [ 6565.962781] [<ffffffffa10af13a>] ofd_preprw+0x7ea/0x10c0 [ofd] [ 6565.962855] [<ffffffffa0e8fce7>] tgt_brw_write+0xc17/0x1640 [ptlrpc] [ 6565.962861] [<ffffffff81632d15>] ? __slab_free+0x10e/0x277 [ 6565.962866] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150 [ 6565.962870] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0 [ 6565.962875] [<ffffffff81639d72>] ? mutex_lock+0x12/0x2f [ 6565.962949] [<ffffffffa0e8c225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [ 6565.963019] [<ffffffffa0e381ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [ 6565.963034] [<ffffffffa081a128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [ 6565.963103] [<ffffffffa0e35d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [ 6565.963109] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [ 6565.963112] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [ 6565.963181] [<ffffffffa0e3c260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [ 6565.963187] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 [ 6565.963256] [<ffffffffa0e3b7c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [ 6565.963262] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [ 6565.963267] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6565.963273] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [ 6565.963278] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6565.963280] BUG: Bad page state in process ll_ost_io01_006 pfn:18eecc6 [ 6565.963282] page:ffffea0063bb3180 count:-1 mapcount:0 mapping: (null) index:0x0 [ 6565.963284] page flags: 0x6fffff00000000() [ 6565.963285] page dumped because: nonzero _count [ 6565.963320] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate dm_service_time xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas raid_class lrw gf128mul glue_helper ablk_helper cryptd scsi_transport_sas iTCO_wdt iTCO_vendor_support mei_me ipmi_devintf ipmi_ssif lpc_ich sb_edac ipmi_si [ 6565.963346] edac_core sg ipmi_msghandler mei shpchp pcspkr ioatdma mfd_core i2c_i801 wmi acpi_pad acpi_power_meter nfsd dm_multipath dm_mod nfs_acl lockd binfmt_misc grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm ptp crct10dif_pclmul pps_core crct10dif_common ahci drm crc32c_intel dca libahci mlx4_core i2c_algo_bit libata i2c_core [ 6565.963349] CPU: 31 PID: 10886 Comm: ll_ost_io01_006 Tainted: G B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [ 6565.963350] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 6565.963358] ffffea0063bb3180 000000008d05d0f2 ffff88202236f6f8 ffffffff81636431 [ 6565.963365] ffff88202236f720 ffffffff81631645 ffff88203e759c68 0000000000003735 [ 6565.963372] 0000000000000001 ffff88202236f828 ffffffff81173028 ffff881022e59370 [ 6565.963372] Call Trace: [ 6565.963378] [<ffffffff81636431>] dump_stack+0x19/0x1b [ 6565.963383] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc [ 6565.963388] [<ffffffff81173028>] get_page_from_freelist+0x848/0x9b0 [ 6565.963397] [<ffffffffa06cadaa>] ? spl_kmem_free+0x2a/0x40 [spl] [ 6565.963403] [<ffffffff81173327>] __alloc_pages_nodemask+0x197/0xba0 [ 6565.963416] [<ffffffffa01f9f02>] ? mlx4_ib_post_send+0x4e2/0xb20 [mlx4_ib] [ 6565.963458] [<ffffffffa0b68f8d>] ? lu_obj_hop_keycmp+0x1d/0x30 [obdclass] [ 6565.963473] [<ffffffffa081d717>] ? cfs_hash_bd_lookup_intent+0x57/0x160 [libcfs] [ 6565.963479] [<ffffffff811b4afa>] alloc_pages_current+0xaa/0x170 [ 6565.963491] [<ffffffffa0d5786b>] osd_bufs_get+0x4cb/0xba0 [osd_zfs] [ 6565.963506] [<ffffffffa10ade3d>] ofd_preprw_write.isra.29+0x1bd/0xcd0 [ofd] [ 6565.963519] [<ffffffffa10af13a>] ofd_preprw+0x7ea/0x10c0 [ofd] [ 6565.963593] [<ffffffffa0e8fce7>] tgt_brw_write+0xc17/0x1640 [ptlrpc] [ 6565.963599] [<ffffffff81632d15>] ? __slab_free+0x10e/0x277 [ 6565.963603] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150 [ 6565.963607] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0 [ 6565.963612] [<ffffffff81639d72>] ? mutex_lock+0x12/0x2f [ 6565.963686] [<ffffffffa0e8c225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [ 6565.963756] [<ffffffffa0e381ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [ 6565.963778] [<ffffffffa081a128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [ 6565.963847] [<ffffffffa0e35d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [ 6565.963853] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [ 6565.963856] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [ 6565.963925] [<ffffffffa0e3c260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [ 6565.963931] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 [ 6565.964000] [<ffffffffa0e3b7c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [ 6565.964006] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [ 6565.964011] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6565.964016] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [ 6565.964021] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6567.436859] page:ffffea0063bb3380 count:-1 mapcount:0 mapping: (null) index:0x0 [ 6567.447916] page flags: 0x6fffff00000000() [ 6567.454287] page dumped because: nonzero _count [ 6567.461107] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate dm_service_time xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas raid_class lrw gf128mul glue_helper ablk_helper cryptd scsi_transport_sas iTCO_wdt iTCO_vendor_support mei_me ipmi_devintf ipmi_ssif lpc_ich sb_edac ipmi_si [ 6567.549458] edac_core sg ipmi_msghandler mei shpchp pcspkr ioatdma mfd_core i2c_i801 wmi acpi_pad acpi_power_meter nfsd dm_multipath dm_mod nfs_acl lockd binfmt_misc grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm ptp crct10dif_pclmul pps_core crct10dif_common ahci drm crc32c_intel dca libahci mlx4_core i2c_algo_bit libata i2c_core [ 6567.606553] CPU: 19 PID: 11266 Comm: ll_ost_io01_007 Tainted: G B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [ 6567.619967] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 6567.632682] ffffea0063bb3380 0000000029637c7c ffff880f32283908 ffffffff81636431 [ 6567.642074] ffff880f32283930 ffffffff81631645 ffffea0063bb3380 0000000000000000 [ 6567.651459] 000fffff00000000 ffff880f32283978 ffffffff811714dd fff00000fe000000 [ 6567.660857] Call Trace: [ 6567.664645] [<ffffffff81636431>] dump_stack+0x19/0x1b [ 6567.671441] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc [ 6567.678829] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190 [ 6567.686591] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140 [ 6567.694250] [<ffffffff8117200f>] __free_pages+0x3f/0x60 [ 6567.701235] [<ffffffffa0d56ad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs] [ 6567.709381] [<ffffffffa10ab84a>] ofd_commitrw_write+0xea/0x1c20 [ofd] [ 6567.717717] [<ffffffffa10aff2d>] ofd_commitrw+0x51d/0xa40 [ofd] [ 6567.725522] [<ffffffffa0eb78d2>] obd_commitrw+0x2ec/0x32f [ptlrpc] [ 6567.733604] [<ffffffffa0e8ff71>] tgt_brw_write+0xea1/0x1640 [ptlrpc] [ 6567.741863] [<ffffffffa0de6560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [ 6567.751008] [<ffffffffa0e8c225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [ 6567.760002] [<ffffffffa0e381ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [ 6567.769852] [<ffffffffa081a128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [ 6567.778843] [<ffffffffa0e35d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [ 6567.787757] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [ 6567.796038] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [ 6567.803866] [<ffffffffa0e3c260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [ 6567.812150] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 [ 6567.819620] [<ffffffffa0e3b7c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [ 6567.828951] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [ 6567.835460] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6567.843817] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [ 6567.850900] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6591.647844] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.6@o2ib inode [0x200000402:0x38:0x0] object 0x0:151 extent [67108864-74711039]: client csum 10225ab5, server csum d83f5ab1 [ 6602.366408] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.6@o2ib inode [0x200000402:0x46:0x0] object 0x0:158 extent [67108864-82968575]: client csum df6bd34a, server csum a629d34d [ 6611.821644] general protection fault: 0000 [#1] SMP [ 6611.829518] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate dm_service_time xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas raid_class lrw gf128mul glue_helper ablk_helper cryptd scsi_transport_sas iTCO_wdt iTCO_vendor_support mei_me ipmi_devintf ipmi_ssif lpc_ich sb_edac ipmi_si [ 6611.923714] edac_core sg ipmi_msghandler mei shpchp pcspkr ioatdma mfd_core i2c_i801 wmi acpi_pad acpi_power_meter nfsd dm_multipath dm_mod nfs_acl lockd binfmt_misc grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm ptp crct10dif_pclmul pps_core crct10dif_common ahci drm crc32c_intel dca libahci mlx4_core i2c_algo_bit libata i2c_core [ 6611.985416] CPU: 55 PID: 9668 Comm: ll_ost_io01_000 Tainted: G B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [ 6611.999894] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 6612.013786] task: ffff880fd957e780 ti: ffff880fdb368000 task.ti: ffff880fdb368000 [ 6612.024361] RIP: 0010:[<ffffffffa0814e30>] [<ffffffffa0814e30>] adler32_update+0x70/0x250 [libcfs] [ 6612.036764] RSP: 0018:ffff880fdb36b990 EFLAGS: 00010212 [ 6612.044902] RAX: 0000000000000cce RBX: 0000000000000cce RCX: 3433323130363534 [ 6612.055097] RDX: 0000000000000cce RSI: 0cd1944c0d8d4332 RDI: 0cd1944c0d8d4332 [ 6612.065272] RBP: ffff880fdb36b9f8 R08: 00000000000195a0 R09: 0000000000000cce [ 6612.075453] R10: ffff88103e807900 R11: 0000000000000001 R12: 3433323130363534 [ 6612.085641] R13: 0000000031303635 R14: ffffffffa0834410 R15: 0000000000000001 [ 6612.095830] FS: 0000000000000000(0000) GS:ffff88203e8c0000(0000) knlGS:0000000000000000 [ 6612.107119] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6612.115792] CR2: 00007f19c6c7c000 CR3: 000000000194a000 CR4: 00000000001407e0 [ 6612.126030] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6612.136265] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 6612.146492] Stack: [ 6612.150994] ffff881d1a0d1cd0 00000ccedb36b9c8 0cd1944c0d8d4332 0000000000000000 [ 6612.161627] 00000cce00000000 ffffffffa0834410 ffff882027752a08 ffff880fdb36b9f0 [ 6612.172284] 0cd1944c0d8d4332 3433323130363534 0000000031303635 ffffffffa0834410 [ 6612.182948] Call Trace: [ 6612.187988] [<ffffffff812b1a78>] crypto_shash_update+0x38/0x100 [ 6612.197017] [<ffffffff812b1d6e>] shash_ahash_update+0x3e/0x70 [ 6612.205854] [<ffffffff812b1db2>] shash_async_update+0x12/0x20 [ 6612.214676] [<ffffffffa0813fce>] cfs_crypto_hash_update_page+0x7e/0xb0 [libcfs] [ 6612.225344] [<ffffffffa0eb7459>] tgt_checksum_bulk.isra.33+0x35a/0x4e7 [ptlrpc] [ 6612.236606] [<ffffffffa0e9021d>] tgt_brw_write+0x114d/0x1640 [ptlrpc] [ 6612.246831] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150 [ 6612.255910] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0 [ 6612.265910] [<ffffffffa0de6560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [ 6612.276879] [<ffffffffa0e8c225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [ 6612.287460] [<ffffffffa0e381ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [ 6612.298869] [<ffffffffa081a128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [ 6612.309312] [<ffffffffa0e35d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [ 6612.319759] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [ 6612.329610] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [ 6612.338955] [<ffffffffa0e3c260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [ 6612.348565] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 [ 6612.357360] [<ffffffffa0e3b7c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [ 6612.368146] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [ 6612.376092] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6612.385802] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [ 6612.394179] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6612.403767] Code: 44 00 00 8b 5d b8 b8 b0 15 00 00 81 fb b0 15 00 00 0f 46 c3 29 45 b8 83 f8 0f 89 45 a4 0f 8e f8 00 00 00 48 8b 7d a8 89 45 bc 90 <44> 0f b6 2f 44 0f b6 77 01 48 83 c7 10 44 0f b6 67 f2 0f b6 5f [ 6612.430647] RIP [<ffffffffa0814e30>] adler32_update+0x70/0x250 [libcfs] [ 6612.440428] RSP <ffff880fdb36b990>
            jsalians_intel John Salinas (Inactive) added a comment - - edited

            Lustre 2.9.0 + 0.7.0 RC3 (none of our patches) record size 1M on OST0 and 16M on OST1. brw_size=16 on both raidz – messages but no crash manual dumps: 10.8.1.4-2017-04-15-00:26:17 10.8.1.4-2017-04-15-01:47:43 10.8.1.3-2017-04-15-13:22:45 10.8.1.4-2017-04-15-13:22:47

            wolf-4 OSS

            [  163.434692] Lustre: lsdraid-OST0001: Recovery over after 0:06, of 5 clients 5 recovered and 0 were evicted.
            [  163.480746] Lustre: lsdraid-OST0001: deleting orphan objects from 0x0:720 to 0x0:1025
            [  370.631336] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x3b0:0x0] object 0x0:1225 extent [83886080-92680191]: client csum d5f42113, server csum 1a89e99c
            [  480.339896] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x49c:0x0] object 0x0:4041 extent [33554432-47890431]: client csum e47bcdcb, server csum 86becdcf
            [  488.890964] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x4ae:0x0] object 0x0:5107 extent [67108864-73793535]: client csum b74b30df, server csum 20c030ec
            [  509.914190] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x52f:0x0] object 0x0:6348 extent [33554432-43007999]: client csum cbc76f28, server csum 4b241635
            [  539.505532] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x5be:0x0] object 0x0:7700 extent [67108864-78381055]: client csum b6e2021c, server csum c5ce4f88
            [  560.736133] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x5f1:0x0] object 0x0:8747 extent [67108864-81104895]: client csum ddc22e54, server csum 894f5e1a
            [  618.743576] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x6d0:0x0] object 0x0:11762 extent [67108864-81694719]: client csum 734e4939, server csum 175394a5
            [  618.764867] LustreError: Skipped 1 previous similar message
            [ 1080.395798] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x7fa:0x0] object 0x0:14839 extent [40140800-50331647]: client csum 937c50bf, server csum f71e2e65
            [ 1080.417120] LustreError: Skipped 2 previous similar messages
            [ 3001.142322] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0xd10:0x0] object 0x0:49284 extent [100663296-108527615]: client csum ab9466a8, server csum 10b4e228
            [ 3400.563954] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0xfb0:0x0] object 0x0:54388 extent [67108864-82837503]: client csum 71e8cd52, server csum 35becd53
            [ 3461.970072] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x1052:0x0] object 0x0:55534 extent [67108864-74973183]: client csum c0a766ab, server csum ab5a66bb
            [ 3762.672549] BUG: Bad page state in process ll_ost_io01_003  pfn:182ec6d
            [ 3762.680002] page:ffffea0060bb1b40 count:-1 mapcount:0 mapping:          (null) index:0x0
            [ 3762.689091] page flags: 0x6fffff00000000()
            [ 3762.693727] page dumped because: nonzero _count
            [ 3762.700757] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas lrw gf128mul glue_helper ablk_helper cryptd raid_class scsi_transport_sas mei_me iTCO_wdt ipmi_ssif iTCO_vendor_support mei ipmi_devintf sb_edac sg
            [ 3762.790920]  ioatdma lpc_ich shpchp edac_core pcspkr i2c_i801 ipmi_si mfd_core ipmi_msghandler acpi_pad acpi_power_meter wmi nfsd dm_multipath dm_mod auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb crct10dif_pclmul ptp crct10dif_common ttm ahci crc32c_intel pps_core mlx4_core libahci drm dca i2c_algo_bit libata i2c_core
            [ 3762.850233] CPU: 31 PID: 9096 Comm: ll_ost_io01_003 Tainted: P          IOE  ------------   3.10.0-327.36.3.el7.x86_64 #1
            [ 3762.864178] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
            [ 3762.877501]  ffffea0060bb1b40 000000006cbfa991 ffff880fd6a47908 ffffffff81636431
            [ 3762.887516]  ffff880fd6a47930 ffffffff81631645 ffffea0060bb1b40 0000000000000000
            [ 3762.897491]  000fffff00000000 ffff880fd6a47978 ffffffff811714dd fff00000fe000000
            [ 3762.907458] Call Trace:
            [ 3762.912046]  [<ffffffff81636431>] dump_stack+0x19/0x1b
            [ 3762.919394]  [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc
            [ 3762.927333]  [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190
            [ 3762.935630]  [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140
            [ 3762.943790]  [<ffffffff8117200f>] __free_pages+0x3f/0x60
            [ 3762.951264]  [<ffffffffa0fa1ad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs]
            [ 3762.959874]  [<ffffffffa109b84a>] ofd_commitrw_write+0xea/0x1c20 [ofd]
            [ 3762.968646]  [<ffffffffa109ff2d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [ 3762.976868]  [<ffffffffa0e0f8d2>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [ 3762.985338]  [<ffffffffa0de7f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
            [ 3762.993957]  [<ffffffffa0d3e560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [ 3763.003453]  [<ffffffffa0de4225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [ 3763.012530]  [<ffffffffa0d901ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [ 3763.022429]  [<ffffffffa0a33128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [ 3763.031354]  [<ffffffffa0d8dd68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [ 3763.040220]  [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
            [ 3763.048476]  [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
            [ 3763.056267]  [<ffffffffa0d94260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
            [ 3763.064562]  [<ffffffffa0d937c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [ 3763.074037]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [ 3763.080685]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [ 3763.089162]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [ 3763.096349]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [ 3855.476573] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x12e3:0x0] object 0x0:58439 extent [67108864-82837503]: client csum 71e8cd52, server csum 14e5cd5e
            [ 3923.650281] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x13bc:0x0] object 0x0:59171 extent [33554432-48742399]: client csum 9005f4a9, server csum db87ac4c
            [ 5698.551136] perf interrupt took too long (2521 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
            [ 5904.311835] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x1734:0x0] object 0x0:66681 extent [67108864-80281599]: client csum 1eaa58ca, server csum 44a378f0
            [ 8708.045614] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x1d31:0x0] object 0x0:67733 extent [121729024-134217727]: client csum 99efe98c, server csum e23d22e1
            [ 9738.442312] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2051:0x0] object 0x0:68278 extent [100663296-116666367]: client csum d42f69dc, server csum 8732074f
            [10448.854337] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x237a:0x0] object 0x0:68809 extent [100663296-112549887]: client csum 7a8b3e1a, server csum 1dbd0291
            [10480.902373] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2396:0x0] object 0x0:68834 extent [85426176-100663295]: client csum f43a36f0, server csum 9d10e702
            [11720.767365] BUG: Bad page state in process ll_ost_io01_001  pfn:15d132f
            [11720.777259] page:ffffea005744cbc0 count:-1 mapcount:0 mapping:          (null) index:0x0
            [11720.788693] page flags: 0x6fffff00000000()
            [11720.795463] page dumped because: nonzero _count
            [11720.802596] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas lrw gf128mul glue_helper ablk_helper cryptd raid_class scsi_transport_sas mei_me iTCO_wdt ipmi_ssif iTCO_vendor_support mei ipmi_devintf sb_edac sg
            [11720.893130]  ioatdma lpc_ich shpchp edac_core pcspkr i2c_i801 ipmi_si mfd_core ipmi_msghandler acpi_pad acpi_power_meter wmi nfsd dm_multipath dm_mod auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb crct10dif_pclmul ptp crct10dif_common ttm ahci crc32c_intel pps_core mlx4_core libahci drm dca i2c_algo_bit libata i2c_core
            [11720.951749] CPU: 35 PID: 8509 Comm: ll_ost_io01_001 Tainted: P    B     IOE  ------------   3.10.0-327.36.3.el7.x86_64 #1
            [11720.965393] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
            [11720.978463]  ffffea005744cbc0 00000000a971f860 ffff880fdb6bf908 ffffffff81636431
            [11720.988249]  ffff880fdb6bf930 ffffffff81631645 ffffea005744cbc0 0000000000000000
            [11720.998053]  000fffff00000000 ffff880fdb6bf978 ffffffff811714dd fff00000fe000000
            [11721.007838] Call Trace:
            [11721.012009]  [<ffffffff81636431>] dump_stack+0x19/0x1b
            [11721.019195]  [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc
            [11721.026948]  [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190
            [11721.035167]  [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140
            [11721.043294]  [<ffffffff8117200f>] __free_pages+0x3f/0x60
            [11721.050752]  [<ffffffffa0fa1ad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs]
            [11721.059372]  [<ffffffffa109b84a>] ofd_commitrw_write+0xea/0x1c20 [ofd]
            [11721.068157]  [<ffffffffa109ff2d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [11721.076424]  [<ffffffffa0e0f8d2>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [11721.085001]  [<ffffffffa0de7f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
            [11721.094001]  [<ffffffff81632d15>] ? __slab_free+0x10e/0x277
            [11721.101706]  [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
            [11721.109340]  [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
            [11721.117924]  [<ffffffffa0d3e560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [11721.127469]  [<ffffffffa0de4225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [11721.136601]  [<ffffffffa0d901ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [11721.146564]  [<ffffffffa0a33128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [11721.155726]  [<ffffffffa0d8dd68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [11721.164815]  [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
            [11721.173099]  [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
            [11721.180948]  [<ffffffffa0d94260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
            [11721.189287]  [<ffffffffa0d937c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [11721.198828]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [11721.205490]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [11721.214017]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [11721.221178]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [11906.409714] perf interrupt took too long (5056 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
            [12369.576466] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x28dd:0x0] object 0x0:69605 extent [100663296-115441663]: client csum 34b2200, server csum 5f29220d
            [12574.297235] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2a16:0x0] object 0x0:69767 extent [100663296-114409471]: client csum c953b2e4, server csum f3b9a3f5
            [12583.154014] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2a22:0x0] object 0x0:69773 extent [100663296-117309439]: client csum fa39f722, server csum 17548bac
            

            wolf-3 OSS

            [  702.495373] Lustre: lsdraid-OST0000: Connection restored to 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 (at 192.168.1.6@o2ib)
            [  712.111566] LustreError: 35894:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0
            [  712.629997] LustreError: 39491:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0
            [  712.649481] LustreError: 39491:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 8 previous similar messages
            [  713.660785] LustreError: 38266:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0
            [  713.679875] LustreError: 38266:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 5 previous similar messages
            [  715.665680] LustreError: 38165:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0
            [  715.685499] LustreError: 38165:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 48 previous similar messages
            [  835.423369] Lustre: lsdraid-OST0000: Connection restored to 4e5e1424-c5a7-dbfe-ccf8-a041ec520cb5 (at 192.168.1.9@o2ib)
            [  835.437468] Lustre: Skipped 2 previous similar messages
            [11228.546836] perf interrupt took too long (2506 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
            [28193.720410] LNet: Service thread pid 91775 was inactive for 200.29s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            [28193.743765] Pid: 91775, comm: ll_ost00_010
            [28193.750363] 
            Call Trace:
            [28193.758633]  [<ffffffff8163bb39>] schedule+0x29/0x70
            [28193.765982]  [<ffffffffa05cb2fd>] cv_wait_common+0x10d/0x130 [spl]
            [28193.774687]  [<ffffffff810a6b80>] ? autoremove_wake_function+0x0/0x40
            [28193.783567]  [<ffffffffa05cb335>] __cv_wait+0x15/0x20 [spl]
            [28193.791608]  [<ffffffffa1439c23>] txg_wait_open+0xb3/0xf0 [zfs]
            [28193.799877]  [<ffffffffa13e264d>] dmu_free_long_range+0x25d/0x3d0 [zfs]
            [28193.808919]  [<ffffffffa1092468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs]
            [28193.818586]  [<ffffffffa10927d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs]
            [28193.828178]  [<ffffffffa1094dba>] osd_trans_stop+0x31a/0x5b0 [osd_zfs]
            [28193.836927]  [<ffffffffa119516f>] ofd_trans_stop+0x1f/0x60 [ofd]
            [28193.845026]  [<ffffffffa1198d82>] ofd_object_destroy+0x2b2/0x890 [ofd]
            [28193.853770]  [<ffffffffa1191987>] ofd_destroy_by_fid+0x307/0x510 [ofd]
            [28193.862440]  [<ffffffffa0cdcbe0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
            [28193.871264]  [<ffffffffa0cd71f0>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc]
            [28193.880161]  [<ffffffffa1181627>] ofd_destroy_hdl+0x267/0xa50 [ofd]
            [28193.888454]  [<ffffffffa0d6b225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [28193.897329]  [<ffffffffa0d171ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [28193.907053]  [<ffffffffa09c7128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [28193.915785]  [<ffffffffa0d14d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [28193.924476]  [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
            [28193.932565]  [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
            [28193.940211]  [<ffffffffa0d1b260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
            [28193.948394]  [<ffffffffa0d1a7c0>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc]
            [28193.956493]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [28193.963027]  [<ffffffff810a5ac0>] ? kthread+0x0/0xe0
            [28193.969635]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [28193.976729]  [<ffffffff810a5ac0>] ? kthread+0x0/0xe0
            
            [28193.985950] LustreError: dumping log to /tmp/lustre-log.1492246924.91775
            [28199.712751] LNet: Service thread pid 91775 completed after 206.29s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [31329.310375] perf interrupt took too long (5002 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
            [root@wolf-3 10.8.1.3-2017-04-14-22:46:09]# 
            
            

            [root@wolf-4 combined]# ps aux |grep 9096
            root 9096 0.6 0.0 0 0 ? S 01:55 4:21 [ll_ost_io01_003]
            root 77386 0.0 0.0 112656 976 pts/0 S+ 12:56 0:00 grep --color=auto 9096
            [root@wolf-4 combined]# man ps
            [root@wolf-4 combined]# ps aux |grep 8509
            root 8509 4.3 0.0 0 0 ? D 01:55 28:56 [ll_ost_io01_001]
            root 84813 0.0 0.0 112656 976 pts/0 S+ 12:57 0:00 grep --color=auto 8509

            [root@wolf-4 combined]# cat /proc/9096/stack
            [<ffffffffa0d8dff5>] ptlrpc_wait_event+0x325/0x340 [ptlrpc]
            [<ffffffffa0d93fcb>] ptlrpc_main+0x80b/0x1de0 [ptlrpc]
            [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [<ffffffffffffffff>] 0xffffffffffffffff
            [root@wolf-4 combined]# cat /proc/8509/stack
            [<ffffffff8108c04f>] usleep_range+0x4f/0x70
            [<ffffffffa269c99a>] dmu_tx_wait+0x33a/0x360 [zfs]
            [<ffffffffa269ca45>] dmu_tx_assign+0x85/0x3f0 [zfs]
            [<ffffffffa0f94fea>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
            [<ffffffffa10960db>] ofd_trans_start+0x6b/0xe0 [ofd]
            [<ffffffffa109c0a3>] ofd_commitrw_write+0x943/0x1c20 [ofd]
            [<ffffffffa109ff2d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [<ffffffffa0e0f8d2>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [<ffffffffa0de7f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
            [<ffffffffa0de4225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [<ffffffffa0d901ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [<ffffffffa0d94260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
            [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [<ffffffffffffffff>] 0xffffffffffffffff

            jsalians_intel John Salinas (Inactive) added a comment - - edited Lustre 2.9.0 + 0.7.0 RC3 (none of our patches) record size 1M on OST0 and 16M on OST1. brw_size=16 on both raidz – messages but no crash manual dumps: 10.8.1.4-2017-04-15-00:26:17 10.8.1.4-2017-04-15-01:47:43 10.8.1.3-2017-04-15-13:22:45 10.8.1.4-2017-04-15-13:22:47 wolf-4 OSS [ 163.434692] Lustre: lsdraid-OST0001: Recovery over after 0:06, of 5 clients 5 recovered and 0 were evicted. [ 163.480746] Lustre: lsdraid-OST0001: deleting orphan objects from 0x0:720 to 0x0:1025 [ 370.631336] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x3b0:0x0] object 0x0:1225 extent [83886080-92680191]: client csum d5f42113, server csum 1a89e99c [ 480.339896] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x49c:0x0] object 0x0:4041 extent [33554432-47890431]: client csum e47bcdcb, server csum 86becdcf [ 488.890964] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x4ae:0x0] object 0x0:5107 extent [67108864-73793535]: client csum b74b30df, server csum 20c030ec [ 509.914190] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x52f:0x0] object 0x0:6348 extent [33554432-43007999]: client csum cbc76f28, server csum 4b241635 [ 539.505532] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x5be:0x0] object 0x0:7700 extent [67108864-78381055]: client csum b6e2021c, server csum c5ce4f88 [ 560.736133] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x5f1:0x0] object 0x0:8747 extent [67108864-81104895]: client csum ddc22e54, server csum 894f5e1a [ 618.743576] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x6d0:0x0] object 0x0:11762 extent [67108864-81694719]: client csum 734e4939, server csum 175394a5 [ 618.764867] LustreError: Skipped 1 previous similar message [ 1080.395798] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x7fa:0x0] object 0x0:14839 extent [40140800-50331647]: client csum 937c50bf, server csum f71e2e65 [ 1080.417120] LustreError: Skipped 2 previous similar messages [ 3001.142322] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0xd10:0x0] object 0x0:49284 extent [100663296-108527615]: client csum ab9466a8, server csum 10b4e228 [ 3400.563954] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0xfb0:0x0] object 0x0:54388 extent [67108864-82837503]: client csum 71e8cd52, server csum 35becd53 [ 3461.970072] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x1052:0x0] object 0x0:55534 extent [67108864-74973183]: client csum c0a766ab, server csum ab5a66bb [ 3762.672549] BUG: Bad page state in process ll_ost_io01_003 pfn:182ec6d [ 3762.680002] page:ffffea0060bb1b40 count:-1 mapcount:0 mapping: (null) index:0x0 [ 3762.689091] page flags: 0x6fffff00000000() [ 3762.693727] page dumped because: nonzero _count [ 3762.700757] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas lrw gf128mul glue_helper ablk_helper cryptd raid_class scsi_transport_sas mei_me iTCO_wdt ipmi_ssif iTCO_vendor_support mei ipmi_devintf sb_edac sg [ 3762.790920] ioatdma lpc_ich shpchp edac_core pcspkr i2c_i801 ipmi_si mfd_core ipmi_msghandler acpi_pad acpi_power_meter wmi nfsd dm_multipath dm_mod auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb crct10dif_pclmul ptp crct10dif_common ttm ahci crc32c_intel pps_core mlx4_core libahci drm dca i2c_algo_bit libata i2c_core [ 3762.850233] CPU: 31 PID: 9096 Comm: ll_ost_io01_003 Tainted: P IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [ 3762.864178] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 3762.877501] ffffea0060bb1b40 000000006cbfa991 ffff880fd6a47908 ffffffff81636431 [ 3762.887516] ffff880fd6a47930 ffffffff81631645 ffffea0060bb1b40 0000000000000000 [ 3762.897491] 000fffff00000000 ffff880fd6a47978 ffffffff811714dd fff00000fe000000 [ 3762.907458] Call Trace: [ 3762.912046] [<ffffffff81636431>] dump_stack+0x19/0x1b [ 3762.919394] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc [ 3762.927333] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190 [ 3762.935630] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140 [ 3762.943790] [<ffffffff8117200f>] __free_pages+0x3f/0x60 [ 3762.951264] [<ffffffffa0fa1ad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs] [ 3762.959874] [<ffffffffa109b84a>] ofd_commitrw_write+0xea/0x1c20 [ofd] [ 3762.968646] [<ffffffffa109ff2d>] ofd_commitrw+0x51d/0xa40 [ofd] [ 3762.976868] [<ffffffffa0e0f8d2>] obd_commitrw+0x2ec/0x32f [ptlrpc] [ 3762.985338] [<ffffffffa0de7f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc] [ 3762.993957] [<ffffffffa0d3e560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [ 3763.003453] [<ffffffffa0de4225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [ 3763.012530] [<ffffffffa0d901ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [ 3763.022429] [<ffffffffa0a33128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [ 3763.031354] [<ffffffffa0d8dd68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [ 3763.040220] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [ 3763.048476] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [ 3763.056267] [<ffffffffa0d94260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [ 3763.064562] [<ffffffffa0d937c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [ 3763.074037] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [ 3763.080685] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 3763.089162] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [ 3763.096349] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 3855.476573] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x12e3:0x0] object 0x0:58439 extent [67108864-82837503]: client csum 71e8cd52, server csum 14e5cd5e [ 3923.650281] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x13bc:0x0] object 0x0:59171 extent [33554432-48742399]: client csum 9005f4a9, server csum db87ac4c [ 5698.551136] perf interrupt took too long (2521 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 [ 5904.311835] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x1734:0x0] object 0x0:66681 extent [67108864-80281599]: client csum 1eaa58ca, server csum 44a378f0 [ 8708.045614] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x1d31:0x0] object 0x0:67733 extent [121729024-134217727]: client csum 99efe98c, server csum e23d22e1 [ 9738.442312] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2051:0x0] object 0x0:68278 extent [100663296-116666367]: client csum d42f69dc, server csum 8732074f [10448.854337] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x237a:0x0] object 0x0:68809 extent [100663296-112549887]: client csum 7a8b3e1a, server csum 1dbd0291 [10480.902373] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2396:0x0] object 0x0:68834 extent [85426176-100663295]: client csum f43a36f0, server csum 9d10e702 [11720.767365] BUG: Bad page state in process ll_ost_io01_001 pfn:15d132f [11720.777259] page:ffffea005744cbc0 count:-1 mapcount:0 mapping: (null) index:0x0 [11720.788693] page flags: 0x6fffff00000000() [11720.795463] page dumped because: nonzero _count [11720.802596] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas lrw gf128mul glue_helper ablk_helper cryptd raid_class scsi_transport_sas mei_me iTCO_wdt ipmi_ssif iTCO_vendor_support mei ipmi_devintf sb_edac sg [11720.893130] ioatdma lpc_ich shpchp edac_core pcspkr i2c_i801 ipmi_si mfd_core ipmi_msghandler acpi_pad acpi_power_meter wmi nfsd dm_multipath dm_mod auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb crct10dif_pclmul ptp crct10dif_common ttm ahci crc32c_intel pps_core mlx4_core libahci drm dca i2c_algo_bit libata i2c_core [11720.951749] CPU: 35 PID: 8509 Comm: ll_ost_io01_001 Tainted: P B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [11720.965393] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [11720.978463] ffffea005744cbc0 00000000a971f860 ffff880fdb6bf908 ffffffff81636431 [11720.988249] ffff880fdb6bf930 ffffffff81631645 ffffea005744cbc0 0000000000000000 [11720.998053] 000fffff00000000 ffff880fdb6bf978 ffffffff811714dd fff00000fe000000 [11721.007838] Call Trace: [11721.012009] [<ffffffff81636431>] dump_stack+0x19/0x1b [11721.019195] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc [11721.026948] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190 [11721.035167] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140 [11721.043294] [<ffffffff8117200f>] __free_pages+0x3f/0x60 [11721.050752] [<ffffffffa0fa1ad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs] [11721.059372] [<ffffffffa109b84a>] ofd_commitrw_write+0xea/0x1c20 [ofd] [11721.068157] [<ffffffffa109ff2d>] ofd_commitrw+0x51d/0xa40 [ofd] [11721.076424] [<ffffffffa0e0f8d2>] obd_commitrw+0x2ec/0x32f [ptlrpc] [11721.085001] [<ffffffffa0de7f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc] [11721.094001] [<ffffffff81632d15>] ? __slab_free+0x10e/0x277 [11721.101706] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150 [11721.109340] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0 [11721.117924] [<ffffffffa0d3e560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [11721.127469] [<ffffffffa0de4225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [11721.136601] [<ffffffffa0d901ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [11721.146564] [<ffffffffa0a33128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [11721.155726] [<ffffffffa0d8dd68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [11721.164815] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [11721.173099] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [11721.180948] [<ffffffffa0d94260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [11721.189287] [<ffffffffa0d937c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [11721.198828] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [11721.205490] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [11721.214017] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [11721.221178] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [11906.409714] perf interrupt took too long (5056 > 5000), lowering kernel.perf_event_max_sample_rate to 25000 [12369.576466] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x28dd:0x0] object 0x0:69605 extent [100663296-115441663]: client csum 34b2200, server csum 5f29220d [12574.297235] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2a16:0x0] object 0x0:69767 extent [100663296-114409471]: client csum c953b2e4, server csum f3b9a3f5 [12583.154014] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2a22:0x0] object 0x0:69773 extent [100663296-117309439]: client csum fa39f722, server csum 17548bac wolf-3 OSS [ 702.495373] Lustre: lsdraid-OST0000: Connection restored to 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 (at 192.168.1.6@o2ib) [ 712.111566] LustreError: 35894:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0 [ 712.629997] LustreError: 39491:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0 [ 712.649481] LustreError: 39491:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 8 previous similar messages [ 713.660785] LustreError: 38266:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0 [ 713.679875] LustreError: 38266:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 5 previous similar messages [ 715.665680] LustreError: 38165:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0 [ 715.685499] LustreError: 38165:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 48 previous similar messages [ 835.423369] Lustre: lsdraid-OST0000: Connection restored to 4e5e1424-c5a7-dbfe-ccf8-a041ec520cb5 (at 192.168.1.9@o2ib) [ 835.437468] Lustre: Skipped 2 previous similar messages [11228.546836] perf interrupt took too long (2506 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 [28193.720410] LNet: Service thread pid 91775 was inactive for 200.29s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [28193.743765] Pid: 91775, comm: ll_ost00_010 [28193.750363] Call Trace: [28193.758633] [<ffffffff8163bb39>] schedule+0x29/0x70 [28193.765982] [<ffffffffa05cb2fd>] cv_wait_common+0x10d/0x130 [spl] [28193.774687] [<ffffffff810a6b80>] ? autoremove_wake_function+0x0/0x40 [28193.783567] [<ffffffffa05cb335>] __cv_wait+0x15/0x20 [spl] [28193.791608] [<ffffffffa1439c23>] txg_wait_open+0xb3/0xf0 [zfs] [28193.799877] [<ffffffffa13e264d>] dmu_free_long_range+0x25d/0x3d0 [zfs] [28193.808919] [<ffffffffa1092468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs] [28193.818586] [<ffffffffa10927d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs] [28193.828178] [<ffffffffa1094dba>] osd_trans_stop+0x31a/0x5b0 [osd_zfs] [28193.836927] [<ffffffffa119516f>] ofd_trans_stop+0x1f/0x60 [ofd] [28193.845026] [<ffffffffa1198d82>] ofd_object_destroy+0x2b2/0x890 [ofd] [28193.853770] [<ffffffffa1191987>] ofd_destroy_by_fid+0x307/0x510 [ofd] [28193.862440] [<ffffffffa0cdcbe0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc] [28193.871264] [<ffffffffa0cd71f0>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc] [28193.880161] [<ffffffffa1181627>] ofd_destroy_hdl+0x267/0xa50 [ofd] [28193.888454] [<ffffffffa0d6b225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [28193.897329] [<ffffffffa0d171ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [28193.907053] [<ffffffffa09c7128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [28193.915785] [<ffffffffa0d14d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [28193.924476] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [28193.932565] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [28193.940211] [<ffffffffa0d1b260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [28193.948394] [<ffffffffa0d1a7c0>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc] [28193.956493] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [28193.963027] [<ffffffff810a5ac0>] ? kthread+0x0/0xe0 [28193.969635] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [28193.976729] [<ffffffff810a5ac0>] ? kthread+0x0/0xe0 [28193.985950] LustreError: dumping log to /tmp/lustre-log.1492246924.91775 [28199.712751] LNet: Service thread pid 91775 completed after 206.29s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [31329.310375] perf interrupt took too long (5002 > 5000), lowering kernel.perf_event_max_sample_rate to 25000 [root@wolf-3 10.8.1.3-2017-04-14-22:46:09]# [root@wolf-4 combined] # ps aux |grep 9096 root 9096 0.6 0.0 0 0 ? S 01:55 4:21 [ll_ost_io01_003] root 77386 0.0 0.0 112656 976 pts/0 S+ 12:56 0:00 grep --color=auto 9096 [root@wolf-4 combined] # man ps [root@wolf-4 combined] # ps aux |grep 8509 root 8509 4.3 0.0 0 0 ? D 01:55 28:56 [ll_ost_io01_001] root 84813 0.0 0.0 112656 976 pts/0 S+ 12:57 0:00 grep --color=auto 8509 [root@wolf-4 combined] # cat /proc/9096/stack [<ffffffffa0d8dff5>] ptlrpc_wait_event+0x325/0x340 [ptlrpc] [<ffffffffa0d93fcb>] ptlrpc_main+0x80b/0x1de0 [ptlrpc] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [<ffffffffffffffff>] 0xffffffffffffffff [root@wolf-4 combined] # cat /proc/8509/stack [<ffffffff8108c04f>] usleep_range+0x4f/0x70 [<ffffffffa269c99a>] dmu_tx_wait+0x33a/0x360 [zfs] [<ffffffffa269ca45>] dmu_tx_assign+0x85/0x3f0 [zfs] [<ffffffffa0f94fea>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [<ffffffffa10960db>] ofd_trans_start+0x6b/0xe0 [ofd] [<ffffffffa109c0a3>] ofd_commitrw_write+0x943/0x1c20 [ofd] [<ffffffffa109ff2d>] ofd_commitrw+0x51d/0xa40 [ofd] [<ffffffffa0e0f8d2>] obd_commitrw+0x2ec/0x32f [ptlrpc] [<ffffffffa0de7f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc] [<ffffffffa0de4225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [<ffffffffa0d901ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [<ffffffffa0d94260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [<ffffffffffffffff>] 0xffffffffffffffff

            I don't remember BUG: Bad page state in process in LU-9279 but it was a month ago so anything is possible.

            None of the traces here look ZFS related – can you give us any hint on where to look?

            jsalians_intel John Salinas (Inactive) added a comment - I don't remember BUG: Bad page state in process in LU-9279 but it was a month ago so anything is possible. None of the traces here look ZFS related – can you give us any hint on where to look?

            Could the initial dump of LU-9279 be truncated and there's a double free prior to the bad page pointer? That would actually make more sense for a failure scenario.

            utopiabound Nathaniel Clark added a comment - Could the initial dump of LU-9279 be truncated and there's a double free prior to the bad page pointer? That would actually make more sense for a failure scenario.

            On Onyx: $ ls -lart /scratch/johnsali/LU-9304.tgz
            -rwxr-xr-x 1 johnsali johnsali 815773487 Apr 14 07:28 /scratch/johnsali/LU-9304.tgz

            jsalians_intel John Salinas (Inactive) added a comment - On Onyx: $ ls -lart /scratch/johnsali/ LU-9304 .tgz -rwxr-xr-x 1 johnsali johnsali 815773487 Apr 14 07:28 /scratch/johnsali/ LU-9304 .tgz

            I have logins to Onyx and Lola.

            utopiabound Nathaniel Clark added a comment - I have logins to Onyx and Lola.

            Which clusters do you have a login for I will copy it over to nfs on that cluster?

            jsalians_intel John Salinas (Inactive) added a comment - Which clusters do you have a login for I will copy it over to nfs on that cluster?

            How can I get a copy? I don't have a login to wolf currently.

            utopiabound Nathaniel Clark added a comment - How can I get a copy? I don't have a login to wolf currently.

            Oh good we have a dump for this one!

            jsalians_intel John Salinas (Inactive) added a comment - Oh good we have a dump for this one!

            Yes, looking at this, I would assume they come from the same root cause.

            utopiabound Nathaniel Clark added a comment - Yes, looking at this, I would assume they come from the same root cause.

            Hi Nate,

            Can you please look into this one. We thought on the triage call that this could be a duplicate of LU-9279. Do you agree?

            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hi Nate, Can you please look into this one. We thought on the triage call that this could be a duplicate of LU-9279 . Do you agree? Thanks. Joe

            People

              utopiabound Nathaniel Clark
              jsalians_intel John Salinas (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: