Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9304

BUG: Bad page state in process ll_ost_io01_013 pfn:1a01bcd kernel BUG at include/linux/scatterlist.h:65!

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      Running 4 Lustre Clients, 2 OSS nodes each with 1 zpool, and 1 mds.
      This OSS node:

      1. zpool status -v
        pool: ost0
        state: ONLINE
        scan: none requested
        config:

      NAME STATE READ WRITE CKSUM
      ost0 ONLINE 0 0 0
      draid1-0

      {any}

      ONLINE 0 0 0
      mpathaj ONLINE 0 0 0
      mpathai ONLINE 0 0 0
      mpathah ONLINE 0 0 0
      mpathag ONLINE 0 0 0
      mpathaq ONLINE 0 0 0
      mpathap ONLINE 0 0 0
      mpathak ONLINE 0 0 0
      mpathz ONLINE 0 0 0
      mpatham ONLINE 0 0 0
      mpathal ONLINE 0 0 0
      mpathao ONLINE 0 0 0
      spares
      $draid1-0-s0 AVAIL

      errors: No known data errors

      This build of zfs was from coral-prototype branch and Lustre was a Lustre Master from Dec 1st.

      We were running our file system aging utility: FileAger.py (1-2 copies on each of the 4 client nodes) along an IOR: mpirun -wdir /mnt/lustre/ -np 4 -rr -machinefile hosts -env I_MPI_EXTRA_FILESYSTEM=on -env I_MPI_EXTRA_FILESYSTEM_LIST=lustre /home/johnsali/wolf-3/ior/src/ior -a POSIX -F -N 4 -d 2 -i 1 -s 20000 -b 16MB -t 16MB -k -w -r

      While this was running it appears we hit this failure.

      [159898.950714] BUG: Bad page state in process ll_ost_io01_013 pfn:1a01bcd
      [159898.960045] page:ffffea006806f340 count:-1 mapcount:0 mapping: (null) index:0x0
      [159898.970667] page flags: 0x6fffff00000000()
      [159898.976808] page dumped because: nonzero _count
      [159898.983412] Modules linked in: nfsv3 nfs_acl raid10 osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd mpt3sas ipmi_devintf ipmi_ssif ipmi_si
      [159899.072452] raid_class sb_edac iTCO_wdt iTCO_vendor_support scsi_transport_sas sg edac_core pcspkr ipmi_msghandler wmi ioatdma mei_me mei lpc_ich shpchp i2c_i801 mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_ib mlx4_en ib_sa vxlan ib_mad ip6_udp_tunnel udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper crct10dif_pclmul igb crct10dif_common ttm ptp crc32c_intel ahci pps_core drm mlx4_core libahci dca i2c_algo_bit libata i2c_core [last unloaded: zunicode]
      [159899.135473] CPU: 57 PID: 98747 Comm: ll_ost_io01_013 Tainted: G IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1
      [159899.149461] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
      [159899.162801] ffffea006806f340 00000000424e76b3 ffff880f9e233908 ffffffff81636431
      [159899.172821] ffff880f9e233930 ffffffff81631645 ffffea006806f340 0000000000000000
      [159899.182870] 000fffff00000000 ffff880f9e233978 ffffffff811714dd fff00000fe000000
      [159899.192895] Call Trace:
      [159899.197269] [<ffffffff81636431>] dump_stack+0x19/0x1b
      [159899.204667] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc
      [159899.212639] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190
      [159899.220965] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140
      [159899.229171] [<ffffffff8117200f>] __free_pages+0x3f/0x60
      [159899.236690] [<ffffffffa100bad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs]
      [159899.245372] [<ffffffffa118284a>] ofd_commitrw_write+0xea/0x1c20 [ofd]
      [159899.254234] [<ffffffffa1186f2d>] ofd_commitrw+0x51d/0xa40 [ofd]
      [159899.262551] [<ffffffffa0d538d5>] obd_commitrw+0x2ec/0x32f [ptlrpc]
      [159899.271488] [<ffffffffa0d2bf71>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
      [159899.280509] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
      [159899.288372] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
      [159899.297010] [<ffffffffa0c82560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
      [159899.306746] [<ffffffffa0d28225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
      [159899.316058] [<ffffffffa0cd41ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      [159899.326348] [<ffffffffa0967128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
      [159899.335679] [<ffffffffa0cd1d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [159899.345029] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
      [159899.353394] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
      [159899.361264] [<ffffffffa0cd8260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
      [159899.369596] [<ffffffffa0cd77c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
      [159899.379160] [<ffffffff810a5b8f>] kthread+0xcf/0xe0
      [159899.385881] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [159899.394413] [<ffffffff81646a98>] ret_from_fork+0x58/0x90
      [159899.401653] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [159899.410157] Disabling lock debugging due to kernel taint
      [163012.964891] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.8@o2ib inode [0x200000406:0x3c5:0x0] object 0x0:44785 extent [67108864-80752639]: client csum 7f08fe36, server csum f8fbfe4c
      [163012.990138] LustreError: Skipped 2 previous similar messages
      [163020.008131] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.8@o2ib inode [0x200000406:0x3d6:0x0] object 0x0:44794 extent [83886080-100270079]: client csum 886feb33, server csum ccc0eb4a
      [163042.829796] -----------[ cut here ]-----------
      [163042.837389] kernel BUG at include/linux/scatterlist.h:65!
      [163042.845758] invalid opcode: 0000 1 SMP
      [163042.852645] Modules linked in: nfsv3 nfs_acl raid10 osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd mpt3sas ipmi_devintf ipmi_ssif ipmi_si
      [163042.944819] raid_class sb_edac iTCO_wdt iTCO_vendor_support scsi_transport_sas sg edac_core pcspkr ipmi_msghandler wmi ioatdma mei_me mei lpc_ich shpchp i2c_i801 mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_ib mlx4_en ib_sa vxlan ib_mad ip6_udp_tunnel udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper crct10dif_pclmul igb crct10dif_common ttm ptp crc32c_intel ahci pps_core drm mlx4_core libahci dca i2c_algo_bit libata i2c_core [last unloaded: zunicode]
      [163043.010335] CPU: 12 PID: 84956 Comm: ll_ost_io00_002 Tainted: G B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1
      [163043.025057] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
      [163043.038989] task: ffff880fc52bc500 ti: ffff880fc55bc000 task.ti: ffff880fc55bc000
      [163043.049639] RIP: 0010:[<ffffffffa0960fef>] [<ffffffffa0960fef>] cfs_crypto_hash_update_page+0x9f/0xb0 [libcfs]
      [163043.063453] RSP: 0018:ffff880fc55bfab8 EFLAGS: 00010202
      [163043.071687] RAX: 0000000000000002 RBX: ffff8810f6db9b80 RCX: 0000000000000000
      [163043.081918] RDX: 0000000000000020 RSI: 0000000000000000 RDI: ffff880fc55bfad8
      [163043.092095] RBP: ffff880fc55bfb00 R08: 00000000000195a0 R09: ffff880fc55bfab8
      [163043.103441] R10: ffff88103e807900 R11: 0000000000000001 R12: 3635343332313036
      [163043.113462] R13: 0000000033323130 R14: 0000000000000534 R15: 0000000000000000
      [163043.123487] FS: 0000000000000000(0000) GS:ffff88103ef00000(0000) knlGS:0000000000000000
      [163043.134599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [163043.143101] CR2: 00007fce5afab000 CR3: 000000000194a000 CR4: 00000000001407e0
      [163043.153184] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [163043.163242] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [163043.173280] Stack:
      [163043.177580] 0000000000000002 0000000000000000 0000000000000000 0000000000000000
      [163043.188354] 00000000f43b381e 0000000000000000 ffff880fcc7d1301 ffff880e73ecc200
      [163043.199140] 0000000000000000 ffff880fc55bfb68 ffffffffa0d5345c ffff88202563f0a8
      [163043.209907] Call Trace:
      [163043.215455] [<ffffffffa0d5345c>] tgt_checksum_bulk.isra.33+0x35a/0x4e7 [ptlrpc]
      [163043.226242] [<ffffffffa0d2c21d>] tgt_brw_write+0x114d/0x1640 [ptlrpc]
      [163043.235986] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
      [163043.244558] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
      [163043.254271] [<ffffffffa0c82560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
      [163043.264858] [<ffffffffa0d28225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
      [163043.275043] [<ffffffffa0cd41ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      [163043.286074] [<ffffffffa0967128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
      [163043.296175] [<ffffffffa0cd1d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [163043.306194] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
      [163043.315553] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
      [163043.324714] [<ffffffffa0cd8260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
      [163043.334070] [<ffffffffa0cd77c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
      [163043.344635] [<ffffffff810a5b8f>] kthread+0xcf/0xe0
      [163043.352181] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [163043.361606] [<ffffffff81646a98>] ret_from_fork+0x58/0x90
      [163043.369571] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [163043.378772] Code: 89 43 38 48 8b 43 20 ff 50 c0 48 8b 55 d8 65 48 33 14 25 28 00 00 00 75 0d 48 83 c4 28 5b 41 5c 41 5d 41 5e 5d c3 e8 61 a0 71 e0 <0f> 0b 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
      [163043.406113] RIP [<ffffffffa0960fef>] cfs_crypto_hash_update_page+0x9f/0xb0 [libcfs]
      [163043.416991] RSP <ffff880fc55bfab8>

      This happened fairly quickly. After this run I restarted the system and it happened again almost immediately.

      Attachments

        Issue Links

          Activity

            [LU-9304] BUG: Bad page state in process ll_ost_io01_013 pfn:1a01bcd kernel BUG at include/linux/scatterlist.h:65!
            jsalians_intel John Salinas (Inactive) added a comment - - edited

            Lustre 2.9.0 + 0.7.0 RC3 (none of our patches) record size 1M on OST0 and 16M on OST1. brw_size=16 on both raidz – messages but no crash manual dumps: 10.8.1.4-2017-04-15-00:26:17 10.8.1.4-2017-04-15-01:47:43 10.8.1.3-2017-04-15-13:22:45 10.8.1.4-2017-04-15-13:22:47

            wolf-4 OSS

            [  163.434692] Lustre: lsdraid-OST0001: Recovery over after 0:06, of 5 clients 5 recovered and 0 were evicted.
            [  163.480746] Lustre: lsdraid-OST0001: deleting orphan objects from 0x0:720 to 0x0:1025
            [  370.631336] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x3b0:0x0] object 0x0:1225 extent [83886080-92680191]: client csum d5f42113, server csum 1a89e99c
            [  480.339896] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x49c:0x0] object 0x0:4041 extent [33554432-47890431]: client csum e47bcdcb, server csum 86becdcf
            [  488.890964] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x4ae:0x0] object 0x0:5107 extent [67108864-73793535]: client csum b74b30df, server csum 20c030ec
            [  509.914190] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x52f:0x0] object 0x0:6348 extent [33554432-43007999]: client csum cbc76f28, server csum 4b241635
            [  539.505532] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x5be:0x0] object 0x0:7700 extent [67108864-78381055]: client csum b6e2021c, server csum c5ce4f88
            [  560.736133] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x5f1:0x0] object 0x0:8747 extent [67108864-81104895]: client csum ddc22e54, server csum 894f5e1a
            [  618.743576] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x6d0:0x0] object 0x0:11762 extent [67108864-81694719]: client csum 734e4939, server csum 175394a5
            [  618.764867] LustreError: Skipped 1 previous similar message
            [ 1080.395798] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x7fa:0x0] object 0x0:14839 extent [40140800-50331647]: client csum 937c50bf, server csum f71e2e65
            [ 1080.417120] LustreError: Skipped 2 previous similar messages
            [ 3001.142322] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0xd10:0x0] object 0x0:49284 extent [100663296-108527615]: client csum ab9466a8, server csum 10b4e228
            [ 3400.563954] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0xfb0:0x0] object 0x0:54388 extent [67108864-82837503]: client csum 71e8cd52, server csum 35becd53
            [ 3461.970072] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x1052:0x0] object 0x0:55534 extent [67108864-74973183]: client csum c0a766ab, server csum ab5a66bb
            [ 3762.672549] BUG: Bad page state in process ll_ost_io01_003  pfn:182ec6d
            [ 3762.680002] page:ffffea0060bb1b40 count:-1 mapcount:0 mapping:          (null) index:0x0
            [ 3762.689091] page flags: 0x6fffff00000000()
            [ 3762.693727] page dumped because: nonzero _count
            [ 3762.700757] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas lrw gf128mul glue_helper ablk_helper cryptd raid_class scsi_transport_sas mei_me iTCO_wdt ipmi_ssif iTCO_vendor_support mei ipmi_devintf sb_edac sg
            [ 3762.790920]  ioatdma lpc_ich shpchp edac_core pcspkr i2c_i801 ipmi_si mfd_core ipmi_msghandler acpi_pad acpi_power_meter wmi nfsd dm_multipath dm_mod auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb crct10dif_pclmul ptp crct10dif_common ttm ahci crc32c_intel pps_core mlx4_core libahci drm dca i2c_algo_bit libata i2c_core
            [ 3762.850233] CPU: 31 PID: 9096 Comm: ll_ost_io01_003 Tainted: P          IOE  ------------   3.10.0-327.36.3.el7.x86_64 #1
            [ 3762.864178] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
            [ 3762.877501]  ffffea0060bb1b40 000000006cbfa991 ffff880fd6a47908 ffffffff81636431
            [ 3762.887516]  ffff880fd6a47930 ffffffff81631645 ffffea0060bb1b40 0000000000000000
            [ 3762.897491]  000fffff00000000 ffff880fd6a47978 ffffffff811714dd fff00000fe000000
            [ 3762.907458] Call Trace:
            [ 3762.912046]  [<ffffffff81636431>] dump_stack+0x19/0x1b
            [ 3762.919394]  [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc
            [ 3762.927333]  [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190
            [ 3762.935630]  [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140
            [ 3762.943790]  [<ffffffff8117200f>] __free_pages+0x3f/0x60
            [ 3762.951264]  [<ffffffffa0fa1ad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs]
            [ 3762.959874]  [<ffffffffa109b84a>] ofd_commitrw_write+0xea/0x1c20 [ofd]
            [ 3762.968646]  [<ffffffffa109ff2d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [ 3762.976868]  [<ffffffffa0e0f8d2>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [ 3762.985338]  [<ffffffffa0de7f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
            [ 3762.993957]  [<ffffffffa0d3e560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [ 3763.003453]  [<ffffffffa0de4225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [ 3763.012530]  [<ffffffffa0d901ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [ 3763.022429]  [<ffffffffa0a33128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [ 3763.031354]  [<ffffffffa0d8dd68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [ 3763.040220]  [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
            [ 3763.048476]  [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
            [ 3763.056267]  [<ffffffffa0d94260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
            [ 3763.064562]  [<ffffffffa0d937c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [ 3763.074037]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [ 3763.080685]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [ 3763.089162]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [ 3763.096349]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [ 3855.476573] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x12e3:0x0] object 0x0:58439 extent [67108864-82837503]: client csum 71e8cd52, server csum 14e5cd5e
            [ 3923.650281] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x13bc:0x0] object 0x0:59171 extent [33554432-48742399]: client csum 9005f4a9, server csum db87ac4c
            [ 5698.551136] perf interrupt took too long (2521 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
            [ 5904.311835] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x1734:0x0] object 0x0:66681 extent [67108864-80281599]: client csum 1eaa58ca, server csum 44a378f0
            [ 8708.045614] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x1d31:0x0] object 0x0:67733 extent [121729024-134217727]: client csum 99efe98c, server csum e23d22e1
            [ 9738.442312] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2051:0x0] object 0x0:68278 extent [100663296-116666367]: client csum d42f69dc, server csum 8732074f
            [10448.854337] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x237a:0x0] object 0x0:68809 extent [100663296-112549887]: client csum 7a8b3e1a, server csum 1dbd0291
            [10480.902373] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2396:0x0] object 0x0:68834 extent [85426176-100663295]: client csum f43a36f0, server csum 9d10e702
            [11720.767365] BUG: Bad page state in process ll_ost_io01_001  pfn:15d132f
            [11720.777259] page:ffffea005744cbc0 count:-1 mapcount:0 mapping:          (null) index:0x0
            [11720.788693] page flags: 0x6fffff00000000()
            [11720.795463] page dumped because: nonzero _count
            [11720.802596] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas lrw gf128mul glue_helper ablk_helper cryptd raid_class scsi_transport_sas mei_me iTCO_wdt ipmi_ssif iTCO_vendor_support mei ipmi_devintf sb_edac sg
            [11720.893130]  ioatdma lpc_ich shpchp edac_core pcspkr i2c_i801 ipmi_si mfd_core ipmi_msghandler acpi_pad acpi_power_meter wmi nfsd dm_multipath dm_mod auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb crct10dif_pclmul ptp crct10dif_common ttm ahci crc32c_intel pps_core mlx4_core libahci drm dca i2c_algo_bit libata i2c_core
            [11720.951749] CPU: 35 PID: 8509 Comm: ll_ost_io01_001 Tainted: P    B     IOE  ------------   3.10.0-327.36.3.el7.x86_64 #1
            [11720.965393] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
            [11720.978463]  ffffea005744cbc0 00000000a971f860 ffff880fdb6bf908 ffffffff81636431
            [11720.988249]  ffff880fdb6bf930 ffffffff81631645 ffffea005744cbc0 0000000000000000
            [11720.998053]  000fffff00000000 ffff880fdb6bf978 ffffffff811714dd fff00000fe000000
            [11721.007838] Call Trace:
            [11721.012009]  [<ffffffff81636431>] dump_stack+0x19/0x1b
            [11721.019195]  [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc
            [11721.026948]  [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190
            [11721.035167]  [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140
            [11721.043294]  [<ffffffff8117200f>] __free_pages+0x3f/0x60
            [11721.050752]  [<ffffffffa0fa1ad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs]
            [11721.059372]  [<ffffffffa109b84a>] ofd_commitrw_write+0xea/0x1c20 [ofd]
            [11721.068157]  [<ffffffffa109ff2d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [11721.076424]  [<ffffffffa0e0f8d2>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [11721.085001]  [<ffffffffa0de7f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
            [11721.094001]  [<ffffffff81632d15>] ? __slab_free+0x10e/0x277
            [11721.101706]  [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
            [11721.109340]  [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
            [11721.117924]  [<ffffffffa0d3e560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [11721.127469]  [<ffffffffa0de4225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [11721.136601]  [<ffffffffa0d901ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [11721.146564]  [<ffffffffa0a33128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [11721.155726]  [<ffffffffa0d8dd68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [11721.164815]  [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
            [11721.173099]  [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
            [11721.180948]  [<ffffffffa0d94260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
            [11721.189287]  [<ffffffffa0d937c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [11721.198828]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [11721.205490]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [11721.214017]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [11721.221178]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [11906.409714] perf interrupt took too long (5056 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
            [12369.576466] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x28dd:0x0] object 0x0:69605 extent [100663296-115441663]: client csum 34b2200, server csum 5f29220d
            [12574.297235] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2a16:0x0] object 0x0:69767 extent [100663296-114409471]: client csum c953b2e4, server csum f3b9a3f5
            [12583.154014] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2a22:0x0] object 0x0:69773 extent [100663296-117309439]: client csum fa39f722, server csum 17548bac
            

            wolf-3 OSS

            [  702.495373] Lustre: lsdraid-OST0000: Connection restored to 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 (at 192.168.1.6@o2ib)
            [  712.111566] LustreError: 35894:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0
            [  712.629997] LustreError: 39491:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0
            [  712.649481] LustreError: 39491:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 8 previous similar messages
            [  713.660785] LustreError: 38266:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0
            [  713.679875] LustreError: 38266:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 5 previous similar messages
            [  715.665680] LustreError: 38165:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0
            [  715.685499] LustreError: 38165:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 48 previous similar messages
            [  835.423369] Lustre: lsdraid-OST0000: Connection restored to 4e5e1424-c5a7-dbfe-ccf8-a041ec520cb5 (at 192.168.1.9@o2ib)
            [  835.437468] Lustre: Skipped 2 previous similar messages
            [11228.546836] perf interrupt took too long (2506 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
            [28193.720410] LNet: Service thread pid 91775 was inactive for 200.29s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            [28193.743765] Pid: 91775, comm: ll_ost00_010
            [28193.750363] 
            Call Trace:
            [28193.758633]  [<ffffffff8163bb39>] schedule+0x29/0x70
            [28193.765982]  [<ffffffffa05cb2fd>] cv_wait_common+0x10d/0x130 [spl]
            [28193.774687]  [<ffffffff810a6b80>] ? autoremove_wake_function+0x0/0x40
            [28193.783567]  [<ffffffffa05cb335>] __cv_wait+0x15/0x20 [spl]
            [28193.791608]  [<ffffffffa1439c23>] txg_wait_open+0xb3/0xf0 [zfs]
            [28193.799877]  [<ffffffffa13e264d>] dmu_free_long_range+0x25d/0x3d0 [zfs]
            [28193.808919]  [<ffffffffa1092468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs]
            [28193.818586]  [<ffffffffa10927d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs]
            [28193.828178]  [<ffffffffa1094dba>] osd_trans_stop+0x31a/0x5b0 [osd_zfs]
            [28193.836927]  [<ffffffffa119516f>] ofd_trans_stop+0x1f/0x60 [ofd]
            [28193.845026]  [<ffffffffa1198d82>] ofd_object_destroy+0x2b2/0x890 [ofd]
            [28193.853770]  [<ffffffffa1191987>] ofd_destroy_by_fid+0x307/0x510 [ofd]
            [28193.862440]  [<ffffffffa0cdcbe0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
            [28193.871264]  [<ffffffffa0cd71f0>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc]
            [28193.880161]  [<ffffffffa1181627>] ofd_destroy_hdl+0x267/0xa50 [ofd]
            [28193.888454]  [<ffffffffa0d6b225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [28193.897329]  [<ffffffffa0d171ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [28193.907053]  [<ffffffffa09c7128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [28193.915785]  [<ffffffffa0d14d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [28193.924476]  [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
            [28193.932565]  [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
            [28193.940211]  [<ffffffffa0d1b260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
            [28193.948394]  [<ffffffffa0d1a7c0>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc]
            [28193.956493]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [28193.963027]  [<ffffffff810a5ac0>] ? kthread+0x0/0xe0
            [28193.969635]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [28193.976729]  [<ffffffff810a5ac0>] ? kthread+0x0/0xe0
            
            [28193.985950] LustreError: dumping log to /tmp/lustre-log.1492246924.91775
            [28199.712751] LNet: Service thread pid 91775 completed after 206.29s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [31329.310375] perf interrupt took too long (5002 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
            [root@wolf-3 10.8.1.3-2017-04-14-22:46:09]# 
            
            

            [root@wolf-4 combined]# ps aux |grep 9096
            root 9096 0.6 0.0 0 0 ? S 01:55 4:21 [ll_ost_io01_003]
            root 77386 0.0 0.0 112656 976 pts/0 S+ 12:56 0:00 grep --color=auto 9096
            [root@wolf-4 combined]# man ps
            [root@wolf-4 combined]# ps aux |grep 8509
            root 8509 4.3 0.0 0 0 ? D 01:55 28:56 [ll_ost_io01_001]
            root 84813 0.0 0.0 112656 976 pts/0 S+ 12:57 0:00 grep --color=auto 8509

            [root@wolf-4 combined]# cat /proc/9096/stack
            [<ffffffffa0d8dff5>] ptlrpc_wait_event+0x325/0x340 [ptlrpc]
            [<ffffffffa0d93fcb>] ptlrpc_main+0x80b/0x1de0 [ptlrpc]
            [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [<ffffffffffffffff>] 0xffffffffffffffff
            [root@wolf-4 combined]# cat /proc/8509/stack
            [<ffffffff8108c04f>] usleep_range+0x4f/0x70
            [<ffffffffa269c99a>] dmu_tx_wait+0x33a/0x360 [zfs]
            [<ffffffffa269ca45>] dmu_tx_assign+0x85/0x3f0 [zfs]
            [<ffffffffa0f94fea>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
            [<ffffffffa10960db>] ofd_trans_start+0x6b/0xe0 [ofd]
            [<ffffffffa109c0a3>] ofd_commitrw_write+0x943/0x1c20 [ofd]
            [<ffffffffa109ff2d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [<ffffffffa0e0f8d2>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [<ffffffffa0de7f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
            [<ffffffffa0de4225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [<ffffffffa0d901ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [<ffffffffa0d94260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
            [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [<ffffffffffffffff>] 0xffffffffffffffff

            jsalians_intel John Salinas (Inactive) added a comment - - edited Lustre 2.9.0 + 0.7.0 RC3 (none of our patches) record size 1M on OST0 and 16M on OST1. brw_size=16 on both raidz – messages but no crash manual dumps: 10.8.1.4-2017-04-15-00:26:17 10.8.1.4-2017-04-15-01:47:43 10.8.1.3-2017-04-15-13:22:45 10.8.1.4-2017-04-15-13:22:47 wolf-4 OSS [ 163.434692] Lustre: lsdraid-OST0001: Recovery over after 0:06, of 5 clients 5 recovered and 0 were evicted. [ 163.480746] Lustre: lsdraid-OST0001: deleting orphan objects from 0x0:720 to 0x0:1025 [ 370.631336] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x3b0:0x0] object 0x0:1225 extent [83886080-92680191]: client csum d5f42113, server csum 1a89e99c [ 480.339896] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x49c:0x0] object 0x0:4041 extent [33554432-47890431]: client csum e47bcdcb, server csum 86becdcf [ 488.890964] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x4ae:0x0] object 0x0:5107 extent [67108864-73793535]: client csum b74b30df, server csum 20c030ec [ 509.914190] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x52f:0x0] object 0x0:6348 extent [33554432-43007999]: client csum cbc76f28, server csum 4b241635 [ 539.505532] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x5be:0x0] object 0x0:7700 extent [67108864-78381055]: client csum b6e2021c, server csum c5ce4f88 [ 560.736133] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x5f1:0x0] object 0x0:8747 extent [67108864-81104895]: client csum ddc22e54, server csum 894f5e1a [ 618.743576] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x6d0:0x0] object 0x0:11762 extent [67108864-81694719]: client csum 734e4939, server csum 175394a5 [ 618.764867] LustreError: Skipped 1 previous similar message [ 1080.395798] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x7fa:0x0] object 0x0:14839 extent [40140800-50331647]: client csum 937c50bf, server csum f71e2e65 [ 1080.417120] LustreError: Skipped 2 previous similar messages [ 3001.142322] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0xd10:0x0] object 0x0:49284 extent [100663296-108527615]: client csum ab9466a8, server csum 10b4e228 [ 3400.563954] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0xfb0:0x0] object 0x0:54388 extent [67108864-82837503]: client csum 71e8cd52, server csum 35becd53 [ 3461.970072] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x1052:0x0] object 0x0:55534 extent [67108864-74973183]: client csum c0a766ab, server csum ab5a66bb [ 3762.672549] BUG: Bad page state in process ll_ost_io01_003 pfn:182ec6d [ 3762.680002] page:ffffea0060bb1b40 count:-1 mapcount:0 mapping: (null) index:0x0 [ 3762.689091] page flags: 0x6fffff00000000() [ 3762.693727] page dumped because: nonzero _count [ 3762.700757] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas lrw gf128mul glue_helper ablk_helper cryptd raid_class scsi_transport_sas mei_me iTCO_wdt ipmi_ssif iTCO_vendor_support mei ipmi_devintf sb_edac sg [ 3762.790920] ioatdma lpc_ich shpchp edac_core pcspkr i2c_i801 ipmi_si mfd_core ipmi_msghandler acpi_pad acpi_power_meter wmi nfsd dm_multipath dm_mod auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb crct10dif_pclmul ptp crct10dif_common ttm ahci crc32c_intel pps_core mlx4_core libahci drm dca i2c_algo_bit libata i2c_core [ 3762.850233] CPU: 31 PID: 9096 Comm: ll_ost_io01_003 Tainted: P IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [ 3762.864178] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 3762.877501] ffffea0060bb1b40 000000006cbfa991 ffff880fd6a47908 ffffffff81636431 [ 3762.887516] ffff880fd6a47930 ffffffff81631645 ffffea0060bb1b40 0000000000000000 [ 3762.897491] 000fffff00000000 ffff880fd6a47978 ffffffff811714dd fff00000fe000000 [ 3762.907458] Call Trace: [ 3762.912046] [<ffffffff81636431>] dump_stack+0x19/0x1b [ 3762.919394] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc [ 3762.927333] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190 [ 3762.935630] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140 [ 3762.943790] [<ffffffff8117200f>] __free_pages+0x3f/0x60 [ 3762.951264] [<ffffffffa0fa1ad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs] [ 3762.959874] [<ffffffffa109b84a>] ofd_commitrw_write+0xea/0x1c20 [ofd] [ 3762.968646] [<ffffffffa109ff2d>] ofd_commitrw+0x51d/0xa40 [ofd] [ 3762.976868] [<ffffffffa0e0f8d2>] obd_commitrw+0x2ec/0x32f [ptlrpc] [ 3762.985338] [<ffffffffa0de7f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc] [ 3762.993957] [<ffffffffa0d3e560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [ 3763.003453] [<ffffffffa0de4225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [ 3763.012530] [<ffffffffa0d901ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [ 3763.022429] [<ffffffffa0a33128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [ 3763.031354] [<ffffffffa0d8dd68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [ 3763.040220] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [ 3763.048476] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [ 3763.056267] [<ffffffffa0d94260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [ 3763.064562] [<ffffffffa0d937c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [ 3763.074037] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [ 3763.080685] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 3763.089162] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [ 3763.096349] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 3855.476573] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x12e3:0x0] object 0x0:58439 extent [67108864-82837503]: client csum 71e8cd52, server csum 14e5cd5e [ 3923.650281] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x13bc:0x0] object 0x0:59171 extent [33554432-48742399]: client csum 9005f4a9, server csum db87ac4c [ 5698.551136] perf interrupt took too long (2521 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 [ 5904.311835] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x1734:0x0] object 0x0:66681 extent [67108864-80281599]: client csum 1eaa58ca, server csum 44a378f0 [ 8708.045614] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x1d31:0x0] object 0x0:67733 extent [121729024-134217727]: client csum 99efe98c, server csum e23d22e1 [ 9738.442312] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2051:0x0] object 0x0:68278 extent [100663296-116666367]: client csum d42f69dc, server csum 8732074f [10448.854337] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x237a:0x0] object 0x0:68809 extent [100663296-112549887]: client csum 7a8b3e1a, server csum 1dbd0291 [10480.902373] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2396:0x0] object 0x0:68834 extent [85426176-100663295]: client csum f43a36f0, server csum 9d10e702 [11720.767365] BUG: Bad page state in process ll_ost_io01_001 pfn:15d132f [11720.777259] page:ffffea005744cbc0 count:-1 mapcount:0 mapping: (null) index:0x0 [11720.788693] page flags: 0x6fffff00000000() [11720.795463] page dumped because: nonzero _count [11720.802596] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas lrw gf128mul glue_helper ablk_helper cryptd raid_class scsi_transport_sas mei_me iTCO_wdt ipmi_ssif iTCO_vendor_support mei ipmi_devintf sb_edac sg [11720.893130] ioatdma lpc_ich shpchp edac_core pcspkr i2c_i801 ipmi_si mfd_core ipmi_msghandler acpi_pad acpi_power_meter wmi nfsd dm_multipath dm_mod auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb crct10dif_pclmul ptp crct10dif_common ttm ahci crc32c_intel pps_core mlx4_core libahci drm dca i2c_algo_bit libata i2c_core [11720.951749] CPU: 35 PID: 8509 Comm: ll_ost_io01_001 Tainted: P B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [11720.965393] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [11720.978463] ffffea005744cbc0 00000000a971f860 ffff880fdb6bf908 ffffffff81636431 [11720.988249] ffff880fdb6bf930 ffffffff81631645 ffffea005744cbc0 0000000000000000 [11720.998053] 000fffff00000000 ffff880fdb6bf978 ffffffff811714dd fff00000fe000000 [11721.007838] Call Trace: [11721.012009] [<ffffffff81636431>] dump_stack+0x19/0x1b [11721.019195] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc [11721.026948] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190 [11721.035167] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140 [11721.043294] [<ffffffff8117200f>] __free_pages+0x3f/0x60 [11721.050752] [<ffffffffa0fa1ad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs] [11721.059372] [<ffffffffa109b84a>] ofd_commitrw_write+0xea/0x1c20 [ofd] [11721.068157] [<ffffffffa109ff2d>] ofd_commitrw+0x51d/0xa40 [ofd] [11721.076424] [<ffffffffa0e0f8d2>] obd_commitrw+0x2ec/0x32f [ptlrpc] [11721.085001] [<ffffffffa0de7f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc] [11721.094001] [<ffffffff81632d15>] ? __slab_free+0x10e/0x277 [11721.101706] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150 [11721.109340] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0 [11721.117924] [<ffffffffa0d3e560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [11721.127469] [<ffffffffa0de4225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [11721.136601] [<ffffffffa0d901ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [11721.146564] [<ffffffffa0a33128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [11721.155726] [<ffffffffa0d8dd68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [11721.164815] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [11721.173099] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [11721.180948] [<ffffffffa0d94260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [11721.189287] [<ffffffffa0d937c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [11721.198828] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [11721.205490] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [11721.214017] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [11721.221178] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [11906.409714] perf interrupt took too long (5056 > 5000), lowering kernel.perf_event_max_sample_rate to 25000 [12369.576466] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x28dd:0x0] object 0x0:69605 extent [100663296-115441663]: client csum 34b2200, server csum 5f29220d [12574.297235] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2a16:0x0] object 0x0:69767 extent [100663296-114409471]: client csum c953b2e4, server csum f3b9a3f5 [12583.154014] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2a22:0x0] object 0x0:69773 extent [100663296-117309439]: client csum fa39f722, server csum 17548bac wolf-3 OSS [ 702.495373] Lustre: lsdraid-OST0000: Connection restored to 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 (at 192.168.1.6@o2ib) [ 712.111566] LustreError: 35894:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0 [ 712.629997] LustreError: 39491:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0 [ 712.649481] LustreError: 39491:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 8 previous similar messages [ 713.660785] LustreError: 38266:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0 [ 713.679875] LustreError: 38266:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 5 previous similar messages [ 715.665680] LustreError: 38165:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0 [ 715.685499] LustreError: 38165:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 48 previous similar messages [ 835.423369] Lustre: lsdraid-OST0000: Connection restored to 4e5e1424-c5a7-dbfe-ccf8-a041ec520cb5 (at 192.168.1.9@o2ib) [ 835.437468] Lustre: Skipped 2 previous similar messages [11228.546836] perf interrupt took too long (2506 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 [28193.720410] LNet: Service thread pid 91775 was inactive for 200.29s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [28193.743765] Pid: 91775, comm: ll_ost00_010 [28193.750363] Call Trace: [28193.758633] [<ffffffff8163bb39>] schedule+0x29/0x70 [28193.765982] [<ffffffffa05cb2fd>] cv_wait_common+0x10d/0x130 [spl] [28193.774687] [<ffffffff810a6b80>] ? autoremove_wake_function+0x0/0x40 [28193.783567] [<ffffffffa05cb335>] __cv_wait+0x15/0x20 [spl] [28193.791608] [<ffffffffa1439c23>] txg_wait_open+0xb3/0xf0 [zfs] [28193.799877] [<ffffffffa13e264d>] dmu_free_long_range+0x25d/0x3d0 [zfs] [28193.808919] [<ffffffffa1092468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs] [28193.818586] [<ffffffffa10927d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs] [28193.828178] [<ffffffffa1094dba>] osd_trans_stop+0x31a/0x5b0 [osd_zfs] [28193.836927] [<ffffffffa119516f>] ofd_trans_stop+0x1f/0x60 [ofd] [28193.845026] [<ffffffffa1198d82>] ofd_object_destroy+0x2b2/0x890 [ofd] [28193.853770] [<ffffffffa1191987>] ofd_destroy_by_fid+0x307/0x510 [ofd] [28193.862440] [<ffffffffa0cdcbe0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc] [28193.871264] [<ffffffffa0cd71f0>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc] [28193.880161] [<ffffffffa1181627>] ofd_destroy_hdl+0x267/0xa50 [ofd] [28193.888454] [<ffffffffa0d6b225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [28193.897329] [<ffffffffa0d171ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [28193.907053] [<ffffffffa09c7128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [28193.915785] [<ffffffffa0d14d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [28193.924476] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [28193.932565] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [28193.940211] [<ffffffffa0d1b260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [28193.948394] [<ffffffffa0d1a7c0>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc] [28193.956493] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [28193.963027] [<ffffffff810a5ac0>] ? kthread+0x0/0xe0 [28193.969635] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [28193.976729] [<ffffffff810a5ac0>] ? kthread+0x0/0xe0 [28193.985950] LustreError: dumping log to /tmp/lustre-log.1492246924.91775 [28199.712751] LNet: Service thread pid 91775 completed after 206.29s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [31329.310375] perf interrupt took too long (5002 > 5000), lowering kernel.perf_event_max_sample_rate to 25000 [root@wolf-3 10.8.1.3-2017-04-14-22:46:09]# [root@wolf-4 combined] # ps aux |grep 9096 root 9096 0.6 0.0 0 0 ? S 01:55 4:21 [ll_ost_io01_003] root 77386 0.0 0.0 112656 976 pts/0 S+ 12:56 0:00 grep --color=auto 9096 [root@wolf-4 combined] # man ps [root@wolf-4 combined] # ps aux |grep 8509 root 8509 4.3 0.0 0 0 ? D 01:55 28:56 [ll_ost_io01_001] root 84813 0.0 0.0 112656 976 pts/0 S+ 12:57 0:00 grep --color=auto 8509 [root@wolf-4 combined] # cat /proc/9096/stack [<ffffffffa0d8dff5>] ptlrpc_wait_event+0x325/0x340 [ptlrpc] [<ffffffffa0d93fcb>] ptlrpc_main+0x80b/0x1de0 [ptlrpc] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [<ffffffffffffffff>] 0xffffffffffffffff [root@wolf-4 combined] # cat /proc/8509/stack [<ffffffff8108c04f>] usleep_range+0x4f/0x70 [<ffffffffa269c99a>] dmu_tx_wait+0x33a/0x360 [zfs] [<ffffffffa269ca45>] dmu_tx_assign+0x85/0x3f0 [zfs] [<ffffffffa0f94fea>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [<ffffffffa10960db>] ofd_trans_start+0x6b/0xe0 [ofd] [<ffffffffa109c0a3>] ofd_commitrw_write+0x943/0x1c20 [ofd] [<ffffffffa109ff2d>] ofd_commitrw+0x51d/0xa40 [ofd] [<ffffffffa0e0f8d2>] obd_commitrw+0x2ec/0x32f [ptlrpc] [<ffffffffa0de7f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc] [<ffffffffa0de4225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [<ffffffffa0d901ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [<ffffffffa0d94260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [<ffffffffffffffff>] 0xffffffffffffffff

            I don't remember BUG: Bad page state in process in LU-9279 but it was a month ago so anything is possible.

            None of the traces here look ZFS related – can you give us any hint on where to look?

            jsalians_intel John Salinas (Inactive) added a comment - I don't remember BUG: Bad page state in process in LU-9279 but it was a month ago so anything is possible. None of the traces here look ZFS related – can you give us any hint on where to look?

            Could the initial dump of LU-9279 be truncated and there's a double free prior to the bad page pointer? That would actually make more sense for a failure scenario.

            utopiabound Nathaniel Clark added a comment - Could the initial dump of LU-9279 be truncated and there's a double free prior to the bad page pointer? That would actually make more sense for a failure scenario.

            On Onyx: $ ls -lart /scratch/johnsali/LU-9304.tgz
            -rwxr-xr-x 1 johnsali johnsali 815773487 Apr 14 07:28 /scratch/johnsali/LU-9304.tgz

            jsalians_intel John Salinas (Inactive) added a comment - On Onyx: $ ls -lart /scratch/johnsali/ LU-9304 .tgz -rwxr-xr-x 1 johnsali johnsali 815773487 Apr 14 07:28 /scratch/johnsali/ LU-9304 .tgz

            I have logins to Onyx and Lola.

            utopiabound Nathaniel Clark added a comment - I have logins to Onyx and Lola.

            Which clusters do you have a login for I will copy it over to nfs on that cluster?

            jsalians_intel John Salinas (Inactive) added a comment - Which clusters do you have a login for I will copy it over to nfs on that cluster?

            How can I get a copy? I don't have a login to wolf currently.

            utopiabound Nathaniel Clark added a comment - How can I get a copy? I don't have a login to wolf currently.

            Oh good we have a dump for this one!

            jsalians_intel John Salinas (Inactive) added a comment - Oh good we have a dump for this one!

            Yes, looking at this, I would assume they come from the same root cause.

            utopiabound Nathaniel Clark added a comment - Yes, looking at this, I would assume they come from the same root cause.

            Hi Nate,

            Can you please look into this one. We thought on the triage call that this could be a duplicate of LU-9279. Do you agree?

            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hi Nate, Can you please look into this one. We thought on the triage call that this could be a duplicate of LU-9279 . Do you agree? Thanks. Joe

            Here is another one:
            [85463.960467] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0xc8:0x0] object 0x0:493 extent [50331648-66977791]: client csum 26eef72b, server csum 6a2afc80
            [85538.710838] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x130:0x0] object 0x0:545 extent [68812800-83886079]: client csum 7f41af68, server csum f877af67
            [85629.615262] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x30e:0x0] object 0x0:783 extent [67108864-82313215]: client csum bd02b56a, server csum 8f588935
            [85680.448461] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x3df:0x0] object 0x0:887 extent [67108864-81018879]: client csum 54933a67, server csum 31bca8f7
            [87381.228273] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x6d3:0x0] object 0x0:1265 extent [83886080-100007935]: client csum c62adf42, server csum 47f2df45
            [87450.618291] BUG: Bad page state in process ll_ost_io01_018 pfn:1fef99b
            [87450.627834] page:ffffea007fbe66c0 count:-1 mapcount:0 mapping: (null) index:0x0
            [87450.639074] page flags: 0x6fffff00000000()
            [87450.645680] page dumped because: nonzero _count
            [87450.652779] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_ssse3 sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure dm_service_time intel_powerclamp coretemp intel_rapl kvm_intel mpt3sas kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper sb_edac cryptd iTCO_wdt edac_core ipmi_devintf
            [87450.743972] ipmi_ssif mei_me raid_class sg iTCO_vendor_support scsi_transport_sas pcspkr mei ipmi_si ipmi_msghandler ioatdma shpchp lpc_ich i2c_i801 wmi mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt igb drm_kms_helper crct10dif_pclmul ahci crct10dif_common ttm ptp crc32c_intel libahci pps_core drm mlx4_core dca libata i2c_algo_bit i2c_core [last unloaded: zunicode]
            [87450.805273] CPU: 21 PID: 124934 Comm: ll_ost_io01_018 Tainted: G IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1
            [87450.819123] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
            [87450.832223] ffffea007fbe66c0 00000000140992fa ffff8800354cf908 ffffffff81636431
            [87450.842024] ffff8800354cf930 ffffffff81631645 ffffea007fbe66c0 0000000000000000
            [87450.851816] 000fffff00000000 ffff8800354cf978 ffffffff811714dd fff00000fe000000
            [87450.861609] Call Trace:
            [87450.865810] [<ffffffff81636431>] dump_stack+0x19/0x1b
            [87450.873020] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc
            [87450.880805] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190
            [87450.888959] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140
            [87450.897005] [<ffffffff8117200f>] __free_pages+0x3f/0x60
            [87450.904375] [<ffffffffa13c0ad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs]
            [87450.912902] [<ffffffffa153d84a>] ofd_commitrw_write+0xea/0x1c20 [ofd]
            [87450.921600] [<ffffffffa1541f2d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [87450.929762] [<ffffffffa0da08d2>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [87450.938190] [<ffffffffa0d78f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
            [87450.946742] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
            [87450.954201] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
            [87450.962643] [<ffffffffa0ccf560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [87450.972101] [<ffffffffa0d75225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [87450.981134] [<ffffffffa0d211ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [87450.991008] [<ffffffffa09a4128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [87451.000321] [<ffffffffa0d1ed68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [87451.009495] [<ffffffffa0d25260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
            [87451.018091] [<ffffffffa0d247c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [87451.027944] [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [87451.034889] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [87451.043631] [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [87451.051138] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [87451.059821] Disabling lock debugging due to kernel taint
            [88135.004640] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x9c5:0x0] object 0x0:1640 extent [67108864-83230719]: client csum d48fdf40, server csum 7834d05f
            [88167.103209] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x9f5:0x0] object 0x0:1664 extent [100663296-108920831]: client csum f45b7896, server csum 796e789a
            [88372.104154] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0xae9:0x0] object 0x0:1785 extent [67108864-83099647]: client csum 63d944, server csum 990a54d0
            [89192.783421] -----------[ cut here ]-----------
            [89192.790964] WARNING: at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0()
            [89192.800675] list_del corruption. prev->next should be ffffc906a3d0c010, but was 3635343332313036
            [89192.812702] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_ssse3 sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure dm_service_time intel_powerclamp coretemp intel_rapl kvm_intel mpt3sas kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper sb_edac cryptd iTCO_wdt edac_core ipmi_devintf
            [89192.906561] ipmi_ssif mei_me raid_class sg iTCO_vendor_support scsi_transport_sas pcspkr mei ipmi_si ipmi_msghandler ioatdma shpchp lpc_ich i2c_i801 wmi mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt igb drm_kms_helper crct10dif_pclmul ahci crct10dif_common ttm ptp crc32c_intel libahci pps_core drm mlx4_core dca libata i2c_algo_bit i2c_core [last unloaded: zunicode]
            [89192.971373] CPU: 22 PID: 47821 Comm: z_wr_int_7 Tainted: G B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1
            [89192.985319] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
            [89192.999116] ffff880fd3713bc8 00000000c561abf7 ffff880fd3713b80 ffffffff81636431
            [89193.009585] ffff880fd3713bb8 ffffffff8107b260 ffffc906a3d0c010 ffff88202372a660
            [89193.020080] 0000000000000010 0000000000000000 ffff882013de9800 ffff880fd3713c20
            [89193.030560] Call Trace:
            [89193.035444] [<ffffffff81636431>] dump_stack+0x19/0x1b
            [89193.043527] [<ffffffff8107b260>] warn_slowpath_common+0x70/0xb0
            [89193.052574] [<ffffffff8107b2fc>] warn_slowpath_fmt+0x5c/0x80
            [89193.061337] [<ffffffff8130c6a1>] __list_del_entry+0xa1/0xd0
            [89193.069975] [<ffffffff8130c6dd>] list_del+0xd/0x30
            [89193.077745] [<ffffffffa04f056d>] __spl_cache_flush+0xed/0x150 [spl]
            [89193.087183] [<ffffffffa04f0696>] spl_cache_flush+0x36/0x50 [spl]
            [89193.096324] [<ffffffffa04f15a2>] spl_kmem_cache_free+0x1c2/0x1d0 [spl]
            [89193.106221] [<ffffffffa11254fa>] zio_buf_free+0x5a/0x60 [zfs]
            [89193.115119] [<ffffffffa104bba9>] abd_free+0x249/0x270 [zfs]
            [89193.123765] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
            [89193.133434] [<ffffffffa10db5f4>] vdev_raidz_map_free+0x34/0xd0 [zfs]
            [89193.142998] [<ffffffffa10db6e9>] vdev_raidz_map_free_vsd+0x29/0x30 [zfs]
            [89193.152927] [<ffffffffa11265ed>] zio_vdev_io_assess+0x4d/0x250 [zfs]
            [89193.162466] [<ffffffffa112622c>] zio_execute+0x9c/0x100 [zfs]
            [89193.171271] [<ffffffffa04f2ed6>] taskq_thread+0x246/0x470 [spl]
            [89193.180262] [<ffffffff810b8940>] ? wake_up_state+0x20/0x20
            [89193.188773] [<ffffffffa04f2c90>] ? taskq_thread_spawn+0x60/0x60 [spl]
            [89193.198360] [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [89193.206072] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [89193.215629] [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [89193.223914] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [89193.233417] --[ end trace c1da4e4c37ad9549 ]--
            [89193.409308] general protection fault: 0000 1 SMP
            [89193.416842] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_ssse3 sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure dm_service_time intel_powerclamp coretemp intel_rapl kvm_intel mpt3sas kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper sb_edac cryptd iTCO_wdt edac_core ipmi_devintf
            [89193.509290] ipmi_ssif mei_me raid_class sg iTCO_vendor_support scsi_transport_sas pcspkr mei ipmi_si ipmi_msghandler ioatdma shpchp lpc_ich i2c_i801 wmi mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt igb drm_kms_helper crct10dif_pclmul ahci crct10dif_common ttm ptp crc32c_intel libahci pps_core drm mlx4_core dca libata i2c_algo_bit i2c_core [last unloaded: zunicode]
            [89193.573354] CPU: 37 PID: 86386 Comm: z_wr_int_7 Tainted: G B W IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1
            [89193.587115] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
            [89193.600690] task: ffff881ecece7300 ti: ffff88176c4c4000 task.ti: ffff88176c4c4000
            [89193.610926] RIP: 0010:[<ffffffff8130c54f>] [<ffffffff8130c54f>] __list_add+0xf/0xc0
            [89193.621652] RSP: 0018:ffff88176c4c7c30 EFLAGS: 00010086
            [89193.629539] RAX: 0000000000380000 RBX: ffffc906a8127000 RCX: 0000000000000004
            [89193.639440] RDX: 3130363534333231 RSI: ffffc906a8127020 RDI: ffffc906a9d2f018
            [89193.649298] RBP: ffff88176c4c7c48 R08: 0000000000000000 R09: 0000000000000000
            [89193.659138] R10: 0000000000000007 R11: 0000000000000000 R12: 3130363534333231
            [89193.668948] R13: ffffc906a8127020 R14: 0000000000000000 R15: ffff882013de9800
            [89193.678753] FS: 0000000000000000(0000) GS:ffff88103f0c0000(0000) knlGS:0000000000000000
            [89193.689643] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [89193.697898] CR2: 00007f2413fce000 CR3: 000000000194a000 CR4: 00000000001407e0
            [89193.707726] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            [89193.717569] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            [89193.727403] Stack:
            [89193.731485] ffffc906a8127000 ffff8810252800c0 0000000000000010 ffff88176c4c7c98
            [89193.741719] ffffffffa04f0535 0000000200a3c286 0000003e862d74d4 ffff882013de98a0
            [89193.751956] ffff882013de98b8 ffff882013de9800 ffff8810252800c0 0000000000000002
            [89193.762205] Call Trace:
            [89193.766851] [<ffffffffa04f0535>] __spl_cache_flush+0xb5/0x150 [spl]
            [89193.775877] [<ffffffffa04f0696>] spl_cache_flush+0x36/0x50 [spl]
            [89193.784617] [<ffffffffa04f15a2>] spl_kmem_cache_free+0x1c2/0x1d0 [spl]
            [89193.793997] [<ffffffffa11254fa>] zio_buf_free+0x5a/0x60 [zfs]
            [89193.802468] [<ffffffffa104bba9>] abd_free+0x249/0x270 [zfs]
            [89193.810746] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
            [89193.819797] [<ffffffffa10db5f4>] vdev_raidz_map_free+0x34/0xd0 [zfs]
            [89193.828971] [<ffffffffa10db6e9>] vdev_raidz_map_free_vsd+0x29/0x30 [zfs]
            [89193.838527] [<ffffffffa11265ed>] zio_vdev_io_assess+0x4d/0x250 [zfs]
            [89193.847696] [<ffffffffa112622c>] zio_execute+0x9c/0x100 [zfs]
            [89193.856147] [<ffffffffa04f2ed6>] taskq_thread+0x246/0x470 [spl]
            [89193.864781] [<ffffffff810b8940>] ? wake_up_state+0x20/0x20
            [89193.872946] [<ffffffffa04f2c90>] ? taskq_thread_spawn+0x60/0x60 [spl]
            [89193.882186] [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [89193.889553] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [89193.898764] [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [89193.906707] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [89193.915871] Code: 48 89 df e8 f4 45 eb ff b8 f4 ff ff ff e9 4a ff ff ff b8 f4 ff ff ff e9 40 ff ff ff 55 48 89 e5 41 55 49 89 f5 41 54 49 89 d4 53 <4c> 8b 42 08 48 89 fb 49 39 f0 75 2a 4d 8b 45 00 4d 39 c4 75 68
            [89193.942003] RIP [<ffffffff8130c54f>] __list_add+0xf/0xc0
            [89193.949893] RSP <ffff88176c4c7c30>

            /scratch/dumps/wolf-4.wolf.hpdd.intel.com/10.8.1.4-2017-04-06-14:08:00

            jsalians_intel John Salinas (Inactive) added a comment - Here is another one: [85463.960467] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0xc8:0x0] object 0x0:493 extent [50331648-66977791] : client csum 26eef72b, server csum 6a2afc80 [85538.710838] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x130:0x0] object 0x0:545 extent [68812800-83886079] : client csum 7f41af68, server csum f877af67 [85629.615262] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x30e:0x0] object 0x0:783 extent [67108864-82313215] : client csum bd02b56a, server csum 8f588935 [85680.448461] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x3df:0x0] object 0x0:887 extent [67108864-81018879] : client csum 54933a67, server csum 31bca8f7 [87381.228273] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x6d3:0x0] object 0x0:1265 extent [83886080-100007935] : client csum c62adf42, server csum 47f2df45 [87450.618291] BUG: Bad page state in process ll_ost_io01_018 pfn:1fef99b [87450.627834] page:ffffea007fbe66c0 count:-1 mapcount:0 mapping: (null) index:0x0 [87450.639074] page flags: 0x6fffff00000000() [87450.645680] page dumped because: nonzero _count [87450.652779] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_ssse3 sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure dm_service_time intel_powerclamp coretemp intel_rapl kvm_intel mpt3sas kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper sb_edac cryptd iTCO_wdt edac_core ipmi_devintf [87450.743972] ipmi_ssif mei_me raid_class sg iTCO_vendor_support scsi_transport_sas pcspkr mei ipmi_si ipmi_msghandler ioatdma shpchp lpc_ich i2c_i801 wmi mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt igb drm_kms_helper crct10dif_pclmul ahci crct10dif_common ttm ptp crc32c_intel libahci pps_core drm mlx4_core dca libata i2c_algo_bit i2c_core [last unloaded: zunicode] [87450.805273] CPU: 21 PID: 124934 Comm: ll_ost_io01_018 Tainted: G IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [87450.819123] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [87450.832223] ffffea007fbe66c0 00000000140992fa ffff8800354cf908 ffffffff81636431 [87450.842024] ffff8800354cf930 ffffffff81631645 ffffea007fbe66c0 0000000000000000 [87450.851816] 000fffff00000000 ffff8800354cf978 ffffffff811714dd fff00000fe000000 [87450.861609] Call Trace: [87450.865810] [<ffffffff81636431>] dump_stack+0x19/0x1b [87450.873020] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc [87450.880805] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190 [87450.888959] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140 [87450.897005] [<ffffffff8117200f>] __free_pages+0x3f/0x60 [87450.904375] [<ffffffffa13c0ad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs] [87450.912902] [<ffffffffa153d84a>] ofd_commitrw_write+0xea/0x1c20 [ofd] [87450.921600] [<ffffffffa1541f2d>] ofd_commitrw+0x51d/0xa40 [ofd] [87450.929762] [<ffffffffa0da08d2>] obd_commitrw+0x2ec/0x32f [ptlrpc] [87450.938190] [<ffffffffa0d78f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc] [87450.946742] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150 [87450.954201] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0 [87450.962643] [<ffffffffa0ccf560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [87450.972101] [<ffffffffa0d75225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [87450.981134] [<ffffffffa0d211ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [87450.991008] [<ffffffffa09a4128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [87451.000321] [<ffffffffa0d1ed68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [87451.009495] [<ffffffffa0d25260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [87451.018091] [<ffffffffa0d247c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [87451.027944] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [87451.034889] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [87451.043631] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [87451.051138] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [87451.059821] Disabling lock debugging due to kernel taint [88135.004640] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x9c5:0x0] object 0x0:1640 extent [67108864-83230719] : client csum d48fdf40, server csum 7834d05f [88167.103209] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x9f5:0x0] object 0x0:1664 extent [100663296-108920831] : client csum f45b7896, server csum 796e789a [88372.104154] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0xae9:0x0] object 0x0:1785 extent [67108864-83099647] : client csum 63d944, server csum 990a54d0 [89192.783421] ----------- [ cut here ] ----------- [89192.790964] WARNING: at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0() [89192.800675] list_del corruption. prev->next should be ffffc906a3d0c010, but was 3635343332313036 [89192.812702] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_ssse3 sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure dm_service_time intel_powerclamp coretemp intel_rapl kvm_intel mpt3sas kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper sb_edac cryptd iTCO_wdt edac_core ipmi_devintf [89192.906561] ipmi_ssif mei_me raid_class sg iTCO_vendor_support scsi_transport_sas pcspkr mei ipmi_si ipmi_msghandler ioatdma shpchp lpc_ich i2c_i801 wmi mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt igb drm_kms_helper crct10dif_pclmul ahci crct10dif_common ttm ptp crc32c_intel libahci pps_core drm mlx4_core dca libata i2c_algo_bit i2c_core [last unloaded: zunicode] [89192.971373] CPU: 22 PID: 47821 Comm: z_wr_int_7 Tainted: G B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [89192.985319] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [89192.999116] ffff880fd3713bc8 00000000c561abf7 ffff880fd3713b80 ffffffff81636431 [89193.009585] ffff880fd3713bb8 ffffffff8107b260 ffffc906a3d0c010 ffff88202372a660 [89193.020080] 0000000000000010 0000000000000000 ffff882013de9800 ffff880fd3713c20 [89193.030560] Call Trace: [89193.035444] [<ffffffff81636431>] dump_stack+0x19/0x1b [89193.043527] [<ffffffff8107b260>] warn_slowpath_common+0x70/0xb0 [89193.052574] [<ffffffff8107b2fc>] warn_slowpath_fmt+0x5c/0x80 [89193.061337] [<ffffffff8130c6a1>] __list_del_entry+0xa1/0xd0 [89193.069975] [<ffffffff8130c6dd>] list_del+0xd/0x30 [89193.077745] [<ffffffffa04f056d>] __spl_cache_flush+0xed/0x150 [spl] [89193.087183] [<ffffffffa04f0696>] spl_cache_flush+0x36/0x50 [spl] [89193.096324] [<ffffffffa04f15a2>] spl_kmem_cache_free+0x1c2/0x1d0 [spl] [89193.106221] [<ffffffffa11254fa>] zio_buf_free+0x5a/0x60 [zfs] [89193.115119] [<ffffffffa104bba9>] abd_free+0x249/0x270 [zfs] [89193.123765] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 [89193.133434] [<ffffffffa10db5f4>] vdev_raidz_map_free+0x34/0xd0 [zfs] [89193.142998] [<ffffffffa10db6e9>] vdev_raidz_map_free_vsd+0x29/0x30 [zfs] [89193.152927] [<ffffffffa11265ed>] zio_vdev_io_assess+0x4d/0x250 [zfs] [89193.162466] [<ffffffffa112622c>] zio_execute+0x9c/0x100 [zfs] [89193.171271] [<ffffffffa04f2ed6>] taskq_thread+0x246/0x470 [spl] [89193.180262] [<ffffffff810b8940>] ? wake_up_state+0x20/0x20 [89193.188773] [<ffffffffa04f2c90>] ? taskq_thread_spawn+0x60/0x60 [spl] [89193.198360] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [89193.206072] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [89193.215629] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [89193.223914] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [89193.233417] -- [ end trace c1da4e4c37ad9549 ] -- [89193.409308] general protection fault: 0000 1 SMP [89193.416842] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_ssse3 sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure dm_service_time intel_powerclamp coretemp intel_rapl kvm_intel mpt3sas kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper sb_edac cryptd iTCO_wdt edac_core ipmi_devintf [89193.509290] ipmi_ssif mei_me raid_class sg iTCO_vendor_support scsi_transport_sas pcspkr mei ipmi_si ipmi_msghandler ioatdma shpchp lpc_ich i2c_i801 wmi mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt igb drm_kms_helper crct10dif_pclmul ahci crct10dif_common ttm ptp crc32c_intel libahci pps_core drm mlx4_core dca libata i2c_algo_bit i2c_core [last unloaded: zunicode] [89193.573354] CPU: 37 PID: 86386 Comm: z_wr_int_7 Tainted: G B W IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [89193.587115] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [89193.600690] task: ffff881ecece7300 ti: ffff88176c4c4000 task.ti: ffff88176c4c4000 [89193.610926] RIP: 0010: [<ffffffff8130c54f>] [<ffffffff8130c54f>] __list_add+0xf/0xc0 [89193.621652] RSP: 0018:ffff88176c4c7c30 EFLAGS: 00010086 [89193.629539] RAX: 0000000000380000 RBX: ffffc906a8127000 RCX: 0000000000000004 [89193.639440] RDX: 3130363534333231 RSI: ffffc906a8127020 RDI: ffffc906a9d2f018 [89193.649298] RBP: ffff88176c4c7c48 R08: 0000000000000000 R09: 0000000000000000 [89193.659138] R10: 0000000000000007 R11: 0000000000000000 R12: 3130363534333231 [89193.668948] R13: ffffc906a8127020 R14: 0000000000000000 R15: ffff882013de9800 [89193.678753] FS: 0000000000000000(0000) GS:ffff88103f0c0000(0000) knlGS:0000000000000000 [89193.689643] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [89193.697898] CR2: 00007f2413fce000 CR3: 000000000194a000 CR4: 00000000001407e0 [89193.707726] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [89193.717569] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [89193.727403] Stack: [89193.731485] ffffc906a8127000 ffff8810252800c0 0000000000000010 ffff88176c4c7c98 [89193.741719] ffffffffa04f0535 0000000200a3c286 0000003e862d74d4 ffff882013de98a0 [89193.751956] ffff882013de98b8 ffff882013de9800 ffff8810252800c0 0000000000000002 [89193.762205] Call Trace: [89193.766851] [<ffffffffa04f0535>] __spl_cache_flush+0xb5/0x150 [spl] [89193.775877] [<ffffffffa04f0696>] spl_cache_flush+0x36/0x50 [spl] [89193.784617] [<ffffffffa04f15a2>] spl_kmem_cache_free+0x1c2/0x1d0 [spl] [89193.793997] [<ffffffffa11254fa>] zio_buf_free+0x5a/0x60 [zfs] [89193.802468] [<ffffffffa104bba9>] abd_free+0x249/0x270 [zfs] [89193.810746] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 [89193.819797] [<ffffffffa10db5f4>] vdev_raidz_map_free+0x34/0xd0 [zfs] [89193.828971] [<ffffffffa10db6e9>] vdev_raidz_map_free_vsd+0x29/0x30 [zfs] [89193.838527] [<ffffffffa11265ed>] zio_vdev_io_assess+0x4d/0x250 [zfs] [89193.847696] [<ffffffffa112622c>] zio_execute+0x9c/0x100 [zfs] [89193.856147] [<ffffffffa04f2ed6>] taskq_thread+0x246/0x470 [spl] [89193.864781] [<ffffffff810b8940>] ? wake_up_state+0x20/0x20 [89193.872946] [<ffffffffa04f2c90>] ? taskq_thread_spawn+0x60/0x60 [spl] [89193.882186] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [89193.889553] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [89193.898764] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [89193.906707] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [89193.915871] Code: 48 89 df e8 f4 45 eb ff b8 f4 ff ff ff e9 4a ff ff ff b8 f4 ff ff ff e9 40 ff ff ff 55 48 89 e5 41 55 49 89 f5 41 54 49 89 d4 53 <4c> 8b 42 08 48 89 fb 49 39 f0 75 2a 4d 8b 45 00 4d 39 c4 75 68 [89193.942003] RIP [<ffffffff8130c54f>] __list_add+0xf/0xc0 [89193.949893] RSP <ffff88176c4c7c30> /scratch/dumps/wolf-4.wolf.hpdd.intel.com/10.8.1.4-2017-04-06-14:08:00

            People

              utopiabound Nathaniel Clark
              jsalians_intel John Salinas (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: