Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9304

BUG: Bad page state in process ll_ost_io01_013 pfn:1a01bcd kernel BUG at include/linux/scatterlist.h:65!

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      Running 4 Lustre Clients, 2 OSS nodes each with 1 zpool, and 1 mds.
      This OSS node:

      1. zpool status -v
        pool: ost0
        state: ONLINE
        scan: none requested
        config:

      NAME STATE READ WRITE CKSUM
      ost0 ONLINE 0 0 0
      draid1-0

      {any}

      ONLINE 0 0 0
      mpathaj ONLINE 0 0 0
      mpathai ONLINE 0 0 0
      mpathah ONLINE 0 0 0
      mpathag ONLINE 0 0 0
      mpathaq ONLINE 0 0 0
      mpathap ONLINE 0 0 0
      mpathak ONLINE 0 0 0
      mpathz ONLINE 0 0 0
      mpatham ONLINE 0 0 0
      mpathal ONLINE 0 0 0
      mpathao ONLINE 0 0 0
      spares
      $draid1-0-s0 AVAIL

      errors: No known data errors

      This build of zfs was from coral-prototype branch and Lustre was a Lustre Master from Dec 1st.

      We were running our file system aging utility: FileAger.py (1-2 copies on each of the 4 client nodes) along an IOR: mpirun -wdir /mnt/lustre/ -np 4 -rr -machinefile hosts -env I_MPI_EXTRA_FILESYSTEM=on -env I_MPI_EXTRA_FILESYSTEM_LIST=lustre /home/johnsali/wolf-3/ior/src/ior -a POSIX -F -N 4 -d 2 -i 1 -s 20000 -b 16MB -t 16MB -k -w -r

      While this was running it appears we hit this failure.

      [159898.950714] BUG: Bad page state in process ll_ost_io01_013 pfn:1a01bcd
      [159898.960045] page:ffffea006806f340 count:-1 mapcount:0 mapping: (null) index:0x0
      [159898.970667] page flags: 0x6fffff00000000()
      [159898.976808] page dumped because: nonzero _count
      [159898.983412] Modules linked in: nfsv3 nfs_acl raid10 osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd mpt3sas ipmi_devintf ipmi_ssif ipmi_si
      [159899.072452] raid_class sb_edac iTCO_wdt iTCO_vendor_support scsi_transport_sas sg edac_core pcspkr ipmi_msghandler wmi ioatdma mei_me mei lpc_ich shpchp i2c_i801 mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_ib mlx4_en ib_sa vxlan ib_mad ip6_udp_tunnel udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper crct10dif_pclmul igb crct10dif_common ttm ptp crc32c_intel ahci pps_core drm mlx4_core libahci dca i2c_algo_bit libata i2c_core [last unloaded: zunicode]
      [159899.135473] CPU: 57 PID: 98747 Comm: ll_ost_io01_013 Tainted: G IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1
      [159899.149461] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
      [159899.162801] ffffea006806f340 00000000424e76b3 ffff880f9e233908 ffffffff81636431
      [159899.172821] ffff880f9e233930 ffffffff81631645 ffffea006806f340 0000000000000000
      [159899.182870] 000fffff00000000 ffff880f9e233978 ffffffff811714dd fff00000fe000000
      [159899.192895] Call Trace:
      [159899.197269] [<ffffffff81636431>] dump_stack+0x19/0x1b
      [159899.204667] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc
      [159899.212639] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190
      [159899.220965] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140
      [159899.229171] [<ffffffff8117200f>] __free_pages+0x3f/0x60
      [159899.236690] [<ffffffffa100bad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs]
      [159899.245372] [<ffffffffa118284a>] ofd_commitrw_write+0xea/0x1c20 [ofd]
      [159899.254234] [<ffffffffa1186f2d>] ofd_commitrw+0x51d/0xa40 [ofd]
      [159899.262551] [<ffffffffa0d538d5>] obd_commitrw+0x2ec/0x32f [ptlrpc]
      [159899.271488] [<ffffffffa0d2bf71>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
      [159899.280509] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
      [159899.288372] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
      [159899.297010] [<ffffffffa0c82560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
      [159899.306746] [<ffffffffa0d28225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
      [159899.316058] [<ffffffffa0cd41ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      [159899.326348] [<ffffffffa0967128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
      [159899.335679] [<ffffffffa0cd1d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [159899.345029] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
      [159899.353394] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
      [159899.361264] [<ffffffffa0cd8260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
      [159899.369596] [<ffffffffa0cd77c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
      [159899.379160] [<ffffffff810a5b8f>] kthread+0xcf/0xe0
      [159899.385881] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [159899.394413] [<ffffffff81646a98>] ret_from_fork+0x58/0x90
      [159899.401653] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [159899.410157] Disabling lock debugging due to kernel taint
      [163012.964891] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.8@o2ib inode [0x200000406:0x3c5:0x0] object 0x0:44785 extent [67108864-80752639]: client csum 7f08fe36, server csum f8fbfe4c
      [163012.990138] LustreError: Skipped 2 previous similar messages
      [163020.008131] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.8@o2ib inode [0x200000406:0x3d6:0x0] object 0x0:44794 extent [83886080-100270079]: client csum 886feb33, server csum ccc0eb4a
      [163042.829796] -----------[ cut here ]-----------
      [163042.837389] kernel BUG at include/linux/scatterlist.h:65!
      [163042.845758] invalid opcode: 0000 1 SMP
      [163042.852645] Modules linked in: nfsv3 nfs_acl raid10 osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd mpt3sas ipmi_devintf ipmi_ssif ipmi_si
      [163042.944819] raid_class sb_edac iTCO_wdt iTCO_vendor_support scsi_transport_sas sg edac_core pcspkr ipmi_msghandler wmi ioatdma mei_me mei lpc_ich shpchp i2c_i801 mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_ib mlx4_en ib_sa vxlan ib_mad ip6_udp_tunnel udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper crct10dif_pclmul igb crct10dif_common ttm ptp crc32c_intel ahci pps_core drm mlx4_core libahci dca i2c_algo_bit libata i2c_core [last unloaded: zunicode]
      [163043.010335] CPU: 12 PID: 84956 Comm: ll_ost_io00_002 Tainted: G B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1
      [163043.025057] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
      [163043.038989] task: ffff880fc52bc500 ti: ffff880fc55bc000 task.ti: ffff880fc55bc000
      [163043.049639] RIP: 0010:[<ffffffffa0960fef>] [<ffffffffa0960fef>] cfs_crypto_hash_update_page+0x9f/0xb0 [libcfs]
      [163043.063453] RSP: 0018:ffff880fc55bfab8 EFLAGS: 00010202
      [163043.071687] RAX: 0000000000000002 RBX: ffff8810f6db9b80 RCX: 0000000000000000
      [163043.081918] RDX: 0000000000000020 RSI: 0000000000000000 RDI: ffff880fc55bfad8
      [163043.092095] RBP: ffff880fc55bfb00 R08: 00000000000195a0 R09: ffff880fc55bfab8
      [163043.103441] R10: ffff88103e807900 R11: 0000000000000001 R12: 3635343332313036
      [163043.113462] R13: 0000000033323130 R14: 0000000000000534 R15: 0000000000000000
      [163043.123487] FS: 0000000000000000(0000) GS:ffff88103ef00000(0000) knlGS:0000000000000000
      [163043.134599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [163043.143101] CR2: 00007fce5afab000 CR3: 000000000194a000 CR4: 00000000001407e0
      [163043.153184] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [163043.163242] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [163043.173280] Stack:
      [163043.177580] 0000000000000002 0000000000000000 0000000000000000 0000000000000000
      [163043.188354] 00000000f43b381e 0000000000000000 ffff880fcc7d1301 ffff880e73ecc200
      [163043.199140] 0000000000000000 ffff880fc55bfb68 ffffffffa0d5345c ffff88202563f0a8
      [163043.209907] Call Trace:
      [163043.215455] [<ffffffffa0d5345c>] tgt_checksum_bulk.isra.33+0x35a/0x4e7 [ptlrpc]
      [163043.226242] [<ffffffffa0d2c21d>] tgt_brw_write+0x114d/0x1640 [ptlrpc]
      [163043.235986] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
      [163043.244558] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
      [163043.254271] [<ffffffffa0c82560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
      [163043.264858] [<ffffffffa0d28225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
      [163043.275043] [<ffffffffa0cd41ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      [163043.286074] [<ffffffffa0967128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
      [163043.296175] [<ffffffffa0cd1d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [163043.306194] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
      [163043.315553] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
      [163043.324714] [<ffffffffa0cd8260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
      [163043.334070] [<ffffffffa0cd77c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
      [163043.344635] [<ffffffff810a5b8f>] kthread+0xcf/0xe0
      [163043.352181] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [163043.361606] [<ffffffff81646a98>] ret_from_fork+0x58/0x90
      [163043.369571] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [163043.378772] Code: 89 43 38 48 8b 43 20 ff 50 c0 48 8b 55 d8 65 48 33 14 25 28 00 00 00 75 0d 48 83 c4 28 5b 41 5c 41 5d 41 5e 5d c3 e8 61 a0 71 e0 <0f> 0b 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
      [163043.406113] RIP [<ffffffffa0960fef>] cfs_crypto_hash_update_page+0x9f/0xb0 [libcfs]
      [163043.416991] RSP <ffff880fc55bfab8>

      This happened fairly quickly. After this run I restarted the system and it happened again almost immediately.

      Attachments

        Issue Links

          Activity

            [LU-9304] BUG: Bad page state in process ll_ost_io01_013 pfn:1a01bcd kernel BUG at include/linux/scatterlist.h:65!

            Could the initial dump of LU-9279 be truncated and there's a double free prior to the bad page pointer? That would actually make more sense for a failure scenario.

            utopiabound Nathaniel Clark added a comment - Could the initial dump of LU-9279 be truncated and there's a double free prior to the bad page pointer? That would actually make more sense for a failure scenario.

            On Onyx: $ ls -lart /scratch/johnsali/LU-9304.tgz
            -rwxr-xr-x 1 johnsali johnsali 815773487 Apr 14 07:28 /scratch/johnsali/LU-9304.tgz

            jsalians_intel John Salinas (Inactive) added a comment - On Onyx: $ ls -lart /scratch/johnsali/ LU-9304 .tgz -rwxr-xr-x 1 johnsali johnsali 815773487 Apr 14 07:28 /scratch/johnsali/ LU-9304 .tgz

            I have logins to Onyx and Lola.

            utopiabound Nathaniel Clark added a comment - I have logins to Onyx and Lola.

            Which clusters do you have a login for I will copy it over to nfs on that cluster?

            jsalians_intel John Salinas (Inactive) added a comment - Which clusters do you have a login for I will copy it over to nfs on that cluster?

            How can I get a copy? I don't have a login to wolf currently.

            utopiabound Nathaniel Clark added a comment - How can I get a copy? I don't have a login to wolf currently.

            Oh good we have a dump for this one!

            jsalians_intel John Salinas (Inactive) added a comment - Oh good we have a dump for this one!

            Yes, looking at this, I would assume they come from the same root cause.

            utopiabound Nathaniel Clark added a comment - Yes, looking at this, I would assume they come from the same root cause.

            Hi Nate,

            Can you please look into this one. We thought on the triage call that this could be a duplicate of LU-9279. Do you agree?

            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hi Nate, Can you please look into this one. We thought on the triage call that this could be a duplicate of LU-9279 . Do you agree? Thanks. Joe

            Here is another one:
            [85463.960467] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0xc8:0x0] object 0x0:493 extent [50331648-66977791]: client csum 26eef72b, server csum 6a2afc80
            [85538.710838] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x130:0x0] object 0x0:545 extent [68812800-83886079]: client csum 7f41af68, server csum f877af67
            [85629.615262] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x30e:0x0] object 0x0:783 extent [67108864-82313215]: client csum bd02b56a, server csum 8f588935
            [85680.448461] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x3df:0x0] object 0x0:887 extent [67108864-81018879]: client csum 54933a67, server csum 31bca8f7
            [87381.228273] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x6d3:0x0] object 0x0:1265 extent [83886080-100007935]: client csum c62adf42, server csum 47f2df45
            [87450.618291] BUG: Bad page state in process ll_ost_io01_018 pfn:1fef99b
            [87450.627834] page:ffffea007fbe66c0 count:-1 mapcount:0 mapping: (null) index:0x0
            [87450.639074] page flags: 0x6fffff00000000()
            [87450.645680] page dumped because: nonzero _count
            [87450.652779] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_ssse3 sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure dm_service_time intel_powerclamp coretemp intel_rapl kvm_intel mpt3sas kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper sb_edac cryptd iTCO_wdt edac_core ipmi_devintf
            [87450.743972] ipmi_ssif mei_me raid_class sg iTCO_vendor_support scsi_transport_sas pcspkr mei ipmi_si ipmi_msghandler ioatdma shpchp lpc_ich i2c_i801 wmi mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt igb drm_kms_helper crct10dif_pclmul ahci crct10dif_common ttm ptp crc32c_intel libahci pps_core drm mlx4_core dca libata i2c_algo_bit i2c_core [last unloaded: zunicode]
            [87450.805273] CPU: 21 PID: 124934 Comm: ll_ost_io01_018 Tainted: G IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1
            [87450.819123] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
            [87450.832223] ffffea007fbe66c0 00000000140992fa ffff8800354cf908 ffffffff81636431
            [87450.842024] ffff8800354cf930 ffffffff81631645 ffffea007fbe66c0 0000000000000000
            [87450.851816] 000fffff00000000 ffff8800354cf978 ffffffff811714dd fff00000fe000000
            [87450.861609] Call Trace:
            [87450.865810] [<ffffffff81636431>] dump_stack+0x19/0x1b
            [87450.873020] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc
            [87450.880805] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190
            [87450.888959] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140
            [87450.897005] [<ffffffff8117200f>] __free_pages+0x3f/0x60
            [87450.904375] [<ffffffffa13c0ad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs]
            [87450.912902] [<ffffffffa153d84a>] ofd_commitrw_write+0xea/0x1c20 [ofd]
            [87450.921600] [<ffffffffa1541f2d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [87450.929762] [<ffffffffa0da08d2>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [87450.938190] [<ffffffffa0d78f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
            [87450.946742] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
            [87450.954201] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
            [87450.962643] [<ffffffffa0ccf560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [87450.972101] [<ffffffffa0d75225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [87450.981134] [<ffffffffa0d211ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [87450.991008] [<ffffffffa09a4128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [87451.000321] [<ffffffffa0d1ed68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [87451.009495] [<ffffffffa0d25260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
            [87451.018091] [<ffffffffa0d247c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [87451.027944] [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [87451.034889] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [87451.043631] [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [87451.051138] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [87451.059821] Disabling lock debugging due to kernel taint
            [88135.004640] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x9c5:0x0] object 0x0:1640 extent [67108864-83230719]: client csum d48fdf40, server csum 7834d05f
            [88167.103209] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x9f5:0x0] object 0x0:1664 extent [100663296-108920831]: client csum f45b7896, server csum 796e789a
            [88372.104154] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0xae9:0x0] object 0x0:1785 extent [67108864-83099647]: client csum 63d944, server csum 990a54d0
            [89192.783421] -----------[ cut here ]-----------
            [89192.790964] WARNING: at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0()
            [89192.800675] list_del corruption. prev->next should be ffffc906a3d0c010, but was 3635343332313036
            [89192.812702] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_ssse3 sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure dm_service_time intel_powerclamp coretemp intel_rapl kvm_intel mpt3sas kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper sb_edac cryptd iTCO_wdt edac_core ipmi_devintf
            [89192.906561] ipmi_ssif mei_me raid_class sg iTCO_vendor_support scsi_transport_sas pcspkr mei ipmi_si ipmi_msghandler ioatdma shpchp lpc_ich i2c_i801 wmi mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt igb drm_kms_helper crct10dif_pclmul ahci crct10dif_common ttm ptp crc32c_intel libahci pps_core drm mlx4_core dca libata i2c_algo_bit i2c_core [last unloaded: zunicode]
            [89192.971373] CPU: 22 PID: 47821 Comm: z_wr_int_7 Tainted: G B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1
            [89192.985319] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
            [89192.999116] ffff880fd3713bc8 00000000c561abf7 ffff880fd3713b80 ffffffff81636431
            [89193.009585] ffff880fd3713bb8 ffffffff8107b260 ffffc906a3d0c010 ffff88202372a660
            [89193.020080] 0000000000000010 0000000000000000 ffff882013de9800 ffff880fd3713c20
            [89193.030560] Call Trace:
            [89193.035444] [<ffffffff81636431>] dump_stack+0x19/0x1b
            [89193.043527] [<ffffffff8107b260>] warn_slowpath_common+0x70/0xb0
            [89193.052574] [<ffffffff8107b2fc>] warn_slowpath_fmt+0x5c/0x80
            [89193.061337] [<ffffffff8130c6a1>] __list_del_entry+0xa1/0xd0
            [89193.069975] [<ffffffff8130c6dd>] list_del+0xd/0x30
            [89193.077745] [<ffffffffa04f056d>] __spl_cache_flush+0xed/0x150 [spl]
            [89193.087183] [<ffffffffa04f0696>] spl_cache_flush+0x36/0x50 [spl]
            [89193.096324] [<ffffffffa04f15a2>] spl_kmem_cache_free+0x1c2/0x1d0 [spl]
            [89193.106221] [<ffffffffa11254fa>] zio_buf_free+0x5a/0x60 [zfs]
            [89193.115119] [<ffffffffa104bba9>] abd_free+0x249/0x270 [zfs]
            [89193.123765] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
            [89193.133434] [<ffffffffa10db5f4>] vdev_raidz_map_free+0x34/0xd0 [zfs]
            [89193.142998] [<ffffffffa10db6e9>] vdev_raidz_map_free_vsd+0x29/0x30 [zfs]
            [89193.152927] [<ffffffffa11265ed>] zio_vdev_io_assess+0x4d/0x250 [zfs]
            [89193.162466] [<ffffffffa112622c>] zio_execute+0x9c/0x100 [zfs]
            [89193.171271] [<ffffffffa04f2ed6>] taskq_thread+0x246/0x470 [spl]
            [89193.180262] [<ffffffff810b8940>] ? wake_up_state+0x20/0x20
            [89193.188773] [<ffffffffa04f2c90>] ? taskq_thread_spawn+0x60/0x60 [spl]
            [89193.198360] [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [89193.206072] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [89193.215629] [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [89193.223914] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [89193.233417] --[ end trace c1da4e4c37ad9549 ]--
            [89193.409308] general protection fault: 0000 1 SMP
            [89193.416842] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_ssse3 sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure dm_service_time intel_powerclamp coretemp intel_rapl kvm_intel mpt3sas kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper sb_edac cryptd iTCO_wdt edac_core ipmi_devintf
            [89193.509290] ipmi_ssif mei_me raid_class sg iTCO_vendor_support scsi_transport_sas pcspkr mei ipmi_si ipmi_msghandler ioatdma shpchp lpc_ich i2c_i801 wmi mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt igb drm_kms_helper crct10dif_pclmul ahci crct10dif_common ttm ptp crc32c_intel libahci pps_core drm mlx4_core dca libata i2c_algo_bit i2c_core [last unloaded: zunicode]
            [89193.573354] CPU: 37 PID: 86386 Comm: z_wr_int_7 Tainted: G B W IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1
            [89193.587115] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
            [89193.600690] task: ffff881ecece7300 ti: ffff88176c4c4000 task.ti: ffff88176c4c4000
            [89193.610926] RIP: 0010:[<ffffffff8130c54f>] [<ffffffff8130c54f>] __list_add+0xf/0xc0
            [89193.621652] RSP: 0018:ffff88176c4c7c30 EFLAGS: 00010086
            [89193.629539] RAX: 0000000000380000 RBX: ffffc906a8127000 RCX: 0000000000000004
            [89193.639440] RDX: 3130363534333231 RSI: ffffc906a8127020 RDI: ffffc906a9d2f018
            [89193.649298] RBP: ffff88176c4c7c48 R08: 0000000000000000 R09: 0000000000000000
            [89193.659138] R10: 0000000000000007 R11: 0000000000000000 R12: 3130363534333231
            [89193.668948] R13: ffffc906a8127020 R14: 0000000000000000 R15: ffff882013de9800
            [89193.678753] FS: 0000000000000000(0000) GS:ffff88103f0c0000(0000) knlGS:0000000000000000
            [89193.689643] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [89193.697898] CR2: 00007f2413fce000 CR3: 000000000194a000 CR4: 00000000001407e0
            [89193.707726] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            [89193.717569] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            [89193.727403] Stack:
            [89193.731485] ffffc906a8127000 ffff8810252800c0 0000000000000010 ffff88176c4c7c98
            [89193.741719] ffffffffa04f0535 0000000200a3c286 0000003e862d74d4 ffff882013de98a0
            [89193.751956] ffff882013de98b8 ffff882013de9800 ffff8810252800c0 0000000000000002
            [89193.762205] Call Trace:
            [89193.766851] [<ffffffffa04f0535>] __spl_cache_flush+0xb5/0x150 [spl]
            [89193.775877] [<ffffffffa04f0696>] spl_cache_flush+0x36/0x50 [spl]
            [89193.784617] [<ffffffffa04f15a2>] spl_kmem_cache_free+0x1c2/0x1d0 [spl]
            [89193.793997] [<ffffffffa11254fa>] zio_buf_free+0x5a/0x60 [zfs]
            [89193.802468] [<ffffffffa104bba9>] abd_free+0x249/0x270 [zfs]
            [89193.810746] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
            [89193.819797] [<ffffffffa10db5f4>] vdev_raidz_map_free+0x34/0xd0 [zfs]
            [89193.828971] [<ffffffffa10db6e9>] vdev_raidz_map_free_vsd+0x29/0x30 [zfs]
            [89193.838527] [<ffffffffa11265ed>] zio_vdev_io_assess+0x4d/0x250 [zfs]
            [89193.847696] [<ffffffffa112622c>] zio_execute+0x9c/0x100 [zfs]
            [89193.856147] [<ffffffffa04f2ed6>] taskq_thread+0x246/0x470 [spl]
            [89193.864781] [<ffffffff810b8940>] ? wake_up_state+0x20/0x20
            [89193.872946] [<ffffffffa04f2c90>] ? taskq_thread_spawn+0x60/0x60 [spl]
            [89193.882186] [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [89193.889553] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [89193.898764] [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [89193.906707] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [89193.915871] Code: 48 89 df e8 f4 45 eb ff b8 f4 ff ff ff e9 4a ff ff ff b8 f4 ff ff ff e9 40 ff ff ff 55 48 89 e5 41 55 49 89 f5 41 54 49 89 d4 53 <4c> 8b 42 08 48 89 fb 49 39 f0 75 2a 4d 8b 45 00 4d 39 c4 75 68
            [89193.942003] RIP [<ffffffff8130c54f>] __list_add+0xf/0xc0
            [89193.949893] RSP <ffff88176c4c7c30>

            /scratch/dumps/wolf-4.wolf.hpdd.intel.com/10.8.1.4-2017-04-06-14:08:00

            jsalians_intel John Salinas (Inactive) added a comment - Here is another one: [85463.960467] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0xc8:0x0] object 0x0:493 extent [50331648-66977791] : client csum 26eef72b, server csum 6a2afc80 [85538.710838] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x130:0x0] object 0x0:545 extent [68812800-83886079] : client csum 7f41af68, server csum f877af67 [85629.615262] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x30e:0x0] object 0x0:783 extent [67108864-82313215] : client csum bd02b56a, server csum 8f588935 [85680.448461] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x3df:0x0] object 0x0:887 extent [67108864-81018879] : client csum 54933a67, server csum 31bca8f7 [87381.228273] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x6d3:0x0] object 0x0:1265 extent [83886080-100007935] : client csum c62adf42, server csum 47f2df45 [87450.618291] BUG: Bad page state in process ll_ost_io01_018 pfn:1fef99b [87450.627834] page:ffffea007fbe66c0 count:-1 mapcount:0 mapping: (null) index:0x0 [87450.639074] page flags: 0x6fffff00000000() [87450.645680] page dumped because: nonzero _count [87450.652779] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_ssse3 sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure dm_service_time intel_powerclamp coretemp intel_rapl kvm_intel mpt3sas kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper sb_edac cryptd iTCO_wdt edac_core ipmi_devintf [87450.743972] ipmi_ssif mei_me raid_class sg iTCO_vendor_support scsi_transport_sas pcspkr mei ipmi_si ipmi_msghandler ioatdma shpchp lpc_ich i2c_i801 wmi mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt igb drm_kms_helper crct10dif_pclmul ahci crct10dif_common ttm ptp crc32c_intel libahci pps_core drm mlx4_core dca libata i2c_algo_bit i2c_core [last unloaded: zunicode] [87450.805273] CPU: 21 PID: 124934 Comm: ll_ost_io01_018 Tainted: G IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [87450.819123] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [87450.832223] ffffea007fbe66c0 00000000140992fa ffff8800354cf908 ffffffff81636431 [87450.842024] ffff8800354cf930 ffffffff81631645 ffffea007fbe66c0 0000000000000000 [87450.851816] 000fffff00000000 ffff8800354cf978 ffffffff811714dd fff00000fe000000 [87450.861609] Call Trace: [87450.865810] [<ffffffff81636431>] dump_stack+0x19/0x1b [87450.873020] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc [87450.880805] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190 [87450.888959] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140 [87450.897005] [<ffffffff8117200f>] __free_pages+0x3f/0x60 [87450.904375] [<ffffffffa13c0ad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs] [87450.912902] [<ffffffffa153d84a>] ofd_commitrw_write+0xea/0x1c20 [ofd] [87450.921600] [<ffffffffa1541f2d>] ofd_commitrw+0x51d/0xa40 [ofd] [87450.929762] [<ffffffffa0da08d2>] obd_commitrw+0x2ec/0x32f [ptlrpc] [87450.938190] [<ffffffffa0d78f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc] [87450.946742] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150 [87450.954201] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0 [87450.962643] [<ffffffffa0ccf560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [87450.972101] [<ffffffffa0d75225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [87450.981134] [<ffffffffa0d211ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [87450.991008] [<ffffffffa09a4128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [87451.000321] [<ffffffffa0d1ed68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [87451.009495] [<ffffffffa0d25260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [87451.018091] [<ffffffffa0d247c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [87451.027944] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [87451.034889] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [87451.043631] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [87451.051138] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [87451.059821] Disabling lock debugging due to kernel taint [88135.004640] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x9c5:0x0] object 0x0:1640 extent [67108864-83230719] : client csum d48fdf40, server csum 7834d05f [88167.103209] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x9f5:0x0] object 0x0:1664 extent [100663296-108920831] : client csum f45b7896, server csum 796e789a [88372.104154] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0xae9:0x0] object 0x0:1785 extent [67108864-83099647] : client csum 63d944, server csum 990a54d0 [89192.783421] ----------- [ cut here ] ----------- [89192.790964] WARNING: at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0() [89192.800675] list_del corruption. prev->next should be ffffc906a3d0c010, but was 3635343332313036 [89192.812702] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_ssse3 sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure dm_service_time intel_powerclamp coretemp intel_rapl kvm_intel mpt3sas kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper sb_edac cryptd iTCO_wdt edac_core ipmi_devintf [89192.906561] ipmi_ssif mei_me raid_class sg iTCO_vendor_support scsi_transport_sas pcspkr mei ipmi_si ipmi_msghandler ioatdma shpchp lpc_ich i2c_i801 wmi mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt igb drm_kms_helper crct10dif_pclmul ahci crct10dif_common ttm ptp crc32c_intel libahci pps_core drm mlx4_core dca libata i2c_algo_bit i2c_core [last unloaded: zunicode] [89192.971373] CPU: 22 PID: 47821 Comm: z_wr_int_7 Tainted: G B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [89192.985319] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [89192.999116] ffff880fd3713bc8 00000000c561abf7 ffff880fd3713b80 ffffffff81636431 [89193.009585] ffff880fd3713bb8 ffffffff8107b260 ffffc906a3d0c010 ffff88202372a660 [89193.020080] 0000000000000010 0000000000000000 ffff882013de9800 ffff880fd3713c20 [89193.030560] Call Trace: [89193.035444] [<ffffffff81636431>] dump_stack+0x19/0x1b [89193.043527] [<ffffffff8107b260>] warn_slowpath_common+0x70/0xb0 [89193.052574] [<ffffffff8107b2fc>] warn_slowpath_fmt+0x5c/0x80 [89193.061337] [<ffffffff8130c6a1>] __list_del_entry+0xa1/0xd0 [89193.069975] [<ffffffff8130c6dd>] list_del+0xd/0x30 [89193.077745] [<ffffffffa04f056d>] __spl_cache_flush+0xed/0x150 [spl] [89193.087183] [<ffffffffa04f0696>] spl_cache_flush+0x36/0x50 [spl] [89193.096324] [<ffffffffa04f15a2>] spl_kmem_cache_free+0x1c2/0x1d0 [spl] [89193.106221] [<ffffffffa11254fa>] zio_buf_free+0x5a/0x60 [zfs] [89193.115119] [<ffffffffa104bba9>] abd_free+0x249/0x270 [zfs] [89193.123765] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 [89193.133434] [<ffffffffa10db5f4>] vdev_raidz_map_free+0x34/0xd0 [zfs] [89193.142998] [<ffffffffa10db6e9>] vdev_raidz_map_free_vsd+0x29/0x30 [zfs] [89193.152927] [<ffffffffa11265ed>] zio_vdev_io_assess+0x4d/0x250 [zfs] [89193.162466] [<ffffffffa112622c>] zio_execute+0x9c/0x100 [zfs] [89193.171271] [<ffffffffa04f2ed6>] taskq_thread+0x246/0x470 [spl] [89193.180262] [<ffffffff810b8940>] ? wake_up_state+0x20/0x20 [89193.188773] [<ffffffffa04f2c90>] ? taskq_thread_spawn+0x60/0x60 [spl] [89193.198360] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [89193.206072] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [89193.215629] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [89193.223914] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [89193.233417] -- [ end trace c1da4e4c37ad9549 ] -- [89193.409308] general protection fault: 0000 1 SMP [89193.416842] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_ssse3 sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure dm_service_time intel_powerclamp coretemp intel_rapl kvm_intel mpt3sas kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper sb_edac cryptd iTCO_wdt edac_core ipmi_devintf [89193.509290] ipmi_ssif mei_me raid_class sg iTCO_vendor_support scsi_transport_sas pcspkr mei ipmi_si ipmi_msghandler ioatdma shpchp lpc_ich i2c_i801 wmi mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt igb drm_kms_helper crct10dif_pclmul ahci crct10dif_common ttm ptp crc32c_intel libahci pps_core drm mlx4_core dca libata i2c_algo_bit i2c_core [last unloaded: zunicode] [89193.573354] CPU: 37 PID: 86386 Comm: z_wr_int_7 Tainted: G B W IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [89193.587115] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [89193.600690] task: ffff881ecece7300 ti: ffff88176c4c4000 task.ti: ffff88176c4c4000 [89193.610926] RIP: 0010: [<ffffffff8130c54f>] [<ffffffff8130c54f>] __list_add+0xf/0xc0 [89193.621652] RSP: 0018:ffff88176c4c7c30 EFLAGS: 00010086 [89193.629539] RAX: 0000000000380000 RBX: ffffc906a8127000 RCX: 0000000000000004 [89193.639440] RDX: 3130363534333231 RSI: ffffc906a8127020 RDI: ffffc906a9d2f018 [89193.649298] RBP: ffff88176c4c7c48 R08: 0000000000000000 R09: 0000000000000000 [89193.659138] R10: 0000000000000007 R11: 0000000000000000 R12: 3130363534333231 [89193.668948] R13: ffffc906a8127020 R14: 0000000000000000 R15: ffff882013de9800 [89193.678753] FS: 0000000000000000(0000) GS:ffff88103f0c0000(0000) knlGS:0000000000000000 [89193.689643] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [89193.697898] CR2: 00007f2413fce000 CR3: 000000000194a000 CR4: 00000000001407e0 [89193.707726] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [89193.717569] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [89193.727403] Stack: [89193.731485] ffffc906a8127000 ffff8810252800c0 0000000000000010 ffff88176c4c7c98 [89193.741719] ffffffffa04f0535 0000000200a3c286 0000003e862d74d4 ffff882013de98a0 [89193.751956] ffff882013de98b8 ffff882013de9800 ffff8810252800c0 0000000000000002 [89193.762205] Call Trace: [89193.766851] [<ffffffffa04f0535>] __spl_cache_flush+0xb5/0x150 [spl] [89193.775877] [<ffffffffa04f0696>] spl_cache_flush+0x36/0x50 [spl] [89193.784617] [<ffffffffa04f15a2>] spl_kmem_cache_free+0x1c2/0x1d0 [spl] [89193.793997] [<ffffffffa11254fa>] zio_buf_free+0x5a/0x60 [zfs] [89193.802468] [<ffffffffa104bba9>] abd_free+0x249/0x270 [zfs] [89193.810746] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 [89193.819797] [<ffffffffa10db5f4>] vdev_raidz_map_free+0x34/0xd0 [zfs] [89193.828971] [<ffffffffa10db6e9>] vdev_raidz_map_free_vsd+0x29/0x30 [zfs] [89193.838527] [<ffffffffa11265ed>] zio_vdev_io_assess+0x4d/0x250 [zfs] [89193.847696] [<ffffffffa112622c>] zio_execute+0x9c/0x100 [zfs] [89193.856147] [<ffffffffa04f2ed6>] taskq_thread+0x246/0x470 [spl] [89193.864781] [<ffffffff810b8940>] ? wake_up_state+0x20/0x20 [89193.872946] [<ffffffffa04f2c90>] ? taskq_thread_spawn+0x60/0x60 [spl] [89193.882186] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [89193.889553] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [89193.898764] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [89193.906707] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [89193.915871] Code: 48 89 df e8 f4 45 eb ff b8 f4 ff ff ff e9 4a ff ff ff b8 f4 ff ff ff e9 40 ff ff ff 55 48 89 e5 41 55 49 89 f5 41 54 49 89 d4 53 <4c> 8b 42 08 48 89 fb 49 39 f0 75 2a 4d 8b 45 00 4d 39 c4 75 68 [89193.942003] RIP [<ffffffff8130c54f>] __list_add+0xf/0xc0 [89193.949893] RSP <ffff88176c4c7c30> /scratch/dumps/wolf-4.wolf.hpdd.intel.com/10.8.1.4-2017-04-06-14:08:00

            Hit this again with Lustre 2.9.0 and ZFS RC3 + dRAID/Metadata Segregation:

            [35316.591117] BUG: Bad page state in process ll_ost_io01_020 pfn:1c17405
            [35316.598572] page:ffffea00705d0140 count:-1 mapcount:0 mapping: (null) index:0x0
            [35316.607674] page flags: 0x6fffff00000000()
            [35316.612314] page dumped because: nonzero _count
            [35316.617415] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas lrw gf128mul glue_helper ablk_helper cryptd raid_class scsi_transport_sas mei_me iTCO_wdt iTCO_vendor_support lpc_ich mei sg shpchp ipmi_ssif ipmi_devintf
            [35316.698570] sb_edac ioatdma ipmi_si edac_core mfd_core pcspkr i2c_i801 ipmi_msghandler wmi acpi_power_meter acpi_pad dm_multipath dm_mod nfsd nfs_acl lockd grace binfmt_misc auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt igb drm_kms_helper crct10dif_pclmul ttm crct10dif_common ptp crc32c_intel ahci pps_core mlx4_core drm libahci dca i2c_algo_bit libata i2c_core
            [35316.753391] CPU: 21 PID: 62915 Comm: ll_ost_io01_020 Tainted: G IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1
            [35316.767182] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
            [35316.780218] ffffea00705d0140 00000000be615ee9 ffff880d1ee53908 ffffffff81636431
            [35316.789972] ffff880d1ee53930 ffffffff81631645 ffffea00705d0140 0000000000000000
            [35316.799723] 000fffff00000000 ffff880d1ee53978 ffffffff811714dd fff00000fe000000
            [35316.809454] Call Trace:
            [35316.813567] [<ffffffff81636431>] dump_stack+0x19/0x1b
            [35316.820677] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc
            [35316.828342] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190
            [35316.836431] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140
            [35316.844392] [<ffffffff8117200f>] __free_pages+0x3f/0x60
            [35316.851674] [<ffffffffa0fbead3>] osd_bufs_put+0x123/0x1f0 [osd_zfs]
            [35316.860120] [<ffffffffa10b884a>] ofd_commitrw_write+0xea/0x1c20 [ofd]
            [35316.868709] [<ffffffffa10bcf2d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [35316.876752] [<ffffffffa0e2c8d2>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [35316.885066] [<ffffffffa0e04f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
            [35316.893554] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
            [35316.900939] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
            [35316.909327] [<ffffffffa0d5b560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [35316.918682] [<ffffffffa0e01225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [35316.927668] [<ffffffffa0dad1ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [35316.937769] [<ffffffffa09f2128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [35316.946877] [<ffffffffa0daad68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [35316.955969] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
            [35316.964579] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
            [35316.972520] [<ffffffffa0db1260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
            [35316.980768] [<ffffffffa0db07c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [35316.990532] [<ffffffff810a5b8f>] kthread+0xcf/0xe0
            [35316.997352] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [35317.005890] [<ffffffff81646a98>] ret_from_fork+0x58/0x90
            [35317.013064] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
            [35317.021494] Disabling lock debugging due to kernel taint

            /scratch/dumps/wolf-3.wolf.hpdd.intel.com/10.8.1.3-2017-04-06-19:44:09

            jsalians_intel John Salinas (Inactive) added a comment - Hit this again with Lustre 2.9.0 and ZFS RC3 + dRAID/Metadata Segregation: [35316.591117] BUG: Bad page state in process ll_ost_io01_020 pfn:1c17405 [35316.598572] page:ffffea00705d0140 count:-1 mapcount:0 mapping: (null) index:0x0 [35316.607674] page flags: 0x6fffff00000000() [35316.612314] page dumped because: nonzero _count [35316.617415] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas lrw gf128mul glue_helper ablk_helper cryptd raid_class scsi_transport_sas mei_me iTCO_wdt iTCO_vendor_support lpc_ich mei sg shpchp ipmi_ssif ipmi_devintf [35316.698570] sb_edac ioatdma ipmi_si edac_core mfd_core pcspkr i2c_i801 ipmi_msghandler wmi acpi_power_meter acpi_pad dm_multipath dm_mod nfsd nfs_acl lockd grace binfmt_misc auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt igb drm_kms_helper crct10dif_pclmul ttm crct10dif_common ptp crc32c_intel ahci pps_core mlx4_core drm libahci dca i2c_algo_bit libata i2c_core [35316.753391] CPU: 21 PID: 62915 Comm: ll_ost_io01_020 Tainted: G IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [35316.767182] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [35316.780218] ffffea00705d0140 00000000be615ee9 ffff880d1ee53908 ffffffff81636431 [35316.789972] ffff880d1ee53930 ffffffff81631645 ffffea00705d0140 0000000000000000 [35316.799723] 000fffff00000000 ffff880d1ee53978 ffffffff811714dd fff00000fe000000 [35316.809454] Call Trace: [35316.813567] [<ffffffff81636431>] dump_stack+0x19/0x1b [35316.820677] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc [35316.828342] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190 [35316.836431] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140 [35316.844392] [<ffffffff8117200f>] __free_pages+0x3f/0x60 [35316.851674] [<ffffffffa0fbead3>] osd_bufs_put+0x123/0x1f0 [osd_zfs] [35316.860120] [<ffffffffa10b884a>] ofd_commitrw_write+0xea/0x1c20 [ofd] [35316.868709] [<ffffffffa10bcf2d>] ofd_commitrw+0x51d/0xa40 [ofd] [35316.876752] [<ffffffffa0e2c8d2>] obd_commitrw+0x2ec/0x32f [ptlrpc] [35316.885066] [<ffffffffa0e04f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc] [35316.893554] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150 [35316.900939] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0 [35316.909327] [<ffffffffa0d5b560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [35316.918682] [<ffffffffa0e01225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [35316.927668] [<ffffffffa0dad1ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [35316.937769] [<ffffffffa09f2128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [35316.946877] [<ffffffffa0daad68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [35316.955969] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [35316.964579] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [35316.972520] [<ffffffffa0db1260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [35316.980768] [<ffffffffa0db07c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [35316.990532] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [35316.997352] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [35317.005890] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [35317.013064] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [35317.021494] Disabling lock debugging due to kernel taint /scratch/dumps/wolf-3.wolf.hpdd.intel.com/10.8.1.3-2017-04-06-19:44:09

            John Salinas to retest with new stack.

            kalyana Kalyana Chadalavada (Inactive) added a comment - John Salinas to retest with new stack.

            People

              utopiabound Nathaniel Clark
              jsalians_intel John Salinas (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: