Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Hide[root@wolf-3 10.8.1.3-2017-04-06-19:44:09]# rpm -qa |grep -i lustre
kmod-lustre-tests-2.9.0_dirty-1.el7.centos.x86_64
lustre-tests-2.9.0_dirty-1.el7.centos.x86_64
lustre-osd-zfs-mount-2.9.0_dirty-1.el7.centos.x86_64
lustre-2.9.0_dirty-1.el7.centos.x86_64
lustre-iokit-2.9.0_dirty-1.el7.centos.x86_64
kmod-lustre-2.9.0_dirty-1.el7.centos.x86_64
kmod-lustre-osd-zfs-2.9.0_dirty-1.el7.centos.x86_64
lustre-debuginfo-2.9.0_dirty-1.el7.centos.x86_64
[root@wolf-3 10.8.1.3-2017-04-06-19:44:09]# rpm -qa |grep -i zfs
libzfs2-0.7.0-rc3_29_g48659df.el7.centos.x86_64
kmod-zfs-0.7.0-rc3_29_g48659df.el7.centos.x86_64
zfs-debuginfo-0.7.0-rc3_29_g48659df.el7.centos.x86_64
lustre-osd-zfs-mount-2.9.0_dirty-1.el7.centos.x86_64
zfs-0.7.0-rc3_29_g48659df.el7.centos.x86_64
zfs-test-0.7.0-rc3_29_g48659df.el7.centos.x86_64
kmod-lustre-osd-zfs-2.9.0_dirty-1.el7.centos.x86_64
zfs-kmod-debuginfo-0.7.0-rc3_29_g48659df.el7.centos.x86_64
4 Clients over IB to 2 OSS and 1 MDS.
OSS each have 1 OST:
quick_oss1.sh:zpool create -f -o ashift=12 -o cachefile=none -O recordsize=16MB ost0 draid2 cfg=test_2_5_4_18_draidcfg.nvl mpathaa mpathab mpathac mpathad mpathae mpathaf mpathag mpathah mpathai mpathaj mpathak mpathal mpatham mpathan mpathao mpathap mpathaq mpathar
quick_oss1.sh:zpool status -v ost0
quick_oss1.sh:zpool feature@large_blocks=enabled ost0
quick_oss1.sh:zpool get all ost0 |grep large_blocks
quick_oss2.sh:zpool create -f -o ashift=12 -o cachefile=none -O recordsize=16MB ost1 draid2 cfg=test_2_5_4_18_draidcfg.nvl mpatha mpathb mpathc mpathd mpathe mpathf mpathg mpathh mpathi mpathj mpathk mpathl mpathm mpathn mpatho mpathp mpathq mpathr
quick_oss2.sh:zpool status -v ost1
quick_oss2.sh:zpool feature@large_blocks=enabled ost1
quick_oss2.sh:zpool get all ost1 |grep large_blocksShow[ root@wolf-3 10.8.1.3-2017-04-06-19:44:09]# rpm -qa |grep -i lustre kmod-lustre-tests-2.9.0_dirty-1.el7.centos.x86_64 lustre-tests-2.9.0_dirty-1.el7.centos.x86_64 lustre-osd-zfs-mount-2.9.0_dirty-1.el7.centos.x86_64 lustre-2.9.0_dirty-1.el7.centos.x86_64 lustre-iokit-2.9.0_dirty-1.el7.centos.x86_64 kmod-lustre-2.9.0_dirty-1.el7.centos.x86_64 kmod-lustre-osd-zfs-2.9.0_dirty-1.el7.centos.x86_64 lustre-debuginfo-2.9.0_dirty-1.el7.centos.x86_64 [ root@wolf-3 10.8.1.3-2017-04-06-19:44:09]# rpm -qa |grep -i zfs libzfs2-0.7.0-rc3_29_g48659df.el7.centos.x86_64 kmod-zfs-0.7.0-rc3_29_g48659df.el7.centos.x86_64 zfs-debuginfo-0.7.0-rc3_29_g48659df.el7.centos.x86_64 lustre-osd-zfs-mount-2.9.0_dirty-1.el7.centos.x86_64 zfs-0.7.0-rc3_29_g48659df.el7.centos.x86_64 zfs-test-0.7.0-rc3_29_g48659df.el7.centos.x86_64 kmod-lustre-osd-zfs-2.9.0_dirty-1.el7.centos.x86_64 zfs-kmod-debuginfo-0.7.0-rc3_29_g48659df.el7.centos.x86_64 4 Clients over IB to 2 OSS and 1 MDS. OSS each have 1 OST: quick_oss1.sh:zpool create -f -o ashift=12 -o cachefile=none -O recordsize=16MB ost0 draid2 cfg=test_2_5_4_18_draidcfg.nvl mpathaa mpathab mpathac mpathad mpathae mpathaf mpathag mpathah mpathai mpathaj mpathak mpathal mpatham mpathan mpathao mpathap mpathaq mpathar quick_oss1.sh:zpool status -v ost0 quick_oss1.sh:zpool feature@large_blocks=enabled ost0 quick_oss1.sh:zpool get all ost0 |grep large_blocks quick_oss2.sh:zpool create -f -o ashift=12 -o cachefile=none -O recordsize=16MB ost1 draid2 cfg=test_2_5_4_18_draidcfg.nvl mpatha mpathb mpathc mpathd mpathe mpathf mpathg mpathh mpathi mpathj mpathk mpathl mpathm mpathn mpatho mpathp mpathq mpathr quick_oss2.sh:zpool status -v ost1 quick_oss2.sh:zpool feature@large_blocks=enabled ost1 quick_oss2.sh:zpool get all ost1 |grep large_blocks
-
3
-
9223372036854775807
Description
Running 4 Lustre Clients, 2 OSS nodes each with 1 zpool, and 1 mds.
This OSS node:
- zpool status -v
pool: ost0
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
ost0 ONLINE 0 0 0
draid1-0
ONLINE 0 0 0
mpathaj ONLINE 0 0 0
mpathai ONLINE 0 0 0
mpathah ONLINE 0 0 0
mpathag ONLINE 0 0 0
mpathaq ONLINE 0 0 0
mpathap ONLINE 0 0 0
mpathak ONLINE 0 0 0
mpathz ONLINE 0 0 0
mpatham ONLINE 0 0 0
mpathal ONLINE 0 0 0
mpathao ONLINE 0 0 0
spares
$draid1-0-s0 AVAIL
errors: No known data errors
This build of zfs was from coral-prototype branch and Lustre was a Lustre Master from Dec 1st.
We were running our file system aging utility: FileAger.py (1-2 copies on each of the 4 client nodes) along an IOR: mpirun -wdir /mnt/lustre/ -np 4 -rr -machinefile hosts -env I_MPI_EXTRA_FILESYSTEM=on -env I_MPI_EXTRA_FILESYSTEM_LIST=lustre /home/johnsali/wolf-3/ior/src/ior -a POSIX -F -N 4 -d 2 -i 1 -s 20000 -b 16MB -t 16MB -k -w -r
While this was running it appears we hit this failure.
[159898.950714] BUG: Bad page state in process ll_ost_io01_013 pfn:1a01bcd
[159898.960045] page:ffffea006806f340 count:-1 mapcount:0 mapping: (null) index:0x0
[159898.970667] page flags: 0x6fffff00000000()
[159898.976808] page dumped because: nonzero _count
[159898.983412] Modules linked in: nfsv3 nfs_acl raid10 osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd mpt3sas ipmi_devintf ipmi_ssif ipmi_si
[159899.072452] raid_class sb_edac iTCO_wdt iTCO_vendor_support scsi_transport_sas sg edac_core pcspkr ipmi_msghandler wmi ioatdma mei_me mei lpc_ich shpchp i2c_i801 mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_ib mlx4_en ib_sa vxlan ib_mad ip6_udp_tunnel udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper crct10dif_pclmul igb crct10dif_common ttm ptp crc32c_intel ahci pps_core drm mlx4_core libahci dca i2c_algo_bit libata i2c_core [last unloaded: zunicode]
[159899.135473] CPU: 57 PID: 98747 Comm: ll_ost_io01_013 Tainted: G IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1
[159899.149461] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
[159899.162801] ffffea006806f340 00000000424e76b3 ffff880f9e233908 ffffffff81636431
[159899.172821] ffff880f9e233930 ffffffff81631645 ffffea006806f340 0000000000000000
[159899.182870] 000fffff00000000 ffff880f9e233978 ffffffff811714dd fff00000fe000000
[159899.192895] Call Trace:
[159899.197269] [<ffffffff81636431>] dump_stack+0x19/0x1b
[159899.204667] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc
[159899.212639] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190
[159899.220965] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140
[159899.229171] [<ffffffff8117200f>] __free_pages+0x3f/0x60
[159899.236690] [<ffffffffa100bad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs]
[159899.245372] [<ffffffffa118284a>] ofd_commitrw_write+0xea/0x1c20 [ofd]
[159899.254234] [<ffffffffa1186f2d>] ofd_commitrw+0x51d/0xa40 [ofd]
[159899.262551] [<ffffffffa0d538d5>] obd_commitrw+0x2ec/0x32f [ptlrpc]
[159899.271488] [<ffffffffa0d2bf71>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
[159899.280509] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
[159899.288372] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
[159899.297010] [<ffffffffa0c82560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[159899.306746] [<ffffffffa0d28225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[159899.316058] [<ffffffffa0cd41ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[159899.326348] [<ffffffffa0967128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[159899.335679] [<ffffffffa0cd1d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[159899.345029] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
[159899.353394] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
[159899.361264] [<ffffffffa0cd8260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
[159899.369596] [<ffffffffa0cd77c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
[159899.379160] [<ffffffff810a5b8f>] kthread+0xcf/0xe0
[159899.385881] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
[159899.394413] [<ffffffff81646a98>] ret_from_fork+0x58/0x90
[159899.401653] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
[159899.410157] Disabling lock debugging due to kernel taint
[163012.964891] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.8@o2ib inode [0x200000406:0x3c5:0x0] object 0x0:44785 extent [67108864-80752639]: client csum 7f08fe36, server csum f8fbfe4c
[163012.990138] LustreError: Skipped 2 previous similar messages
[163020.008131] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.8@o2ib inode [0x200000406:0x3d6:0x0] object 0x0:44794 extent [83886080-100270079]: client csum 886feb33, server csum ccc0eb4a
[163042.829796] -----------[ cut here ]-----------
[163042.837389] kernel BUG at include/linux/scatterlist.h:65!
[163042.845758] invalid opcode: 0000 1 SMP
[163042.852645] Modules linked in: nfsv3 nfs_acl raid10 osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd mpt3sas ipmi_devintf ipmi_ssif ipmi_si
[163042.944819] raid_class sb_edac iTCO_wdt iTCO_vendor_support scsi_transport_sas sg edac_core pcspkr ipmi_msghandler wmi ioatdma mei_me mei lpc_ich shpchp i2c_i801 mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_ib mlx4_en ib_sa vxlan ib_mad ip6_udp_tunnel udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper crct10dif_pclmul igb crct10dif_common ttm ptp crc32c_intel ahci pps_core drm mlx4_core libahci dca i2c_algo_bit libata i2c_core [last unloaded: zunicode]
[163043.010335] CPU: 12 PID: 84956 Comm: ll_ost_io00_002 Tainted: G B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1
[163043.025057] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
[163043.038989] task: ffff880fc52bc500 ti: ffff880fc55bc000 task.ti: ffff880fc55bc000
[163043.049639] RIP: 0010:[<ffffffffa0960fef>] [<ffffffffa0960fef>] cfs_crypto_hash_update_page+0x9f/0xb0 [libcfs]
[163043.063453] RSP: 0018:ffff880fc55bfab8 EFLAGS: 00010202
[163043.071687] RAX: 0000000000000002 RBX: ffff8810f6db9b80 RCX: 0000000000000000
[163043.081918] RDX: 0000000000000020 RSI: 0000000000000000 RDI: ffff880fc55bfad8
[163043.092095] RBP: ffff880fc55bfb00 R08: 00000000000195a0 R09: ffff880fc55bfab8
[163043.103441] R10: ffff88103e807900 R11: 0000000000000001 R12: 3635343332313036
[163043.113462] R13: 0000000033323130 R14: 0000000000000534 R15: 0000000000000000
[163043.123487] FS: 0000000000000000(0000) GS:ffff88103ef00000(0000) knlGS:0000000000000000
[163043.134599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[163043.143101] CR2: 00007fce5afab000 CR3: 000000000194a000 CR4: 00000000001407e0
[163043.153184] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[163043.163242] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[163043.173280] Stack:
[163043.177580] 0000000000000002 0000000000000000 0000000000000000 0000000000000000
[163043.188354] 00000000f43b381e 0000000000000000 ffff880fcc7d1301 ffff880e73ecc200
[163043.199140] 0000000000000000 ffff880fc55bfb68 ffffffffa0d5345c ffff88202563f0a8
[163043.209907] Call Trace:
[163043.215455] [<ffffffffa0d5345c>] tgt_checksum_bulk.isra.33+0x35a/0x4e7 [ptlrpc]
[163043.226242] [<ffffffffa0d2c21d>] tgt_brw_write+0x114d/0x1640 [ptlrpc]
[163043.235986] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
[163043.244558] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
[163043.254271] [<ffffffffa0c82560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[163043.264858] [<ffffffffa0d28225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[163043.275043] [<ffffffffa0cd41ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[163043.286074] [<ffffffffa0967128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[163043.296175] [<ffffffffa0cd1d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[163043.306194] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
[163043.315553] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
[163043.324714] [<ffffffffa0cd8260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
[163043.334070] [<ffffffffa0cd77c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
[163043.344635] [<ffffffff810a5b8f>] kthread+0xcf/0xe0
[163043.352181] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
[163043.361606] [<ffffffff81646a98>] ret_from_fork+0x58/0x90
[163043.369571] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
[163043.378772] Code: 89 43 38 48 8b 43 20 ff 50 c0 48 8b 55 d8 65 48 33 14 25 28 00 00 00 75 0d 48 83 c4 28 5b 41 5c 41 5d 41 5e 5d c3 e8 61 a0 71 e0 <0f> 0b 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
[163043.406113] RIP [<ffffffffa0960fef>] cfs_crypto_hash_update_page+0x9f/0xb0 [libcfs]
[163043.416991] RSP <ffff880fc55bfab8>
This happened fairly quickly. After this run I restarted the system and it happened again almost immediately.
Attachments
Issue Links
Activity
Here is a test case:
#!/usr/bin/python3.4
import os
import uuid
import tempfile
def WriteFileBasic(multiplysize, blocksize, writechar, path):
"""
Writes a basic file in a sequanetial fashion and fills it with writechar with the size of the file
being equal to writechar * multiplysize and written in chunks at time based on blocksize.
"""
writethis = writechar * multiplysize
unique_filename = uuid.uuid4()
filetowrite = path + '/basic-' + str(unique_filename)
fd = open(filetowrite, 'wb')
for x in range(blocksize):
fd.write(bytes(writethis, 'UTF-8'))
fd.close()
directory = tempfile.mkdtemp(dir="/mnt/lustre")
if not os.path.exists(directory):
os.makedirs(directory)
print("Writing files to:
{0}".format(directory))
for i in range(100):
for x in range(2):
WriteFileBasic(multiplysize=204800, blocksize=1024, writechar='0123456', path=directory)
WriteFileBasic(multiplysize=204800, blocksize=1024, writechar='0123456', path=directory)
WriteFileBasic(multiplysize=204800, blocksize=128, writechar='0', path=directory)
WriteFileBasic(multiplysize=204800, blocksize=128, writechar='0', path=directory)
WriteFileBasic(multiplysize=204800, blocksize=1024, writechar='0123456', path=directory)
WriteFileBasic(multiplysize=124, blocksize=1024, writechar='0', path=directory)
I start multiple copies of that on one client node at least 2 but 4 sometimes seems to make this come out a little quicker
Here is an example from tonight running this.
mpirun -np 4 -wdir /mnt/lustre -machinefile hosts -env I_MPI_EXTRA_FILESYSTEM=on -env I_MPI_EXTRA_FILESYSTEM_LIST=lustre /home/johnsali/wolf-3/ior/src/ior -a POSIX -F -N 4 -d 2 -i 1 -s 20480 -b 8m -t 8m
./testcase_pagestate.py &
./testcase_pagestate.py &
almost immediately:
[103132.581173] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.6@o2ib inode [0x200000bd0:0xa98:0x0] object 0x0:1018 extent [83886080-92979199]: client csum 3e2f59b2, server csum cf13a5a5
But it took several minutes to get:
[103332.411485] BUG: Bad page state in process ll_ost_io01_000 pfn:171cd2c
[103332.420695] page:ffffea005c734b00 count:-1 mapcount:0 mapping: (null) index:0x0
[103332.431396] page flags: 0x6fffff00000000()
[103332.437504] page dumped because: nonzero _count
[103332.444053] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate dm_service_time xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt mpt3sas iTCO_vendor_support lpc_ich ipmi_ssif sb_edac ipmi_devintf mei_me raid_class scsi_transport_sas sg mei edac_core i2c_i801 mfd_core pcspkr ipmi_si ioatdma shpchp ipmi_msghandler acpi_pad acpi_power_meter wmi dm_multipath dm_mod binfmt_misc nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb crct10dif_pclmul ttm crct10dif_common ahci ptp crc32c_intel libahci pps_core drm mlx4_core dca libata i2c_algo_bit i2c_core
[103332.588503] CPU: 58 PID: 7806 Comm: ll_ost_io01_000 Tainted: P B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1
[103332.601978] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
[103332.614871] ffffea005c734b00 000000002ff1b307 ffff880ff5e47908 ffffffff81636431
[103332.624444] ffff880ff5e47930 ffffffff81631645 ffffea005c734b00 0000000000000000
[103332.634042] 000fffff00000000 ffff880ff5e47978 ffffffff811714dd fff00000fe000000
[103332.643624] Call Trace:
[103332.647573] [<ffffffff81636431>] dump_stack+0x19/0x1b
[103332.654528] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc
[103332.662089] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190
[103332.670015] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140
[103332.677872] [<ffffffff8117200f>] __free_pages+0x3f/0x60
[103332.685071] [<ffffffffa0aeead3>] osd_bufs_put+0x123/0x1f0 [osd_zfs]
[103332.693421] [<ffffffffa0b5c84a>] ofd_commitrw_write+0xea/0x1c20 [ofd]
[103332.701942] [<ffffffffa0b60f2d>] ofd_commitrw+0x51d/0xa40 [ofd]
[103332.709944] [<ffffffffa0ece8d2>] obd_commitrw+0x2ec/0x32f [ptlrpc]
[103332.718231] [<ffffffffa0ea6f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
[103332.726897] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
[103332.734244] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
[103332.742599] [<ffffffffa0dfd560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[103332.751928] [<ffffffffa0ea3225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[103332.761126] [<ffffffffa0e4f1ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[103332.770928] [<ffffffffa09c1128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[103332.779767] [<ffffffffa0e4cd68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[103332.788569] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
[103332.796803] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
[103332.804581] [<ffffffffa0e53260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
[103332.812826] [<ffffffffa0e527c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
[103332.822297] [<ffffffff810a5b8f>] kthread+0xcf/0xe0
[103332.828979] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
[103332.837491] [<ffffffff81646a98>] ret_from_fork+0x58/0x90
[103332.844738] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
The case is basically writing large file, large file, very small file or large file, very small file while stream IO is going on from ior.
For the dumps in 0.7.0 RC3 after an OST process takes a BUG: Bad page state in process ll_ost_io01 but the OSS node doesn't crash is this process "hung" / dead forever?
Do you have any ideas why in the one case the OSS node hits the Bad page state in process in stock RC3 but not the node crash/reboot? Is that just different configuration options (like panic on this) or is that a symptom that something additional is wrong in beta-coral-combined?
Is there anyway to isolate the memory for ptlrpc? I am not sure how to figure out what is stepping on these values. I can run these same tests directly on ZFS without issues – figuring out the interactions between zfs and lustre is a bit challenging.
I did a mostly successful run with Lustre 2.9.0 + 0.7.0 RC3 (none of our patches) record size 1M on both OSS which forces a max RPC size of 1MB both raidz. With this combination set I did not not see Bad page state in process XXX. It is my assumption based on these runs that the ZFS record size and associated brw size of 16MB is the issue.
Looking at "wolf-3 OSS 10.8.1.3-2017-04-15-00:39:07":
Dump the backtrace with stack:
crash> bt -f ... #9 [ffff880fdb36bab0] cfs_crypto_hash_update_page at ffffffffa0813fce [libcfs] ffff880fdb36bab8: 3433323130363536 3130363500000332 ffff880fdb36bac8: 0000000000000000 0000000000000000 ffff880fdb36bad8: 00000000b643a6e1 0000000000000000 ffff880fdb36bae8: ffff882027752a01 ffff881dc0c9a600 ffff880fdb36baf8: 0000000000000000 ffff880fdb36bb68 ffff880fdb36bb08: ffffffffa0eb7459 #10 [ffff880fdb36bb08] tgt_checksum_bulk at ffffffffa0eb7459 [ptlrpc] ffff880fdb36bb10: ffff881969ae18a8 ffff880fd957e780 ffff880fdb36bb20: 00000004810b8940 ffff881d1a0d1c80 ffff880fdb36bb30: dead000000200200 00000000b643a6e1 ffff880fdb36bb40: ffff8817ffd40050 ffff882027752a80 ffff880fdb36bb50: ffff881f26180000 0000000000000000 ffff880fdb36bb60: ffff8818dee4cc00 ffff880fdb36bcd0 ffff880fdb36bb70: ffffffffa0e9021d ...
Hunt until we find the ptlrpc_bulk_desc:
crash> struct ptlrpc_bulk_desc ffff881dc0c9a600 struct ptlrpc_bulk_desc { bd_failure = 0, bd_registered = 0, bd_lock = { { rlock = { raw_lock = { { head_tail = 4587590, tickets = { head = 70, tail = 70 } } } } } }, bd_import_generation = 0, bd_type = 41, bd_portal = 8, bd_export = 0xffff8818dee4cc00, bd_import = 0x0, bd_req = 0xffff8817ffd40050, bd_frag_ops = 0xffffffffa0ec16a0 <ptlrpc_bulk_kiov_nopin_ops>, bd_waitq = { lock = { { rlock = { raw_lock = { { head_tail = 393222, tickets = { head = 6, tail = 6 } } } } } }, task_list = { next = 0xffff881dc0c9a640, prev = 0xffff881dc0c9a640 } }, bd_iov_count = 3872, bd_max_iov = 3872, bd_nob = 15859712, bd_nob_transferred = 15859712, bd_last_mbits = 0, bd_cbid = { cbid_fn = 0xffffffffa0e30b80 <reply_out_callback+736>, cbid_arg = 0xffff881dc0c9a600 }, bd_sender = 1407378115789062, bd_md_count = 0, bd_md_max_brw = 16, bd_mds = {{ cookie = 237797 }, { cookie = 237805 }, { cookie = 237813 }, { cookie = 237821 }, { cookie = 237829 }, { cookie = 237837 }, { cookie = 237845 }, { cookie = 237853 }, { cookie = 237861 }, { cookie = 237869 }, { cookie = 237877 }, { cookie = 237885 }, { cookie = 237893 }, { cookie = 237901 }, { cookie = 237909 }, { cookie = 237917 }}, bd_u = { bd_kiov = { bd_enc_vec = 0x0, bd_vec = 0xffff881be3ca0000 }, bd_kvec = { bd_enc_kvec = 0x0, bd_kvec = 0xffff881be3ca0000 } } }
Since we know the page point in the bd_vec is the issue:
crash> lnet_kiov_t ffff881be3ca0000 struct lnet_kiov_t { kiov_page = 0x3433323130363534, kiov_len = 825243189, kiov_offset = 892613426 }
The kiov_page isn't even remotely a valid kernel pointer. I'll work on tracking down where the bad values could have come from.
The above dumps are now on onyx:
/scratch/johnsali/
rw-rr- 1 johnsali johnsali 430M Apr 15 08:24 10.8.1.3-2017-04-14-224609.tgz
rw-rr- 1 johnsali johnsali 142M Apr 15 08:24 10.8.1.3-2017-04-15-003907.tgz
rw-rr- 1 johnsali johnsali 707M Apr 15 08:26 10.8.1.3-2017-04-15-132245.tgz
rw-rr- 1 johnsali johnsali 274M Apr 15 08:27 10.8.1.4-2017-04-15-002617.tgz
rw-rr- 1 johnsali johnsali 485M Apr 15 08:29 10.8.1.4-2017-04-15-014743.tgz
rw-rr- 1 johnsali johnsali 782M Apr 15 08:31 10.8.1.4-2017-04-15-132247.tgz
Yesterday I tried the following combinations:
Lustre 2.9.0 + latest coral_beta_combined record size 16M brw_size=16 draid zfs_abd_scatter_enabled = 0, max_pages_per_rpc=4096 – crash 10.8.1.3-2017-04-14-22:46:09
Lustre 2.9.0 + latest coral_beta_combined record size 16M brw_size=16 draid zfs_abd_scatter_enabled = 0, max_pages_per_rpc=256 – crash 10.8.1.3-2017-04-15-00:39:07
wolf-3 OSS 10.8.1.3-2017-04-14-22:46:09
147931.299899] Lustre: lsdraid-OST0000: new disk, initializing [147931.307239] Lustre: srv-lsdraid-OST0000: No data found on store. Initialize space [147936.355608] Lustre: lsdraid-OST0000: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) [147963.624729] Lustre: lsdraid-OST0000: Connection restored to bd4f4e40-dbac-a829-f1fd-3c4450a08dcb (at 192.168.1.6@o2ib) [147970.995882] Lustre: lsdraid-OST0000: Connection restored to b9fbce4c-a90b-3f7f-770e-f9863c38efb5 (at 192.168.1.8@o2ib) [147975.210049] Lustre: lsdraid-OST0000: Connection restored to 862f84d1-bf42-0dd3-ba54-1e1a9568317e (at 192.168.1.7@o2ib) [147975.223042] Lustre: Skipped 1 previous similar message [148306.620448] Lustre: lsdraid-OST0000: Connection restored to b9fbce4c-a90b-3f7f-770e-f9863c38efb5 (at 192.168.1.8@o2ib) [148306.633674] Lustre: Skipped 1 previous similar message [233987.779195] perf interrupt took too long (10163 > 9615), lowering kernel.perf_event_max_sample_rate to 13000 [414188.327658] Lustre: 83697:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1492208717/real 1492208717] req@ffff880f11ac8300 x1564414877971952/t0(0) o39->lsdraid-MDT0000-lwp-OST0000@192.168.1.5@o2ib:12/10 lens 224/224 e 0 to 1 dl 1492208723 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [414188.364839] Lustre: Failing over lsdraid-OST0000 [414192.689319] Lustre: 118209:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1492208721/real 1492208721] req@ffff8815f6846f00 x1564414877971968/t0(0) o400->MGC192.168.1.5@o2ib@192.168.1.5@o2ib:26/25 lens 224/224 e 0 to 1 dl 1492208728 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [414194.373337] Lustre: 83697:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1492208723/real 1492208723] req@ffff880f11ac8300 x1564414877972032/t0(0) o251->MGC192.168.1.5@o2ib@192.168.1.5@o2ib:26/25 lens 224/224 e 0 to 1 dl 1492208729 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [414194.411850] Lustre: server umount lsdraid-OST0000 complete [414368.256969] Lustre: lsdraid-OST0000: new disk, initializing [414368.265405] Lustre: srv-lsdraid-OST0000: No data found on store. Initialize space [414375.147139] Lustre: lsdraid-OST0000: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) [414533.259382] Lustre: Failing over lsdraid-OST0000 [414533.276260] Lustre: server umount lsdraid-OST0000 complete [414724.001373] Lustre: lsdraid-OST0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 [414725.696637] Lustre: lsdraid-OST0000: Will be in recovery for at least 2:30, or until 1 client reconnects [414725.709414] Lustre: lsdraid-OST0000: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) [414725.874431] Lustre: lsdraid-OST0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted. [415336.132350] Lustre: lsdraid-OST0000: Connection restored to bd4f4e40-dbac-a829-f1fd-3c4450a08dcb (at 192.168.1.6@o2ib) [415406.632740] ------------[ cut here ]------------ [415406.633861] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.6@o2ib inode [0x200000401:0x7a:0x0] object 0x0:88 extent [50331648-57343999]: client csum 41b33fd5, server csum 649d3feb [415406.665939] kernel BUG at include/linux/scatterlist.h:65! [415406.674352] invalid opcode: 0000 [#1] SMP [415406.681344] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_generic crypto_null xfs libcrc32c rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm dm_service_time ses enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper mpt3sas ablk_helper cryptd raid_class scsi_transport_sas ipmi_devintf ipmi_ssif iTCO_wdt [415406.776798] sg pcspkr iTCO_vendor_support ipmi_si ipmi_msghandler mei_me sb_edac acpi_power_meter ioatdma lpc_ich edac_core acpi_pad shpchp mei wmi i2c_i801 mfd_core dm_multipath dm_mod nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb crct10dif_pclmul ttm crct10dif_common ptp crc32c_intel ahci pps_core drm mlx4_core libahci dca i2c_algo_bit libata i2c_core [last unloaded: zunicode] [415406.848441] CPU: 29 PID: 89865 Comm: ll_ost_io01_000 Tainted: G IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [415406.863708] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [415406.878344] task: ffff8817d96e5c00 ti: ffff881a6b35c000 task.ti: ffff881a6b35c000 [415406.889651] RIP: 0010:[<ffffffffa0c0cfef>] [<ffffffffa0c0cfef>] cfs_crypto_hash_update_page+0x9f/0xb0 [libcfs] [415406.903951] RSP: 0018:ffff881a6b35fab8 EFLAGS: 00010202 [415406.912870] RAX: 0000000000000002 RBX: ffff8820050b5900 RCX: 0000000000000000 [415406.923849] RDX: 0000000000000020 RSI: 0000000000000000 RDI: ffff881a6b35fad8 [415406.934787] RBP: ffff881a6b35fb00 R08: 00000000000195a0 R09: ffff881a6b35fab8 [415406.945693] R10: ffff88103e807900 R11: 0000000000000001 R12: 3534333231303635 [415406.956568] R13: 0000000032313036 R14: 0000000000000433 R15: 0000000000000000 [415406.967407] FS: 0000000000000000(0000) GS:ffff88203e6c0000(0000) knlGS:0000000000000000 [415406.979287] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [415406.988490] CR2: 00007fc89400b008 CR3: 000000000194a000 CR4: 00000000001407e0 [415406.999227] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [415407.009940] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [415407.020607] Stack: [415407.025494] 0000000000000002 0000000000000000 0000000000000000 0000000000000000 [415407.036487] 00000000ced088e5 0000000000000000 ffff882024772701 ffff880db7053000 [415407.047418] 0000000000000000 ffff881a6b35fb68 ffffffffa0f8e459 ffff8819d6ea98a8 [415407.058319] Call Trace: [415407.063640] [<ffffffffa0f8e459>] tgt_checksum_bulk.isra.33+0x35a/0x4e7 [ptlrpc] [415407.074501] [<ffffffffa0f6721d>] tgt_brw_write+0x114d/0x1640 [ptlrpc] [415407.084323] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150 [415407.092958] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0 [415407.102588] [<ffffffffa0ebd560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [415407.113192] [<ffffffffa0f63225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [415407.123952] [<ffffffffa0f0f1ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [415407.135575] [<ffffffffa0c13128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [415407.146329] [<ffffffffa0f0cd68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [415407.156963] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [415407.166363] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [415407.175301] [<ffffffffa0f13260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [415407.184635] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 [415407.193114] [<ffffffffa0f127c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [415407.204113] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [415407.212374] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [415407.222423] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [415407.231187] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [415407.241105] Code: 89 43 38 48 8b 43 20 ff 50 c0 48 8b 55 d8 65 48 33 14 25 28 00 00 00 75 0d 48 83 c4 28 5b 41 5c 41 5d 41 5e 5d c3 e8 61 e0 46 e0 <0f> 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 [415407.268624] RIP [<ffffffffa0c0cfef>] cfs_crypto_hash_update_page+0x9f/0xb0 [libcfs] [415407.279914] RSP <ffff881a6b35fab8>
wolf-3 OSS 10.8.1.3-2017-04-15-00:39:07
[ 6415.538534] Lustre: lsdraid-OST0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 [ 6422.155237] Lustre: lsdraid-OST0000: Will be in recovery for at least 2:30, or until 1 client reconnects [ 6422.165992] Lustre: lsdraid-OST0000: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) [ 6422.291438] Lustre: lsdraid-OST0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted. [ 6422.301549] Lustre: lsdraid-OST0000: deleting orphan objects from 0x0:91 to 0x0:129 [ 6474.856831] Lustre: lsdraid-OST0000: Connection restored to (at 192.168.1.8@o2ib) [ 6565.960924] BUG: Bad page state in process ll_ost_io01_007 pfn:18eecce [ 6565.961668] BUG: Bad page state in process ll_ost_io01_006 pfn:18eecca [ 6565.961672] page:ffffea0063bb3280 count:-1 mapcount:0 mapping: (null) index:0x0 [ 6565.961674] page flags: 0x6fffff00000000() [ 6565.961675] page dumped because: nonzero _count [ 6565.961726] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate dm_service_time xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas raid_class lrw gf128mul glue_helper ablk_helper cryptd scsi_transport_sas iTCO_wdt iTCO_vendor_support mei_me ipmi_devintf ipmi_ssif lpc_ich sb_edac ipmi_si [ 6565.961778] edac_core sg ipmi_msghandler mei shpchp pcspkr ioatdma mfd_core i2c_i801 wmi acpi_pad acpi_power_meter nfsd dm_multipath dm_mod nfs_acl lockd binfmt_misc grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm ptp crct10dif_pclmul pps_core crct10dif_common ahci drm crc32c_intel dca libahci mlx4_core i2c_algo_bit libata i2c_core [ 6565.961782] CPU: 31 PID: 10886 Comm: ll_ost_io01_006 Tainted: G IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [ 6565.961784] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 6565.961792] ffffea0063bb3280 000000008d05d0f2 ffff88202236f6f8 ffffffff81636431 [ 6565.961797] ffff88202236f720 ffffffff81631645 ffff88203e759c68 0000000000003735 [ 6565.961803] 0000000000000001 ffff88202236f828 ffffffff81173028 ffff881022e59370 [ 6565.961804] Call Trace: [ 6565.961819] [<ffffffff81636431>] dump_stack+0x19/0x1b [ 6565.961824] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc [ 6565.961833] [<ffffffff81173028>] get_page_from_freelist+0x848/0x9b0 [ 6565.961844] [<ffffffffa06cadaa>] ? spl_kmem_free+0x2a/0x40 [spl] [ 6565.961848] [<ffffffff81173327>] __alloc_pages_nodemask+0x197/0xba0 [ 6565.961862] [<ffffffffa01f9f02>] ? mlx4_ib_post_send+0x4e2/0xb20 [mlx4_ib] [ 6565.961910] [<ffffffffa0b68f8d>] ? lu_obj_hop_keycmp+0x1d/0x30 [obdclass] [ 6565.961927] [<ffffffffa081d717>] ? cfs_hash_bd_lookup_intent+0x57/0x160 [libcfs] [ 6565.961935] [<ffffffff811b4afa>] alloc_pages_current+0xaa/0x170 [ 6565.961952] [<ffffffffa0d5786b>] osd_bufs_get+0x4cb/0xba0 [osd_zfs] [ 6565.961970] [<ffffffffa10ade3d>] ofd_preprw_write.isra.29+0x1bd/0xcd0 [ofd] [ 6565.961980] [<ffffffffa10af13a>] ofd_preprw+0x7ea/0x10c0 [ofd] [ 6565.962092] [<ffffffffa0e8fce7>] tgt_brw_write+0xc17/0x1640 [ptlrpc] [ 6565.962098] [<ffffffff81632d15>] ? __slab_free+0x10e/0x277 [ 6565.962105] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150 [ 6565.962110] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0 [ 6565.962115] [<ffffffff81639d72>] ? mutex_lock+0x12/0x2f [ 6565.962178] [<ffffffffa0e8c225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [ 6565.962234] [<ffffffffa0e381ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [ 6565.962249] [<ffffffffa081a128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [ 6565.962302] [<ffffffffa0e35d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [ 6565.962311] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [ 6565.962315] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [ 6565.962368] [<ffffffffa0e3c260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [ 6565.962377] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 [ 6565.962428] [<ffffffffa0e3b7c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [ 6565.962436] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [ 6565.962441] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6565.962449] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [ 6565.962454] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6565.962456] Disabling lock debugging due to kernel taint [ 6565.962539] BUG: Bad page state in process ll_ost_io01_006 pfn:18eecc5 [ 6565.962541] page:ffffea0063bb3140 count:-1 mapcount:0 mapping: (null) index:0x0 [ 6565.962542] page flags: 0x6fffff00000000() [ 6565.962543] page dumped because: nonzero _count [ 6565.962576] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate dm_service_time xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas raid_class lrw gf128mul glue_helper ablk_helper cryptd scsi_transport_sas iTCO_wdt iTCO_vendor_support mei_me ipmi_devintf ipmi_ssif lpc_ich sb_edac ipmi_si [ 6565.962601] edac_core sg ipmi_msghandler mei shpchp pcspkr ioatdma mfd_core i2c_i801 wmi acpi_pad acpi_power_meter nfsd dm_multipath dm_mod nfs_acl lockd binfmt_misc grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm ptp crct10dif_pclmul pps_core crct10dif_common ahci drm crc32c_intel dca libahci mlx4_core i2c_algo_bit libata i2c_core [ 6565.962604] CPU: 31 PID: 10886 Comm: ll_ost_io01_006 Tainted: G B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [ 6565.962605] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 6565.962612] ffffea0063bb3140 000000008d05d0f2 ffff88202236f6f8 ffffffff81636431 [ 6565.962619] ffff88202236f720 ffffffff81631645 ffff88203e759c68 0000000000003735 [ 6565.962625] 0000000000000001 ffff88202236f828 ffffffff81173028 ffff881022e59370 [ 6565.962626] Call Trace: [ 6565.962632] [<ffffffff81636431>] dump_stack+0x19/0x1b [ 6565.962636] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc [ 6565.962641] [<ffffffff81173028>] get_page_from_freelist+0x848/0x9b0 [ 6565.962650] [<ffffffffa06cadaa>] ? spl_kmem_free+0x2a/0x40 [spl] [ 6565.962655] [<ffffffff81173327>] __alloc_pages_nodemask+0x197/0xba0 [ 6565.962669] [<ffffffffa01f9f02>] ? mlx4_ib_post_send+0x4e2/0xb20 [mlx4_ib] [ 6565.962711] [<ffffffffa0b68f8d>] ? lu_obj_hop_keycmp+0x1d/0x30 [obdclass] [ 6565.962727] [<ffffffffa081d717>] ? cfs_hash_bd_lookup_intent+0x57/0x160 [libcfs] [ 6565.962733] [<ffffffff811b4afa>] alloc_pages_current+0xaa/0x170 [ 6565.962745] [<ffffffffa0d5786b>] osd_bufs_get+0x4cb/0xba0 [osd_zfs] [ 6565.962767] [<ffffffffa10ade3d>] ofd_preprw_write.isra.29+0x1bd/0xcd0 [ofd] [ 6565.962781] [<ffffffffa10af13a>] ofd_preprw+0x7ea/0x10c0 [ofd] [ 6565.962855] [<ffffffffa0e8fce7>] tgt_brw_write+0xc17/0x1640 [ptlrpc] [ 6565.962861] [<ffffffff81632d15>] ? __slab_free+0x10e/0x277 [ 6565.962866] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150 [ 6565.962870] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0 [ 6565.962875] [<ffffffff81639d72>] ? mutex_lock+0x12/0x2f [ 6565.962949] [<ffffffffa0e8c225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [ 6565.963019] [<ffffffffa0e381ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [ 6565.963034] [<ffffffffa081a128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [ 6565.963103] [<ffffffffa0e35d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [ 6565.963109] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [ 6565.963112] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [ 6565.963181] [<ffffffffa0e3c260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [ 6565.963187] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 [ 6565.963256] [<ffffffffa0e3b7c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [ 6565.963262] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [ 6565.963267] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6565.963273] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [ 6565.963278] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6565.963280] BUG: Bad page state in process ll_ost_io01_006 pfn:18eecc6 [ 6565.963282] page:ffffea0063bb3180 count:-1 mapcount:0 mapping: (null) index:0x0 [ 6565.963284] page flags: 0x6fffff00000000() [ 6565.963285] page dumped because: nonzero _count [ 6565.963320] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate dm_service_time xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas raid_class lrw gf128mul glue_helper ablk_helper cryptd scsi_transport_sas iTCO_wdt iTCO_vendor_support mei_me ipmi_devintf ipmi_ssif lpc_ich sb_edac ipmi_si [ 6565.963346] edac_core sg ipmi_msghandler mei shpchp pcspkr ioatdma mfd_core i2c_i801 wmi acpi_pad acpi_power_meter nfsd dm_multipath dm_mod nfs_acl lockd binfmt_misc grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm ptp crct10dif_pclmul pps_core crct10dif_common ahci drm crc32c_intel dca libahci mlx4_core i2c_algo_bit libata i2c_core [ 6565.963349] CPU: 31 PID: 10886 Comm: ll_ost_io01_006 Tainted: G B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [ 6565.963350] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 6565.963358] ffffea0063bb3180 000000008d05d0f2 ffff88202236f6f8 ffffffff81636431 [ 6565.963365] ffff88202236f720 ffffffff81631645 ffff88203e759c68 0000000000003735 [ 6565.963372] 0000000000000001 ffff88202236f828 ffffffff81173028 ffff881022e59370 [ 6565.963372] Call Trace: [ 6565.963378] [<ffffffff81636431>] dump_stack+0x19/0x1b [ 6565.963383] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc [ 6565.963388] [<ffffffff81173028>] get_page_from_freelist+0x848/0x9b0 [ 6565.963397] [<ffffffffa06cadaa>] ? spl_kmem_free+0x2a/0x40 [spl] [ 6565.963403] [<ffffffff81173327>] __alloc_pages_nodemask+0x197/0xba0 [ 6565.963416] [<ffffffffa01f9f02>] ? mlx4_ib_post_send+0x4e2/0xb20 [mlx4_ib] [ 6565.963458] [<ffffffffa0b68f8d>] ? lu_obj_hop_keycmp+0x1d/0x30 [obdclass] [ 6565.963473] [<ffffffffa081d717>] ? cfs_hash_bd_lookup_intent+0x57/0x160 [libcfs] [ 6565.963479] [<ffffffff811b4afa>] alloc_pages_current+0xaa/0x170 [ 6565.963491] [<ffffffffa0d5786b>] osd_bufs_get+0x4cb/0xba0 [osd_zfs] [ 6565.963506] [<ffffffffa10ade3d>] ofd_preprw_write.isra.29+0x1bd/0xcd0 [ofd] [ 6565.963519] [<ffffffffa10af13a>] ofd_preprw+0x7ea/0x10c0 [ofd] [ 6565.963593] [<ffffffffa0e8fce7>] tgt_brw_write+0xc17/0x1640 [ptlrpc] [ 6565.963599] [<ffffffff81632d15>] ? __slab_free+0x10e/0x277 [ 6565.963603] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150 [ 6565.963607] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0 [ 6565.963612] [<ffffffff81639d72>] ? mutex_lock+0x12/0x2f [ 6565.963686] [<ffffffffa0e8c225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [ 6565.963756] [<ffffffffa0e381ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [ 6565.963778] [<ffffffffa081a128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [ 6565.963847] [<ffffffffa0e35d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [ 6565.963853] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [ 6565.963856] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [ 6565.963925] [<ffffffffa0e3c260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [ 6565.963931] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 [ 6565.964000] [<ffffffffa0e3b7c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [ 6565.964006] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [ 6565.964011] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6565.964016] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [ 6565.964021] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6567.436859] page:ffffea0063bb3380 count:-1 mapcount:0 mapping: (null) index:0x0 [ 6567.447916] page flags: 0x6fffff00000000() [ 6567.454287] page dumped because: nonzero _count [ 6567.461107] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate dm_service_time xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas raid_class lrw gf128mul glue_helper ablk_helper cryptd scsi_transport_sas iTCO_wdt iTCO_vendor_support mei_me ipmi_devintf ipmi_ssif lpc_ich sb_edac ipmi_si [ 6567.549458] edac_core sg ipmi_msghandler mei shpchp pcspkr ioatdma mfd_core i2c_i801 wmi acpi_pad acpi_power_meter nfsd dm_multipath dm_mod nfs_acl lockd binfmt_misc grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm ptp crct10dif_pclmul pps_core crct10dif_common ahci drm crc32c_intel dca libahci mlx4_core i2c_algo_bit libata i2c_core [ 6567.606553] CPU: 19 PID: 11266 Comm: ll_ost_io01_007 Tainted: G B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [ 6567.619967] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 6567.632682] ffffea0063bb3380 0000000029637c7c ffff880f32283908 ffffffff81636431 [ 6567.642074] ffff880f32283930 ffffffff81631645 ffffea0063bb3380 0000000000000000 [ 6567.651459] 000fffff00000000 ffff880f32283978 ffffffff811714dd fff00000fe000000 [ 6567.660857] Call Trace: [ 6567.664645] [<ffffffff81636431>] dump_stack+0x19/0x1b [ 6567.671441] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc [ 6567.678829] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190 [ 6567.686591] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140 [ 6567.694250] [<ffffffff8117200f>] __free_pages+0x3f/0x60 [ 6567.701235] [<ffffffffa0d56ad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs] [ 6567.709381] [<ffffffffa10ab84a>] ofd_commitrw_write+0xea/0x1c20 [ofd] [ 6567.717717] [<ffffffffa10aff2d>] ofd_commitrw+0x51d/0xa40 [ofd] [ 6567.725522] [<ffffffffa0eb78d2>] obd_commitrw+0x2ec/0x32f [ptlrpc] [ 6567.733604] [<ffffffffa0e8ff71>] tgt_brw_write+0xea1/0x1640 [ptlrpc] [ 6567.741863] [<ffffffffa0de6560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [ 6567.751008] [<ffffffffa0e8c225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [ 6567.760002] [<ffffffffa0e381ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [ 6567.769852] [<ffffffffa081a128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [ 6567.778843] [<ffffffffa0e35d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [ 6567.787757] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [ 6567.796038] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [ 6567.803866] [<ffffffffa0e3c260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [ 6567.812150] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 [ 6567.819620] [<ffffffffa0e3b7c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [ 6567.828951] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [ 6567.835460] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6567.843817] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [ 6567.850900] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6591.647844] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.6@o2ib inode [0x200000402:0x38:0x0] object 0x0:151 extent [67108864-74711039]: client csum 10225ab5, server csum d83f5ab1 [ 6602.366408] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.6@o2ib inode [0x200000402:0x46:0x0] object 0x0:158 extent [67108864-82968575]: client csum df6bd34a, server csum a629d34d [ 6611.821644] general protection fault: 0000 [#1] SMP [ 6611.829518] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache zfs(OE) zunicode(OE) zavl(OE) icp(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate dm_service_time xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas raid_class lrw gf128mul glue_helper ablk_helper cryptd scsi_transport_sas iTCO_wdt iTCO_vendor_support mei_me ipmi_devintf ipmi_ssif lpc_ich sb_edac ipmi_si [ 6611.923714] edac_core sg ipmi_msghandler mei shpchp pcspkr ioatdma mfd_core i2c_i801 wmi acpi_pad acpi_power_meter nfsd dm_multipath dm_mod nfs_acl lockd binfmt_misc grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm ptp crct10dif_pclmul pps_core crct10dif_common ahci drm crc32c_intel dca libahci mlx4_core i2c_algo_bit libata i2c_core [ 6611.985416] CPU: 55 PID: 9668 Comm: ll_ost_io01_000 Tainted: G B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [ 6611.999894] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 6612.013786] task: ffff880fd957e780 ti: ffff880fdb368000 task.ti: ffff880fdb368000 [ 6612.024361] RIP: 0010:[<ffffffffa0814e30>] [<ffffffffa0814e30>] adler32_update+0x70/0x250 [libcfs] [ 6612.036764] RSP: 0018:ffff880fdb36b990 EFLAGS: 00010212 [ 6612.044902] RAX: 0000000000000cce RBX: 0000000000000cce RCX: 3433323130363534 [ 6612.055097] RDX: 0000000000000cce RSI: 0cd1944c0d8d4332 RDI: 0cd1944c0d8d4332 [ 6612.065272] RBP: ffff880fdb36b9f8 R08: 00000000000195a0 R09: 0000000000000cce [ 6612.075453] R10: ffff88103e807900 R11: 0000000000000001 R12: 3433323130363534 [ 6612.085641] R13: 0000000031303635 R14: ffffffffa0834410 R15: 0000000000000001 [ 6612.095830] FS: 0000000000000000(0000) GS:ffff88203e8c0000(0000) knlGS:0000000000000000 [ 6612.107119] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6612.115792] CR2: 00007f19c6c7c000 CR3: 000000000194a000 CR4: 00000000001407e0 [ 6612.126030] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6612.136265] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 6612.146492] Stack: [ 6612.150994] ffff881d1a0d1cd0 00000ccedb36b9c8 0cd1944c0d8d4332 0000000000000000 [ 6612.161627] 00000cce00000000 ffffffffa0834410 ffff882027752a08 ffff880fdb36b9f0 [ 6612.172284] 0cd1944c0d8d4332 3433323130363534 0000000031303635 ffffffffa0834410 [ 6612.182948] Call Trace: [ 6612.187988] [<ffffffff812b1a78>] crypto_shash_update+0x38/0x100 [ 6612.197017] [<ffffffff812b1d6e>] shash_ahash_update+0x3e/0x70 [ 6612.205854] [<ffffffff812b1db2>] shash_async_update+0x12/0x20 [ 6612.214676] [<ffffffffa0813fce>] cfs_crypto_hash_update_page+0x7e/0xb0 [libcfs] [ 6612.225344] [<ffffffffa0eb7459>] tgt_checksum_bulk.isra.33+0x35a/0x4e7 [ptlrpc] [ 6612.236606] [<ffffffffa0e9021d>] tgt_brw_write+0x114d/0x1640 [ptlrpc] [ 6612.246831] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150 [ 6612.255910] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0 [ 6612.265910] [<ffffffffa0de6560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [ 6612.276879] [<ffffffffa0e8c225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [ 6612.287460] [<ffffffffa0e381ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [ 6612.298869] [<ffffffffa081a128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [ 6612.309312] [<ffffffffa0e35d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [ 6612.319759] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [ 6612.329610] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [ 6612.338955] [<ffffffffa0e3c260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [ 6612.348565] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 [ 6612.357360] [<ffffffffa0e3b7c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [ 6612.368146] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [ 6612.376092] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6612.385802] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [ 6612.394179] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 6612.403767] Code: 44 00 00 8b 5d b8 b8 b0 15 00 00 81 fb b0 15 00 00 0f 46 c3 29 45 b8 83 f8 0f 89 45 a4 0f 8e f8 00 00 00 48 8b 7d a8 89 45 bc 90 <44> 0f b6 2f 44 0f b6 77 01 48 83 c7 10 44 0f b6 67 f2 0f b6 5f [ 6612.430647] RIP [<ffffffffa0814e30>] adler32_update+0x70/0x250 [libcfs] [ 6612.440428] RSP <ffff880fdb36b990>
Lustre 2.9.0 + 0.7.0 RC3 (none of our patches) record size 1M on OST0 and 16M on OST1. brw_size=16 on both raidz – messages but no crash manual dumps: 10.8.1.4-2017-04-15-00:26:17 10.8.1.4-2017-04-15-01:47:43 10.8.1.3-2017-04-15-13:22:45 10.8.1.4-2017-04-15-13:22:47
wolf-4 OSS
[ 163.434692] Lustre: lsdraid-OST0001: Recovery over after 0:06, of 5 clients 5 recovered and 0 were evicted. [ 163.480746] Lustre: lsdraid-OST0001: deleting orphan objects from 0x0:720 to 0x0:1025 [ 370.631336] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x3b0:0x0] object 0x0:1225 extent [83886080-92680191]: client csum d5f42113, server csum 1a89e99c [ 480.339896] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x49c:0x0] object 0x0:4041 extent [33554432-47890431]: client csum e47bcdcb, server csum 86becdcf [ 488.890964] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x4ae:0x0] object 0x0:5107 extent [67108864-73793535]: client csum b74b30df, server csum 20c030ec [ 509.914190] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x52f:0x0] object 0x0:6348 extent [33554432-43007999]: client csum cbc76f28, server csum 4b241635 [ 539.505532] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x5be:0x0] object 0x0:7700 extent [67108864-78381055]: client csum b6e2021c, server csum c5ce4f88 [ 560.736133] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x5f1:0x0] object 0x0:8747 extent [67108864-81104895]: client csum ddc22e54, server csum 894f5e1a [ 618.743576] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x6d0:0x0] object 0x0:11762 extent [67108864-81694719]: client csum 734e4939, server csum 175394a5 [ 618.764867] LustreError: Skipped 1 previous similar message [ 1080.395798] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x7fa:0x0] object 0x0:14839 extent [40140800-50331647]: client csum 937c50bf, server csum f71e2e65 [ 1080.417120] LustreError: Skipped 2 previous similar messages [ 3001.142322] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0xd10:0x0] object 0x0:49284 extent [100663296-108527615]: client csum ab9466a8, server csum 10b4e228 [ 3400.563954] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0xfb0:0x0] object 0x0:54388 extent [67108864-82837503]: client csum 71e8cd52, server csum 35becd53 [ 3461.970072] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x1052:0x0] object 0x0:55534 extent [67108864-74973183]: client csum c0a766ab, server csum ab5a66bb [ 3762.672549] BUG: Bad page state in process ll_ost_io01_003 pfn:182ec6d [ 3762.680002] page:ffffea0060bb1b40 count:-1 mapcount:0 mapping: (null) index:0x0 [ 3762.689091] page flags: 0x6fffff00000000() [ 3762.693727] page dumped because: nonzero _count [ 3762.700757] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas lrw gf128mul glue_helper ablk_helper cryptd raid_class scsi_transport_sas mei_me iTCO_wdt ipmi_ssif iTCO_vendor_support mei ipmi_devintf sb_edac sg [ 3762.790920] ioatdma lpc_ich shpchp edac_core pcspkr i2c_i801 ipmi_si mfd_core ipmi_msghandler acpi_pad acpi_power_meter wmi nfsd dm_multipath dm_mod auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb crct10dif_pclmul ptp crct10dif_common ttm ahci crc32c_intel pps_core mlx4_core libahci drm dca i2c_algo_bit libata i2c_core [ 3762.850233] CPU: 31 PID: 9096 Comm: ll_ost_io01_003 Tainted: P IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [ 3762.864178] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 3762.877501] ffffea0060bb1b40 000000006cbfa991 ffff880fd6a47908 ffffffff81636431 [ 3762.887516] ffff880fd6a47930 ffffffff81631645 ffffea0060bb1b40 0000000000000000 [ 3762.897491] 000fffff00000000 ffff880fd6a47978 ffffffff811714dd fff00000fe000000 [ 3762.907458] Call Trace: [ 3762.912046] [<ffffffff81636431>] dump_stack+0x19/0x1b [ 3762.919394] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc [ 3762.927333] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190 [ 3762.935630] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140 [ 3762.943790] [<ffffffff8117200f>] __free_pages+0x3f/0x60 [ 3762.951264] [<ffffffffa0fa1ad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs] [ 3762.959874] [<ffffffffa109b84a>] ofd_commitrw_write+0xea/0x1c20 [ofd] [ 3762.968646] [<ffffffffa109ff2d>] ofd_commitrw+0x51d/0xa40 [ofd] [ 3762.976868] [<ffffffffa0e0f8d2>] obd_commitrw+0x2ec/0x32f [ptlrpc] [ 3762.985338] [<ffffffffa0de7f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc] [ 3762.993957] [<ffffffffa0d3e560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [ 3763.003453] [<ffffffffa0de4225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [ 3763.012530] [<ffffffffa0d901ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [ 3763.022429] [<ffffffffa0a33128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [ 3763.031354] [<ffffffffa0d8dd68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [ 3763.040220] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [ 3763.048476] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [ 3763.056267] [<ffffffffa0d94260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [ 3763.064562] [<ffffffffa0d937c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [ 3763.074037] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [ 3763.080685] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 3763.089162] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [ 3763.096349] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 3855.476573] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x12e3:0x0] object 0x0:58439 extent [67108864-82837503]: client csum 71e8cd52, server csum 14e5cd5e [ 3923.650281] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x13bc:0x0] object 0x0:59171 extent [33554432-48742399]: client csum 9005f4a9, server csum db87ac4c [ 5698.551136] perf interrupt took too long (2521 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 [ 5904.311835] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x1734:0x0] object 0x0:66681 extent [67108864-80281599]: client csum 1eaa58ca, server csum 44a378f0 [ 8708.045614] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x1d31:0x0] object 0x0:67733 extent [121729024-134217727]: client csum 99efe98c, server csum e23d22e1 [ 9738.442312] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2051:0x0] object 0x0:68278 extent [100663296-116666367]: client csum d42f69dc, server csum 8732074f [10448.854337] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x237a:0x0] object 0x0:68809 extent [100663296-112549887]: client csum 7a8b3e1a, server csum 1dbd0291 [10480.902373] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2396:0x0] object 0x0:68834 extent [85426176-100663295]: client csum f43a36f0, server csum 9d10e702 [11720.767365] BUG: Bad page state in process ll_ost_io01_001 pfn:15d132f [11720.777259] page:ffffea005744cbc0 count:-1 mapcount:0 mapping: (null) index:0x0 [11720.788693] page flags: 0x6fffff00000000() [11720.795463] page dumped because: nonzero _count [11720.802596] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel mpt3sas lrw gf128mul glue_helper ablk_helper cryptd raid_class scsi_transport_sas mei_me iTCO_wdt ipmi_ssif iTCO_vendor_support mei ipmi_devintf sb_edac sg [11720.893130] ioatdma lpc_ich shpchp edac_core pcspkr i2c_i801 ipmi_si mfd_core ipmi_msghandler acpi_pad acpi_power_meter wmi nfsd dm_multipath dm_mod auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables ext4 mbcache jbd2 mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb crct10dif_pclmul ptp crct10dif_common ttm ahci crc32c_intel pps_core mlx4_core libahci drm dca i2c_algo_bit libata i2c_core [11720.951749] CPU: 35 PID: 8509 Comm: ll_ost_io01_001 Tainted: P B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1 [11720.965393] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [11720.978463] ffffea005744cbc0 00000000a971f860 ffff880fdb6bf908 ffffffff81636431 [11720.988249] ffff880fdb6bf930 ffffffff81631645 ffffea005744cbc0 0000000000000000 [11720.998053] 000fffff00000000 ffff880fdb6bf978 ffffffff811714dd fff00000fe000000 [11721.007838] Call Trace: [11721.012009] [<ffffffff81636431>] dump_stack+0x19/0x1b [11721.019195] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc [11721.026948] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190 [11721.035167] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140 [11721.043294] [<ffffffff8117200f>] __free_pages+0x3f/0x60 [11721.050752] [<ffffffffa0fa1ad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs] [11721.059372] [<ffffffffa109b84a>] ofd_commitrw_write+0xea/0x1c20 [ofd] [11721.068157] [<ffffffffa109ff2d>] ofd_commitrw+0x51d/0xa40 [ofd] [11721.076424] [<ffffffffa0e0f8d2>] obd_commitrw+0x2ec/0x32f [ptlrpc] [11721.085001] [<ffffffffa0de7f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc] [11721.094001] [<ffffffff81632d15>] ? __slab_free+0x10e/0x277 [11721.101706] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150 [11721.109340] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0 [11721.117924] [<ffffffffa0d3e560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [11721.127469] [<ffffffffa0de4225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [11721.136601] [<ffffffffa0d901ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [11721.146564] [<ffffffffa0a33128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [11721.155726] [<ffffffffa0d8dd68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [11721.164815] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [11721.173099] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [11721.180948] [<ffffffffa0d94260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [11721.189287] [<ffffffffa0d937c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [11721.198828] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [11721.205490] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [11721.214017] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [11721.221178] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [11906.409714] perf interrupt took too long (5056 > 5000), lowering kernel.perf_event_max_sample_rate to 25000 [12369.576466] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x28dd:0x0] object 0x0:69605 extent [100663296-115441663]: client csum 34b2200, server csum 5f29220d [12574.297235] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2a16:0x0] object 0x0:69767 extent [100663296-114409471]: client csum c953b2e4, server csum f3b9a3f5 [12583.154014] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0001 from 12345-192.168.1.6@o2ib inode [0x200000405:0x2a22:0x0] object 0x0:69773 extent [100663296-117309439]: client csum fa39f722, server csum 17548bac
wolf-3 OSS
[ 702.495373] Lustre: lsdraid-OST0000: Connection restored to 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 (at 192.168.1.6@o2ib) [ 712.111566] LustreError: 35894:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0 [ 712.629997] LustreError: 39491:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0 [ 712.649481] LustreError: 39491:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 8 previous similar messages [ 713.660785] LustreError: 38266:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0 [ 713.679875] LustreError: 38266:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 5 previous similar messages [ 715.665680] LustreError: 38165:0:(ofd_grant.c:641:ofd_grant_check()) lsdraid-OST0000: cli 5dd53d1b-72ff-64c0-86f7-b4ab04036f55 claims 17432576 GRANT, real grant 0 [ 715.685499] LustreError: 38165:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 48 previous similar messages [ 835.423369] Lustre: lsdraid-OST0000: Connection restored to 4e5e1424-c5a7-dbfe-ccf8-a041ec520cb5 (at 192.168.1.9@o2ib) [ 835.437468] Lustre: Skipped 2 previous similar messages [11228.546836] perf interrupt took too long (2506 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 [28193.720410] LNet: Service thread pid 91775 was inactive for 200.29s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [28193.743765] Pid: 91775, comm: ll_ost00_010 [28193.750363] Call Trace: [28193.758633] [<ffffffff8163bb39>] schedule+0x29/0x70 [28193.765982] [<ffffffffa05cb2fd>] cv_wait_common+0x10d/0x130 [spl] [28193.774687] [<ffffffff810a6b80>] ? autoremove_wake_function+0x0/0x40 [28193.783567] [<ffffffffa05cb335>] __cv_wait+0x15/0x20 [spl] [28193.791608] [<ffffffffa1439c23>] txg_wait_open+0xb3/0xf0 [zfs] [28193.799877] [<ffffffffa13e264d>] dmu_free_long_range+0x25d/0x3d0 [zfs] [28193.808919] [<ffffffffa1092468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs] [28193.818586] [<ffffffffa10927d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs] [28193.828178] [<ffffffffa1094dba>] osd_trans_stop+0x31a/0x5b0 [osd_zfs] [28193.836927] [<ffffffffa119516f>] ofd_trans_stop+0x1f/0x60 [ofd] [28193.845026] [<ffffffffa1198d82>] ofd_object_destroy+0x2b2/0x890 [ofd] [28193.853770] [<ffffffffa1191987>] ofd_destroy_by_fid+0x307/0x510 [ofd] [28193.862440] [<ffffffffa0cdcbe0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc] [28193.871264] [<ffffffffa0cd71f0>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc] [28193.880161] [<ffffffffa1181627>] ofd_destroy_hdl+0x267/0xa50 [ofd] [28193.888454] [<ffffffffa0d6b225>] tgt_request_handle+0x915/0x1320 [ptlrpc] [28193.897329] [<ffffffffa0d171ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [28193.907053] [<ffffffffa09c7128>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [28193.915785] [<ffffffffa0d14d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [28193.924476] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 [28193.932565] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 [28193.940211] [<ffffffffa0d1b260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [28193.948394] [<ffffffffa0d1a7c0>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc] [28193.956493] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [28193.963027] [<ffffffff810a5ac0>] ? kthread+0x0/0xe0 [28193.969635] [<ffffffff81646a98>] ret_from_fork+0x58/0x90 [28193.976729] [<ffffffff810a5ac0>] ? kthread+0x0/0xe0 [28193.985950] LustreError: dumping log to /tmp/lustre-log.1492246924.91775 [28199.712751] LNet: Service thread pid 91775 completed after 206.29s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [31329.310375] perf interrupt took too long (5002 > 5000), lowering kernel.perf_event_max_sample_rate to 25000 [root@wolf-3 10.8.1.3-2017-04-14-22:46:09]#
[root@wolf-4 combined]# ps aux |grep 9096
root 9096 0.6 0.0 0 0 ? S 01:55 4:21 [ll_ost_io01_003]
root 77386 0.0 0.0 112656 976 pts/0 S+ 12:56 0:00 grep --color=auto 9096
[root@wolf-4 combined]# man ps
[root@wolf-4 combined]# ps aux |grep 8509
root 8509 4.3 0.0 0 0 ? D 01:55 28:56 [ll_ost_io01_001]
root 84813 0.0 0.0 112656 976 pts/0 S+ 12:57 0:00 grep --color=auto 8509
[root@wolf-4 combined]# cat /proc/9096/stack
[<ffffffffa0d8dff5>] ptlrpc_wait_event+0x325/0x340 [ptlrpc]
[<ffffffffa0d93fcb>] ptlrpc_main+0x80b/0x1de0 [ptlrpc]
[<ffffffff810a5b8f>] kthread+0xcf/0xe0
[<ffffffff81646a98>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff
[root@wolf-4 combined]# cat /proc/8509/stack
[<ffffffff8108c04f>] usleep_range+0x4f/0x70
[<ffffffffa269c99a>] dmu_tx_wait+0x33a/0x360 [zfs]
[<ffffffffa269ca45>] dmu_tx_assign+0x85/0x3f0 [zfs]
[<ffffffffa0f94fea>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[<ffffffffa10960db>] ofd_trans_start+0x6b/0xe0 [ofd]
[<ffffffffa109c0a3>] ofd_commitrw_write+0x943/0x1c20 [ofd]
[<ffffffffa109ff2d>] ofd_commitrw+0x51d/0xa40 [ofd]
[<ffffffffa0e0f8d2>] obd_commitrw+0x2ec/0x32f [ptlrpc]
[<ffffffffa0de7f71>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
[<ffffffffa0de4225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[<ffffffffa0d901ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[<ffffffffa0d94260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
[<ffffffff810a5b8f>] kthread+0xcf/0xe0
[<ffffffff81646a98>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff
memory corruption.