We hit a similar problem during mdtest.
<format>
327157.963523] Lustre: scratch-MDT0000: Recovery over after 0:23, of 3 clients 3 recovered and 0 were evicted.
[360914.220229] Lustre: scratch-MDT0000: haven't heard from client a0cd3281-9880-d337-3d07-517afc288361 (at 10.0.10.69@o2ib) in 241 seconds. I think it's dead, and I am evicting it. exp ffff881fd2af1c00, cur 1475220380 expire 1475220230 last 1475220139
[360914.220233] Lustre: Skipped 21 previous similar messages
[375716.231473] BUG: soft lockup - CPU#26 stuck for 23s! [mdt00_030:34083]
[375716.238822] Modules linked in: nls_utf8 isofs ofd(OE) ost(OE) loop iptable_filter rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) ksocklnd(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) ib_srp(OE) scsi_transport_srp(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx4_en(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) mlx4_core(OE) dm_service_time intel_powerclamp coretemp intel_rapl dm_round_robin kvm crc32_pclmul ghash_clmulni_intel cryptd iTCO_wdt shpchp mxm_wmi iTCO_vendor_support lpc_ich mfd_core sb_edac edac_core mei_me mei i2c_i801 ioatdma pcspkr ipmi_devintf acpi_power_meter
[375716.238855] ipmi_si ipmi_msghandler wmi acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_multipath ip_tables ext4 mbcache jbd2 mlx5_ib(OE) ib_core(OE) ib_addr(OE) ib_netlink(OE) sd_mod crc_t10dif crct10dif_generic ast syscopyarea sysfillrect sysimgblt drm_kms_helper crct10dif_pclmul crct10dif_common crc32c_intel ttm ahci igb drm mlx5_core(OE) libahci qla2xxx vxlan dca ip6_udp_tunnel i2c_algo_bit udp_tunnel libata i2c_core mlx_compat(OE) ptp scsi_transport_fc pps_core scsi_tgt dm_mirror dm_region_hash dm_log dm_mod sg
[375716.238878] CPU: 26 PID: 34083 Comm: mdt00_030 Tainted: G OE ------------ 3.10.0-327.28.3.el7_lustre.2.7.18.ddn0.gd4e0769.x86_64 #1
[375716.238879] Hardware name: Supermicro X10DDW-i/X10DDW-i, BIOS 2.0 01/11/2016
[375716.238880] task: ffff880fd0527300 ti: ffff880fcd528000 task.ti: ffff880fcd528000
[375716.238881] RIP: 0010:[<ffffffff8163e1a0>] [<ffffffff8163e1a0>] _raw_spin_lock+0x30/0x50
[375716.238887] RSP: 0018:ffff880fcd52b638 EFLAGS: 00000287
[375716.238888] RAX: 0000000000007420 RBX: ffff880fb2ba2b60 RCX: 000000000000e160
[375716.238889] RDX: 000000000000dec6 RSI: 000000000000dec6 RDI: ffff880fc6011ba0
[375716.238889] RBP: ffff880fcd52b638 R08: 8010000000000000 R09: 10264119c0080000
[375716.238890] R10: efbbc2efd03e7002 R11: ffffea0040975b80 R12: ffff880fb60f4750
[375716.238891] R13: ffff880f96f91138 R14: ffff880fb60f51a0 R15: ffff880fefb5d820
[375716.238892] FS: 0000000000000000(0000) GS:ffff88103fb80000(0000) knlGS:0000000000000000
[375716.238893] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[375716.238894] CR2: 00007f5af2025854 CR3: 000000000194e000 CR4: 00000000001407e0
[375716.238894] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[375716.238895] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[375716.238896] Stack:
[375716.238896] ffff880fcd52b6c0 ffffffffa0218bfc 8000000000012820 0000000000012820
[375716.238899] 00000000267cb810 ffff880fc6011800 0000000000000000 0000000000000001
[375716.238902] ffff880fb0c54f01 ffff880fcd52b6e0 ffffffffa1004e96 00000000e0d19ef6
[375716.238904] Call Trace:
[375716.238912] [<ffffffffa0218bfc>] do_get_write_access+0x32c/0x4e0 [jbd2]
[375716.238924] [<ffffffffa1004e96>] ? ldiskfs_getblk+0xa6/0x200 [ldiskfs]
[375716.238928] [<ffffffffa0218dd7>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
[375716.238932] [<ffffffffa0fe11db>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
[375716.238936] [<ffffffffa0219144>] ? jbd2_journal_dirty_metadata+0xd4/0x260 [jbd2]
[375716.238951] [<ffffffffa10a32e8>] osd_ldiskfs_write_record+0xa8/0x360 [osd_ldiskfs]
[375716.238957] [<ffffffffa10a3698>] osd_write+0xf8/0x230 [osd_ldiskfs]
[375716.238987] [<ffffffffa0b3a295>] dt_record_write+0x45/0x130 [obdclass]
[375716.238997] [<ffffffffa0af769f>] llog_osd_write_rec+0x72f/0x1210 [obdclass]
[375716.239003] [<ffffffffa109a602>] ? iam_path_release+0x42/0x60 [osd_ldiskfs]
[375716.239013] [<ffffffffa0ae7f0a>] llog_write_rec+0xaa/0x280 [obdclass]
[375716.239023] [<ffffffffa0aebfae>] llog_cat_add_rec+0x46e/0xe00 [obdclass]
[375716.239031] [<ffffffffa0ae514a>] llog_add+0x7a/0x1a0 [obdclass]
[375716.239044] [<ffffffffa13c789d>] osp_sync_add_rec+0x24d/0x9a0 [osp]
[375716.239050] [<ffffffffa1096e71>] ? osd_oi_delete+0x1a1/0x420 [osd_ldiskfs]
[375716.239055] [<ffffffffa13cb147>] osp_sync_add+0x47/0x50 [osp]
[375716.239059] [<ffffffffa13b7f1f>] osp_object_destroy+0x10f/0x170 [osp]
[375716.239073] [<ffffffffa1310d87>] lod_object_destroy+0x677/0xa50 [lod]
[375716.239084] [<ffffffffa135d2e7>] ? mdd_mark_dead_object+0x27/0x3d0 [mdd]
[375716.239091] [<ffffffffa136a20e>] mdd_finish_unlink+0x2fe/0x460 [mdd]
[375716.239097] [<ffffffffa136e5ed>] mdd_unlink+0x8dd/0xa90 [mdd]
[375716.239120] [<ffffffffa122d936>] mdt_reint_unlink+0xa96/0x11f0 [mdt]
[375716.239137] [<ffffffffa0b5699e>] ? lu_ucred+0x1e/0x30 [obdclass]
[375716.239146] [<ffffffffa1231420>] mdt_reint_rec+0x80/0x210 [mdt]
[375716.239155] [<ffffffffa1212299>] mdt_reint_internal+0x5d9/0xb30 [mdt]
[375716.239164] [<ffffffffa121d237>] mdt_reint+0x67/0x140 [mdt]
[375716.239208] [<ffffffffa0db4adb>] tgt_request_handle+0x8fb/0x11f0 [ptlrpc]
[375716.239232] [<ffffffffa0d5797b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[375716.239249] [<ffffffffa07b7d78>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[375716.239272] [<ffffffffa0d54a48>] ? ptlrpc_wait_event+0x98/0x330 [ptlrpc]
[375716.239296] [<ffffffffa0d5b2a0>] ptlrpc_main+0xc00/0x1f60 [ptlrpc]
[375716.239300] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
[375716.239323] [<ffffffffa0d5a6a0>] ? ptlrpc_register_service+0x1070/0x1070 [ptlrpc]
[375716.239328] [<ffffffff810a5b2f>] kthread+0xcf/0xe0
[375716.239330] [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
[375716.239333] [<ffffffff81646e58>] ret_from_fork+0x58/0x90
[375716.239335] [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
[375716.239336] Code: 55 48 89 e5 b8 00 00 02 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f b7 f2 b8 00 80 00 00 eb 0c 0f 1f 44 00 00 <f3> 90 83 e8 01 74 0a 0f b7 0f 66 39 ca 75 f1 5d c3 0f 1f 80 00
[375744.242810] BUG: soft lockup - CPU#26 stuck for 23s! [mdt00_030:34083]
[375744.250127] Modules linked in: nls_utf8 isofs ofd(OE) ost(OE) loop iptable_filter rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) ksocklnd(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) ib_srp(OE) scsi_transport_srp(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx4_en(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) mlx4_core(OE) dm_service_time intel_powerclamp coretemp intel_rapl dm_round_robin kvm crc32_pclmul ghash_clmulni_intel cryptd iTCO_wdt shpchp mxm_wmi iTCO_vendor_support lpc_ich mfd_core sb_edac edac_core mei_me mei i2c_i801 ioatdma pcspkr ipmi_devintf acpi_power_meter
[375744.250147] ipmi_si ipmi_msghandler wmi acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_multipath ip_tables ext4 mbcache jbd2 mlx5_ib(OE) ib_core(OE) ib_addr(OE) ib_netlink(OE) sd_mod crc_t10dif crct10dif_generic ast syscopyarea sysfillrect sysimgblt drm_kms_helper crct10dif_pclmul crct10dif_common crc32c_intel ttm ahci igb drm mlx5_core(OE) libahci qla2xxx vxlan dca ip6_udp_tunnel i2c_algo_bit udp_tunnel libata i2c_core mlx_compat(OE) ptp scsi_transport_fc pps_core scsi_tgt dm_mirror dm_region_hash dm_log dm_mod sg
[375744.250162] CPU: 26 PID: 34083 Comm: mdt00_030 Tainted: G OEL ------------ 3.10.0-327.28.3.el7_lustre.2.7.18.ddn0.gd4e0769.x86_64 #1
[375744.250163] Hardware name: Supermicro X10DDW-i/X10DDW-i, BIOS 2.0 01/11/2016
[375744.250164] task: ffff880fd0527300 ti: ffff880fcd528000 task.ti: ffff880fcd528000
[375744.250165] RIP: 0010:[<ffffffff8163e1a2>] [<ffffffff8163e1a2>] _raw_spin_lock+0x32/0x50
[375744.250168] RSP: 0018:ffff880fcd52b638 EFLAGS: 00000287
[375744.250169] RAX: 000000000000184a RBX: ffff880fb2ba2b60 RCX: 000000000000e160
[375744.250170] RDX: 000000000000dec6 RSI: 000000000000dec6 RDI: ffff880fc6011ba0
[375744.250170] RBP: ffff880fcd52b638 R08: 8010000000000000 R09: 10264119c0080000
[375744.250171] R10: efbbc2efd03e7002 R11: ffffea0040975b80 R12: ffff880fb60f4750
[375744.250172] R13: ffff880f96f91138 R14: ffff880fb60f51a0 R15: ffff880fefb5d820
[375744.250173] FS: 0000000000000000(0000) GS:ffff88103fb80000(0000) knlGS:0000000000000000
[375744.250173] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[375744.250174] CR2: 00007f5af2025854 CR3: 000000000194e000 CR4: 00000000001407e0
[375744.250175] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[375744.250176] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[375744.250176] Stack:
[375744.250177] ffff880fcd52b6c0 ffffffffa0218bfc 8000000000012820 0000000000012820
[375744.250180] 00000000267cb810 ffff880fc6011800 0000000000000000 0000000000000001
[375744.250182] ffff880fb0c54f01 ffff880fcd52b6e0 ffffffffa1004e96 00000000e0d19ef6
[375744.250185] Call Trace:
[375744.250191] [<ffffffffa0218bfc>] do_get_write_access+0x32c/0x4e0 [jbd2]
[375744.250198] [<ffffffffa1004e96>] ? ldiskfs_getblk+0xa6/0x200 [ldiskfs]
[375744.250203] [<ffffffffa0218dd7>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
[375744.250207] [<ffffffffa0fe11db>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
[375744.250211] [<ffffffffa0219144>] ? jbd2_journal_dirty_metadata+0xd4/0x260 [jbd2]
[375744.250219] [<ffffffffa10a32e8>] osd_ldiskfs_write_record+0xa8/0x360 [osd_ldiskfs]
[375744.250225] [<ffffffffa10a3698>] osd_write+0xf8/0x230 [osd_ldiskfs]
[375744.250240] [<ffffffffa0b3a295>] dt_record_write+0x45/0x130 [obdclass]
[375744.250250] [<ffffffffa0af769f>] llog_osd_write_rec+0x72f/0x1210 [obdclass]
[375744.250256] [<ffffffffa109a602>] ? iam_path_release+0x42/0x60 [osd_ldiskfs]
[375744.250266] [<ffffffffa0ae7f0a>] llog_write_rec+0xaa/0x280 [obdclass]
[375744.250275] [<ffffffffa0aebfae>] llog_cat_add_rec+0x46e/0xe00 [obdclass]
[375744.250283] [<ffffffffa0ae514a>] llog_add+0x7a/0x1a0 [obdclass]
[375744.250288] [<ffffffffa13c789d>] osp_sync_add_rec+0x24d/0x9a0 [osp]
[375744.250294] [<ffffffffa1096e71>] ? osd_oi_delete+0x1a1/0x420 [osd_ldiskfs]
[375744.250299] [<ffffffffa13cb147>] osp_sync_add+0x47/0x50 [osp]
[375744.250302] [<ffffffffa13b7f1f>] osp_object_destroy+0x10f/0x170 [osp]
[375744.250311] [<ffffffffa1310d87>] lod_object_destroy+0x677/0xa50 [lod]
[375744.250316] [<ffffffffa135d2e7>] ? mdd_mark_dead_object+0x27/0x3d0 [mdd]
[375744.250321] [<ffffffffa136a20e>] mdd_finish_unlink+0x2fe/0x460 [mdd]
[375744.250325] [<ffffffffa136e5ed>] mdd_unlink+0x8dd/0xa90 [mdd]
[375744.250334] [<ffffffffa122d936>] mdt_reint_unlink+0xa96/0x11f0 [mdt]
[375744.250347] [<ffffffffa0b5699e>] ? lu_ucred+0x1e/0x30 [obdclass]
[375744.250355] [<ffffffffa1231420>] mdt_reint_rec+0x80/0x210 [mdt]
[375744.250361] [<ffffffffa1212299>] mdt_reint_internal+0x5d9/0xb30 [mdt]
[375744.250367] [<ffffffffa121d237>] mdt_reint+0x67/0x140 [mdt]
[375744.250390] [<ffffffffa0db4adb>] tgt_request_handle+0x8fb/0x11f0 [ptlrpc]
[375744.250410] [<ffffffffa0d5797b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[375744.250421] [<ffffffffa07b7d78>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[375744.250439] [<ffffffffa0d54a48>] ? ptlrpc_wait_event+0x98/0x330 [ptlrpc]
[375744.250457] [<ffffffffa0d5b2a0>] ptlrpc_main+0xc00/0x1f60 [ptlrpc]
[375744.250460] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
[375744.250478] [<ffffffffa0d5a6a0>] ? ptlrpc_register_service+0x1070/0x1070 [ptlrpc]
[375744.250480] [<ffffffff810a5b2f>] kthread+0xcf/0xe0
[375744.250483] [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
[375744.250485] [<ffffffff81646e58>] ret_from_fork+0x58/0x90
[375744.250487] [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
[375744.250488] Code: 89 e5 b8 00 00 02 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f b7 f2 b8 00 80 00 00 eb 0c 0f 1f 44 00 00 f3 90 <83> e8 01 74 0a 0f b7 0f 66 39 ca 75 f1 5d c3 0f 1f 80 00 00 00
[375750.710430] INFO: rcu_sched self-detected stall on CPU
[375750.713449] INFO: rcu_sched detected stalls on CPUs/tasks:
[375750.713449] {
[375750.713450] 26
</format>
Just in case this could help, I just found this ticket and all others that have been linked/duped to it, and I wonder if some of them could not be related to
LU-8685instead. My feeling comes from the fact that, according to my own debugging/disassembly, the spin-lock being referenced and causing the associated threads to be stuck, at do_get_write_access()+0x32c is (journal_t *)->j_list_lock and thus bug and patch identified inLU-8685could also be highly related.