Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.10.1, Lustre 2.12.4
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for sarah_lw <wei3.liu@intel.com>
This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/fd03f350-9c9e-11e7-ba27-5254006e85c2.
The sub-test test_iozone failed with the following error:
test failed to respond and timed out
server and client: RHEL7.4 zfs
MDS dmesg shows
[25904.206622] osp-syn-0-0 D 0000000000000000 0 23127 2 0x00000080 [25904.208389] ffff88005c5976a0 0000000000000046 ffff88004e28eeb0 ffff88005c597fd8 [25904.210163] ffff88005c597fd8 ffff88005c597fd8 ffff88004e28eeb0 ffff8800608452f8 [25904.211921] ffff880060845240 ffff880060845268 ffff880060845300 0000000000000000 [25904.213716] Call Trace: [25904.214985] [<ffffffff816a94c9>] schedule+0x29/0x70 [25904.216575] [<ffffffffc07144d5>] cv_wait_common+0x125/0x150 [spl] [25904.218157] [<ffffffff810b1910>] ? wake_up_atomic_t+0x30/0x30 [25904.219805] [<ffffffffc0714515>] __cv_wait+0x15/0x20 [spl] [25904.221433] [<ffffffffc086317f>] txg_wait_synced+0xef/0x140 [zfs] [25904.223020] [<ffffffffc0818a75>] dmu_tx_wait+0x275/0x3c0 [zfs] [25904.224676] [<ffffffffc0818c51>] dmu_tx_assign+0x91/0x490 [zfs] [25904.226256] [<ffffffffc0c3ad00>] ? llog_osd_declare_destroy+0x2f0/0x640 [obdclass] [25904.228018] [<ffffffffc1085efa>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [25904.229710] [<ffffffffc0c278c7>] llog_cancel_rec+0x147/0x870 [obdclass] [25904.231414] [<ffffffffc0c2e33a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass] [25904.233232] [<ffffffffc0e4b8a0>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc] [25904.234923] [<ffffffffc13c57f3>] osp_sync_process_committed+0x213/0x6c0 [osp] [25904.236691] [<ffffffffc13c6bd6>] osp_sync_process_queues+0x556/0x2010 [osp] [25904.238371] [<ffffffff810c4810>] ? wake_up_state+0x20/0x20 [25904.239978] [<ffffffffc0c28595>] llog_process_thread+0x5a5/0x1180 [obdclass] [25904.241718] [<ffffffffc13c6680>] ? osp_sync_thread+0x9e0/0x9e0 [osp] [25904.243404] [<ffffffffc0c2922c>] llog_process_or_fork+0xbc/0x450 [obdclass] [25904.245050] [<ffffffffc0c2e91d>] llog_cat_process_cb+0x43d/0x4e0 [obdclass] [25904.246794] [<ffffffffc0c28595>] llog_process_thread+0x5a5/0x1180 [obdclass] [25904.248521] [<ffffffff810ce8d8>] ? check_preempt_wakeup+0x148/0x250 [25904.250109] [<ffffffffc0c2e4e0>] ? llog_cat_cancel_records+0x2e0/0x2e0 [obdclass] [25904.251906] [<ffffffffc0c2922c>] llog_process_or_fork+0xbc/0x450 [obdclass] [25904.253600] [<ffffffffc0c2e4e0>] ? llog_cat_cancel_records+0x2e0/0x2e0 [obdclass] [25904.255390] [<ffffffffc0c2daa9>] llog_cat_process_or_fork+0x199/0x2a0 [obdclass] [25904.257072] [<ffffffff810c4822>] ? default_wake_function+0x12/0x20 [25904.258766] [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90 [25904.260355] [<ffffffffc13c6680>] ? osp_sync_thread+0x9e0/0x9e0 [osp] [25904.262039] [<ffffffffc0c2dbde>] llog_cat_process+0x2e/0x30 [obdclass] [25904.263683] [<ffffffffc13c5ea8>] osp_sync_thread+0x208/0x9e0 [osp] [25904.265347] [<ffffffff81029557>] ? __switch_to+0xd7/0x510 [25904.266851] [<ffffffffc13c5ca0>] ? osp_sync_process_committed+0x6c0/0x6c0 [osp] [25904.268620] [<ffffffff810b098f>] kthread+0xcf/0xe0 [25904.270093] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [25904.271736] [<ffffffff816b4f18>] ret_from_fork+0x58/0x90 [25904.273300] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 ... [25904.821292] mdt_rdpg00_002 D ffff880060845300 0 24274 2 0x00000080 [25904.822912] ffff88005a573880 0000000000000046 ffff88006b1b8fd0 ffff88005a573fd8 [25904.824661] ffff88005a573fd8 ffff88005a573fd8 ffff88006b1b8fd0 ffff8800608452f8 [25904.826407] ffff880060845240 ffff880060845268 ffff880060845300 0000000000000000 [25904.828129] Call Trace: [25904.829349] [<ffffffff816a94c9>] schedule+0x29/0x70 [25904.830842] [<ffffffffc07144d5>] cv_wait_common+0x125/0x150 [spl] [25904.832434] [<ffffffff810b1910>] ? wake_up_atomic_t+0x30/0x30 [25904.833928] [<ffffffffc0714515>] __cv_wait+0x15/0x20 [spl] [25904.835494] [<ffffffffc086317f>] txg_wait_synced+0xef/0x140 [zfs] [25904.837084] [<ffffffffc0818a75>] dmu_tx_wait+0x275/0x3c0 [zfs] [25904.838670] [<ffffffffc0818c51>] dmu_tx_assign+0x91/0x490 [zfs] [25904.840246] [<ffffffffc1085efa>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [25904.841792] [<ffffffffc103a128>] qmt_trans_start_with_slv+0x248/0x530 [lquota] [25904.843472] [<ffffffffc1033196>] qmt_dqacq0+0x1a6/0xf00 [lquota] [25904.845069] [<ffffffffc0e4a2df>] ? lustre_pack_reply_flags+0x6f/0x1e0 [ptlrpc] [25904.846753] [<ffffffffc1036b21>] qmt_intent_policy+0x831/0xe50 [lquota] [25904.848387] [<ffffffffc12207c2>] mdt_intent_policy+0x662/0xc70 [mdt] [25904.849991] [<ffffffffc0e0112f>] ? ldlm_resource_get+0x9f/0xa30 [ptlrpc] [25904.851631] [<ffffffffc0dfa2b7>] ldlm_lock_enqueue+0x387/0x970 [ptlrpc] [25904.853270] [<ffffffffc0e23c23>] ldlm_handle_enqueue0+0x9c3/0x1680 [ptlrpc] [25904.854932] [<ffffffffc0e4be90>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] [25904.856589] [<ffffffffc0ea9182>] tgt_enqueue+0x62/0x210 [ptlrpc] [25904.858208] [<ffffffffc0ead085>] tgt_request_handle+0x925/0x1370 [ptlrpc] [25904.859868] [<ffffffffc0e55ec6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc] [25904.861533] [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90 [25904.863131] [<ffffffffc0e59602>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [25904.864765] [<ffffffffc0e58b70>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] [25904.866473] [<ffffffff810b098f>] kthread+0xcf/0xe0 [25904.867963] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [25904.869514] [<ffffffff816b4f18>] ret_from_fork+0x58/0x90 [25904.871056] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
Info required for matching: sanity-benchmark iozone
This might require a new ticket, but ... I have a new case that looks like what is described here. For Lustre (future) 2.12.4 at https://testing.whamcloud.com/test_sets/4c5e02d0-349e-11ea-b0f4-52540065bddc for ZFS with DNE, we see test_bonnie hang, but also errors during test_dbench.
Looking at console logs for client1 (vm1), we see dbench process hung with all the same traces
In the console logs of the OSS, we see inactive threads while running both dbench and bonnie tests
On the MDS1/3 (vm4), we see inactive threads while running dbench