[LU-10009] sanity-benchmark test_iozone: test failed to respond and timed out Created: 19/Sep/17  Updated: 16/Jan/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.1, Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-10250 replay-single test_74: hang and time... Open
Related
is related to LU-12234 sanity-benchmark test iozone hangs in... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/fd03f350-9c9e-11e7-ba27-5254006e85c2.

The sub-test test_iozone failed with the following error:

test failed to respond and timed out

server and client: RHEL7.4 zfs

MDS dmesg shows

[25904.206622] osp-syn-0-0     D 0000000000000000     0 23127      2 0x00000080
[25904.208389]  ffff88005c5976a0 0000000000000046 ffff88004e28eeb0 ffff88005c597fd8
[25904.210163]  ffff88005c597fd8 ffff88005c597fd8 ffff88004e28eeb0 ffff8800608452f8
[25904.211921]  ffff880060845240 ffff880060845268 ffff880060845300 0000000000000000
[25904.213716] Call Trace:
[25904.214985]  [<ffffffff816a94c9>] schedule+0x29/0x70
[25904.216575]  [<ffffffffc07144d5>] cv_wait_common+0x125/0x150 [spl]
[25904.218157]  [<ffffffff810b1910>] ? wake_up_atomic_t+0x30/0x30
[25904.219805]  [<ffffffffc0714515>] __cv_wait+0x15/0x20 [spl]
[25904.221433]  [<ffffffffc086317f>] txg_wait_synced+0xef/0x140 [zfs]
[25904.223020]  [<ffffffffc0818a75>] dmu_tx_wait+0x275/0x3c0 [zfs]
[25904.224676]  [<ffffffffc0818c51>] dmu_tx_assign+0x91/0x490 [zfs]
[25904.226256]  [<ffffffffc0c3ad00>] ? llog_osd_declare_destroy+0x2f0/0x640 [obdclass]
[25904.228018]  [<ffffffffc1085efa>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[25904.229710]  [<ffffffffc0c278c7>] llog_cancel_rec+0x147/0x870 [obdclass]
[25904.231414]  [<ffffffffc0c2e33a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass]
[25904.233232]  [<ffffffffc0e4b8a0>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc]
[25904.234923]  [<ffffffffc13c57f3>] osp_sync_process_committed+0x213/0x6c0 [osp]
[25904.236691]  [<ffffffffc13c6bd6>] osp_sync_process_queues+0x556/0x2010 [osp]
[25904.238371]  [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
[25904.239978]  [<ffffffffc0c28595>] llog_process_thread+0x5a5/0x1180 [obdclass]
[25904.241718]  [<ffffffffc13c6680>] ? osp_sync_thread+0x9e0/0x9e0 [osp]
[25904.243404]  [<ffffffffc0c2922c>] llog_process_or_fork+0xbc/0x450 [obdclass]
[25904.245050]  [<ffffffffc0c2e91d>] llog_cat_process_cb+0x43d/0x4e0 [obdclass]
[25904.246794]  [<ffffffffc0c28595>] llog_process_thread+0x5a5/0x1180 [obdclass]
[25904.248521]  [<ffffffff810ce8d8>] ? check_preempt_wakeup+0x148/0x250
[25904.250109]  [<ffffffffc0c2e4e0>] ? llog_cat_cancel_records+0x2e0/0x2e0 [obdclass]
[25904.251906]  [<ffffffffc0c2922c>] llog_process_or_fork+0xbc/0x450 [obdclass]
[25904.253600]  [<ffffffffc0c2e4e0>] ? llog_cat_cancel_records+0x2e0/0x2e0 [obdclass]
[25904.255390]  [<ffffffffc0c2daa9>] llog_cat_process_or_fork+0x199/0x2a0 [obdclass]
[25904.257072]  [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
[25904.258766]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
[25904.260355]  [<ffffffffc13c6680>] ? osp_sync_thread+0x9e0/0x9e0 [osp]
[25904.262039]  [<ffffffffc0c2dbde>] llog_cat_process+0x2e/0x30 [obdclass]
[25904.263683]  [<ffffffffc13c5ea8>] osp_sync_thread+0x208/0x9e0 [osp]
[25904.265347]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
[25904.266851]  [<ffffffffc13c5ca0>] ? osp_sync_process_committed+0x6c0/0x6c0 [osp]
[25904.268620]  [<ffffffff810b098f>] kthread+0xcf/0xe0
[25904.270093]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
[25904.271736]  [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
[25904.273300]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40

...

[25904.821292] mdt_rdpg00_002  D ffff880060845300     0 24274      2 0x00000080
[25904.822912]  ffff88005a573880 0000000000000046 ffff88006b1b8fd0 ffff88005a573fd8
[25904.824661]  ffff88005a573fd8 ffff88005a573fd8 ffff88006b1b8fd0 ffff8800608452f8
[25904.826407]  ffff880060845240 ffff880060845268 ffff880060845300 0000000000000000
[25904.828129] Call Trace:
[25904.829349]  [<ffffffff816a94c9>] schedule+0x29/0x70
[25904.830842]  [<ffffffffc07144d5>] cv_wait_common+0x125/0x150 [spl]
[25904.832434]  [<ffffffff810b1910>] ? wake_up_atomic_t+0x30/0x30
[25904.833928]  [<ffffffffc0714515>] __cv_wait+0x15/0x20 [spl]
[25904.835494]  [<ffffffffc086317f>] txg_wait_synced+0xef/0x140 [zfs]
[25904.837084]  [<ffffffffc0818a75>] dmu_tx_wait+0x275/0x3c0 [zfs]
[25904.838670]  [<ffffffffc0818c51>] dmu_tx_assign+0x91/0x490 [zfs]
[25904.840246]  [<ffffffffc1085efa>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[25904.841792]  [<ffffffffc103a128>] qmt_trans_start_with_slv+0x248/0x530 [lquota]
[25904.843472]  [<ffffffffc1033196>] qmt_dqacq0+0x1a6/0xf00 [lquota]
[25904.845069]  [<ffffffffc0e4a2df>] ? lustre_pack_reply_flags+0x6f/0x1e0 [ptlrpc]
[25904.846753]  [<ffffffffc1036b21>] qmt_intent_policy+0x831/0xe50 [lquota]
[25904.848387]  [<ffffffffc12207c2>] mdt_intent_policy+0x662/0xc70 [mdt]
[25904.849991]  [<ffffffffc0e0112f>] ? ldlm_resource_get+0x9f/0xa30 [ptlrpc]
[25904.851631]  [<ffffffffc0dfa2b7>] ldlm_lock_enqueue+0x387/0x970 [ptlrpc]
[25904.853270]  [<ffffffffc0e23c23>] ldlm_handle_enqueue0+0x9c3/0x1680 [ptlrpc]
[25904.854932]  [<ffffffffc0e4be90>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
[25904.856589]  [<ffffffffc0ea9182>] tgt_enqueue+0x62/0x210 [ptlrpc]
[25904.858208]  [<ffffffffc0ead085>] tgt_request_handle+0x925/0x1370 [ptlrpc]
[25904.859868]  [<ffffffffc0e55ec6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
[25904.861533]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
[25904.863131]  [<ffffffffc0e59602>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[25904.864765]  [<ffffffffc0e58b70>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[25904.866473]  [<ffffffff810b098f>] kthread+0xcf/0xe0
[25904.867963]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
[25904.869514]  [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
[25904.871056]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40

Info required for matching: sanity-benchmark iozone



 Comments   
Comment by Sarah Liu [ 19/Sep/17 ]

This failure has been seen on multiple tests
sanity-hsm test_24a also shows similar trace on MDS
https://testing.hpdd.intel.com/test_sets/fdc369ec-9c9e-11e7-ba27-5254006e85c2
performance-sanity
https://testing.hpdd.intel.com/test_sets/0083a0c0-9c9f-11e7-ba27-5254006e85c2

Comment by Andreas Dilger [ 21/Sep/17 ]

It looks like these threads are all stuck in the ZFS transaction commit code, waiting for the txg_sync thread to finish committing the transaction.

How consistently is this problem reproduced (e.g. 1 in 5 runs, 1 in 10, ...)? If the problem can be reproduced fairly consistently, then I would recommend to try to see if it can be reproduced with ZFS 0.6.5.9 instead of ZFS 0.7.1, since it may very well be a problem in the ZFS code and not Lustre, or at least a bad interaction with how Lustre is using ZFS.

To test this, change the SPLZFSVER in the lbuild script to be "0.6.5.9" and then submit the patch with Test-Parameters: that run the test 10 times (or however many are needed to reproduce it consistently).

Comment by Sarah Liu [ 21/Sep/17 ]

Does this relate to LU-9247?
For the consistency, taking results from 2.10.1 RC1, DNE and non-DNE ZFS, sanity-benchmark, performance-sanity and obdfilter-survey are either hit LU-9247 or this one. Also search all sanity-benchmark runs in 4 weeks on b2_10, hit this one 2 in 7 runs, hit LU-9247 3 in 7 runs.

I will make a patch to test with zfs 0.6.5.9

Comment by James Nunez (Inactive) [ 15/Jan/20 ]

This might require a new ticket, but ... I have a new case that looks like what is described here. For Lustre (future) 2.12.4 at https://testing.whamcloud.com/test_sets/4c5e02d0-349e-11ea-b0f4-52540065bddc for ZFS with DNE, we see test_bonnie hang, but also errors during test_dbench.

Looking at console logs for client1 (vm1), we see dbench process hung with all the same traces

[28504.786302] Lustre: DEBUG MARKER: == sanity-benchmark test dbench: dbench ============================================================== 17:07:28 (1578676048)
[29239.794475] LNet: 13710:0:(debug.c:370:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
[29510.651244] INFO: task dbench:13724 blocked for more than 120 seconds.
[29510.652635] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[29510.653990] dbench          D ffff9d4dbb0362a0     0 13724  13723 0x00000080
[29510.655393] Call Trace:
[29510.655905]  [<ffffffffa637f229>] schedule+0x29/0x70
[29510.656775]  [<ffffffffa637cbb1>] schedule_timeout+0x221/0x2d0
[29510.658160]  [<ffffffffc0b4ec60>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc]
[29510.659430]  [<ffffffffa637f5dd>] wait_for_completion+0xfd/0x140
[29510.660590]  [<ffffffffa5cda0b0>] ? wake_up_state+0x20/0x20
[29510.661793]  [<ffffffffc0d1caf4>] osc_io_fsync_end+0x74/0xa0 [osc]
[29510.663143]  [<ffffffffc097bc80>] cl_io_end+0x60/0x150 [obdclass]
[29510.664277]  [<ffffffffc0d6e0ab>] lov_io_end_wrapper+0xdb/0xe0 [lov]
[29510.665413]  [<ffffffffc0d6e380>] lov_io_fsync_end+0x80/0x1b0 [lov]
[29510.666552]  [<ffffffffc097bc80>] cl_io_end+0x60/0x150 [obdclass]
[29510.667664]  [<ffffffffc097e39a>] cl_io_loop+0xda/0x1c0 [obdclass]
[29510.668887]  [<ffffffffc0ee4dfb>] cl_sync_file_range+0x2db/0x380 [lustre]
[29510.670094]  [<ffffffffc0ee5129>] ll_fsync+0x289/0x490 [lustre]
[29510.671181]  [<ffffffffa5e7d9f7>] do_fsync+0x67/0xb0
[29510.672102]  [<ffffffffa638ce15>] ? system_call_after_swapgs+0xa2/0x146
[29510.673232]  [<ffffffffa5e7dce0>] SyS_fsync+0x10/0x20
[29510.674238]  [<ffffffffa638cede>] system_call_fastpath+0x25/0x2a
[29510.675317]  [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146
[29510.676491] INFO: task dbench:13731 blocked for more than 120 seconds.
[29510.677654] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[29510.679025] dbench          D ffff9d4dbbe341c0     0 13731  13723 0x00000080
[29510.680320] Call Trace:
[29510.680788]  [<ffffffffa637f229>] schedule+0x29/0x70
…
[29510.699317]  [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146
[29510.700579] INFO: task dbench:13732 blocked for more than 120 seconds.
[29510.701739] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[29510.703114] dbench          D ffff9d4dbbe30000     0 13732  13723 0x00000080
[29510.704412] Call Trace:
[29510.704859]  [<ffffffffa637f229>] schedule+0x29/0x70
…
[29510.723527]  [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146
[29510.724697] INFO: task dbench:13739 blocked for more than 120 seconds.
[29510.725860] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[29510.727220] dbench          D ffff9d4da6735230     0 13739  13723 0x00000080
[29510.728500] Call Trace:
[29510.729054]  [<ffffffffa637f229>] schedule+0x29/0x70
…
[29510.747529]  [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146
[29510.748711] INFO: task dbench:13741 blocked for more than 120 seconds.
[29510.749927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[29510.751293] dbench          D ffff9d4da67341c0     0 13741  13723 0x00000080
[29510.752583] Call Trace:
[29510.753102]  [<ffffffffa637f229>] schedule+0x29/0x70
…
[29510.771672]  [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146
[30016.551909] LNet: 14222:0:(debug.c:370:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
[30017.592350] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
[30017.997077] Lustre: DEBUG MARKER: rc=0; val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ $? -eq 0 && $val -ne 0 ]]; then echo $(hostname -s): $val; rc=$val; fi; exit $rc
[30019.791258] Lustre: DEBUG MARKER: dmesg
[30020.607155] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-benchmark test bonnie: bonnie++ ============================================================ 17:32:45 \(1578677565\)
[30020.885290] Lustre: DEBUG MARKER: == sanity-benchmark test bonnie: bonnie++ ============================================================ 17:32:45 (1578677565)
[30022.115531] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1889280kB available, using 3438440kB file size
[30022.397509] Lustre: DEBUG MARKER: min OST has 1889280kB available, using 3438440kB file size
[31417.232936] Lustre: 3163:0:(client.c:376:ptlrpc_at_adj_net_latency()) Reported service time 297 > total measured time 9
[31417.257892] Lustre: 3163:0:(client.c:376:ptlrpc_at_adj_net_latency()) Reported service time 297 > total measured time 9
[31418.333175] Lustre: 3163:0:(client.c:376:ptlrpc_at_adj_net_latency()) Reported service time 290 > total measured time 2
…

In the console logs of the OSS, we see inactive threads while running both dbench and bonnie tests

[28373.743227] Lustre: DEBUG MARKER: == sanity-benchmark test dbench: dbench ============================================================== 17:07:28 (1578676048)
[29151.759167] LNet: Service thread pid 30219 was inactive for 40.01s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[29151.762111] Pid: 30219, comm: ll_ost_io00_009 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
[29151.763845] Call Trace:
[29151.764347]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
[29151.765484]  [<ffffffffc0244315>] __cv_wait+0x15/0x20 [spl]
[29151.766544]  [<ffffffffc064a503>] txg_wait_open+0xc3/0x110 [zfs]
[29151.768064]  [<ffffffffc05ffdfa>] dmu_tx_wait+0x3aa/0x3c0 [zfs]
[29151.769162]  [<ffffffffc05ffea2>] dmu_tx_assign+0x92/0x490 [zfs]
[29151.770306]  [<ffffffffc10d1fd9>] osd_trans_start+0x199/0x440 [osd_zfs]
[29151.771576]  [<ffffffffc121bbf5>] ofd_trans_start+0x75/0xf0 [ofd]
[29151.772766]  [<ffffffffc1222cb1>] ofd_commitrw_write+0xa31/0x1d40 [ofd]
[29151.774017]  [<ffffffffc122714c>] ofd_commitrw+0x48c/0x9e0 [ofd]
[29151.775172]  [<ffffffffc0efc81c>] obd_commitrw+0x9c/0x370 [ptlrpc]
[29151.776625]  [<ffffffffc0f00cc2>] tgt_brw_write+0xf02/0x1ad0 [ptlrpc]
[29151.777896]  [<ffffffffc0f02b0a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[29151.779158]  [<ffffffffc0ea746b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[29151.780513]  [<ffffffffc0eaadd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[29151.781657]  [<ffffffff8f2c50d1>] kthread+0xd1/0xe0
[29151.782580]  [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39
[29151.783703]  [<ffffffffffffffff>] 0xffffffffffffffff
[29151.784681] LustreError: dumping log to /tmp/lustre-log.1578676828.30219
[29170.341906] LNet: Service thread pid 30229 completed after 57.73s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[29174.364066] LNet: Service thread pid 30238 was inactive for 60.85s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[29174.367218] Pid: 30238, comm: ll_ost_io00_023 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
[29174.369133] Call Trace:
[29174.369646]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
…
[29174.385846]  [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39
[29174.386947]  [<ffffffffffffffff>] 0xffffffffffffffff
[29174.387861] Pid: 30237, comm: ll_ost_io00_022 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
[29174.389604] Call Trace:
[29174.390059]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
…
[29174.405701]  [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39
[29174.406812]  [<ffffffffffffffff>] 0xffffffffffffffff
[29174.407694] Pid: 30232, comm: ll_ost_io00_021 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
[29174.409403] Call Trace:
[29174.409865]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
…
[29174.425546]  [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39
[29174.426647]  [<ffffffffffffffff>] 0xffffffffffffffff
[29174.427546] Pid: 30230, comm: ll_ost_io00_019 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
[29174.429258] Call Trace:
[29174.429709]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
…
[29174.445405]  [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39
[29174.446519]  [<ffffffffffffffff>] 0xffffffffffffffff
[29174.447422] LNet: Service thread pid 30231 was inactive for 61.71s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[29175.811107] LNet: Service thread pid 30239 was inactive for 62.21s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[29175.813367] LNet: Skipped 14 previous similar messages
[29175.814259] LustreError: dumping log to /tmp/lustre-log.1578676852.30239
[29180.860776] LNet: Service thread pid 30245 completed after 66.26s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[29180.863656] LNet: Skipped 1 previous similar message
[29185.479226] LNet: Service thread pid 1697 was inactive for 71.57s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
…
[29250.371484] LNet: Service thread pid 30224 completed after 138.03s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[29250.374371] LNet: Skipped 1 previous similar message
[29886.591943] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
[29887.500162] Lustre: DEBUG MARKER: rc=0;
[29887.500162] 			val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
[29887.500162] 			if [[ $? -eq 0 && $val -ne 0 ]]; then
[29887.500162] 				echo $(hostname -s): $val;
[29887.500162] 				rc=$val;
[29887.500162] 			fi;
[29887.500162] 			exit $rc
[29888.765447] Lustre: DEBUG MARKER: dmesg
[29889.610387] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-benchmark test bonnie: bonnie++ ============================================================ 17:32:45 \(1578677565\)
[29889.875373] Lustre: DEBUG MARKER: == sanity-benchmark test bonnie: bonnie++ ============================================================ 17:32:45 (1578677565)
[29891.104509] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1889280kB available, using 3438440kB file size
[29891.374921] Lustre: DEBUG MARKER: min OST has 1889280kB available, using 3438440kB file size
[29937.030473] LNet: Service thread pid 30532 was inactive for 40.10s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[29937.033411] LNet: Skipped 3 previous similar messages
[29937.034276] Pid: 30532, comm: ll_ost_io00_039 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
[29937.036006] Call Trace:
[29937.036737]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
[29937.037866]  [<ffffffffc0244315>] __cv_wait+0x15/0x20 [spl]
[29937.038927]  [<ffffffffc064a503>] txg_wait_open+0xc3/0x110 [zfs]
[29937.040315]  [<ffffffffc05ffdfa>] dmu_tx_wait+0x3aa/0x3c0 [zfs]
[29937.041495]  [<ffffffffc05ffea2>] dmu_tx_assign+0x92/0x490 [zfs]
[29937.042577]  [<ffffffffc10d1fd9>] osd_trans_start+0x199/0x440 [osd_zfs]
[29937.043860]  [<ffffffffc121bbf5>] ofd_trans_start+0x75/0xf0 [ofd]
[29937.045045]  [<ffffffffc1222cb1>] ofd_commitrw_write+0xa31/0x1d40 [ofd]
[29937.046283]  [<ffffffffc122714c>] ofd_commitrw+0x48c/0x9e0 [ofd]
[29937.047428]  [<ffffffffc0efc81c>] obd_commitrw+0x9c/0x370 [ptlrpc]
[29937.048870]  [<ffffffffc0f00cc2>] tgt_brw_write+0xf02/0x1ad0 [ptlrpc]
[29937.050125]  [<ffffffffc0f02b0a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[29937.051456]  [<ffffffffc0ea746b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[29937.052819]  [<ffffffffc0eaadd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[29937.054037]  [<ffffffff8f2c50d1>] kthread+0xd1/0xe0
[29937.055092]  [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39
[29937.056210]  [<ffffffffffffffff>] 0xffffffffffffffff
[29937.057192] LustreError: dumping log to /tmp/lustre-log.1578677613.30532
[29939.358695] LNet: Service thread pid 30527 completed after 42.34s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[29939.361926] LNet: Skipped 5 previous similar messages
[29944.592261] LNet: Service thread pid 30535 was inactive for 46.67s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[29944.592622] LNet: Service thread pid 30532 completed after 47.66s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[29944.597937] Pid: 30535, comm: ll_ost_io00_042 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
[29944.599693] Call Trace:
[29944.600416]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
…
[29944.631503]  [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39
[29944.633800]  [<ffffffffffffffff>] 0xffffffffffffffff
[29944.635849] Pid: 30538, comm: ll_ost_io00_045 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
[29944.638729] Call Trace:
[29944.639574]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
…
[29944.669318]  [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39
[29944.671370]  [<ffffffffffffffff>] 0xffffffffffffffff
[29944.673425] Pid: 1703, comm: ll_ost_io00_002 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
[29944.676410] Call Trace:
[29944.677279]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
…
[29944.709297]  [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39
[29944.711640]  [<ffffffffffffffff>] 0xffffffffffffffff
[29944.713831] Pid: 30554, comm: ll_ost_io00_061 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
[29944.716912] Call Trace:
[29944.717817]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
…
 

On the MDS1/3 (vm4), we see inactive threads while running dbench

[28372.841352] Lustre: DEBUG MARKER: == sanity-benchmark test dbench: dbench ============================================================== 17:07:28 (1578676048)
[29178.025765] LNet: Service thread pid 19144 was inactive for 40.10s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[29178.028758] Pid: 19144, comm: mdt00_001 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
[29178.030468] Call Trace:
[29178.031077]  [<ffffffffc1778c98>] osp_precreate_reserve+0x2e8/0x800 [osp]
[29178.032389]  [<ffffffffc176d8b9>] osp_declare_create+0x199/0x5b0 [osp]
[29178.033653]  [<ffffffffc16b569f>] lod_sub_declare_create+0xdf/0x210 [lod]
[29178.035062]  [<ffffffffc16ad86e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod]
[29178.036345]  [<ffffffffc16b080e>] lod_alloc_rr.constprop.19+0xeee/0x1490 [lod]
[29178.037720]  [<ffffffffc16b492d>] lod_qos_prep_create+0x12fd/0x1890 [lod]
[29178.038944]  [<ffffffffc16b50d5>] lod_prepare_create+0x215/0x2e0 [lod]
[29178.040287]  [<ffffffffc16a4f3e>] lod_declare_striped_create+0x1ee/0x980 [lod]
[29178.041651]  [<ffffffffc16a9814>] lod_declare_create+0x204/0x590 [lod]
[29178.042920]  [<ffffffffc171faf2>] mdd_declare_create_object_internal+0xe2/0x2f0 [mdd]
[29178.044380]  [<ffffffffc170f75c>] mdd_declare_create+0x4c/0xdf0 [mdd]
[29178.045636]  [<ffffffffc1713247>] mdd_create+0x867/0x14a0 [mdd]
[29178.046819]  [<ffffffffc15af9ff>] mdt_reint_open+0x224f/0x3240 [mdt]
[29178.048314]  [<ffffffffc15a2a53>] mdt_reint_rec+0x83/0x210 [mdt]
[29178.049508]  [<ffffffffc157f213>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[29178.050820]  [<ffffffffc158c2e2>] mdt_intent_open+0x82/0x3a0 [mdt]
[29178.051980]  [<ffffffffc158a405>] mdt_intent_policy+0x435/0xd80 [mdt]
[29178.053325]  [<ffffffffc1148e06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
[29178.054935]  [<ffffffffc1171506>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
[29178.056274]  [<ffffffffc11f9cf2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[29178.057669]  [<ffffffffc1200b0a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[29178.058956]  [<ffffffffc11a546b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[29178.060423]  [<ffffffffc11a8dd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[29178.061674]  [<ffffffffb5ec50d1>] kthread+0xd1/0xe0
[29178.062605]  [<ffffffffb658cd37>] ret_from_fork_nospec_end+0x0/0x39
[29178.063870]  [<ffffffffffffffff>] 0xffffffffffffffff
[29178.064808] LustreError: dumping log to /tmp/lustre-log.1578676854.19144
[29179.962488] LNet: Service thread pid 26259 was inactive for 42.04s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
… 
Generated at Sat Feb 10 02:31:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.