[LU-10009] sanity-benchmark test_iozone: test failed to respond and timed out Created: 19/Sep/17 Updated: 16/Jan/20 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.1, Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
This issue was created by maloo for sarah_lw <wei3.liu@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/fd03f350-9c9e-11e7-ba27-5254006e85c2. The sub-test test_iozone failed with the following error: test failed to respond and timed out server and client: RHEL7.4 zfs MDS dmesg shows [25904.206622] osp-syn-0-0 D 0000000000000000 0 23127 2 0x00000080 [25904.208389] ffff88005c5976a0 0000000000000046 ffff88004e28eeb0 ffff88005c597fd8 [25904.210163] ffff88005c597fd8 ffff88005c597fd8 ffff88004e28eeb0 ffff8800608452f8 [25904.211921] ffff880060845240 ffff880060845268 ffff880060845300 0000000000000000 [25904.213716] Call Trace: [25904.214985] [<ffffffff816a94c9>] schedule+0x29/0x70 [25904.216575] [<ffffffffc07144d5>] cv_wait_common+0x125/0x150 [spl] [25904.218157] [<ffffffff810b1910>] ? wake_up_atomic_t+0x30/0x30 [25904.219805] [<ffffffffc0714515>] __cv_wait+0x15/0x20 [spl] [25904.221433] [<ffffffffc086317f>] txg_wait_synced+0xef/0x140 [zfs] [25904.223020] [<ffffffffc0818a75>] dmu_tx_wait+0x275/0x3c0 [zfs] [25904.224676] [<ffffffffc0818c51>] dmu_tx_assign+0x91/0x490 [zfs] [25904.226256] [<ffffffffc0c3ad00>] ? llog_osd_declare_destroy+0x2f0/0x640 [obdclass] [25904.228018] [<ffffffffc1085efa>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [25904.229710] [<ffffffffc0c278c7>] llog_cancel_rec+0x147/0x870 [obdclass] [25904.231414] [<ffffffffc0c2e33a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass] [25904.233232] [<ffffffffc0e4b8a0>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc] [25904.234923] [<ffffffffc13c57f3>] osp_sync_process_committed+0x213/0x6c0 [osp] [25904.236691] [<ffffffffc13c6bd6>] osp_sync_process_queues+0x556/0x2010 [osp] [25904.238371] [<ffffffff810c4810>] ? wake_up_state+0x20/0x20 [25904.239978] [<ffffffffc0c28595>] llog_process_thread+0x5a5/0x1180 [obdclass] [25904.241718] [<ffffffffc13c6680>] ? osp_sync_thread+0x9e0/0x9e0 [osp] [25904.243404] [<ffffffffc0c2922c>] llog_process_or_fork+0xbc/0x450 [obdclass] [25904.245050] [<ffffffffc0c2e91d>] llog_cat_process_cb+0x43d/0x4e0 [obdclass] [25904.246794] [<ffffffffc0c28595>] llog_process_thread+0x5a5/0x1180 [obdclass] [25904.248521] [<ffffffff810ce8d8>] ? check_preempt_wakeup+0x148/0x250 [25904.250109] [<ffffffffc0c2e4e0>] ? llog_cat_cancel_records+0x2e0/0x2e0 [obdclass] [25904.251906] [<ffffffffc0c2922c>] llog_process_or_fork+0xbc/0x450 [obdclass] [25904.253600] [<ffffffffc0c2e4e0>] ? llog_cat_cancel_records+0x2e0/0x2e0 [obdclass] [25904.255390] [<ffffffffc0c2daa9>] llog_cat_process_or_fork+0x199/0x2a0 [obdclass] [25904.257072] [<ffffffff810c4822>] ? default_wake_function+0x12/0x20 [25904.258766] [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90 [25904.260355] [<ffffffffc13c6680>] ? osp_sync_thread+0x9e0/0x9e0 [osp] [25904.262039] [<ffffffffc0c2dbde>] llog_cat_process+0x2e/0x30 [obdclass] [25904.263683] [<ffffffffc13c5ea8>] osp_sync_thread+0x208/0x9e0 [osp] [25904.265347] [<ffffffff81029557>] ? __switch_to+0xd7/0x510 [25904.266851] [<ffffffffc13c5ca0>] ? osp_sync_process_committed+0x6c0/0x6c0 [osp] [25904.268620] [<ffffffff810b098f>] kthread+0xcf/0xe0 [25904.270093] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [25904.271736] [<ffffffff816b4f18>] ret_from_fork+0x58/0x90 [25904.273300] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 ... [25904.821292] mdt_rdpg00_002 D ffff880060845300 0 24274 2 0x00000080 [25904.822912] ffff88005a573880 0000000000000046 ffff88006b1b8fd0 ffff88005a573fd8 [25904.824661] ffff88005a573fd8 ffff88005a573fd8 ffff88006b1b8fd0 ffff8800608452f8 [25904.826407] ffff880060845240 ffff880060845268 ffff880060845300 0000000000000000 [25904.828129] Call Trace: [25904.829349] [<ffffffff816a94c9>] schedule+0x29/0x70 [25904.830842] [<ffffffffc07144d5>] cv_wait_common+0x125/0x150 [spl] [25904.832434] [<ffffffff810b1910>] ? wake_up_atomic_t+0x30/0x30 [25904.833928] [<ffffffffc0714515>] __cv_wait+0x15/0x20 [spl] [25904.835494] [<ffffffffc086317f>] txg_wait_synced+0xef/0x140 [zfs] [25904.837084] [<ffffffffc0818a75>] dmu_tx_wait+0x275/0x3c0 [zfs] [25904.838670] [<ffffffffc0818c51>] dmu_tx_assign+0x91/0x490 [zfs] [25904.840246] [<ffffffffc1085efa>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [25904.841792] [<ffffffffc103a128>] qmt_trans_start_with_slv+0x248/0x530 [lquota] [25904.843472] [<ffffffffc1033196>] qmt_dqacq0+0x1a6/0xf00 [lquota] [25904.845069] [<ffffffffc0e4a2df>] ? lustre_pack_reply_flags+0x6f/0x1e0 [ptlrpc] [25904.846753] [<ffffffffc1036b21>] qmt_intent_policy+0x831/0xe50 [lquota] [25904.848387] [<ffffffffc12207c2>] mdt_intent_policy+0x662/0xc70 [mdt] [25904.849991] [<ffffffffc0e0112f>] ? ldlm_resource_get+0x9f/0xa30 [ptlrpc] [25904.851631] [<ffffffffc0dfa2b7>] ldlm_lock_enqueue+0x387/0x970 [ptlrpc] [25904.853270] [<ffffffffc0e23c23>] ldlm_handle_enqueue0+0x9c3/0x1680 [ptlrpc] [25904.854932] [<ffffffffc0e4be90>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] [25904.856589] [<ffffffffc0ea9182>] tgt_enqueue+0x62/0x210 [ptlrpc] [25904.858208] [<ffffffffc0ead085>] tgt_request_handle+0x925/0x1370 [ptlrpc] [25904.859868] [<ffffffffc0e55ec6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc] [25904.861533] [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90 [25904.863131] [<ffffffffc0e59602>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [25904.864765] [<ffffffffc0e58b70>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] [25904.866473] [<ffffffff810b098f>] kthread+0xcf/0xe0 [25904.867963] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [25904.869514] [<ffffffff816b4f18>] ret_from_fork+0x58/0x90 [25904.871056] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 Info required for matching: sanity-benchmark iozone |
| Comments |
| Comment by Sarah Liu [ 19/Sep/17 ] |
|
This failure has been seen on multiple tests |
| Comment by Andreas Dilger [ 21/Sep/17 ] |
|
It looks like these threads are all stuck in the ZFS transaction commit code, waiting for the txg_sync thread to finish committing the transaction. How consistently is this problem reproduced (e.g. 1 in 5 runs, 1 in 10, ...)? If the problem can be reproduced fairly consistently, then I would recommend to try to see if it can be reproduced with ZFS 0.6.5.9 instead of ZFS 0.7.1, since it may very well be a problem in the ZFS code and not Lustre, or at least a bad interaction with how Lustre is using ZFS. To test this, change the SPLZFSVER in the lbuild script to be "0.6.5.9" and then submit the patch with Test-Parameters: that run the test 10 times (or however many are needed to reproduce it consistently). |
| Comment by Sarah Liu [ 21/Sep/17 ] |
|
Does this relate to I will make a patch to test with zfs 0.6.5.9 |
| Comment by James Nunez (Inactive) [ 15/Jan/20 ] |
|
This might require a new ticket, but ... I have a new case that looks like what is described here. For Lustre (future) 2.12.4 at https://testing.whamcloud.com/test_sets/4c5e02d0-349e-11ea-b0f4-52540065bddc for ZFS with DNE, we see test_bonnie hang, but also errors during test_dbench. Looking at console logs for client1 (vm1), we see dbench process hung with all the same traces [28504.786302] Lustre: DEBUG MARKER: == sanity-benchmark test dbench: dbench ============================================================== 17:07:28 (1578676048) [29239.794475] LNet: 13710:0:(debug.c:370:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release. [29510.651244] INFO: task dbench:13724 blocked for more than 120 seconds. [29510.652635] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [29510.653990] dbench D ffff9d4dbb0362a0 0 13724 13723 0x00000080 [29510.655393] Call Trace: [29510.655905] [<ffffffffa637f229>] schedule+0x29/0x70 [29510.656775] [<ffffffffa637cbb1>] schedule_timeout+0x221/0x2d0 [29510.658160] [<ffffffffc0b4ec60>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc] [29510.659430] [<ffffffffa637f5dd>] wait_for_completion+0xfd/0x140 [29510.660590] [<ffffffffa5cda0b0>] ? wake_up_state+0x20/0x20 [29510.661793] [<ffffffffc0d1caf4>] osc_io_fsync_end+0x74/0xa0 [osc] [29510.663143] [<ffffffffc097bc80>] cl_io_end+0x60/0x150 [obdclass] [29510.664277] [<ffffffffc0d6e0ab>] lov_io_end_wrapper+0xdb/0xe0 [lov] [29510.665413] [<ffffffffc0d6e380>] lov_io_fsync_end+0x80/0x1b0 [lov] [29510.666552] [<ffffffffc097bc80>] cl_io_end+0x60/0x150 [obdclass] [29510.667664] [<ffffffffc097e39a>] cl_io_loop+0xda/0x1c0 [obdclass] [29510.668887] [<ffffffffc0ee4dfb>] cl_sync_file_range+0x2db/0x380 [lustre] [29510.670094] [<ffffffffc0ee5129>] ll_fsync+0x289/0x490 [lustre] [29510.671181] [<ffffffffa5e7d9f7>] do_fsync+0x67/0xb0 [29510.672102] [<ffffffffa638ce15>] ? system_call_after_swapgs+0xa2/0x146 [29510.673232] [<ffffffffa5e7dce0>] SyS_fsync+0x10/0x20 [29510.674238] [<ffffffffa638cede>] system_call_fastpath+0x25/0x2a [29510.675317] [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146 [29510.676491] INFO: task dbench:13731 blocked for more than 120 seconds. [29510.677654] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [29510.679025] dbench D ffff9d4dbbe341c0 0 13731 13723 0x00000080 [29510.680320] Call Trace: [29510.680788] [<ffffffffa637f229>] schedule+0x29/0x70 … [29510.699317] [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146 [29510.700579] INFO: task dbench:13732 blocked for more than 120 seconds. [29510.701739] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [29510.703114] dbench D ffff9d4dbbe30000 0 13732 13723 0x00000080 [29510.704412] Call Trace: [29510.704859] [<ffffffffa637f229>] schedule+0x29/0x70 … [29510.723527] [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146 [29510.724697] INFO: task dbench:13739 blocked for more than 120 seconds. [29510.725860] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [29510.727220] dbench D ffff9d4da6735230 0 13739 13723 0x00000080 [29510.728500] Call Trace: [29510.729054] [<ffffffffa637f229>] schedule+0x29/0x70 … [29510.747529] [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146 [29510.748711] INFO: task dbench:13741 blocked for more than 120 seconds. [29510.749927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [29510.751293] dbench D ffff9d4da67341c0 0 13741 13723 0x00000080 [29510.752583] Call Trace: [29510.753102] [<ffffffffa637f229>] schedule+0x29/0x70 … [29510.771672] [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146 [30016.551909] LNet: 14222:0:(debug.c:370:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release. [30017.592350] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [30017.997077] Lustre: DEBUG MARKER: rc=0; val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ $? -eq 0 && $val -ne 0 ]]; then echo $(hostname -s): $val; rc=$val; fi; exit $rc [30019.791258] Lustre: DEBUG MARKER: dmesg [30020.607155] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-benchmark test bonnie: bonnie++ ============================================================ 17:32:45 \(1578677565\) [30020.885290] Lustre: DEBUG MARKER: == sanity-benchmark test bonnie: bonnie++ ============================================================ 17:32:45 (1578677565) [30022.115531] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1889280kB available, using 3438440kB file size [30022.397509] Lustre: DEBUG MARKER: min OST has 1889280kB available, using 3438440kB file size [31417.232936] Lustre: 3163:0:(client.c:376:ptlrpc_at_adj_net_latency()) Reported service time 297 > total measured time 9 [31417.257892] Lustre: 3163:0:(client.c:376:ptlrpc_at_adj_net_latency()) Reported service time 297 > total measured time 9 [31418.333175] Lustre: 3163:0:(client.c:376:ptlrpc_at_adj_net_latency()) Reported service time 290 > total measured time 2 … In the console logs of the OSS, we see inactive threads while running both dbench and bonnie tests [28373.743227] Lustre: DEBUG MARKER: == sanity-benchmark test dbench: dbench ============================================================== 17:07:28 (1578676048) [29151.759167] LNet: Service thread pid 30219 was inactive for 40.01s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [29151.762111] Pid: 30219, comm: ll_ost_io00_009 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29151.763845] Call Trace: [29151.764347] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] [29151.765484] [<ffffffffc0244315>] __cv_wait+0x15/0x20 [spl] [29151.766544] [<ffffffffc064a503>] txg_wait_open+0xc3/0x110 [zfs] [29151.768064] [<ffffffffc05ffdfa>] dmu_tx_wait+0x3aa/0x3c0 [zfs] [29151.769162] [<ffffffffc05ffea2>] dmu_tx_assign+0x92/0x490 [zfs] [29151.770306] [<ffffffffc10d1fd9>] osd_trans_start+0x199/0x440 [osd_zfs] [29151.771576] [<ffffffffc121bbf5>] ofd_trans_start+0x75/0xf0 [ofd] [29151.772766] [<ffffffffc1222cb1>] ofd_commitrw_write+0xa31/0x1d40 [ofd] [29151.774017] [<ffffffffc122714c>] ofd_commitrw+0x48c/0x9e0 [ofd] [29151.775172] [<ffffffffc0efc81c>] obd_commitrw+0x9c/0x370 [ptlrpc] [29151.776625] [<ffffffffc0f00cc2>] tgt_brw_write+0xf02/0x1ad0 [ptlrpc] [29151.777896] [<ffffffffc0f02b0a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [29151.779158] [<ffffffffc0ea746b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [29151.780513] [<ffffffffc0eaadd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [29151.781657] [<ffffffff8f2c50d1>] kthread+0xd1/0xe0 [29151.782580] [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39 [29151.783703] [<ffffffffffffffff>] 0xffffffffffffffff [29151.784681] LustreError: dumping log to /tmp/lustre-log.1578676828.30219 [29170.341906] LNet: Service thread pid 30229 completed after 57.73s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [29174.364066] LNet: Service thread pid 30238 was inactive for 60.85s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [29174.367218] Pid: 30238, comm: ll_ost_io00_023 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29174.369133] Call Trace: [29174.369646] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] … [29174.385846] [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39 [29174.386947] [<ffffffffffffffff>] 0xffffffffffffffff [29174.387861] Pid: 30237, comm: ll_ost_io00_022 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29174.389604] Call Trace: [29174.390059] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] … [29174.405701] [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39 [29174.406812] [<ffffffffffffffff>] 0xffffffffffffffff [29174.407694] Pid: 30232, comm: ll_ost_io00_021 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29174.409403] Call Trace: [29174.409865] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] … [29174.425546] [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39 [29174.426647] [<ffffffffffffffff>] 0xffffffffffffffff [29174.427546] Pid: 30230, comm: ll_ost_io00_019 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29174.429258] Call Trace: [29174.429709] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] … [29174.445405] [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39 [29174.446519] [<ffffffffffffffff>] 0xffffffffffffffff [29174.447422] LNet: Service thread pid 30231 was inactive for 61.71s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [29175.811107] LNet: Service thread pid 30239 was inactive for 62.21s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [29175.813367] LNet: Skipped 14 previous similar messages [29175.814259] LustreError: dumping log to /tmp/lustre-log.1578676852.30239 [29180.860776] LNet: Service thread pid 30245 completed after 66.26s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [29180.863656] LNet: Skipped 1 previous similar message [29185.479226] LNet: Service thread pid 1697 was inactive for 71.57s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. … [29250.371484] LNet: Service thread pid 30224 completed after 138.03s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [29250.374371] LNet: Skipped 1 previous similar message [29886.591943] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [29887.500162] Lustre: DEBUG MARKER: rc=0; [29887.500162] val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); [29887.500162] if [[ $? -eq 0 && $val -ne 0 ]]; then [29887.500162] echo $(hostname -s): $val; [29887.500162] rc=$val; [29887.500162] fi; [29887.500162] exit $rc [29888.765447] Lustre: DEBUG MARKER: dmesg [29889.610387] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-benchmark test bonnie: bonnie++ ============================================================ 17:32:45 \(1578677565\) [29889.875373] Lustre: DEBUG MARKER: == sanity-benchmark test bonnie: bonnie++ ============================================================ 17:32:45 (1578677565) [29891.104509] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1889280kB available, using 3438440kB file size [29891.374921] Lustre: DEBUG MARKER: min OST has 1889280kB available, using 3438440kB file size [29937.030473] LNet: Service thread pid 30532 was inactive for 40.10s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [29937.033411] LNet: Skipped 3 previous similar messages [29937.034276] Pid: 30532, comm: ll_ost_io00_039 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29937.036006] Call Trace: [29937.036737] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] [29937.037866] [<ffffffffc0244315>] __cv_wait+0x15/0x20 [spl] [29937.038927] [<ffffffffc064a503>] txg_wait_open+0xc3/0x110 [zfs] [29937.040315] [<ffffffffc05ffdfa>] dmu_tx_wait+0x3aa/0x3c0 [zfs] [29937.041495] [<ffffffffc05ffea2>] dmu_tx_assign+0x92/0x490 [zfs] [29937.042577] [<ffffffffc10d1fd9>] osd_trans_start+0x199/0x440 [osd_zfs] [29937.043860] [<ffffffffc121bbf5>] ofd_trans_start+0x75/0xf0 [ofd] [29937.045045] [<ffffffffc1222cb1>] ofd_commitrw_write+0xa31/0x1d40 [ofd] [29937.046283] [<ffffffffc122714c>] ofd_commitrw+0x48c/0x9e0 [ofd] [29937.047428] [<ffffffffc0efc81c>] obd_commitrw+0x9c/0x370 [ptlrpc] [29937.048870] [<ffffffffc0f00cc2>] tgt_brw_write+0xf02/0x1ad0 [ptlrpc] [29937.050125] [<ffffffffc0f02b0a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [29937.051456] [<ffffffffc0ea746b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [29937.052819] [<ffffffffc0eaadd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [29937.054037] [<ffffffff8f2c50d1>] kthread+0xd1/0xe0 [29937.055092] [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39 [29937.056210] [<ffffffffffffffff>] 0xffffffffffffffff [29937.057192] LustreError: dumping log to /tmp/lustre-log.1578677613.30532 [29939.358695] LNet: Service thread pid 30527 completed after 42.34s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [29939.361926] LNet: Skipped 5 previous similar messages [29944.592261] LNet: Service thread pid 30535 was inactive for 46.67s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [29944.592622] LNet: Service thread pid 30532 completed after 47.66s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [29944.597937] Pid: 30535, comm: ll_ost_io00_042 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29944.599693] Call Trace: [29944.600416] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] … [29944.631503] [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39 [29944.633800] [<ffffffffffffffff>] 0xffffffffffffffff [29944.635849] Pid: 30538, comm: ll_ost_io00_045 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29944.638729] Call Trace: [29944.639574] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] … [29944.669318] [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39 [29944.671370] [<ffffffffffffffff>] 0xffffffffffffffff [29944.673425] Pid: 1703, comm: ll_ost_io00_002 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29944.676410] Call Trace: [29944.677279] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] … [29944.709297] [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39 [29944.711640] [<ffffffffffffffff>] 0xffffffffffffffff [29944.713831] Pid: 30554, comm: ll_ost_io00_061 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29944.716912] Call Trace: [29944.717817] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] … On the MDS1/3 (vm4), we see inactive threads while running dbench [28372.841352] Lustre: DEBUG MARKER: == sanity-benchmark test dbench: dbench ============================================================== 17:07:28 (1578676048) [29178.025765] LNet: Service thread pid 19144 was inactive for 40.10s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [29178.028758] Pid: 19144, comm: mdt00_001 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29178.030468] Call Trace: [29178.031077] [<ffffffffc1778c98>] osp_precreate_reserve+0x2e8/0x800 [osp] [29178.032389] [<ffffffffc176d8b9>] osp_declare_create+0x199/0x5b0 [osp] [29178.033653] [<ffffffffc16b569f>] lod_sub_declare_create+0xdf/0x210 [lod] [29178.035062] [<ffffffffc16ad86e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod] [29178.036345] [<ffffffffc16b080e>] lod_alloc_rr.constprop.19+0xeee/0x1490 [lod] [29178.037720] [<ffffffffc16b492d>] lod_qos_prep_create+0x12fd/0x1890 [lod] [29178.038944] [<ffffffffc16b50d5>] lod_prepare_create+0x215/0x2e0 [lod] [29178.040287] [<ffffffffc16a4f3e>] lod_declare_striped_create+0x1ee/0x980 [lod] [29178.041651] [<ffffffffc16a9814>] lod_declare_create+0x204/0x590 [lod] [29178.042920] [<ffffffffc171faf2>] mdd_declare_create_object_internal+0xe2/0x2f0 [mdd] [29178.044380] [<ffffffffc170f75c>] mdd_declare_create+0x4c/0xdf0 [mdd] [29178.045636] [<ffffffffc1713247>] mdd_create+0x867/0x14a0 [mdd] [29178.046819] [<ffffffffc15af9ff>] mdt_reint_open+0x224f/0x3240 [mdt] [29178.048314] [<ffffffffc15a2a53>] mdt_reint_rec+0x83/0x210 [mdt] [29178.049508] [<ffffffffc157f213>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [29178.050820] [<ffffffffc158c2e2>] mdt_intent_open+0x82/0x3a0 [mdt] [29178.051980] [<ffffffffc158a405>] mdt_intent_policy+0x435/0xd80 [mdt] [29178.053325] [<ffffffffc1148e06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] [29178.054935] [<ffffffffc1171506>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] [29178.056274] [<ffffffffc11f9cf2>] tgt_enqueue+0x62/0x210 [ptlrpc] [29178.057669] [<ffffffffc1200b0a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [29178.058956] [<ffffffffc11a546b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [29178.060423] [<ffffffffc11a8dd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [29178.061674] [<ffffffffb5ec50d1>] kthread+0xd1/0xe0 [29178.062605] [<ffffffffb658cd37>] ret_from_fork_nospec_end+0x0/0x39 [29178.063870] [<ffffffffffffffff>] 0xffffffffffffffff [29178.064808] LustreError: dumping log to /tmp/lustre-log.1578676854.19144 [29179.962488] LNet: Service thread pid 26259 was inactive for 42.04s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: … |