Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10009

sanity-benchmark test_iozone: test failed to respond and timed out

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.1, Lustre 2.12.4
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/fd03f350-9c9e-11e7-ba27-5254006e85c2.

      The sub-test test_iozone failed with the following error:

      test failed to respond and timed out
      

      server and client: RHEL7.4 zfs

      MDS dmesg shows

      [25904.206622] osp-syn-0-0     D 0000000000000000     0 23127      2 0x00000080
      [25904.208389]  ffff88005c5976a0 0000000000000046 ffff88004e28eeb0 ffff88005c597fd8
      [25904.210163]  ffff88005c597fd8 ffff88005c597fd8 ffff88004e28eeb0 ffff8800608452f8
      [25904.211921]  ffff880060845240 ffff880060845268 ffff880060845300 0000000000000000
      [25904.213716] Call Trace:
      [25904.214985]  [<ffffffff816a94c9>] schedule+0x29/0x70
      [25904.216575]  [<ffffffffc07144d5>] cv_wait_common+0x125/0x150 [spl]
      [25904.218157]  [<ffffffff810b1910>] ? wake_up_atomic_t+0x30/0x30
      [25904.219805]  [<ffffffffc0714515>] __cv_wait+0x15/0x20 [spl]
      [25904.221433]  [<ffffffffc086317f>] txg_wait_synced+0xef/0x140 [zfs]
      [25904.223020]  [<ffffffffc0818a75>] dmu_tx_wait+0x275/0x3c0 [zfs]
      [25904.224676]  [<ffffffffc0818c51>] dmu_tx_assign+0x91/0x490 [zfs]
      [25904.226256]  [<ffffffffc0c3ad00>] ? llog_osd_declare_destroy+0x2f0/0x640 [obdclass]
      [25904.228018]  [<ffffffffc1085efa>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
      [25904.229710]  [<ffffffffc0c278c7>] llog_cancel_rec+0x147/0x870 [obdclass]
      [25904.231414]  [<ffffffffc0c2e33a>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass]
      [25904.233232]  [<ffffffffc0e4b8a0>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc]
      [25904.234923]  [<ffffffffc13c57f3>] osp_sync_process_committed+0x213/0x6c0 [osp]
      [25904.236691]  [<ffffffffc13c6bd6>] osp_sync_process_queues+0x556/0x2010 [osp]
      [25904.238371]  [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
      [25904.239978]  [<ffffffffc0c28595>] llog_process_thread+0x5a5/0x1180 [obdclass]
      [25904.241718]  [<ffffffffc13c6680>] ? osp_sync_thread+0x9e0/0x9e0 [osp]
      [25904.243404]  [<ffffffffc0c2922c>] llog_process_or_fork+0xbc/0x450 [obdclass]
      [25904.245050]  [<ffffffffc0c2e91d>] llog_cat_process_cb+0x43d/0x4e0 [obdclass]
      [25904.246794]  [<ffffffffc0c28595>] llog_process_thread+0x5a5/0x1180 [obdclass]
      [25904.248521]  [<ffffffff810ce8d8>] ? check_preempt_wakeup+0x148/0x250
      [25904.250109]  [<ffffffffc0c2e4e0>] ? llog_cat_cancel_records+0x2e0/0x2e0 [obdclass]
      [25904.251906]  [<ffffffffc0c2922c>] llog_process_or_fork+0xbc/0x450 [obdclass]
      [25904.253600]  [<ffffffffc0c2e4e0>] ? llog_cat_cancel_records+0x2e0/0x2e0 [obdclass]
      [25904.255390]  [<ffffffffc0c2daa9>] llog_cat_process_or_fork+0x199/0x2a0 [obdclass]
      [25904.257072]  [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
      [25904.258766]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
      [25904.260355]  [<ffffffffc13c6680>] ? osp_sync_thread+0x9e0/0x9e0 [osp]
      [25904.262039]  [<ffffffffc0c2dbde>] llog_cat_process+0x2e/0x30 [obdclass]
      [25904.263683]  [<ffffffffc13c5ea8>] osp_sync_thread+0x208/0x9e0 [osp]
      [25904.265347]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
      [25904.266851]  [<ffffffffc13c5ca0>] ? osp_sync_process_committed+0x6c0/0x6c0 [osp]
      [25904.268620]  [<ffffffff810b098f>] kthread+0xcf/0xe0
      [25904.270093]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
      [25904.271736]  [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
      [25904.273300]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
      
      ...
      
      [25904.821292] mdt_rdpg00_002  D ffff880060845300     0 24274      2 0x00000080
      [25904.822912]  ffff88005a573880 0000000000000046 ffff88006b1b8fd0 ffff88005a573fd8
      [25904.824661]  ffff88005a573fd8 ffff88005a573fd8 ffff88006b1b8fd0 ffff8800608452f8
      [25904.826407]  ffff880060845240 ffff880060845268 ffff880060845300 0000000000000000
      [25904.828129] Call Trace:
      [25904.829349]  [<ffffffff816a94c9>] schedule+0x29/0x70
      [25904.830842]  [<ffffffffc07144d5>] cv_wait_common+0x125/0x150 [spl]
      [25904.832434]  [<ffffffff810b1910>] ? wake_up_atomic_t+0x30/0x30
      [25904.833928]  [<ffffffffc0714515>] __cv_wait+0x15/0x20 [spl]
      [25904.835494]  [<ffffffffc086317f>] txg_wait_synced+0xef/0x140 [zfs]
      [25904.837084]  [<ffffffffc0818a75>] dmu_tx_wait+0x275/0x3c0 [zfs]
      [25904.838670]  [<ffffffffc0818c51>] dmu_tx_assign+0x91/0x490 [zfs]
      [25904.840246]  [<ffffffffc1085efa>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
      [25904.841792]  [<ffffffffc103a128>] qmt_trans_start_with_slv+0x248/0x530 [lquota]
      [25904.843472]  [<ffffffffc1033196>] qmt_dqacq0+0x1a6/0xf00 [lquota]
      [25904.845069]  [<ffffffffc0e4a2df>] ? lustre_pack_reply_flags+0x6f/0x1e0 [ptlrpc]
      [25904.846753]  [<ffffffffc1036b21>] qmt_intent_policy+0x831/0xe50 [lquota]
      [25904.848387]  [<ffffffffc12207c2>] mdt_intent_policy+0x662/0xc70 [mdt]
      [25904.849991]  [<ffffffffc0e0112f>] ? ldlm_resource_get+0x9f/0xa30 [ptlrpc]
      [25904.851631]  [<ffffffffc0dfa2b7>] ldlm_lock_enqueue+0x387/0x970 [ptlrpc]
      [25904.853270]  [<ffffffffc0e23c23>] ldlm_handle_enqueue0+0x9c3/0x1680 [ptlrpc]
      [25904.854932]  [<ffffffffc0e4be90>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
      [25904.856589]  [<ffffffffc0ea9182>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [25904.858208]  [<ffffffffc0ead085>] tgt_request_handle+0x925/0x1370 [ptlrpc]
      [25904.859868]  [<ffffffffc0e55ec6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
      [25904.861533]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
      [25904.863131]  [<ffffffffc0e59602>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [25904.864765]  [<ffffffffc0e58b70>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
      [25904.866473]  [<ffffffff810b098f>] kthread+0xcf/0xe0
      [25904.867963]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
      [25904.869514]  [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
      [25904.871056]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
      

      Info required for matching: sanity-benchmark iozone

      Attachments

        Issue Links

          Activity

            [LU-10009] sanity-benchmark test_iozone: test failed to respond and timed out

            This might require a new ticket, but ... I have a new case that looks like what is described here. For Lustre (future) 2.12.4 at https://testing.whamcloud.com/test_sets/4c5e02d0-349e-11ea-b0f4-52540065bddc for ZFS with DNE, we see test_bonnie hang, but also errors during test_dbench.

            Looking at console logs for client1 (vm1), we see dbench process hung with all the same traces

            [28504.786302] Lustre: DEBUG MARKER: == sanity-benchmark test dbench: dbench ============================================================== 17:07:28 (1578676048)
            [29239.794475] LNet: 13710:0:(debug.c:370:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
            [29510.651244] INFO: task dbench:13724 blocked for more than 120 seconds.
            [29510.652635] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [29510.653990] dbench          D ffff9d4dbb0362a0     0 13724  13723 0x00000080
            [29510.655393] Call Trace:
            [29510.655905]  [<ffffffffa637f229>] schedule+0x29/0x70
            [29510.656775]  [<ffffffffa637cbb1>] schedule_timeout+0x221/0x2d0
            [29510.658160]  [<ffffffffc0b4ec60>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc]
            [29510.659430]  [<ffffffffa637f5dd>] wait_for_completion+0xfd/0x140
            [29510.660590]  [<ffffffffa5cda0b0>] ? wake_up_state+0x20/0x20
            [29510.661793]  [<ffffffffc0d1caf4>] osc_io_fsync_end+0x74/0xa0 [osc]
            [29510.663143]  [<ffffffffc097bc80>] cl_io_end+0x60/0x150 [obdclass]
            [29510.664277]  [<ffffffffc0d6e0ab>] lov_io_end_wrapper+0xdb/0xe0 [lov]
            [29510.665413]  [<ffffffffc0d6e380>] lov_io_fsync_end+0x80/0x1b0 [lov]
            [29510.666552]  [<ffffffffc097bc80>] cl_io_end+0x60/0x150 [obdclass]
            [29510.667664]  [<ffffffffc097e39a>] cl_io_loop+0xda/0x1c0 [obdclass]
            [29510.668887]  [<ffffffffc0ee4dfb>] cl_sync_file_range+0x2db/0x380 [lustre]
            [29510.670094]  [<ffffffffc0ee5129>] ll_fsync+0x289/0x490 [lustre]
            [29510.671181]  [<ffffffffa5e7d9f7>] do_fsync+0x67/0xb0
            [29510.672102]  [<ffffffffa638ce15>] ? system_call_after_swapgs+0xa2/0x146
            [29510.673232]  [<ffffffffa5e7dce0>] SyS_fsync+0x10/0x20
            [29510.674238]  [<ffffffffa638cede>] system_call_fastpath+0x25/0x2a
            [29510.675317]  [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146
            [29510.676491] INFO: task dbench:13731 blocked for more than 120 seconds.
            [29510.677654] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [29510.679025] dbench          D ffff9d4dbbe341c0     0 13731  13723 0x00000080
            [29510.680320] Call Trace:
            [29510.680788]  [<ffffffffa637f229>] schedule+0x29/0x70
            …
            [29510.699317]  [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146
            [29510.700579] INFO: task dbench:13732 blocked for more than 120 seconds.
            [29510.701739] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [29510.703114] dbench          D ffff9d4dbbe30000     0 13732  13723 0x00000080
            [29510.704412] Call Trace:
            [29510.704859]  [<ffffffffa637f229>] schedule+0x29/0x70
            …
            [29510.723527]  [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146
            [29510.724697] INFO: task dbench:13739 blocked for more than 120 seconds.
            [29510.725860] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [29510.727220] dbench          D ffff9d4da6735230     0 13739  13723 0x00000080
            [29510.728500] Call Trace:
            [29510.729054]  [<ffffffffa637f229>] schedule+0x29/0x70
            …
            [29510.747529]  [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146
            [29510.748711] INFO: task dbench:13741 blocked for more than 120 seconds.
            [29510.749927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [29510.751293] dbench          D ffff9d4da67341c0     0 13741  13723 0x00000080
            [29510.752583] Call Trace:
            [29510.753102]  [<ffffffffa637f229>] schedule+0x29/0x70
            …
            [29510.771672]  [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146
            [30016.551909] LNet: 14222:0:(debug.c:370:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
            [30017.592350] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
            [30017.997077] Lustre: DEBUG MARKER: rc=0; val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ $? -eq 0 && $val -ne 0 ]]; then echo $(hostname -s): $val; rc=$val; fi; exit $rc
            [30019.791258] Lustre: DEBUG MARKER: dmesg
            [30020.607155] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-benchmark test bonnie: bonnie++ ============================================================ 17:32:45 \(1578677565\)
            [30020.885290] Lustre: DEBUG MARKER: == sanity-benchmark test bonnie: bonnie++ ============================================================ 17:32:45 (1578677565)
            [30022.115531] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1889280kB available, using 3438440kB file size
            [30022.397509] Lustre: DEBUG MARKER: min OST has 1889280kB available, using 3438440kB file size
            [31417.232936] Lustre: 3163:0:(client.c:376:ptlrpc_at_adj_net_latency()) Reported service time 297 > total measured time 9
            [31417.257892] Lustre: 3163:0:(client.c:376:ptlrpc_at_adj_net_latency()) Reported service time 297 > total measured time 9
            [31418.333175] Lustre: 3163:0:(client.c:376:ptlrpc_at_adj_net_latency()) Reported service time 290 > total measured time 2
            …
            

            In the console logs of the OSS, we see inactive threads while running both dbench and bonnie tests

            [28373.743227] Lustre: DEBUG MARKER: == sanity-benchmark test dbench: dbench ============================================================== 17:07:28 (1578676048)
            [29151.759167] LNet: Service thread pid 30219 was inactive for 40.01s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            [29151.762111] Pid: 30219, comm: ll_ost_io00_009 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
            [29151.763845] Call Trace:
            [29151.764347]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
            [29151.765484]  [<ffffffffc0244315>] __cv_wait+0x15/0x20 [spl]
            [29151.766544]  [<ffffffffc064a503>] txg_wait_open+0xc3/0x110 [zfs]
            [29151.768064]  [<ffffffffc05ffdfa>] dmu_tx_wait+0x3aa/0x3c0 [zfs]
            [29151.769162]  [<ffffffffc05ffea2>] dmu_tx_assign+0x92/0x490 [zfs]
            [29151.770306]  [<ffffffffc10d1fd9>] osd_trans_start+0x199/0x440 [osd_zfs]
            [29151.771576]  [<ffffffffc121bbf5>] ofd_trans_start+0x75/0xf0 [ofd]
            [29151.772766]  [<ffffffffc1222cb1>] ofd_commitrw_write+0xa31/0x1d40 [ofd]
            [29151.774017]  [<ffffffffc122714c>] ofd_commitrw+0x48c/0x9e0 [ofd]
            [29151.775172]  [<ffffffffc0efc81c>] obd_commitrw+0x9c/0x370 [ptlrpc]
            [29151.776625]  [<ffffffffc0f00cc2>] tgt_brw_write+0xf02/0x1ad0 [ptlrpc]
            [29151.777896]  [<ffffffffc0f02b0a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
            [29151.779158]  [<ffffffffc0ea746b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [29151.780513]  [<ffffffffc0eaadd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
            [29151.781657]  [<ffffffff8f2c50d1>] kthread+0xd1/0xe0
            [29151.782580]  [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39
            [29151.783703]  [<ffffffffffffffff>] 0xffffffffffffffff
            [29151.784681] LustreError: dumping log to /tmp/lustre-log.1578676828.30219
            [29170.341906] LNet: Service thread pid 30229 completed after 57.73s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [29174.364066] LNet: Service thread pid 30238 was inactive for 60.85s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            [29174.367218] Pid: 30238, comm: ll_ost_io00_023 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
            [29174.369133] Call Trace:
            [29174.369646]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
            …
            [29174.385846]  [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39
            [29174.386947]  [<ffffffffffffffff>] 0xffffffffffffffff
            [29174.387861] Pid: 30237, comm: ll_ost_io00_022 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
            [29174.389604] Call Trace:
            [29174.390059]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
            …
            [29174.405701]  [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39
            [29174.406812]  [<ffffffffffffffff>] 0xffffffffffffffff
            [29174.407694] Pid: 30232, comm: ll_ost_io00_021 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
            [29174.409403] Call Trace:
            [29174.409865]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
            …
            [29174.425546]  [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39
            [29174.426647]  [<ffffffffffffffff>] 0xffffffffffffffff
            [29174.427546] Pid: 30230, comm: ll_ost_io00_019 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
            [29174.429258] Call Trace:
            [29174.429709]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
            …
            [29174.445405]  [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39
            [29174.446519]  [<ffffffffffffffff>] 0xffffffffffffffff
            [29174.447422] LNet: Service thread pid 30231 was inactive for 61.71s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
            [29175.811107] LNet: Service thread pid 30239 was inactive for 62.21s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
            [29175.813367] LNet: Skipped 14 previous similar messages
            [29175.814259] LustreError: dumping log to /tmp/lustre-log.1578676852.30239
            [29180.860776] LNet: Service thread pid 30245 completed after 66.26s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [29180.863656] LNet: Skipped 1 previous similar message
            [29185.479226] LNet: Service thread pid 1697 was inactive for 71.57s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
            …
            [29250.371484] LNet: Service thread pid 30224 completed after 138.03s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [29250.374371] LNet: Skipped 1 previous similar message
            [29886.591943] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
            [29887.500162] Lustre: DEBUG MARKER: rc=0;
            [29887.500162] 			val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
            [29887.500162] 			if [[ $? -eq 0 && $val -ne 0 ]]; then
            [29887.500162] 				echo $(hostname -s): $val;
            [29887.500162] 				rc=$val;
            [29887.500162] 			fi;
            [29887.500162] 			exit $rc
            [29888.765447] Lustre: DEBUG MARKER: dmesg
            [29889.610387] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-benchmark test bonnie: bonnie++ ============================================================ 17:32:45 \(1578677565\)
            [29889.875373] Lustre: DEBUG MARKER: == sanity-benchmark test bonnie: bonnie++ ============================================================ 17:32:45 (1578677565)
            [29891.104509] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1889280kB available, using 3438440kB file size
            [29891.374921] Lustre: DEBUG MARKER: min OST has 1889280kB available, using 3438440kB file size
            [29937.030473] LNet: Service thread pid 30532 was inactive for 40.10s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            [29937.033411] LNet: Skipped 3 previous similar messages
            [29937.034276] Pid: 30532, comm: ll_ost_io00_039 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
            [29937.036006] Call Trace:
            [29937.036737]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
            [29937.037866]  [<ffffffffc0244315>] __cv_wait+0x15/0x20 [spl]
            [29937.038927]  [<ffffffffc064a503>] txg_wait_open+0xc3/0x110 [zfs]
            [29937.040315]  [<ffffffffc05ffdfa>] dmu_tx_wait+0x3aa/0x3c0 [zfs]
            [29937.041495]  [<ffffffffc05ffea2>] dmu_tx_assign+0x92/0x490 [zfs]
            [29937.042577]  [<ffffffffc10d1fd9>] osd_trans_start+0x199/0x440 [osd_zfs]
            [29937.043860]  [<ffffffffc121bbf5>] ofd_trans_start+0x75/0xf0 [ofd]
            [29937.045045]  [<ffffffffc1222cb1>] ofd_commitrw_write+0xa31/0x1d40 [ofd]
            [29937.046283]  [<ffffffffc122714c>] ofd_commitrw+0x48c/0x9e0 [ofd]
            [29937.047428]  [<ffffffffc0efc81c>] obd_commitrw+0x9c/0x370 [ptlrpc]
            [29937.048870]  [<ffffffffc0f00cc2>] tgt_brw_write+0xf02/0x1ad0 [ptlrpc]
            [29937.050125]  [<ffffffffc0f02b0a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
            [29937.051456]  [<ffffffffc0ea746b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [29937.052819]  [<ffffffffc0eaadd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
            [29937.054037]  [<ffffffff8f2c50d1>] kthread+0xd1/0xe0
            [29937.055092]  [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39
            [29937.056210]  [<ffffffffffffffff>] 0xffffffffffffffff
            [29937.057192] LustreError: dumping log to /tmp/lustre-log.1578677613.30532
            [29939.358695] LNet: Service thread pid 30527 completed after 42.34s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [29939.361926] LNet: Skipped 5 previous similar messages
            [29944.592261] LNet: Service thread pid 30535 was inactive for 46.67s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            [29944.592622] LNet: Service thread pid 30532 completed after 47.66s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [29944.597937] Pid: 30535, comm: ll_ost_io00_042 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
            [29944.599693] Call Trace:
            [29944.600416]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
            …
            [29944.631503]  [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39
            [29944.633800]  [<ffffffffffffffff>] 0xffffffffffffffff
            [29944.635849] Pid: 30538, comm: ll_ost_io00_045 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
            [29944.638729] Call Trace:
            [29944.639574]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
            …
            [29944.669318]  [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39
            [29944.671370]  [<ffffffffffffffff>] 0xffffffffffffffff
            [29944.673425] Pid: 1703, comm: ll_ost_io00_002 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
            [29944.676410] Call Trace:
            [29944.677279]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
            …
            [29944.709297]  [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39
            [29944.711640]  [<ffffffffffffffff>] 0xffffffffffffffff
            [29944.713831] Pid: 30554, comm: ll_ost_io00_061 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
            [29944.716912] Call Trace:
            [29944.717817]  [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl]
            …
             

            On the MDS1/3 (vm4), we see inactive threads while running dbench

            [28372.841352] Lustre: DEBUG MARKER: == sanity-benchmark test dbench: dbench ============================================================== 17:07:28 (1578676048)
            [29178.025765] LNet: Service thread pid 19144 was inactive for 40.10s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            [29178.028758] Pid: 19144, comm: mdt00_001 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019
            [29178.030468] Call Trace:
            [29178.031077]  [<ffffffffc1778c98>] osp_precreate_reserve+0x2e8/0x800 [osp]
            [29178.032389]  [<ffffffffc176d8b9>] osp_declare_create+0x199/0x5b0 [osp]
            [29178.033653]  [<ffffffffc16b569f>] lod_sub_declare_create+0xdf/0x210 [lod]
            [29178.035062]  [<ffffffffc16ad86e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod]
            [29178.036345]  [<ffffffffc16b080e>] lod_alloc_rr.constprop.19+0xeee/0x1490 [lod]
            [29178.037720]  [<ffffffffc16b492d>] lod_qos_prep_create+0x12fd/0x1890 [lod]
            [29178.038944]  [<ffffffffc16b50d5>] lod_prepare_create+0x215/0x2e0 [lod]
            [29178.040287]  [<ffffffffc16a4f3e>] lod_declare_striped_create+0x1ee/0x980 [lod]
            [29178.041651]  [<ffffffffc16a9814>] lod_declare_create+0x204/0x590 [lod]
            [29178.042920]  [<ffffffffc171faf2>] mdd_declare_create_object_internal+0xe2/0x2f0 [mdd]
            [29178.044380]  [<ffffffffc170f75c>] mdd_declare_create+0x4c/0xdf0 [mdd]
            [29178.045636]  [<ffffffffc1713247>] mdd_create+0x867/0x14a0 [mdd]
            [29178.046819]  [<ffffffffc15af9ff>] mdt_reint_open+0x224f/0x3240 [mdt]
            [29178.048314]  [<ffffffffc15a2a53>] mdt_reint_rec+0x83/0x210 [mdt]
            [29178.049508]  [<ffffffffc157f213>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
            [29178.050820]  [<ffffffffc158c2e2>] mdt_intent_open+0x82/0x3a0 [mdt]
            [29178.051980]  [<ffffffffc158a405>] mdt_intent_policy+0x435/0xd80 [mdt]
            [29178.053325]  [<ffffffffc1148e06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
            [29178.054935]  [<ffffffffc1171506>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
            [29178.056274]  [<ffffffffc11f9cf2>] tgt_enqueue+0x62/0x210 [ptlrpc]
            [29178.057669]  [<ffffffffc1200b0a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
            [29178.058956]  [<ffffffffc11a546b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [29178.060423]  [<ffffffffc11a8dd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
            [29178.061674]  [<ffffffffb5ec50d1>] kthread+0xd1/0xe0
            [29178.062605]  [<ffffffffb658cd37>] ret_from_fork_nospec_end+0x0/0x39
            [29178.063870]  [<ffffffffffffffff>] 0xffffffffffffffff
            [29178.064808] LustreError: dumping log to /tmp/lustre-log.1578676854.19144
            [29179.962488] LNet: Service thread pid 26259 was inactive for 42.04s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            … 
            
            jamesanunez James Nunez (Inactive) added a comment - This might require a new ticket, but ... I have a new case that looks like what is described here. For Lustre (future) 2.12.4 at https://testing.whamcloud.com/test_sets/4c5e02d0-349e-11ea-b0f4-52540065bddc for ZFS with DNE, we see test_bonnie hang, but also errors during test_dbench. Looking at console logs for client1 (vm1), we see dbench process hung with all the same traces [28504.786302] Lustre: DEBUG MARKER: == sanity-benchmark test dbench: dbench ============================================================== 17:07:28 (1578676048) [29239.794475] LNet: 13710:0:(debug.c:370:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release. [29510.651244] INFO: task dbench:13724 blocked for more than 120 seconds. [29510.652635] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [29510.653990] dbench D ffff9d4dbb0362a0 0 13724 13723 0x00000080 [29510.655393] Call Trace: [29510.655905] [<ffffffffa637f229>] schedule+0x29/0x70 [29510.656775] [<ffffffffa637cbb1>] schedule_timeout+0x221/0x2d0 [29510.658160] [<ffffffffc0b4ec60>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc] [29510.659430] [<ffffffffa637f5dd>] wait_for_completion+0xfd/0x140 [29510.660590] [<ffffffffa5cda0b0>] ? wake_up_state+0x20/0x20 [29510.661793] [<ffffffffc0d1caf4>] osc_io_fsync_end+0x74/0xa0 [osc] [29510.663143] [<ffffffffc097bc80>] cl_io_end+0x60/0x150 [obdclass] [29510.664277] [<ffffffffc0d6e0ab>] lov_io_end_wrapper+0xdb/0xe0 [lov] [29510.665413] [<ffffffffc0d6e380>] lov_io_fsync_end+0x80/0x1b0 [lov] [29510.666552] [<ffffffffc097bc80>] cl_io_end+0x60/0x150 [obdclass] [29510.667664] [<ffffffffc097e39a>] cl_io_loop+0xda/0x1c0 [obdclass] [29510.668887] [<ffffffffc0ee4dfb>] cl_sync_file_range+0x2db/0x380 [lustre] [29510.670094] [<ffffffffc0ee5129>] ll_fsync+0x289/0x490 [lustre] [29510.671181] [<ffffffffa5e7d9f7>] do_fsync+0x67/0xb0 [29510.672102] [<ffffffffa638ce15>] ? system_call_after_swapgs+0xa2/0x146 [29510.673232] [<ffffffffa5e7dce0>] SyS_fsync+0x10/0x20 [29510.674238] [<ffffffffa638cede>] system_call_fastpath+0x25/0x2a [29510.675317] [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146 [29510.676491] INFO: task dbench:13731 blocked for more than 120 seconds. [29510.677654] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [29510.679025] dbench D ffff9d4dbbe341c0 0 13731 13723 0x00000080 [29510.680320] Call Trace: [29510.680788] [<ffffffffa637f229>] schedule+0x29/0x70 … [29510.699317] [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146 [29510.700579] INFO: task dbench:13732 blocked for more than 120 seconds. [29510.701739] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [29510.703114] dbench D ffff9d4dbbe30000 0 13732 13723 0x00000080 [29510.704412] Call Trace: [29510.704859] [<ffffffffa637f229>] schedule+0x29/0x70 … [29510.723527] [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146 [29510.724697] INFO: task dbench:13739 blocked for more than 120 seconds. [29510.725860] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [29510.727220] dbench D ffff9d4da6735230 0 13739 13723 0x00000080 [29510.728500] Call Trace: [29510.729054] [<ffffffffa637f229>] schedule+0x29/0x70 … [29510.747529] [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146 [29510.748711] INFO: task dbench:13741 blocked for more than 120 seconds. [29510.749927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [29510.751293] dbench D ffff9d4da67341c0 0 13741 13723 0x00000080 [29510.752583] Call Trace: [29510.753102] [<ffffffffa637f229>] schedule+0x29/0x70 … [29510.771672] [<ffffffffa638ce21>] ? system_call_after_swapgs+0xae/0x146 [30016.551909] LNet: 14222:0:(debug.c:370:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release. [30017.592350] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [30017.997077] Lustre: DEBUG MARKER: rc=0; val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ $? -eq 0 && $val -ne 0 ]]; then echo $(hostname -s): $val; rc=$val; fi; exit $rc [30019.791258] Lustre: DEBUG MARKER: dmesg [30020.607155] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-benchmark test bonnie: bonnie++ ============================================================ 17:32:45 \(1578677565\) [30020.885290] Lustre: DEBUG MARKER: == sanity-benchmark test bonnie: bonnie++ ============================================================ 17:32:45 (1578677565) [30022.115531] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1889280kB available, using 3438440kB file size [30022.397509] Lustre: DEBUG MARKER: min OST has 1889280kB available, using 3438440kB file size [31417.232936] Lustre: 3163:0:(client.c:376:ptlrpc_at_adj_net_latency()) Reported service time 297 > total measured time 9 [31417.257892] Lustre: 3163:0:(client.c:376:ptlrpc_at_adj_net_latency()) Reported service time 297 > total measured time 9 [31418.333175] Lustre: 3163:0:(client.c:376:ptlrpc_at_adj_net_latency()) Reported service time 290 > total measured time 2 … In the console logs of the OSS, we see inactive threads while running both dbench and bonnie tests [28373.743227] Lustre: DEBUG MARKER: == sanity-benchmark test dbench: dbench ============================================================== 17:07:28 (1578676048) [29151.759167] LNet: Service thread pid 30219 was inactive for 40.01s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [29151.762111] Pid: 30219, comm: ll_ost_io00_009 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29151.763845] Call Trace: [29151.764347] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] [29151.765484] [<ffffffffc0244315>] __cv_wait+0x15/0x20 [spl] [29151.766544] [<ffffffffc064a503>] txg_wait_open+0xc3/0x110 [zfs] [29151.768064] [<ffffffffc05ffdfa>] dmu_tx_wait+0x3aa/0x3c0 [zfs] [29151.769162] [<ffffffffc05ffea2>] dmu_tx_assign+0x92/0x490 [zfs] [29151.770306] [<ffffffffc10d1fd9>] osd_trans_start+0x199/0x440 [osd_zfs] [29151.771576] [<ffffffffc121bbf5>] ofd_trans_start+0x75/0xf0 [ofd] [29151.772766] [<ffffffffc1222cb1>] ofd_commitrw_write+0xa31/0x1d40 [ofd] [29151.774017] [<ffffffffc122714c>] ofd_commitrw+0x48c/0x9e0 [ofd] [29151.775172] [<ffffffffc0efc81c>] obd_commitrw+0x9c/0x370 [ptlrpc] [29151.776625] [<ffffffffc0f00cc2>] tgt_brw_write+0xf02/0x1ad0 [ptlrpc] [29151.777896] [<ffffffffc0f02b0a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [29151.779158] [<ffffffffc0ea746b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [29151.780513] [<ffffffffc0eaadd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [29151.781657] [<ffffffff8f2c50d1>] kthread+0xd1/0xe0 [29151.782580] [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39 [29151.783703] [<ffffffffffffffff>] 0xffffffffffffffff [29151.784681] LustreError: dumping log to /tmp/lustre-log.1578676828.30219 [29170.341906] LNet: Service thread pid 30229 completed after 57.73s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [29174.364066] LNet: Service thread pid 30238 was inactive for 60.85s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [29174.367218] Pid: 30238, comm: ll_ost_io00_023 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29174.369133] Call Trace: [29174.369646] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] … [29174.385846] [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39 [29174.386947] [<ffffffffffffffff>] 0xffffffffffffffff [29174.387861] Pid: 30237, comm: ll_ost_io00_022 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29174.389604] Call Trace: [29174.390059] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] … [29174.405701] [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39 [29174.406812] [<ffffffffffffffff>] 0xffffffffffffffff [29174.407694] Pid: 30232, comm: ll_ost_io00_021 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29174.409403] Call Trace: [29174.409865] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] … [29174.425546] [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39 [29174.426647] [<ffffffffffffffff>] 0xffffffffffffffff [29174.427546] Pid: 30230, comm: ll_ost_io00_019 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29174.429258] Call Trace: [29174.429709] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] … [29174.445405] [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39 [29174.446519] [<ffffffffffffffff>] 0xffffffffffffffff [29174.447422] LNet: Service thread pid 30231 was inactive for 61.71s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [29175.811107] LNet: Service thread pid 30239 was inactive for 62.21s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [29175.813367] LNet: Skipped 14 previous similar messages [29175.814259] LustreError: dumping log to /tmp/lustre-log.1578676852.30239 [29180.860776] LNet: Service thread pid 30245 completed after 66.26s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [29180.863656] LNet: Skipped 1 previous similar message [29185.479226] LNet: Service thread pid 1697 was inactive for 71.57s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. … [29250.371484] LNet: Service thread pid 30224 completed after 138.03s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [29250.374371] LNet: Skipped 1 previous similar message [29886.591943] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [29887.500162] Lustre: DEBUG MARKER: rc=0; [29887.500162] val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); [29887.500162] if [[ $? -eq 0 && $val -ne 0 ]]; then [29887.500162] echo $(hostname -s): $val; [29887.500162] rc=$val; [29887.500162] fi; [29887.500162] exit $rc [29888.765447] Lustre: DEBUG MARKER: dmesg [29889.610387] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-benchmark test bonnie: bonnie++ ============================================================ 17:32:45 \(1578677565\) [29889.875373] Lustre: DEBUG MARKER: == sanity-benchmark test bonnie: bonnie++ ============================================================ 17:32:45 (1578677565) [29891.104509] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1889280kB available, using 3438440kB file size [29891.374921] Lustre: DEBUG MARKER: min OST has 1889280kB available, using 3438440kB file size [29937.030473] LNet: Service thread pid 30532 was inactive for 40.10s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [29937.033411] LNet: Skipped 3 previous similar messages [29937.034276] Pid: 30532, comm: ll_ost_io00_039 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29937.036006] Call Trace: [29937.036737] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] [29937.037866] [<ffffffffc0244315>] __cv_wait+0x15/0x20 [spl] [29937.038927] [<ffffffffc064a503>] txg_wait_open+0xc3/0x110 [zfs] [29937.040315] [<ffffffffc05ffdfa>] dmu_tx_wait+0x3aa/0x3c0 [zfs] [29937.041495] [<ffffffffc05ffea2>] dmu_tx_assign+0x92/0x490 [zfs] [29937.042577] [<ffffffffc10d1fd9>] osd_trans_start+0x199/0x440 [osd_zfs] [29937.043860] [<ffffffffc121bbf5>] ofd_trans_start+0x75/0xf0 [ofd] [29937.045045] [<ffffffffc1222cb1>] ofd_commitrw_write+0xa31/0x1d40 [ofd] [29937.046283] [<ffffffffc122714c>] ofd_commitrw+0x48c/0x9e0 [ofd] [29937.047428] [<ffffffffc0efc81c>] obd_commitrw+0x9c/0x370 [ptlrpc] [29937.048870] [<ffffffffc0f00cc2>] tgt_brw_write+0xf02/0x1ad0 [ptlrpc] [29937.050125] [<ffffffffc0f02b0a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [29937.051456] [<ffffffffc0ea746b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [29937.052819] [<ffffffffc0eaadd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [29937.054037] [<ffffffff8f2c50d1>] kthread+0xd1/0xe0 [29937.055092] [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39 [29937.056210] [<ffffffffffffffff>] 0xffffffffffffffff [29937.057192] LustreError: dumping log to /tmp/lustre-log.1578677613.30532 [29939.358695] LNet: Service thread pid 30527 completed after 42.34s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [29939.361926] LNet: Skipped 5 previous similar messages [29944.592261] LNet: Service thread pid 30535 was inactive for 46.67s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [29944.592622] LNet: Service thread pid 30532 completed after 47.66s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [29944.597937] Pid: 30535, comm: ll_ost_io00_042 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29944.599693] Call Trace: [29944.600416] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] … [29944.631503] [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39 [29944.633800] [<ffffffffffffffff>] 0xffffffffffffffff [29944.635849] Pid: 30538, comm: ll_ost_io00_045 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29944.638729] Call Trace: [29944.639574] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] … [29944.669318] [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39 [29944.671370] [<ffffffffffffffff>] 0xffffffffffffffff [29944.673425] Pid: 1703, comm: ll_ost_io00_002 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29944.676410] Call Trace: [29944.677279] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] … [29944.709297] [<ffffffff8f98cd37>] ret_from_fork_nospec_end+0x0/0x39 [29944.711640] [<ffffffffffffffff>] 0xffffffffffffffff [29944.713831] Pid: 30554, comm: ll_ost_io00_061 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29944.716912] Call Trace: [29944.717817] [<ffffffffc02442d5>] cv_wait_common+0x125/0x150 [spl] … On the MDS1/3 (vm4), we see inactive threads while running dbench [28372.841352] Lustre: DEBUG MARKER: == sanity-benchmark test dbench: dbench ============================================================== 17:07:28 (1578676048) [29178.025765] LNet: Service thread pid 19144 was inactive for 40.10s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [29178.028758] Pid: 19144, comm: mdt00_001 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Dec 13 20:09:43 UTC 2019 [29178.030468] Call Trace: [29178.031077] [<ffffffffc1778c98>] osp_precreate_reserve+0x2e8/0x800 [osp] [29178.032389] [<ffffffffc176d8b9>] osp_declare_create+0x199/0x5b0 [osp] [29178.033653] [<ffffffffc16b569f>] lod_sub_declare_create+0xdf/0x210 [lod] [29178.035062] [<ffffffffc16ad86e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod] [29178.036345] [<ffffffffc16b080e>] lod_alloc_rr.constprop.19+0xeee/0x1490 [lod] [29178.037720] [<ffffffffc16b492d>] lod_qos_prep_create+0x12fd/0x1890 [lod] [29178.038944] [<ffffffffc16b50d5>] lod_prepare_create+0x215/0x2e0 [lod] [29178.040287] [<ffffffffc16a4f3e>] lod_declare_striped_create+0x1ee/0x980 [lod] [29178.041651] [<ffffffffc16a9814>] lod_declare_create+0x204/0x590 [lod] [29178.042920] [<ffffffffc171faf2>] mdd_declare_create_object_internal+0xe2/0x2f0 [mdd] [29178.044380] [<ffffffffc170f75c>] mdd_declare_create+0x4c/0xdf0 [mdd] [29178.045636] [<ffffffffc1713247>] mdd_create+0x867/0x14a0 [mdd] [29178.046819] [<ffffffffc15af9ff>] mdt_reint_open+0x224f/0x3240 [mdt] [29178.048314] [<ffffffffc15a2a53>] mdt_reint_rec+0x83/0x210 [mdt] [29178.049508] [<ffffffffc157f213>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [29178.050820] [<ffffffffc158c2e2>] mdt_intent_open+0x82/0x3a0 [mdt] [29178.051980] [<ffffffffc158a405>] mdt_intent_policy+0x435/0xd80 [mdt] [29178.053325] [<ffffffffc1148e06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] [29178.054935] [<ffffffffc1171506>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] [29178.056274] [<ffffffffc11f9cf2>] tgt_enqueue+0x62/0x210 [ptlrpc] [29178.057669] [<ffffffffc1200b0a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [29178.058956] [<ffffffffc11a546b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [29178.060423] [<ffffffffc11a8dd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [29178.061674] [<ffffffffb5ec50d1>] kthread+0xd1/0xe0 [29178.062605] [<ffffffffb658cd37>] ret_from_fork_nospec_end+0x0/0x39 [29178.063870] [<ffffffffffffffff>] 0xffffffffffffffff [29178.064808] LustreError: dumping log to /tmp/lustre-log.1578676854.19144 [29179.962488] LNet: Service thread pid 26259 was inactive for 42.04s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: …
            sarah Sarah Liu added a comment - - edited

            Does this relate to LU-9247?
            For the consistency, taking results from 2.10.1 RC1, DNE and non-DNE ZFS, sanity-benchmark, performance-sanity and obdfilter-survey are either hit LU-9247 or this one. Also search all sanity-benchmark runs in 4 weeks on b2_10, hit this one 2 in 7 runs, hit LU-9247 3 in 7 runs.

            I will make a patch to test with zfs 0.6.5.9

            sarah Sarah Liu added a comment - - edited Does this relate to LU-9247 ? For the consistency, taking results from 2.10.1 RC1, DNE and non-DNE ZFS, sanity-benchmark, performance-sanity and obdfilter-survey are either hit LU-9247 or this one. Also search all sanity-benchmark runs in 4 weeks on b2_10, hit this one 2 in 7 runs, hit LU-9247 3 in 7 runs. I will make a patch to test with zfs 0.6.5.9

            It looks like these threads are all stuck in the ZFS transaction commit code, waiting for the txg_sync thread to finish committing the transaction.

            How consistently is this problem reproduced (e.g. 1 in 5 runs, 1 in 10, ...)? If the problem can be reproduced fairly consistently, then I would recommend to try to see if it can be reproduced with ZFS 0.6.5.9 instead of ZFS 0.7.1, since it may very well be a problem in the ZFS code and not Lustre, or at least a bad interaction with how Lustre is using ZFS.

            To test this, change the SPLZFSVER in the lbuild script to be "0.6.5.9" and then submit the patch with Test-Parameters: that run the test 10 times (or however many are needed to reproduce it consistently).

            adilger Andreas Dilger added a comment - It looks like these threads are all stuck in the ZFS transaction commit code, waiting for the txg_sync thread to finish committing the transaction. How consistently is this problem reproduced (e.g. 1 in 5 runs, 1 in 10, ...)? If the problem can be reproduced fairly consistently, then I would recommend to try to see if it can be reproduced with ZFS 0.6.5.9 instead of ZFS 0.7.1, since it may very well be a problem in the ZFS code and not Lustre, or at least a bad interaction with how Lustre is using ZFS. To test this, change the SPLZFSVER in the lbuild script to be "0.6.5.9" and then submit the patch with Test-Parameters: that run the test 10 times (or however many are needed to reproduce it consistently).
            sarah Sarah Liu added a comment - - edited

            This failure has been seen on multiple tests
            sanity-hsm test_24a also shows similar trace on MDS
            https://testing.hpdd.intel.com/test_sets/fdc369ec-9c9e-11e7-ba27-5254006e85c2
            performance-sanity
            https://testing.hpdd.intel.com/test_sets/0083a0c0-9c9f-11e7-ba27-5254006e85c2

            sarah Sarah Liu added a comment - - edited This failure has been seen on multiple tests sanity-hsm test_24a also shows similar trace on MDS https://testing.hpdd.intel.com/test_sets/fdc369ec-9c9e-11e7-ba27-5254006e85c2 performance-sanity https://testing.hpdd.intel.com/test_sets/0083a0c0-9c9f-11e7-ba27-5254006e85c2

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: