Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.6
-
3
-
9223372036854775807
Description
parallel-scale-nfsv4 test_compilebench hangs. The last lines in the suite_log is that compilebench is starting up
== parallel-scale-nfsv4 test compilebench: compilebench ============================================== 05:21:00 (1575696060) OPTIONS: cbench_DIR=/usr/bin cbench_IDIRS=2 cbench_RUNS=2 trevis-42vm1.trevis.whamcloud.com trevis-42vm2 free space = 15345664 KB ./compilebench -D /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.7562 -i 2 -r 2 --makej using working directory /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.7562, 2 intial dirs 2 runs
The client 1 (vm1) console log has an NFS error
[39354.701601] Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.7562 -i 2 -r 2 --makej [39677.612810] nfs: server trevis-42vm4 not responding, still trying
And the OSS (vm3) has stack traces filled with spl/zfs function calls
[39370.234418] Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.7562 -i 2 -r 2 --makej [39553.055992] LNet: Service thread pid 6880 was inactive for 40.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [39553.059178] Pid: 6880, comm: ll_ost00_066 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Mon Oct 21 20:19:09 UTC 2019 [39553.060958] Call Trace: [39553.061440] [<ffffffffc04972d5>] cv_wait_common+0x125/0x150 [spl] [39553.062685] [<ffffffffc0497315>] __cv_wait+0x15/0x20 [spl] [39553.063896] [<ffffffffc08d217b>] dbuf_read+0x7fb/0x9e0 [zfs] [39553.065039] [<ffffffffc08de090>] dmu_buf_hold_by_dnode+0x50/0x80 [zfs] [39553.066284] [<ffffffffc09516f2>] zap_get_leaf_byblk.isra.11+0x92/0x260 [zfs] [39553.067668] [<ffffffffc09519a8>] zap_deref_leaf+0xe8/0x100 [zfs] [39553.068928] [<ffffffffc09529e2>] fzap_lookup+0x62/0x180 [zfs] [39553.070075] [<ffffffffc09572e1>] zap_lookup_impl+0xd1/0x200 [zfs] [39553.071244] [<ffffffffc0957c5e>] zap_lookup_norm+0x8e/0xd0 [zfs] [39553.072424] [<ffffffffc0957cd3>] zap_lookup+0x33/0x40 [zfs] [39553.073580] [<ffffffffc132ba71>] osd_fid_lookup+0x3e1/0x4c0 [osd_zfs] [39553.074820] [<ffffffffc1323969>] osd_object_init+0x389/0xca0 [osd_zfs] [39553.076116] [<ffffffffc0e0236b>] lu_object_start.isra.35+0x8b/0x120 [obdclass] [39553.077481] [<ffffffffc0e06251>] lu_object_find_at+0x1e1/0xa60 [obdclass] [39553.078846] [<ffffffffc0e06ae6>] lu_object_find+0x16/0x20 [obdclass] [39553.080130] [<ffffffffc1461185>] ofd_object_find+0x35/0x100 [ofd] [39553.081346] [<ffffffffc146fe73>] ofd_lvbo_init+0x363/0x7f0 [ofd] [39553.082503] [<ffffffffc10b929c>] ldlm_handle_enqueue0+0x97c/0x15f0 [ptlrpc] [39553.083876] [<ffffffffc1141b02>] tgt_enqueue+0x62/0x210 [ptlrpc] [39553.085151] [<ffffffffc11482ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [39553.086446] [<ffffffffc10ed29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [39553.087908] [<ffffffffc10f0bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] [39553.089184] [<ffffffffa38c50d1>] kthread+0xd1/0xe0 [39553.090104] [<ffffffffa3f8cd37>] ret_from_fork_nospec_end+0x0/0x39 [39553.091311] [<ffffffffffffffff>] 0xffffffffffffffff [39553.092282] LustreError: dumping log to /tmp/lustre-log.1575696244.6880 [39698.591981] LNet: Service thread pid 16643 was inactive for 186.08s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [39698.594909] Pid: 16643, comm: ldlm_cn00_008 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Mon Oct 21 20:19:09 UTC 2019 [39698.596657] Call Trace: [39698.597134] [<ffffffffa3b90f88>] call_rwsem_down_read_failed+0x18/0x30 [39698.598432] [<ffffffffc0e03cea>] lu_cache_shrink+0x6a/0x2d0 [obdclass] [39698.599673] [<ffffffffa39cf73e>] shrink_slab+0xae/0x340 [39698.600710] [<ffffffffa39d2aca>] do_try_to_free_pages+0x3ca/0x520 [39698.601856] [<ffffffffa39d2d1c>] try_to_free_pages+0xfc/0x180 [39698.602931] [<ffffffffa3f745fd>] __alloc_pages_slowpath+0x457/0x724 [39698.604109] [<ffffffffa39c6b84>] __alloc_pages_nodemask+0x404/0x420 [39698.605236] [<ffffffffa3a14c68>] alloc_pages_current+0x98/0x110 [39698.606337] [<ffffffffa3a1fea5>] new_slab+0x2c5/0x390 [39698.607303] [<ffffffffa3a2170c>] ___slab_alloc+0x3ac/0x4f0 [39698.608334] [<ffffffffa3f75a1e>] __slab_alloc+0x40/0x5c [39698.609398] [<ffffffffa3a25acf>] __kmalloc_node+0xbf/0x2b0 [39698.610442] [<ffffffffc0490238>] spl_kmem_zalloc+0xd8/0x180 [spl] [39698.611575] [<ffffffffc08f28a3>] dnode_hold_impl+0x8a3/0xc40 [zfs] [39698.612753] [<ffffffffc08f2c5b>] dnode_hold+0x1b/0x20 [zfs] [39698.613878] [<ffffffffc08db6b2>] dmu_bonus_hold+0x32/0x1d0 [zfs] [39698.615016] [<ffffffffc1323544>] __osd_obj2dnode+0x34/0xd0 [osd_zfs] [39698.616172] [<ffffffffc13237c1>] osd_object_init+0x1e1/0xca0 [osd_zfs] [39698.617404] [<ffffffffc0e0236b>] lu_object_start.isra.35+0x8b/0x120 [obdclass] [39698.618738] [<ffffffffc0e06251>] lu_object_find_at+0x1e1/0xa60 [obdclass] [39698.620106] [<ffffffffc0e06ae6>] lu_object_find+0x16/0x20 [obdclass] [39698.621289] [<ffffffffc1461185>] ofd_object_find+0x35/0x100 [ofd] [39698.622443] [<ffffffffc146f297>] ofd_lvbo_update+0x577/0xdf0 [ofd] [39698.623588] [<ffffffffc10b6324>] ldlm_request_cancel+0x404/0x780 [ptlrpc] [39698.624888] [<ffffffffc10ba612>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc] [39698.626197] [<ffffffffc10ba7e8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc] [39698.627439] [<ffffffffc10ed29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [39698.628845] [<ffffffffc10f0bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] [39698.630074] [<ffffffffa38c50d1>] kthread+0xd1/0xe0 [39698.630968] [<ffffffffa3f8cd37>] ret_from_fork_nospec_end+0x0/0x39 [39698.632127] [<ffffffffffffffff>] 0xffffffffffffffff [39698.633134] LustreError: dumping log to /tmp/lustre-log.1575696390.16643 [39699.006888] Pid: 16648, comm: ldlm_cn00_011 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Mon Oct 21 20:19:09 UTC 2019 [39699.008967] Call Trace: [39699.009454] [<ffffffffa3b90f88>] call_rwsem_down_read_failed+0x18/0x30 [39699.010685] [<ffffffffc0e03cea>] lu_cache_shrink+0x6a/0x2d0 [obdclass] [39699.012057] [<ffffffffa39cf73e>] shrink_slab+0xae/0x340 [39699.013044] [<ffffffffa39d2aca>] do_try_to_free_pages+0x3ca/0x520 [39699.014236] [<ffffffffa39d2d1c>] try_to_free_pages+0xfc/0x180 [39699.015351] [<ffffffffa3f745fd>] __alloc_pages_slowpath+0x457/0x724 [39699.016567] [<ffffffffa39c6b84>] __alloc_pages_nodemask+0x404/0x420 [39699.017806] [<ffffffffa3a14c68>] alloc_pages_current+0x98/0x110 [39699.019015] [<ffffffffa3a1fea5>] new_slab+0x2c5/0x390 [39699.019958] [<ffffffffa3a2170c>] ___slab_alloc+0x3ac/0x4f0 [39699.021035] [<ffffffffa3f75a1e>] __slab_alloc+0x40/0x5c [39699.022050] [<ffffffffa3a25acf>] __kmalloc_node+0xbf/0x2b0 [39699.023116] [<ffffffffc0490238>] spl_kmem_zalloc+0xd8/0x180 [spl] [39699.024293] [<ffffffffc08f28a3>] dnode_hold_impl+0x8a3/0xc40 [zfs] [39699.025520] [<ffffffffc08f2c5b>] dnode_hold+0x1b/0x20 [zfs] [39699.026613] [<ffffffffc08db6b2>] dmu_bonus_hold+0x32/0x1d0 [zfs] [39699.027792] [<ffffffffc1323544>] __osd_obj2dnode+0x34/0xd0 [osd_zfs] [39699.029096] [<ffffffffc13237c1>] osd_object_init+0x1e1/0xca0 [osd_zfs] [39699.030320] [<ffffffffc0e0236b>] lu_object_start.isra.35+0x8b/0x120 [obdclass] [39699.031703] [<ffffffffc0e06251>] lu_object_find_at+0x1e1/0xa60 [obdclass] [39699.033058] [<ffffffffc0e06ae6>] lu_object_find+0x16/0x20 [obdclass] [39699.034261] [<ffffffffc1461185>] ofd_object_find+0x35/0x100 [ofd] [39699.035461] [<ffffffffc146f297>] ofd_lvbo_update+0x577/0xdf0 [ofd] [39699.036665] [<ffffffffc10b6324>] ldlm_request_cancel+0x404/0x780 [ptlrpc] [39699.038037] [<ffffffffc10ba612>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc] [39699.039308] [<ffffffffc10ba7e8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc] [39699.040636] [<ffffffffc10ed29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [39699.042176] [<ffffffffc10f0bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] [39699.043369] [<ffffffffa38c50d1>] kthread+0xd1/0xe0 [39699.044312] [<ffffffffa3f8cd37>] ret_from_fork_nospec_end+0x0/0x39 [39699.045516] [<ffffffffffffffff>] 0xffffffffffffffff [39699.046475] Pid: 17567, comm: ldlm_cn00_039 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Mon Oct 21 20:19:09 UTC 2019 [39699.048317] Call Trace: [39699.048822] [<ffffffffc04972d5>] cv_wait_common+0x125/0x150 [spl] [39699.050027] [<ffffffffc0497315>] __cv_wait+0x15/0x20 [spl] [39699.051062] [<ffffffffc08d217b>] dbuf_read+0x7fb/0x9e0 [zfs] [39699.052182] [<ffffffffc08de090>] dmu_buf_hold_by_dnode+0x50/0x80 [zfs] [39699.053510] [<ffffffffc09516f2>] zap_get_leaf_byblk.isra.11+0x92/0x260 [zfs] [39699.054882] [<ffffffffc09519a8>] zap_deref_leaf+0xe8/0x100 [zfs] [39699.056121] [<ffffffffc09529e2>] fzap_lookup+0x62/0x180 [zfs] [39699.057226] [<ffffffffc09572e1>] zap_lookup_impl+0xd1/0x200 [zfs] [39699.058441] [<ffffffffc0957c5e>] zap_lookup_norm+0x8e/0xd0 [zfs] [39699.059621] [<ffffffffc0957cd3>] zap_lookup+0x33/0x40 [zfs] [39699.060717] [<ffffffffc132ba71>] osd_fid_lookup+0x3e1/0x4c0 [osd_zfs] [39699.062011] [<ffffffffc1323969>] osd_object_init+0x389/0xca0 [osd_zfs] [39699.063231] [<ffffffffc0e0236b>] lu_object_start.isra.35+0x8b/0x120 [obdclass] [39699.064614] [<ffffffffc0e06251>] lu_object_find_at+0x1e1/0xa60 [obdclass] [39699.065976] [<ffffffffc0e06ae6>] lu_object_find+0x16/0x20 [obdclass] [39699.067174] [<ffffffffc1461185>] ofd_object_find+0x35/0x100 [ofd] [39699.068368] [<ffffffffc146f297>] ofd_lvbo_update+0x577/0xdf0 [ofd] [39699.069546] [<ffffffffc10b6324>] ldlm_request_cancel+0x404/0x780 [ptlrpc] [39699.070850] [<ffffffffc10ba612>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc] [39699.072277] [<ffffffffc10ba7e8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc] [39699.073564] [<ffffffffc10ed29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [39699.074999] [<ffffffffc10f0bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] [39699.076212] [<ffffffffa38c50d1>] kthread+0xd1/0xe0 [39699.077174] [<ffffffffa3f8cd37>] ret_from_fork_nospec_end+0x0/0x39 [39699.078403] [<ffffffffffffffff>] 0xffffffffffffffff [39699.079386] Pid: 17556, comm: ldlm_cn00_028 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Mon Oct 21 20:19:09 UTC 2019 [39699.081221] Call Trace: [39699.081718] [<ffffffffa3b90f88>] call_rwsem_down_read_failed+0x18/0x30 [39699.082970] [<ffffffffc0e03cea>] lu_cache_shrink+0x6a/0x2d0 [obdclass] [39699.084190] [<ffffffffa39cf73e>] shrink_slab+0xae/0x340 [39699.085196] [<ffffffffa39d2aca>] do_try_to_free_pages+0x3ca/0x520 [39699.086428] [<ffffffffa39d2d1c>] try_to_free_pages+0xfc/0x180 [39699.087560] [<ffffffffa3f745fd>] __alloc_pages_slowpath+0x457/0x724 [39699.088770] [<ffffffffa39c6b84>] __alloc_pages_nodemask+0x404/0x420 [39699.090045] [<ffffffffa3a14c68>] alloc_pages_current+0x98/0x110 [39699.091160] [<ffffffffa3a1fea5>] new_slab+0x2c5/0x390 [39699.092168] [<ffffffffa3a2170c>] ___slab_alloc+0x3ac/0x4f0 [39699.093222] [<ffffffffa3f75a1e>] __slab_alloc+0x40/0x5c [39699.094293] [<ffffffffa3a25acf>] __kmalloc_node+0xbf/0x2b0 [39699.095339] [<ffffffffc0490238>] spl_kmem_zalloc+0xd8/0x180 [spl] [39699.096521] [<ffffffffc08f28a3>] dnode_hold_impl+0x8a3/0xc40 [zfs] [39699.097734] [<ffffffffc08f2c5b>] dnode_hold+0x1b/0x20 [zfs] [39699.098856] [<ffffffffc08db6b2>] dmu_bonus_hold+0x32/0x1d0 [zfs] [39699.100097] [<ffffffffc1323544>] __osd_obj2dnode+0x34/0xd0 [osd_zfs] [39699.101298] [<ffffffffc13237c1>] osd_object_init+0x1e1/0xca0 [osd_zfs] [39699.102540] [<ffffffffc0e0236b>] lu_object_start.isra.35+0x8b/0x120 [obdclass] [39699.103920] [<ffffffffc0e06251>] lu_object_find_at+0x1e1/0xa60 [obdclass] [39699.105264] [<ffffffffc0e06ae6>] lu_object_find+0x16/0x20 [obdclass] [39699.106464] [<ffffffffc1461185>] ofd_object_find+0x35/0x100 [ofd] [39699.107634] [<ffffffffc146f297>] ofd_lvbo_update+0x577/0xdf0 [ofd] [39699.108852] [<ffffffffc10b6324>] ldlm_request_cancel+0x404/0x780 [ptlrpc] [39699.110194] [<ffffffffc10ba612>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc] [39699.111454] [<ffffffffc10ba7e8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc] [39699.112783] [<ffffffffc10ed29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [39699.114322] [<ffffffffc10f0bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] [39699.115512] [<ffffffffa38c50d1>] kthread+0xd1/0xe0 [39699.116452] [<ffffffffa3f8cd37>] ret_from_fork_nospec_end+0x0/0x39 [39699.117628] [<ffffffffffffffff>] 0xffffffffffffffff
and hung processes
[39699.118607] LNet: Service thread pid 16644 was inactive for 186.50s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [39720.323966] INFO: task kswapd0:35 blocked for more than 120 seconds. [39720.325102] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [39720.326433] kswapd0 D ffff9d67bc3de2a0 0 35 2 0x00000000 [39720.327744] Call Trace: [39720.328231] [<ffffffffc08d3628>] ? dbuf_destroy+0x278/0x3b0 [zfs] [39720.329428] [<ffffffffa3f80149>] schedule_preempt_disabled+0x29/0x70 [39720.330576] [<ffffffffa3f7df57>] __mutex_lock_slowpath+0xc7/0x1d0 [39720.331658] [<ffffffffa3f7d32f>] mutex_lock+0x1f/0x2f [39720.332579] [<ffffffffc08cbe36>] arc_buf_destroy+0x96/0x130 [zfs] [39720.333671] [<ffffffffc08d33e1>] dbuf_destroy+0x31/0x3b0 [zfs] [39720.334743] [<ffffffffc0935166>] ? spa_get_random+0x36/0x60 [zfs] [39720.335846] [<ffffffffc08d3800>] dbuf_evict_one+0xa0/0xc0 [zfs] [39720.336949] [<ffffffffc08d3295>] dbuf_rele_and_unlock+0x4a5/0x4c0 [zfs] [39720.338139] [<ffffffffc08f2e4b>] ? dnode_rele+0x3b/0x40 [zfs] [39720.339139] [<ffffffffa3a224af>] ? __slab_free+0x9f/0x320 [39720.340264] [<ffffffffc08d3d88>] dbuf_rele+0x38/0x40 [zfs] [39720.341295] [<ffffffffc08d5c7e>] dmu_buf_rele+0xe/0x10 [zfs] [39720.342360] [<ffffffffc091df83>] sa_handle_destroy+0x93/0xe0 [zfs] [39720.343499] [<ffffffffc131e247>] osd_object_delete+0x57/0x100 [osd_zfs] [39720.344729] [<ffffffffc0e021d8>] lu_object_free.isra.32+0x68/0x170 [obdclass] [39720.346031] [<ffffffffc0e02fbe>] lu_site_purge_objects+0x2fe/0x530 [obdclass] [39720.347330] [<ffffffffc0e03ed9>] lu_cache_shrink+0x259/0x2d0 [obdclass] [39720.348525] [<ffffffffa39cf805>] shrink_slab+0x175/0x340 [39720.349511] [<ffffffffa3a3e171>] ? vmpressure+0x21/0x90 [39720.350522] [<ffffffffa39d3538>] balance_pgdat+0x3a8/0x5e0 [39720.351548] [<ffffffffa39d38e3>] kswapd+0x173/0x440 [39720.352468] [<ffffffffa38c61c0>] ? wake_up_atomic_t+0x30/0x30 [39720.353505] [<ffffffffa39d3770>] ? balance_pgdat+0x5e0/0x5e0 [39720.354523] [<ffffffffa38c50d1>] kthread+0xd1/0xe0 [39720.355413] [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40 [39720.356526] [<ffffffffa3f8cd37>] ret_from_fork_nospec_begin+0x21/0x21 [39720.357687] [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40 [39720.358783] INFO: task ldlm_cn00_000:22155 blocked for more than 120 seconds. [39720.360016] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [39720.361467] ldlm_cn00_000 D ffff9d67b43341c0 0 22155 2 0x00000080 [39720.362758] Call Trace: [39720.363216] [<ffffffffa3f7f229>] schedule+0x29/0x70 [39720.364159] [<ffffffffa3f80c15>] rwsem_down_read_failed+0x105/0x1c0 [39720.365318] [<ffffffffa3b90f88>] call_rwsem_down_read_failed+0x18/0x30 [39720.366482] [<ffffffffa3f7e3b0>] down_read+0x20/0x40 [39720.367424] [<ffffffffc0e03cea>] lu_cache_shrink+0x6a/0x2d0 [obdclass] [39720.368614] [<ffffffffa39cf73e>] shrink_slab+0xae/0x340 [39720.369584] [<ffffffffa3a3e171>] ? vmpressure+0x21/0x90 [39720.370556] [<ffffffffa39d2aca>] do_try_to_free_pages+0x3ca/0x520 [39720.371705] [<ffffffffa39d2d1c>] try_to_free_pages+0xfc/0x180 [39720.372764] [<ffffffffa3f745fd>] __alloc_pages_slowpath+0x457/0x724 [39720.373896] [<ffffffffa39c6b84>] __alloc_pages_nodemask+0x404/0x420 [39720.375039] [<ffffffffa3a14c68>] alloc_pages_current+0x98/0x110 [39720.376120] [<ffffffffa3a1fea5>] new_slab+0x2c5/0x390 [39720.377058] [<ffffffffa3a2170c>] ___slab_alloc+0x3ac/0x4f0 [39720.378143] [<ffffffffc0490238>] ? spl_kmem_zalloc+0xd8/0x180 [spl] [39720.379322] [<ffffffffc0491605>] ? spl_kmem_cache_free+0x185/0x1f0 [spl] [39720.380542] [<ffffffffc0490238>] ? spl_kmem_zalloc+0xd8/0x180 [spl] [39720.381685] [<ffffffffa3f75a1e>] __slab_alloc+0x40/0x5c [39720.382704] [<ffffffffa3a25acf>] __kmalloc_node+0xbf/0x2b0 [39720.383715] [<ffffffffc0490238>] spl_kmem_zalloc+0xd8/0x180 [spl] [39720.384880] [<ffffffffc08f28a3>] dnode_hold_impl+0x8a3/0xc40 [zfs] [39720.386020] [<ffffffffc08f2c5b>] dnode_hold+0x1b/0x20 [zfs] [39720.387064] [<ffffffffc08db6b2>] dmu_bonus_hold+0x32/0x1d0 [zfs] [39720.388177] [<ffffffffc1323544>] __osd_obj2dnode+0x34/0xd0 [osd_zfs] [39720.389334] [<ffffffffc13237c1>] osd_object_init+0x1e1/0xca0 [osd_zfs] [39720.390518] [<ffffffffc0de4829>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [39720.391800] [<ffffffffc0e02a2d>] ? lu_object_add+0x2d/0x40 [obdclass] [39720.393007] [<ffffffffc0e0236b>] lu_object_start.isra.35+0x8b/0x120 [obdclass] [39720.394546] [<ffffffffc0e06251>] lu_object_find_at+0x1e1/0xa60 [obdclass] [39720.395937] [<ffffffffc0de4829>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [39720.397217] [<ffffffffc0e06ae6>] lu_object_find+0x16/0x20 [obdclass] [39720.398380] [<ffffffffc1461185>] ofd_object_find+0x35/0x100 [ofd] [39720.399495] [<ffffffffc146f297>] ofd_lvbo_update+0x577/0xdf0 [ofd] [39720.400651] [<ffffffffc10b6324>] ldlm_request_cancel+0x404/0x780 [ptlrpc] [39720.401901] [<ffffffffc10ba612>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc] [39720.403136] [<ffffffffc10ba7e8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc] [39720.404384] [<ffffffffc10ed29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [39720.405758] [<ffffffffc10f0bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] [39720.406936] [<ffffffffa38d3efe>] ? finish_task_switch+0x4e/0x1c0 [39720.408018] [<ffffffffa3f7ec88>] ? __schedule+0x448/0x9c0 [39720.409032] [<ffffffffc10f00d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [39720.410340] [<ffffffffa38c50d1>] kthread+0xd1/0xe0 [39720.411291] [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40 [39720.412386] [<ffffffffa3f8cd37>] ret_from_fork_nospec_begin+0x21/0x21 [39720.413554] [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40 [39720.414651] INFO: task ldlm_cn00_001:22156 blocked for more than 120 seconds. [39720.415914] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [39720.417284] ldlm_cn00_001 D ffff9d67b4d11070 0 22156 2 0x00000080 [39720.418655] Call Trace: [39720.419121] [<ffffffffc04913c9>] ? spl_kmem_cache_alloc+0x99/0x150 [spl] [39720.420338] [<ffffffffa3f80149>] schedule_preempt_disabled+0x29/0x70 [39720.421478] [<ffffffffa3f7df57>] __mutex_lock_slowpath+0xc7/0x1d0 [39720.422568] [<ffffffffa3f7d32f>] mutex_lock+0x1f/0x2f [39720.423511] [<ffffffffc08c47af>] buf_hash_find+0x9f/0x160 [zfs] [39720.424606] [<ffffffffc08c823f>] arc_read+0x16f/0xab0 [zfs] [39720.425647] [<ffffffffc08d32b0>] ? dbuf_rele_and_unlock+0x4c0/0x4c0 [zfs] [39720.426893] [<ffffffffc08d1cc9>] dbuf_read+0x349/0x9e0 [zfs] [39720.427968] [<ffffffffc08d4402>] ? dbuf_hold_impl+0xb2/0xd0 [zfs] [39720.429145] [<ffffffffc08de399>] dmu_spill_hold_by_dnode+0x49/0x120 [zfs] [39720.430379] [<ffffffffc091d719>] ? sa_find_idx_tab+0x69/0x200 [zfs] [39720.431546] [<ffffffffc08de517>] dmu_spill_hold_existing+0xa7/0x100 [zfs] [39720.432799] [<ffffffffc091f6e4>] sa_get_spill.part.10+0x24/0x80 [zfs] [39720.434072] [<ffffffffc09208ec>] sa_attr_op+0x42c/0x460 [zfs] [39720.435141] [<ffffffffc0920e8f>] sa_lookup+0xaf/0x100 [zfs] [39720.436176] [<ffffffffc1321320>] osd_object_init0+0x1e0/0xf80 [osd_zfs] [39720.437372] [<ffffffffc1323a5b>] osd_object_init+0x47b/0xca0 [osd_zfs] [39720.438566] [<ffffffffc0de4829>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [39720.439862] [<ffffffffc0e02a2d>] ? lu_object_add+0x2d/0x40 [obdclass] [39720.441098] [<ffffffffc0e0236b>] lu_object_start.isra.35+0x8b/0x120 [obdclass] [39720.442401] [<ffffffffc0e06251>] lu_object_find_at+0x1e1/0xa60 [obdclass] [39720.443634] [<ffffffffc0e06ae6>] lu_object_find+0x16/0x20 [obdclass] [39720.444790] [<ffffffffc1461185>] ofd_object_find+0x35/0x100 [ofd] [39720.445905] [<ffffffffc146f297>] ofd_lvbo_update+0x577/0xdf0 [ofd] [39720.447016] [<ffffffffa3a22912>] ? kmem_cache_free+0x1e2/0x200 [39720.448111] [<ffffffffc10b6324>] ldlm_request_cancel+0x404/0x780 [ptlrpc] [39720.449357] [<ffffffffc10ba612>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc] [39720.450582] [<ffffffffc10ba7e8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc] [39720.451831] [<ffffffffc10ed29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [39720.453265] [<ffffffffc10f0bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] [39720.454389] [<ffffffffa38d3efe>] ? finish_task_switch+0x4e/0x1c0 [39720.455481] [<ffffffffa3f7ec88>] ? __schedule+0x448/0x9c0 [39720.456484] [<ffffffffc10f00d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [39720.457790] [<ffffffffa38c50d1>] kthread+0xd1/0xe0 [39720.458676] [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40 [39720.459766] [<ffffffffa3f8cd37>] ret_from_fork_nospec_begin+0x21/0x21 [39720.460933] [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40 [39720.462016] INFO: task ldlm_cn00_002:22157 blocked for more than 120 seconds. [39720.463276] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [39720.464705] ldlm_cn00_002 D ffff9d67b4d15230 0 22157 2 0x00000080 [39720.466009] Call Trace: [39720.466446] [<ffffffffa3a21569>] ? ___slab_alloc+0x209/0x4f0 [39720.467433] [<ffffffffa3f7f229>] schedule+0x29/0x70 [39720.468398] [<ffffffffc04972d5>] cv_wait_common+0x125/0x150 [spl] [39720.469511] [<ffffffffa38c61c0>] ? wake_up_atomic_t+0x30/0x30 [39720.470554] [<ffffffffc0497315>] __cv_wait+0x15/0x20 [spl] [39720.471575] [<ffffffffc08d217b>] dbuf_read+0x7fb/0x9e0 [zfs] [39720.472629] [<ffffffffc08d4496>] ? dbuf_hold+0x16/0x20 [zfs] [39720.473681] [<ffffffffc08de090>] dmu_buf_hold_by_dnode+0x50/0x80 [zfs] [39720.474936] [<ffffffffc09516f2>] zap_get_leaf_byblk.isra.11+0x92/0x260 [zfs] [39720.476247] [<ffffffffc09519a8>] zap_deref_leaf+0xe8/0x100 [zfs] [39720.477366] [<ffffffffc09529e2>] fzap_lookup+0x62/0x180 [zfs] [39720.478434] [<ffffffffc09572e1>] zap_lookup_impl+0xd1/0x200 [zfs] [39720.479565] [<ffffffffc0957c5e>] zap_lookup_norm+0x8e/0xd0 [zfs] [39720.480669] [<ffffffffc0957cd3>] zap_lookup+0x33/0x40 [zfs] [39720.481700] [<ffffffffc132ba71>] osd_fid_lookup+0x3e1/0x4c0 [osd_zfs] [39720.482868] [<ffffffffc1323969>] osd_object_init+0x389/0xca0 [osd_zfs] [39720.484053] [<ffffffffc0de4829>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [39720.485328] [<ffffffffc0e02a2d>] ? lu_object_add+0x2d/0x40 [obdclass] [39720.486555] [<ffffffffc0e0236b>] lu_object_start.isra.35+0x8b/0x120 [obdclass] [39720.487850] [<ffffffffc0e06251>] lu_object_find_at+0x1e1/0xa60 [obdclass] [39720.489085] [<ffffffffc0e06ae6>] lu_object_find+0x16/0x20 [obdclass] [39720.490245] [<ffffffffc1461185>] ofd_object_find+0x35/0x100 [ofd] [39720.491362] [<ffffffffc146f297>] ofd_lvbo_update+0x577/0xdf0 [ofd] [39720.492482] [<ffffffffa3a22912>] ? kmem_cache_free+0x1e2/0x200 [39720.493590] [<ffffffffc10b6324>] ldlm_request_cancel+0x404/0x780 [ptlrpc] [39720.494838] [<ffffffffc10ba612>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc] [39720.496071] [<ffffffffc10ba7e8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc] [39720.497327] [<ffffffffc10ed29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [39720.498751] [<ffffffffc10f0bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] [39720.499868] [<ffffffffa38d3efe>] ? finish_task_switch+0x4e/0x1c0 [39720.500963] [<ffffffffa3f7ec88>] ? __schedule+0x448/0x9c0 [39720.501987] [<ffffffffc10f00d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [39720.503319] [<ffffffffa38c50d1>] kthread+0xd1/0xe0 [39720.504202] [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40 [39720.505292] [<ffffffffa3f8cd37>] ret_from_fork_nospec_begin+0x21/0x21 [39720.506444] [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40 [39720.507555] INFO: task z_wr_int_3:1371 blocked for more than 120 seconds. [39720.508719] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [39720.510179] z_wr_int_3 D ffff9d67bbfde2a0 0 1371 2 0x00000080 [39720.511506] Call Trace: [39720.511964] [<ffffffffa38d20a0>] ? task_rq_unlock+0x20/0x20 [39720.513011] [<ffffffffa3f80149>] schedule_preempt_disabled+0x29/0x70 [39720.514168] [<ffffffffa3f7df57>] __mutex_lock_slowpath+0xc7/0x1d0 [39720.515283] [<ffffffffa3f7d32f>] mutex_lock+0x1f/0x2f [39720.516233] [<ffffffffc08c44bf>] buf_hash_insert+0x7f/0x190 [zfs] [39720.517376] [<ffffffffc08c94f6>] arc_write_done+0x136/0x400 [zfs] [39720.518493] [<ffffffffc0994c4a>] zio_done+0x2da/0xd60 [zfs] [39720.519519] [<ffffffffa3a22df6>] ? kfree+0x106/0x140 [39720.520540] [<ffffffffc098f99f>] zio_execute+0x9f/0x100 [zfs] [39720.521603] [<ffffffffc0492d7c>] taskq_thread+0x2ac/0x4f0 [spl] [39720.522689] [<ffffffffa38da0b0>] ? wake_up_state+0x20/0x20 [39720.523726] [<ffffffffc098f900>] ? zio_taskq_member.isra.7.constprop.10+0x80/0x80 [zfs] [39720.525146] [<ffffffffc0492ad0>] ? taskq_thread_spawn+0x60/0x60 [spl] [39720.526310] [<ffffffffa38c50d1>] kthread+0xd1/0xe0 [39720.527211] [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40 [39720.528286] [<ffffffffa3f8cd37>] ret_from_fork_nospec_begin+0x21/0x21 [39720.529433] [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40 [39720.530532] INFO: task txg_sync:1424 blocked for more than 120 seconds. [39720.531794] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [39720.533156] txg_sync D ffff9d677716a0e0 0 1424 2 0x00000080 [39720.534428] Call Trace: [39720.534927] [<ffffffffc0493017>] ? taskq_dispatch_ent+0x57/0x170 [spl] [39720.536076] [<ffffffffc098f900>] ? zio_taskq_member.isra.7.constprop.10+0x80/0x80 [zfs] [39720.537499] [<ffffffffa3f7f229>] schedule+0x29/0x70 [39720.538387] [<ffffffffa3f7cbb1>] schedule_timeout+0x221/0x2d0 [39720.539461] [<ffffffffc098e91f>] ? zio_taskq_dispatch+0x8f/0xa0 [zfs] [39720.540628] [<ffffffffa386c27e>] ? kvm_clock_get_cycles+0x1e/0x20 [39720.541716] [<ffffffffa3f7e79d>] io_schedule_timeout+0xad/0x130 [39720.542856] [<ffffffffa38c5d46>] ? prepare_to_wait_exclusive+0x56/0x90 [39720.544033] [<ffffffffa3f7e838>] io_schedule+0x18/0x20 [39720.544985] [<ffffffffc0497262>] cv_wait_common+0xb2/0x150 [spl] [39720.546076] [<ffffffffa38c61c0>] ? wake_up_atomic_t+0x30/0x30 [39720.547125] [<ffffffffc0497338>] __cv_wait_io+0x18/0x20 [spl] [39720.548194] [<ffffffffc0992a6b>] zio_wait+0x11b/0x1c0 [zfs] [39720.549234] [<ffffffffc090885f>] dsl_pool_sync+0xbf/0x440 [zfs] [39720.550323] [<ffffffffc0926937>] spa_sync+0x437/0xd90 [zfs] [39720.551364] [<ffffffffc093b351>] txg_sync_thread+0x301/0x510 [zfs] [39720.552502] [<ffffffffc093b050>] ? txg_fini+0x2a0/0x2a0 [zfs] [39720.553588] [<ffffffffc0492063>] thread_generic_wrapper+0x73/0x80 [spl] [39720.554786] [<ffffffffc0491ff0>] ? __thread_exit+0x20/0x20 [spl] [39720.555866] [<ffffffffa38c50d1>] kthread+0xd1/0xe0 [39720.556741] [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40 [39720.557819] [<ffffffffa3f8cd37>] ret_from_fork_nospec_begin+0x21/0x21 [39720.558970] [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40 [39720.560068] INFO: task txg_sync:3484 blocked for more than 120 seconds. [39720.561204] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [39720.562600] txg_sync D ffff9d679b2f8000 0 3484 2 0x00000080 [39720.563892] Call Trace: [39720.564336] [<ffffffffa3f80149>] schedule_preempt_disabled+0x29/0x70 [39720.565570] [<ffffffffa3f7df57>] __mutex_lock_slowpath+0xc7/0x1d0 [39720.566666] [<ffffffffa3f7d32f>] mutex_lock+0x1f/0x2f [39720.567606] [<ffffffffc08c47af>] buf_hash_find+0x9f/0x160 [zfs] [39720.568694] [<ffffffffc08c823f>] arc_read+0x16f/0xab0 [zfs] [39720.569737] [<ffffffffc08d32b0>] ? dbuf_rele_and_unlock+0x4c0/0x4c0 [zfs] [39720.570989] [<ffffffffc08d1cc9>] dbuf_read+0x349/0x9e0 [zfs] [39720.572030] [<ffffffffc08d40aa>] __dbuf_hold_impl+0x31a/0x5c0 [zfs] [39720.573183] [<ffffffffc08d43f2>] dbuf_hold_impl+0xa2/0xd0 [zfs] [39720.574278] [<ffffffffc08d4455>] dbuf_hold_level+0x35/0x60 [zfs] [39720.575439] [<ffffffffc08d4496>] dbuf_hold+0x16/0x20 [zfs] [39720.576466] [<ffffffffc08dd1fe>] dmu_buf_hold_array_by_dnode+0xee/0x4a0 [zfs] [39720.577762] [<ffffffffc08ddb91>] dmu_write.part.7+0x71/0x110 [zfs] [39720.578898] [<ffffffffc08ddc49>] dmu_write+0x19/0x20 [zfs] [39720.579929] [<ffffffffc09392a9>] space_map_write+0x309/0x530 [zfs] [39720.581063] [<ffffffffc0917364>] metaslab_sync+0x154/0x7e0 [zfs] [39720.582174] [<ffffffffc08dabd7>] ? ddt_sync+0x3d7/0x740 [zfs] [39720.583229] [<ffffffffc090ff1c>] ? dsl_scan_sync+0x4c/0xc40 [zfs] [39720.584334] [<ffffffffa3f7d322>] ? mutex_lock+0x12/0x2f [39720.585310] [<ffffffffc09404a3>] vdev_sync+0x73/0x140 [zfs] [39720.586416] [<ffffffffc09269ab>] spa_sync+0x4ab/0xd90 [zfs] [39720.587459] [<ffffffffc093b351>] txg_sync_thread+0x301/0x510 [zfs] [39720.588599] [<ffffffffc093b050>] ? txg_fini+0x2a0/0x2a0 [zfs] [39720.589634] [<ffffffffc0492063>] thread_generic_wrapper+0x73/0x80 [spl] [39720.590830] [<ffffffffc0491ff0>] ? __thread_exit+0x20/0x20 [spl] [39720.591923] [<ffffffffa38c50d1>] kthread+0xd1/0xe0 [39720.592807] [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40 [39720.593883] [<ffffffffa3f8cd37>] ret_from_fork_nospec_begin+0x21/0x21 [39720.595045] [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40 [39720.596143] INFO: task ldlm_cn00_003:10024 blocked for more than 120 seconds. [39720.597454] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [39720.598851] ldlm_cn00_003 D ffff9d67b5d8e2a0 0 10024 2 0x00000080 [39720.600146] Call Trace: …
There may be failures for ldiskfs testing, but the stack traces would look very different. So, this is a ticket only for ZFS failures.
The stack traces look similar to what is seen in LU-12957, but not exactly the same.
Starting 05 MAY 2019, we’ve seen this failure five times. Logs for parallel-scale-nfsv4 test compilebench hangs are at
2.12.3.40 - https://testing.whamcloud.com/test_sets/dff91922-18c3-11ea-a9d7-52540065bddc
2.13.50 - https://testing.whamcloud.com/test_sets/6f45b128-040e-11ea-b934-52540065bddc
2.12.58.155 - https://testing.whamcloud.com/test_sets/5cd37e5c-f212-11e9-add9-52540065bddc
2.12.53.13 - https://testing.whamcloud.com/test_sets/6a7ae358-7065-11e9-a6f9-52540065bddc
2.12.53.45 - https://testing.whamcloud.com/test_sets/d4888a58-7c44-11e9-a028-52540065bddc