Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13068

parallel-scale-nfsv4 test compilebench hangs

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.6
    • 3
    • 9223372036854775807

    Description

      parallel-scale-nfsv4 test_compilebench hangs. The last lines in the suite_log is that compilebench is starting up

      == parallel-scale-nfsv4 test compilebench: compilebench ============================================== 05:21:00 (1575696060)
      OPTIONS:
      cbench_DIR=/usr/bin
      cbench_IDIRS=2
      cbench_RUNS=2
      trevis-42vm1.trevis.whamcloud.com
      trevis-42vm2
      free space = 15345664 KB
      ./compilebench -D /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.7562 -i 2         -r 2 --makej
      using working directory /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.7562, 2 intial dirs 2 runs
      

      The client 1 (vm1) console log has an NFS error

      [39354.701601] Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.7562 -i 2 -r 2 --makej
      [39677.612810] nfs: server trevis-42vm4 not responding, still trying
      

      And the OSS (vm3) has stack traces filled with spl/zfs function calls

      [39370.234418] Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.7562 -i 2 -r 2 --makej
      [39553.055992] LNet: Service thread pid 6880 was inactive for 40.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [39553.059178] Pid: 6880, comm: ll_ost00_066 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Mon Oct 21 20:19:09 UTC 2019
      [39553.060958] Call Trace:
      [39553.061440]  [<ffffffffc04972d5>] cv_wait_common+0x125/0x150 [spl]
      [39553.062685]  [<ffffffffc0497315>] __cv_wait+0x15/0x20 [spl]
      [39553.063896]  [<ffffffffc08d217b>] dbuf_read+0x7fb/0x9e0 [zfs]
      [39553.065039]  [<ffffffffc08de090>] dmu_buf_hold_by_dnode+0x50/0x80 [zfs]
      [39553.066284]  [<ffffffffc09516f2>] zap_get_leaf_byblk.isra.11+0x92/0x260 [zfs]
      [39553.067668]  [<ffffffffc09519a8>] zap_deref_leaf+0xe8/0x100 [zfs]
      [39553.068928]  [<ffffffffc09529e2>] fzap_lookup+0x62/0x180 [zfs]
      [39553.070075]  [<ffffffffc09572e1>] zap_lookup_impl+0xd1/0x200 [zfs]
      [39553.071244]  [<ffffffffc0957c5e>] zap_lookup_norm+0x8e/0xd0 [zfs]
      [39553.072424]  [<ffffffffc0957cd3>] zap_lookup+0x33/0x40 [zfs]
      [39553.073580]  [<ffffffffc132ba71>] osd_fid_lookup+0x3e1/0x4c0 [osd_zfs]
      [39553.074820]  [<ffffffffc1323969>] osd_object_init+0x389/0xca0 [osd_zfs]
      [39553.076116]  [<ffffffffc0e0236b>] lu_object_start.isra.35+0x8b/0x120 [obdclass]
      [39553.077481]  [<ffffffffc0e06251>] lu_object_find_at+0x1e1/0xa60 [obdclass]
      [39553.078846]  [<ffffffffc0e06ae6>] lu_object_find+0x16/0x20 [obdclass]
      [39553.080130]  [<ffffffffc1461185>] ofd_object_find+0x35/0x100 [ofd]
      [39553.081346]  [<ffffffffc146fe73>] ofd_lvbo_init+0x363/0x7f0 [ofd]
      [39553.082503]  [<ffffffffc10b929c>] ldlm_handle_enqueue0+0x97c/0x15f0 [ptlrpc]
      [39553.083876]  [<ffffffffc1141b02>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [39553.085151]  [<ffffffffc11482ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [39553.086446]  [<ffffffffc10ed29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [39553.087908]  [<ffffffffc10f0bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      [39553.089184]  [<ffffffffa38c50d1>] kthread+0xd1/0xe0
      [39553.090104]  [<ffffffffa3f8cd37>] ret_from_fork_nospec_end+0x0/0x39
      [39553.091311]  [<ffffffffffffffff>] 0xffffffffffffffff
      [39553.092282] LustreError: dumping log to /tmp/lustre-log.1575696244.6880
      [39698.591981] LNet: Service thread pid 16643 was inactive for 186.08s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [39698.594909] Pid: 16643, comm: ldlm_cn00_008 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Mon Oct 21 20:19:09 UTC 2019
      [39698.596657] Call Trace:
      [39698.597134]  [<ffffffffa3b90f88>] call_rwsem_down_read_failed+0x18/0x30
      [39698.598432]  [<ffffffffc0e03cea>] lu_cache_shrink+0x6a/0x2d0 [obdclass]
      [39698.599673]  [<ffffffffa39cf73e>] shrink_slab+0xae/0x340
      [39698.600710]  [<ffffffffa39d2aca>] do_try_to_free_pages+0x3ca/0x520
      [39698.601856]  [<ffffffffa39d2d1c>] try_to_free_pages+0xfc/0x180
      [39698.602931]  [<ffffffffa3f745fd>] __alloc_pages_slowpath+0x457/0x724
      [39698.604109]  [<ffffffffa39c6b84>] __alloc_pages_nodemask+0x404/0x420
      [39698.605236]  [<ffffffffa3a14c68>] alloc_pages_current+0x98/0x110
      [39698.606337]  [<ffffffffa3a1fea5>] new_slab+0x2c5/0x390
      [39698.607303]  [<ffffffffa3a2170c>] ___slab_alloc+0x3ac/0x4f0
      [39698.608334]  [<ffffffffa3f75a1e>] __slab_alloc+0x40/0x5c
      [39698.609398]  [<ffffffffa3a25acf>] __kmalloc_node+0xbf/0x2b0
      [39698.610442]  [<ffffffffc0490238>] spl_kmem_zalloc+0xd8/0x180 [spl]
      [39698.611575]  [<ffffffffc08f28a3>] dnode_hold_impl+0x8a3/0xc40 [zfs]
      [39698.612753]  [<ffffffffc08f2c5b>] dnode_hold+0x1b/0x20 [zfs]
      [39698.613878]  [<ffffffffc08db6b2>] dmu_bonus_hold+0x32/0x1d0 [zfs]
      [39698.615016]  [<ffffffffc1323544>] __osd_obj2dnode+0x34/0xd0 [osd_zfs]
      [39698.616172]  [<ffffffffc13237c1>] osd_object_init+0x1e1/0xca0 [osd_zfs]
      [39698.617404]  [<ffffffffc0e0236b>] lu_object_start.isra.35+0x8b/0x120 [obdclass]
      [39698.618738]  [<ffffffffc0e06251>] lu_object_find_at+0x1e1/0xa60 [obdclass]
      [39698.620106]  [<ffffffffc0e06ae6>] lu_object_find+0x16/0x20 [obdclass]
      [39698.621289]  [<ffffffffc1461185>] ofd_object_find+0x35/0x100 [ofd]
      [39698.622443]  [<ffffffffc146f297>] ofd_lvbo_update+0x577/0xdf0 [ofd]
      [39698.623588]  [<ffffffffc10b6324>] ldlm_request_cancel+0x404/0x780 [ptlrpc]
      [39698.624888]  [<ffffffffc10ba612>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc]
      [39698.626197]  [<ffffffffc10ba7e8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
      [39698.627439]  [<ffffffffc10ed29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [39698.628845]  [<ffffffffc10f0bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      [39698.630074]  [<ffffffffa38c50d1>] kthread+0xd1/0xe0
      [39698.630968]  [<ffffffffa3f8cd37>] ret_from_fork_nospec_end+0x0/0x39
      [39698.632127]  [<ffffffffffffffff>] 0xffffffffffffffff
      [39698.633134] LustreError: dumping log to /tmp/lustre-log.1575696390.16643
      [39699.006888] Pid: 16648, comm: ldlm_cn00_011 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Mon Oct 21 20:19:09 UTC 2019
      [39699.008967] Call Trace:
      [39699.009454]  [<ffffffffa3b90f88>] call_rwsem_down_read_failed+0x18/0x30
      [39699.010685]  [<ffffffffc0e03cea>] lu_cache_shrink+0x6a/0x2d0 [obdclass]
      [39699.012057]  [<ffffffffa39cf73e>] shrink_slab+0xae/0x340
      [39699.013044]  [<ffffffffa39d2aca>] do_try_to_free_pages+0x3ca/0x520
      [39699.014236]  [<ffffffffa39d2d1c>] try_to_free_pages+0xfc/0x180
      [39699.015351]  [<ffffffffa3f745fd>] __alloc_pages_slowpath+0x457/0x724
      [39699.016567]  [<ffffffffa39c6b84>] __alloc_pages_nodemask+0x404/0x420
      [39699.017806]  [<ffffffffa3a14c68>] alloc_pages_current+0x98/0x110
      [39699.019015]  [<ffffffffa3a1fea5>] new_slab+0x2c5/0x390
      [39699.019958]  [<ffffffffa3a2170c>] ___slab_alloc+0x3ac/0x4f0
      [39699.021035]  [<ffffffffa3f75a1e>] __slab_alloc+0x40/0x5c
      [39699.022050]  [<ffffffffa3a25acf>] __kmalloc_node+0xbf/0x2b0
      [39699.023116]  [<ffffffffc0490238>] spl_kmem_zalloc+0xd8/0x180 [spl]
      [39699.024293]  [<ffffffffc08f28a3>] dnode_hold_impl+0x8a3/0xc40 [zfs]
      [39699.025520]  [<ffffffffc08f2c5b>] dnode_hold+0x1b/0x20 [zfs]
      [39699.026613]  [<ffffffffc08db6b2>] dmu_bonus_hold+0x32/0x1d0 [zfs]
      [39699.027792]  [<ffffffffc1323544>] __osd_obj2dnode+0x34/0xd0 [osd_zfs]
      [39699.029096]  [<ffffffffc13237c1>] osd_object_init+0x1e1/0xca0 [osd_zfs]
      [39699.030320]  [<ffffffffc0e0236b>] lu_object_start.isra.35+0x8b/0x120 [obdclass]
      [39699.031703]  [<ffffffffc0e06251>] lu_object_find_at+0x1e1/0xa60 [obdclass]
      [39699.033058]  [<ffffffffc0e06ae6>] lu_object_find+0x16/0x20 [obdclass]
      [39699.034261]  [<ffffffffc1461185>] ofd_object_find+0x35/0x100 [ofd]
      [39699.035461]  [<ffffffffc146f297>] ofd_lvbo_update+0x577/0xdf0 [ofd]
      [39699.036665]  [<ffffffffc10b6324>] ldlm_request_cancel+0x404/0x780 [ptlrpc]
      [39699.038037]  [<ffffffffc10ba612>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc]
      [39699.039308]  [<ffffffffc10ba7e8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
      [39699.040636]  [<ffffffffc10ed29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [39699.042176]  [<ffffffffc10f0bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      [39699.043369]  [<ffffffffa38c50d1>] kthread+0xd1/0xe0
      [39699.044312]  [<ffffffffa3f8cd37>] ret_from_fork_nospec_end+0x0/0x39
      [39699.045516]  [<ffffffffffffffff>] 0xffffffffffffffff
      [39699.046475] Pid: 17567, comm: ldlm_cn00_039 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Mon Oct 21 20:19:09 UTC 2019
      [39699.048317] Call Trace:
      [39699.048822]  [<ffffffffc04972d5>] cv_wait_common+0x125/0x150 [spl]
      [39699.050027]  [<ffffffffc0497315>] __cv_wait+0x15/0x20 [spl]
      [39699.051062]  [<ffffffffc08d217b>] dbuf_read+0x7fb/0x9e0 [zfs]
      [39699.052182]  [<ffffffffc08de090>] dmu_buf_hold_by_dnode+0x50/0x80 [zfs]
      [39699.053510]  [<ffffffffc09516f2>] zap_get_leaf_byblk.isra.11+0x92/0x260 [zfs]
      [39699.054882]  [<ffffffffc09519a8>] zap_deref_leaf+0xe8/0x100 [zfs]
      [39699.056121]  [<ffffffffc09529e2>] fzap_lookup+0x62/0x180 [zfs]
      [39699.057226]  [<ffffffffc09572e1>] zap_lookup_impl+0xd1/0x200 [zfs]
      [39699.058441]  [<ffffffffc0957c5e>] zap_lookup_norm+0x8e/0xd0 [zfs]
      [39699.059621]  [<ffffffffc0957cd3>] zap_lookup+0x33/0x40 [zfs]
      [39699.060717]  [<ffffffffc132ba71>] osd_fid_lookup+0x3e1/0x4c0 [osd_zfs]
      [39699.062011]  [<ffffffffc1323969>] osd_object_init+0x389/0xca0 [osd_zfs]
      [39699.063231]  [<ffffffffc0e0236b>] lu_object_start.isra.35+0x8b/0x120 [obdclass]
      [39699.064614]  [<ffffffffc0e06251>] lu_object_find_at+0x1e1/0xa60 [obdclass]
      [39699.065976]  [<ffffffffc0e06ae6>] lu_object_find+0x16/0x20 [obdclass]
      [39699.067174]  [<ffffffffc1461185>] ofd_object_find+0x35/0x100 [ofd]
      [39699.068368]  [<ffffffffc146f297>] ofd_lvbo_update+0x577/0xdf0 [ofd]
      [39699.069546]  [<ffffffffc10b6324>] ldlm_request_cancel+0x404/0x780 [ptlrpc]
      [39699.070850]  [<ffffffffc10ba612>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc]
      [39699.072277]  [<ffffffffc10ba7e8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
      [39699.073564]  [<ffffffffc10ed29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [39699.074999]  [<ffffffffc10f0bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      [39699.076212]  [<ffffffffa38c50d1>] kthread+0xd1/0xe0
      [39699.077174]  [<ffffffffa3f8cd37>] ret_from_fork_nospec_end+0x0/0x39
      [39699.078403]  [<ffffffffffffffff>] 0xffffffffffffffff
      [39699.079386] Pid: 17556, comm: ldlm_cn00_028 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Mon Oct 21 20:19:09 UTC 2019
      [39699.081221] Call Trace:
      [39699.081718]  [<ffffffffa3b90f88>] call_rwsem_down_read_failed+0x18/0x30
      [39699.082970]  [<ffffffffc0e03cea>] lu_cache_shrink+0x6a/0x2d0 [obdclass]
      [39699.084190]  [<ffffffffa39cf73e>] shrink_slab+0xae/0x340
      [39699.085196]  [<ffffffffa39d2aca>] do_try_to_free_pages+0x3ca/0x520
      [39699.086428]  [<ffffffffa39d2d1c>] try_to_free_pages+0xfc/0x180
      [39699.087560]  [<ffffffffa3f745fd>] __alloc_pages_slowpath+0x457/0x724
      [39699.088770]  [<ffffffffa39c6b84>] __alloc_pages_nodemask+0x404/0x420
      [39699.090045]  [<ffffffffa3a14c68>] alloc_pages_current+0x98/0x110
      [39699.091160]  [<ffffffffa3a1fea5>] new_slab+0x2c5/0x390
      [39699.092168]  [<ffffffffa3a2170c>] ___slab_alloc+0x3ac/0x4f0
      [39699.093222]  [<ffffffffa3f75a1e>] __slab_alloc+0x40/0x5c
      [39699.094293]  [<ffffffffa3a25acf>] __kmalloc_node+0xbf/0x2b0
      [39699.095339]  [<ffffffffc0490238>] spl_kmem_zalloc+0xd8/0x180 [spl]
      [39699.096521]  [<ffffffffc08f28a3>] dnode_hold_impl+0x8a3/0xc40 [zfs]
      [39699.097734]  [<ffffffffc08f2c5b>] dnode_hold+0x1b/0x20 [zfs]
      [39699.098856]  [<ffffffffc08db6b2>] dmu_bonus_hold+0x32/0x1d0 [zfs]
      [39699.100097]  [<ffffffffc1323544>] __osd_obj2dnode+0x34/0xd0 [osd_zfs]
      [39699.101298]  [<ffffffffc13237c1>] osd_object_init+0x1e1/0xca0 [osd_zfs]
      [39699.102540]  [<ffffffffc0e0236b>] lu_object_start.isra.35+0x8b/0x120 [obdclass]
      [39699.103920]  [<ffffffffc0e06251>] lu_object_find_at+0x1e1/0xa60 [obdclass]
      [39699.105264]  [<ffffffffc0e06ae6>] lu_object_find+0x16/0x20 [obdclass]
      [39699.106464]  [<ffffffffc1461185>] ofd_object_find+0x35/0x100 [ofd]
      [39699.107634]  [<ffffffffc146f297>] ofd_lvbo_update+0x577/0xdf0 [ofd]
      [39699.108852]  [<ffffffffc10b6324>] ldlm_request_cancel+0x404/0x780 [ptlrpc]
      [39699.110194]  [<ffffffffc10ba612>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc]
      [39699.111454]  [<ffffffffc10ba7e8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
      [39699.112783]  [<ffffffffc10ed29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [39699.114322]  [<ffffffffc10f0bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      [39699.115512]  [<ffffffffa38c50d1>] kthread+0xd1/0xe0
      [39699.116452]  [<ffffffffa3f8cd37>] ret_from_fork_nospec_end+0x0/0x39
      [39699.117628]  [<ffffffffffffffff>] 0xffffffffffffffff
      

      and hung processes

      [39699.118607] LNet: Service thread pid 16644 was inactive for 186.50s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [39720.323966] INFO: task kswapd0:35 blocked for more than 120 seconds.
      [39720.325102] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [39720.326433] kswapd0         D ffff9d67bc3de2a0     0    35      2 0x00000000
      [39720.327744] Call Trace:
      [39720.328231]  [<ffffffffc08d3628>] ? dbuf_destroy+0x278/0x3b0 [zfs]
      [39720.329428]  [<ffffffffa3f80149>] schedule_preempt_disabled+0x29/0x70
      [39720.330576]  [<ffffffffa3f7df57>] __mutex_lock_slowpath+0xc7/0x1d0
      [39720.331658]  [<ffffffffa3f7d32f>] mutex_lock+0x1f/0x2f
      [39720.332579]  [<ffffffffc08cbe36>] arc_buf_destroy+0x96/0x130 [zfs]
      [39720.333671]  [<ffffffffc08d33e1>] dbuf_destroy+0x31/0x3b0 [zfs]
      [39720.334743]  [<ffffffffc0935166>] ? spa_get_random+0x36/0x60 [zfs]
      [39720.335846]  [<ffffffffc08d3800>] dbuf_evict_one+0xa0/0xc0 [zfs]
      [39720.336949]  [<ffffffffc08d3295>] dbuf_rele_and_unlock+0x4a5/0x4c0 [zfs]
      [39720.338139]  [<ffffffffc08f2e4b>] ? dnode_rele+0x3b/0x40 [zfs]
      [39720.339139]  [<ffffffffa3a224af>] ? __slab_free+0x9f/0x320
      [39720.340264]  [<ffffffffc08d3d88>] dbuf_rele+0x38/0x40 [zfs]
      [39720.341295]  [<ffffffffc08d5c7e>] dmu_buf_rele+0xe/0x10 [zfs]
      [39720.342360]  [<ffffffffc091df83>] sa_handle_destroy+0x93/0xe0 [zfs]
      [39720.343499]  [<ffffffffc131e247>] osd_object_delete+0x57/0x100 [osd_zfs]
      [39720.344729]  [<ffffffffc0e021d8>] lu_object_free.isra.32+0x68/0x170 [obdclass]
      [39720.346031]  [<ffffffffc0e02fbe>] lu_site_purge_objects+0x2fe/0x530 [obdclass]
      [39720.347330]  [<ffffffffc0e03ed9>] lu_cache_shrink+0x259/0x2d0 [obdclass]
      [39720.348525]  [<ffffffffa39cf805>] shrink_slab+0x175/0x340
      [39720.349511]  [<ffffffffa3a3e171>] ? vmpressure+0x21/0x90
      [39720.350522]  [<ffffffffa39d3538>] balance_pgdat+0x3a8/0x5e0
      [39720.351548]  [<ffffffffa39d38e3>] kswapd+0x173/0x440
      [39720.352468]  [<ffffffffa38c61c0>] ? wake_up_atomic_t+0x30/0x30
      [39720.353505]  [<ffffffffa39d3770>] ? balance_pgdat+0x5e0/0x5e0
      [39720.354523]  [<ffffffffa38c50d1>] kthread+0xd1/0xe0
      [39720.355413]  [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40
      [39720.356526]  [<ffffffffa3f8cd37>] ret_from_fork_nospec_begin+0x21/0x21
      [39720.357687]  [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40
      [39720.358783] INFO: task ldlm_cn00_000:22155 blocked for more than 120 seconds.
      [39720.360016] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [39720.361467] ldlm_cn00_000   D ffff9d67b43341c0     0 22155      2 0x00000080
      [39720.362758] Call Trace:
      [39720.363216]  [<ffffffffa3f7f229>] schedule+0x29/0x70
      [39720.364159]  [<ffffffffa3f80c15>] rwsem_down_read_failed+0x105/0x1c0
      [39720.365318]  [<ffffffffa3b90f88>] call_rwsem_down_read_failed+0x18/0x30
      [39720.366482]  [<ffffffffa3f7e3b0>] down_read+0x20/0x40
      [39720.367424]  [<ffffffffc0e03cea>] lu_cache_shrink+0x6a/0x2d0 [obdclass]
      [39720.368614]  [<ffffffffa39cf73e>] shrink_slab+0xae/0x340
      [39720.369584]  [<ffffffffa3a3e171>] ? vmpressure+0x21/0x90
      [39720.370556]  [<ffffffffa39d2aca>] do_try_to_free_pages+0x3ca/0x520
      [39720.371705]  [<ffffffffa39d2d1c>] try_to_free_pages+0xfc/0x180
      [39720.372764]  [<ffffffffa3f745fd>] __alloc_pages_slowpath+0x457/0x724
      [39720.373896]  [<ffffffffa39c6b84>] __alloc_pages_nodemask+0x404/0x420
      [39720.375039]  [<ffffffffa3a14c68>] alloc_pages_current+0x98/0x110
      [39720.376120]  [<ffffffffa3a1fea5>] new_slab+0x2c5/0x390
      [39720.377058]  [<ffffffffa3a2170c>] ___slab_alloc+0x3ac/0x4f0
      [39720.378143]  [<ffffffffc0490238>] ? spl_kmem_zalloc+0xd8/0x180 [spl]
      [39720.379322]  [<ffffffffc0491605>] ? spl_kmem_cache_free+0x185/0x1f0 [spl]
      [39720.380542]  [<ffffffffc0490238>] ? spl_kmem_zalloc+0xd8/0x180 [spl]
      [39720.381685]  [<ffffffffa3f75a1e>] __slab_alloc+0x40/0x5c
      [39720.382704]  [<ffffffffa3a25acf>] __kmalloc_node+0xbf/0x2b0
      [39720.383715]  [<ffffffffc0490238>] spl_kmem_zalloc+0xd8/0x180 [spl]
      [39720.384880]  [<ffffffffc08f28a3>] dnode_hold_impl+0x8a3/0xc40 [zfs]
      [39720.386020]  [<ffffffffc08f2c5b>] dnode_hold+0x1b/0x20 [zfs]
      [39720.387064]  [<ffffffffc08db6b2>] dmu_bonus_hold+0x32/0x1d0 [zfs]
      [39720.388177]  [<ffffffffc1323544>] __osd_obj2dnode+0x34/0xd0 [osd_zfs]
      [39720.389334]  [<ffffffffc13237c1>] osd_object_init+0x1e1/0xca0 [osd_zfs]
      [39720.390518]  [<ffffffffc0de4829>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [39720.391800]  [<ffffffffc0e02a2d>] ? lu_object_add+0x2d/0x40 [obdclass]
      [39720.393007]  [<ffffffffc0e0236b>] lu_object_start.isra.35+0x8b/0x120 [obdclass]
      [39720.394546]  [<ffffffffc0e06251>] lu_object_find_at+0x1e1/0xa60 [obdclass]
      [39720.395937]  [<ffffffffc0de4829>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [39720.397217]  [<ffffffffc0e06ae6>] lu_object_find+0x16/0x20 [obdclass]
      [39720.398380]  [<ffffffffc1461185>] ofd_object_find+0x35/0x100 [ofd]
      [39720.399495]  [<ffffffffc146f297>] ofd_lvbo_update+0x577/0xdf0 [ofd]
      [39720.400651]  [<ffffffffc10b6324>] ldlm_request_cancel+0x404/0x780 [ptlrpc]
      [39720.401901]  [<ffffffffc10ba612>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc]
      [39720.403136]  [<ffffffffc10ba7e8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
      [39720.404384]  [<ffffffffc10ed29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [39720.405758]  [<ffffffffc10f0bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      [39720.406936]  [<ffffffffa38d3efe>] ? finish_task_switch+0x4e/0x1c0
      [39720.408018]  [<ffffffffa3f7ec88>] ? __schedule+0x448/0x9c0
      [39720.409032]  [<ffffffffc10f00d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [39720.410340]  [<ffffffffa38c50d1>] kthread+0xd1/0xe0
      [39720.411291]  [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40
      [39720.412386]  [<ffffffffa3f8cd37>] ret_from_fork_nospec_begin+0x21/0x21
      [39720.413554]  [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40
      [39720.414651] INFO: task ldlm_cn00_001:22156 blocked for more than 120 seconds.
      [39720.415914] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [39720.417284] ldlm_cn00_001   D ffff9d67b4d11070     0 22156      2 0x00000080
      [39720.418655] Call Trace:
      [39720.419121]  [<ffffffffc04913c9>] ? spl_kmem_cache_alloc+0x99/0x150 [spl]
      [39720.420338]  [<ffffffffa3f80149>] schedule_preempt_disabled+0x29/0x70
      [39720.421478]  [<ffffffffa3f7df57>] __mutex_lock_slowpath+0xc7/0x1d0
      [39720.422568]  [<ffffffffa3f7d32f>] mutex_lock+0x1f/0x2f
      [39720.423511]  [<ffffffffc08c47af>] buf_hash_find+0x9f/0x160 [zfs]
      [39720.424606]  [<ffffffffc08c823f>] arc_read+0x16f/0xab0 [zfs]
      [39720.425647]  [<ffffffffc08d32b0>] ? dbuf_rele_and_unlock+0x4c0/0x4c0 [zfs]
      [39720.426893]  [<ffffffffc08d1cc9>] dbuf_read+0x349/0x9e0 [zfs]
      [39720.427968]  [<ffffffffc08d4402>] ? dbuf_hold_impl+0xb2/0xd0 [zfs]
      [39720.429145]  [<ffffffffc08de399>] dmu_spill_hold_by_dnode+0x49/0x120 [zfs]
      [39720.430379]  [<ffffffffc091d719>] ? sa_find_idx_tab+0x69/0x200 [zfs]
      [39720.431546]  [<ffffffffc08de517>] dmu_spill_hold_existing+0xa7/0x100 [zfs]
      [39720.432799]  [<ffffffffc091f6e4>] sa_get_spill.part.10+0x24/0x80 [zfs]
      [39720.434072]  [<ffffffffc09208ec>] sa_attr_op+0x42c/0x460 [zfs]
      [39720.435141]  [<ffffffffc0920e8f>] sa_lookup+0xaf/0x100 [zfs]
      [39720.436176]  [<ffffffffc1321320>] osd_object_init0+0x1e0/0xf80 [osd_zfs]
      [39720.437372]  [<ffffffffc1323a5b>] osd_object_init+0x47b/0xca0 [osd_zfs]
      [39720.438566]  [<ffffffffc0de4829>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [39720.439862]  [<ffffffffc0e02a2d>] ? lu_object_add+0x2d/0x40 [obdclass]
      [39720.441098]  [<ffffffffc0e0236b>] lu_object_start.isra.35+0x8b/0x120 [obdclass]
      [39720.442401]  [<ffffffffc0e06251>] lu_object_find_at+0x1e1/0xa60 [obdclass]
      [39720.443634]  [<ffffffffc0e06ae6>] lu_object_find+0x16/0x20 [obdclass]
      [39720.444790]  [<ffffffffc1461185>] ofd_object_find+0x35/0x100 [ofd]
      [39720.445905]  [<ffffffffc146f297>] ofd_lvbo_update+0x577/0xdf0 [ofd]
      [39720.447016]  [<ffffffffa3a22912>] ? kmem_cache_free+0x1e2/0x200
      [39720.448111]  [<ffffffffc10b6324>] ldlm_request_cancel+0x404/0x780 [ptlrpc]
      [39720.449357]  [<ffffffffc10ba612>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc]
      [39720.450582]  [<ffffffffc10ba7e8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
      [39720.451831]  [<ffffffffc10ed29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [39720.453265]  [<ffffffffc10f0bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      [39720.454389]  [<ffffffffa38d3efe>] ? finish_task_switch+0x4e/0x1c0
      [39720.455481]  [<ffffffffa3f7ec88>] ? __schedule+0x448/0x9c0
      [39720.456484]  [<ffffffffc10f00d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [39720.457790]  [<ffffffffa38c50d1>] kthread+0xd1/0xe0
      [39720.458676]  [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40
      [39720.459766]  [<ffffffffa3f8cd37>] ret_from_fork_nospec_begin+0x21/0x21
      [39720.460933]  [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40
      [39720.462016] INFO: task ldlm_cn00_002:22157 blocked for more than 120 seconds.
      [39720.463276] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [39720.464705] ldlm_cn00_002   D ffff9d67b4d15230     0 22157      2 0x00000080
      [39720.466009] Call Trace:
      [39720.466446]  [<ffffffffa3a21569>] ? ___slab_alloc+0x209/0x4f0
      [39720.467433]  [<ffffffffa3f7f229>] schedule+0x29/0x70
      [39720.468398]  [<ffffffffc04972d5>] cv_wait_common+0x125/0x150 [spl]
      [39720.469511]  [<ffffffffa38c61c0>] ? wake_up_atomic_t+0x30/0x30
      [39720.470554]  [<ffffffffc0497315>] __cv_wait+0x15/0x20 [spl]
      [39720.471575]  [<ffffffffc08d217b>] dbuf_read+0x7fb/0x9e0 [zfs]
      [39720.472629]  [<ffffffffc08d4496>] ? dbuf_hold+0x16/0x20 [zfs]
      [39720.473681]  [<ffffffffc08de090>] dmu_buf_hold_by_dnode+0x50/0x80 [zfs]
      [39720.474936]  [<ffffffffc09516f2>] zap_get_leaf_byblk.isra.11+0x92/0x260 [zfs]
      [39720.476247]  [<ffffffffc09519a8>] zap_deref_leaf+0xe8/0x100 [zfs]
      [39720.477366]  [<ffffffffc09529e2>] fzap_lookup+0x62/0x180 [zfs]
      [39720.478434]  [<ffffffffc09572e1>] zap_lookup_impl+0xd1/0x200 [zfs]
      [39720.479565]  [<ffffffffc0957c5e>] zap_lookup_norm+0x8e/0xd0 [zfs]
      [39720.480669]  [<ffffffffc0957cd3>] zap_lookup+0x33/0x40 [zfs]
      [39720.481700]  [<ffffffffc132ba71>] osd_fid_lookup+0x3e1/0x4c0 [osd_zfs]
      [39720.482868]  [<ffffffffc1323969>] osd_object_init+0x389/0xca0 [osd_zfs]
      [39720.484053]  [<ffffffffc0de4829>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [39720.485328]  [<ffffffffc0e02a2d>] ? lu_object_add+0x2d/0x40 [obdclass]
      [39720.486555]  [<ffffffffc0e0236b>] lu_object_start.isra.35+0x8b/0x120 [obdclass]
      [39720.487850]  [<ffffffffc0e06251>] lu_object_find_at+0x1e1/0xa60 [obdclass]
      [39720.489085]  [<ffffffffc0e06ae6>] lu_object_find+0x16/0x20 [obdclass]
      [39720.490245]  [<ffffffffc1461185>] ofd_object_find+0x35/0x100 [ofd]
      [39720.491362]  [<ffffffffc146f297>] ofd_lvbo_update+0x577/0xdf0 [ofd]
      [39720.492482]  [<ffffffffa3a22912>] ? kmem_cache_free+0x1e2/0x200
      [39720.493590]  [<ffffffffc10b6324>] ldlm_request_cancel+0x404/0x780 [ptlrpc]
      [39720.494838]  [<ffffffffc10ba612>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc]
      [39720.496071]  [<ffffffffc10ba7e8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
      [39720.497327]  [<ffffffffc10ed29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [39720.498751]  [<ffffffffc10f0bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      [39720.499868]  [<ffffffffa38d3efe>] ? finish_task_switch+0x4e/0x1c0
      [39720.500963]  [<ffffffffa3f7ec88>] ? __schedule+0x448/0x9c0
      [39720.501987]  [<ffffffffc10f00d0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [39720.503319]  [<ffffffffa38c50d1>] kthread+0xd1/0xe0
      [39720.504202]  [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40
      [39720.505292]  [<ffffffffa3f8cd37>] ret_from_fork_nospec_begin+0x21/0x21
      [39720.506444]  [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40
      [39720.507555] INFO: task z_wr_int_3:1371 blocked for more than 120 seconds.
      [39720.508719] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [39720.510179] z_wr_int_3      D ffff9d67bbfde2a0     0  1371      2 0x00000080
      [39720.511506] Call Trace:
      [39720.511964]  [<ffffffffa38d20a0>] ? task_rq_unlock+0x20/0x20
      [39720.513011]  [<ffffffffa3f80149>] schedule_preempt_disabled+0x29/0x70
      [39720.514168]  [<ffffffffa3f7df57>] __mutex_lock_slowpath+0xc7/0x1d0
      [39720.515283]  [<ffffffffa3f7d32f>] mutex_lock+0x1f/0x2f
      [39720.516233]  [<ffffffffc08c44bf>] buf_hash_insert+0x7f/0x190 [zfs]
      [39720.517376]  [<ffffffffc08c94f6>] arc_write_done+0x136/0x400 [zfs]
      [39720.518493]  [<ffffffffc0994c4a>] zio_done+0x2da/0xd60 [zfs]
      [39720.519519]  [<ffffffffa3a22df6>] ? kfree+0x106/0x140
      [39720.520540]  [<ffffffffc098f99f>] zio_execute+0x9f/0x100 [zfs]
      [39720.521603]  [<ffffffffc0492d7c>] taskq_thread+0x2ac/0x4f0 [spl]
      [39720.522689]  [<ffffffffa38da0b0>] ? wake_up_state+0x20/0x20
      [39720.523726]  [<ffffffffc098f900>] ? zio_taskq_member.isra.7.constprop.10+0x80/0x80 [zfs]
      [39720.525146]  [<ffffffffc0492ad0>] ? taskq_thread_spawn+0x60/0x60 [spl]
      [39720.526310]  [<ffffffffa38c50d1>] kthread+0xd1/0xe0
      [39720.527211]  [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40
      [39720.528286]  [<ffffffffa3f8cd37>] ret_from_fork_nospec_begin+0x21/0x21
      [39720.529433]  [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40
      [39720.530532] INFO: task txg_sync:1424 blocked for more than 120 seconds.
      [39720.531794] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [39720.533156] txg_sync        D ffff9d677716a0e0     0  1424      2 0x00000080
      [39720.534428] Call Trace:
      [39720.534927]  [<ffffffffc0493017>] ? taskq_dispatch_ent+0x57/0x170 [spl]
      [39720.536076]  [<ffffffffc098f900>] ? zio_taskq_member.isra.7.constprop.10+0x80/0x80 [zfs]
      [39720.537499]  [<ffffffffa3f7f229>] schedule+0x29/0x70
      [39720.538387]  [<ffffffffa3f7cbb1>] schedule_timeout+0x221/0x2d0
      [39720.539461]  [<ffffffffc098e91f>] ? zio_taskq_dispatch+0x8f/0xa0 [zfs]
      [39720.540628]  [<ffffffffa386c27e>] ? kvm_clock_get_cycles+0x1e/0x20
      [39720.541716]  [<ffffffffa3f7e79d>] io_schedule_timeout+0xad/0x130
      [39720.542856]  [<ffffffffa38c5d46>] ? prepare_to_wait_exclusive+0x56/0x90
      [39720.544033]  [<ffffffffa3f7e838>] io_schedule+0x18/0x20
      [39720.544985]  [<ffffffffc0497262>] cv_wait_common+0xb2/0x150 [spl]
      [39720.546076]  [<ffffffffa38c61c0>] ? wake_up_atomic_t+0x30/0x30
      [39720.547125]  [<ffffffffc0497338>] __cv_wait_io+0x18/0x20 [spl]
      [39720.548194]  [<ffffffffc0992a6b>] zio_wait+0x11b/0x1c0 [zfs]
      [39720.549234]  [<ffffffffc090885f>] dsl_pool_sync+0xbf/0x440 [zfs]
      [39720.550323]  [<ffffffffc0926937>] spa_sync+0x437/0xd90 [zfs]
      [39720.551364]  [<ffffffffc093b351>] txg_sync_thread+0x301/0x510 [zfs]
      [39720.552502]  [<ffffffffc093b050>] ? txg_fini+0x2a0/0x2a0 [zfs]
      [39720.553588]  [<ffffffffc0492063>] thread_generic_wrapper+0x73/0x80 [spl]
      [39720.554786]  [<ffffffffc0491ff0>] ? __thread_exit+0x20/0x20 [spl]
      [39720.555866]  [<ffffffffa38c50d1>] kthread+0xd1/0xe0
      [39720.556741]  [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40
      [39720.557819]  [<ffffffffa3f8cd37>] ret_from_fork_nospec_begin+0x21/0x21
      [39720.558970]  [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40
      [39720.560068] INFO: task txg_sync:3484 blocked for more than 120 seconds.
      [39720.561204] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [39720.562600] txg_sync        D ffff9d679b2f8000     0  3484      2 0x00000080
      [39720.563892] Call Trace:
      [39720.564336]  [<ffffffffa3f80149>] schedule_preempt_disabled+0x29/0x70
      [39720.565570]  [<ffffffffa3f7df57>] __mutex_lock_slowpath+0xc7/0x1d0
      [39720.566666]  [<ffffffffa3f7d32f>] mutex_lock+0x1f/0x2f
      [39720.567606]  [<ffffffffc08c47af>] buf_hash_find+0x9f/0x160 [zfs]
      [39720.568694]  [<ffffffffc08c823f>] arc_read+0x16f/0xab0 [zfs]
      [39720.569737]  [<ffffffffc08d32b0>] ? dbuf_rele_and_unlock+0x4c0/0x4c0 [zfs]
      [39720.570989]  [<ffffffffc08d1cc9>] dbuf_read+0x349/0x9e0 [zfs]
      [39720.572030]  [<ffffffffc08d40aa>] __dbuf_hold_impl+0x31a/0x5c0 [zfs]
      [39720.573183]  [<ffffffffc08d43f2>] dbuf_hold_impl+0xa2/0xd0 [zfs]
      [39720.574278]  [<ffffffffc08d4455>] dbuf_hold_level+0x35/0x60 [zfs]
      [39720.575439]  [<ffffffffc08d4496>] dbuf_hold+0x16/0x20 [zfs]
      [39720.576466]  [<ffffffffc08dd1fe>] dmu_buf_hold_array_by_dnode+0xee/0x4a0 [zfs]
      [39720.577762]  [<ffffffffc08ddb91>] dmu_write.part.7+0x71/0x110 [zfs]
      [39720.578898]  [<ffffffffc08ddc49>] dmu_write+0x19/0x20 [zfs]
      [39720.579929]  [<ffffffffc09392a9>] space_map_write+0x309/0x530 [zfs]
      [39720.581063]  [<ffffffffc0917364>] metaslab_sync+0x154/0x7e0 [zfs]
      [39720.582174]  [<ffffffffc08dabd7>] ? ddt_sync+0x3d7/0x740 [zfs]
      [39720.583229]  [<ffffffffc090ff1c>] ? dsl_scan_sync+0x4c/0xc40 [zfs]
      [39720.584334]  [<ffffffffa3f7d322>] ? mutex_lock+0x12/0x2f
      [39720.585310]  [<ffffffffc09404a3>] vdev_sync+0x73/0x140 [zfs]
      [39720.586416]  [<ffffffffc09269ab>] spa_sync+0x4ab/0xd90 [zfs]
      [39720.587459]  [<ffffffffc093b351>] txg_sync_thread+0x301/0x510 [zfs]
      [39720.588599]  [<ffffffffc093b050>] ? txg_fini+0x2a0/0x2a0 [zfs]
      [39720.589634]  [<ffffffffc0492063>] thread_generic_wrapper+0x73/0x80 [spl]
      [39720.590830]  [<ffffffffc0491ff0>] ? __thread_exit+0x20/0x20 [spl]
      [39720.591923]  [<ffffffffa38c50d1>] kthread+0xd1/0xe0
      [39720.592807]  [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40
      [39720.593883]  [<ffffffffa3f8cd37>] ret_from_fork_nospec_begin+0x21/0x21
      [39720.595045]  [<ffffffffa38c5000>] ? insert_kthread_work+0x40/0x40
      [39720.596143] INFO: task ldlm_cn00_003:10024 blocked for more than 120 seconds.
      [39720.597454] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [39720.598851] ldlm_cn00_003   D ffff9d67b5d8e2a0     0 10024      2 0x00000080
      [39720.600146] Call Trace:
      …
      

      There may be failures for ldiskfs testing, but the stack traces would look very different. So, this is a ticket only for ZFS failures.

      The stack traces look similar to what is seen in LU-12957, but not exactly the same.

      Starting 05 MAY 2019, we’ve seen this failure five times. Logs for parallel-scale-nfsv4 test compilebench hangs are at
      2.12.3.40 - https://testing.whamcloud.com/test_sets/dff91922-18c3-11ea-a9d7-52540065bddc
      2.13.50 - https://testing.whamcloud.com/test_sets/6f45b128-040e-11ea-b934-52540065bddc
      2.12.58.155 - https://testing.whamcloud.com/test_sets/5cd37e5c-f212-11e9-add9-52540065bddc
      2.12.53.13 - https://testing.whamcloud.com/test_sets/6a7ae358-7065-11e9-a6f9-52540065bddc
      2.12.53.45 - https://testing.whamcloud.com/test_sets/d4888a58-7c44-11e9-a028-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: