[LU-10572] Hang in conf-sanity test 69 on zfs Created: 28/Jan/18  Updated: 18/Jun/21  Resolved: 18/Jun/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10250 replay-single test_74: hang and time... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I introduced conf-sanity testing with hang-detection in my testing recently and immediately hit a problem in conf-sanity 69 with zfs.

Console output looks like this:

[ 8960.194121] Lustre: DEBUG MARKER: == conf-sanity test 69: replace an OST with the same index =========================================== 13:23:05 (1517163785)
[ 8961.087061] Lustre: Lustre: Build Version: 2.10.57_58_gb6ec609
[ 8961.516996] LNet: Added LNI 192.168.123.104@tcp [8/512/0/180] 
[ 8961.518530] LNet: Accept all, port 988
[ 8963.178685] Lustre: 7186:0:(gss_svc_upcall.c:1185:gss_init_svc_upcall()) Init channel is not opened by lsvcgssd, following request might be dropped until lsvcgssd is active
[ 8963.181215] Lustre: 7186:0:(gss_mech_switch.c:71:lgss_mech_register()) Register gssnull mechanism
[ 8963.182155] Key type lgssc registered
[ 8964.538931] Lustre: Echo OBD driver; http://www.lustre.org/
[ 8970.072662] Lustre: MGS: Connection restored to MGC192.168.123.104@tcp_0 (at 0@lo)
[ 8971.975498] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180
[ 8975.966431] Lustre: lustre-MDT0000: Connection restored to 192.168.123.104@tcp (at 0@lo)
[ 8976.096272] Lustre: lustre-OST0000: Imperative Recovery not enabled, recovery window 60-180
[ 8977.021749] Lustre: lustre-OST0000: Connection restored to 192.168.123.104@tcp (at 0@lo)
[ 8977.023981] Lustre: Skipped 1 previous similar message
[ 8977.042514] Lustre: lustre-OST0000: deleting orphan objects from 0x0:3 to 0x0:33
[ 8977.556824] Lustre: Mounted lustre-client
[ 8979.082258] Lustre: DEBUG MARKER: On OST0, 2176 inodes available. Want 99936.
[ 9329.642647] LNet: Service thread pid 18129 was inactive for 40.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 9329.645105] Pid: 18129, comm: mdt03_003
[ 9329.645761] 
Call Trace:
[ 9329.648546]  [<ffffffff81704319>] schedule+0x29/0x70
[ 9329.649263]  [<ffffffff81700952>] schedule_timeout+0x162/0x2a0
[ 9329.650032]  [<ffffffffa0227ea5>] ? cfs_block_sigsinv+0x45/0xa0 [libcfs]
[ 9329.651617]  [<ffffffff810879d0>] ? process_timeout+0x0/0x10
[ 9329.652397]  [<ffffffffa0227ed1>] ? cfs_block_sigsinv+0x71/0xa0 [libcfs]
[ 9329.653319]  [<ffffffffa13be2c8>] osp_precreate_reserve+0x318/0x830 [osp]
[ 9329.654066]  [<ffffffff810b7cc0>] ? default_wake_function+0x0/0x20
[ 9329.654770]  [<ffffffffa13b3b53>] osp_declare_create+0x193/0x590 [osp]
[ 9329.655491]  [<ffffffffa1367ab2>] lod_sub_declare_create+0xe2/0x210 [lod]
[ 9329.656237]  [<ffffffffa1360dce>] lod_qos_declare_object_on+0xbe/0x3a0 [lod]
[ 9329.668533]  [<ffffffffa136688d>] lod_qos_prep_create+0x10dd/0x1800 [lod]
[ 9329.669452]  [<ffffffffa13674e5>] lod_prepare_create+0x255/0x350 [lod]
[ 9329.670180]  [<ffffffffa13588de>] lod_declare_striped_create+0x1ee/0x980 [lod]
[ 9329.671284]  [<ffffffffa1367ab2>] ? lod_sub_declare_create+0xe2/0x210 [lod]
[ 9329.672301]  [<ffffffffa135d074>] lod_declare_create+0x204/0x5a0 [lod]
[ 9329.672874]  [<ffffffffa0232f47>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[ 9329.673435]  [<ffffffffa12300ff>] mdd_declare_create_object_internal+0xdf/0x2f0 [mdd]
[ 9329.674527]  [<ffffffffa1220953>] mdd_declare_create+0x53/0xe30 [mdd]
[ 9329.675323]  [<ffffffffa022f734>] ? libcfs_log_return+0x24/0x30 [libcfs]
[ 9329.676039]  [<ffffffffa1224c69>] mdd_create+0x859/0x1410 [mdd]
[ 9329.676746]  [<ffffffffa12a75f4>] mdt_reint_open+0x2144/0x3190 [mdt]
[ 9329.677500]  [<ffffffffa129b900>] mdt_reint_rec+0x80/0x210 [mdt]
[ 9329.679262]  [<ffffffffa127b5cb>] mdt_reint_internal+0x5fb/0x990 [mdt]
[ 9329.681088]  [<ffffffffa1287927>] mdt_intent_reint+0x157/0x420 [mdt]
[ 9329.682780]  [<ffffffffa127e6a2>] mdt_intent_opc+0x442/0xad0 [mdt]
[ 9329.684883]  [<ffffffffa05e2490>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc]
[ 9329.686409]  [<ffffffffa1286116>] mdt_intent_policy+0x1a6/0x360 [mdt]
[ 9329.688190]  [<ffffffffa059209a>] ldlm_lock_enqueue+0x35a/0x8d0 [ptlrpc]
[ 9329.693351]  [<ffffffffa05ba543>] ldlm_handle_enqueue0+0x913/0x13f0 [ptlrpc]
[ 9329.694593]  [<ffffffffa0232f47>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[ 9329.696037]  [<ffffffffa063ec32>] tgt_enqueue+0x62/0x210 [ptlrpc]
[ 9329.697482]  [<ffffffffa0646e5b>] tgt_request_handle+0x93b/0x13e0 [ptlrpc]
[ 9329.698628]  [<ffffffffa05ebc71>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc]
[ 9329.700755]  [<ffffffffa05efa28>] ptlrpc_main+0xa58/0x1df0 [ptlrpc]
[ 9329.702545]  [<ffffffffa05eefd0>] ? ptlrpc_main+0x0/0x1df0 [ptlrpc]
[ 9329.704139]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[ 9329.705436]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[ 9329.706325]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[ 9329.707374]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[ 9329.708383] 
[ 9329.709145] LustreError: dumping log to /tmp/lustre-log.1517164156.18129
[ 9331.070281] LNet: Service thread pid 18129 completed after 41.49s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[10031.724772] LNet: Service thread pid 29224 was inactive for 40.02s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[10031.727847] Pid: 29224, comm: ll_ost00_007
[10031.736752] 
Call Trace:
[10031.738429]  [<ffffffff81704319>] schedule+0x29/0x70
[10031.739283]  [<ffffffffa09a4fad>] cv_wait_common+0x17d/0x240 [spl]
[10031.740453]  [<ffffffff810a4070>] ? autoremove_wake_function+0x0/0x40
[10031.740960]  [<ffffffffa09a5085>] __cv_wait+0x15/0x20 [spl]
[10031.741456]  [<ffffffffa0b5d293>] txg_wait_open+0xe3/0x1a0 [zfs]
[10031.741957]  [<ffffffffa0b065d2>] dmu_tx_wait+0x5c2/0x5d0 [zfs]
[10031.742442]  [<ffffffffa0b06675>] dmu_tx_assign+0x95/0x5a0 [zfs]
[10031.742969]  [<ffffffffa09bad87>] osd_trans_start+0xa7/0x3c0 [osd_zfs]
[10031.743500]  [<ffffffffa1415596>] ofd_precreate_objects+0xa46/0x1aa0 [ofd]
[10031.743995]  [<ffffffffa1408cda>] ofd_create_hdl+0x47a/0x20b0 [ofd]
[10031.744519]  [<ffffffffa022f734>] ? libcfs_log_return+0x24/0x30 [libcfs]
[10031.808155]  [<ffffffffa05e07a0>] ? lustre_pack_reply_v2+0x1a0/0x2a0 [ptlrpc]
[10031.809367]  [<ffffffffa0646e5b>] tgt_request_handle+0x93b/0x13e0 [ptlrpc]
[10031.810419]  [<ffffffffa05ebc71>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc]
[10031.811779]  [<ffffffffa05efa28>] ptlrpc_main+0xa58/0x1df0 [ptlrpc]
[10031.812493]  [<ffffffffa05eefd0>] ? ptlrpc_main+0x0/0x1df0 [ptlrpc]
[10031.813192]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[10031.813831]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[10031.814592]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[10031.815264]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[10031.815936] 
[10031.816501] LustreError: dumping log to /tmp/lustre-log.1517164858.29224
[10043.338621] LNet: Service thread pid 29224 completed after 51.63s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[10281.450667] LNet: Service thread pid 8784 was inactive for 40.10s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[10281.453783] Pid: 8784, comm: ll_ost06_001
[10281.455676] 
Call Trace:
[10281.456876]  [<ffffffff81704319>] schedule+0x29/0x70
[10281.470252]  [<ffffffffa09a4fad>] cv_wait_common+0x17d/0x240 [spl]
[10281.471091]  [<ffffffff810a4070>] ? autoremove_wake_function+0x0/0x40
[10281.471942]  [<ffffffffa09a5085>] __cv_wait+0x15/0x20 [spl]
[10281.472837]  [<ffffffffa0b5d293>] txg_wait_open+0xe3/0x1a0 [zfs]
[10281.473563]  [<ffffffffa0b065d2>] dmu_tx_wait+0x5c2/0x5d0 [zfs]
[10281.483714]  [<ffffffffa0b06675>] dmu_tx_assign+0x95/0x5a0 [zfs]
[10281.484251]  [<ffffffffa09bad87>] osd_trans_start+0xa7/0x3c0 [osd_zfs]
[10281.485140]  [<ffffffffa1415596>] ofd_precreate_objects+0xa46/0x1aa0 [ofd]
[10281.485926]  [<ffffffffa1408cda>] ofd_create_hdl+0x47a/0x20b0 [ofd]
[10281.486792]  [<ffffffffa022f734>] ? libcfs_log_return+0x24/0x30 [libcfs]
[10281.496076]  [<ffffffffa05e07a0>] ? lustre_pack_reply_v2+0x1a0/0x2a0 [ptlrpc]
[10281.497057]  [<ffffffffa0646e5b>] tgt_request_handle+0x93b/0x13e0 [ptlrpc]
[10281.497794]  [<ffffffffa05ebc71>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc]
[10281.499265]  [<ffffffffa05efa28>] ptlrpc_main+0xa58/0x1df0 [ptlrpc]
[10281.512041]  [<ffffffffa05eefd0>] ? ptlrpc_main+0x0/0x1df0 [ptlrpc]
[10281.512751]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[10281.513402]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[10281.514076]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[10281.514786]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[10281.515472] 
[10281.516186] LustreError: dumping log to /tmp/lustre-log.1517165107.8784
[10293.208563] LNet: Service thread pid 8784 completed after 51.86s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[10656.875336] LNet: Service thread pid 687 was inactive for 40.11s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[10656.878646] Pid: 687, comm: ll_ost01_004
[10656.879276] 
Call Trace:
[10656.880437]  [<ffffffff81704319>] schedule+0x29/0x70
[10656.883153]  [<ffffffffa09a4fad>] cv_wait_common+0x17d/0x240 [spl]
[10656.883758]  [<ffffffff810a4070>] ? autoremove_wake_function+0x0/0x40
[10656.884387]  [<ffffffffa09a5085>] __cv_wait+0x15/0x20 [spl]
[10656.884938]  [<ffffffffa0b5d293>] txg_wait_open+0xe3/0x1a0 [zfs]
[10656.885448]  [<ffffffffa0b065d2>] dmu_tx_wait+0x5c2/0x5d0 [zfs]
[10656.885962]  [<ffffffffa0b06675>] dmu_tx_assign+0x95/0x5a0 [zfs]
[10656.886435]  [<ffffffffa09bad87>] osd_trans_start+0xa7/0x3c0 [osd_zfs]
[10656.886957]  [<ffffffffa1415596>] ofd_precreate_objects+0xa46/0x1aa0 [ofd]
[10656.887439]  [<ffffffffa1408cda>] ofd_create_hdl+0x47a/0x20b0 [ofd]
[10656.888005]  [<ffffffffa022f734>] ? libcfs_log_return+0x24/0x30 [libcfs]
[10656.888573]  [<ffffffffa05e07a0>] ? lustre_pack_reply_v2+0x1a0/0x2a0 [ptlrpc]
[10656.889143]  [<ffffffffa0646e5b>] tgt_request_handle+0x93b/0x13e0 [ptlrpc]
[10656.910113]  [<ffffffffa05ebc71>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc]
[10656.911377]  [<ffffffffa05efa28>] ptlrpc_main+0xa58/0x1df0 [ptlrpc]
[10656.912038]  [<ffffffffa05eefd0>] ? ptlrpc_main+0x0/0x1df0 [ptlrpc]
[10656.912543]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[10656.913030]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[10656.913633]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[10656.914298]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[10656.914770] 
[10656.915235] LustreError: dumping log to /tmp/lustre-log.1517165483.687
[10664.977508] LNet: Service thread pid 687 completed after 48.21s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[10817.394032] LNet: Service thread pid 3812 was inactive for 40.05s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[10817.396885] Pid: 3812, comm: ll_ost05_007
[10817.397532] 
Call Trace:
[10817.407582]  [<ffffffff81704319>] schedule+0x29/0x70
[10817.414462]  [<ffffffffa09a4fad>] cv_wait_common+0x17d/0x240 [spl]
[10817.415219]  [<ffffffff810a4070>] ? autoremove_wake_function+0x0/0x40
[10817.415935]  [<ffffffffa09a5085>] __cv_wait+0x15/0x20 [spl]
[10817.416663]  [<ffffffffa0b5d293>] txg_wait_open+0xe3/0x1a0 [zfs]
[10817.417376]  [<ffffffffa0b065d2>] dmu_tx_wait+0x5c2/0x5d0 [zfs]
[10817.434514]  [<ffffffffa0b06675>] dmu_tx_assign+0x95/0x5a0 [zfs]
[10817.435237]  [<ffffffffa09bad87>] osd_trans_start+0xa7/0x3c0 [osd_zfs]
[10817.435958]  [<ffffffffa1415596>] ofd_precreate_objects+0xa46/0x1aa0 [ofd]
[10817.436663]  [<ffffffffa1408cda>] ofd_create_hdl+0x47a/0x20b0 [ofd]
[10817.437367]  [<ffffffffa022f734>] ? libcfs_log_return+0x24/0x30 [libcfs]
[10817.453458]  [<ffffffffa05e07a0>] ? lustre_pack_reply_v2+0x1a0/0x2a0 [ptlrpc]
[10817.459190]  [<ffffffffa0646e5b>] tgt_request_handle+0x93b/0x13e0 [ptlrpc]
[10817.460135]  [<ffffffffa05ebc71>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc]
[10817.461755]  [<ffffffffa05efa28>] ptlrpc_main+0xa58/0x1df0 [ptlrpc]
[10817.462857]  [<ffffffffa05eefd0>] ? ptlrpc_main+0x0/0x1df0 [ptlrpc]
[10817.463724]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[10817.464563]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[10817.465320]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[10817.474397]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[10817.475266] 
[10817.475867] LustreError: dumping log to /tmp/lustre-log.1517165643.3812
[10828.195354] LNet: Service thread pid 3812 completed after 50.85s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[10875.628351] LNet: Service thread pid 7993 was inactive for 40.03s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[10875.631117] Pid: 7993, comm: mdt06_002
[10875.631734] 
Call Trace:
[10875.633730]  [<ffffffff81704319>] schedule+0x29/0x70
[10875.634417]  [<ffffffff81700952>] schedule_timeout+0x162/0x2a0
[10875.636968]  [<ffffffffa0227ea5>] ? cfs_block_sigsinv+0x45/0xa0 [libcfs]
[10875.637682]  [<ffffffff810879d0>] ? process_timeout+0x0/0x10
[10875.638380]  [<ffffffffa0227ed1>] ? cfs_block_sigsinv+0x71/0xa0 [libcfs]
[10875.639103]  [<ffffffffa13be2c8>] osp_precreate_reserve+0x318/0x830 [osp]
[10875.639799]  [<ffffffff810b7cc0>] ? default_wake_function+0x0/0x20
[10875.648290]  [<ffffffffa13b3b53>] osp_declare_create+0x193/0x590 [osp]
[10875.649029]  [<ffffffffa1367ab2>] lod_sub_declare_create+0xe2/0x210 [lod]
[10875.649755]  [<ffffffffa1360dce>] lod_qos_declare_object_on+0xbe/0x3a0 [lod]
[10875.650483]  [<ffffffffa136688d>] lod_qos_prep_create+0x10dd/0x1800 [lod]
[10875.651199]  [<ffffffffa13674e5>] lod_prepare_create+0x255/0x350 [lod]
[10875.651900]  [<ffffffffa13588de>] lod_declare_striped_create+0x1ee/0x980 [lod]
[10875.653563]  [<ffffffffa1367ab2>] ? lod_sub_declare_create+0xe2/0x210 [lod]
[10875.654291]  [<ffffffffa135d074>] lod_declare_create+0x204/0x5a0 [lod]
[10875.655017]  [<ffffffffa0232f47>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[10875.655885]  [<ffffffffa12300ff>] mdd_declare_create_object_internal+0xdf/0x2f0 [mdd]
[10875.668080]  [<ffffffffa1220953>] mdd_declare_create+0x53/0xe30 [mdd]
[10875.677779]  [<ffffffffa022f734>] ? libcfs_log_return+0x24/0x30 [libcfs]
[10875.678567]  [<ffffffffa1224c69>] mdd_create+0x859/0x1410 [mdd]
[10875.679110]  [<ffffffffa12a75f4>] mdt_reint_open+0x2144/0x3190 [mdt]
[10875.679778]  [<ffffffffa129b900>] mdt_reint_rec+0x80/0x210 [mdt]
[10875.680363]  [<ffffffffa127b5cb>] mdt_reint_internal+0x5fb/0x990 [mdt]
[10875.680878]  [<ffffffffa1287927>] mdt_intent_reint+0x157/0x420 [mdt]
[10875.682050]  [<ffffffffa127e6a2>] mdt_intent_opc+0x442/0xad0 [mdt]
[10875.682629]  [<ffffffffa05e2490>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc]
[10875.683321]  [<ffffffffa1286116>] mdt_intent_policy+0x1a6/0x360 [mdt]
[10875.683850]  [<ffffffffa059209a>] ldlm_lock_enqueue+0x35a/0x8d0 [ptlrpc]
[10875.684545]  [<ffffffffa05ba543>] ldlm_handle_enqueue0+0x913/0x13f0 [ptlrpc]
[10875.685128]  [<ffffffffa0232f47>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[10875.685825]  [<ffffffffa063ec32>] tgt_enqueue+0x62/0x210 [ptlrpc]
[10875.686373]  [<ffffffffa0646e5b>] tgt_request_handle+0x93b/0x13e0 [ptlrpc]
[10875.686909]  [<ffffffffa05ebc71>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc]
[10875.688364]  [<ffffffffa05efa28>] ptlrpc_main+0xa58/0x1df0 [ptlrpc]
[10875.688910]  [<ffffffffa05eefd0>] ? ptlrpc_main+0x0/0x1df0 [ptlrpc]
[10875.689628]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[10875.690224]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[10875.690698]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[10875.691349]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[10875.691818] 
[10875.692431] LustreError: dumping log to /tmp/lustre-log.1517165701.7993
[10875.723487] LNet: Service thread pid 7993 completed after 40.13s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[11386.093776] LNet: Service thread pid 7976 was inactive for 40.10s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[11386.096569] Pid: 7976, comm: mdt01_000
[11386.099291] 
Call Trace:
[11386.100643]  [<ffffffff81704319>] schedule+0x29/0x70
[11386.101324]  [<ffffffff81700952>] schedule_timeout+0x162/0x2a0
[11386.102030]  [<ffffffffa0227ea5>] ? cfs_block_sigsinv+0x45/0xa0 [libcfs]
[11386.102855]  [<ffffffff810879d0>] ? process_timeout+0x0/0x10
[11386.103616]  [<ffffffffa0227ed1>] ? cfs_block_sigsinv+0x71/0xa0 [libcfs]
[11386.104349]  [<ffffffffa13be2c8>] osp_precreate_reserve+0x318/0x830 [osp]
[11386.105054]  [<ffffffff810b7cc0>] ? default_wake_function+0x0/0x20
[11386.105869]  [<ffffffffa13b3b53>] osp_declare_create+0x193/0x590 [osp]
[11386.106720]  [<ffffffffa1367ab2>] lod_sub_declare_create+0xe2/0x210 [lod]
[11386.108825]  [<ffffffffa1360dce>] lod_qos_declare_object_on+0xbe/0x3a0 [lod]
[11386.163398]  [<ffffffffa136688d>] lod_qos_prep_create+0x10dd/0x1800 [lod]
[11386.164178]  [<ffffffffa13674e5>] lod_prepare_create+0x255/0x350 [lod]
[11386.166825]  [<ffffffffa13588de>] lod_declare_striped_create+0x1ee/0x980 [lod]
[11386.168090]  [<ffffffffa1367ab2>] ? lod_sub_declare_create+0xe2/0x210 [lod]
[11386.168794]  [<ffffffffa135d074>] lod_declare_create+0x204/0x5a0 [lod]
[11386.178636]  [<ffffffffa0232f47>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[11386.180684]  [<ffffffffa12300ff>] mdd_declare_create_object_internal+0xdf/0x2f0 [mdd]
[11386.197534]  [<ffffffffa1220953>] mdd_declare_create+0x53/0xe30 [mdd]
[11386.202676]  [<ffffffffa022f734>] ? libcfs_log_return+0x24/0x30 [libcfs]
[11386.203980]  [<ffffffffa1224c69>] mdd_create+0x859/0x1410 [mdd]
[11386.213411]  [<ffffffffa12a75f4>] mdt_reint_open+0x2144/0x3190 [mdt]
[11386.218277]  [<ffffffffa129b900>] mdt_reint_rec+0x80/0x210 [mdt]
[11386.228289]  [<ffffffffa127b5cb>] mdt_reint_internal+0x5fb/0x990 [mdt]
[11386.245324]  [<ffffffffa1287927>] mdt_intent_reint+0x157/0x420 [mdt]
[11386.298037]  [<ffffffffa127e6a2>] mdt_intent_opc+0x442/0xad0 [mdt]
[11386.298932]  [<ffffffffa05e2490>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc]
[11386.300428]  [<ffffffffa1286116>] mdt_intent_policy+0x1a6/0x360 [mdt]
[11386.312540]  [<ffffffffa059209a>] ldlm_lock_enqueue+0x35a/0x8d0 [ptlrpc]
[11386.313331]  [<ffffffffa05ba543>] ldlm_handle_enqueue0+0x913/0x13f0 [ptlrpc]
[11386.314102]  [<ffffffffa0232f47>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[11386.314873]  [<ffffffffa063ec32>] tgt_enqueue+0x62/0x210 [ptlrpc]
[11386.371225]  [<ffffffffa0646e5b>] tgt_request_handle+0x93b/0x13e0 [ptlrpc]
[11386.372055]  [<ffffffffa05ebc71>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc]
[11386.373355]  [<ffffffffa05efa28>] ptlrpc_main+0xa58/0x1df0 [ptlrpc]
[11386.392312]  [<ffffffffa05eefd0>] ? ptlrpc_main+0x0/0x1df0 [ptlrpc]
[11386.393037]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[11386.393686]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[11386.394358]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[11386.395042]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[11386.395735] 
[11386.396429] LustreError: dumping log to /tmp/lustre-log.1517166212.7976
[11391.441778] LNet: Service thread pid 7976 completed after 45.44s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[11646.381483] LNet: Service thread pid 27607 was inactive for 40.14s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[11646.406010] Pid: 27607, comm: ll_ost04_009
[11646.406793] 
Call Trace:
[11646.408187]  [<ffffffff81704319>] schedule+0x29/0x70
[11646.427239]  [<ffffffffa09a4fad>] cv_wait_common+0x17d/0x240 [spl]
[11646.427962]  [<ffffffff810a4070>] ? autoremove_wake_function+0x0/0x40
[11646.428667]  [<ffffffffa09a5085>] __cv_wait+0x15/0x20 [spl]
[11646.429380]  [<ffffffffa0b5d293>] txg_wait_open+0xe3/0x1a0 [zfs]
[11646.483045]  [<ffffffffa0b065d2>] dmu_tx_wait+0x5c2/0x5d0 [zfs]
[11646.483791]  [<ffffffffa0b06675>] dmu_tx_assign+0x95/0x5a0 [zfs]
[11646.487195]  [<ffffffffa09bad87>] osd_trans_start+0xa7/0x3c0 [osd_zfs]
[11646.488017]  [<ffffffffa1415596>] ofd_precreate_objects+0xa46/0x1aa0 [ofd]
[11646.488726]  [<ffffffffa1408cda>] ofd_create_hdl+0x47a/0x20b0 [ofd]
[11646.489443]  [<ffffffffa022f734>] ? libcfs_log_return+0x24/0x30 [libcfs]
[11646.490213]  [<ffffffffa05e07a0>] ? lustre_pack_reply_v2+0x1a0/0x2a0 [ptlrpc]
[11646.490978]  [<ffffffffa0646e5b>] tgt_request_handle+0x93b/0x13e0 [ptlrpc]
[11646.491713]  [<ffffffffa05ebc71>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc]
[11646.585560]  [<ffffffffa05efa28>] ptlrpc_main+0xa58/0x1df0 [ptlrpc]
[11646.587445]  [<ffffffffa05eefd0>] ? ptlrpc_main+0x0/0x1df0 [ptlrpc]
[11646.588380]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[11646.589045]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[11646.589502]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[11646.589984]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[11646.590438] 
[11646.590834] LustreError: dumping log to /tmp/lustre-log.1517166472.27607
[11650.848143] Pid: 7986, comm: mdt04_001
[11650.911957] 
Call Trace:
[11650.915395]  [<ffffffff81704319>] schedule+0x29/0x70
[11650.915855]  [<ffffffff81700952>] schedule_timeout+0x162/0x2a0
[11650.916386]  [<ffffffffa0227ea5>] ? cfs_block_sigsinv+0x45/0xa0 [libcfs]
[11650.917146]  [<ffffffff810879d0>] ? process_timeout+0x0/0x10
[11650.917808]  [<ffffffffa0227ed1>] ? cfs_block_sigsinv+0x71/0xa0 [libcfs]
[11650.919882]  [<ffffffffa13be2c8>] osp_precreate_reserve+0x318/0x830 [osp]
[11650.920577]  [<ffffffff810b7cc0>] ? default_wake_function+0x0/0x20
[11650.927677]  [<ffffffffa13b3b53>] osp_declare_create+0x193/0x590 [osp]
[11650.928576]  [<ffffffffa1367ab2>] lod_sub_declare_create+0xe2/0x210 [lod]
[11650.929423]  [<ffffffffa1360dce>] lod_qos_declare_object_on+0xbe/0x3a0 [lod]
[11650.930299]  [<ffffffffa136688d>] lod_qos_prep_create+0x10dd/0x1800 [lod]
[11650.931383]  [<ffffffffa13674e5>] lod_prepare_create+0x255/0x350 [lod]
[11650.932126]  [<ffffffffa13588de>] lod_declare_striped_create+0x1ee/0x980 [lod]
[11650.933401]  [<ffffffffa1367ab2>] ? lod_sub_declare_create+0xe2/0x210 [lod]
[11650.942676]  [<ffffffffa135d074>] lod_declare_create+0x204/0x5a0 [lod]
[11650.943445]  [<ffffffffa0232f47>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[11650.944187]  [<ffffffffa12300ff>] mdd_declare_create_object_internal+0xdf/0x2f0 [mdd]
[11650.945472]  [<ffffffffa1220953>] mdd_declare_create+0x53/0xe30 [mdd]
[11650.955257]  [<ffffffffa022f734>] ? libcfs_log_return+0x24/0x30 [libcfs]
[11650.955987]  [<ffffffffa1224c69>] mdd_create+0x859/0x1410 [mdd]
[11650.956711]  [<ffffffffa12a75f4>] mdt_reint_open+0x2144/0x3190 [mdt]
[11650.957500]  [<ffffffffa129b900>] mdt_reint_rec+0x80/0x210 [mdt]
[11650.958246]  [<ffffffffa127b5cb>] mdt_reint_internal+0x5fb/0x990 [mdt]
[11650.958967]  [<ffffffffa1287927>] mdt_intent_reint+0x157/0x420 [mdt]
[11650.959667]  [<ffffffffa127e6a2>] mdt_intent_opc+0x442/0xad0 [mdt]
[11650.960434]  [<ffffffffa05e2490>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc]
[11650.961232]  [<ffffffffa1286116>] mdt_intent_policy+0x1a6/0x360 [mdt]
[11650.961996]  [<ffffffffa059209a>] ldlm_lock_enqueue+0x35a/0x8d0 [ptlrpc]
[11650.962893]  [<ffffffffa05ba543>] ldlm_handle_enqueue0+0x913/0x13f0 [ptlrpc]
[11650.963881]  [<ffffffffa0232f47>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[11650.964779]  [<ffffffffa063ec32>] tgt_enqueue+0x62/0x210 [ptlrpc]
[11650.965676]  [<ffffffffa0646e5b>] tgt_request_handle+0x93b/0x13e0 [ptlrpc]
[11650.967478]  [<ffffffffa05ebc71>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc]
[11650.968910]  [<ffffffffa05efa28>] ptlrpc_main+0xa58/0x1df0 [ptlrpc]
[11650.969707]  [<ffffffffa05eefd0>] ? ptlrpc_main+0x0/0x1df0 [ptlrpc]
[11650.970237]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[11650.982955]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[11650.983796]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[11651.002062]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[11651.002815] 
[11651.003438] LustreError: dumping log to /tmp/lustre-log.1517166477.7986
[11657.848212] LNet: Service thread pid 7986 completed after 47.10s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[12003.206088] LNet: Service thread pid 29404 was inactive for 40.16s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[12003.248977] LNet: Skipped 1 previous similar message
[12003.249773] Pid: 29404, comm: ll_ost00_008
[12003.250417] 
Call Trace:
[12003.251575]  [<ffffffff81704319>] schedule+0x29/0x70
[12003.252248]  [<ffffffffa09a4fad>] cv_wait_common+0x17d/0x240 [spl]
[12003.266969]  [<ffffffff810a4070>] ? autoremove_wake_function+0x0/0x40
[12003.267718]  [<ffffffffa09a5085>] __cv_wait+0x15/0x20 [spl]
[12003.268463]  [<ffffffffa0b5d293>] txg_wait_open+0xe3/0x1a0 [zfs]
[12003.270220]  [<ffffffffa0b065d2>] dmu_tx_wait+0x5c2/0x5d0 [zfs]
[12003.270931]  [<ffffffffa0b06675>] dmu_tx_assign+0x95/0x5a0 [zfs]
[12003.271637]  [<ffffffffa09bad87>] osd_trans_start+0xa7/0x3c0 [osd_zfs]
[12003.278620]  [<ffffffffa1415596>] ofd_precreate_objects+0xa46/0x1aa0 [ofd]
[12003.279417]  [<ffffffffa1408cda>] ofd_create_hdl+0x47a/0x20b0 [ofd]
[12003.288353]  [<ffffffffa022f734>] ? libcfs_log_return+0x24/0x30 [libcfs]
[12003.289170]  [<ffffffffa05e07a0>] ? lustre_pack_reply_v2+0x1a0/0x2a0 [ptlrpc]
[12003.289937]  [<ffffffffa0646e5b>] tgt_request_handle+0x93b/0x13e0 [ptlrpc]
[12003.290673]  [<ffffffffa05ebc71>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc]
[12003.298125]  [<ffffffffa05efa28>] ptlrpc_main+0xa58/0x1df0 [ptlrpc]
[12003.298851]  [<ffffffffa05eefd0>] ? ptlrpc_main+0x0/0x1df0 [ptlrpc]
[12003.299707]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[12003.309644]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[12003.310163]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[12003.310627]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[12003.311173] 
[12003.311602] LustreError: dumping log to /tmp/lustre-log.1517166829.29404
[12008.410465] LNet: Service thread pid 29404 completed after 45.36s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[12008.414754] LNet: Skipped 1 previous similar message
[12037.466656] SysRq : Trigger a crash

As far as I can tell ldiskfs is unaffected.

I checked in maloo and it passes there in like 600 seconds, so it's not a regular zfs slowness or anything of the sort.



 Comments   
Comment by James A Simmons [ 03/Jul/19 ]

I wonder if upstream kernel commit 

84e07b9d0ac8728b1865b23498d746861a8ab4c2 would resolve this.

Comment by James A Simmons [ 04/Jul/19 ]

The back trace looks like signal problem but its not  Worth a try

Comment by Andreas Dilger [ 18/Jun/21 ]

Have not seen this in the past 6 months. While this test is skipped for SLOW in review sessions, it is run in full test sessions (101 passes in a row for ZFS since 2020-12-01).

Generated at Sat Feb 10 02:36:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.