[LU-12167] conf-sanity test_90c: transaction blocked hang on OST umount Created: 07/Apr/19  Updated: 04/Jul/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Alex Zhuravlev
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for wangshilong <wshilong@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/30cc74a4-5923-11e9-8e92-52540065bddc

test_90c failed with the following error:

Timeout occurred after 379 mins, last suite running was conf-sanity, restarting cluster to continue tests

Looks ZFS transaction blocked which hang on OST umount

19270.120645] Pid: 1384, comm: ll_ost00_006 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Mon Mar 25 21:14:10 UTC 2019
[19270.122330] Call Trace:
[19270.122837]  [<ffffffffc042f2d5>] cv_wait_common+0x125/0x150 [spl]
[19270.123988]  [<ffffffffc042f315>] __cv_wait+0x15/0x20 [spl]
[19270.124988]  [<ffffffffc05a62bf>] txg_wait_synced+0xef/0x140 [zfs]
[19270.126335]  [<ffffffffc11d39db>] osd_trans_stop+0x53b/0x5e0 [osd_zfs]
[19270.127563]  [<ffffffffc132ed45>] ofd_trans_stop+0x25/0x60 [ofd]
[19270.128692]  [<ffffffffc1333335>] ofd_destroy+0x2c5/0x960 [ofd]
[19270.129759]  [<ffffffffc132b5a4>] ofd_destroy_by_fid+0x1f4/0x4a0 [ofd]
[19270.130919]  [<ffffffffc1321677>] ofd_destroy_hdl+0x267/0x970 [ofd]
[19270.132027]  [<ffffffffc0fed06a>] tgt_request_handle+0x91a/0x15c0 [ptlrpc]
[19270.133698]  [<ffffffffc0f906ae>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
[19270.135099]  [<ffffffffc0f941ac>] ptlrpc_main+0xbbc/0x2090 [ptlrpc]
[19270.136246]  [<ffffffff9d4c1c31>] kthread+0xd1/0xe0
[19270.137169]  [<ffffffff9db74c37>] ret_from_fork_nospec_end+0x0/0x39
[19270.138304]  [<ffffffffffffffff>] 0xffffffffffffffff
[19270.139282] Pid: 1381, comm: ll_ost00_005 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Mon Mar 25 21:14:10 UTC 2019
[19270.141084] Call Trace:
[19270.141612]  [<ffffffffc042f2d5>] cv_wait_common+0x125/0x150 [spl]
[19270.143092]  [<ffffffffc042f315>] __cv_wait+0x15/0x20 [spl]
[19270.144111]  [<ffffffffc05a62bf>] txg_wait_synced+0xef/0x140 [zfs]
[19270.145275]  [<ffffffffc11d39db>] osd_trans_stop+0x53b/0x5e0 [osd_zfs]
[19270.146454]  [<ffffffffc132ed45>] ofd_trans_stop+0x25/0x60 [ofd]
[19270.147545]  [<ffffffffc1333335>] ofd_destroy+0x2c5/0x960 [ofd]
[19270.148616]  [<ffffffffc132b5a4>] ofd_destroy_by_fid+0x1f4/0x4a0 [ofd]
[19270.149771]  [<ffffffffc1321677>] ofd_destroy_hdl+0x267/0x970 [ofd]
[19270.150910]  [<ffffffffc0fed06a>] tgt_request_handle+0x91a/0x15c0 [ptlrpc]
[19270.152179]  [<ffffffffc0f906ae>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
[19270.153593]  [<ffffffffc0f941ac>] ptlrpc_main+0xbbc/0x2090 [ptlrpc]
[19270.154744]  [<ffffffff9d4c1c31>] kthread+0xd1/0xe0
[19270.155633]  [<ffffffff9db74c37>] ret_from_fork_nospec_end+0x0/0x39
[19270.156767]  [<ffffffffffffffff>] 0xffffffffffffffff
[19270.157685] Pid: 1393, comm: ll_ost00_009 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Mon Mar 25 21:14:10 UTC 2019
[19270.159379] Call Trace:
[19270.159829]  [<ffffffffc042f2d5>] cv_wait_common+0x125/0x150 [spl]
[19270.160932]  [<ffffffffc042f315>] __cv_wait+0x15/0x20 [spl]
[19270.161933]  [<ffffffffc05a62bf>] txg_wait_synced+0xef/0x140 [zfs]
[19270.163056]  [<ffffffffc11d39db>] osd_trans_stop+0x53b/0x5e0 [osd_zfs]
[19270.164205]  [<ffffffffc132ed45>] ofd_trans_stop+0x25/0x60 [ofd]
[19270.165278]  [<ffffffffc1333335>] ofd_destroy+0x2c5/0x960 [ofd]
[19270.166331]  [<ffffffffc132b5a4>] ofd_destroy_by_fid+0x1f4/0x4a0 [ofd]
[19270.167492]  [<ffffffffc1321677>] ofd_destroy_hdl+0x267/0x970 [ofd]
[19270.168600]  [<ffffffffc0fed06a>] tgt_request_handle+0x91a/0x15c0 [ptlrpc]
[19270.169846]  [<ffffffffc0f906ae>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
[19270.171223]  [<ffffffffc0f941ac>] ptlrpc_main+0xbbc/0x2090 [ptlrpc]
[19270.172365]  [<ffffffff9d4c1c31>] kthread+0xd1/0xe0
[19270.173243]  [<ffffffff9db74c37>] ret_from_fork_nospec_end+0x0/0x39
[19270.174349]  [<ffffffffffffffff>] 0xffffffffffffffff
....

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
conf-sanity test_90c - Timeout occurred after 379 mins, last suite running was conf-sanity, restarting cluster to continue tests



 Comments   
Comment by Wang Shilong (Inactive) [ 07/Apr/19 ]

Check following OST backtrace.

https://testing.whamcloud.com/test_logs/3373a93e-5923-11e9-8e92-52540065bddc/show_text

Comment by Wang Shilong (Inactive) [ 07/Apr/19 ]

Looks dmu objection allocation blocked on mutex acquired:

[22795.946424] Call Trace:
[22795.946868]  [<ffffffff9d4d31b6>] __cond_resched+0x26/0x30
[22795.947820]  [<ffffffff9db67eea>] _cond_resched+0x3a/0x50
[22795.948754]  [<ffffffff9db65e92>] mutex_lock+0x12/0x2f
[22795.949668]  [<ffffffffc055d5b3>] dnode_check_slots_free+0x63/0x100 [zfs]
[22795.950844]  [<ffffffffc055e4b6>] dnode_hold_impl+0x4e6/0xc40 [zfs]
[22795.951938]  [<ffffffffc054b905>] ? dmu_object_next+0x95/0x140 [zfs]
[22795.953063]  [<ffffffffc054bb9f>] dmu_object_alloc_dnsize+0x1ef/0x3e0 [zfs]
[22795.954262]  [<ffffffffc11ddba2>] __osd_object_create+0x82/0x170 [osd_zfs]
[22795.955441]  [<ffffffffc11ddf0d>] osd_mkreg+0x7d/0x210 [osd_zfs]
[22795.956484]  [<ffffffffc11d9506>] osd_create+0x316/0xaf0 [osd_zfs]
[22795.957577]  [<ffffffffc133028c>] ofd_precreate_objects+0xeec/0x1aa0 [ofd]
[22795.958765]  [<ffffffffc13221f4>] ofd_create_hdl+0x474/0x2100 [ofd]
[22795.959861]  [<ffffffffc0c0a8d4>] ? libcfs_log_return+0x24/0x30 [libcfs]
[22795.961032]  [<ffffffffc0f84613>] ? lustre_pack_reply_v2+0x183/0x280 [ptlrpc]
[22795.962281]  [<ffffffffc0fed06a>] tgt_request_handle+0x91a/0x15c0 [ptlrpc]
[22795.963464]  [<ffffffffc0c0df17>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[22795.964627]  [<ffffffffc0f906ae>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
[22795.965966]  [<ffffffffc0f941ac>] ptlrpc_main+0xbbc/0x2090 [ptlrpc]
[22795.967048]  [<ffffffff9d4d0880>] ? finish_task_switch+0x50/0x1c0
[22795.968125]  [<ffffffffc0f935f0>] ? ptlrpc_register_service+0xfa0/0xfa0 [ptlrpc]
[22795.969395]  [<ffffffff9d4c1c31>] kthread+0xd1/0xe0
[22795.970237]  [<ffffffff9d4c1b60>] ? insert_kthread_work+0x40/0x40
[22795.971279]  [<ffffffff9db74c37>] ret_from_fork_nospec_begin+0x21/0x21
[22795.972400]  [<ffffffff9d4c1b60>] ? insert_kthread_work+0x40/0x40
[22795.973453] ll_ost00_004    D ffff9f40243a30c0     0  1372      2 0x00000080
Comment by Alex Zhuravlev [ 08/Apr/19 ]

this looks likeĀ https://github.com/zfsonlinux/zfs/issues/8426

Comment by Wang Shilong (Inactive) [ 08/Apr/19 ]

So problem will be solved once ZFS is upgraded to 0.8

Comment by Sebastien Buisson [ 19/Nov/19 ]

Probably the same problem hit in https://testing.whamcloud.com/test_sets/6ccd175e-0a5a-11ea-8e77-52540065bddc, but during conf-sanity test_76c and with a slightly different stack trace:

[22615.838841] Pid: 9453, comm: ll_ost00_006 3.10.0-957.27.2.el7_lustre.x86_64 #1 SMP Fri Nov 8 17:23:54 UTC 2019
[22615.838842] Call Trace:
[22615.838866]  [<ffffffffc05272d5>] cv_wait_common+0x125/0x150 [spl]
[22615.838870]  [<ffffffffc0527315>] __cv_wait+0x15/0x20 [spl]
[22615.838996]  [<ffffffffc07d84d3>] txg_wait_open+0xc3/0x110 [zfs]
[22615.839014]  [<ffffffffc078ddca>] dmu_tx_wait+0x3aa/0x3c0 [zfs]
[22615.839031]  [<ffffffffc078de72>] dmu_tx_assign+0x92/0x490 [zfs]
[22615.839075]  [<ffffffffc137dfd9>] osd_trans_start+0x199/0x440 [osd_zfs]
[22615.839136]  [<ffffffffc14c906c>] ofd_precreate_objects+0xa1c/0x1df0 [ofd]
[22615.839140]  [<ffffffffc14bc024>] ofd_create_hdl+0x474/0x2100 [ofd]
[22615.839518]  [<ffffffffc11afeea>] tgt_request_handle+0x97a/0x1620 [ptlrpc]
[22615.839556]  [<ffffffffc1156a66>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[22615.839590]  [<ffffffffc115a59c>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
[22615.839594]  [<ffffffffb7ac2e81>] kthread+0xd1/0xe0
[22615.839597]  [<ffffffffb8177c37>] ret_from_fork_nospec_end+0x0/0x39
[22615.839605]  [<ffffffffffffffff>] 0xffffffffffffffff
[22615.862159] Pid: 30965, comm: ll_ost00_004 3.10.0-957.27.2.el7_lustre.x86_64 #1 SMP Fri Nov 8 17:23:54 UTC 2019
[22615.863851] Call Trace:
[22615.864279]  [<ffffffffc05272d5>] cv_wait_common+0x125/0x150 [spl]
[22615.865584]  [<ffffffffc0527315>] __cv_wait+0x15/0x20 [spl]
[22615.866771]  [<ffffffffc07d84d3>] txg_wait_open+0xc3/0x110 [zfs]
[22615.867885]  [<ffffffffc078ddca>] dmu_tx_wait+0x3aa/0x3c0 [zfs]
[22615.868958]  [<ffffffffc078de72>] dmu_tx_assign+0x92/0x490 [zfs]
[22615.870026]  [<ffffffffc137dfd9>] osd_trans_start+0x199/0x440 [osd_zfs]
[22615.871199]  [<ffffffffc14c906c>] ofd_precreate_objects+0xa1c/0x1df0 [ofd]
[22615.872413]  [<ffffffffc14bc024>] ofd_create_hdl+0x474/0x2100 [ofd]
[22615.873538]  [<ffffffffc11afeea>] tgt_request_handle+0x97a/0x1620 [ptlrpc]
[22615.874798]  [<ffffffffc1156a66>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[22615.876165]  [<ffffffffc115a59c>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
[22615.877303]  [<ffffffffb7ac2e81>] kthread+0xd1/0xe0
[22615.878183]  [<ffffffffb8177c37>] ret_from_fork_nospec_end+0x0/0x39
[22615.879298]  [<ffffffffffffffff>] 0xffffffffffffffff
Comment by Andreas Dilger [ 04/Jul/20 ]

+1 on master with the same stack as Sebastien:
https://testing.whamcloud.com/test_sets/ba979f34-0a79-406f-931d-202a6419d48e

Generated at Sat Feb 10 02:50:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.