[LU-10250] replay-single test_74: hang and timed out Created: 16/Nov/17  Updated: 18/Feb/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-10009 sanity-benchmark test_iozone: test fa... Open
is duplicated by LU-10065 ost-pools test_5a: test failed to res... Resolved
Related
is related to LU-10223 OSS hang Resolved
is related to LU-10572 Hang in conf-sanity test 69 on zfs Resolved
is related to LU-10670 sanity-flr test 43 timeout Resolved
is related to LU-12510 mds server hangs cv_wait_common Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Jinshan Xiong <jinshan.xiong@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/96539eb0-cadb-11e7-8027-52540065bddc.

The sub-test test_74 failed with the following error:

Timeout occurred after 128 mins, last suite running was replay-single, restarting cluster to continue tests

Please provide additional information about the failure here.

Info required for matching: replay-single 74

On the client stack, there is a stack trace:

[ 4440.087054] INFO: task touch:18639 blocked for more than 120 seconds.
[ 4440.089288] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4440.091541] touch           D 0000000000000000     0 18639  17432 0x00000080
[ 4440.093752]  ffff880069033a40 0000000000000086 ffff880036a79fa0 ffff880069033fd8
[ 4440.096088]  ffff880069033fd8 ffff880069033fd8 ffff880036a79fa0 ffff880066647650
[ 4440.098281]  7fffffffffffffff ffff880066647648 ffff880036a79fa0 0000000000000000
[ 4440.100517] Call Trace:
[ 4440.102304]  [<ffffffff816a9589>] schedule+0x29/0x70
[ 4440.104209]  [<ffffffff816a7099>] schedule_timeout+0x239/0x2c0
[ 4440.106187]  [<ffffffffc09e3108>] ? ptlrpc_set_add_new_req+0xd8/0x150 [ptlrpc]
[ 4440.108288]  [<ffffffffc0bc2bc0>] ? osc_io_ladvise_end+0x50/0x50 [osc]
[ 4440.110272]  [<ffffffffc0a1324b>] ? ptlrpcd_add_req+0x22b/0x300 [ptlrpc]
[ 4440.112208]  [<ffffffffc09f5e40>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc]
[ 4440.114221]  [<ffffffff816a993d>] wait_for_completion+0xfd/0x140
[ 4440.116026]  [<ffffffff810c4820>] ? wake_up_state+0x20/0x20
[ 4440.117835]  [<ffffffffc0bc2d84>] osc_io_setattr_end+0xc4/0x180 [osc]
[ 4440.119639]  [<ffffffffc0bc4381>] ? osc_io_setattr_start+0x471/0x6e0 [osc]
[ 4440.121517]  [<ffffffffc0c15450>] ? lov_io_iter_fini_wrapper+0x50/0x50 [lov]
[ 4440.123344]  [<ffffffffc08201ed>] cl_io_end+0x5d/0x150 [obdclass]
[ 4440.125122]  [<ffffffffc0c1552b>] lov_io_end_wrapper+0xdb/0xe0 [lov]
[ 4440.126825]  [<ffffffffc0c15b75>] lov_io_call.isra.5+0x85/0x140 [lov]
[ 4440.128586]  [<ffffffffc0c15c66>] lov_io_end+0x36/0xb0 [lov]
[ 4440.130210]  [<ffffffffc08201ed>] cl_io_end+0x5d/0x150 [obdclass]
[ 4440.131901]  [<ffffffffc082287f>] cl_io_loop+0x13f/0xc70 [obdclass]
[ 4440.133544]  [<ffffffffc0cbb4d0>] cl_setattr_ost+0x240/0x3a0 [lustre]
[ 4440.135229]  [<ffffffffc0c95b05>] ll_setattr_raw+0x1185/0x1290 [lustre]
[ 4440.136821]  [<ffffffffc0c95c7c>] ll_setattr+0x6c/0xd0 [lustre]
[ 4440.138380]  [<ffffffff8121ee71>] notify_change+0x2c1/0x420
[ 4440.139813]  [<ffffffff81233b59>] utimes_common+0xd9/0x1c0
[ 4440.141276]  [<ffffffff81233d7e>] do_utimes+0x13e/0x180
[ 4440.142646]  [<ffffffff81233ec4>] SyS_utimensat+0x64/0xb0
[ 4440.144085]  [<ffffffff816ac5c8>] ? page_fault+0x28/0x30
[ 4440.145450]  [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b
[ 4560.146045] INFO: task touch:18639 blocked for more than 120 seconds.

That shows the client was waiting for a setattr to complete, and on the OFD side:

[ 4163.936021] LNet: Service thread pid 15315 was inactive for 40.10s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 4163.941814] Pid: 15315, comm: ll_ost00_000
[ 4163.944340] 
Call Trace:
[ 4163.948946]  [<ffffffff816a9589>] schedule+0x29/0x70
[ 4163.951469]  [<ffffffffc07084d5>] cv_wait_common+0x125/0x150 [spl]
[ 4163.954051]  [<ffffffff810b1920>] ? autoremove_wake_function+0x0/0x40
[ 4163.956663]  [<ffffffffc0708515>] __cv_wait+0x15/0x20 [spl]
[ 4163.959223]  [<ffffffffc085781f>] txg_wait_synced+0xef/0x140 [zfs]
[ 4163.961745]  [<ffffffffc080cc65>] dmu_tx_wait+0x275/0x3c0 [zfs]
[ 4163.964242]  [<ffffffffc080ce41>] dmu_tx_assign+0x91/0x490 [zfs]
[ 4163.966725]  [<ffffffffc0ecbe1a>] ? tgt_txn_start_cb+0x1da/0x3b0 [ptlrpc]
[ 4163.969213]  [<ffffffffc1098f37>] osd_trans_start+0xa7/0x3a0 [osd_zfs]
[ 4163.971617]  [<ffffffffc11c1fab>] ofd_trans_start+0x6b/0xe0 [ofd]
[ 4163.973952]  [<ffffffffc11c4943>] ofd_attr_set+0x433/0xb00 [ofd]
[ 4163.976281]  [<ffffffffc11afff3>] ofd_setattr_hdl+0x303/0x950 [ofd]
[ 4163.978596]  [<ffffffffc0ed7c05>] tgt_request_handle+0x925/0x13b0 [ptlrpc]
[ 4163.980977]  [<ffffffffc0e7be8e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
[ 4163.983391]  [<ffffffff810ba598>] ? __wake_up_common+0x58/0x90
[ 4163.985676]  [<ffffffffc0e7f632>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[ 4163.987983]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
[ 4163.990182]  [<ffffffff816a9000>] ? __schedule+0x350/0x8b0
[ 4163.992348]  [<ffffffffc0e7eba0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
[ 4163.994528]  [<ffffffff810b099f>] kthread+0xcf/0xe0
[ 4163.996574]  [<ffffffff810b08d0>] ? kthread+0x0/0xe0
[ 4163.998609]  [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
[ 4164.000651]  [<ffffffff810b08d0>] ? kthread+0x0/0xe0

[ 4164.004284] LustreError: dumping log to /tmp/lustre-log.1510820580.15315

It's waiting a sync trans, which handled the setattr from client, and it seems the trans is never completed.

I guess this is the same issue LU-4440. Unfortunately there is no useful description on the ticket.



 Comments   
Comment by Andreas Dilger [ 28/Nov/17 ]

The stack traces dumped on the client are because osc_io_setattr_end() is using wait_for_completion() directly. It would be better to use wait_for_completion_interruptible() so that the wait is interruptible in case the user wants to kill the process and get on with their day.

That doesn't resolve the issue on the server, but the client may as well be fixed at the same time.

Comment by Andreas Dilger [ 29/Aug/18 ]

This seems to be hitting occasionally - unexplained and unreported hang in ZFS dmu_tx_wait() (see linked tickets). I don't know if this is an artifact of our VM testing, or a problem that also hits on real hardware, but it is something to be aware of and keep an eye open for.

Comment by James A Simmons [ 03/Jul/19 ]

This is a problem hit on real hardware. This problem took down our production file system

Comment by Andreas Dilger [ 03/Jul/19 ]

James, could you please file a separate ticket for your issue. It is of course fine to link to this ticket, but that allows tracking the problem with your production system better, and it may be that the problem you are seeing is unrelated (being stuck in "dmu_tx_wait()" is basically the equivalent of being stuck in "start_this_handle()" for ldiskfs, so there are a hundred ways to get there.

Also, I wasn't aware that you are running ZFS in production? I thought you had ldiskfs-based filesystems on Spider2?

Comment by James A Simmons [ 03/Jul/19 ]

This is for our NOAA file system which is also production. It is 2.12 LTS using ZFS as back end.

Comment by Alex Zhuravlev [ 04/Jul/19 ]

it would be helpful to get a full set of backtraces. dmu_tx_wait() can be waiting for a new TXG while current TXG is pinned by another process.

Generated at Sat Feb 10 02:33:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.