[LU-10250] replay-single test_74: hang and timed out Created: 16/Nov/17 Updated: 18/Feb/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for Jinshan Xiong <jinshan.xiong@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/96539eb0-cadb-11e7-8027-52540065bddc. The sub-test test_74 failed with the following error: Timeout occurred after 128 mins, last suite running was replay-single, restarting cluster to continue tests Please provide additional information about the failure here. Info required for matching: replay-single 74 On the client stack, there is a stack trace: [ 4440.087054] INFO: task touch:18639 blocked for more than 120 seconds. [ 4440.089288] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4440.091541] touch D 0000000000000000 0 18639 17432 0x00000080 [ 4440.093752] ffff880069033a40 0000000000000086 ffff880036a79fa0 ffff880069033fd8 [ 4440.096088] ffff880069033fd8 ffff880069033fd8 ffff880036a79fa0 ffff880066647650 [ 4440.098281] 7fffffffffffffff ffff880066647648 ffff880036a79fa0 0000000000000000 [ 4440.100517] Call Trace: [ 4440.102304] [<ffffffff816a9589>] schedule+0x29/0x70 [ 4440.104209] [<ffffffff816a7099>] schedule_timeout+0x239/0x2c0 [ 4440.106187] [<ffffffffc09e3108>] ? ptlrpc_set_add_new_req+0xd8/0x150 [ptlrpc] [ 4440.108288] [<ffffffffc0bc2bc0>] ? osc_io_ladvise_end+0x50/0x50 [osc] [ 4440.110272] [<ffffffffc0a1324b>] ? ptlrpcd_add_req+0x22b/0x300 [ptlrpc] [ 4440.112208] [<ffffffffc09f5e40>] ? lustre_swab_niobuf_remote+0x30/0x30 [ptlrpc] [ 4440.114221] [<ffffffff816a993d>] wait_for_completion+0xfd/0x140 [ 4440.116026] [<ffffffff810c4820>] ? wake_up_state+0x20/0x20 [ 4440.117835] [<ffffffffc0bc2d84>] osc_io_setattr_end+0xc4/0x180 [osc] [ 4440.119639] [<ffffffffc0bc4381>] ? osc_io_setattr_start+0x471/0x6e0 [osc] [ 4440.121517] [<ffffffffc0c15450>] ? lov_io_iter_fini_wrapper+0x50/0x50 [lov] [ 4440.123344] [<ffffffffc08201ed>] cl_io_end+0x5d/0x150 [obdclass] [ 4440.125122] [<ffffffffc0c1552b>] lov_io_end_wrapper+0xdb/0xe0 [lov] [ 4440.126825] [<ffffffffc0c15b75>] lov_io_call.isra.5+0x85/0x140 [lov] [ 4440.128586] [<ffffffffc0c15c66>] lov_io_end+0x36/0xb0 [lov] [ 4440.130210] [<ffffffffc08201ed>] cl_io_end+0x5d/0x150 [obdclass] [ 4440.131901] [<ffffffffc082287f>] cl_io_loop+0x13f/0xc70 [obdclass] [ 4440.133544] [<ffffffffc0cbb4d0>] cl_setattr_ost+0x240/0x3a0 [lustre] [ 4440.135229] [<ffffffffc0c95b05>] ll_setattr_raw+0x1185/0x1290 [lustre] [ 4440.136821] [<ffffffffc0c95c7c>] ll_setattr+0x6c/0xd0 [lustre] [ 4440.138380] [<ffffffff8121ee71>] notify_change+0x2c1/0x420 [ 4440.139813] [<ffffffff81233b59>] utimes_common+0xd9/0x1c0 [ 4440.141276] [<ffffffff81233d7e>] do_utimes+0x13e/0x180 [ 4440.142646] [<ffffffff81233ec4>] SyS_utimensat+0x64/0xb0 [ 4440.144085] [<ffffffff816ac5c8>] ? page_fault+0x28/0x30 [ 4440.145450] [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b [ 4560.146045] INFO: task touch:18639 blocked for more than 120 seconds. That shows the client was waiting for a setattr to complete, and on the OFD side: [ 4163.936021] LNet: Service thread pid 15315 was inactive for 40.10s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 4163.941814] Pid: 15315, comm: ll_ost00_000 [ 4163.944340] Call Trace: [ 4163.948946] [<ffffffff816a9589>] schedule+0x29/0x70 [ 4163.951469] [<ffffffffc07084d5>] cv_wait_common+0x125/0x150 [spl] [ 4163.954051] [<ffffffff810b1920>] ? autoremove_wake_function+0x0/0x40 [ 4163.956663] [<ffffffffc0708515>] __cv_wait+0x15/0x20 [spl] [ 4163.959223] [<ffffffffc085781f>] txg_wait_synced+0xef/0x140 [zfs] [ 4163.961745] [<ffffffffc080cc65>] dmu_tx_wait+0x275/0x3c0 [zfs] [ 4163.964242] [<ffffffffc080ce41>] dmu_tx_assign+0x91/0x490 [zfs] [ 4163.966725] [<ffffffffc0ecbe1a>] ? tgt_txn_start_cb+0x1da/0x3b0 [ptlrpc] [ 4163.969213] [<ffffffffc1098f37>] osd_trans_start+0xa7/0x3a0 [osd_zfs] [ 4163.971617] [<ffffffffc11c1fab>] ofd_trans_start+0x6b/0xe0 [ofd] [ 4163.973952] [<ffffffffc11c4943>] ofd_attr_set+0x433/0xb00 [ofd] [ 4163.976281] [<ffffffffc11afff3>] ofd_setattr_hdl+0x303/0x950 [ofd] [ 4163.978596] [<ffffffffc0ed7c05>] tgt_request_handle+0x925/0x13b0 [ptlrpc] [ 4163.980977] [<ffffffffc0e7be8e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc] [ 4163.983391] [<ffffffff810ba598>] ? __wake_up_common+0x58/0x90 [ 4163.985676] [<ffffffffc0e7f632>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [ 4163.987983] [<ffffffff81029557>] ? __switch_to+0xd7/0x510 [ 4163.990182] [<ffffffff816a9000>] ? __schedule+0x350/0x8b0 [ 4163.992348] [<ffffffffc0e7eba0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc] [ 4163.994528] [<ffffffff810b099f>] kthread+0xcf/0xe0 [ 4163.996574] [<ffffffff810b08d0>] ? kthread+0x0/0xe0 [ 4163.998609] [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90 [ 4164.000651] [<ffffffff810b08d0>] ? kthread+0x0/0xe0 [ 4164.004284] LustreError: dumping log to /tmp/lustre-log.1510820580.15315 It's waiting a sync trans, which handled the setattr from client, and it seems the trans is never completed. I guess this is the same issue |
| Comments |
| Comment by Andreas Dilger [ 28/Nov/17 ] |
|
The stack traces dumped on the client are because osc_io_setattr_end() is using wait_for_completion() directly. It would be better to use wait_for_completion_interruptible() so that the wait is interruptible in case the user wants to kill the process and get on with their day. That doesn't resolve the issue on the server, but the client may as well be fixed at the same time. |
| Comment by Andreas Dilger [ 29/Aug/18 ] |
|
This seems to be hitting occasionally - unexplained and unreported hang in ZFS dmu_tx_wait() (see linked tickets). I don't know if this is an artifact of our VM testing, or a problem that also hits on real hardware, but it is something to be aware of and keep an eye open for. |
| Comment by James A Simmons [ 03/Jul/19 ] |
|
This is a problem hit on real hardware. This problem took down our production file system |
| Comment by Andreas Dilger [ 03/Jul/19 ] |
|
James, could you please file a separate ticket for your issue. It is of course fine to link to this ticket, but that allows tracking the problem with your production system better, and it may be that the problem you are seeing is unrelated (being stuck in "dmu_tx_wait()" is basically the equivalent of being stuck in "start_this_handle()" for ldiskfs, so there are a hundred ways to get there. Also, I wasn't aware that you are running ZFS in production? I thought you had ldiskfs-based filesystems on Spider2? |
| Comment by James A Simmons [ 03/Jul/19 ] |
|
This is for our NOAA file system which is also production. It is 2.12 LTS using ZFS as back end. |
| Comment by Alex Zhuravlev [ 04/Jul/19 ] |
|
it would be helpful to get a full set of backtraces. dmu_tx_wait() can be waiting for a new TXG while current TXG is pinned by another process. |