[LU-11235] sanity: test_230d timeout on ZFS backends Created: 13/Aug/18  Updated: 09/Dec/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: always_except, zfs

Issue Links:
Duplicate
duplicates LU-8601 sanity test_230d: Timeout on ZFS back... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Lai Siyao <lai.siyao@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/b3a1452a-9c82-11e8-8ee3-52540065bddc

MDS backtrace https://testing.whamcloud.com/test_logs/b50461b8-9c82-11e8-8ee3-52540065bddc/show_text shows it's stuck in txg_wait_synced().

[10285.072535] ldlm_cn00_002   D ffff9825b67adee0     0 32218      2 0x00000080
[10285.073583] Call Trace:
[10285.073963]  [<ffffffff8b914029>] schedule+0x29/0x70
[10285.074659]  [<ffffffffc05da205>] cv_wait_common+0x125/0x150 [spl]
[10285.075360]  [<ffffffff8b2bc610>] ? wake_up_atomic_t+0x30/0x30
[10285.076201]  [<ffffffffc05da245>] __cv_wait+0x15/0x20 [spl]
[10285.076939]  [<ffffffffc0704bdf>] txg_wait_synced+0xef/0x140 [zfs]
[10285.077663]  [<ffffffffc1348a43>] osd_sync+0xc3/0x140 [osd_zfs]
[10285.078507]  [<ffffffffc1163a5b>] tgt_sync+0x14b/0x270 [ptlrpc]
[10285.079287]  [<ffffffffc1165f32>] tgt_blocking_ast+0x302/0x630 [ptlrpc]
[10285.080043]  [<ffffffffc10b6e5a>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
[10285.080787]  [<ffffffffc10b7156>] ldlm_lock_cancel+0x56/0x1f0 [ptlrpc]
[10285.081492]  [<ffffffffc10dc06b>] ldlm_request_cancel+0x18b/0x730 [ptlrpc]
[10285.082266]  [<ffffffffc10df5ea>] ldlm_handle_cancel+0xba/0x250 [ptlrpc]
[10285.082964]  [<ffffffffc10df8d8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
[10285.083688]  [<ffffffffc111040b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[10285.084449]  [<ffffffff8b2c52ab>] ? __wake_up_common+0x5b/0x90
[10285.085095]  [<ffffffffc1113c44>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
[10285.085752]  [<ffffffffc1113130>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
[10285.086464]  [<ffffffff8b2bb621>] kthread+0xd1/0xe0
[10285.087009]  [<ffffffff8b2bb550>] ? insert_kthread_work+0x40/0x40
[10285.087626]  [<ffffffff8b9205f7>] ret_from_fork_nospec_begin+0x21/0x21
[10285.088259]  [<ffffffff8b2bb550>] ? insert_kthread_work+0x40/0x40


 Comments   
Comment by Lai Siyao [ 13/Aug/18 ]

Alex, do you have any clue?

Comment by Alex Zhuravlev [ 13/Aug/18 ]

Lai, in many other cases that was just a sign of large (and usually slow) I/O

Comment by Lai Siyao [ 13/Aug/18 ]

The test log shows it was stuck there over 170mins.

Comment by Andreas Dilger [ 17/Aug/18 ]

Given that this test is "check migrate big directory" it seems probable that this relates to the patches themselves. The txg_wait_synced() call is the equivalent of being stuck in jbd2_journal_stop() - this is waiting for the transaction to commit, but something is keeping the transaction open for a long time. Normally ZFS will commit a TXG every 1s, so I suspect some kind of reference leak.

Comment by Andreas Dilger [ 17/Aug/18 ]

I see some other stuck threads, and in particular txg_sync is waiting on IO to complete, and a second thread is also stuck in the same place (I'm not sure what the in:imjournal thread is for):

[10284.337042] in:imjournal    D ffff9825fafbdee0     0  1304      1 0x00000080
[10284.337732] Call Trace:
[10284.338475]  [<ffffffff8b914029>] schedule+0x29/0x70
[10284.338958]  [<ffffffff8b911999>] schedule_timeout+0x239/0x2c0
[10284.341130]  [<ffffffff8b91353d>] io_schedule_timeout+0xad/0x130
[10284.341696]  [<ffffffff8b9135d8>] io_schedule+0x18/0x20
[10284.342205]  [<ffffffff8b911fc1>] bit_wait_io+0x11/0x50
[10284.342703]  [<ffffffff8b911ae7>] __wait_on_bit+0x67/0x90
[10284.343789]  [<ffffffff8b3936d1>] wait_on_page_bit+0x81/0xa0
[10284.344939]  [<ffffffff8b3a4e7b>] truncate_inode_pages_range+0x42b/0x700
[10284.348657]  [<ffffffff8b3a51bf>] truncate_inode_pages_final+0x4f/0x60
[10284.349288]  [<ffffffffc047610f>] ext4_evict_inode+0x11f/0x4c0 [ext4]
[10284.349899]  [<ffffffff8b4387d4>] evict+0xb4/0x180
[10284.350372]  [<ffffffff8b4390dc>] iput+0xfc/0x190
[10284.350816]  [<ffffffff8b4339a0>] __dentry_kill+0x120/0x180
[10284.351361]  [<ffffffff8b433aa9>] dput+0xa9/0x160
[10284.351811]  [<ffffffff8b42db08>] SYSC_renameat2+0x518/0x5a0
[10284.355455]  [<ffffffff8b42ea0e>] SyS_renameat2+0xe/0x10
[10284.355977]  [<ffffffff8b42ea4e>] SyS_rename+0x1e/0x20

[10285.072535] ldlm_cn00_002   D ffff9825b67adee0     0 32218      2 0x00000080
[10285.073583] Call Trace:
[10285.073963]  [<ffffffff8b914029>] schedule+0x29/0x70
[10285.074659]  [<ffffffffc05da205>] cv_wait_common+0x125/0x150 [spl]
[10285.076201]  [<ffffffffc05da245>] __cv_wait+0x15/0x20 [spl]
[10285.076939]  [<ffffffffc0704bdf>] txg_wait_synced+0xef/0x140 [zfs]
[10285.077663]  [<ffffffffc1348a43>] osd_sync+0xc3/0x140 [osd_zfs]
[10285.078507]  [<ffffffffc1163a5b>] tgt_sync+0x14b/0x270 [ptlrpc]
[10285.079287]  [<ffffffffc1165f32>] tgt_blocking_ast+0x302/0x630 [ptlrpc]
[10285.080043]  [<ffffffffc10b6e5a>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
[10285.080787]  [<ffffffffc10b7156>] ldlm_lock_cancel+0x56/0x1f0 [ptlrpc]
[10285.081492]  [<ffffffffc10dc06b>] ldlm_request_cancel+0x18b/0x730 [ptlrpc]
[10285.082266]  [<ffffffffc10df5ea>] ldlm_handle_cancel+0xba/0x250 [ptlrpc]
[10285.082964]  [<ffffffffc10df8d8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
[10285.083688]  [<ffffffffc111040b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[10285.085095]  [<ffffffffc1113c44>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
[10285.086464]  [<ffffffff8b2bb621>] kthread+0xd1/0xe0

[10285.745845] txg_sync        D ffff9825d9792f70     0  1592      2 0x00000080
[10285.746553] Call Trace:
[10285.747373]  [<ffffffff8b914029>] schedule+0x29/0x70
[10285.747843]  [<ffffffff8b911999>] schedule_timeout+0x239/0x2c0
[10285.749545]  [<ffffffff8b91353d>] io_schedule_timeout+0xad/0x130
[10285.750737]  [<ffffffff8b9135d8>] io_schedule+0x18/0x20
[10285.751252]  [<ffffffffc05da192>] cv_wait_common+0xb2/0x150 [spl]
[10285.752390]  [<ffffffffc05da268>] __cv_wait_io+0x18/0x20 [spl]
[10285.752994]  [<ffffffffc075d023>] zio_wait+0x113/0x1c0 [zfs]
[10285.753560]  [<ffffffffc0711bc1>] vdev_config_sync+0xf1/0x180 [zfs]
[10285.754199]  [<ffffffffc06f1a9c>] spa_sync+0xa1c/0xd90 [zfs]
[10285.755372]  [<ffffffffc0705c41>] txg_sync_thread+0x301/0x510 [zfs]
[10285.756565]  [<ffffffffc05d5013>] thread_generic_wrapper+0x73/0x80 [spl]
[10285.757786]  [<ffffffff8b2bb621>] kthread+0xd1/0xe0

There is a second txg_sync thread that doesn't appear to be doing anything, I'm not sure if that is for a different MDT zpool?:

[10285.004345] txg_sync        S ffff9825e40b9fa0     0 31971      2 0x00000080
[10285.005189] Call Trace:
[10285.005442]  [<ffffffff8b914029>] schedule+0x29/0x70
[10285.005912]  [<ffffffff8b9118d4>] schedule_timeout+0x174/0x2c0
[10285.007064]  [<ffffffffc05da362>] __cv_timedwait_common+0xd2/0x190 [spl]
[10285.008255]  [<ffffffffc05da476>] __cv_timedwait_sig+0x16/0x20 [spl]
[10285.008890]  [<ffffffffc0705add>] txg_sync_thread+0x19d/0x510 [zfs]
[10285.010105]  [<ffffffffc05d5013>] thread_generic_wrapper+0x73/0x80 [spl]
[10285.011345]  [<ffffffff8b2bb621>] kthread+0xd1/0xe0

This could conceivably be stuck waiting for IO from the VM host to complete, I'm not sure yet, so I've resubmitted this test session to re-run to see if the problem hits again. I wouldn't want to just to just accept a 1/2 failure rate for landing, but it is possible this is a freak failure. Resubmitting the patch with several lines of Test-Parameters: mdtbackfstype=zfs ostbackfstype=zfs mdscount=2 mdtcount=2 testlist=sanity,sanity,sanity,sanity would tell us whether this was a freak accident or a regular problem.

Comment by Lai Siyao [ 11/Sep/18 ]

This may be an old issue, because the same symptom appears on 2.10 also, see LDEV-309 and LU-8601.

I've enabled full debug and checked the logs, each time it's stuck in osd_sync() --> txg_wait_synced(), and with LU-4684 patches this happens more often is because the new dir migration implementation triggers more lock revoke which triggers Sync-on-Lock-Cancel more often. Per Alex's suggestion in LU-8601, I'd suggest to disable this test for ZFS system, otherwise the patch for LU-4684 can't pass this test.

Andreas, what's your opinion?

Comment by Andreas Dilger [ 13/Sep/18 ]

I guess the important question is not whether the test is failing , but whether the test represents something that will happen in the real world when directory restripe is being used? If yes, then it doesn't help us to disable the test, because it just means that users will get similar hangs on their production systems.

Comment by Lai Siyao [ 13/Sep/18 ]

I agree, there are two possible issues about ZFS which needs investigation:
1. transaction in sync mode may not work.
2. txg_wait_synced() may hung if called frequently.

I looked though the latest commits in ZFS code, bug didn't find related fix. BTW which version of ZFS is deployed in autotest?

Comment by Alex Zhuravlev [ 13/Sep/18 ]

another theoretical possibility would be a missing I/O completion callback or some race in that area.

Comment by Peter Jones [ 13/Sep/18 ]

>  BTW which version of ZFS is deployed in autotest?

Will be 0.7.9 IIUC

Comment by Jian Yu [ 12/Nov/18 ]

Occurred again on Lustre b2_10 branch:
https://testing.whamcloud.com/test_sets/ff7e4cd0-e698-11e8-bfe1-52540065bddc

Comment by Andreas Dilger [ 09/Dec/20 ]

+1 on master review-dne-selinux-ssk:
https://testing.whamcloud.com/test_sets/dd21c09d-9d88-44e4-9e8e-2f7e93045034

there are no errors in any logs, it just looks like it is taking a long time?

Generated at Sat Feb 10 02:42:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.