[LU-11235] sanity: test_230d timeout on ZFS backends Created: 13/Aug/18 Updated: 09/Dec/20 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | always_except, zfs | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Lai Siyao <lai.siyao@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/b3a1452a-9c82-11e8-8ee3-52540065bddc MDS backtrace https://testing.whamcloud.com/test_logs/b50461b8-9c82-11e8-8ee3-52540065bddc/show_text shows it's stuck in txg_wait_synced(). [10285.072535] ldlm_cn00_002 D ffff9825b67adee0 0 32218 2 0x00000080 [10285.073583] Call Trace: [10285.073963] [<ffffffff8b914029>] schedule+0x29/0x70 [10285.074659] [<ffffffffc05da205>] cv_wait_common+0x125/0x150 [spl] [10285.075360] [<ffffffff8b2bc610>] ? wake_up_atomic_t+0x30/0x30 [10285.076201] [<ffffffffc05da245>] __cv_wait+0x15/0x20 [spl] [10285.076939] [<ffffffffc0704bdf>] txg_wait_synced+0xef/0x140 [zfs] [10285.077663] [<ffffffffc1348a43>] osd_sync+0xc3/0x140 [osd_zfs] [10285.078507] [<ffffffffc1163a5b>] tgt_sync+0x14b/0x270 [ptlrpc] [10285.079287] [<ffffffffc1165f32>] tgt_blocking_ast+0x302/0x630 [ptlrpc] [10285.080043] [<ffffffffc10b6e5a>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc] [10285.080787] [<ffffffffc10b7156>] ldlm_lock_cancel+0x56/0x1f0 [ptlrpc] [10285.081492] [<ffffffffc10dc06b>] ldlm_request_cancel+0x18b/0x730 [ptlrpc] [10285.082266] [<ffffffffc10df5ea>] ldlm_handle_cancel+0xba/0x250 [ptlrpc] [10285.082964] [<ffffffffc10df8d8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc] [10285.083688] [<ffffffffc111040b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [10285.084449] [<ffffffff8b2c52ab>] ? __wake_up_common+0x5b/0x90 [10285.085095] [<ffffffffc1113c44>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [10285.085752] [<ffffffffc1113130>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc] [10285.086464] [<ffffffff8b2bb621>] kthread+0xd1/0xe0 [10285.087009] [<ffffffff8b2bb550>] ? insert_kthread_work+0x40/0x40 [10285.087626] [<ffffffff8b9205f7>] ret_from_fork_nospec_begin+0x21/0x21 [10285.088259] [<ffffffff8b2bb550>] ? insert_kthread_work+0x40/0x40 |
| Comments |
| Comment by Lai Siyao [ 13/Aug/18 ] |
|
Alex, do you have any clue? |
| Comment by Alex Zhuravlev [ 13/Aug/18 ] |
|
Lai, in many other cases that was just a sign of large (and usually slow) I/O |
| Comment by Lai Siyao [ 13/Aug/18 ] |
|
The test log shows it was stuck there over 170mins. |
| Comment by Andreas Dilger [ 17/Aug/18 ] |
|
Given that this test is "check migrate big directory" it seems probable that this relates to the patches themselves. The txg_wait_synced() call is the equivalent of being stuck in jbd2_journal_stop() - this is waiting for the transaction to commit, but something is keeping the transaction open for a long time. Normally ZFS will commit a TXG every 1s, so I suspect some kind of reference leak. |
| Comment by Andreas Dilger [ 17/Aug/18 ] |
|
I see some other stuck threads, and in particular txg_sync is waiting on IO to complete, and a second thread is also stuck in the same place (I'm not sure what the in:imjournal thread is for): [10284.337042] in:imjournal D ffff9825fafbdee0 0 1304 1 0x00000080 [10284.337732] Call Trace: [10284.338475] [<ffffffff8b914029>] schedule+0x29/0x70 [10284.338958] [<ffffffff8b911999>] schedule_timeout+0x239/0x2c0 [10284.341130] [<ffffffff8b91353d>] io_schedule_timeout+0xad/0x130 [10284.341696] [<ffffffff8b9135d8>] io_schedule+0x18/0x20 [10284.342205] [<ffffffff8b911fc1>] bit_wait_io+0x11/0x50 [10284.342703] [<ffffffff8b911ae7>] __wait_on_bit+0x67/0x90 [10284.343789] [<ffffffff8b3936d1>] wait_on_page_bit+0x81/0xa0 [10284.344939] [<ffffffff8b3a4e7b>] truncate_inode_pages_range+0x42b/0x700 [10284.348657] [<ffffffff8b3a51bf>] truncate_inode_pages_final+0x4f/0x60 [10284.349288] [<ffffffffc047610f>] ext4_evict_inode+0x11f/0x4c0 [ext4] [10284.349899] [<ffffffff8b4387d4>] evict+0xb4/0x180 [10284.350372] [<ffffffff8b4390dc>] iput+0xfc/0x190 [10284.350816] [<ffffffff8b4339a0>] __dentry_kill+0x120/0x180 [10284.351361] [<ffffffff8b433aa9>] dput+0xa9/0x160 [10284.351811] [<ffffffff8b42db08>] SYSC_renameat2+0x518/0x5a0 [10284.355455] [<ffffffff8b42ea0e>] SyS_renameat2+0xe/0x10 [10284.355977] [<ffffffff8b42ea4e>] SyS_rename+0x1e/0x20 [10285.072535] ldlm_cn00_002 D ffff9825b67adee0 0 32218 2 0x00000080 [10285.073583] Call Trace: [10285.073963] [<ffffffff8b914029>] schedule+0x29/0x70 [10285.074659] [<ffffffffc05da205>] cv_wait_common+0x125/0x150 [spl] [10285.076201] [<ffffffffc05da245>] __cv_wait+0x15/0x20 [spl] [10285.076939] [<ffffffffc0704bdf>] txg_wait_synced+0xef/0x140 [zfs] [10285.077663] [<ffffffffc1348a43>] osd_sync+0xc3/0x140 [osd_zfs] [10285.078507] [<ffffffffc1163a5b>] tgt_sync+0x14b/0x270 [ptlrpc] [10285.079287] [<ffffffffc1165f32>] tgt_blocking_ast+0x302/0x630 [ptlrpc] [10285.080043] [<ffffffffc10b6e5a>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc] [10285.080787] [<ffffffffc10b7156>] ldlm_lock_cancel+0x56/0x1f0 [ptlrpc] [10285.081492] [<ffffffffc10dc06b>] ldlm_request_cancel+0x18b/0x730 [ptlrpc] [10285.082266] [<ffffffffc10df5ea>] ldlm_handle_cancel+0xba/0x250 [ptlrpc] [10285.082964] [<ffffffffc10df8d8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc] [10285.083688] [<ffffffffc111040b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [10285.085095] [<ffffffffc1113c44>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [10285.086464] [<ffffffff8b2bb621>] kthread+0xd1/0xe0 [10285.745845] txg_sync D ffff9825d9792f70 0 1592 2 0x00000080 [10285.746553] Call Trace: [10285.747373] [<ffffffff8b914029>] schedule+0x29/0x70 [10285.747843] [<ffffffff8b911999>] schedule_timeout+0x239/0x2c0 [10285.749545] [<ffffffff8b91353d>] io_schedule_timeout+0xad/0x130 [10285.750737] [<ffffffff8b9135d8>] io_schedule+0x18/0x20 [10285.751252] [<ffffffffc05da192>] cv_wait_common+0xb2/0x150 [spl] [10285.752390] [<ffffffffc05da268>] __cv_wait_io+0x18/0x20 [spl] [10285.752994] [<ffffffffc075d023>] zio_wait+0x113/0x1c0 [zfs] [10285.753560] [<ffffffffc0711bc1>] vdev_config_sync+0xf1/0x180 [zfs] [10285.754199] [<ffffffffc06f1a9c>] spa_sync+0xa1c/0xd90 [zfs] [10285.755372] [<ffffffffc0705c41>] txg_sync_thread+0x301/0x510 [zfs] [10285.756565] [<ffffffffc05d5013>] thread_generic_wrapper+0x73/0x80 [spl] [10285.757786] [<ffffffff8b2bb621>] kthread+0xd1/0xe0 There is a second txg_sync thread that doesn't appear to be doing anything, I'm not sure if that is for a different MDT zpool?: [10285.004345] txg_sync S ffff9825e40b9fa0 0 31971 2 0x00000080 [10285.005189] Call Trace: [10285.005442] [<ffffffff8b914029>] schedule+0x29/0x70 [10285.005912] [<ffffffff8b9118d4>] schedule_timeout+0x174/0x2c0 [10285.007064] [<ffffffffc05da362>] __cv_timedwait_common+0xd2/0x190 [spl] [10285.008255] [<ffffffffc05da476>] __cv_timedwait_sig+0x16/0x20 [spl] [10285.008890] [<ffffffffc0705add>] txg_sync_thread+0x19d/0x510 [zfs] [10285.010105] [<ffffffffc05d5013>] thread_generic_wrapper+0x73/0x80 [spl] [10285.011345] [<ffffffff8b2bb621>] kthread+0xd1/0xe0 This could conceivably be stuck waiting for IO from the VM host to complete, I'm not sure yet, so I've resubmitted this test session to re-run to see if the problem hits again. I wouldn't want to just to just accept a 1/2 failure rate for landing, but it is possible this is a freak failure. Resubmitting the patch with several lines of Test-Parameters: mdtbackfstype=zfs ostbackfstype=zfs mdscount=2 mdtcount=2 testlist=sanity,sanity,sanity,sanity would tell us whether this was a freak accident or a regular problem. |
| Comment by Lai Siyao [ 11/Sep/18 ] |
|
This may be an old issue, because the same symptom appears on 2.10 also, see LDEV-309 and I've enabled full debug and checked the logs, each time it's stuck in osd_sync() --> txg_wait_synced(), and with Andreas, what's your opinion? |
| Comment by Andreas Dilger [ 13/Sep/18 ] |
|
I guess the important question is not whether the test is failing , but whether the test represents something that will happen in the real world when directory restripe is being used? If yes, then it doesn't help us to disable the test, because it just means that users will get similar hangs on their production systems. |
| Comment by Lai Siyao [ 13/Sep/18 ] |
|
I agree, there are two possible issues about ZFS which needs investigation: I looked though the latest commits in ZFS code, bug didn't find related fix. BTW which version of ZFS is deployed in autotest? |
| Comment by Alex Zhuravlev [ 13/Sep/18 ] |
|
another theoretical possibility would be a missing I/O completion callback or some race in that area. |
| Comment by Peter Jones [ 13/Sep/18 ] |
|
> BTW which version of ZFS is deployed in autotest? Will be 0.7.9 IIUC |
| Comment by Jian Yu [ 12/Nov/18 ] |
|
Occurred again on Lustre b2_10 branch: |
| Comment by Andreas Dilger [ 09/Dec/20 ] |
|
+1 on master review-dne-selinux-ssk: there are no errors in any logs, it just looks like it is taking a long time? |