[LU-9247] replay-ost-single test_5: test failed to respond and timed out Created: 23/Mar/17 Updated: 16/Apr/20 Resolved: 16/Apr/20 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0, Lustre 2.10.1, Lustre 2.11.0, Lustre 2.10.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | James Casper | Assignee: | Alex Zhuravlev |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
onyx-32vm1-8, Full Group test, |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
https://testing.hpdd.intel.com/test_sessions/afc7f4b0-0af4-11e7-8c9f-5254006e85c2 It appears that zfs was hung and caused this timeout. Here are a couple indications of this: test_log: Starting ost1: lustre-ost1/ost1 /mnt/lustre-ost1 CMD: onyx-32vm8 mkdir -p /mnt/lustre-ost1; mount -t lustre lustre-ost1/ost1 /mnt/lustre-ost1 onyx-32vm8: e2label: No such file or directory while trying to open lustre-ost1/ost1 onyx-32vm8: Couldn't find valid filesystem superblock. OST console: 10:35:06:[31399.498089] txg_sync D 0000000000000001 0 27626 2 0x00000080 10:35:06:[31399.498090] ffff880049607ac0 0000000000000046 ffff88003d98edd0 ffff880049607fd8 10:35:06:[31399.498091] ffff880049607fd8 ffff880049607fd8 ffff88003d98edd0 ffff88007fc16c40 10:35:06:[31399.498092] 0000000000000000 7fffffffffffffff ffff88005ac587a8 0000000000000001 10:35:06:[31399.498092] Call Trace: 10:35:06:[31399.498093] [<ffffffff8168bac9>] schedule+0x29/0x70 10:35:06:[31399.498095] [<ffffffff81689519>] schedule_timeout+0x239/0x2d0 10:35:06:[31399.498096] [<ffffffff810c4fe2>] ? default_wake_function+0x12/0x20 10:35:06:[31399.498098] [<ffffffff810ba238>] ? __wake_up_common+0x58/0x90 10:35:06:[31399.498101] [<ffffffff81060c1f>] ? kvm_clock_get_cycles+0x1f/0x30 10:35:06:[31399.498103] [<ffffffff8168b06e>] io_schedule_timeout+0xae/0x130 10:35:06:[31399.498104] [<ffffffff810b1416>] ? prepare_to_wait_exclusive+0x56/0x90 10:35:06:[31399.498106] [<ffffffff8168b108>] io_schedule+0x18/0x20 10:35:06:[31399.498109] [<ffffffffa0677617>] cv_wait_common+0xa7/0x130 [spl] 10:35:06:[31399.498111] [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30 10:35:06:[31399.498114] [<ffffffffa06776f8>] __cv_wait_io+0x18/0x20 [spl] 10:35:06:[31399.498150] [<ffffffffa07d151b>] zio_wait+0x10b/0x1f0 [zfs] 10:35:06:[31399.498169] [<ffffffffa075acdf>] dsl_pool_sync+0xbf/0x440 [zfs] 10:35:06:[31399.498187] [<ffffffffa0775868>] spa_sync+0x388/0xb50 [zfs] 10:35:06:[31399.498189] [<ffffffff810b174b>] ? autoremove_wake_function+0x2b/0x40 10:35:06:[31399.498191] [<ffffffff81689c72>] ? mutex_lock+0x12/0x2f 10:35:06:[31399.498208] [<ffffffffa07874e5>] txg_sync_thread+0x3c5/0x620 [zfs] 10:35:06:[31399.498226] [<ffffffffa0787120>] ? txg_init+0x280/0x280 [zfs] 10:35:06:[31399.498229] [<ffffffffa0672851>] thread_generic_wrapper+0x71/0x80 [spl] 10:35:06:[31399.498232] [<ffffffffa06727e0>] ? __thread_exit+0x20/0x20 [spl] 10:35:06:[31399.498234] [<ffffffff810b064f>] kthread+0xcf/0xe0 10:35:06:[31399.498235] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 10:35:06:[31399.498237] [<ffffffff81696958>] ret_from_fork+0x58/0x90 10:35:06:[31399.498239] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 |
| Comments |
| Comment by Sarah Liu [ 27/Mar/17 ] |
|
not sure if this is a dup of |
| Comment by Sarah Liu [ 27/Mar/17 ] |
|
another similar failure seen on sanity-quota on the same run oss dmesg [ 6136.708017] txg_sync D 0000000000000001 0 29681 2 0x00000080 [ 6136.709697] ffff880058e87ac0 0000000000000046 ffff88005f6f5e20 ffff880058e87fd8 [ 6136.711425] ffff880058e87fd8 ffff880058e87fd8 ffff88005f6f5e20 ffff88007fc16c40 [ 6136.713142] 0000000000000000 7fffffffffffffff ffff880061d92ec0 0000000000000001 [ 6136.714847] Call Trace: [ 6136.716104] [<ffffffff8168bac9>] schedule+0x29/0x70 [ 6136.717564] [<ffffffff81689519>] schedule_timeout+0x239/0x2d0 [ 6136.719086] [<ffffffff810c4fe2>] ? default_wake_function+0x12/0x20 [ 6136.720623] [<ffffffff810ba238>] ? __wake_up_common+0x58/0x90 [ 6136.722131] [<ffffffff81060c1f>] ? kvm_clock_get_cycles+0x1f/0x30 [ 6136.723666] [<ffffffff8168b06e>] io_schedule_timeout+0xae/0x130 [ 6136.725197] [<ffffffff810b1416>] ? prepare_to_wait_exclusive+0x56/0x90 [ 6136.726767] [<ffffffff8168b108>] io_schedule+0x18/0x20 [ 6136.728231] [<ffffffffa0670617>] cv_wait_common+0xa7/0x130 [spl] [ 6136.729775] [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30 [ 6136.731277] [<ffffffffa06706f8>] __cv_wait_io+0x18/0x20 [spl] [ 6136.732815] [<ffffffffa07ca51b>] zio_wait+0x10b/0x1f0 [zfs] [ 6136.734304] [<ffffffffa0753edc>] dsl_pool_sync+0x2bc/0x440 [zfs] [ 6136.735829] [<ffffffffa076e868>] spa_sync+0x388/0xb50 [zfs] [ 6136.737299] [<ffffffff810b174b>] ? autoremove_wake_function+0x2b/0x40 [ 6136.738871] [<ffffffff81689c72>] ? mutex_lock+0x12/0x2f [ 6136.740337] [<ffffffffa07804e5>] txg_sync_thread+0x3c5/0x620 [zfs] [ 6136.741859] [<ffffffffa0780120>] ? txg_init+0x280/0x280 [zfs] [ 6136.743323] [<ffffffffa066b851>] thread_generic_wrapper+0x71/0x80 [spl] [ 6136.744868] [<ffffffffa066b7e0>] ? __thread_exit+0x20/0x20 [spl] [ 6136.746368] [<ffffffff810b064f>] kthread+0xcf/0xe0 [ 6136.747868] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 [ 6136.749474] [<ffffffff81696958>] ret_from_fork+0x58/0x90 [ 6136.750975] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 [ 6136.752560] inconsistency_v S 0000000000000000 0 29844 2 0x00000080 [ 6136.754234] ffff8800696f3d98 0000000000000046 ffff88005f6f6dd0 ffff8800696f3fd8 [ 6136.755996] ffff8800696f3fd8 ffff8800696f3fd8 ffff88005f6f6dd0 ffff88005c176b40 |
| Comment by Sarah Liu [ 27/Mar/17 ] |
|
more instance. This issue was suggested by Oleg marked as |
| Comment by Sarah Liu [ 13/Apr/17 ] |
|
keep seeing the same error on sanity-quota master zfs testing, and the error looks like https://testing.hpdd.intel.com/test_sets/5d38651e-1dae-11e7-9de9-5254006e85c2 |
| Comment by Peter Jones [ 20/Apr/17 ] |
|
Alex Does this appear to be a duplicate of Peter |
| Comment by Alex Zhuravlev [ 24/Apr/17 ] |
|
Peter, I'm going through the logs, at at the moment it doesn't look like |
| Comment by James Casper [ 17/May/17 ] |
|
This can also happen after an MDS failover. https://testing.hpdd.intel.com/test_sessions/edde2a3e-9ae8-434a-8170-b64e9e85529c full group, CentOS7/zfs/DNE From MDS1/3 dmesg: [66569.880864] txg_sync D 0000000000000001 0 19793 2 0x00000080 [66569.882578] ffff880052433ae0 0000000000000046 ffff880063c1bec0 ffff880052433fd8 [66569.884423] ffff880052433fd8 ffff880052433fd8 ffff880063c1bec0 ffff88007fc16c40 [66569.886169] 0000000000000000 7fffffffffffffff ffff880067d20c00 0000000000000001 [66569.887898] Call Trace: [66569.889172] [<ffffffff8168c3c9>] schedule+0x29/0x70 [66569.890664] [<ffffffff81689e29>] schedule_timeout+0x239/0x2c0 [66569.892192] [<ffffffff81060c1f>] ? kvm_clock_get_cycles+0x1f/0x30 [66569.893743] [<ffffffff810eb08c>] ? ktime_get_ts64+0x4c/0xf0 [66569.895244] [<ffffffff8168b96e>] io_schedule_timeout+0xae/0x130 [66569.896773] [<ffffffff810b1816>] ? prepare_to_wait_exclusive+0x56/0x90 [66569.898348] [<ffffffff8168ba08>] io_schedule+0x18/0x20 [66569.899836] [<ffffffffa06ec617>] cv_wait_common+0xa7/0x130 [spl] [66569.901382] [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30 [66569.902986] [<ffffffffa06ec6f8>] __cv_wait_io+0x18/0x20 [spl] [66569.904618] [<ffffffffa0fd1cc3>] zio_wait+0xc3/0x140 [zfs] [66569.906167] [<ffffffffa0f5bfbf>] dsl_pool_sync+0xbf/0x440 [zfs] [66569.907750] [<ffffffffa0f763e8>] spa_sync+0x388/0xb50 [zfs] [66569.909258] [<ffffffff810b1b4b>] ? autoremove_wake_function+0x2b/0x40 [66569.910857] [<ffffffff8168a572>] ? mutex_lock+0x12/0x2f [66569.912439] [<ffffffffa0f88065>] txg_sync_thread+0x3c5/0x620 [zfs] [66569.914047] [<ffffffffa0f87ca0>] ? txg_init+0x280/0x280 [zfs] [66569.915586] [<ffffffffa06e7851>] thread_generic_wrapper+0x71/0x80 [spl] [66569.917185] [<ffffffffa06e77e0>] ? __thread_exit+0x20/0x20 [spl] [66569.918744] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 [66569.920183] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 [66569.921777] [<ffffffff81697318>] ret_from_fork+0x58/0x90 [66569.923264] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 |
| Comment by Alex Zhuravlev [ 18/May/17 ] |
|
please, tell the exact test you found the last trace. |
| Comment by James Casper [ 18/May/17 ] |
|
Sorry about that: replay-single, test_70c |
| Comment by Alex Zhuravlev [ 18/May/17 ] |
|
hm, I think it's important to notice that the stack itself seem to be valid - syncing thread was awaiting for I/O. |
| Comment by Alex Zhuravlev [ 24/Aug/17 ] |
|
first of all, many reports marked with the client got stuck at dd: it's trying to make sync IO but notransno mode has been set on OST, so sync IO can't commit. I'm trying to understand why it needs to be sync.. /* out of quota, try sync write */ |
| Comment by Gerrit Updater [ 29/Aug/17 ] |
|
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/28775 |
| Comment by Alex Zhuravlev [ 31/Aug/17 ] |
|
I think the reason is missing grant. there is a code in the test to get grants but it seem to fail, so dd gets stuck. |
| Comment by James Casper [ 11/Oct/17 ] |
|
This ticket should not be used for ldiskfs. It was opened for OST consoles showing a hung txg_sync process. I have only seen this process hang when ZFS is in the configuration. |
| Comment by James Nunez (Inactive) [ 19/Dec/17 ] |
|
Alex - Has this issue been fixed or is it understood or is it being addresses in another ticket? replay-ost-single test 5 still hangs frequently and there are several tickets open for test; |
| Comment by Alex Zhuravlev [ 19/Dec/17 ] |
|
I've seen many different failures with symptoms very similar to described in this ticket, so it's hard to say what has been fixed.. |
| Comment by James Nunez (Inactive) [ 15/Mar/18 ] |
|
I think we are seeing this issues again. Please see the following for logs: https://testing.hpdd.intel.com/test_sets/d2b2a44c-2746-11e8-9e0e-52540065bddc |