[LU-9247] replay-ost-single test_5: test failed to respond and timed out Created: 23/Mar/17  Updated: 16/Apr/20  Resolved: 16/Apr/20

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.10.1, Lustre 2.11.0, Lustre 2.10.7
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James Casper Assignee: Alex Zhuravlev
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

onyx-32vm1-8, Full Group test,
RHEL7.3/zfs, branch master, v2.9.54, b3541


Issue Links:
Related
is related to LU-4950 sanity-benchmark test fsx hung: txg_s... Closed
is related to LU-8601 sanity test_230d: Timeout on ZFS back... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.hpdd.intel.com/test_sessions/afc7f4b0-0af4-11e7-8c9f-5254006e85c2

It appears that zfs was hung and caused this timeout. Here are a couple indications of this:

test_log:

Starting ost1: lustre-ost1/ost1 /mnt/lustre-ost1
CMD: onyx-32vm8 mkdir -p /mnt/lustre-ost1; mount -t lustre lustre-ost1/ost1 /mnt/lustre-ost1
onyx-32vm8: e2label: No such file or directory while trying to open lustre-ost1/ost1
onyx-32vm8: Couldn't find valid filesystem superblock.

OST console:

10:35:06:[31399.498089] txg_sync        D 0000000000000001     0 27626      2 0x00000080
10:35:06:[31399.498090]  ffff880049607ac0 0000000000000046 ffff88003d98edd0 ffff880049607fd8
10:35:06:[31399.498091]  ffff880049607fd8 ffff880049607fd8 ffff88003d98edd0 ffff88007fc16c40
10:35:06:[31399.498092]  0000000000000000 7fffffffffffffff ffff88005ac587a8 0000000000000001
10:35:06:[31399.498092] Call Trace:
10:35:06:[31399.498093]  [<ffffffff8168bac9>] schedule+0x29/0x70
10:35:06:[31399.498095]  [<ffffffff81689519>] schedule_timeout+0x239/0x2d0
10:35:06:[31399.498096]  [<ffffffff810c4fe2>] ? default_wake_function+0x12/0x20
10:35:06:[31399.498098]  [<ffffffff810ba238>] ? __wake_up_common+0x58/0x90
10:35:06:[31399.498101]  [<ffffffff81060c1f>] ? kvm_clock_get_cycles+0x1f/0x30
10:35:06:[31399.498103]  [<ffffffff8168b06e>] io_schedule_timeout+0xae/0x130
10:35:06:[31399.498104]  [<ffffffff810b1416>] ? prepare_to_wait_exclusive+0x56/0x90
10:35:06:[31399.498106]  [<ffffffff8168b108>] io_schedule+0x18/0x20
10:35:06:[31399.498109]  [<ffffffffa0677617>] cv_wait_common+0xa7/0x130 [spl]
10:35:06:[31399.498111]  [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30
10:35:06:[31399.498114]  [<ffffffffa06776f8>] __cv_wait_io+0x18/0x20 [spl]
10:35:06:[31399.498150]  [<ffffffffa07d151b>] zio_wait+0x10b/0x1f0 [zfs]
10:35:06:[31399.498169]  [<ffffffffa075acdf>] dsl_pool_sync+0xbf/0x440 [zfs]
10:35:06:[31399.498187]  [<ffffffffa0775868>] spa_sync+0x388/0xb50 [zfs]
10:35:06:[31399.498189]  [<ffffffff810b174b>] ? autoremove_wake_function+0x2b/0x40
10:35:06:[31399.498191]  [<ffffffff81689c72>] ? mutex_lock+0x12/0x2f
10:35:06:[31399.498208]  [<ffffffffa07874e5>] txg_sync_thread+0x3c5/0x620 [zfs]
10:35:06:[31399.498226]  [<ffffffffa0787120>] ? txg_init+0x280/0x280 [zfs]
10:35:06:[31399.498229]  [<ffffffffa0672851>] thread_generic_wrapper+0x71/0x80 [spl]
10:35:06:[31399.498232]  [<ffffffffa06727e0>] ? __thread_exit+0x20/0x20 [spl]
10:35:06:[31399.498234]  [<ffffffff810b064f>] kthread+0xcf/0xe0
10:35:06:[31399.498235]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
10:35:06:[31399.498237]  [<ffffffff81696958>] ret_from_fork+0x58/0x90
10:35:06:[31399.498239]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140


 Comments   
Comment by Sarah Liu [ 27/Mar/17 ]

not sure if this is a dup of LU-4950

Comment by Sarah Liu [ 27/Mar/17 ]

another similar failure seen on sanity-quota on the same run
https://testing.hpdd.intel.com/test_sets/ddb7634c-0af4-11e7-8c9f-5254006e85c2

oss dmesg

[ 6136.708017] txg_sync        D 0000000000000001     0 29681      2 0x00000080
[ 6136.709697]  ffff880058e87ac0 0000000000000046 ffff88005f6f5e20 ffff880058e87fd8
[ 6136.711425]  ffff880058e87fd8 ffff880058e87fd8 ffff88005f6f5e20 ffff88007fc16c40
[ 6136.713142]  0000000000000000 7fffffffffffffff ffff880061d92ec0 0000000000000001
[ 6136.714847] Call Trace:
[ 6136.716104]  [<ffffffff8168bac9>] schedule+0x29/0x70
[ 6136.717564]  [<ffffffff81689519>] schedule_timeout+0x239/0x2d0
[ 6136.719086]  [<ffffffff810c4fe2>] ? default_wake_function+0x12/0x20
[ 6136.720623]  [<ffffffff810ba238>] ? __wake_up_common+0x58/0x90
[ 6136.722131]  [<ffffffff81060c1f>] ? kvm_clock_get_cycles+0x1f/0x30
[ 6136.723666]  [<ffffffff8168b06e>] io_schedule_timeout+0xae/0x130
[ 6136.725197]  [<ffffffff810b1416>] ? prepare_to_wait_exclusive+0x56/0x90
[ 6136.726767]  [<ffffffff8168b108>] io_schedule+0x18/0x20
[ 6136.728231]  [<ffffffffa0670617>] cv_wait_common+0xa7/0x130 [spl]
[ 6136.729775]  [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30
[ 6136.731277]  [<ffffffffa06706f8>] __cv_wait_io+0x18/0x20 [spl]
[ 6136.732815]  [<ffffffffa07ca51b>] zio_wait+0x10b/0x1f0 [zfs]
[ 6136.734304]  [<ffffffffa0753edc>] dsl_pool_sync+0x2bc/0x440 [zfs]
[ 6136.735829]  [<ffffffffa076e868>] spa_sync+0x388/0xb50 [zfs]
[ 6136.737299]  [<ffffffff810b174b>] ? autoremove_wake_function+0x2b/0x40
[ 6136.738871]  [<ffffffff81689c72>] ? mutex_lock+0x12/0x2f
[ 6136.740337]  [<ffffffffa07804e5>] txg_sync_thread+0x3c5/0x620 [zfs]
[ 6136.741859]  [<ffffffffa0780120>] ? txg_init+0x280/0x280 [zfs]
[ 6136.743323]  [<ffffffffa066b851>] thread_generic_wrapper+0x71/0x80 [spl]
[ 6136.744868]  [<ffffffffa066b7e0>] ? __thread_exit+0x20/0x20 [spl]
[ 6136.746368]  [<ffffffff810b064f>] kthread+0xcf/0xe0
[ 6136.747868]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
[ 6136.749474]  [<ffffffff81696958>] ret_from_fork+0x58/0x90
[ 6136.750975]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
[ 6136.752560] inconsistency_v S 0000000000000000     0 29844      2 0x00000080
[ 6136.754234]  ffff8800696f3d98 0000000000000046 ffff88005f6f6dd0 ffff8800696f3fd8
[ 6136.755996]  ffff8800696f3fd8 ffff8800696f3fd8 ffff88005f6f6dd0 ffff88005c176b40
Comment by Sarah Liu [ 27/Mar/17 ]

more instance. This issue was suggested by Oleg marked as LU-8601, are these 2 the dup?
https://testing.hpdd.intel.com/test_sets/e463d0ae-0af4-11e7-8c9f-5254006e85c2

Comment by Sarah Liu [ 13/Apr/17 ]

keep seeing the same error on sanity-quota master zfs testing, and the error looks like LU-6812 which is marked as dup of LU-4950.

https://testing.hpdd.intel.com/test_sets/5d38651e-1dae-11e7-9de9-5254006e85c2

Comment by Peter Jones [ 20/Apr/17 ]

Alex

Does this appear to be a duplicate of LU-4950?

Peter

Comment by Alex Zhuravlev [ 24/Apr/17 ]

Peter, I'm going through the logs, at at the moment it doesn't look like LU-4950 where one thread was spinning in ZFS failing to start transaction (too many blocks to release from a huge file).

Comment by James Casper [ 17/May/17 ]

This can also happen after an MDS failover.

https://testing.hpdd.intel.com/test_sessions/edde2a3e-9ae8-434a-8170-b64e9e85529c

full group, CentOS7/zfs/DNE
master branch, v2.9.57, b3575

From MDS1/3 dmesg:

[66569.880864] txg_sync        D 0000000000000001     0 19793      2 0x00000080
[66569.882578]  ffff880052433ae0 0000000000000046 ffff880063c1bec0 ffff880052433fd8
[66569.884423]  ffff880052433fd8 ffff880052433fd8 ffff880063c1bec0 ffff88007fc16c40
[66569.886169]  0000000000000000 7fffffffffffffff ffff880067d20c00 0000000000000001
[66569.887898] Call Trace:
[66569.889172]  [<ffffffff8168c3c9>] schedule+0x29/0x70
[66569.890664]  [<ffffffff81689e29>] schedule_timeout+0x239/0x2c0
[66569.892192]  [<ffffffff81060c1f>] ? kvm_clock_get_cycles+0x1f/0x30
[66569.893743]  [<ffffffff810eb08c>] ? ktime_get_ts64+0x4c/0xf0
[66569.895244]  [<ffffffff8168b96e>] io_schedule_timeout+0xae/0x130
[66569.896773]  [<ffffffff810b1816>] ? prepare_to_wait_exclusive+0x56/0x90
[66569.898348]  [<ffffffff8168ba08>] io_schedule+0x18/0x20
[66569.899836]  [<ffffffffa06ec617>] cv_wait_common+0xa7/0x130 [spl]
[66569.901382]  [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
[66569.902986]  [<ffffffffa06ec6f8>] __cv_wait_io+0x18/0x20 [spl]
[66569.904618]  [<ffffffffa0fd1cc3>] zio_wait+0xc3/0x140 [zfs]
[66569.906167]  [<ffffffffa0f5bfbf>] dsl_pool_sync+0xbf/0x440 [zfs]
[66569.907750]  [<ffffffffa0f763e8>] spa_sync+0x388/0xb50 [zfs]
[66569.909258]  [<ffffffff810b1b4b>] ? autoremove_wake_function+0x2b/0x40
[66569.910857]  [<ffffffff8168a572>] ? mutex_lock+0x12/0x2f
[66569.912439]  [<ffffffffa0f88065>] txg_sync_thread+0x3c5/0x620 [zfs]
[66569.914047]  [<ffffffffa0f87ca0>] ? txg_init+0x280/0x280 [zfs]
[66569.915586]  [<ffffffffa06e7851>] thread_generic_wrapper+0x71/0x80 [spl]
[66569.917185]  [<ffffffffa06e77e0>] ? __thread_exit+0x20/0x20 [spl]
[66569.918744]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[66569.920183]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[66569.921777]  [<ffffffff81697318>] ret_from_fork+0x58/0x90
[66569.923264]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
Comment by Alex Zhuravlev [ 18/May/17 ]

please, tell the exact test you found the last trace.

Comment by James Casper [ 18/May/17 ]

Sorry about that: replay-single, test_70c

Comment by Alex Zhuravlev [ 18/May/17 ]

hm, I think it's important to notice that the stack itself seem to be valid - syncing thread was awaiting for I/O.

Comment by Alex Zhuravlev [ 24/Aug/17 ]

first of all, many reports marked with LU-9247 were runs with ldiskfs, so it's not ZFS specific.

the client got stuck at dd:
23:19:31:[33123.740321] dd S 0000000000000000 0 13435 12942 0x00000080
23:19:31:[33123.748089] [<ffffffff8168b679>] schedule+0x29/0x70
23:19:31:[33123.749516] [<ffffffffa0827cfd>] cl_sync_io_wait+0x2ad/0x3c0 [obdclass]
23:19:31:[33123.751062] [<ffffffff810c4ec0>] ? wake_up_state+0x20/0x20
23:19:31:[33123.752515] [<ffffffffa0827f48>] cl_io_submit_sync+0x138/0x1e0 [obdclass]
23:19:31:[33123.754107] [<ffffffffa0cb2620>] vvp_io_write_commit+0x640/0x8d0 [lustre]
23:19:31:[33123.755663] [<ffffffffa0cb2dfa>] vvp_io_write_start+0x54a/0x720 [lustre]
23:19:31:[33123.757211] [<ffffffffa0826f95>] cl_io_start+0x65/0x130 [obdclass]
23:19:31:[33123.758696] [<ffffffffa0829335>] cl_io_loop+0xa5/0x190 [obdclass]
23:19:31:[33123.760178] [<ffffffffa0c60cbf>] ll_file_io_generic+0x67f/0xb50 [lustre]
23:19:31:[33123.761706] [<ffffffffa0c6145d>] ll_file_aio_write+0x12d/0x1f0 [lustre]
23:19:31:[33123.763229] [<ffffffffa0c615ee>] ll_file_write+0xce/0x1e0 [lustre]
23:19:31:[33123.764716] [<ffffffff811fe12d>] vfs_write+0xbd/0x1e0

it's trying to make sync IO but notransno mode has been set on OST, so sync IO can't commit.

I'm trying to understand why it needs to be sync..

/* out of quota, try sync write */
if (rc == -EDQUOT && !cl_io_is_mkwrite(io)) {
rc = vvp_io_commit_sync(env, io, queue,
vio->u.write.vui_from,
vio->u.write.vui_to);

Comment by Gerrit Updater [ 29/Aug/17 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/28775
Subject: LU-9247 tests: additional debug information
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: efb1281e4a9f372725af689ecd3aa2002860c185

Comment by Alex Zhuravlev [ 31/Aug/17 ]

I think the reason is missing grant. there is a code in the test to get grants but it seem to fail, so dd gets stuck.
it would be good to land the patch above as I'm unable to reproduce the failure manually. the changes suggested just
dump additional information about current grant, so it'll be easier to verify the theory.

Comment by James Casper [ 11/Oct/17 ]

This ticket should not be used for ldiskfs. It was opened for OST consoles showing a hung txg_sync process. I have only seen this process hang when ZFS is in the configuration.

Comment by James Nunez (Inactive) [ 19/Dec/17 ]

Alex - Has this issue been fixed or is it understood or is it being addresses in another ticket? replay-ost-single test 5 still hangs frequently and there are several tickets open for test; LU-9273, LU-5214, LU-10348, and maybe others. Many of the dmesg and console logs for this failure don't have much information.

Comment by Alex Zhuravlev [ 19/Dec/17 ]

I've seen many different failures with symptoms very similar to described in this ticket, so it's hard to say what has been fixed..

Comment by James Nunez (Inactive) [ 15/Mar/18 ]

I think we are seeing this issues again. Please see the following for logs:

https://testing.hpdd.intel.com/test_sets/d2b2a44c-2746-11e8-9e0e-52540065bddc

Generated at Sat Feb 10 02:24:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.