[LU-11128] replay-single test timeout Created: 08/Jul/18  Updated: 28/May/19  Resolved: 02/Oct/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-11126 replay-single test 89: import.c:1668:... Resolved
is duplicated by LU-11183 sanity test 244 hangs with no informa... Resolved
Related
is related to LU-7236 connections on demand Resolved
is related to LU-11362 sanity test_156: timeout loop in ptlr... Open
is related to LU-11269 ptlrpc_set_add_req()) ASSERTION( req-... Resolved
is related to LU-11405 add a test for idle connection feature Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for bobijam <bobijam@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/5c95f0b2-8186-11e8-b441-52540065bddc

test_115 failed with the following error:

Timeout occurred after 216 mins, last suite running was replay-single, restarting cluster to continue tests

MDS dmesg keeps showing following error messages during several tests, and the test takes too much time.

[ 2545.541360] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
[ 2545.571570] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.210@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[ 2545.618732] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.212@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[ 2545.618926] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.212@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[ 2545.619112] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.212@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
...

another hit also happens at https://testing.whamcloud.com/test_sets/08372d04-8188-11e8-97ff-52540065bddc

test_80c 'Timeout occurred after 159 mins, last suite running was replay-single, restarting cluster to continue tests' 

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
replay-single test_115 - Timeout occurred after 216 mins, last suite running was replay-single, restarting cluster to continue tests
 



 Comments   
Comment by Andreas Dilger [ 17/Jul/18 ]

There are a significant number of test timeouts on replay-single (60 in the past 40 weeks), but they are spread across a large number of different subtests, so may be harder to notice. Doing a subtest query for TIMEOUT failures, but leaving the subtest field blank will show this.

Comment by Alex Zhuravlev [ 16/Aug/18 ]

I'm working on this as it seem to affect many patches I've been working on and more importantly it seem to be related to LU-7236 (idle connections can disconnect).

https://review.whamcloud.com/#/c/32980/ is a patch to reproduce/debug the issue. so far it looks like late DISCONNECT reply resets just reinitiated connection and it gets stuck.

 

Comment by Alex Zhuravlev [ 20/Aug/18 ]

https://review.whamcloud.com/#/c/32980 has passed many replay-single runs, I think it's ready for inspection and regular testing.

 

Comment by Andreas Dilger [ 13/Sep/18 ]

One question I had about this failure - how short/long is the idle connection timeout? It seems like we shouldn't be getting so many timeouts in the middle of actively running tests. Is there some correlation between the tests failing with this issue and the length of time the client is idle?

When you think about it, we don't want the connections to be dropping after only a few seconds of idle time, or we may get big reconnection storms if the system is still mostly in use, which will also hurt performance because of dropped grant and such.

Comment by Alex Zhuravlev [ 13/Sep/18 ]

20s by default

Comment by Andreas Dilger [ 13/Sep/18 ]

I wonder if 20s is too short by default? Especially in the case of large systems where there may be thousands of clients that have nearly identical behaviour (e.g. active/idle at the same time, though possibly to different OSTs).

On the one hand, the 20s timeout definitely good for finding issues with this code during testing, but I think the default should be longer (e.g. 60s or 300s) depending on how long it takes for a large number of clients to reconnect. We could still set a shorter time in the test-framework to ensure the code continues to be tested. For testing purposes, it might also make sense to have an option (e.g. "lctl set_param osc.*.idle_timeout=debug" and "...=nodebug" or similar) to print a message to the console when the client disconnects (e.g. "testfs-OST0004: disconnect after 50s idle" and "testfs-OST0004: reconnect after 650s idle" or similar) so that we can help debug problems related to this feature. The console message should be enabled during testing.

I see in the code that idle_timeout has a maximum value of CONNECT_SWITCH_MAX = 50s, which seems a bit short to me? Is that because the OBD_PING RPCs will keep the connection alive if it is longer than this? What happens if ping_interval (default 25s) is shorter than idle_timeout? Is that why the default idle_timeout is 20s?

Comment by Gerrit Updater [ 14/Sep/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33168
Subject: LU-11128 ptlrpc: add debugging for idle connections
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 216c9c7cbcd38fa56ee2240c8a13066ad66b3f77

Comment by Alex Zhuravlev [ 14/Sep/18 ]

Andreas, I'm fine to change the defaults and yes, one of the reason to rather have it short is to hit the code more frequently.

ping reply is not counted:

if (lustre_msg_get_opc(req->rq_reqmsg) != OBD_PING)
 req->rq_import->imp_last_reply_time = ktime_get_real_seconds();

then check for idle:

if (now - imp->imp_last_reply_time < imp->imp_idle_timeout)
    return false;
Comment by Andreas Dilger [ 15/Sep/18 ]

In that case, can you submit a patch to increase the default idle_timeout value, and set it lower in the test config for testing.

Comment by Gerrit Updater [ 21/Sep/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33168/
Subject: LU-11128 ptlrpc: add debugging for idle connections
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0aa58d26f5df2b71a040ed6f0f419b925528b6ad

Comment by Gerrit Updater [ 02/Oct/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32980/
Subject: LU-11128 ptlrpc: new request vs disconnect race
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 93d20d171c20491a96e5e85d7442a002f300619d

Comment by Peter Jones [ 02/Oct/18 ]

Landed for 2.12

Generated at Sat Feb 10 02:41:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.