[LU-11128] replay-single test timeout Created: 08/Jul/18 Updated: 28/May/19 Resolved: 02/Oct/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for bobijam <bobijam@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/5c95f0b2-8186-11e8-b441-52540065bddc test_115 failed with the following error: Timeout occurred after 216 mins, last suite running was replay-single, restarting cluster to continue tests MDS dmesg keeps showing following error messages during several tests, and the test takes too much time. [ 2545.541360] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. [ 2545.571570] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.210@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 2545.618732] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.212@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 2545.618926] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.212@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 2545.619112] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.212@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. ... another hit also happens at https://testing.whamcloud.com/test_sets/08372d04-8188-11e8-97ff-52540065bddc test_80c 'Timeout occurred after 159 mins, last suite running was replay-single, restarting cluster to continue tests' VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 17/Jul/18 ] |
|
There are a significant number of test timeouts on replay-single (60 in the past 40 weeks), but they are spread across a large number of different subtests, so may be harder to notice. Doing a subtest query for TIMEOUT failures, but leaving the subtest field blank will show this. |
| Comment by Alex Zhuravlev [ 16/Aug/18 ] |
|
I'm working on this as it seem to affect many patches I've been working on and more importantly it seem to be related to https://review.whamcloud.com/#/c/32980/ is a patch to reproduce/debug the issue. so far it looks like late DISCONNECT reply resets just reinitiated connection and it gets stuck.
|
| Comment by Alex Zhuravlev [ 20/Aug/18 ] |
|
https://review.whamcloud.com/#/c/32980 has passed many replay-single runs, I think it's ready for inspection and regular testing.
|
| Comment by Andreas Dilger [ 13/Sep/18 ] |
|
One question I had about this failure - how short/long is the idle connection timeout? It seems like we shouldn't be getting so many timeouts in the middle of actively running tests. Is there some correlation between the tests failing with this issue and the length of time the client is idle? When you think about it, we don't want the connections to be dropping after only a few seconds of idle time, or we may get big reconnection storms if the system is still mostly in use, which will also hurt performance because of dropped grant and such. |
| Comment by Alex Zhuravlev [ 13/Sep/18 ] |
|
20s by default |
| Comment by Andreas Dilger [ 13/Sep/18 ] |
|
I wonder if 20s is too short by default? Especially in the case of large systems where there may be thousands of clients that have nearly identical behaviour (e.g. active/idle at the same time, though possibly to different OSTs). On the one hand, the 20s timeout definitely good for finding issues with this code during testing, but I think the default should be longer (e.g. 60s or 300s) depending on how long it takes for a large number of clients to reconnect. We could still set a shorter time in the test-framework to ensure the code continues to be tested. For testing purposes, it might also make sense to have an option (e.g. "lctl set_param osc.*.idle_timeout=debug" and "...=nodebug" or similar) to print a message to the console when the client disconnects (e.g. "testfs-OST0004: disconnect after 50s idle" and "testfs-OST0004: reconnect after 650s idle" or similar) so that we can help debug problems related to this feature. The console message should be enabled during testing. I see in the code that idle_timeout has a maximum value of CONNECT_SWITCH_MAX = 50s, which seems a bit short to me? Is that because the OBD_PING RPCs will keep the connection alive if it is longer than this? What happens if ping_interval (default 25s) is shorter than idle_timeout? Is that why the default idle_timeout is 20s? |
| Comment by Gerrit Updater [ 14/Sep/18 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33168 |
| Comment by Alex Zhuravlev [ 14/Sep/18 ] |
|
Andreas, I'm fine to change the defaults and yes, one of the reason to rather have it short is to hit the code more frequently. ping reply is not counted:
if (lustre_msg_get_opc(req->rq_reqmsg) != OBD_PING)
req->rq_import->imp_last_reply_time = ktime_get_real_seconds();
then check for idle: if (now - imp->imp_last_reply_time < imp->imp_idle_timeout) return false; |
| Comment by Andreas Dilger [ 15/Sep/18 ] |
|
In that case, can you submit a patch to increase the default idle_timeout value, and set it lower in the test config for testing. |
| Comment by Gerrit Updater [ 21/Sep/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33168/ |
| Comment by Gerrit Updater [ 02/Oct/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32980/ |
| Comment by Peter Jones [ 02/Oct/18 ] |
|
Landed for 2.12 |