[LU-4349] conf-sanity test_47: test failed to respond and timed out Created: 05/Dec/13 Updated: 11/Oct/14 Resolved: 18/Apr/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.4 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre-master build # 1791 RHEL6 zfs |
||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||
| Rank (Obsolete): | 11914 | ||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/1d5f9fd8-5c6a-11e3-9d08-52540035b04c. The sub-test test_47 failed with the following error:
Info required for matching: conf-sanity 47 This test took 3600s and failed with timeout, cannot find error logs. I checked https://maloo.whamcloud.com/sub_tests/b9351066-5d53-11e3-956b-52540035b04c |
| Comments |
| Comment by Sarah Liu [ 05/Dec/13 ] |
|
also seen this error in SLES11 SP3 client: https://maloo.whamcloud.com/test_sets/fb3c1e54-5de5-11e3-8f3c-52540035b04c |
| Comment by Andreas Dilger [ 09/Dec/13 ] |
|
This is also timing out on ldiskfs test runs: https://maloo.whamcloud.com/sub_tests/e2c36e9c-5eab-11e3-b298-52540035b04c |
| Comment by Andreas Dilger [ 09/Dec/13 ] |
|
This failure looks like it was introduced by http://review.whamcloud.com/4960 since it first appeared on 2013-11-26 on two tests for this patch (review-zfs and review-dne-zfs so it was not flagged as a failure) and then after that patch was landed on 2013-12-02 it started failing on other test runs starting 2013-12-03. It does fail more often on ZFS than ldiskfs, but maybe only because ZFS writes take longer? |
| Comment by Mikhail Pershin [ 16/Dec/13 ] |
|
Test failed due to recovery problems, for some reason the recovery timers grows up and never stops. I am not 100% sure why but I found problem with http://review.whamcloud.com/4960 that may be related. Patch is there http://review.whamcloud.com/8582 |
| Comment by nasf (Inactive) [ 27/Dec/13 ] |
|
Another failure instance: https://maloo.whamcloud.com/test_sets/b089e1d4-6e89-11e3-9de0-52540035b04c |
| Comment by Jinshan Xiong (Inactive) [ 27/Dec/13 ] |
|
failed at: https://maloo.whamcloud.com/test_sessions/187b6dee-6f07-11e3-ad93-52540035b04c |
| Comment by Andreas Dilger [ 29/Dec/13 ] |
|
Patch landed to master on 2013-12-27 19:13 so please verify any failures on tests after this time do not contain this patch. |
| Comment by Nathaniel Clark [ 03/Jan/14 ] |
|
This failed more recently than patch #8582 was landed: I had opened bug |
| Comment by Nathaniel Clark [ 06/Jan/14 ] |
|
Recovery timer goes around again, it seems to be stuck: 00010000:00080000:0.0:1388959703.814015:0:25116:0:(ldlm_lib.c:1010:target_handle_connect()) lustre-OST0000: connection from lustre-MDT0000-mdtlov_UUID@10.10.17.33@tcp recovering/t0 exp (null) cur 1388959703 last 0 00010000:02000400:0.0:1388959703.814020:0:25116:0:(ldlm_lib.c:1054:target_handle_connect()) lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 10.10.17.33@tcp), waiting for all 2 known clients (0 recovered, 1 in progress, and 0 evicted) to recover in 0:03 ... 00010000:00080000:0.1:1388959707.256838:0:0:0:(ldlm_lib.c:2152:target_recovery_expired()) lustre-OST0000: recovery timed out; 1 clients are still in recovery after 2076s (1 clients connected) 00010000:02000400:0.0:1388959707.256873:0:24824:0:(ldlm_lib.c:1803:target_recovery_overseer()) lustre-OST0000: recovery is timed out, evict stale exports 00000020:00000001:0.0:1388959707.256876:0:24824:0:(genops.c:1354:class_disconnect_stale_exports()) Process entered 00000020:00000001:0.0:1388959707.256879:0:24824:0:(genops.c:1281:class_disconnect_export_list()) Process entered 00000020:00000001:0.0:1388959707.256880:0:24824:0:(genops.c:1319:class_disconnect_export_list()) Process leaving 00000020:00000001:0.0:1388959707.256881:0:24824:0:(genops.c:1394:class_disconnect_stale_exports()) Process leaving 00010000:00080000:0.0:1388959707.256883:0:24824:0:(ldlm_lib.c:1607:extend_recovery_timer()) lustre-OST0000: recovery timer will expire in 30 seconds ... 00010000:00080000:0.0:1388959708.813899:0:25116:0:(ldlm_lib.c:1010:target_handle_connect()) lustre-OST0000: connection from lustre-MDT0000-mdtlov_UUID@10.10.17.33@tcp recovering/t0 exp (null) cur 1388959708 last 0 00010000:02000400:0.0:1388959708.813904:0:25116:0:(ldlm_lib.c:1054:target_handle_connect()) lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 10.10.17.33@tcp), waiting for all 2 known clients (0 recovered, 1 in progress, and 0 evicted) to recover in 0:28 |
| Comment by Andreas Dilger [ 08/Jan/14 ] |
|
Mike, any progress on this problem? This test failed 30 times in the past 4 days. |
| Comment by Mikhail Pershin [ 09/Jan/14 ] |
|
Patch to count disconnected clients in stale counter: http://review.whamcloud.com/8785 |
| Comment by Jian Yu [ 13/Jan/14 ] |
|
Another instance on master branch: https://maloo.whamcloud.com/test_sets/601c8390-7bbe-11e3-8a96-52540035b04c |
| Comment by Jodi Levi (Inactive) [ 16/Jan/14 ] |
|
Yu Jian, |
| Comment by Jian Yu [ 17/Jan/14 ] |
|
Hi Jodi, |
| Comment by Andreas Dilger [ 17/Jan/14 ] |
|
It is hard to know whether the patch fixes the problem without repeated runs, since it also passes without the patch. That said, there were almost 100 failures of this test in the past 4 weeks, so this is failing pretty often. |
| Comment by Jodi Levi (Inactive) [ 20/Jan/14 ] |
|
Yu Jian, |
| Comment by Jian Yu [ 21/Jan/14 ] |
|
By searching on Maloo, I found that after patch #8785 landed on master branch 3 days ago, conf-sanity test 47 all passed (more than 60 test runs, except several failed ones because the patches were not rebased). So, the issue was fixed. Let's close this ticket. |
| Comment by nasf (Inactive) [ 22/Jan/14 ] |
|
I think I met the same failure with #8785 applied. https://maloo.whamcloud.com/test_sets/fdfb4102-8355-11e3-bedf-52540035b04c |
| Comment by Mikhail Pershin [ 22/Jan/14 ] |
|
I don't see evidences that this is the same failure, in original bug there was recovery stuck with message like this: In your log there are no such messages, there is nothing useful in log actually. If there will be another test failures with recovery stuck let me know. I'd wait for more instances of this bug in master to say for sure this bug still exists, until that this looks more related to patch applied. |
| Comment by nasf (Inactive) [ 22/Jan/14 ] |
|
Here is the summary for recent conf-sanity test_47 failure. I am not sure they are really the same as https://maloo.whamcloud.com/sub_tests/query?utf8=✓&test_set%5Btest_set_script_id%5D=7f66aa20-3db2-11e0-80c0-52540025f9af&sub_test%5Bsub_test_script_id%5D=8392c2be-3db2-11e0-80c0-52540025f9af&sub_test%5Bstatus%5D=TIMEOUT&sub_test%5Bquery_bugs%5D=&test_session%5Btest_host%5D=&test_session%5Btest_group%5D=&test_session%5Buser_id%5D=&test_session%5Bquery_date%5D=&test_session%5Bquery_recent_period%5D=604800&test_node%5Bos_type_id%5D=&test_node%5Bdistribution_type_id%5D=&test_node%5Barchitecture_type_id%5D=&test_node%5Bfile_system_type_id%5D=&test_node%5Blustre_branch_id%5D=&test_node_network%5Bnetwork_type_id%5D=&commit=Update+results |
| Comment by nasf (Inactive) [ 03/Feb/14 ] |
|
Hm… with the patch http://review.whamcloud.com/8997 (for |
| Comment by nasf (Inactive) [ 03/Feb/14 ] |
|
Please reopen it when hit it again in the future. |
| Comment by Andreas Dilger [ 05/Feb/14 ] |
|
If this was fixed by landing the |
| Comment by Mikhail Pershin [ 10/Feb/14 ] |
|
Andreas, while this was happened in the same test, the failure in |
| Comment by Sarah Liu [ 04/Mar/14 ] |
|
Hit this issue again during tag-2.5.56 testing in DNE mode https://maloo.whamcloud.com/test_sets/92cc87f8-a02c-11e3-947c-52540035b04c |
| Comment by Bob Glossman (Inactive) [ 06/Mar/14 ] |
|
may be the same issue, seen in b2_5: |
| Comment by Mikhail Pershin [ 28/Mar/14 ] |
|
Last two issues were happened in branches without |
| Comment by Mikhail Pershin [ 28/Mar/14 ] |
|
set ticket as replated to |
| Comment by Jodi Levi (Inactive) [ 28/Mar/14 ] |
|
Sarah, |
| Comment by Sarah Liu [ 31/Mar/14 ] |
|
Hi Jodi, |
| Comment by Oleg Drokin [ 02/Apr/14 ] |
|
Somebody needs to backport the patch from lu-4413 as it could not be cleanly cherry-picked to b2_5 |
| Comment by Bob Glossman (Inactive) [ 02/Apr/14 ] |
|
We are talking about back port of http://review.whamcloud.com/8997, right? I can do that. |
| Comment by Jodi Levi (Inactive) [ 09/Apr/14 ] |
|
Minh retriggered the test of Change, 9875 on Shadow today. |
| Comment by Peter Jones [ 18/Apr/14 ] |
|
Landed for 2.6 |