[LU-4349] conf-sanity test_47: test failed to respond and timed out Created: 05/Dec/13  Updated: 11/Oct/14  Resolved: 18/Apr/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.4

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None
Environment:

lustre-master build # 1791 RHEL6 zfs


Issue Links:
Duplicate
duplicates LU-4413 Test failure on test suite conf-sanit... Resolved
is duplicated by LU-4426 Test failure on test suite conf-sanit... Resolved
Related
is related to LU-793 Reconnections should not be refused w... Resolved
is related to LU-4413 Test failure on test suite conf-sanit... Resolved
is related to LU-3895 conf-sanity test 47: umount -d -f /mn... Closed
Severity: 3
Rank (Obsolete): 11914

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/1d5f9fd8-5c6a-11e3-9d08-52540035b04c.

The sub-test test_47 failed with the following error:

test failed to respond and timed out

Info required for matching: conf-sanity 47

This test took 3600s and failed with timeout, cannot find error logs. I checked
a PASS conf-sanity run on zfs, the same sub test takes about 134s to complete:

https://maloo.whamcloud.com/sub_tests/b9351066-5d53-11e3-956b-52540035b04c



 Comments   
Comment by Sarah Liu [ 05/Dec/13 ]

also seen this error in SLES11 SP3 client:

https://maloo.whamcloud.com/test_sets/fb3c1e54-5de5-11e3-8f3c-52540035b04c

Comment by Andreas Dilger [ 09/Dec/13 ]

This is also timing out on ldiskfs test runs:

https://maloo.whamcloud.com/sub_tests/e2c36e9c-5eab-11e3-b298-52540035b04c

Comment by Andreas Dilger [ 09/Dec/13 ]

This failure looks like it was introduced by http://review.whamcloud.com/4960 since it first appeared on 2013-11-26 on two tests for this patch (review-zfs and review-dne-zfs so it was not flagged as a failure) and then after that patch was landed on 2013-12-02 it started failing on other test runs starting 2013-12-03. It does fail more often on ZFS than ldiskfs, but maybe only because ZFS writes take longer?

Comment by Mikhail Pershin [ 16/Dec/13 ]

Test failed due to recovery problems, for some reason the recovery timers grows up and never stops. I am not 100% sure why but I found problem with http://review.whamcloud.com/4960 that may be related. Patch is there http://review.whamcloud.com/8582

Comment by nasf (Inactive) [ 27/Dec/13 ]

Another failure instance:

https://maloo.whamcloud.com/test_sets/b089e1d4-6e89-11e3-9de0-52540035b04c

Comment by Jinshan Xiong (Inactive) [ 27/Dec/13 ]

failed at: https://maloo.whamcloud.com/test_sessions/187b6dee-6f07-11e3-ad93-52540035b04c

Comment by Andreas Dilger [ 29/Dec/13 ]

Patch landed to master on 2013-12-27 19:13 so please verify any failures on tests after this time do not contain this patch.

Comment by Nathaniel Clark [ 03/Jan/14 ]

This failed more recently than patch #8582 was landed:
https://maloo.whamcloud.com/test_sets/325bb0ac-7317-11e3-92ab-52540035b04c

I had opened bug LU-4426, I'll close that as duplicate of this one.

Comment by Nathaniel Clark [ 06/Jan/14 ]

Recovery timer goes around again, it seems to be stuck:

00010000:00080000:0.0:1388959703.814015:0:25116:0:(ldlm_lib.c:1010:target_handle_connect()) lustre-OST0000: connection from lustre-MDT0000-mdtlov_UUID@10.10.17.33@tcp recovering/t0 exp (null) cur 1388959703 last 0
00010000:02000400:0.0:1388959703.814020:0:25116:0:(ldlm_lib.c:1054:target_handle_connect()) lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 10.10.17.33@tcp), waiting for all 2 known clients (0 recovered, 1 in progress, and 0 evicted) to recover in 0:03
...
00010000:00080000:0.1:1388959707.256838:0:0:0:(ldlm_lib.c:2152:target_recovery_expired()) lustre-OST0000: recovery timed out; 1 clients are still in recovery after 2076s (1 clients connected)
00010000:02000400:0.0:1388959707.256873:0:24824:0:(ldlm_lib.c:1803:target_recovery_overseer()) lustre-OST0000: recovery is timed out, evict stale exports
00000020:00000001:0.0:1388959707.256876:0:24824:0:(genops.c:1354:class_disconnect_stale_exports()) Process entered
00000020:00000001:0.0:1388959707.256879:0:24824:0:(genops.c:1281:class_disconnect_export_list()) Process entered
00000020:00000001:0.0:1388959707.256880:0:24824:0:(genops.c:1319:class_disconnect_export_list()) Process leaving
00000020:00000001:0.0:1388959707.256881:0:24824:0:(genops.c:1394:class_disconnect_stale_exports()) Process leaving
00010000:00080000:0.0:1388959707.256883:0:24824:0:(ldlm_lib.c:1607:extend_recovery_timer()) lustre-OST0000: recovery timer will expire in 30 seconds
...
00010000:00080000:0.0:1388959708.813899:0:25116:0:(ldlm_lib.c:1010:target_handle_connect()) lustre-OST0000: connection from lustre-MDT0000-mdtlov_UUID@10.10.17.33@tcp recovering/t0 exp (null) cur 1388959708 last 0
00010000:02000400:0.0:1388959708.813904:0:25116:0:(ldlm_lib.c:1054:target_handle_connect()) lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 10.10.17.33@tcp), waiting for all 2 known clients (0 recovered, 1 in progress, and 0 evicted) to recover in 0:28
Comment by Andreas Dilger [ 08/Jan/14 ]

Mike, any progress on this problem? This test failed 30 times in the past 4 days.

Comment by Mikhail Pershin [ 09/Jan/14 ]

Patch to count disconnected clients in stale counter: http://review.whamcloud.com/8785
Not sure this is exactly the reason, but looks like stale counter is wrong after normal export disconnect comparing with eviction.

Comment by Jian Yu [ 13/Jan/14 ]

Another instance on master branch: https://maloo.whamcloud.com/test_sets/601c8390-7bbe-11e3-8a96-52540035b04c

Comment by Jodi Levi (Inactive) [ 16/Jan/14 ]

Yu Jian,
Did the latest instance happen with Mike's patch? Or do we think his patch will fix this issue once it lands?

Comment by Jian Yu [ 17/Jan/14 ]

Hi Jodi,
The latest instance happened without Mike's patch. The test results in http://review.whamcloud.com/8785 showed that conf-sanity test 47 passed.

Comment by Andreas Dilger [ 17/Jan/14 ]

It is hard to know whether the patch fixes the problem without repeated runs, since it also passes without the patch.

That said, there were almost 100 failures of this test in the past 4 weeks, so this is failing pretty often.

Comment by Jodi Levi (Inactive) [ 20/Jan/14 ]

Yu Jian,
Since Change, 8785 has landed, can you verify this is fixed and close the ticket or comment?
Thank you!

Comment by Jian Yu [ 21/Jan/14 ]

By searching on Maloo, I found that after patch #8785 landed on master branch 3 days ago, conf-sanity test 47 all passed (more than 60 test runs, except several failed ones because the patches were not rebased). So, the issue was fixed. Let's close this ticket.

Comment by nasf (Inactive) [ 22/Jan/14 ]

I think I met the same failure with #8785 applied.

https://maloo.whamcloud.com/test_sets/fdfb4102-8355-11e3-bedf-52540035b04c

Comment by Mikhail Pershin [ 22/Jan/14 ]

I don't see evidences that this is the same failure, in original bug there was recovery stuck with message like this:
kernel: Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 10.10.18.247@tcp), waiting for all 5 known clients (3 recovered, 1 in progress, and 0 evicted) to recover in 0:12

In your log there are no such messages, there is nothing useful in log actually. If there will be another test failures with recovery stuck let me know. I'd wait for more instances of this bug in master to say for sure this bug still exists, until that this looks more related to patch applied.

Comment by nasf (Inactive) [ 22/Jan/14 ]

Here is the summary for recent conf-sanity test_47 failure. I am not sure they are really the same as LU-4349, but it is true that we are still trouble with conf-sanity failure. If it is not the same, I will open a new ticket.

https://maloo.whamcloud.com/sub_tests/query?utf8=✓&test_set%5Btest_set_script_id%5D=7f66aa20-3db2-11e0-80c0-52540025f9af&sub_test%5Bsub_test_script_id%5D=8392c2be-3db2-11e0-80c0-52540025f9af&sub_test%5Bstatus%5D=TIMEOUT&sub_test%5Bquery_bugs%5D=&test_session%5Btest_host%5D=&test_session%5Btest_group%5D=&test_session%5Buser_id%5D=&test_session%5Bquery_date%5D=&test_session%5Bquery_recent_period%5D=604800&test_node%5Bos_type_id%5D=&test_node%5Bdistribution_type_id%5D=&test_node%5Barchitecture_type_id%5D=&test_node%5Bfile_system_type_id%5D=&test_node%5Blustre_branch_id%5D=&test_node_network%5Bnetwork_type_id%5D=&commit=Update+results

Comment by nasf (Inactive) [ 03/Feb/14 ]

Hm… with the patch http://review.whamcloud.com/8997 (for LU-4413) landed to master, I did not met conf-sanity failure. So I prefer to close this bug.

Comment by nasf (Inactive) [ 03/Feb/14 ]

Please reopen it when hit it again in the future.

Comment by Andreas Dilger [ 05/Feb/14 ]

If this was fixed by landing the LU-4413 patch, then it should be linked to that bug.

Comment by Mikhail Pershin [ 10/Feb/14 ]

Andreas, while this was happened in the same test, the failure in LU-4413 looked different than in LU-4349. There was no stuck in recovery but after. So I doubt it is related actually, it looks more like coincidence.

Comment by Sarah Liu [ 04/Mar/14 ]

Hit this issue again during tag-2.5.56 testing in DNE mode

https://maloo.whamcloud.com/test_sets/92cc87f8-a02c-11e3-947c-52540035b04c

Comment by Bob Glossman (Inactive) [ 06/Mar/14 ]

may be the same issue, seen in b2_5:
https://maloo.whamcloud.com/test_sets/550c0fde-a521-11e3-9fee-52540035b04c

Comment by Mikhail Pershin [ 28/Mar/14 ]

Last two issues were happened in branches without LU-4413 fix mentioned in Fan Yong comment. We have to port http://review.whamcloud.com/8997 to the b2_5 it seems

Comment by Mikhail Pershin [ 28/Mar/14 ]

set ticket as replated to LU-4413 just because of the similar symptoms

Comment by Jodi Levi (Inactive) [ 28/Mar/14 ]

Sarah,
Did you hit this on the 2.5.57 tag testing?

Comment by Sarah Liu [ 31/Mar/14 ]

Hi Jodi,
No, I didn't see this issue in tag-2.5.57 testing

Comment by Oleg Drokin [ 02/Apr/14 ]

Somebody needs to backport the patch from lu-4413 as it could not be cleanly cherry-picked to b2_5

Comment by Bob Glossman (Inactive) [ 02/Apr/14 ]

We are talking about back port of http://review.whamcloud.com/8997, right? I can do that.

http://review.whamcloud.com/9875

Comment by Jodi Levi (Inactive) [ 09/Apr/14 ]

Minh retriggered the test of Change, 9875 on Shadow today.

Comment by Peter Jones [ 18/Apr/14 ]

Landed for 2.6

Generated at Sat Feb 10 01:41:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.