[LU-4331] Test timeout on recovery-small test_59 Created: 02/Dec/13  Updated: 11/Aug/15  Resolved: 11/Aug/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: zfs

Severity: 3
Rank (Obsolete): 11852

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run:
http://maloo.whamcloud.com/test_sets/1272e790-4240-11e3-b1be-52540035b04c
http://maloo.whamcloud.com/test_sets/8f6311b8-59f0-11e3-9a87-52540035b04c
http://maloo.whamcloud.com/test_sets/59cc8d90-500e-11e3-b42b-52540035b04c

The sub-test test_59 failed with the following error:

test failed to respond and timed out

Info required for matching: recovery-small 59



 Comments   
Comment by nasf (Inactive) [ 11/Jan/14 ]

Another failure instance:

https://maloo.whamcloud.com/test_sets/b9ca3f64-7aaf-11e3-a11f-52540035b04c

Comment by Bobbie Lind (Inactive) [ 02/May/14 ]

Most tests seem to timeout during a mount or unmount operation. As can be seen here http://review.whamcloud.com/#/c/9205/. I have to update and run the test timeouts again as it's been a while since I last ran this test.

Comment by nasf (Inactive) [ 28/Jun/15 ]

Another failure instance:
https://testing.hpdd.intel.com/test_sets/e53e97c6-1d51-11e5-bb4a-5254006e85c2

Test output:

== recovery-small test 59: Read cancel race on client eviction == 15:17:50 (1435443470)
Starting client: onyx-41vm6.onyx.hpdd.intel.com:  -o user_xattr,flock onyx-41vm3@tcp:/lustre /mnt/lustre2
CMD: onyx-41vm6.onyx.hpdd.intel.com mkdir -p /mnt/lustre2
CMD: onyx-41vm6.onyx.hpdd.intel.com mount -t lustre -o user_xattr,flock onyx-41vm3@tcp:/lustre /mnt/lustre2
fail_loc=0x311
fail_loc=0

MDS console log:

23:19:05:Lustre: DEBUG MARKER: == recovery-small test 59: Read cancel race on client eviction == 15:17:50 (1435443470)
23:19:05:LustreError: 11-0: lustre-OST0005-osc-MDT0000: operation ost_connect to node 10.2.4.195@tcp failed: rc = -114
23:19:05:LustreError: Skipped 51 previous similar messages

OSS console log:

22:39:52:Lustre: DEBUG MARKER: == recovery-small test 59: Read cancel race on client eviction == 15:17:50 (1435443470)
22:39:52:Lustre: lustre-OST0005: Export ffff880065304000 already connecting from 10.2.4.197@tcp
22:39:52:Lustre: Skipped 71 previous similar messages
22:39:52:Lustre: 5060:0:(service.c:1336:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
22:39:52:  req@ffff8800440d3980 x1505169144091928/t0(0) o4->512c404f-3996-b0b6-e5c0-201d629dc366@10.2.4.197@tcp:591/0 lens 608/448 e 23 to 0 dl 1435443586 ref 2 fl Interpret:H/0/0 rc 0/0
22:39:52:Lustre: lustre-OST0005: Client 512c404f-3996-b0b6-e5c0-201d629dc366 (at 10.2.4.197@tcp) reconnecting
22:39:52:Lustre: 5061:0:(service.c:1336:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
22:39:52:  req@ffff880068f9e680 x1505169144119492/t0(0) o4->512c404f-3996-b0b6-e5c0-201d629dc366@10.2.4.197@tcp:643/0 lens 608/448 e 8 to 0 dl 1435443638 ref 2 fl Interpret:/0/0 rc 0/0
22:39:52:Lustre: 5061:0:(service.c:1336:ptlrpc_at_send_early_reply()) Skipped 3 previous similar messages
22:39:52:Lustre: lustre-OST0005: Export ffff880065304000 already connecting from 10.2.4.197@tcp
22:39:52:Lustre: Skipped 203 previous similar messages
Comment by Andreas Dilger [ 11/Aug/15 ]

This hasn't really been failing since 2014-06-27 except for possibly one hit every three months, which might be totally unrelated. It isn't possible to determine if these failures are related anymore because the old test logs are not available and there is not even the most basic description of the failure in the bug except that test_59 timed out.

Closing this and we can re-open it or create a new one if this test starts failing again.

Generated at Sat Feb 10 01:41:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.