[LU-1890] Test failure on test suite recovery-small, subtest test_101 Created: 11/Sep/12  Updated: 12/Jun/13  Resolved: 12/Jun/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Lai Siyao
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: Text File fat-intel-3vm8.syslog.log     Text File recovery-small.test_101.console.fat-intel-3vm7.log    
Severity: 3
Rank (Obsolete): 4143

 Description   

This issue was created by maloo for yujian <yujian@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/ae7cd73a-f760-11e1-8b95-52540035b04c.

Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/12

The sub-test test_101 failed with the following error:

test failed to respond and timed out

Info required for matching: recovery-small 101

MGS/MDS Nodes: fat-intel-3vm3(10.10.4.88), fat-intel-3vm7(10.10.4.92)
OSS Nodes: fat-intel-3vm4(10.10.4.89), fat-intel-3vm8(10.10.4.93)
Client Nodes: fat-intel-3vm1(10.10.4.86), fat-intel-3vm5(10.10.4.90), fat-intel-3vm6(10.10.4.91)

Syslog on OSS fat-intel-3vm8(10.10.4.93):

Sep  4 18:14:34 fat-intel-3vm8 rshd[4135]: root@fat-intel-3vm1.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "e2label /dev/lvm-OSS/P7 2>/dev/null");echo XXRETCODE:$?'
Sep  4 18:14:35 fat-intel-3vm8 kernel: Lustre: lustre-OST0006: Will be in recovery for at least 1:00, or until 4 clients reconnect
Sep  4 18:14:39 fat-intel-3vm8 kernel: Lustre: lustre-OST0006: Recovery over after 0:04, of 4 clients 4 recovered and 0 were evicted.
Sep  4 18:14:39 fat-intel-3vm8 kernel: Lustre: lustre-OST0006: received MDS connection from 10.10.4.92@tcp
Sep  4 18:18:07 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346807862/real 13468078
62]  req@ffff880037705800 x1412229540872552/t0(0) o250->MGC10.10.4.88@tcp@10.10.4.88@tcp:26/25 lens 400/544 e 0 to 1 dl 1346807887 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Sep  4 18:18:07 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
Sep  4 18:27:07 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346808402/real 1346808402]  req@ffff88007c723400 x1412229540872570/t0(0) o250->MGC10.10.4.88@tcp@10.10.4.88@tcp:26/25 lens 400/544 e 0 to 1 dl 1346808427 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Sep  4 18:27:07 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 17 previous similar messages
Sep  4 18:37:37 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346809032/real 1346809032]  req@ffff88007c723c00 x1412229540872591/t0(0) o250->MGC10.10.4.88@tcp@10.10.4.88@tcp:26/25 lens 400/544 e 0 to 1 dl 1346809057 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Sep  4 18:37:37 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 20 previous similar messages
Sep  4 18:48:07 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346809662/real 13468096
62]  req@ffff8800709db800 x1412229540872612/t0(0) o250->MGC10.10.4.88@tcp@10.10.4.88@tcp:26/25 lens 400/544 e 0 to 1 dl 1346809687 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Sep  4 18:48:07 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 20 previous similar messages
Sep  4 18:58:37 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346810292/real 13468102
92]  req@ffff880037705c00 x1412229540872633/t0(0) o250->MGC10.10.4.88@tcp@10.10.4.88@tcp:26/25 lens 400/544 e 0 to 1 dl 1346810317 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Sep  4 18:58:37 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 20 previous similar messages
Sep  4 19:04:34 fat-intel-3vm8 kernel: Lustre: lustre-OST0004: haven't heard from client 85be9246-99f9-3b19-5f81-ad80152fd420 (at 10.10.4.91@tcp) in 47 seconds. I think it's dead, and I am evicting it. exp ffff88007254fc00, cur 1346810674 expire 1346810644 last 1346810627
Sep  4 19:04:34 fat-intel-3vm8 kernel: Lustre: lustre-OST0004: haven't heard from client f75916a5-eaec-360e-dafc-5ebb1448c443 (at 10.10.4.90@tcp) in 34 seconds. I think it's dead, and I am evicting it. exp ffff88007598b000, cur 1346810674 expire 1346810644 last 1346810640
Sep  4 19:04:39 fat-intel-3vm8 kernel: Lustre: lustre-OST0000: haven't heard from client 85be9246-99f9-3b19-5f81-ad80152fd420 (at 10.10.4.91@tcp) in 52 seconds. I think it's dead, and I am evicting it. exp ffff8800732a6000, cur 1346810679 expire 1346810649 last 1346810627
Sep  4 19:04:39 fat-intel-3vm8 kernel: Lustre: Skipped 10 previous similar messages
Sep  4 19:04:40 fat-intel-3vm8 kernel: connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4298439482, last ping 4298444482, now 4298449482
Sep  4 19:04:40 fat-intel-3vm8 kernel: connection1:0: detected conn error (1011)
Sep  4 19:04:41 fat-intel-3vm8 iscsid: Kernel reported iSCSI connection 1:0 error (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)
Sep  4 19:05:38 fat-intel-3vm8 kernel: MTRR variable ranges enabled:
Sep  4 19:05:38 fat-intel-3vm8 kernel:  0 base 00E0000000 mask FFE0000000 uncachable

Syslog on client fat-intel-3vm6(10.10.4.91):

Sep  4 18:03:35 fat-intel-3vm6 kernel: Lustre: DEBUG MARKER: == recovery-small test 101: IR: Make sure IR works w/o normal recovery == 18:03:35 (1346807015)
Sep  4 18:04:07 fat-intel-3vm6 kernel: Lustre: lustre-OST0001-osc-ffff8800749ac400: Connection to lustre-OST0001 (at 10.10.4.89@tcp) was lost; in progress operations using this
 service will wait for recovery to complete
Sep  4 18:04:07 fat-intel-3vm6 kernel: Lustre: Skipped 3 previous similar messages
Sep  4 18:05:07 fat-intel-3vm6 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -19
Sep  4 18:06:17 fat-intel-3vm6 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -19
Sep  4 18:06:17 fat-intel-3vm6 kernel: LustreError: Skipped 28 previous similar messages
Sep  4 18:06:24 fat-intel-3vm6 kernel: Lustre: lustre-OST0000-osc-ffff8800749ac400: Connection restored to lustre-OST0000 (at 10.10.4.93@tcp)
Sep  4 18:06:24 fat-intel-3vm6 kernel: Lustre: Skipped 6 previous similar messages
Sep  4 18:07:47 fat-intel-3vm6 kernel: Lustre: 2131:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has failed due to network error: [sent 1346807267/real 13468
07267]  req@ffff880074801800 x1412225718819426/t0(0) o8->lustre-OST0002-osc-ffff8800749ac400@10.10.4.89@tcp:28/4 lens 400/544 e 0 to 1 dl 1346807293 ref 1 fl Rpc:XN/0/ffffffff 
rc 0/-1
Sep  4 18:07:47 fat-intel-3vm6 kernel: Lustre: 2131:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 159 previous similar messages
Sep  4 18:08:27 fat-intel-3vm6 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -11
Sep  4 18:08:27 fat-intel-3vm6 kernel: LustreError: Skipped 49 previous similar messages
Sep  4 18:13:02 fat-intel-3vm6 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -11
Sep  4 18:13:02 fat-intel-3vm6 kernel: LustreError: Skipped 51 previous similar messages
Sep  4 19:04:28 fat-intel-3vm6 kernel: MTRR variable ranges enabled:
Sep  4 19:04:28 fat-intel-3vm6 kernel:  0 base 00E0000000 mask FFE0000000 uncachable

Syslog on client fat-intel-3vm5(10.10.4.90):

Sep  4 18:03:35 fat-intel-3vm5 kernel: Lustre: DEBUG MARKER: == recovery-small test 101: IR: Make sure IR works w/o normal recovery == 18:03:35 (1346807015)
Sep  4 18:04:10 fat-intel-3vm5 kernel: Lustre: lustre-OST0001-osc-ffff880037556c00: Connection to lustre-OST0001 (at 10.10.4.89@tcp) was lost; in progress operations using this
 service will wait for recovery to complete
Sep  4 18:04:10 fat-intel-3vm5 kernel: Lustre: Skipped 3 previous similar messages
Sep  4 18:05:20 fat-intel-3vm5 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -19
Sep  4 18:06:24 fat-intel-3vm5 kernel: Lustre: lustre-OST0000-osc-ffff880037556c00: Connection restored to lustre-OST0000 (at 10.10.4.93@tcp)
Sep  4 18:06:24 fat-intel-3vm5 kernel: Lustre: Skipped 6 previous similar messages
Sep  4 18:07:30 fat-intel-3vm5 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -19
Sep  4 18:07:30 fat-intel-3vm5 kernel: LustreError: Skipped 37 previous similar messages
Sep  4 18:10:06 fat-intel-3vm5 kernel: Lustre: 2138:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346807380/real 13468073
80]  req@ffff8800794cd000 x1412225718819590/t0(0) o8->lustre-OST0003-osc-ffff880037556c00@10.10.4.89@tcp:28/4 lens 400/544 e 0 to 1 dl 1346807406 ref 1 fl Rpc:XN/0/ffffffff rc 
0/-1
Sep  4 18:10:06 fat-intel-3vm5 kernel: Lustre: 2138:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 173 previous similar messages
Sep  4 18:12:20 fat-intel-3vm5 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -19
Sep  4 18:12:20 fat-intel-3vm5 kernel: LustreError: Skipped 70 previous similar messages
Sep  4 19:04:42 fat-intel-3vm5 kernel: MTRR variable ranges enabled:
Sep  4 19:04:42 fat-intel-3vm5 kernel:  0 base 00E0000000 mask FFE0000000 uncachable

Syslog on client fat-intel-3vm1(10.10.4.86):

Sep  4 18:03:35 fat-intel-3vm1 kernel: Lustre: DEBUG MARKER: == recovery-small test 101: IR: Make sure IR works w/o normal recovery == 18:03:35 (1346807015)
Sep  4 18:05:08 fat-intel-3vm1 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -19
Sep  4 18:05:08 fat-intel-3vm1 kernel: LustreError: Skipped 6 previous similar messages
Sep  4 18:06:54 fat-intel-3vm1 kernel: Lustre: 3295:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346807188/real 13468071
88]  req@ffff88002a4e5000 x1412225453747374/t0(0) o8->lustre-OST0005-osc-ffff880047d0f000@10.10.4.89@tcp:28/4 lens 400/544 e 0 to 1 dl 1346807213 ref 1 fl Rpc:XN/0/ffffffff rc 
0/-1
Sep  4 18:06:54 fat-intel-3vm1 kernel: Lustre: 3295:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 119 previous similar messages
Sep  4 18:11:35 fat-intel-3vm1 kernel: Lustre: lustre-OST0004-osc-ffff880047d0f000: Connection restored to lustre-OST0004 (at 10.10.4.93@tcp)
Sep  4 18:11:35 fat-intel-3vm1 kernel: Lustre: Skipped 5 previous similar messages
Sep  4 19:05:09 fat-intel-3vm1 kernel: MTRR variable ranges enabled:
Sep  4 19:05:09 fat-intel-3vm1 kernel:  0 base 00E0000000 mask FFE0000000 uncachable


 Comments   
Comment by Jian Yu [ 17/Sep/12 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_3/19
Test group: failover

The same issue occurred: https://maloo.whamcloud.com/test_sets/4b58e1bc-fff3-11e1-9f3c-52540035b04c

Comment by Peter Jones [ 17/Sep/12 ]

Lai

Could you please look into this one?

Thanks

Peter

Comment by Lai Siyao [ 20/Sep/12 ]

When I ran this single subtest 101 in a loop, it always passed. Now I'm running full test to try to reproduce. The log shows MDS suddenly rebooted, but doesn't give much details, is it right, Yujian?

Comment by Jian Yu [ 20/Sep/12 ]

When I ran this single subtest 101 in a loop, it always passed. Now I'm running full test to try to reproduce. The log shows MDS suddenly rebooted, but doesn't give much details, is it right, Yujian?

For report https://maloo.whamcloud.com/test_sets/ae7cd73a-f760-11e1-8b95-52540035b04c, I only found out the attached console log for MDS/MGS fat-intel-3vm7(10.10.4.92) from the autotest logdir on brent node.

The issue only occurred while running the test under failover configuration. So, if you wanna reproduce or debug it, you can upload a patch to Gerrit with the following commit messages:

Test-Parameters: fortestonly envdefinitions=SLOW=yes \
clientcount=4 osscount=2 mdscount=2 austeroptions=-R \
failover=true useiscsi=true testlist=recovery-small
Comment by Lai Siyao [ 21/Sep/12 ]

I reproduced in my test environment. This is a Imperative Recovery test, and the client is not able to reconnect in recovery window, and finally failed. I'm not clear of the cause yet, and am collecting more logs to debug.

Comment by Jian Yu [ 23/Sep/12 ]

Here is a link to the historical Maloo reports for recovery-small test 101 in failover test group:
http://tinyurl.com/8adbnzz

As we can see:

  • before 2012-05-31, the test has passed some times but not with failover configuration
  • between 2012-05-31 and 2012-08-20, the test was run with failover configuration but hit TT-765
  • after TT-765 was fixed, since 2012-09-05, LU-1890 has been occurring regularly
Comment by Lai Siyao [ 24/Sep/12 ]

Jinshan, the test log shows that after OST failover, client is denied to connect, while MDS is successfully connected. Do you have any clue?

Comment by Jinshan Xiong (Inactive) [ 24/Sep/12 ]

Siyao, can you please check if this is because the client data was not written into persistent storage of last_rcvd so the OST thought that was a new client so denied it to reconnect? I saw this kind of issue several times and if this is the case, we can fix it by issuing a sync IO before failing OST.

Comment by Lai Siyao [ 25/Sep/12 ]

Chris, could you describe how the shared disk is configured for failover test?

In my manual test, the failure cause is that I used /dev/sdx like device (iscsi disk) as OST device, but the disk sequence tends to change on different nodes, i.e, /dev/sdc on OST node may be /dev/sdd on the failover OST node, so after failover, they access different device, so client is not able to connect. After I use /dev/disk/by-id/iscsi-xxxx to accss the shared disk, I've run this test successfully more than 1 hour now.

Comment by Peter Jones [ 25/Sep/12 ]

I think that there is enough evidence to suggest that this is purely a testing issue. We should still continue to work on resolving this and include a fix for an RC2 if one is ready and we need one, but it would not warrant holding the release on its own merits

Comment by Lai Siyao [ 26/Sep/12 ]

There is another issue for failover testing: only logs on the failed node is collected on maloo, the logs of the failover node is nowhere to find.

Comment by Jian Yu [ 12/Oct/12 ]

Lustre Tag: v2_3_0_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/32
Distro/Arch: RHEL6.3/x86_64(server), RHEL5.8/x86_64(client)
Test Group: failover

The issue occurred again: https://maloo.whamcloud.com/test_sets/8f2e24c4-146b-11e2-af8d-52540035b04c

Comment by Andreas Dilger [ 12/Jun/13 ]

This is no longer being seen in testing, since about 2012/11. Closing it because it is incorrectly being marked as the cause of many other failures, including LU-3453.

Generated at Sat Feb 10 01:20:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.