Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1890

Test failure on test suite recovery-small, subtest test_101

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.3.0
    • None
    • 3
    • 4143

    Description

      This issue was created by maloo for yujian <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/ae7cd73a-f760-11e1-8b95-52540035b04c.

      Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/12

      The sub-test test_101 failed with the following error:

      test failed to respond and timed out

      Info required for matching: recovery-small 101

      MGS/MDS Nodes: fat-intel-3vm3(10.10.4.88), fat-intel-3vm7(10.10.4.92)
      OSS Nodes: fat-intel-3vm4(10.10.4.89), fat-intel-3vm8(10.10.4.93)
      Client Nodes: fat-intel-3vm1(10.10.4.86), fat-intel-3vm5(10.10.4.90), fat-intel-3vm6(10.10.4.91)

      Syslog on OSS fat-intel-3vm8(10.10.4.93):

      Sep  4 18:14:34 fat-intel-3vm8 rshd[4135]: root@fat-intel-3vm1.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "e2label /dev/lvm-OSS/P7 2>/dev/null");echo XXRETCODE:$?'
      Sep  4 18:14:35 fat-intel-3vm8 kernel: Lustre: lustre-OST0006: Will be in recovery for at least 1:00, or until 4 clients reconnect
      Sep  4 18:14:39 fat-intel-3vm8 kernel: Lustre: lustre-OST0006: Recovery over after 0:04, of 4 clients 4 recovered and 0 were evicted.
      Sep  4 18:14:39 fat-intel-3vm8 kernel: Lustre: lustre-OST0006: received MDS connection from 10.10.4.92@tcp
      Sep  4 18:18:07 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346807862/real 13468078
      62]  req@ffff880037705800 x1412229540872552/t0(0) o250->MGC10.10.4.88@tcp@10.10.4.88@tcp:26/25 lens 400/544 e 0 to 1 dl 1346807887 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Sep  4 18:18:07 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
      Sep  4 18:27:07 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346808402/real 1346808402]  req@ffff88007c723400 x1412229540872570/t0(0) o250->MGC10.10.4.88@tcp@10.10.4.88@tcp:26/25 lens 400/544 e 0 to 1 dl 1346808427 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Sep  4 18:27:07 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 17 previous similar messages
      Sep  4 18:37:37 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346809032/real 1346809032]  req@ffff88007c723c00 x1412229540872591/t0(0) o250->MGC10.10.4.88@tcp@10.10.4.88@tcp:26/25 lens 400/544 e 0 to 1 dl 1346809057 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Sep  4 18:37:37 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 20 previous similar messages
      Sep  4 18:48:07 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346809662/real 13468096
      62]  req@ffff8800709db800 x1412229540872612/t0(0) o250->MGC10.10.4.88@tcp@10.10.4.88@tcp:26/25 lens 400/544 e 0 to 1 dl 1346809687 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Sep  4 18:48:07 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 20 previous similar messages
      Sep  4 18:58:37 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346810292/real 13468102
      92]  req@ffff880037705c00 x1412229540872633/t0(0) o250->MGC10.10.4.88@tcp@10.10.4.88@tcp:26/25 lens 400/544 e 0 to 1 dl 1346810317 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Sep  4 18:58:37 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 20 previous similar messages
      Sep  4 19:04:34 fat-intel-3vm8 kernel: Lustre: lustre-OST0004: haven't heard from client 85be9246-99f9-3b19-5f81-ad80152fd420 (at 10.10.4.91@tcp) in 47 seconds. I think it's dead, and I am evicting it. exp ffff88007254fc00, cur 1346810674 expire 1346810644 last 1346810627
      Sep  4 19:04:34 fat-intel-3vm8 kernel: Lustre: lustre-OST0004: haven't heard from client f75916a5-eaec-360e-dafc-5ebb1448c443 (at 10.10.4.90@tcp) in 34 seconds. I think it's dead, and I am evicting it. exp ffff88007598b000, cur 1346810674 expire 1346810644 last 1346810640
      Sep  4 19:04:39 fat-intel-3vm8 kernel: Lustre: lustre-OST0000: haven't heard from client 85be9246-99f9-3b19-5f81-ad80152fd420 (at 10.10.4.91@tcp) in 52 seconds. I think it's dead, and I am evicting it. exp ffff8800732a6000, cur 1346810679 expire 1346810649 last 1346810627
      Sep  4 19:04:39 fat-intel-3vm8 kernel: Lustre: Skipped 10 previous similar messages
      Sep  4 19:04:40 fat-intel-3vm8 kernel: connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4298439482, last ping 4298444482, now 4298449482
      Sep  4 19:04:40 fat-intel-3vm8 kernel: connection1:0: detected conn error (1011)
      Sep  4 19:04:41 fat-intel-3vm8 iscsid: Kernel reported iSCSI connection 1:0 error (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)
      Sep  4 19:05:38 fat-intel-3vm8 kernel: MTRR variable ranges enabled:
      Sep  4 19:05:38 fat-intel-3vm8 kernel:  0 base 00E0000000 mask FFE0000000 uncachable
      

      Syslog on client fat-intel-3vm6(10.10.4.91):

      Sep  4 18:03:35 fat-intel-3vm6 kernel: Lustre: DEBUG MARKER: == recovery-small test 101: IR: Make sure IR works w/o normal recovery == 18:03:35 (1346807015)
      Sep  4 18:04:07 fat-intel-3vm6 kernel: Lustre: lustre-OST0001-osc-ffff8800749ac400: Connection to lustre-OST0001 (at 10.10.4.89@tcp) was lost; in progress operations using this
       service will wait for recovery to complete
      Sep  4 18:04:07 fat-intel-3vm6 kernel: Lustre: Skipped 3 previous similar messages
      Sep  4 18:05:07 fat-intel-3vm6 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -19
      Sep  4 18:06:17 fat-intel-3vm6 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -19
      Sep  4 18:06:17 fat-intel-3vm6 kernel: LustreError: Skipped 28 previous similar messages
      Sep  4 18:06:24 fat-intel-3vm6 kernel: Lustre: lustre-OST0000-osc-ffff8800749ac400: Connection restored to lustre-OST0000 (at 10.10.4.93@tcp)
      Sep  4 18:06:24 fat-intel-3vm6 kernel: Lustre: Skipped 6 previous similar messages
      Sep  4 18:07:47 fat-intel-3vm6 kernel: Lustre: 2131:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has failed due to network error: [sent 1346807267/real 13468
      07267]  req@ffff880074801800 x1412225718819426/t0(0) o8->lustre-OST0002-osc-ffff8800749ac400@10.10.4.89@tcp:28/4 lens 400/544 e 0 to 1 dl 1346807293 ref 1 fl Rpc:XN/0/ffffffff 
      rc 0/-1
      Sep  4 18:07:47 fat-intel-3vm6 kernel: Lustre: 2131:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 159 previous similar messages
      Sep  4 18:08:27 fat-intel-3vm6 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -11
      Sep  4 18:08:27 fat-intel-3vm6 kernel: LustreError: Skipped 49 previous similar messages
      Sep  4 18:13:02 fat-intel-3vm6 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -11
      Sep  4 18:13:02 fat-intel-3vm6 kernel: LustreError: Skipped 51 previous similar messages
      Sep  4 19:04:28 fat-intel-3vm6 kernel: MTRR variable ranges enabled:
      Sep  4 19:04:28 fat-intel-3vm6 kernel:  0 base 00E0000000 mask FFE0000000 uncachable
      

      Syslog on client fat-intel-3vm5(10.10.4.90):

      Sep  4 18:03:35 fat-intel-3vm5 kernel: Lustre: DEBUG MARKER: == recovery-small test 101: IR: Make sure IR works w/o normal recovery == 18:03:35 (1346807015)
      Sep  4 18:04:10 fat-intel-3vm5 kernel: Lustre: lustre-OST0001-osc-ffff880037556c00: Connection to lustre-OST0001 (at 10.10.4.89@tcp) was lost; in progress operations using this
       service will wait for recovery to complete
      Sep  4 18:04:10 fat-intel-3vm5 kernel: Lustre: Skipped 3 previous similar messages
      Sep  4 18:05:20 fat-intel-3vm5 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -19
      Sep  4 18:06:24 fat-intel-3vm5 kernel: Lustre: lustre-OST0000-osc-ffff880037556c00: Connection restored to lustre-OST0000 (at 10.10.4.93@tcp)
      Sep  4 18:06:24 fat-intel-3vm5 kernel: Lustre: Skipped 6 previous similar messages
      Sep  4 18:07:30 fat-intel-3vm5 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -19
      Sep  4 18:07:30 fat-intel-3vm5 kernel: LustreError: Skipped 37 previous similar messages
      Sep  4 18:10:06 fat-intel-3vm5 kernel: Lustre: 2138:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346807380/real 13468073
      80]  req@ffff8800794cd000 x1412225718819590/t0(0) o8->lustre-OST0003-osc-ffff880037556c00@10.10.4.89@tcp:28/4 lens 400/544 e 0 to 1 dl 1346807406 ref 1 fl Rpc:XN/0/ffffffff rc 
      0/-1
      Sep  4 18:10:06 fat-intel-3vm5 kernel: Lustre: 2138:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 173 previous similar messages
      Sep  4 18:12:20 fat-intel-3vm5 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -19
      Sep  4 18:12:20 fat-intel-3vm5 kernel: LustreError: Skipped 70 previous similar messages
      Sep  4 19:04:42 fat-intel-3vm5 kernel: MTRR variable ranges enabled:
      Sep  4 19:04:42 fat-intel-3vm5 kernel:  0 base 00E0000000 mask FFE0000000 uncachable
      

      Syslog on client fat-intel-3vm1(10.10.4.86):

      Sep  4 18:03:35 fat-intel-3vm1 kernel: Lustre: DEBUG MARKER: == recovery-small test 101: IR: Make sure IR works w/o normal recovery == 18:03:35 (1346807015)
      Sep  4 18:05:08 fat-intel-3vm1 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -19
      Sep  4 18:05:08 fat-intel-3vm1 kernel: LustreError: Skipped 6 previous similar messages
      Sep  4 18:06:54 fat-intel-3vm1 kernel: Lustre: 3295:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346807188/real 13468071
      88]  req@ffff88002a4e5000 x1412225453747374/t0(0) o8->lustre-OST0005-osc-ffff880047d0f000@10.10.4.89@tcp:28/4 lens 400/544 e 0 to 1 dl 1346807213 ref 1 fl Rpc:XN/0/ffffffff rc 
      0/-1
      Sep  4 18:06:54 fat-intel-3vm1 kernel: Lustre: 3295:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 119 previous similar messages
      Sep  4 18:11:35 fat-intel-3vm1 kernel: Lustre: lustre-OST0004-osc-ffff880047d0f000: Connection restored to lustre-OST0004 (at 10.10.4.93@tcp)
      Sep  4 18:11:35 fat-intel-3vm1 kernel: Lustre: Skipped 5 previous similar messages
      Sep  4 19:05:09 fat-intel-3vm1 kernel: MTRR variable ranges enabled:
      Sep  4 19:05:09 fat-intel-3vm1 kernel:  0 base 00E0000000 mask FFE0000000 uncachable
      

      Attachments

        Activity

          [LU-1890] Test failure on test suite recovery-small, subtest test_101

          This is no longer being seen in testing, since about 2012/11. Closing it because it is incorrectly being marked as the cause of many other failures, including LU-3453.

          adilger Andreas Dilger added a comment - This is no longer being seen in testing, since about 2012/11. Closing it because it is incorrectly being marked as the cause of many other failures, including LU-3453 .
          yujian Jian Yu added a comment -

          Lustre Tag: v2_3_0_RC2
          Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/32
          Distro/Arch: RHEL6.3/x86_64(server), RHEL5.8/x86_64(client)
          Test Group: failover

          The issue occurred again: https://maloo.whamcloud.com/test_sets/8f2e24c4-146b-11e2-af8d-52540035b04c

          yujian Jian Yu added a comment - Lustre Tag: v2_3_0_RC2 Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/32 Distro/Arch: RHEL6.3/x86_64(server), RHEL5.8/x86_64(client) Test Group: failover The issue occurred again: https://maloo.whamcloud.com/test_sets/8f2e24c4-146b-11e2-af8d-52540035b04c
          laisiyao Lai Siyao added a comment -

          There is another issue for failover testing: only logs on the failed node is collected on maloo, the logs of the failover node is nowhere to find.

          laisiyao Lai Siyao added a comment - There is another issue for failover testing: only logs on the failed node is collected on maloo, the logs of the failover node is nowhere to find.
          pjones Peter Jones added a comment -

          I think that there is enough evidence to suggest that this is purely a testing issue. We should still continue to work on resolving this and include a fix for an RC2 if one is ready and we need one, but it would not warrant holding the release on its own merits

          pjones Peter Jones added a comment - I think that there is enough evidence to suggest that this is purely a testing issue. We should still continue to work on resolving this and include a fix for an RC2 if one is ready and we need one, but it would not warrant holding the release on its own merits
          laisiyao Lai Siyao added a comment -

          Chris, could you describe how the shared disk is configured for failover test?

          In my manual test, the failure cause is that I used /dev/sdx like device (iscsi disk) as OST device, but the disk sequence tends to change on different nodes, i.e, /dev/sdc on OST node may be /dev/sdd on the failover OST node, so after failover, they access different device, so client is not able to connect. After I use /dev/disk/by-id/iscsi-xxxx to accss the shared disk, I've run this test successfully more than 1 hour now.

          laisiyao Lai Siyao added a comment - Chris, could you describe how the shared disk is configured for failover test? In my manual test, the failure cause is that I used /dev/sdx like device (iscsi disk) as OST device, but the disk sequence tends to change on different nodes, i.e, /dev/sdc on OST node may be /dev/sdd on the failover OST node, so after failover, they access different device, so client is not able to connect. After I use /dev/disk/by-id/iscsi-xxxx to accss the shared disk, I've run this test successfully more than 1 hour now.

          Siyao, can you please check if this is because the client data was not written into persistent storage of last_rcvd so the OST thought that was a new client so denied it to reconnect? I saw this kind of issue several times and if this is the case, we can fix it by issuing a sync IO before failing OST.

          jay Jinshan Xiong (Inactive) added a comment - Siyao, can you please check if this is because the client data was not written into persistent storage of last_rcvd so the OST thought that was a new client so denied it to reconnect? I saw this kind of issue several times and if this is the case, we can fix it by issuing a sync IO before failing OST.
          laisiyao Lai Siyao added a comment -

          Jinshan, the test log shows that after OST failover, client is denied to connect, while MDS is successfully connected. Do you have any clue?

          laisiyao Lai Siyao added a comment - Jinshan, the test log shows that after OST failover, client is denied to connect, while MDS is successfully connected. Do you have any clue?
          yujian Jian Yu added a comment -

          Here is a link to the historical Maloo reports for recovery-small test 101 in failover test group:
          http://tinyurl.com/8adbnzz

          As we can see:

          • before 2012-05-31, the test has passed some times but not with failover configuration
          • between 2012-05-31 and 2012-08-20, the test was run with failover configuration but hit TT-765
          • after TT-765 was fixed, since 2012-09-05, LU-1890 has been occurring regularly
          yujian Jian Yu added a comment - Here is a link to the historical Maloo reports for recovery-small test 101 in failover test group: http://tinyurl.com/8adbnzz As we can see: before 2012-05-31, the test has passed some times but not with failover configuration between 2012-05-31 and 2012-08-20, the test was run with failover configuration but hit TT-765 after TT-765 was fixed, since 2012-09-05, LU-1890 has been occurring regularly
          laisiyao Lai Siyao added a comment -

          I reproduced in my test environment. This is a Imperative Recovery test, and the client is not able to reconnect in recovery window, and finally failed. I'm not clear of the cause yet, and am collecting more logs to debug.

          laisiyao Lai Siyao added a comment - I reproduced in my test environment. This is a Imperative Recovery test, and the client is not able to reconnect in recovery window, and finally failed. I'm not clear of the cause yet, and am collecting more logs to debug.
          yujian Jian Yu added a comment -

          When I ran this single subtest 101 in a loop, it always passed. Now I'm running full test to try to reproduce. The log shows MDS suddenly rebooted, but doesn't give much details, is it right, Yujian?

          For report https://maloo.whamcloud.com/test_sets/ae7cd73a-f760-11e1-8b95-52540035b04c, I only found out the attached console log for MDS/MGS fat-intel-3vm7(10.10.4.92) from the autotest logdir on brent node.

          The issue only occurred while running the test under failover configuration. So, if you wanna reproduce or debug it, you can upload a patch to Gerrit with the following commit messages:

          Test-Parameters: fortestonly envdefinitions=SLOW=yes \
          clientcount=4 osscount=2 mdscount=2 austeroptions=-R \
          failover=true useiscsi=true testlist=recovery-small
          
          yujian Jian Yu added a comment - When I ran this single subtest 101 in a loop, it always passed. Now I'm running full test to try to reproduce. The log shows MDS suddenly rebooted, but doesn't give much details, is it right, Yujian? For report https://maloo.whamcloud.com/test_sets/ae7cd73a-f760-11e1-8b95-52540035b04c , I only found out the attached console log for MDS/MGS fat-intel-3vm7(10.10.4.92) from the autotest logdir on brent node. The issue only occurred while running the test under failover configuration. So, if you wanna reproduce or debug it, you can upload a patch to Gerrit with the following commit messages: Test-Parameters: fortestonly envdefinitions=SLOW=yes \ clientcount=4 osscount=2 mdscount=2 austeroptions=-R \ failover=true useiscsi=true testlist=recovery-small

          People

            laisiyao Lai Siyao
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: