Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.3.0
-
None
-
3
-
4143
Description
This issue was created by maloo for yujian <yujian@whamcloud.com>
This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/ae7cd73a-f760-11e1-8b95-52540035b04c.
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/12
The sub-test test_101 failed with the following error:
test failed to respond and timed out
Info required for matching: recovery-small 101
MGS/MDS Nodes: fat-intel-3vm3(10.10.4.88), fat-intel-3vm7(10.10.4.92)
OSS Nodes: fat-intel-3vm4(10.10.4.89), fat-intel-3vm8(10.10.4.93)
Client Nodes: fat-intel-3vm1(10.10.4.86), fat-intel-3vm5(10.10.4.90), fat-intel-3vm6(10.10.4.91)
Syslog on OSS fat-intel-3vm8(10.10.4.93):
Sep 4 18:14:34 fat-intel-3vm8 rshd[4135]: root@fat-intel-3vm1.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "e2label /dev/lvm-OSS/P7 2>/dev/null");echo XXRETCODE:$?' Sep 4 18:14:35 fat-intel-3vm8 kernel: Lustre: lustre-OST0006: Will be in recovery for at least 1:00, or until 4 clients reconnect Sep 4 18:14:39 fat-intel-3vm8 kernel: Lustre: lustre-OST0006: Recovery over after 0:04, of 4 clients 4 recovered and 0 were evicted. Sep 4 18:14:39 fat-intel-3vm8 kernel: Lustre: lustre-OST0006: received MDS connection from 10.10.4.92@tcp Sep 4 18:18:07 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1346807862/real 13468078 62] req@ffff880037705800 x1412229540872552/t0(0) o250->MGC10.10.4.88@tcp@10.10.4.88@tcp:26/25 lens 400/544 e 0 to 1 dl 1346807887 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 18:18:07 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 8 previous similar messages Sep 4 18:27:07 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1346808402/real 1346808402] req@ffff88007c723400 x1412229540872570/t0(0) o250->MGC10.10.4.88@tcp@10.10.4.88@tcp:26/25 lens 400/544 e 0 to 1 dl 1346808427 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 18:27:07 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 17 previous similar messages Sep 4 18:37:37 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1346809032/real 1346809032] req@ffff88007c723c00 x1412229540872591/t0(0) o250->MGC10.10.4.88@tcp@10.10.4.88@tcp:26/25 lens 400/544 e 0 to 1 dl 1346809057 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 18:37:37 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 20 previous similar messages Sep 4 18:48:07 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1346809662/real 13468096 62] req@ffff8800709db800 x1412229540872612/t0(0) o250->MGC10.10.4.88@tcp@10.10.4.88@tcp:26/25 lens 400/544 e 0 to 1 dl 1346809687 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 18:48:07 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 20 previous similar messages Sep 4 18:58:37 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1346810292/real 13468102 92] req@ffff880037705c00 x1412229540872633/t0(0) o250->MGC10.10.4.88@tcp@10.10.4.88@tcp:26/25 lens 400/544 e 0 to 1 dl 1346810317 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 18:58:37 fat-intel-3vm8 kernel: Lustre: 2075:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 20 previous similar messages Sep 4 19:04:34 fat-intel-3vm8 kernel: Lustre: lustre-OST0004: haven't heard from client 85be9246-99f9-3b19-5f81-ad80152fd420 (at 10.10.4.91@tcp) in 47 seconds. I think it's dead, and I am evicting it. exp ffff88007254fc00, cur 1346810674 expire 1346810644 last 1346810627 Sep 4 19:04:34 fat-intel-3vm8 kernel: Lustre: lustre-OST0004: haven't heard from client f75916a5-eaec-360e-dafc-5ebb1448c443 (at 10.10.4.90@tcp) in 34 seconds. I think it's dead, and I am evicting it. exp ffff88007598b000, cur 1346810674 expire 1346810644 last 1346810640 Sep 4 19:04:39 fat-intel-3vm8 kernel: Lustre: lustre-OST0000: haven't heard from client 85be9246-99f9-3b19-5f81-ad80152fd420 (at 10.10.4.91@tcp) in 52 seconds. I think it's dead, and I am evicting it. exp ffff8800732a6000, cur 1346810679 expire 1346810649 last 1346810627 Sep 4 19:04:39 fat-intel-3vm8 kernel: Lustre: Skipped 10 previous similar messages Sep 4 19:04:40 fat-intel-3vm8 kernel: connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4298439482, last ping 4298444482, now 4298449482 Sep 4 19:04:40 fat-intel-3vm8 kernel: connection1:0: detected conn error (1011) Sep 4 19:04:41 fat-intel-3vm8 iscsid: Kernel reported iSCSI connection 1:0 error (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3) Sep 4 19:05:38 fat-intel-3vm8 kernel: MTRR variable ranges enabled: Sep 4 19:05:38 fat-intel-3vm8 kernel: 0 base 00E0000000 mask FFE0000000 uncachable
Syslog on client fat-intel-3vm6(10.10.4.91):
Sep 4 18:03:35 fat-intel-3vm6 kernel: Lustre: DEBUG MARKER: == recovery-small test 101: IR: Make sure IR works w/o normal recovery == 18:03:35 (1346807015) Sep 4 18:04:07 fat-intel-3vm6 kernel: Lustre: lustre-OST0001-osc-ffff8800749ac400: Connection to lustre-OST0001 (at 10.10.4.89@tcp) was lost; in progress operations using this service will wait for recovery to complete Sep 4 18:04:07 fat-intel-3vm6 kernel: Lustre: Skipped 3 previous similar messages Sep 4 18:05:07 fat-intel-3vm6 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -19 Sep 4 18:06:17 fat-intel-3vm6 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -19 Sep 4 18:06:17 fat-intel-3vm6 kernel: LustreError: Skipped 28 previous similar messages Sep 4 18:06:24 fat-intel-3vm6 kernel: Lustre: lustre-OST0000-osc-ffff8800749ac400: Connection restored to lustre-OST0000 (at 10.10.4.93@tcp) Sep 4 18:06:24 fat-intel-3vm6 kernel: Lustre: Skipped 6 previous similar messages Sep 4 18:07:47 fat-intel-3vm6 kernel: Lustre: 2131:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1346807267/real 13468 07267] req@ffff880074801800 x1412225718819426/t0(0) o8->lustre-OST0002-osc-ffff8800749ac400@10.10.4.89@tcp:28/4 lens 400/544 e 0 to 1 dl 1346807293 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 18:07:47 fat-intel-3vm6 kernel: Lustre: 2131:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 159 previous similar messages Sep 4 18:08:27 fat-intel-3vm6 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -11 Sep 4 18:08:27 fat-intel-3vm6 kernel: LustreError: Skipped 49 previous similar messages Sep 4 18:13:02 fat-intel-3vm6 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -11 Sep 4 18:13:02 fat-intel-3vm6 kernel: LustreError: Skipped 51 previous similar messages Sep 4 19:04:28 fat-intel-3vm6 kernel: MTRR variable ranges enabled: Sep 4 19:04:28 fat-intel-3vm6 kernel: 0 base 00E0000000 mask FFE0000000 uncachable
Syslog on client fat-intel-3vm5(10.10.4.90):
Sep 4 18:03:35 fat-intel-3vm5 kernel: Lustre: DEBUG MARKER: == recovery-small test 101: IR: Make sure IR works w/o normal recovery == 18:03:35 (1346807015) Sep 4 18:04:10 fat-intel-3vm5 kernel: Lustre: lustre-OST0001-osc-ffff880037556c00: Connection to lustre-OST0001 (at 10.10.4.89@tcp) was lost; in progress operations using this service will wait for recovery to complete Sep 4 18:04:10 fat-intel-3vm5 kernel: Lustre: Skipped 3 previous similar messages Sep 4 18:05:20 fat-intel-3vm5 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -19 Sep 4 18:06:24 fat-intel-3vm5 kernel: Lustre: lustre-OST0000-osc-ffff880037556c00: Connection restored to lustre-OST0000 (at 10.10.4.93@tcp) Sep 4 18:06:24 fat-intel-3vm5 kernel: Lustre: Skipped 6 previous similar messages Sep 4 18:07:30 fat-intel-3vm5 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -19 Sep 4 18:07:30 fat-intel-3vm5 kernel: LustreError: Skipped 37 previous similar messages Sep 4 18:10:06 fat-intel-3vm5 kernel: Lustre: 2138:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1346807380/real 13468073 80] req@ffff8800794cd000 x1412225718819590/t0(0) o8->lustre-OST0003-osc-ffff880037556c00@10.10.4.89@tcp:28/4 lens 400/544 e 0 to 1 dl 1346807406 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 18:10:06 fat-intel-3vm5 kernel: Lustre: 2138:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 173 previous similar messages Sep 4 18:12:20 fat-intel-3vm5 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -19 Sep 4 18:12:20 fat-intel-3vm5 kernel: LustreError: Skipped 70 previous similar messages Sep 4 19:04:42 fat-intel-3vm5 kernel: MTRR variable ranges enabled: Sep 4 19:04:42 fat-intel-3vm5 kernel: 0 base 00E0000000 mask FFE0000000 uncachable
Syslog on client fat-intel-3vm1(10.10.4.86):
Sep 4 18:03:35 fat-intel-3vm1 kernel: Lustre: DEBUG MARKER: == recovery-small test 101: IR: Make sure IR works w/o normal recovery == 18:03:35 (1346807015) Sep 4 18:05:08 fat-intel-3vm1 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.4.93@tcp. The ost_connect operation failed with -19 Sep 4 18:05:08 fat-intel-3vm1 kernel: LustreError: Skipped 6 previous similar messages Sep 4 18:06:54 fat-intel-3vm1 kernel: Lustre: 3295:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1346807188/real 13468071 88] req@ffff88002a4e5000 x1412225453747374/t0(0) o8->lustre-OST0005-osc-ffff880047d0f000@10.10.4.89@tcp:28/4 lens 400/544 e 0 to 1 dl 1346807213 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 18:06:54 fat-intel-3vm1 kernel: Lustre: 3295:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 119 previous similar messages Sep 4 18:11:35 fat-intel-3vm1 kernel: Lustre: lustre-OST0004-osc-ffff880047d0f000: Connection restored to lustre-OST0004 (at 10.10.4.93@tcp) Sep 4 18:11:35 fat-intel-3vm1 kernel: Lustre: Skipped 5 previous similar messages Sep 4 19:05:09 fat-intel-3vm1 kernel: MTRR variable ranges enabled: Sep 4 19:05:09 fat-intel-3vm1 kernel: 0 base 00E0000000 mask FFE0000000 uncachable
This is no longer being seen in testing, since about 2012/11. Closing it because it is incorrectly being marked as the cause of many other failures, including
LU-3453.