[LU-1499] parallel-scale-nfsv4, subtest test_iorfpp Created: 10/Jun/12 Updated: 08/Jan/16 Resolved: 08/Jan/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 2.1.2, Lustre 2.1.4, Lustre 2.5.1, Lustre 2.7.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Hongchao Zhang |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | mn1 | ||
| Environment: |
Lustre Tag: v2_1_2_RC3 |
||
| Severity: | 3 |
| Rank (Obsolete): | 5424 |
| Description |
|
This issue was created by maloo for yujian <yujian@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/edb73380-b329-11e1-a6d7-52540035b04c. The sub-test test_iorfpp failed with the following error: ** error ** ERROR in aiori-POSIX.c (line 256): transfer failed. ERROR: Input/output error ** exiting ** parallel-scale-nfsv4 test_iorfpp: @@@@@@ FAIL: ior failed! 1 Info required for matching: parallel-scale-nfsv4 iorfpp Console log on MDS showed that: 10:30:59:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test iorfpp: iorfpp ========================================================== 10:30:56 (1339349456) 10:31:19:LustreError: 11-0: an error occurred while communicating with 10.10.4.167@tcp. The ost_write operation failed with -107 10:31:19:LustreError: Skipped 2 previous similar messages 10:31:19:Lustre: lustre-OST0000-osc-ffff81006c110800: Connection to lustre-OST0000 (at 10.10.4.167@tcp) was lost; in progress operations using this service will wait for recovery to complete 10:31:19:Lustre: Skipped 2 previous similar messages 10:31:19:LustreError: 11-0: an error occurred while communicating with 10.10.4.167@tcp. The obd_ping operation failed with -107 10:31:20:LustreError: 167-0: This client was evicted by lustre-OST0000; in progress operations using this service will fail. 10:31:21:LustreError: 3215:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff81006ae4dc00 x1404345569553508/t0(0) o4->lustre-OST0000-osc-ffff81006c110800@10.10.4.167@tcp:6/4 lens 456/416 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 10:31:22:LustreError: 3215:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff810042141800 x1404345569553507/t0(0) o4->lustre-OST0000-osc-ffff81006c110800@10.10.4.167@tcp:6/4 lens 472/416 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 10:31:22:LustreError: 3215:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff810016181000 x1404345569553531/t0(0) o4->lustre-OST0000-osc-ffff81006c110800@10.10.4.167@tcp:6/4 lens 472/416 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 10:31:22:LustreError: 3215:0:(client.c:1060:ptlrpc_import_delay_req()) Skipped 22 previous similar messages 10:31:27:LustreError: 3217:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff81005e151400 x1404345569553575/t0(0) o101->lustre-OST0000-osc-ffff81006c110800@10.10.4.167@tcp:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 10:31:27:LustreError: 3217:0:(client.c:1060:ptlrpc_import_delay_req()) Skipped 7 previous similar messages 10:34:05:Lustre: DEBUG MARKER: parallel-scale-nfsv4 test_iorfpp: @@@@@@ FAIL: ior failed! 1 Console log on OSS showed that: 10:30:59:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test iorfpp: iorfpp ========================================================== 10:30:56 (1339349456) 10:31:16:Lustre: 8348:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1339349469/real 1339349469] req@ffff81002bb25400 x1404345556224852/t0(0) o105->lustre-OST0000@10.10.4.166@tcp:15/16 lens 344/192 e 0 to 1 dl 1339349476 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 10:31:17:LustreError: 138-a: lustre-OST0000: A client on nid 10.10.4.166@tcp was evicted due to a lock completion callback time out: rc -107 10:31:21:Lustre: 8361:0:(ldlm_lib.c:933:target_handle_connect()) lustre-OST0000: connection from d2bf4c4f-6776-3492-f34a-d2b1bfc5d2b3@10.10.4.166@tcp t1460255 exp 0000000000000000 cur 1339349480 last 0 10:31:21:Lustre: 8361:0:(ldlm_lib.c:933:target_handle_connect()) Skipped 2 previous similar messages 10:34:04:Lustre: DEBUG MARKER: parallel-scale-nfsv4 test_iorfpp: @@@@@@ FAIL: ior failed! 1 |
| Comments |
| Comment by Jian Yu [ 10/Jun/12 ] |
|
The same test passed in another test session on the same distro/arch and Lustre build: |
| Comment by Peter Jones [ 23/Jul/12 ] |
|
Hongchao Can you explain what is happening here? Peter |
| Comment by Hongchao Zhang [ 23/Jul/12 ] |
|
the client was evicted by lustre-OST0000 for it didn't response the lock completion callback, then the IOR failed with EIO |
| Comment by Peter Jones [ 31/Jul/12 ] |
|
As per Sarah this has not occurred on recent 2.3 tags so dropping as a blocker |
| Comment by Jian Yu [ 22/Dec/12 ] |
|
Lustre Tag: v2_1_4_RC2 The same issue occurred again: https://maloo.whamcloud.com/test_sets/0dea2186-4bfe-11e2-b603-52540035b04c |
| Comment by Hongchao Zhang [ 05/Jan/13 ] |
|
there is no info about this test in its debug logs in https://maloo.whamcloud.com/test_sets/0dea2186-4bfe-11e2-b603-52540035b04c 00000001:02000400:0.0:1356150192.989830:0:21898:0:(debug.c:447:libcfs_debug_mark_buffer()) DEBUG MARKER: == parallel-scale-nfsv4 test connectathon: connectathon ============================================== 20:23:12 (1356150192) And from the console log of OST, the client is evicted just after the iorfpp test began, |
| Comment by Hongchao Zhang [ 18/Jan/13 ] |
|
Hi YuJian, Can the issue be reproduced? and is there new occurrence of this issue recently? Thanks |
| Comment by Jian Yu [ 18/Jan/13 ] |
|
Hi Hongchao, After Lustre 2.1.4 was released, there has been no testing performed on Lustre b2_1 branch because there has been no landing on that branch. To try to reproduce/debug the issue, you can submit a patch against b2_1 branch with the following test parameters: Test-Parameters: fortestonly envdefinitions=SLOW=yes,ENABLE_QUOTA=yes clientdistro=el5 serverdistro=el5 clientarch=x86_64 serverarch=x86_64 testlist=parallel-scale-nfsv4 For real patch, please remove "fortestonly". |
| Comment by Hongchao Zhang [ 01/Feb/13 ] |
|
the timeout for the lock completion is 7s (1356152572 - 1356152572), 00000100:00000400:0.0:1356152572.131291:0:10135:0:(client.c:1817:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1356152565/real 1356152565] req@ffff810030989000 x1421990094479393/t0(0) o105->lustre-OST0003@10.10.4.198@tcp:15/16 lens 344/192 e 0 to 1 dl 1356152572 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 this timeout could be a little small for the test. |
| Comment by Andreas Dilger [ 23/Nov/13 ] |
|
This test is still failing on a regular basis in full testing: |
| Comment by Jian Yu [ 03/Jan/14 ] |
|
Lustre client: http://build.whamcloud.com/job/lustre-b2_5/5/ parallel-scale test iorssf hit the same failure: |
| Comment by Jian Yu [ 04/Jan/14 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5/ The same failure occurred: |
| Comment by Gerrit Updater [ 05/Aug/15 ] |
|
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/15859 |
| Comment by John Fuchs-Chesney (Inactive) [ 08/Jan/16 ] |
|
Incomplete and out of date. |