[LU-1499] parallel-scale-nfsv4, subtest test_iorfpp Created: 10/Jun/12  Updated: 08/Jan/16  Resolved: 08/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.1.2, Lustre 2.1.4, Lustre 2.5.1, Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Incomplete Votes: 0
Labels: mn1
Environment:

Lustre Tag: v2_1_2_RC3
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/89/
Distro/Arch: RHEL5.8/x86_64


Severity: 3
Rank (Obsolete): 5424

 Description   

This issue was created by maloo for yujian <yujian@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/edb73380-b329-11e1-a6d7-52540035b04c.

The sub-test test_iorfpp failed with the following error:

** error **
ERROR in aiori-POSIX.c (line 256): transfer failed.
ERROR: Input/output error
** exiting **

parallel-scale-nfsv4 test_iorfpp: @@@@@@ FAIL: ior failed! 1

Info required for matching: parallel-scale-nfsv4 iorfpp

Console log on MDS showed that:

10:30:59:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test iorfpp: iorfpp ========================================================== 10:30:56 (1339349456)
10:31:19:LustreError: 11-0: an error occurred while communicating with 10.10.4.167@tcp. The ost_write operation failed with -107
10:31:19:LustreError: Skipped 2 previous similar messages
10:31:19:Lustre: lustre-OST0000-osc-ffff81006c110800: Connection to lustre-OST0000 (at 10.10.4.167@tcp) was lost; in progress operations using this service will wait for recovery to complete
10:31:19:Lustre: Skipped 2 previous similar messages
10:31:19:LustreError: 11-0: an error occurred while communicating with 10.10.4.167@tcp. The obd_ping operation failed with -107
10:31:20:LustreError: 167-0: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
10:31:21:LustreError: 3215:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff81006ae4dc00 x1404345569553508/t0(0) o4->lustre-OST0000-osc-ffff81006c110800@10.10.4.167@tcp:6/4 lens 456/416 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
10:31:22:LustreError: 3215:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff810042141800 x1404345569553507/t0(0) o4->lustre-OST0000-osc-ffff81006c110800@10.10.4.167@tcp:6/4 lens 472/416 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
10:31:22:LustreError: 3215:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff810016181000 x1404345569553531/t0(0) o4->lustre-OST0000-osc-ffff81006c110800@10.10.4.167@tcp:6/4 lens 472/416 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
10:31:22:LustreError: 3215:0:(client.c:1060:ptlrpc_import_delay_req()) Skipped 22 previous similar messages
10:31:27:LustreError: 3217:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff81005e151400 x1404345569553575/t0(0) o101->lustre-OST0000-osc-ffff81006c110800@10.10.4.167@tcp:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
10:31:27:LustreError: 3217:0:(client.c:1060:ptlrpc_import_delay_req()) Skipped 7 previous similar messages
10:34:05:Lustre: DEBUG MARKER: parallel-scale-nfsv4 test_iorfpp: @@@@@@ FAIL: ior failed! 1

Console log on OSS showed that:

10:30:59:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test iorfpp: iorfpp ========================================================== 10:30:56 (1339349456)
10:31:16:Lustre: 8348:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1339349469/real 1339349469]  req@ffff81002bb25400 x1404345556224852/t0(0) o105->lustre-OST0000@10.10.4.166@tcp:15/16 lens 344/192 e 0 to 1 dl 1339349476 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
10:31:17:LustreError: 138-a: lustre-OST0000: A client on nid 10.10.4.166@tcp was evicted due to a lock completion callback time out: rc -107
10:31:21:Lustre: 8361:0:(ldlm_lib.c:933:target_handle_connect()) lustre-OST0000: connection from d2bf4c4f-6776-3492-f34a-d2b1bfc5d2b3@10.10.4.166@tcp t1460255 exp 0000000000000000 cur 1339349480 last 0
10:31:21:Lustre: 8361:0:(ldlm_lib.c:933:target_handle_connect()) Skipped 2 previous similar messages
10:34:04:Lustre: DEBUG MARKER: parallel-scale-nfsv4 test_iorfpp: @@@@@@ FAIL: ior failed! 1


 Comments   
Comment by Jian Yu [ 10/Jun/12 ]

The same test passed in another test session on the same distro/arch and Lustre build:
https://maloo.whamcloud.com/test_sets/76a71432-b32d-11e1-b1be-52540035b04c

Comment by Peter Jones [ 23/Jul/12 ]

Hongchao

Can you explain what is happening here?

Peter

Comment by Hongchao Zhang [ 23/Jul/12 ]

the client was evicted by lustre-OST0000 for it didn't response the lock completion callback, then the IOR failed with EIO
the logs was deleted from brent.whamcloud.com for it was over one month, then don't know why the client failed to repsonse
the completion ast, which should be okay and it is the blocking ast that often fails to reponse!

Comment by Peter Jones [ 31/Jul/12 ]

As per Sarah this has not occurred on recent 2.3 tags so dropping as a blocker

Comment by Jian Yu [ 22/Dec/12 ]

Lustre Tag: v2_1_4_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/164/
Distro/Arch: RHEL5.8/x86_64

The same issue occurred again: https://maloo.whamcloud.com/test_sets/0dea2186-4bfe-11e2-b603-52540035b04c

Comment by Hongchao Zhang [ 05/Jan/13 ]

there is no info about this test in its debug logs in https://maloo.whamcloud.com/test_sets/0dea2186-4bfe-11e2-b603-52540035b04c

00000001:02000400:0.0:1356150192.989830:0:21898:0:(debug.c:447:libcfs_debug_mark_buffer()) DEBUG MARKER: == parallel-scale-nfsv4 test connectathon: connectathon ============================================== 20:23:12 (1356150192)
00000001:02000400:0.0:1356150194.678210:0:21997:0:(debug.c:447:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark .\/runtests -N 10 -b -f \/mnt\/lustre\/d0.connectathon
00000001:02000400:0.0:1356150195.142271:0:22041:0:(debug.c:447:libcfs_debug_mark_buffer()) DEBUG MARKER: ./runtests -N 10 -b -f /mnt/lustre/d0.connectathon
00000001:02000400:0.0:1356150220.844645:0:22360:0:(debug.c:447:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark .\/runtests -N 10 -g -f \/mnt\/lustre\/d0.connectathon
00000001:02000400:0.0:1356150221.134746:0:22404:0:(debug.c:447:libcfs_debug_mark_buffer()) DEBUG MARKER: ./runtests -N 10 -g -f /mnt/lustre/d0.connectathon
00000001:02000400:0.0:1356150260.582849:0:26170:0:(debug.c:447:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark .\/runtests -N 10 -s -f \/mnt\/lustre\/d0.connectathon
00000001:02000400:0.0:1356150260.891751:0:26214:0:(debug.c:447:libcfs_debug_mark_buffer()) DEBUG MARKER: ./runtests -N 10 -s -f /mnt/lustre/d0.connectathon
00000001:02000400:0.0:1356150372.401565:0:26714:0:(debug.c:447:libcfs_debug_mark_buffer()) DEBUG MARKER: lctl set_param -n fail_loc=0 2>/dev/null || true
00000001:02000400:0.0:1356150375.034833:0:26943:0:(debug.c:447:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test iorssf: iorssf ========================================================== 20:26:14 (1356150374)
00000001:02000400:0.0:1356150375.440530:0:26987:0:(debug.c:447:libcfs_debug_mark_buffer()) DEBUG MARKER: == parallel-scale-nfsv4 test iorssf: iorssf ========================================================== 20:26:14 (1356150374)
00000001:02000400:0.0:1356152546.042780:0:27240:0:(debug.c:447:libcfs_debug_mark_buffer()) DEBUG MARKER: lctl set_param -n fail_loc=0 2>/dev/null || true
00000001:02000400:0.0:1356152549.325566:0:27469:0:(debug.c:447:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test iorfpp: iorfpp ========================================================== 21:02:29 (1356152549)
00000001:02000400:0.0:1356152549.729173:0:27513:0:(debug.c:447:libcfs_debug_mark_buffer()) DEBUG MARKER: == parallel-scale-nfsv4 test iorfpp: iorfpp ========================================================== 21:02:29 (1356152549)
00000001:02000400:0.0:1356152701.083852:0:27649:0:(debug.c:447:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark parallel-scale-nfsv4 test_iorfpp: @@@@@@ FAIL: ior failed! 1
00000001:02000400:0.0:1356152701.852008:0:27693:0:(debug.c:447:libcfs_debug_mark_buffer()) DEBUG MARKER: parallel-scale-nfsv4 test_iorfpp: @@@@@@ FAIL: ior failed! 1
00000001:02000400:0.0:1356152702.893441:0:27787:0:(debug.c:447:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2012-12-20/lustre-b2_1-el5-x86_64-ofa_164_-70041010976000-095156/parallel-scale-nfsv4.test_iorfpp.debug_log.$(hostname -s).1356152702.log;

And from the console log of OST, the client is evicted just after the iorfpp test began,
21:02:54:Lustre: 10135:0:(client.c:1817:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1356152565/real 1356152565] req@ffff810030989000 x1421990094479393/t0(0) o105->lustre-OST0003@10.10.4.198@tcp:15/16 lens 344/192 e 0 to 1 dl 1356152572 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
21:02:54:LustreError: 138-a: lustre-OST0003: A client on nid 10.10.4.198@tcp was evicted due to a lock completion callback time out: rc -107

Comment by Hongchao Zhang [ 18/Jan/13 ]

Hi YuJian,

Can the issue be reproduced? and is there new occurrence of this issue recently? Thanks

Comment by Jian Yu [ 18/Jan/13 ]

Hi Hongchao,

After Lustre 2.1.4 was released, there has been no testing performed on Lustre b2_1 branch because there has been no landing on that branch.

To try to reproduce/debug the issue, you can submit a patch against b2_1 branch with the following test parameters:

Test-Parameters: fortestonly envdefinitions=SLOW=yes,ENABLE_QUOTA=yes clientdistro=el5 serverdistro=el5 clientarch=x86_64 serverarch=x86_64 testlist=parallel-scale-nfsv4

For real patch, please remove "fortestonly".

Comment by Hongchao Zhang [ 01/Feb/13 ]

the timeout for the lock completion is 7s (1356152572 - 1356152572),

00000100:00000400:0.0:1356152572.131291:0:10135:0:(client.c:1817:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1356152565/real 1356152565] req@ffff810030989000 x1421990094479393/t0(0) o105->lustre-OST0003@10.10.4.198@tcp:15/16 lens 344/192 e 0 to 1 dl 1356152572 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1

this timeout could be a little small for the test.

Comment by Andreas Dilger [ 23/Nov/13 ]

This test is still failing on a regular basis in full testing:
https://maloo.whamcloud.com/test_sets/7738cba4-5168-11e3-8300-52540035b04c
https://maloo.whamcloud.com/test_sets/e8283092-543e-11e3-9029-52540035b04c

Comment by Jian Yu [ 03/Jan/14 ]

Lustre client: http://build.whamcloud.com/job/lustre-b2_5/5/
Lustre server: http://build.whamcloud.com/job/lustre-b2_1/220/ (2.1.6)

parallel-scale test iorssf hit the same failure:
https://maloo.whamcloud.com/test_sets/581b9c0a-7441-11e3-a802-52540035b04c

Comment by Jian Yu [ 04/Jan/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5/
Distro/Arch: RHEL6.4/x86_64
MDSCOUNT=2

The same failure occurred:
https://maloo.whamcloud.com/test_sets/d48a412a-749a-11e3-8b21-52540035b04c

Comment by Gerrit Updater [ 05/Aug/15 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/15859
Subject: LU-1499 ldlm:debug patch to collect logs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d3560728b533e1f168baf9026ba0de0617ee73a7

Comment by John Fuchs-Chesney (Inactive) [ 08/Jan/16 ]

Incomplete and out of date.
~ jfc.

Generated at Sat Feb 10 01:17:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.