[LU-386] Test failure on test suite replay-single Created: 02/Jun/11  Updated: 07/Jul/11  Resolved: 07/Jul/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Attachments: File 65a.tar.gz     File 65alog.tar.gz     File 65alogs.tar.gz     Text File debug.log     File logs-2.tar.gz     File logs.tar.gz     Text File mds-dmsg.log     Text File tentative.patch     File test_65a     File test_65a-2     File test_65a_debug.diff    
Severity: 3
Rank (Obsolete): 4971

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/27ae3e58-8d47-11e0-aab9-52540025f9af.



 Comments   
Comment by Sarah Liu [ 02/Jun/11 ]

this one can be reproduced

Comment by Peter Jones [ 08/Jun/11 ]

Bobi

I understand that you believe that this is related to bz 10821. Could you please elaborate as to why?

Thanks

Peter

Comment by Zhenyu Xu [ 08/Jun/11 ]

due to misunderstanding i think, there someone associated bug 10821 with this error, which makes me want to check it out to see what history this issue has.

Comment by Zhenyu Xu [ 09/Jun/11 ]

Sarah,

Would you mind patching this debug patch and reproducing the issue, and uploading /tmp/test_65a file which the debug patch would generate as well as MDS debug logs? Thanks.

Comment by Sarah Liu [ 09/Jun/11 ]

sure, will do it tomorrow.

Comment by Zhenyu Xu [ 09/Jun/11 ]

Is the debug.log MDS log? Because I can not find following messages in it

(fail.c:126:__cfs_fail_timeout_set()) cfs_fail_timeout id 50a sleeping for 6000ms
..
(fail.c:130:__cfs_fail_timeout_set()) cfs_fail_timeout id 50a awake

as it is set in the test script

do_facet $SINGLEMDS lctl set_param fail_val=$((${REQ_DELAY} * 1000))
#define OBD_FAIL_PTLRPC_PAUSE_REQ 0x50a
do_facet $SINGLEMDS $LCTL set_param fail_loc=0x8000050a

Comment by Sarah Liu [ 10/Jun/11 ]

it is the debug log of MDS and OST, I can separate MDS and OST and test it again

Comment by Zhenyu Xu [ 12/Jun/11 ]

better got -1 logs from MDS site if possible (the test_65a file from client site as well)

Comment by Sarah Liu [ 13/Jun/11 ]

the test_65a is already in the attached.

Comment by Zhenyu Xu [ 13/Jun/11 ]

in test_65a-2 i found client got a early reply

00000100:00001000:3.0:1308021631.187436:0:9231:0:(events.c:140:reply_in_callback()) @@@ Early reply received: mlen=192 offset=0 replen=216 replied=0 unlinked=0 req@ffff81032c4a9400 x1371560011104309/t0(0) o-1->lustre-MDT0000_UUID@192.168.4.18@o2ib:30/10 lens 232/216 e 0 to 0 dl 1308021637 ref 2 fl Rpc:/ffffffff/ffffffff rc 0/-1

but don't know why client didn't call ptlrpc_at_recv_early_reply()

last resort, reproduce the issue and gather -1 logs from client and mds (thus test_65a file also will contain -1 logs from client), you can zip and upload them.

Comment by Zhenyu Xu [ 15/Jun/11 ]

from last comment (also reveals in the client log updated in logs.tar.gz), client consumes the early reply on 30/10 protal pair, which are SEQ_METADATA_PORTAL/MDC_REPLY_PORTAL, my local test shows different portal usage (12/10 MDS_REQUEST_PORTAL/MDC_REPLY_PORTAL), I don't know fid seq mechanism much and wondering whether the seq allocation handling relates to the issue.

Comment by Zhenyu Xu [ 16/Jun/11 ]

would you mind trying this patch?

Comment by Sarah Liu [ 16/Jun/11 ]

waiting for build: http://review.whamcloud.com/#change,957

Comment by Sarah Liu [ 20/Jun/11 ]

here is the maloo result with your patch
https://maloo.whamcloud.com/test_sets/5f21a89e-9b79-11e0-9a27-52540025f9af

other logs please find in the attached.

Comment by Zhenyu Xu [ 20/Jun/11 ]

still need client test_65 log

Comment by Sarah Liu [ 21/Jun/11 ]

I've uploaded 65a.tar.gz, please check it.

Comment by Zhenyu Xu [ 24/Jun/11 ]

Sarah,

I can not reserve any node, so still need your help testing with latest patch set in http://review.whamcloud.com/#change,957
need -1 mds log and -1 test_65a_client log. (if possible, would it be reproducible to just run this single test case?)

TIA

Comment by Sarah Liu [ 24/Jun/11 ]

yes, it is reproducible for just running test_65a and I will try this patch tomorrow and give you feedback

Comment by Sarah Liu [ 24/Jun/11 ]

here is the log

Comment by Zhenyu Xu [ 28/Jun/11 ]

patch tracking at http://review.whamcloud.com/1025

Comment by Sarah Liu [ 30/Jun/11 ]

this patch works. https://maloo.whamcloud.com/test_sets/9232fa84-a2dc-11e0-aee5-52540025f9af

Comment by James A Simmons [ 06/Jul/11 ]

LU-320 also deals with this issue. The difference is my patch moves away from sysctl. I recommend we move away from sysctl to using $LCTL set_param debug="other". Both test 65a and 65b has this issue.

Comment by Zhenyu Xu [ 06/Jul/11 ]

update patch per James suggestion.

Comment by Sarah Liu [ 06/Jul/11 ]

verified. https://maloo.whamcloud.com/test_sets/341a370c-a84d-11e0-bd2a-52540025f9af

Comment by Build Master (Inactive) [ 07/Jul/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #194
LU-386 fix replay_single test_65a

Oleg Drokin : e0ee0aacd358893dad5c9f0da0dc19ba3ddf08a0
Files :

  • lustre/tests/replay-single.sh
Comment by Build Master (Inactive) [ 07/Jul/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #194
LU-386 fix replay_single test_65a

Oleg Drokin : e0ee0aacd358893dad5c9f0da0dc19ba3ddf08a0
Files :

  • lustre/tests/replay-single.sh
Comment by Build Master (Inactive) [ 07/Jul/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #194
LU-386 fix replay_single test_65a

Oleg Drokin : e0ee0aacd358893dad5c9f0da0dc19ba3ddf08a0
Files :

  • lustre/tests/replay-single.sh
Comment by Build Master (Inactive) [ 07/Jul/11 ]

Integrated in lustre-master » i686,client,el5,ofa #194
LU-386 fix replay_single test_65a

Oleg Drokin : e0ee0aacd358893dad5c9f0da0dc19ba3ddf08a0
Files :

  • lustre/tests/replay-single.sh
Comment by Build Master (Inactive) [ 07/Jul/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #194
LU-386 fix replay_single test_65a

Oleg Drokin : e0ee0aacd358893dad5c9f0da0dc19ba3ddf08a0
Files :

  • lustre/tests/replay-single.sh
Comment by Build Master (Inactive) [ 07/Jul/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #194
LU-386 fix replay_single test_65a

Oleg Drokin : e0ee0aacd358893dad5c9f0da0dc19ba3ddf08a0
Files :

  • lustre/tests/replay-single.sh
Comment by Build Master (Inactive) [ 07/Jul/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #194
LU-386 fix replay_single test_65a

Oleg Drokin : e0ee0aacd358893dad5c9f0da0dc19ba3ddf08a0
Files :

  • lustre/tests/replay-single.sh
Comment by Build Master (Inactive) [ 07/Jul/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #194
LU-386 fix replay_single test_65a

Oleg Drokin : e0ee0aacd358893dad5c9f0da0dc19ba3ddf08a0
Files :

  • lustre/tests/replay-single.sh
Comment by Build Master (Inactive) [ 07/Jul/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #194
LU-386 fix replay_single test_65a

Oleg Drokin : e0ee0aacd358893dad5c9f0da0dc19ba3ddf08a0
Files :

  • lustre/tests/replay-single.sh
Comment by Build Master (Inactive) [ 07/Jul/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #194
LU-386 fix replay_single test_65a

Oleg Drokin : e0ee0aacd358893dad5c9f0da0dc19ba3ddf08a0
Files :

  • lustre/tests/replay-single.sh
Comment by Build Master (Inactive) [ 07/Jul/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #194
LU-386 fix replay_single test_65a

Oleg Drokin : e0ee0aacd358893dad5c9f0da0dc19ba3ddf08a0
Files :

  • lustre/tests/replay-single.sh
Comment by Build Master (Inactive) [ 07/Jul/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #194
LU-386 fix replay_single test_65a

Oleg Drokin : e0ee0aacd358893dad5c9f0da0dc19ba3ddf08a0
Files :

  • lustre/tests/replay-single.sh
Comment by Build Master (Inactive) [ 07/Jul/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #194
LU-386 fix replay_single test_65a

Oleg Drokin : e0ee0aacd358893dad5c9f0da0dc19ba3ddf08a0
Files :

  • lustre/tests/replay-single.sh
Comment by Build Master (Inactive) [ 07/Jul/11 ]

Integrated in lustre-master » i686,server,el5,ofa #194
LU-386 fix replay_single test_65a

Oleg Drokin : e0ee0aacd358893dad5c9f0da0dc19ba3ddf08a0
Files :

  • lustre/tests/replay-single.sh
Comment by Zhenyu Xu [ 07/Jul/11 ]

landed on master for 2.1.0

Generated at Sat Feb 10 01:06:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.