[LU-1793] Test failure on test suite replay-single, subtest test_44a Created: 27/Aug/12  Updated: 21/Dec/12  Resolved: 20/Dec/12

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Duplicate Votes: 0
Labels: LB

Issue Links:
Duplicate
duplicates LU-2275 Open request leak Resolved
Severity: 3
Rank (Obsolete): 4081

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/c0bd39b0-ee84-11e1-9426-52540035b04c.

The sub-test test_44a failed with the following error:

test_44a failed with 3

 == replay-single test 44a: race in target handle connect ============================================= 21:37:30 (1345783050)
3 13=3 13
 replay-single test_44a: @@@@@@ FAIL: test_44a failed with 3 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:3638:error_noexit()
  = /usr/lib64/lustre/tests/test-framework.sh:3660:error()
  = /usr/lib64/lustre/tests/test-framework.sh:3893:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:3922:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:3796:run_test()
  = /usr/lib64/lustre/tests/replay-single.sh:922:main()
Dumping lctl log to /logdir/test_logs/2012-08-23/lustre-b2_3-el6-x86_64-el5-x86_64__3__-7fad99b000d8/replay-single.test_44a.*.1345783052.log
CMD: client-11vm1.lab.whamcloud.com,client-11vm2,client-11vm3,client-11vm4 /usr/sbin/lctl dk > /logdir/test_logs/2012-08-23/lustre-b2_3-el6-x86_64-el5-x86_64__3__-7fad99b000d8/replay-single.test_44a.debug_log.\$(hostname -s).1345783052.log;
         dmesg > /logdir/test_logs/2012-08-23/lustre-b2_3-el6-x86_64-el5-x86_64__3__-7fad99b000d8/replay-single.test_44a.dmesg.\$(hostname -s).1345783052.log


 Comments   
Comment by Sarah Liu [ 10/Sep/12 ]

another failure: https://maloo.whamcloud.com/test_sets/68aa2ccc-f683-11e1-8eb0-52540035b04c

Comment by Sarah Liu [ 24/Sep/12 ]

server/client lustre-b2_3-RC1 RHEL6

https://maloo.whamcloud.com/test_sets/b0389fb2-0653-11e2-9b17-52540035b04c

Comment by Peter Jones [ 25/Sep/12 ]

Hongchao

Could you please comment on this one?

Thanks

Peter

Comment by Hongchao Zhang [ 25/Sep/12 ]

in test_44a,
...
mdcdev=`lctl get_param n devices | awk '/MDT0000-mdc/

{print $1}

'`
[ "$mdcdev" ] || return 2
[ $(echo $mdcdev | wc -w) -eq 1 ] ||

{ echo $mdcdev=$mdcdev && return 3; }

...

more than one 'MDC' were found! we'd better print the "$mdcdev" to see what these MDCs are.

the patch is tracked at http://review.whamcloud.com/#change,4088

Comment by Sarah Liu [ 25/Sep/12 ]

lustre-b2_3-RC1 OFED build
https://maloo.whamcloud.com/test_sets/79e86a92-0756-11e2-ac99-52540035b04c

Comment by Jian Yu [ 10/Oct/12 ]

Lustre Tag: v2_3_0_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/32
Distro/Arch: RHEL6.3/x86_64

The same issue occurred again: https://maloo.whamcloud.com/test_sets/0dd9033e-12aa-11e2-bd97-52540035b04c

Comment by Jian Yu [ 10/Oct/12 ]

Lustre Tag: v2_3_0_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/32
Distro/Arch: RHEL6.3/x86_64(server), RHEL5.8/x86_64(client)

This was hit regularly on different distros/archs:
https://maloo.whamcloud.com/test_sets/bbdfb02a-12b1-11e2-a23c-52540035b04c

Comment by Jodi Levi (Inactive) [ 10/Oct/12 ]

Reducing from blocker per Oleg's comments in 2.3 channel.

1793 is not a blocker - it's a test script or test environment eissue, somehow we ended up having 3 mdc devices instead of just one. There was a debug patch somewhere to pring list of devices in that case

Comment by Sarah Liu [ 02/Nov/12 ]

another failure on SLES11 SP2 client:
https://maloo.whamcloud.com/test_sets/6aff6916-253a-11e2-9e7c-52540035b04c

Comment by Peter Jones [ 28/Nov/12 ]

Sarah reports that this consistently fails 100% of the time so it should be possible to debug and fix this test script/env issue now.

Comment by Oleg Drokin [ 03/Dec/12 ]

This problem potentially got fixed by LU-2275, http://review.whamcloud.com/4458

Comment by Andreas Dilger [ 03/Dec/12 ]

Peter, I think the bug that is failing 100% of the time is LU-2297 (replay-single.sh test_74), not this one.

Comment by Peter Jones [ 03/Dec/12 ]

ok - thanks Andreas

Comment by Hongchao Zhang [ 20/Dec/12 ]

this test passed 100% from 28, Nov, consider to drop it priority?

Comment by Jodi Levi (Inactive) [ 20/Dec/12 ]

Duplicate of LU-2275

Generated at Sat Feb 10 01:19:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.