[LU-4442] Failure on test suite replay-vbr test_7g: Test 7g.3 failed Created: 06/Jan/14  Updated: 14/Jul/15  Resolved: 11/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.6, Lustre 2.6.0, Lustre 2.5.1, Lustre 2.4.3
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Emoly Liu
Resolution: Fixed Votes: 0
Labels: mn1, mn4
Environment:

client and server: lustre-master build 1823 RHEL6 ldiskfs


Severity: 3
Rank (Obsolete): 12186

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/930d9194-74df-11e3-96b0-52540035b04c.

The sub-test test_7g failed with the following error:

Test 7g.3 failed

test log shows:

CMD: client-31vm7 /usr/sbin/lctl list_param osp.*osc*.old_sync_processed 2> /dev/null
osp.lustre-OST0000-osc-MDT0000.old_sync_processed
osp.lustre-OST0001-osc-MDT0000.old_sync_processed
osp.lustre-OST0002-osc-MDT0000.old_sync_processed
osp.lustre-OST0003-osc-MDT0000.old_sync_processed
osp.lustre-OST0004-osc-MDT0000.old_sync_processed
osp.lustre-OST0005-osc-MDT0000.old_sync_processed
osp.lustre-OST0006-osc-MDT0000.old_sync_processed
wait mds1 secs maximumly for client-31vm7 mds-ost sync done.
/usr/lib64/lustre/tests/test-framework.sh: line 2135: [: mds1: integer expression expected
CMD: client-31vm7 /usr/sbin/lctl get_param -n osp.*osc*.old_sync_processed
1
1
1
1
1
1
1
 recovery node iozone not done in mds1 sec. 
 replay-vbr test_7g: @@@@@@ FAIL: Test 7g.3 failed 


 Comments   
Comment by Oleg Drokin [ 09/Jan/14 ]

there's certainly some parsing error somewhee that makes us pick out of mds name instead of some timeout and so thngs go downhill fro there:

/usr/lib64/lustre/tests/test-framework.sh: line 2135: [: mds1: integer expression expected

Comment by Peter Jones [ 09/Jan/14 ]

Emoly

Could you please look into this one?

Thanks

Peter

Comment by Emoly Liu [ 10/Jan/14 ]

I can reproduce it easily. I will investigate and fix it.

Comment by Emoly Liu [ 10/Jan/14 ]

patch at: http://review.whamcloud.com/8796, which fixes the test script issue.

Comment by Jian Yu [ 10/Jan/14 ]

Lustre build: http://build.whamcloud.com/job/lustre-reviews/20841/
Distro/Arch: SLES11SP3/x86_64 (both server and client, kernel version: 3.0.101-0.8)

The same failure occurred:
https://maloo.whamcloud.com/test_sets/43ce86d2-798b-11e3-a27b-52540035b04c

Comment by Emoly Liu [ 13/Jan/14 ]

The maloo test report https://maloo.whamcloud.com/test_logs/2a98cbc0-7bfa-11e3-a7b6-52540035b04c/show_text shows that test_7g has another problem besides test script issue to be fixed by http://review.whamcloud.com/8796.

I will investigate and provide another patch for it.

Comment by Emoly Liu [ 17/Jan/14 ]

By searching maloo, I notice this error has occurred since Dec. 21, and finally I find it is related to LU-3528 http://review.whamcloud.com/8371 .

I am working on the patch.

Comment by Emoly Liu [ 23/Jan/14 ]

The root cause of this failure is that since mdt_object_exists() was added to mdt_reint_link() in http://review.whamcloud.com/#/c/8371, if the child object doesn't exist, there is no chance to do object version check and client1 will not be evicted.

I create the following two patches to fix this problem, and I am not sure which is better:

Tappro, could you please give any advice? Thanks.

Comment by Emoly Liu [ 24/Jan/14 ]

Thanks, Tappro, I saw your choice of http://review.whamcloud.com/8973 .

Comment by Emoly Liu [ 11/Feb/14 ]

Both patches have been landed to 2.6.

Comment by Jian Yu [ 17/Feb/14 ]

Patches for Lustre b2_5 branch:
http://review.whamcloud.com/9289
http://review.whamcloud.com/9290

Comment by Jian Yu [ 17/Feb/14 ]

Landing http://review.whamcloud.com/8371 on Lustre b2_5 build #25 also caused this regression failure on Lustre b2_5 branch:
https://maloo.whamcloud.com/test_sets/0103e6d0-977c-11e3-b941-52540035b04c
https://maloo.whamcloud.com/test_sets/c79df196-97b8-11e3-acb5-52540035b04c

Comment by Sarah Liu [ 25/Mar/14 ]

Also hit on interop test between 2.5.1 server and master client:

https://maloo.whamcloud.com/test_sets/8e4f6b3a-b244-11e3-a93f-52540035b04c

Comment by Jian Yu [ 17/Apr/14 ]

Also hit on interop test between 2.5.1 server and master client:

This is because http://review.whamcloud.com/9213 was landed on Lustre b2_5 branch. We need change Lustre version number 2.5.52 to 2.5.1 in replay-vbr test_7g().
Patch for master branch: http://review.whamcloud.com/9986

Comment by Jian Yu [ 17/Apr/14 ]

Back-ported patch for Lustre b2_4 branch: http://review.whamcloud.com/9987
Back-ported patch for Lustre b2_1 branch: http://review.whamcloud.com/9988

Generated at Sat Feb 10 01:42:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.