[LU-7103] ost-pools test_7a: test failed to respond and timed out Created: 04/Sep/15  Updated: 02/Dec/15  Resolved: 02/Dec/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Bob Glossman (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

client and server: lustre-master build # 3167 RHEL7.1


Issue Links:
Duplicate
is duplicated by LU-7208 ost-pools test_7a: test failed to res... Resolved
Related
is related to LU-6234 lfs computes pool name length incorre... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/30ff2db8-5288-11e5-920d-5254006e85c2.

The sub-test test_7a failed with the following error:

test failed to respond and timed out

test hang but cannot find useful information



 Comments   
Comment by Peter Jones [ 04/Sep/15 ]

Bob

You could please try and identify which commit introduced this recent regression?

Thanks

Peter

Comment by Andreas Dilger [ 04/Sep/15 ]

This first started failing on 2015-08-20 13:07:43 so is likely related to a patch that landed just before that time.

Comment by Bob Glossman (Inactive) [ 10/Sep/15 ]

I'm quite suspicious of the fix for "LU-6234 util: check fsname and pool name". That went in on 8/18 and added test 7a. No fails before then since the test didn't exist.

commit 7c25eb1ba2b1db3009a0e88b3ecf229134f8ac92

Comment by Peter Jones [ 10/Sep/15 ]

Bobijam

Do you think that this issue could be related to the LU-6234 patch?

Thanks

Peter

Comment by Zhenyu Xu [ 11/Sep/15 ]

no, from https://testing.hpdd.intel.com/test_sets/query?utf8=%E2%9C%93&test_set%5Btest_set_script_id%5D=6bea3250-3db2-11e0-80c0-52540025f9af&test_set%5Bstatus%5D=TIMEOUT&test_set%5Bquery_bugs%5D=&test_session%5Btest_host%5D=&test_session%5Btest_group%5D=&test_session%5Buser_id%5D=&test_session%5Bquery_date%5D=&test_session%5Bquery_recent_period%5D=&test_node%5Bos_type_id%5D=&test_node%5Bdistribution_type_id%5D=&test_node%5Barchitecture_type_id%5D=&test_node%5Bfile_system_type_id%5D=&test_node%5Blustre_branch_id%5D=&test_node_network%5Bnetwork_type_id%5D=&commit=Update+results

there are ost-pools test TIMEOUT failure before LU-6234 patch, and also has test_7a pass (https://testing.hpdd.intel.com/test_sets/02cddbde-4e9f-11e5-b0b8-5254006e85c2), I tend to think that there is a hidden issue other than LU-6234 in these TIMEOUT failure.

Comment by Andreas Dilger [ 29/Sep/15 ]

This is now the top failing test in autotest.

Comment by Andreas Dilger [ 29/Sep/15 ]

Bob, can you please push a patch to add set -vx at the start of test_7a and set +vx at the end, and add Test-Parameters: testlist=ost-pools,ost-pools,... to run this test maybe 10 times (if you put them on two lines they would run in parallel). Then we might see what the test is doing when it is failing. The current logs are not helpful.

Comment by Andreas Dilger [ 29/Sep/15 ]

Bobijam, I'm not so sure I agree that this timeout wasn't added by test_7a itself. There are a bunch of test_11 timeouts from 2015-07-25 to 2015-08-05, but those all related to a single patch (version 3 of http://review.whamcloud.com/15767) which was no longer hit for any of the later versions of that patch.

Comment by Gerrit Updater [ 30/Sep/15 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/16676
Subject: LU-7103 debug: add verbose shell echo
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bb1f19082b2a227daa7eee3ba2840b84e479e3b0

Comment by Zhenyu Xu [ 30/Sep/15 ]

I've pushed the verbose echo shell debug patch at http://review.whamcloud.com/16676

Comment by Bob Glossman (Inactive) [ 05/Oct/15 ]

bobijam, you say in an earlier comment that there are failures from before the LU-6234 patch landed. I can't find any. Searching https://testing.hpdd.intel.com/sub_tests/query?commit=Update+results&page=2&sub_test%5Bquery_bugs%5D=&sub_test%5Bstatus%5D=TIMEOUT&sub_test%5Bsub_test_script_id%5D=e1c3310c-da20-11e4-b0f3-5254006e85c2&test_node%5Barchitecture_type_id%5D=&test_node%5Bdistribution_type_id%5D=&test_node%5Bfile_system_type_id%5D=&test_node%5Blustre_branch_id%5D=&test_node%5Bos_type_id%5D=&test_node_network%5Bnetwork_type_id%5D=&test_session%5Bquery_date%5D=&test_session%5Bquery_recent_period%5D=&test_session%5Btest_group%5D=&test_session%5Btest_host%5D=&test_session%5Buser_id%5D=&test_set%5Btest_set_script_id%5D=6bea3250-3db2-11e0-80c0-52540025f9af&utf8=

reports the earliest instance as 8/20. That is after the LU-6234 patch landed.

Comment by Zhenyu Xu [ 08/Oct/15 ]

Bob,

I meant other TIMEOUT failure of ost-pools tests, since test_7a was added in LU-6234 patch, so you would not find test_7a failure before that time.

Comment by Zhenyu Xu [ 08/Oct/15 ]

The debug patch hit two instances of the failure (https://testing.hpdd.intel.com/test_sessions/3eeb4454-6bb8-11e5-b85e-5254006e85c2), while only client2 test log shows following info

== ost-pools test 7a: create various pool name ======================================================= 22:18:26 (1444083506)
'[' 8 -lt 2 ']'
mkdir -p /mnt/lustre/d7a.ost-pools
for i in 1 9 15
cat /dev/urandom
tr -dc a-zA-Z0-9
fold -w 1
head -n 1

No other logs show any trace of the shell command echo, don't understand it.

Comment by Andreas Dilger [ 24/Nov/15 ]

I would suspect that cat /dev/urandom is getting blocked for some reason in the VM. Please add a patch that removes the use of this and instead use something like echo $$$RANDOM$RANDOM which should generate a string between 3 and 15 characters long that can be cut shorter if needed.

Comment by Gerrit Updater [ 24/Nov/15 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/17350
Subject: LU-7103 test: avoid cat of /dev/urandom
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 72bf3691eb56e3b187b73ec1bb4a173feab81dd1

Comment by Gerrit Updater [ 02/Dec/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17350/
Subject: LU-7103 test: avoid cat of /dev/urandom
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cddbef5fae44d09756639c4cd9de62f28f5b34cf

Comment by Joseph Gmitter (Inactive) [ 02/Dec/15 ]

Landed for 2.8.0

Generated at Sat Feb 10 02:06:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.