[LU-8646] sanity-scrub tests 4b and 4c fail with ‘(16) Auto trigger full scrub unexpectedly' Created: 28/Sep/16  Updated: 08/Oct/16  Resolved: 08/Oct/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James Nunez (Inactive) Assignee: nasf (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

autotest review-dne


Issue Links:
Related
is related to LU-8472 sanity-scrub test_5 times out Resolved
is related to LU-8416 sanity-scrub test_4c: Auto trigger fu... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-scrub test_4b and test_4c fail with error

sanity-scrub test_4b: @@@@@@ FAIL: (16) Auto trigger full scrub unexpectedly

The sanity-scrub code from test 4b that is failing is:

for n in $(seq $MDSCOUNT); do
		updated0[$n]=$(scrub_status $n |
			       awk '/^prior_updated/ { print $2 }')

		echo "OI scrub on MDS$n status for the 3rd time:"
		do_facet mds$n $LCTL get_param -n \
			osd-ldiskfs.$(facet_svc mds$n).oi_scrub

		[ ${updated0[$n]} -gt ${updated1[$n]} ] ||
			error "(16) Auto trigger full scrub unexpectedly"
	done

And, from the test log, we see that
Logs for this failure are at https://testing.hpdd.intel.com/test_sets/5862d05c-8406-11e6-a81d-5254006e85c2

Note that the patch for the above failure has the fix for LU-8472 (http://review.whamcloud.com/#/c/21918/).
This failure is similar to the failure seen in LU-8416 which was closed as a duplicated of LU-8472.



 Comments   
Comment by nasf (Inactive) [ 08/Oct/16 ]

According to the Maloo results, in the recent 2 weeks, there are the following 11 failures for sanity-scrub test_4b at (16)...:
https://testing.hpdd.intel.com/sub_tests/f9a50aaa-8b88-11e6-a9b0-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/90f7f918-874b-11e6-a9b0-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/db51a1ea-8736-11e6-ad53-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/5f6cdc90-86c1-11e6-91aa-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/360a3ee2-8621-11e6-91aa-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/2ae9c2ae-85d0-11e6-91aa-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/22345cd6-8577-11e6-a9b0-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/4566e872-847d-11e6-a35f-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/bc9c7326-8423-11e6-a454-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/586f4db4-8406-11e6-a81d-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/6abcd7be-8278-11e6-a35f-5254006e85c2

I have checked them one by one, the build# show that they all based on old parent commit before the LU-8472 patch http://review.whamcloud.com/21918/ landed to master. So I think that the patch 21918 should have already fixed the issue. I would suggest related patches owners to rebase their patches against the latest master, instead of just re-triggerring the old Maloo test section simply.

Comment by nasf (Inactive) [ 08/Oct/16 ]

For example:

And, from the test log, we see that
Logs for this failure are at https://testing.hpdd.intel.com/test_sets/5862d05c-8406-11e6-a81d-5254006e85c2

It is for the patch http://review.whamcloud.com/#/c/13595, the build is https://build.hpdd.intel.com/job/lustre-reviews/41642, the build# 41642 is for the set 251. The set 251's parent is "Parent(s) a653adc01d71f890d1b2b6549f34afb409485bf5". That is older than the patch http://review.whamcloud.com/21918 that was committed via a41c6fad4672a60166088b9ad8aeb4f1b51c38e7.

Comment by nasf (Inactive) [ 08/Oct/16 ]

Please reopen it if hit the trouble again after rebased with the latest master.

Generated at Sat Feb 10 02:19:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.