[LU-3304] Test failure on sanity-quota test_18: watchdog triggered Created: 09/May/13  Updated: 10/Jul/13  Resolved: 10/Jul/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.5.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: James Nunez (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-2283 Test failure on sanity-quota test_18:... Resolved
is related to LU-3195 Interop 2.3.0<->2.3.63 sanity-quota t... Resolved
Severity: 3
Rank (Obsolete): 8186

 Description   

sanity-quota test_18 is failing due to watchdog picking up messages from other sanity-quota tests.

This issue relates to the following test suite run:
https://maloo.whamcloud.com/test_sets/e2a22f56-a90d-11e2-85b9-52540035b04c

Error message for test_18 is:

Lustre: DEBUG MARKER: sanity-quota test_18: @@@@@@ FAIL: Lustre: DEBUG MARKER: sanity-quota test_6: @@@@@@ FAIL: LNet: Service thread pid 21586 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugg


 Comments   
Comment by James Nunez (Inactive) [ 09/May/13 ]

sanity-quota test_6 and test_18 check watchdog in same way, i.e. very similar code. Fixes will be similar.

Comment by James Nunez (Inactive) [ 10/May/13 ]

The proposed patch is at:
http://review.whamcloud.com/#change,6310

Comment by James Nunez (Inactive) [ 10/May/13 ]

The problem here is that test_18 is picking up lines from dmesg from any test. The awk command

local watchdog=$(awk '/sanity-quota test 18/ {start = 1;}
	        /Service thread pid/ && /was inactive/ {
			  if (start) {
			       print;
			  }
	        }' $TMP/lustre-log-${TESTNAME}.log)

sets the start flag to 1 when it sees "sanity-quota test 18" in dmesg, but the flag is never turned off. So, if "sanity-quota test 18" is seen once in dmesg, we get a false positive if any other test generated a message with "Service thread pid" and "was inactive" in it.

I chose to implement Oleg's suggested fix for LU-3195, which is this same issue for test_6, which is to clear dmesg at the beginning of test_18.

Comment by James Nunez (Inactive) [ 10/Jul/13 ]

Patch landed to master.

Generated at Sat Feb 10 01:32:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.