[LU-12210] Test failures associated with DNE index and stripe count randomization Created: 19/Apr/19  Updated: 25/Jul/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Patrick Farrell (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-11560 recovery-small test 134 fails with ‘r... Open
is related to LU-12175 sanity test 208 fails with 'lease bro... Reopened
is related to LU-11636 t-f test_mkdir() does not support int... Resolved
is related to LU-12171 sanity test_133g: Timeout occurred af... Resolved
is related to LU-12176 recovery-small test 107 fails with 't... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

As a secondary change, the LU-11636 patch added MDS index and stripe count randomization to the test-framework in DNE configs.

The randomization introduced several high frequency failures (and, I suspect, other lower frequency failures), and was reverted at https://review.whamcloud.com/34705/

This ticket is to track these test failures, as the randomization change is fundamentally an improvement and is something we'd like to eventually put in.

The four failures that were specifically tracked to this change are:
sanity 208 (LU-12175)

sanity 133g (LU-12171)

recovery-small 107 (LU-12176)

recovery-small 134 (LU-11560)

 

There are some limited details in each of those tickets.  The failures varied, and did not appear to result from a single underlying bug.



 Comments   
Comment by Cory Spitz [ 23/Apr/19 ]

Has anyone tried applying the tests to an older master? I don't think that Cray is experiencing this volume of failures on our 2.11 or 2.12. Maybe there is some regression causing the problem that landing the tests only caught. Maybe the test improvements could be tried with git bisect and a rebased master?

Comment by Cory Spitz [ 23/Apr/19 ]

By the way, why are all of the recent related Bugs RESOLVED? None of them are resolved, right? We're just skipping tests that bring the bugs out.

Comment by Patrick Farrell (Inactive) [ 23/Apr/19 ]

Because the failures associated with the LU-11636 MDT randomization change are tracked here.  I opened this bug to track the underlying problem(s) and track resolving them.  Those other bugs serve(d) to track known test failures.  Any failures of those tests currently are not known failures and we don't want them getting tracked against those bugs (ie, if someone sees a failure in Maloo, we know it's not the one described by those bugs, and they should open a new bug).

 

Note that LU-11560 is still open because there was a pre-existing (rare) failure there.

 

By the way, you noted elsewhere that Cray isn't seeing the same level of test issues with the MDT randomization change from LU-11636.  I know Cray uses a standard config for the test-framework testing, I believe it's 4 nodes?  Is that two MDS and two OSS?  How many MDTs are used, and how many on each node?  I'm trying to see how different the results of the randomization might be in that environment.

Comment by Alexander Lezhoev [ 25/Jul/19 ]

@pfarrel, Cray uses two DNE configurations:

  1. Four-nodes with single mds: mds/mgs, oss, two clients.
  2. Ten-nodes: mgs, 3 mds, 2 oss, 4 clients.
Generated at Sat Feb 10 02:50:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.