Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12210

Test failures associated with DNE index and stripe count randomization

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      As a secondary change, the LU-11636 patch added MDS index and stripe count randomization to the test-framework in DNE configs.

      The randomization introduced several high frequency failures (and, I suspect, other lower frequency failures), and was reverted at https://review.whamcloud.com/34705/

      This ticket is to track these test failures, as the randomization change is fundamentally an improvement and is something we'd like to eventually put in.

      The four failures that were specifically tracked to this change are:
      sanity 208 (LU-12175)

      sanity 133g (LU-12171)

      recovery-small 107 (LU-12176)

      recovery-small 134 (LU-11560)

       

      There are some limited details in each of those tickets.  The failures varied, and did not appear to result from a single underlying bug.

      Attachments

        Issue Links

          Activity

            [LU-12210] Test failures associated with DNE index and stripe count randomization

            @pfarrel, Cray uses two DNE configurations:

            1. Four-nodes with single mds: mds/mgs, oss, two clients.
            2. Ten-nodes: mgs, 3 mds, 2 oss, 4 clients.
            diff Alexander Lezhoev added a comment - @pfarrel, Cray uses two DNE configurations: Four-nodes with single mds: mds/mgs, oss, two clients. Ten-nodes: mgs, 3 mds, 2 oss, 4 clients.

            Because the failures associated with the LU-11636 MDT randomization change are tracked here.  I opened this bug to track the underlying problem(s) and track resolving them.  Those other bugs serve(d) to track known test failures.  Any failures of those tests currently are not known failures and we don't want them getting tracked against those bugs (ie, if someone sees a failure in Maloo, we know it's not the one described by those bugs, and they should open a new bug).

             

            Note that LU-11560 is still open because there was a pre-existing (rare) failure there.

             

            By the way, you noted elsewhere that Cray isn't seeing the same level of test issues with the MDT randomization change from LU-11636.  I know Cray uses a standard config for the test-framework testing, I believe it's 4 nodes?  Is that two MDS and two OSS?  How many MDTs are used, and how many on each node?  I'm trying to see how different the results of the randomization might be in that environment.

            pfarrell Patrick Farrell (Inactive) added a comment - Because the failures associated with the LU-11636 MDT randomization change are tracked here.  I opened this bug to track the underlying problem(s) and track resolving them.  Those other bugs serve(d) to track known test failures.  Any failures of those tests currently are not known failures and we don't want them getting tracked against those bugs (ie, if someone sees a failure in Maloo, we know it's not the one described by those bugs, and they should open a new bug).   Note that LU-11560 is still open because there was a pre-existing (rare) failure there.   By the way, you noted elsewhere that Cray isn't seeing the same level of test issues with the MDT randomization change from LU-11636 .  I know Cray uses a standard config for the test-framework testing, I believe it's 4 nodes?  Is that two MDS and two OSS?  How many MDTs are used, and how many on each node?  I'm trying to see how different the results of the randomization might be in that environment.
            spitzcor Cory Spitz added a comment -

            By the way, why are all of the recent related Bugs RESOLVED? None of them are resolved, right? We're just skipping tests that bring the bugs out.

            spitzcor Cory Spitz added a comment - By the way, why are all of the recent related Bugs RESOLVED? None of them are resolved, right? We're just skipping tests that bring the bugs out.
            spitzcor Cory Spitz added a comment -

            Has anyone tried applying the tests to an older master? I don't think that Cray is experiencing this volume of failures on our 2.11 or 2.12. Maybe there is some regression causing the problem that landing the tests only caught. Maybe the test improvements could be tried with git bisect and a rebased master?

            spitzcor Cory Spitz added a comment - Has anyone tried applying the tests to an older master? I don't think that Cray is experiencing this volume of failures on our 2.11 or 2.12. Maybe there is some regression causing the problem that landing the tests only caught. Maybe the test improvements could be tried with git bisect and a rebased master?

            People

              wc-triage WC Triage
              pfarrell Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: