Details

    • 9595

    Description

      Current version of sanity-hsm has to be adapted to support running with MDSCOUNT >= 2.
      This is a prerequirement to integrate HSM+DNE testing in sanity-hsm.

      We are currently working on it (we will provide a patch).

      Attachments

        Issue Links

          Activity

            [LU-3726] Adapt sanity-hsm to support MDSCOUNT >= 2

            The patch http://review.whamcloud.com/7571 landed to master. There is an open bug LU-4375 tracking some failures in sanity-hsm on DNE which should be used to address those failures.

            adilger Andreas Dilger added a comment - The patch http://review.whamcloud.com/7571 landed to master. There is an open bug LU-4375 tracking some failures in sanity-hsm on DNE which should be used to address those failures.

            Got email discussion with Thomas and he agrees with me that Change #7571 patch-set #7 sub-tests 301/401/403/404 failures could come from the fact CDT on mds2 was not restarted because all MDTs have themselves not been failed+retarted (ie, not only $SINGLEMDS).

            Thus after re-basing change #7571 patch-set #7, I also added/changed sub-test test_302 to fail all MDSs/MDTs instead of only SINGLEMDS. This is patch-set #8.

            bfaccini Bruno Faccini (Inactive) added a comment - Got email discussion with Thomas and he agrees with me that Change #7571 patch-set #7 sub-tests 301/401/403/404 failures could come from the fact CDT on mds2 was not restarted because all MDTs have themselves not been failed+retarted (ie, not only $SINGLEMDS). Thus after re-basing change #7571 patch-set #7, I also added/changed sub-test test_302 to fail all MDSs/MDTs instead of only SINGLEMDS. This is patch-set #8.
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            I spent some time looking at Change #7571 patch-set #7 failures and here are some ideas I got doing so :

            _ sub-test 302, error/msg "hsm_control state is not 'enabled' on mds2". Seems that not all the MDSs/MDTs are re-started/failed (ie, only $SINGLEMDS). Thus mds2 has still its CDT stopped due to last cdt_shutdown ? If yes, this means all MDSs/MDTs must be failed.

            _ sub-test 400, error/msg "request on 0x3c0000401:0x8e:0x0 is not SUCCEED on mds1". In this case, it seems that the HSM request did not arrive where it was expected for the test!! Thus some local+manual debug with the patch/build must occur to understand what happen.

            _ sub-test 401, error/msg "lfs hsm_archive" (with -EAGAIN). It may be a consequence of the fact CDT did not restart on mds2 since sub-test #302 ?? If not, again some local+manual debug with the patch/build must occur to understand what happen.…

            _ sub-tests 403, errr/msg "uuid 289ff266-f294-17cc-b407-fe2e0f15c9a0 not found in agent list on mds2". Again, it may be a consequence of the fact CDT did not restart on mds2 since sub-test #302 ?? And again, if not, some local+manual debug with the patch/build must occur to understand what happen.

            _ sub-tests 404, err/msg "request on 0x3c0000401:0x90:0x0 is not SUCCEED on mds1". The immediate WAITING status of the request can come from the specific problem this sub-test tracks or can be a consequence that max_requests has reached due to the failures just before. Thus, it may be also a consequence of the fact CDT did not restart on mds2 since sub-test #302 ???? And again, if not, some local+manual debug with the patch/build must occur to understand what happen.

            I will run these tests with a local config running patch/build to see if I am right.

            bfaccini Bruno Faccini (Inactive) added a comment - - edited I spent some time looking at Change #7571 patch-set #7 failures and here are some ideas I got doing so : _ sub-test 302, error/msg "hsm_control state is not 'enabled' on mds2". Seems that not all the MDSs/MDTs are re-started/failed (ie, only $SINGLEMDS). Thus mds2 has still its CDT stopped due to last cdt_shutdown ? If yes, this means all MDSs/MDTs must be failed. _ sub-test 400, error/msg "request on 0x3c0000401:0x8e:0x0 is not SUCCEED on mds1". In this case, it seems that the HSM request did not arrive where it was expected for the test!! Thus some local+manual debug with the patch/build must occur to understand what happen. _ sub-test 401, error/msg "lfs hsm_archive" (with -EAGAIN). It may be a consequence of the fact CDT did not restart on mds2 since sub-test #302 ?? If not, again some local+manual debug with the patch/build must occur to understand what happen.… _ sub-tests 403, errr/msg "uuid 289ff266-f294-17cc-b407-fe2e0f15c9a0 not found in agent list on mds2". Again, it may be a consequence of the fact CDT did not restart on mds2 since sub-test #302 ?? And again, if not, some local+manual debug with the patch/build must occur to understand what happen. _ sub-tests 404, err/msg "request on 0x3c0000401:0x90:0x0 is not SUCCEED on mds1". The immediate WAITING status of the request can come from the specific problem this sub-test tracks or can be a consequence that max_requests has reached due to the failures just before. Thus, it may be also a consequence of the fact CDT did not restart on mds2 since sub-test #302 ???? And again, if not, some local+manual debug with the patch/build must occur to understand what happen. I will run these tests with a local config running patch/build to see if I am right.

            The auto-tests failures that experienced patch-set #8 of http://review.whamcloud.com/7437 under DNE conditions are not related to itself but are due to the issue addressed in LU-4093 where orphan HSM requests prevents others to start. And on the other hand, since this condition can clear itself during a test-suite, the sub-tests modified by this patch to become "DNE-aware" ran successfully !!

            Auto-tests failures of Change #7571 patch-set #7 are DNE-specific and I will provide an update about them soon.

            bfaccini Bruno Faccini (Inactive) added a comment - The auto-tests failures that experienced patch-set #8 of http://review.whamcloud.com/7437 under DNE conditions are not related to itself but are due to the issue addressed in LU-4093 where orphan HSM requests prevents others to start. And on the other hand, since this condition can clear itself during a test-suite, the sub-tests modified by this patch to become "DNE-aware" ran successfully !! Auto-tests failures of Change #7571 patch-set #7 are DNE-specific and I will provide an update about them soon.

            sanity-hsm subtests test_301/test_302 failed during Change #7437 patch-set #8 auto-tests DNE session.

            sanity-hsm subtests test_302/test_400/test_401/test_403/test_404 failed during Change #7571 patch-set #7 auto-tests DNE-specifc session.

            Need to analyze Maloo errors reports.

            bfaccini Bruno Faccini (Inactive) added a comment - sanity-hsm subtests test_301/test_302 failed during Change #7437 patch-set #8 auto-tests DNE session. sanity-hsm subtests test_302/test_400/test_401/test_403/test_404 failed during Change #7571 patch-set #7 auto-tests DNE-specifc session. Need to analyze Maloo errors reports.

            Thanks Thomas !! That will greatly help to clarify and not to get Gerrit changes orphaned.

            bfaccini Bruno Faccini (Inactive) added a comment - Thanks Thomas !! That will greatly help to clarify and not to get Gerrit changes orphaned.

            This ticket is now referenced in the commit message for changes http://review.whamcloud.com/7437 and http://review.whamcloud.com/7571.
            The following test parameters have also been set for both of them, as they only modify sanity-hsm.sh:
            "Test-Parameters: mdtcount=2 mdscount=2 testlist=sanity-hsm"

            leibovici-cea Thomas LEIBOVICI - CEA (Inactive) added a comment - This ticket is now referenced in the commit message for changes http://review.whamcloud.com/7437 and http://review.whamcloud.com/7571 . The following test parameters have also been set for both of them, as they only modify sanity-hsm.sh: "Test-Parameters: mdtcount=2 mdscount=2 testlist=sanity-hsm"

            Hello Thomas,

            Thanks to add the "Test-Parameters: mdtcount=2 mdscount=2" added in Commit-msg.

            But I also have an other request, in order to clarify the JIRA-Ticket<->Gerrit-Change relationships, could it be possible that you change Commit-msg of http://review.whamcloud.com/7437 and refer to this ticket instead of LU-3561 ??

            Thanks again and in advance for your help.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Thomas, Thanks to add the "Test-Parameters: mdtcount=2 mdscount=2" added in Commit-msg. But I also have an other request, in order to clarify the JIRA-Ticket<->Gerrit-Change relationships, could it be possible that you change Commit-msg of http://review.whamcloud.com/7437 and refer to this ticket instead of LU-3561 ?? Thanks again and in advance for your help.

            Thomas,

            Can you, as WangDi indicated, re-submit (+ re-base before ...) your change #7437/patch-set #3 with "Test-Parameters: mdtcount=2 mdscount=2" added in Commit-msg, to allow it being tested under DNE conditions.

            Thanks!

            bfaccini Bruno Faccini (Inactive) added a comment - Thomas, Can you, as WangDi indicated, re-submit (+ re-base before ...) your change #7437/patch-set #3 with "Test-Parameters: mdtcount=2 mdscount=2" added in Commit-msg, to allow it being tested under DNE conditions. Thanks!

            I had to re-trigger auto-tests for both patches due to TEI-534 (aka no_root_squash for copy-tool back-end) related issue ...

            bfaccini Bruno Faccini (Inactive) added a comment - I had to re-trigger auto-tests for both patches due to TEI-534 (aka no_root_squash for copy-tool back-end) related issue ...

            People

              bfaccini Bruno Faccini (Inactive)
              leibovici-cea Thomas LEIBOVICI - CEA (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: