I spent some time looking at Change #7571 patch-set #7 failures and here are some ideas I got doing so :
_ sub-test 302, error/msg "hsm_control state is not 'enabled' on mds2". Seems that not all the MDSs/MDTs are re-started/failed (ie, only $SINGLEMDS). Thus mds2 has still its CDT stopped due to last cdt_shutdown ? If yes, this means all MDSs/MDTs must be failed.
_ sub-test 400, error/msg "request on 0x3c0000401:0x8e:0x0 is not SUCCEED on mds1". In this case, it seems that the HSM request did not arrive where it was expected for the test!! Thus some local+manual debug with the patch/build must occur to understand what happen.
_ sub-test 401, error/msg "lfs hsm_archive" (with -EAGAIN). It may be a consequence of the fact CDT did not restart on mds2 since sub-test #302 ?? If not, again some local+manual debug with the patch/build must occur to understand what happen.…
_ sub-tests 403, errr/msg "uuid 289ff266-f294-17cc-b407-fe2e0f15c9a0 not found in agent list on mds2". Again, it may be a consequence of the fact CDT did not restart on mds2 since sub-test #302 ?? And again, if not, some local+manual debug with the patch/build must occur to understand what happen.
_ sub-tests 404, err/msg "request on 0x3c0000401:0x90:0x0 is not SUCCEED on mds1". The immediate WAITING status of the request can come from the specific problem this sub-test tracks or can be a consequence that max_requests has reached due to the failures just before. Thus, it may be also a consequence of the fact CDT did not restart on mds2 since sub-test #302 ???? And again, if not, some local+manual debug with the patch/build must occur to understand what happen.
I will run these tests with a local config running patch/build to see if I am right.
The patch http://review.whamcloud.com/7571 landed to master. There is an open bug
LU-4375tracking some failures in sanity-hsm on DNE which should be used to address those failures.