HSM _not only_ small fixes and to do list goes here
(LU-3647)
|
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.1 |
| Type: | Technical task | Priority: | Major |
| Reporter: | Thomas LEIBOVICI - CEA (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HSM, patch | ||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 9595 | ||||||||
| Description |
|
Current version of sanity-hsm has to be adapted to support running with MDSCOUNT >= 2. We are currently working on it (we will provide a patch). |
| Comments |
| Comment by Peter Jones [ 08/Aug/13 ] |
|
Bruno Could you please take care of this patch when it arrives? Thanks Peter |
| Comment by Thomas LEIBOVICI - CEA (Inactive) [ 23/Aug/13 ] |
|
Here is the proposed patch for this change: |
| Comment by Bruno Faccini (Inactive) [ 28/Aug/13 ] |
|
I had to re-run auto-tests for the change due |
| Comment by Thomas LEIBOVICI - CEA (Inactive) [ 06/Sep/13 ] |
|
The following change depends on this patch: http://review.whamcloud.com/#/c/7571/ (DNE specific tests for HSM). |
| Comment by Bruno Faccini (Inactive) [ 11/Sep/13 ] |
|
I had to re-trigger auto-tests for both patches due to TEI-534 (aka no_root_squash for copy-tool back-end) related issue ... |
| Comment by Bruno Faccini (Inactive) [ 25/Sep/13 ] |
|
Thomas, Can you, as WangDi indicated, re-submit (+ re-base before ...) your change #7437/patch-set #3 with "Test-Parameters: mdtcount=2 mdscount=2" added in Commit-msg, to allow it being tested under DNE conditions. Thanks! |
| Comment by Bruno Faccini (Inactive) [ 26/Sep/13 ] |
|
Hello Thomas, Thanks to add the "Test-Parameters: mdtcount=2 mdscount=2" added in Commit-msg. But I also have an other request, in order to clarify the JIRA-Ticket<->Gerrit-Change relationships, could it be possible that you change Commit-msg of http://review.whamcloud.com/7437 and refer to this ticket instead of Thanks again and in advance for your help. |
| Comment by Thomas LEIBOVICI - CEA (Inactive) [ 27/Sep/13 ] |
|
This ticket is now referenced in the commit message for changes http://review.whamcloud.com/7437 and http://review.whamcloud.com/7571. |
| Comment by Bruno Faccini (Inactive) [ 27/Sep/13 ] |
|
Thanks Thomas !! That will greatly help to clarify and not to get Gerrit changes orphaned. |
| Comment by Bruno Faccini (Inactive) [ 21/Oct/13 ] |
|
sanity-hsm subtests test_301/test_302 failed during Change #7437 patch-set #8 auto-tests DNE session. sanity-hsm subtests test_302/test_400/test_401/test_403/test_404 failed during Change #7571 patch-set #7 auto-tests DNE-specifc session. Need to analyze Maloo errors reports. |
| Comment by Bruno Faccini (Inactive) [ 24/Oct/13 ] |
|
The auto-tests failures that experienced patch-set #8 of http://review.whamcloud.com/7437 under DNE conditions are not related to itself but are due to the issue addressed in Auto-tests failures of Change #7571 patch-set #7 are DNE-specific and I will provide an update about them soon. |
| Comment by Bruno Faccini (Inactive) [ 29/Oct/13 ] |
|
I spent some time looking at Change #7571 patch-set #7 failures and here are some ideas I got doing so : _ sub-test 302, error/msg "hsm_control state is not 'enabled' on mds2". Seems that not all the MDSs/MDTs are re-started/failed (ie, only $SINGLEMDS). Thus mds2 has still its CDT stopped due to last cdt_shutdown ? If yes, this means all MDSs/MDTs must be failed. _ sub-test 400, error/msg "request on 0x3c0000401:0x8e:0x0 is not SUCCEED on mds1". In this case, it seems that the HSM request did not arrive where it was expected for the test!! Thus some local+manual debug with the patch/build must occur to understand what happen. _ sub-test 401, error/msg "lfs hsm_archive" (with -EAGAIN). It may be a consequence of the fact CDT did not restart on mds2 since sub-test #302 ?? If not, again some local+manual debug with the patch/build must occur to understand what happen.… _ sub-tests 403, errr/msg "uuid 289ff266-f294-17cc-b407-fe2e0f15c9a0 not found in agent list on mds2". Again, it may be a consequence of the fact CDT did not restart on mds2 since sub-test #302 ?? And again, if not, some local+manual debug with the patch/build must occur to understand what happen. _ sub-tests 404, err/msg "request on 0x3c0000401:0x90:0x0 is not SUCCEED on mds1". The immediate WAITING status of the request can come from the specific problem this sub-test tracks or can be a consequence that max_requests has reached due to the failures just before. Thus, it may be also a consequence of the fact CDT did not restart on mds2 since sub-test #302 ???? And again, if not, some local+manual debug with the patch/build must occur to understand what happen. I will run these tests with a local config running patch/build to see if I am right. |
| Comment by Bruno Faccini (Inactive) [ 05/Nov/13 ] |
|
Got email discussion with Thomas and he agrees with me that Change #7571 patch-set #7 sub-tests 301/401/403/404 failures could come from the fact CDT on mds2 was not restarted because all MDTs have themselves not been failed+retarted (ie, not only $SINGLEMDS). Thus after re-basing change #7571 patch-set #7, I also added/changed sub-test test_302 to fail all MDSs/MDTs instead of only SINGLEMDS. This is patch-set #8. |
| Comment by Andreas Dilger [ 12/Dec/13 ] |
|
The patch http://review.whamcloud.com/7571 landed to master. There is an open bug |