HSM _not only_ small fixes and to do list goes here (LU-3647)

[LU-3726] Adapt sanity-hsm to support MDSCOUNT >= 2 Created: 08/Aug/13  Updated: 12/Dec/13  Resolved: 12/Dec/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1

Type: Technical task Priority: Major
Reporter: Thomas LEIBOVICI - CEA (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: HSM, patch

Issue Links:
Related
is related to LU-4375 Test failure on test suite sanity-hsm... Resolved
Rank (Obsolete): 9595

 Description   

Current version of sanity-hsm has to be adapted to support running with MDSCOUNT >= 2.
This is a prerequirement to integrate HSM+DNE testing in sanity-hsm.

We are currently working on it (we will provide a patch).



 Comments   
Comment by Peter Jones [ 08/Aug/13 ]

Bruno

Could you please take care of this patch when it arrives?

Thanks

Peter

Comment by Thomas LEIBOVICI - CEA (Inactive) [ 23/Aug/13 ]

Here is the proposed patch for this change:
http://review.whamcloud.com/7437

Comment by Bruno Faccini (Inactive) [ 28/Aug/13 ]

I had to re-run auto-tests for the change due LU-1458 bug trigger during lustre-rsync-test/test_2b.

Comment by Thomas LEIBOVICI - CEA (Inactive) [ 06/Sep/13 ]

The following change depends on this patch: http://review.whamcloud.com/#/c/7571/ (DNE specific tests for HSM).

Comment by Bruno Faccini (Inactive) [ 11/Sep/13 ]

I had to re-trigger auto-tests for both patches due to TEI-534 (aka no_root_squash for copy-tool back-end) related issue ...

Comment by Bruno Faccini (Inactive) [ 25/Sep/13 ]

Thomas,

Can you, as WangDi indicated, re-submit (+ re-base before ...) your change #7437/patch-set #3 with "Test-Parameters: mdtcount=2 mdscount=2" added in Commit-msg, to allow it being tested under DNE conditions.

Thanks!

Comment by Bruno Faccini (Inactive) [ 26/Sep/13 ]

Hello Thomas,

Thanks to add the "Test-Parameters: mdtcount=2 mdscount=2" added in Commit-msg.

But I also have an other request, in order to clarify the JIRA-Ticket<->Gerrit-Change relationships, could it be possible that you change Commit-msg of http://review.whamcloud.com/7437 and refer to this ticket instead of LU-3561 ??

Thanks again and in advance for your help.

Comment by Thomas LEIBOVICI - CEA (Inactive) [ 27/Sep/13 ]

This ticket is now referenced in the commit message for changes http://review.whamcloud.com/7437 and http://review.whamcloud.com/7571.
The following test parameters have also been set for both of them, as they only modify sanity-hsm.sh:
"Test-Parameters: mdtcount=2 mdscount=2 testlist=sanity-hsm"

Comment by Bruno Faccini (Inactive) [ 27/Sep/13 ]

Thanks Thomas !! That will greatly help to clarify and not to get Gerrit changes orphaned.

Comment by Bruno Faccini (Inactive) [ 21/Oct/13 ]

sanity-hsm subtests test_301/test_302 failed during Change #7437 patch-set #8 auto-tests DNE session.

sanity-hsm subtests test_302/test_400/test_401/test_403/test_404 failed during Change #7571 patch-set #7 auto-tests DNE-specifc session.

Need to analyze Maloo errors reports.

Comment by Bruno Faccini (Inactive) [ 24/Oct/13 ]

The auto-tests failures that experienced patch-set #8 of http://review.whamcloud.com/7437 under DNE conditions are not related to itself but are due to the issue addressed in LU-4093 where orphan HSM requests prevents others to start. And on the other hand, since this condition can clear itself during a test-suite, the sub-tests modified by this patch to become "DNE-aware" ran successfully !!

Auto-tests failures of Change #7571 patch-set #7 are DNE-specific and I will provide an update about them soon.

Comment by Bruno Faccini (Inactive) [ 29/Oct/13 ]

I spent some time looking at Change #7571 patch-set #7 failures and here are some ideas I got doing so :

_ sub-test 302, error/msg "hsm_control state is not 'enabled' on mds2". Seems that not all the MDSs/MDTs are re-started/failed (ie, only $SINGLEMDS). Thus mds2 has still its CDT stopped due to last cdt_shutdown ? If yes, this means all MDSs/MDTs must be failed.

_ sub-test 400, error/msg "request on 0x3c0000401:0x8e:0x0 is not SUCCEED on mds1". In this case, it seems that the HSM request did not arrive where it was expected for the test!! Thus some local+manual debug with the patch/build must occur to understand what happen.

_ sub-test 401, error/msg "lfs hsm_archive" (with -EAGAIN). It may be a consequence of the fact CDT did not restart on mds2 since sub-test #302 ?? If not, again some local+manual debug with the patch/build must occur to understand what happen.…

_ sub-tests 403, errr/msg "uuid 289ff266-f294-17cc-b407-fe2e0f15c9a0 not found in agent list on mds2". Again, it may be a consequence of the fact CDT did not restart on mds2 since sub-test #302 ?? And again, if not, some local+manual debug with the patch/build must occur to understand what happen.

_ sub-tests 404, err/msg "request on 0x3c0000401:0x90:0x0 is not SUCCEED on mds1". The immediate WAITING status of the request can come from the specific problem this sub-test tracks or can be a consequence that max_requests has reached due to the failures just before. Thus, it may be also a consequence of the fact CDT did not restart on mds2 since sub-test #302 ???? And again, if not, some local+manual debug with the patch/build must occur to understand what happen.

I will run these tests with a local config running patch/build to see if I am right.

Comment by Bruno Faccini (Inactive) [ 05/Nov/13 ]

Got email discussion with Thomas and he agrees with me that Change #7571 patch-set #7 sub-tests 301/401/403/404 failures could come from the fact CDT on mds2 was not restarted because all MDTs have themselves not been failed+retarted (ie, not only $SINGLEMDS).

Thus after re-basing change #7571 patch-set #7, I also added/changed sub-test test_302 to fail all MDSs/MDTs instead of only SINGLEMDS. This is patch-set #8.

Comment by Andreas Dilger [ 12/Dec/13 ]

The patch http://review.whamcloud.com/7571 landed to master. There is an open bug LU-4375 tracking some failures in sanity-hsm on DNE which should be used to address those failures.

Generated at Sat Feb 10 01:36:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.