[LU-4438] Test failure sanity-hsm test_402: Copytool start should have failed Created: 06/Jan/14 Updated: 13/Aug/14 Resolved: 11/Jun/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.5.1 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HSM | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 12182 | ||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/788a5dda-765f-11e3-b3c0-52540035b04c. The sub-test test_402 failed with the following error:
Info required for matching: sanity-hsm 402 |
| Comments |
| Comment by Nathaniel Clark [ 06/Jan/14 ] |
|
This bug appears to be unique in occurrence, but not apparently related to patch found on (http://review.whamcloud.com/8623) |
| Comment by Bob Glossman (Inactive) [ 20/Feb/14 ] |
|
not unique at all. another: maloo says "Failure Rate: 5.00% of last 100 executions [all branches]" so it must be happening fairly often. |
| Comment by Jian Yu [ 21/Feb/14 ] |
|
By searching on Maloo, I found more instances on Lustre b2_5 and master branches: https://maloo.whamcloud.com/test_sets/0bffe69a-9a26-11e3-965c-52540035b04c |
| Comment by Bruno Faccini (Inactive) [ 28/Mar/14 ] |
|
Having a look to the Maloo failures, this problem looks like a timing issue when some of the copytool thread takes too long to die due to deactivated MDT error handling. May be we should add some delay before to check if copytool is still present (in search_copytools()) fter copytool_setup, or change the tested/grep'ed pattern to become the copytool's main PID ?? |
| Comment by Bruno Faccini (Inactive) [ 30/Mar/14 ] |
|
In fact having a better look into copytool/lhsmtool_posix source code it seems that even if an error is returned by ct_setup() routine, which checks both archive/root and Lustre mount-points availability, main() continues as if everything is ok causing new threads to start and encounter further errors. So better than to add some delay to wait for all copytool threads to finish error processing and die, seems that handling ct_setup() error and immediately exit is the fix here. Patch to implement this is at http://review.whamcloud.com/9853. |
| Comment by Bruno Faccini (Inactive) [ 05/Apr/14 ] |
|
Had to re-base patch-set #4 due to Maloo/auto-tests failures related to |
| Comment by James Nunez (Inactive) [ 21/Apr/14 ] |
|
Hit this problem in review-zfs at https://maloo.whamcloud.com/test_sets/e11b4944-c822-11e3-888b-52540035b04c |
| Comment by Bruno Faccini (Inactive) [ 21/Apr/14 ] |
|
Had to re-base again due to unrelated failures during Maloo/auto-tests session. |
| Comment by Nathaniel Clark [ 05/Jun/14 ] |
|
This seems like it would be a nice to have for b2_5 once it lands on master, but doesn't seem like it should hold up a 2.5 release. This is a minor issue in copytool. |
| Comment by Peter Jones [ 11/Jun/14 ] |
|
Landed for 2.6 |
| Comment by James Nunez (Inactive) [ 13/Jun/14 ] |
|
Patch for b2_5 at http://review.whamcloud.com/#/c/10715/ |