[LU-4545] Test failure sanity-hsm test_223a: request on 0x200000402:0x13f:0x0 is not SUCCEED Created: 27/Jan/14  Updated: 09/Apr/14  Resolved: 12/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.5.1
Fix Version/s: Lustre 2.6.0, Lustre 2.5.2

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: HSM, patch

Severity: 3
Rank (Obsolete): 12423

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run:
http://maloo.whamcloud.com/test_sets/7976264e-650b-11e3-b82f-52540035b04c
https://maloo.whamcloud.com/test_sets/64947b40-858b-11e3-a2cb-52540035b04c

The sub-test test_223a failed with the following error:

request on 0x200000402:0x13f:0x0 is not SUCCEED on mds1

Info required for matching: sanity-hsm 223a



 Comments   
Comment by Jian Yu [ 29/Jan/14 ]

While testing patch http://review.whamcloud.com/9006 on Lustre b2_5 branch, the same failure occurred:
https://maloo.whamcloud.com/test_sets/ca227340-886a-11e3-af42-52540035b04c

Comment by nasf (Inactive) [ 12/Feb/14 ]

Another failure instance:

https://maloo.whamcloud.com/test_sets/0cad3fe2-93ce-11e3-b8a9-52540035b04c

Comment by Bruno Faccini (Inactive) [ 13/Feb/14 ]

An other at https://maloo.whamcloud.com/test_sets/4af8947e-9458-11e3-9ec0-52540035b04c.

Comment by Aurelien Degremont (Inactive) [ 18/Feb/14 ]

https://maloo.whamcloud.com/test_sets/7976264e-650b-11e3-b82f-52540035b04c
is probably related to a ENOSPC error.

But the 4 others have a similar pattern.
In most cases, RESTORE is started by copytool, and, when doing the first PROGRESS, copytool realizes this operation was canceled and so abort its operation. It notifies this to coordinator in (ct_fini()).
But this is a little bit racy. For some cases, copytool will not have enough time to really start the restore. It will call ct_restore_begin(), which will send a PROGRESS to coordinator which will return -ECANCELED.
ct_restore() does not handle this error case the same way if it appears so soon. ct_fini() is not called in this case because hpc is NULL (lustre/utils/lhsmtool_posix.c:1140)

In real life, this is not a real problem. The request will be really canceled and this is what is expected. The cancel request itself will stay STARTED in action llog until eventually it timeouts. This request is expected to be succeeded if copytool really acknowledge the cancel operation, which it does not do in this case.

Comment by Henri Doreau (Inactive) [ 19/Feb/14 ]

The following patch should prevent that from happening by forcing the CT to report progress/error.

http://review.whamcloud.com/9310

Comment by Bob Glossman (Inactive) [ 19/Feb/14 ]

another one:
https://maloo.whamcloud.com/test_sets/1ae74dec-99b3-11e3-83d7-52540035b04c

Comment by Peter Jones [ 24/Feb/14 ]

Bruno

Could you please take care of this patch?

Thanks

Peter

Comment by Bruno Faccini (Inactive) [ 11/Mar/14 ]

After some unrelated errors and re-trigger, patch successfully passed auto-tests. Only waiting for mre reviewers feedback now.

Comment by Jodi Levi (Inactive) [ 12/Mar/14 ]

Patch landed to Master. Will be back ported to b2_5

Comment by James Nunez (Inactive) [ 18/Mar/14 ]

Patch for b2_5 at http://review.whamcloud.com/#/c/9708/

Comment by James Nunez (Inactive) [ 09/Apr/14 ]

http://review.whamcloud.com/#/c/9708/ landed to b2_5.

Generated at Sat Feb 10 01:43:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.