Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4438

Test failure sanity-hsm test_402: Copytool start should have failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0, Lustre 2.5.3
    • Lustre 2.6.0, Lustre 2.5.1
    • 3
    • 12182

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/788a5dda-765f-11e3-b3c0-52540035b04c.

      The sub-test test_402 failed with the following error:

      Copytool start should have failed

      Info required for matching: sanity-hsm 402

      Attachments

        Issue Links

          Activity

            [LU-4438] Test failure sanity-hsm test_402: Copytool start should have failed
            jamesanunez James Nunez (Inactive) added a comment - Patch for b2_5 at http://review.whamcloud.com/#/c/10715/
            pjones Peter Jones added a comment -

            Landed for 2.6

            pjones Peter Jones added a comment - Landed for 2.6
            utopiabound Nathaniel Clark added a comment - - edited

            This seems like it would be a nice to have for b2_5 once it lands on master, but doesn't seem like it should hold up a 2.5 release. This is a minor issue in copytool.

            utopiabound Nathaniel Clark added a comment - - edited This seems like it would be a nice to have for b2_5 once it lands on master, but doesn't seem like it should hold up a 2.5 release. This is a minor issue in copytool.

            Had to re-base again due to unrelated failures during Maloo/auto-tests session.

            bfaccini Bruno Faccini (Inactive) added a comment - Had to re-base again due to unrelated failures during Maloo/auto-tests session.
            jamesanunez James Nunez (Inactive) added a comment - Hit this problem in review-zfs at https://maloo.whamcloud.com/test_sets/e11b4944-c822-11e3-888b-52540035b04c

            Had to re-base patch-set #4 due to Maloo/auto-tests failures related to LU-4805 ...

            bfaccini Bruno Faccini (Inactive) added a comment - Had to re-base patch-set #4 due to Maloo/auto-tests failures related to LU-4805 ...

            In fact having a better look into copytool/lhsmtool_posix source code it seems that even if an error is returned by ct_setup() routine, which checks both archive/root and Lustre mount-points availability, main() continues as if everything is ok causing new threads to start and encounter further errors.

            So better than to add some delay to wait for all copytool threads to finish error processing and die, seems that handling ct_setup() error and immediately exit is the fix here.

            Patch to implement this is at http://review.whamcloud.com/9853.

            bfaccini Bruno Faccini (Inactive) added a comment - In fact having a better look into copytool/lhsmtool_posix source code it seems that even if an error is returned by ct_setup() routine, which checks both archive/root and Lustre mount-points availability, main() continues as if everything is ok causing new threads to start and encounter further errors. So better than to add some delay to wait for all copytool threads to finish error processing and die, seems that handling ct_setup() error and immediately exit is the fix here. Patch to implement this is at http://review.whamcloud.com/9853 .

            Having a look to the Maloo failures, this problem looks like a timing issue when some of the copytool thread takes too long to die due to deactivated MDT error handling.

            May be we should add some delay before to check if copytool is still present (in search_copytools()) fter copytool_setup, or change the tested/grep'ed pattern to become the copytool's main PID ??

            bfaccini Bruno Faccini (Inactive) added a comment - Having a look to the Maloo failures, this problem looks like a timing issue when some of the copytool thread takes too long to die due to deactivated MDT error handling. May be we should add some delay before to check if copytool is still present (in search_copytools()) fter copytool_setup, or change the tested/grep'ed pattern to become the copytool's main PID ??
            yujian Jian Yu added a comment - By searching on Maloo, I found more instances on Lustre b2_5 and master branches: https://maloo.whamcloud.com/test_sets/0bffe69a-9a26-11e3-965c-52540035b04c https://maloo.whamcloud.com/test_sets/e604c3fa-9586-11e3-bbde-52540035b04c https://maloo.whamcloud.com/test_sets/23b6468a-9a1c-11e3-93d7-52540035b04c https://maloo.whamcloud.com/test_sets/5cacc594-9a40-11e3-baa9-52540035b04c

            not unique at all. another:
            https://maloo.whamcloud.com/test_sets/7c39f198-9a72-11e3-ba17-52540035b04c

            maloo says "Failure Rate: 5.00% of last 100 executions [all branches]" so it must be happening fairly often.

            bogl Bob Glossman (Inactive) added a comment - not unique at all. another: https://maloo.whamcloud.com/test_sets/7c39f198-9a72-11e3-ba17-52540035b04c maloo says "Failure Rate: 5.00% of last 100 executions [all branches] " so it must be happening fairly often.

            People

              bfaccini Bruno Faccini (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: