Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6203

sanity-hsm test 251: FAIL: Copytool failed to stop in 20s

Details

    • 3
    • 17337

    Description

      sanity-hsm test 251 failed as follows:

      CMD: shadow-26vm10 pkill -INT -x lhsmtool_posix
      CMD: shadow-26vm10 pgrep -x lhsmtool_posix
      shadow-26vm10: 7902
      Copytool still running on shadow-26vm10
      CMD: shadow-26vm10 pgrep -x lhsmtool_posix
      shadow-26vm10: 7902
      Copytool still running on shadow-26vm10
      CMD: shadow-26vm10 pgrep -x lhsmtool_posix
      shadow-26vm10: 7902
      Copytool still running on shadow-26vm10
      CMD: shadow-26vm10 pgrep -x lhsmtool_posix
      shadow-26vm10: 7902
      Copytool still running on shadow-26vm10
      CMD: shadow-26vm10 pgrep -x lhsmtool_posix
      shadow-26vm10: 7902
      Copytool still running on shadow-26vm10
      CMD: shadow-26vm10 pgrep -x lhsmtool_posix
      shadow-26vm10: 7902
      Copytool still running on shadow-26vm10
      CMD: shadow-26vm10 pgrep -x lhsmtool_posix
      shadow-26vm10: 7902
      Copytool still running on shadow-26vm10
      CMD: shadow-26vm10 pgrep -x lhsmtool_posix
      shadow-26vm10: 7902
      Copytool still running on shadow-26vm10
      CMD: shadow-26vm10 pgrep -x lhsmtool_posix
      shadow-26vm10: 7902
      Copytool still running on shadow-26vm10
      CMD: shadow-26vm10 pgrep -x lhsmtool_posix
      shadow-26vm10: 7902
       sanity-hsm test_251: @@@@@@ FAIL: Copytool failed to stop in 20s ... 
      

      Maloo reports:
      https://testing.hpdd.intel.com/test_sets/49281842-a9ef-11e4-8c6f-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/4cac5380-aa11-11e4-a5c6-5254006e85c2

      Attachments

        Issue Links

          Activity

            [LU-6203] sanity-hsm test 251: FAIL: Copytool failed to stop in 20s
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13646/
            Subject: LU-6203 tests: early lock cancel to allow early copytool death
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 446ad1ef8be5b282224336817936043b42a10fee

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13646/ Subject: LU-6203 tests: early lock cancel to allow early copytool death Project: fs/lustre-release Branch: master Current Patch Set: Commit: 446ad1ef8be5b282224336817936043b42a10fee

            Last 2 failures occurrences again show the same delay of copytool death/cleanup due to being stuck, during tens of seconds, because waiting for lock flush/cancel (between Client, who have created file and asked for its archive, and OST[s]) upon archiving file.

            This seem to only occur on ZFS until now, and I have verified that my patch, adding a cancel_lru_locks() before hsm_archive allows the copytool copy to start immediatelly, which is normally delayed to allow for previous Client's lock flush/cancel.

            bfaccini Bruno Faccini (Inactive) added a comment - Last 2 failures occurrences again show the same delay of copytool death/cleanup due to being stuck, during tens of seconds, because waiting for lock flush/cancel (between Client, who have created file and asked for its archive, and OST [s] ) upon archiving file. This seem to only occur on ZFS until now, and I have verified that my patch, adding a cancel_lru_locks() before hsm_archive allows the copytool copy to start immediatelly, which is normally delayed to allow for previous Client's lock flush/cancel.
            yong.fan nasf (Inactive) added a comment - Another failure instance on b2_5: https://testing.hpdd.intel.com/test_sets/465bb228-bbfa-11e4-a79b-5254006e85c2

            Concerning the "https://testing.hpdd.intel.com/test_sets/97fd06d8-ac1c-11e4-992b-5254006e85c2" case, the 'Copytool failed to stop in 20s ...' errors/symptoms there looks more as an additional consequence (due to cleanup()/copytool_cleanup() execute as a trap, set in copytool_cleanup(), upon exit) of previous problem within same sub-test or even preceeding sub-tests :

            _ 1st sub-test reported to have failed for 'Copytool failed to stop in 20s ...' is test_33, but it has failed previously with "sanity-hsm test_33: @@@@@@ FAIL: request on 0x200007931:0x2b:0x0 is not SUCCEED on mds1" waiting for archive to complete/succeed ...

            _ 2nd sub-test reported to have failed for 'Copytool failed to stop in 20s ...' is test_60, but it has failed previously with "sanity-hsm test_60: @@@@@@ FAIL: Timed out waiting for progress update!" waiting for a progress update during archive.

            _ 3rd sub-test reported to have failed for 'Copytool failed to stop in 20s ...' is test_70, but it is the next just after test_60 and since its 1st cmd is a copytool_cleanup that is likely to have encountered the same problem than the preceeding.

            _ 4th sub-test reported to have failed for 'Copytool failed to stop in 20s ...' is test_71, but it is the next just after test_70 and since its 1st cmd is a copytool_cleanup that is likely to have encountered the same problem than the 2 preceeding.

            _ 5th sub-test reported to have failed for 'Copytool failed to stop in 20s ...' is test_103, and according to its specific logs it is the only one that seems to have triggered the same scenario (huge delay during lock flush/cancel processing) I have already described in my previous update. So it may be an other potential subject to the same change than for test_251.

            After more Lustre debug log reading, it seems that the "huge delay during lock flush/cancel processing" that seems to be the root cause of the problem is mainly on the OSS side, after the Client has handled the Blocking callback and sent back its Cancel of lock to the OSS. The thread handling it on the OSS can then spend multiple tens of seconds in ldlm_request_cancel()>ldlm_lock_cancel()>ldlm_cancel_callback()-> .... and
            highly probably tgt_blocking_ast()>tgt_sync()>dt_object_sync()->osd_object_sync(). So is this finally some kind of ZFS performance issue?

            bfaccini Bruno Faccini (Inactive) added a comment - Concerning the "https://testing.hpdd.intel.com/test_sets/97fd06d8-ac1c-11e4-992b-5254006e85c2" case, the 'Copytool failed to stop in 20s ...' errors/symptoms there looks more as an additional consequence (due to cleanup()/copytool_cleanup() execute as a trap, set in copytool_cleanup(), upon exit) of previous problem within same sub-test or even preceeding sub-tests : _ 1st sub-test reported to have failed for 'Copytool failed to stop in 20s ...' is test_33, but it has failed previously with "sanity-hsm test_33: @@@@@@ FAIL: request on 0x200007931:0x2b:0x0 is not SUCCEED on mds1" waiting for archive to complete/succeed ... _ 2nd sub-test reported to have failed for 'Copytool failed to stop in 20s ...' is test_60, but it has failed previously with "sanity-hsm test_60: @@@@@@ FAIL: Timed out waiting for progress update!" waiting for a progress update during archive. _ 3rd sub-test reported to have failed for 'Copytool failed to stop in 20s ...' is test_70, but it is the next just after test_60 and since its 1st cmd is a copytool_cleanup that is likely to have encountered the same problem than the preceeding. _ 4th sub-test reported to have failed for 'Copytool failed to stop in 20s ...' is test_71, but it is the next just after test_70 and since its 1st cmd is a copytool_cleanup that is likely to have encountered the same problem than the 2 preceeding. _ 5th sub-test reported to have failed for 'Copytool failed to stop in 20s ...' is test_103, and according to its specific logs it is the only one that seems to have triggered the same scenario (huge delay during lock flush/cancel processing) I have already described in my previous update. So it may be an other potential subject to the same change than for test_251. After more Lustre debug log reading, it seems that the "huge delay during lock flush/cancel processing" that seems to be the root cause of the problem is mainly on the OSS side, after the Client has handled the Blocking callback and sent back its Cancel of lock to the OSS. The thread handling it on the OSS can then spend multiple tens of seconds in ldlm_request_cancel() >ldlm_lock_cancel() >ldlm_cancel_callback()-> .... and highly probably tgt_blocking_ast() >tgt_sync() >dt_object_sync()->osd_object_sync(). So is this finally some kind of ZFS performance issue?
            yujian Jian Yu added a comment -

            The failure started affecting patch review testing on Lustre b2_5 patches:
            https://testing.hpdd.intel.com/test_sets/66e27944-acde-11e4-872a-5254006e85c2

            yujian Jian Yu added a comment - The failure started affecting patch review testing on Lustre b2_5 patches: https://testing.hpdd.intel.com/test_sets/66e27944-acde-11e4-872a-5254006e85c2
            yujian Jian Yu added a comment -

            The zfs full group test session was not run on master branch. Maybe this is the reason that we did not search out failure instances on master branch on Maloo.

            In the following report on b2_5 branch, many sub-tests failed with this issue:
            https://testing.hpdd.intel.com/test_sets/97fd06d8-ac1c-11e4-992b-5254006e85c2

            yujian Jian Yu added a comment - The zfs full group test session was not run on master branch. Maybe this is the reason that we did not search out failure instances on master branch on Maloo. In the following report on b2_5 branch, many sub-tests failed with this issue: https://testing.hpdd.intel.com/test_sets/97fd06d8-ac1c-11e4-992b-5254006e85c2

            http://review.whamcloud.com/13646 implements early lock cancel solution to speed-up copytool death.

            bfaccini Bruno Faccini (Inactive) added a comment - http://review.whamcloud.com/13646 implements early lock cancel solution to speed-up copytool death.

            People

              bfaccini Bruno Faccini (Inactive)
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: