Details

    • Technical task
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.5.0
    • 9919

    Description

      In the restore case of hsm_cdt_request_completed(), if the copytool returned success but the layout swap fails then we get an unreadable file with HS_RELEASED clear but LOV_PATTERN_F_RELEASED set.

      Perhaps the new HSM attributes should be applied to the volatile object before layout swap, and hsm_swap_layouts() should call mo_swap_layouts() with SWAP_LAYOUTS_MDS_HSM set.

      Attachments

        Activity

          [LU-3834] hsm_cdt_request_completed() may clear HS_RELEASED on failed restore
          bfaccini Bruno Faccini (Inactive) added a comment - - edited

          Andreas, you b2_5 patch for this ticket at http://review.whamcloud.com/9212, has found a flaw in sanity-hsm/test_12o (from original patch http://review.whamcloud.com/7631 from this ticket too !!) during auto-tests session.

          This new problem is tracked within LU-4613 where I already pushed a patch to master (http://review.whamcloud.com/9235), since #7631 has already landed to master, but what should we do for the b2_5 version you just pushed ?

          bfaccini Bruno Faccini (Inactive) added a comment - - edited Andreas, you b2_5 patch for this ticket at http://review.whamcloud.com/9212 , has found a flaw in sanity-hsm/test_12o (from original patch http://review.whamcloud.com/7631 from this ticket too !!) during auto-tests session. This new problem is tracked within LU-4613 where I already pushed a patch to master ( http://review.whamcloud.com/9235 ), since #7631 has already landed to master, but what should we do for the b2_5 version you just pushed ?

          Ok thanks Andreas, I understand now that I need to take care of this because it is also under my responsibility, if a patch is required for earlier versions, to either create+push a new patch for each other versions or ask Oleg to cherry-pick the original patch for each other versions.

          I don't know why but I thought that the patch integration/release decision was done by other people (you, Oleg, Peter, …), this may simply be you are doing this verification work very requently and do the job for lazy guys like me!!

          bfaccini Bruno Faccini (Inactive) added a comment - Ok thanks Andreas, I understand now that I need to take care of this because it is also under my responsibility, if a patch is required for earlier versions, to either create+push a new patch for each other versions or ask Oleg to cherry-pick the original patch for each other versions. I don't know why but I thought that the patch integration/release decision was done by other people (you, Oleg, Peter, …), this may simply be you are doing this verification work very requently and do the job for lazy guys like me!!

          Bruno, the patch was marked as affecting the 2.5.0 release. I'm just going through patches that have landed to master and trying to see which ones need to be landed for 2.5.1 that have not been landed there, since that is the long-term maintenance release. If you are closing a but then you should consider if it is fixing a problem that is serious and may affect earlier versions of Lustre and should land on the maintenance release. In many cases, Oleg can cherry-pick the patch directly to b2_5 without putting it through Gerrit/Jenkins/autotest again, but he needs to know to do this.

          adilger Andreas Dilger added a comment - Bruno, the patch was marked as affecting the 2.5.0 release. I'm just going through patches that have landed to master and trying to see which ones need to be landed for 2.5.1 that have not been landed there, since that is the long-term maintenance release. If you are closing a but then you should consider if it is fixing a problem that is serious and may affect earlier versions of Lustre and should land on the maintenance release. In many cases, Oleg can cherry-pick the patch directly to b2_5 without putting it through Gerrit/Jenkins/autotest again, but he needs to know to do this.

          Hello Andreas,
          I am sorry if I missed to do something here, to be honest actually I mainly focus to get the patch done for the branch where problem has been reported. But then should I create a new patch version for each affected version listed?

          bfaccini Bruno Faccini (Inactive) added a comment - Hello Andreas, I am sorry if I missed to do something here, to be honest actually I mainly focus to get the patch done for the branch where problem has been reported. But then should I create a new patch version for each affected version listed?

          Patch was only landed to master and not b2_5. In the future, this type of patch should be cherry-picked to b2_5 so that it is fixed in the maintenance release.

          adilger Andreas Dilger added a comment - Patch was only landed to master and not b2_5. In the future, this type of patch should be cherry-picked to b2_5 so that it is fixed in the maintenance release.
          bfaccini Bruno Faccini (Inactive) added a comment - patch http://review.whamcloud.com/7631 has landed. Closing.
          bfaccini Bruno Faccini (Inactive) added a comment - - edited

          Hehe, finally I found that my fault-injection code itself introduced some problem because being added after the volatile/2nd file layout change and not reverting it to mimic the error !! This caused the restored datas to be available as if restore succeed …

          I changed this in patch-set #13, and now new sub-test test_12o runs fine, returning errors on both copytool (ENOTSUPP, injected!) and client (ENODATA) sides with layout-swap fault-injection, and next restore attempt without fault-injection to be successful.

          Will run with build+patch locally and see if I can still reproduce the Volatile object leak on MDT, seen as part of this ticket and LU-4293.

          bfaccini Bruno Faccini (Inactive) added a comment - - edited Hehe, finally I found that my fault-injection code itself introduced some problem because being added after the volatile/2nd file layout change and not reverting it to mimic the error !! This caused the restored datas to be available as if restore succeed … I changed this in patch-set #13, and now new sub-test test_12o runs fine, returning errors on both copytool (ENOTSUPP, injected!) and client (ENODATA) sides with layout-swap fault-injection, and next restore attempt without fault-injection to be successful. Will run with build+patch locally and see if I can still reproduce the Volatile object leak on MDT, seen as part of this ticket and LU-4293 .

          Some update, after I added fault-injection (force -ENOENT in the middle of mdd_swap_layouts() to cause layouts swap back) and associated sub-test test_12o within patch-set #8.

          test_12o fails due to "diff" command, that caused the implicit restore, to be successful when it is expected to fail because of the fault-injection. Strange is that the Restore operation has been marked as failed, the Copytool received the error, and file still has the "released" flag set!!

          I wonder if there could be some issue in mdd_swap_layouts() causing this unexpected behavior ?

          bfaccini Bruno Faccini (Inactive) added a comment - Some update, after I added fault-injection (force -ENOENT in the middle of mdd_swap_layouts() to cause layouts swap back) and associated sub-test test_12o within patch-set #8. test_12o fails due to "diff" command, that caused the implicit restore, to be successful when it is expected to fail because of the fault-injection. Strange is that the Restore operation has been marked as failed, the Copytool received the error, and file still has the "released" flag set!! I wonder if there could be some issue in mdd_swap_layouts() causing this unexpected behavior ?

          I am wondering if I should also add some error injection to simulate SWAP_LAYOUT failure during restore ??

          I will also push a new patch-set #8 to address John's last comment and convert to usual error handling style.

          bfaccini Bruno Faccini (Inactive) added a comment - I am wondering if I should also add some error injection to simulate SWAP_LAYOUT failure during restore ?? I will also push a new patch-set #8 to address John's last comment and convert to usual error handling style.

          I found a possible bug in my original patch version causing layout-lock not to be released when restore is canceled … Just submitted patch-set #8 to fix this, will see if is passes auto-tests (particularly sanity-hsm/test_33 which was timing-out due to md5sum process never ending!!…).

          bfaccini Bruno Faccini (Inactive) added a comment - I found a possible bug in my original patch version causing layout-lock not to be released when restore is canceled … Just submitted patch-set #8 to fix this, will see if is passes auto-tests (particularly sanity-hsm/test_33 which was timing-out due to md5sum process never ending!!…).

          People

            bfaccini Bruno Faccini (Inactive)
            jhammond John Hammond
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: