Details

    • Technical task
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.5.0
    • 9919

    Description

      In the restore case of hsm_cdt_request_completed(), if the copytool returned success but the layout swap fails then we get an unreadable file with HS_RELEASED clear but LOV_PATTERN_F_RELEASED set.

      Perhaps the new HSM attributes should be applied to the volatile object before layout swap, and hsm_swap_layouts() should call mo_swap_layouts() with SWAP_LAYOUTS_MDS_HSM set.

      Attachments

        Activity

          [LU-3834] hsm_cdt_request_completed() may clear HS_RELEASED on failed restore

          Patch was only landed to master and not b2_5. In the future, this type of patch should be cherry-picked to b2_5 so that it is fixed in the maintenance release.

          adilger Andreas Dilger added a comment - Patch was only landed to master and not b2_5. In the future, this type of patch should be cherry-picked to b2_5 so that it is fixed in the maintenance release.
          bfaccini Bruno Faccini (Inactive) added a comment - patch http://review.whamcloud.com/7631 has landed. Closing.
          bfaccini Bruno Faccini (Inactive) added a comment - - edited

          Hehe, finally I found that my fault-injection code itself introduced some problem because being added after the volatile/2nd file layout change and not reverting it to mimic the error !! This caused the restored datas to be available as if restore succeed …

          I changed this in patch-set #13, and now new sub-test test_12o runs fine, returning errors on both copytool (ENOTSUPP, injected!) and client (ENODATA) sides with layout-swap fault-injection, and next restore attempt without fault-injection to be successful.

          Will run with build+patch locally and see if I can still reproduce the Volatile object leak on MDT, seen as part of this ticket and LU-4293.

          bfaccini Bruno Faccini (Inactive) added a comment - - edited Hehe, finally I found that my fault-injection code itself introduced some problem because being added after the volatile/2nd file layout change and not reverting it to mimic the error !! This caused the restored datas to be available as if restore succeed … I changed this in patch-set #13, and now new sub-test test_12o runs fine, returning errors on both copytool (ENOTSUPP, injected!) and client (ENODATA) sides with layout-swap fault-injection, and next restore attempt without fault-injection to be successful. Will run with build+patch locally and see if I can still reproduce the Volatile object leak on MDT, seen as part of this ticket and LU-4293 .

          Some update, after I added fault-injection (force -ENOENT in the middle of mdd_swap_layouts() to cause layouts swap back) and associated sub-test test_12o within patch-set #8.

          test_12o fails due to "diff" command, that caused the implicit restore, to be successful when it is expected to fail because of the fault-injection. Strange is that the Restore operation has been marked as failed, the Copytool received the error, and file still has the "released" flag set!!

          I wonder if there could be some issue in mdd_swap_layouts() causing this unexpected behavior ?

          bfaccini Bruno Faccini (Inactive) added a comment - Some update, after I added fault-injection (force -ENOENT in the middle of mdd_swap_layouts() to cause layouts swap back) and associated sub-test test_12o within patch-set #8. test_12o fails due to "diff" command, that caused the implicit restore, to be successful when it is expected to fail because of the fault-injection. Strange is that the Restore operation has been marked as failed, the Copytool received the error, and file still has the "released" flag set!! I wonder if there could be some issue in mdd_swap_layouts() causing this unexpected behavior ?

          I am wondering if I should also add some error injection to simulate SWAP_LAYOUT failure during restore ??

          I will also push a new patch-set #8 to address John's last comment and convert to usual error handling style.

          bfaccini Bruno Faccini (Inactive) added a comment - I am wondering if I should also add some error injection to simulate SWAP_LAYOUT failure during restore ?? I will also push a new patch-set #8 to address John's last comment and convert to usual error handling style.

          I found a possible bug in my original patch version causing layout-lock not to be released when restore is canceled … Just submitted patch-set #8 to fix this, will see if is passes auto-tests (particularly sanity-hsm/test_33 which was timing-out due to md5sum process never ending!!…).

          bfaccini Bruno Faccini (Inactive) added a comment - I found a possible bug in my original patch version causing layout-lock not to be released when restore is canceled … Just submitted patch-set #8 to fix this, will see if is passes auto-tests (particularly sanity-hsm/test_33 which was timing-out due to md5sum process never ending!!…).

          Latest patch-set #6 auto-tests failures appear not related to this patch but to multiple issues already addressed in others tickets (LU-4093, LU-4086, …). Some of them have heir patches that landed until now, so I will rebase it again …

          bfaccini Bruno Faccini (Inactive) added a comment - Latest patch-set #6 auto-tests failures appear not related to this patch but to multiple issues already addressed in others tickets ( LU-4093 , LU-4086 , …). Some of them have heir patches that landed until now, so I will rebase it again …

          1st patch attempt is at http://review.whamcloud.com/7631.
          Unfortunately it failed in multiple auto-tests and needs at least to re-base.
          But also failed in several sub-tests of sanity-hsm, which is highly suspect ...
          Currently under investigations.

          bfaccini Bruno Faccini (Inactive) added a comment - 1st patch attempt is at http://review.whamcloud.com/7631 . Unfortunately it failed in multiple auto-tests and needs at least to re-base. But also failed in several sub-tests of sanity-hsm, which is highly suspect ... Currently under investigations.
          jhammond John Hammond added a comment -

          > _ always call mo_swap_layouts() with SWAP_LAYOUTS_MDS_HSM flag from hsm_swap_layouts(), since (actually?) hsm_swap_layouts() is only called from mdt_hsm_update_request_state() for a RESTORE op.

          This seems better to me.

          jhammond John Hammond added a comment - > _ always call mo_swap_layouts() with SWAP_LAYOUTS_MDS_HSM flag from hsm_swap_layouts(), since (actually?) hsm_swap_layouts() is only called from mdt_hsm_update_request_state() for a RESTORE op. This seems better to me.

          Ok, I see where this can be fixed now, thank's John. But now, and to save time before submit patch, what is the preferred way to do this ? :

          _ always call mo_swap_layouts() with SWAP_LAYOUTS_MDS_HSM flag from hsm_swap_layouts(), since (actually?) hsm_swap_layouts() is only called from mdt_hsm_update_request_state() for a RESTORE op.

          _ add a new flags to hsm_swap_layouts() to enable SWAP_LAYOUTS_MDS_HSM flag use or not during call to mo_swap_layouts().

          bfaccini Bruno Faccini (Inactive) added a comment - Ok, I see where this can be fixed now, thank's John. But now, and to save time before submit patch, what is the preferred way to do this ? : _ always call mo_swap_layouts() with SWAP_LAYOUTS_MDS_HSM flag from hsm_swap_layouts(), since (actually?) hsm_swap_layouts() is only called from mdt_hsm_update_request_state() for a RESTORE op. _ add a new flags to hsm_swap_layouts() to enable SWAP_LAYOUTS_MDS_HSM flag use or not during call to mo_swap_layouts().

          If this is a simple fix, then we can work out a patch for this. Otherwise I'd like to put the resource on something else because it's unlikely for swap_layout to fail anyway.

          jay Jinshan Xiong (Inactive) added a comment - If this is a simple fix, then we can work out a patch for this. Otherwise I'd like to put the resource on something else because it's unlikely for swap_layout to fail anyway.

          People

            bfaccini Bruno Faccini (Inactive)
            jhammond John Hammond
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: