Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10302

hsm: obscure bug with multi-mountpoints and ldlm

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.11.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      I do not have much to share except the attached reproducer.

      The key elements of the reproducer seem to be:

      1. setup lustre with two mountpoints;
      2. create a file;
      3. launch a copytool on mountpoint A;
      4. suspend the copytool;
      5. archive the file created at step 1 from mountpoint A*;
      6. delete the file on mountpoint B;
      7. sync;
      8. un-suspend the copytool (the output of the copytool should indicate that llapi_hsm_action_begin() failed with EIO, not ENOENT)
      9. umount => the process hangs in an unkillable state.

      *You can use mountpoint B at step 5, but only if you created the file from mountpoint A.

      I added some debug in the reproducer that should be logged in /tmp.

      I suspect those two lines in the dmesg are related to this issue (they are logged at umount time):

      [  143.575078] LustreError: 3703:0:(ldlm_resource.c:1094:ldlm_resource_complain()) filter-lustre-OST0000_UUID: namespace resource [0x2:0x0:0x0].0x0 (ffff8806ab7b6900) refcount nonzero (1) after lock cleanup; forcing cleanup.
      [  143.578233] LustreError: 3703:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x2:0x0:0x0].0x0 (ffff8806ab7b6900) refcount = 2
      

      Note: the title should probably be updated once we figure what the issue exactly is

      Attachments

        Issue Links

          Activity

            [LU-10302] hsm: obscure bug with multi-mountpoints and ldlm
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30477/
            Subject: LU-10302 ldlm: destroy lock if LVB init fails
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: c91cb6ee81e7751b719228efa58dc32fdea836e5

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30477/ Subject: LU-10302 ldlm: destroy lock if LVB init fails Project: fs/lustre-release Branch: master Current Patch Set: Commit: c91cb6ee81e7751b719228efa58dc32fdea836e5

            John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30477
            Subject: LU-10302 ldlm: destroy lock if LVB init fails
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0be0459c0b1409c790a214a73735673ed9907b57

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30477 Subject: LU-10302 ldlm: destroy lock if LVB init fails Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0be0459c0b1409c790a214a73735673ed9907b57

            I cannot reproduce the bug anymore when I apply the patch you proposed for LU-10357. Thank you!

            Maybe we can keep this LU to fix search_inode_for_lustre() or ofd_lvbo_init()... or both, depending on what makes more sense. =)

            bougetq Quentin Bouget (Inactive) added a comment - I cannot reproduce the bug anymore when I apply the patch you proposed for LU-10357 . Thank you! Maybe we can keep this LU to fix search_inode_for_lustre() or ofd_lvbo_init() ... or both, depending on what makes more sense. =)
            jhammond John Hammond added a comment -

            BTW, the CT is able to hit this because it calls search_inode_for_lustre() to get the data version so it is not seeing that the file has been deleted.

            jhammond John Hammond added a comment - BTW, the CT is able to hit this because it calls search_inode_for_lustre() to get the data version so it is not seeing that the file has been deleted.
            jhammond John Hammond added a comment -

            You are seeing the fact that the lock and resource reference counting in LDLM is intolerant of some lvbo init errors. In particular, it ofd_lvbo_init() fails because the object could not be found then a reference on the resource is somehow leaked.

            jhammond John Hammond added a comment - You are seeing the fact that the lock and resource reference counting in LDLM is intolerant of some lvbo init errors. In particular, it ofd_lvbo_init() fails because the object could not be found then a reference on the resource is somehow leaked.
            bougetq Quentin Bouget (Inactive) added a comment - - edited

            The condition to trigger the bug is a bit more complex than I first thought: lhsmtool_posix != rm && !(create == lfs hsm_archive == rm)

            The more verbose version: lhsmtool_posix and rm are run on different mountpoints, and the file is not created, archived and deleted from the same mountpoint.

            I am not sure how useful this is. I am putting it here... just in case.

            bougetq Quentin Bouget (Inactive) added a comment - - edited The condition to trigger the bug is a bit more complex than I first thought: lhsmtool_posix != rm && !(create == lfs hsm_archive == rm) The more verbose version: lhsmtool_posix and rm are run on different mountpoints, and the file is not created, archived and deleted from the same mountpoint. I am not sure how useful this is. I am putting it here... just in case.
            pjones Peter Jones added a comment -

            Bruno

            Can you look into this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bruno Can you look into this one? Thanks Peter

            Letting the hsm request timeout is not a requirement to reproduce, rather than that, syncing data/metadata is what is important.

            I updated the description (once again) and the reproducer accordingly.

            bougetq Quentin Bouget (Inactive) added a comment - Letting the hsm request timeout is not a requirement to reproduce, rather than that, syncing data/metadata is what is important. I updated the description (once again) and the reproducer accordingly.

            My bad, I updated the description: the client unmount hangs.

            > Is this problem hit in normal usage?

            The reproducer I provided works on a single node setup but you can also reproduce on a multi-node setup (copytool on one node, client doing the rm on another node), so this definitely impacts production setups.

            bougetq Quentin Bouget (Inactive) added a comment - My bad, I updated the description: the client unmount hangs. > Is this problem hit in normal usage? The reproducer I provided works on a single node setup but you can also reproduce on a multi-node setup (copytool on one node, client doing the rm on another node), so this definitely impacts production setups.

            Quentin, it isn’t clear from your bug report what the actual problem is that you are hitting?  Does the client unmount fail, or are the error messages unexpected but not a problem otherwise? Is this problem hit in normal usage?

             It does look like the copytool is holding a lock reference on the OST object longer than it should be, but they should be cleaned up at mount.

             

            adilger Andreas Dilger added a comment - Quentin, it isn’t clear from your bug report what the actual problem is that you are hitting?  Does the client unmount fail, or are the error messages unexpected but not a problem otherwise? Is this problem hit in normal usage?  It does look like the copytool is holding a lock reference on the OST object longer than it should be, but they should be cleaned up at mount.  

            People

              jhammond John Hammond
              cealustre CEA
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: