Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5242

Test hang sanity test_132, test_133: umount ost

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.6.0, Lustre 2.7.0, Lustre 2.5.3
    • 3
    • 14622

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:

      http://maloo.whamcloud.com/test_sets/e5783778-f887-11e3-b13a-52540035b04c.

      The sub-test test_132 failed with the following error:

      test failed to respond and timed out

      Info required for matching: sanity 132

      Attachments

        Issue Links

          Activity

            [LU-5242] Test hang sanity test_132, test_133: umount ost

            I would agree with Alex on this. By deferring unlink of small files it will probably double or triple the total IO that the MDT is doing because in addition to the actual dnode deletion it also needs to insert the dnode into the deathrow ZAP in one TXG and then delete it from the same ZAP in a different txg. If there are a large number of objects being deleted at once (easily possible on the MDT), then the deathrow ZAP may get quite large (and never shrink) and updates would become less efficient than if it is kept small.

            adilger Andreas Dilger added a comment - I would agree with Alex on this. By deferring unlink of small files it will probably double or triple the total IO that the MDT is doing because in addition to the actual dnode deletion it also needs to insert the dnode into the deathrow ZAP in one TXG and then delete it from the same ZAP in a different txg. If there are a large number of objects being deleted at once (easily possible on the MDT), then the deathrow ZAP may get quite large (and never shrink) and updates would become less efficient than if it is kept small.

            Thanks all. I'll work on a patch first without the small object optimization to get this bug fixed; then will benchmark to figure out whether to optimize the small object path or not.

            isaac Isaac Huang (Inactive) added a comment - Thanks all. I'll work on a patch first without the small object optimization to get this bug fixed; then will benchmark to figure out whether to optimize the small object path or not.

            I have no objection to do this as simple as possible. my point was that MDT is known to be CPU-bound and ZAP (even micro ZAP) isn't free.

            bzzz Alex Zhuravlev added a comment - I have no objection to do this as simple as possible. my point was that MDT is known to be CPU-bound and ZAP (even micro ZAP) isn't free.

            I could see a case for zero length files as an possible optimization. But I suspect that even for the MDT it would be more efficient to handle the freeing asynchronously outside of any request processing. Even if the file is zero length you're still going to be freeing a spill block for the xattrs and updating a dbuf for the dnode object. Personally I'd keep it as simple and concise as possible until it's clear something more is required. But that's just my preference.

            Keep in mind that none of this free space will be available for a couple TXGs anyway.

            behlendorf Brian Behlendorf added a comment - I could see a case for zero length files as an possible optimization. But I suspect that even for the MDT it would be more efficient to handle the freeing asynchronously outside of any request processing. Even if the file is zero length you're still going to be freeing a spill block for the xattrs and updating a dbuf for the dnode object. Personally I'd keep it as simple and concise as possible until it's clear something more is required. But that's just my preference. Keep in mind that none of this free space will be available for a couple TXGs anyway.

            I'd think that the optimization for small objects (notice most of MDT's objects are literally empty) make sense as we don't need to modify yet another ZAP twice. I guess there is no strong requirement to implement this right away, but still.

            bzzz Alex Zhuravlev added a comment - I'd think that the optimization for small objects (notice most of MDT's objects are literally empty) make sense as we don't need to modify yet another ZAP twice. I guess there is no strong requirement to implement this right away, but still.

            I like Andreas's idea of keeping the unlink behavior compatible with the ZPL. It would be ideal if you could reuse the existing ZPL functions but those functions are tied quite closely to ZPL specific data structures so that's probably not workable. But the ZFS_UNLINKED_SET object itself is just a ZAP containing a list of object ids. And since objects on disk are already constructed to be compatible with the ZPL we should be able to safely use it. Isaac's design is nice, but let me suggest a few minor tweaks:

            • Use the existing ZFS_UNLINKED_SET object linked the MASTER_NODE_OBJ as the deathrow object.
            • In declare_object_destroy() and object_destroy() just handle moving the object to the ZFS_UNLINKED_SET in a single TX.
            • In a dedicated thread, taskq, or generic linux worker thread regularly walk the ZFS_UNLINKED_SET and rely on dmu_free_long_range() to split the free over as many TXGs as required.
            • I don't think there's any advantage in handling small object destruction synchronously in object_destroy(). It's simpler and probably more efficient to always do this asynchronously.
            • Start draining the ZFS_UNLINKED_SET right away when remounting the OSD (this happening during mount for the ZPL).
            behlendorf Brian Behlendorf added a comment - I like Andreas's idea of keeping the unlink behavior compatible with the ZPL. It would be ideal if you could reuse the existing ZPL functions but those functions are tied quite closely to ZPL specific data structures so that's probably not workable. But the ZFS_UNLINKED_SET object itself is just a ZAP containing a list of object ids. And since objects on disk are already constructed to be compatible with the ZPL we should be able to safely use it. Isaac's design is nice, but let me suggest a few minor tweaks: Use the existing ZFS_UNLINKED_SET object linked the MASTER_NODE_OBJ as the deathrow object. In declare_object_destroy() and object_destroy() just handle moving the object to the ZFS_UNLINKED_SET in a single TX. In a dedicated thread, taskq, or generic linux worker thread regularly walk the ZFS_UNLINKED_SET and rely on dmu_free_long_range() to split the free over as many TXGs as required. I don't think there's any advantage in handling small object destruction synchronously in object_destroy(). It's simpler and probably more efficient to always do this asynchronously. Start draining the ZFS_UNLINKED_SET right away when remounting the OSD (this happening during mount for the ZPL).

            It makes sense to keep this compatible with ZFS ZPL if at all possible, so that if large files are unlinked under ZPL and then mounted as Lustre, or vice versa, we don't defer deleting them forever. I see this is handled in zfs_unlinked_add() and zfs_unlinked_drain() with a ZAP named "DELETE_QUEUE" (ZFS_UNLINKED_SET) in the MASTER_NODE_OBJ. Even if we need to implement our own routines to handle this, it makes sense to use the same ZAP and zap format (zap_add_int()).

            adilger Andreas Dilger added a comment - It makes sense to keep this compatible with ZFS ZPL if at all possible, so that if large files are unlinked under ZPL and then mounted as Lustre, or vice versa, we don't defer deleting them forever. I see this is handled in zfs_unlinked_add() and zfs_unlinked_drain() with a ZAP named " DELETE_QUEUE " ( ZFS_UNLINKED_SET ) in the MASTER_NODE_OBJ. Even if we need to implement our own routines to handle this, it makes sense to use the same ZAP and zap format ( zap_add_int() ).

            from Lustre point of view, the object is destroyed when it can't be found with lu_object_find(). IOW, when corresponding FID is removed from OI - this should be a part of object destroy. space accounting shouldn't be an issue as it's not released immediately in any case: grants/quota can release reserved space upon commit, but a new reserve can be made only with new statfs. truncate/dnode destroy in osd_trans_stop() looks OK.

            bzzz Alex Zhuravlev added a comment - from Lustre point of view, the object is destroyed when it can't be found with lu_object_find(). IOW, when corresponding FID is removed from OI - this should be a part of object destroy. space accounting shouldn't be an issue as it's not released immediately in any case: grants/quota can release reserved space upon commit, but a new reserve can be made only with new statfs. truncate/dnode destroy in osd_trans_stop() looks OK.
            isaac Isaac Huang (Inactive) added a comment - - edited

            How about:

            • An object (maybe a ZAP) is created per OST/MDT, to hold objects to be freed. Let's call it the deathrow object.
            • In declare_object_destroy() and object_destroy(), within a single TX, move the object to the deadthrow, and update OI and accounting ZAPs as well. So the object disappears from OST/MDT namespace atomically.
              • If an object is small enough, it can be destroyed in the old way.
            • A separate thread works on the deathrow object, truncating and freeing every object in it.
              • This may break some tests that wait for free space to increase after object removal. We may also do at least the truncating in osd_trans_stop().

            All implemented at the osd-zfs layer.

            isaac Isaac Huang (Inactive) added a comment - - edited How about: An object (maybe a ZAP) is created per OST/MDT, to hold objects to be freed. Let's call it the deathrow object. In declare_object_destroy() and object_destroy(), within a single TX, move the object to the deadthrow, and update OI and accounting ZAPs as well. So the object disappears from OST/MDT namespace atomically. If an object is small enough, it can be destroyed in the old way. A separate thread works on the deathrow object, truncating and freeing every object in it. This may break some tests that wait for free space to increase after object removal. We may also do at least the truncating in osd_trans_stop(). All implemented at the osd-zfs layer.

            well, that would mean we can't atomically destroy huge objects using OUT which I'd like to do to batch OST object destroy's from MDT.

            bzzz Alex Zhuravlev added a comment - well, that would mean we can't atomically destroy huge objects using OUT which I'd like to do to batch OST object destroy's from MDT.
            adilger Andreas Dilger added a comment - - edited

            Note that by the time that the OST object is being destroyed it has already been unlinked from the MDS namespace, and that unlink is committed on the MDS. The OFD has also revoked client locks and discarded any cached dirty pages on the clients, so there is no danger to destroy the object in stages. That is what ext4 is doing internally (truncating blocks in small chunks from the end of the file).

            Since the MDT is already logging the object destroy locally and will resend it if the OSS crashes. The OST object hasn't been deleted from the OST namespace yet, so I don't see a requirement for more infrastructure to handle this at the OSD level. It is fine to truncate the object before destroy FOR THE OST ONLY, and higher layers will handle it.

            That said, since the OSD code is common, we shouldn't be exposing truncated-but-not-unlined objects on the MDT, so it may be easiest to truncate to zero explicitly from the OFD before the object destroy?

            adilger Andreas Dilger added a comment - - edited Note that by the time that the OST object is being destroyed it has already been unlinked from the MDS namespace, and that unlink is committed on the MDS. The OFD has also revoked client locks and discarded any cached dirty pages on the clients, so there is no danger to destroy the object in stages. That is what ext4 is doing internally (truncating blocks in small chunks from the end of the file). Since the MDT is already logging the object destroy locally and will resend it if the OSS crashes. The OST object hasn't been deleted from the OST namespace yet, so I don't see a requirement for more infrastructure to handle this at the OSD level. It is fine to truncate the object before destroy FOR THE OST ONLY, and higher layers will handle it. That said, since the OSD code is common, we shouldn't be exposing truncated-but-not-unlined objects on the MDT, so it may be easiest to truncate to zero explicitly from the OFD before the object destroy?

            People

              isaac Isaac Huang (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              26 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: