Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5242

Test hang sanity test_132, test_133: umount ost

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.6.0, Lustre 2.7.0, Lustre 2.5.3
    • 3
    • 14622

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:

      http://maloo.whamcloud.com/test_sets/e5783778-f887-11e3-b13a-52540035b04c.

      The sub-test test_132 failed with the following error:

      test failed to respond and timed out

      Info required for matching: sanity 132

      Attachments

        Issue Links

          Activity

            [LU-5242] Test hang sanity test_132, test_133: umount ost
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13630/
            Subject: LU-5242 osd-zfs: umount hang in sanity 133g
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 9b704e4088d867851cdb011f0a2560b1e622555c

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13630/ Subject: LU-5242 osd-zfs: umount hang in sanity 133g Project: fs/lustre-release Branch: master Current Patch Set: Commit: 9b704e4088d867851cdb011f0a2560b1e622555c

            Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13805
            Subject: LU-5242 osd-zfs: umount hang in sanity 133g
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: 817cf8a2e781d546508929a9f58b44561ae3361c

            gerrit Gerrit Updater added a comment - Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13805 Subject: LU-5242 osd-zfs: umount hang in sanity 133g Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 817cf8a2e781d546508929a9f58b44561ae3361c
            bogl Bob Glossman (Inactive) added a comment - another seen on b2_5 with zfs: https://testing.hpdd.intel.com/test_sessions/435c3152-b816-11e4-9ecb-5254006e85c2

            I would agree with Alex on this. By deferring unlink of small files it will probably double or triple the total IO that the MDT is doing because in addition to the actual dnode deletion it also needs to insert the dnode into the deathrow ZAP in one TXG and then delete it from the same ZAP in a different txg. If there are a large number of objects being deleted at once (easily possible on the MDT), then the deathrow ZAP may get quite large (and never shrink) and updates would become less efficient than if it is kept small.

            adilger Andreas Dilger added a comment - I would agree with Alex on this. By deferring unlink of small files it will probably double or triple the total IO that the MDT is doing because in addition to the actual dnode deletion it also needs to insert the dnode into the deathrow ZAP in one TXG and then delete it from the same ZAP in a different txg. If there are a large number of objects being deleted at once (easily possible on the MDT), then the deathrow ZAP may get quite large (and never shrink) and updates would become less efficient than if it is kept small.

            Thanks all. I'll work on a patch first without the small object optimization to get this bug fixed; then will benchmark to figure out whether to optimize the small object path or not.

            isaac Isaac Huang (Inactive) added a comment - Thanks all. I'll work on a patch first without the small object optimization to get this bug fixed; then will benchmark to figure out whether to optimize the small object path or not.

            I have no objection to do this as simple as possible. my point was that MDT is known to be CPU-bound and ZAP (even micro ZAP) isn't free.

            bzzz Alex Zhuravlev added a comment - I have no objection to do this as simple as possible. my point was that MDT is known to be CPU-bound and ZAP (even micro ZAP) isn't free.

            I could see a case for zero length files as an possible optimization. But I suspect that even for the MDT it would be more efficient to handle the freeing asynchronously outside of any request processing. Even if the file is zero length you're still going to be freeing a spill block for the xattrs and updating a dbuf for the dnode object. Personally I'd keep it as simple and concise as possible until it's clear something more is required. But that's just my preference.

            Keep in mind that none of this free space will be available for a couple TXGs anyway.

            behlendorf Brian Behlendorf added a comment - I could see a case for zero length files as an possible optimization. But I suspect that even for the MDT it would be more efficient to handle the freeing asynchronously outside of any request processing. Even if the file is zero length you're still going to be freeing a spill block for the xattrs and updating a dbuf for the dnode object. Personally I'd keep it as simple and concise as possible until it's clear something more is required. But that's just my preference. Keep in mind that none of this free space will be available for a couple TXGs anyway.

            I'd think that the optimization for small objects (notice most of MDT's objects are literally empty) make sense as we don't need to modify yet another ZAP twice. I guess there is no strong requirement to implement this right away, but still.

            bzzz Alex Zhuravlev added a comment - I'd think that the optimization for small objects (notice most of MDT's objects are literally empty) make sense as we don't need to modify yet another ZAP twice. I guess there is no strong requirement to implement this right away, but still.

            I like Andreas's idea of keeping the unlink behavior compatible with the ZPL. It would be ideal if you could reuse the existing ZPL functions but those functions are tied quite closely to ZPL specific data structures so that's probably not workable. But the ZFS_UNLINKED_SET object itself is just a ZAP containing a list of object ids. And since objects on disk are already constructed to be compatible with the ZPL we should be able to safely use it. Isaac's design is nice, but let me suggest a few minor tweaks:

            • Use the existing ZFS_UNLINKED_SET object linked the MASTER_NODE_OBJ as the deathrow object.
            • In declare_object_destroy() and object_destroy() just handle moving the object to the ZFS_UNLINKED_SET in a single TX.
            • In a dedicated thread, taskq, or generic linux worker thread regularly walk the ZFS_UNLINKED_SET and rely on dmu_free_long_range() to split the free over as many TXGs as required.
            • I don't think there's any advantage in handling small object destruction synchronously in object_destroy(). It's simpler and probably more efficient to always do this asynchronously.
            • Start draining the ZFS_UNLINKED_SET right away when remounting the OSD (this happening during mount for the ZPL).
            behlendorf Brian Behlendorf added a comment - I like Andreas's idea of keeping the unlink behavior compatible with the ZPL. It would be ideal if you could reuse the existing ZPL functions but those functions are tied quite closely to ZPL specific data structures so that's probably not workable. But the ZFS_UNLINKED_SET object itself is just a ZAP containing a list of object ids. And since objects on disk are already constructed to be compatible with the ZPL we should be able to safely use it. Isaac's design is nice, but let me suggest a few minor tweaks: Use the existing ZFS_UNLINKED_SET object linked the MASTER_NODE_OBJ as the deathrow object. In declare_object_destroy() and object_destroy() just handle moving the object to the ZFS_UNLINKED_SET in a single TX. In a dedicated thread, taskq, or generic linux worker thread regularly walk the ZFS_UNLINKED_SET and rely on dmu_free_long_range() to split the free over as many TXGs as required. I don't think there's any advantage in handling small object destruction synchronously in object_destroy(). It's simpler and probably more efficient to always do this asynchronously. Start draining the ZFS_UNLINKED_SET right away when remounting the OSD (this happening during mount for the ZPL).

            People

              isaac Isaac Huang (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              26 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: