Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5242

Test hang sanity test_132, test_133: umount ost

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.6.0, Lustre 2.7.0, Lustre 2.5.3
    • 3
    • 14622

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:

      http://maloo.whamcloud.com/test_sets/e5783778-f887-11e3-b13a-52540035b04c.

      The sub-test test_132 failed with the following error:

      test failed to respond and timed out

      Info required for matching: sanity 132

      Attachments

        Issue Links

          Activity

            [LU-5242] Test hang sanity test_132, test_133: umount ost

            I like Andreas's idea of keeping the unlink behavior compatible with the ZPL. It would be ideal if you could reuse the existing ZPL functions but those functions are tied quite closely to ZPL specific data structures so that's probably not workable. But the ZFS_UNLINKED_SET object itself is just a ZAP containing a list of object ids. And since objects on disk are already constructed to be compatible with the ZPL we should be able to safely use it. Isaac's design is nice, but let me suggest a few minor tweaks:

            • Use the existing ZFS_UNLINKED_SET object linked the MASTER_NODE_OBJ as the deathrow object.
            • In declare_object_destroy() and object_destroy() just handle moving the object to the ZFS_UNLINKED_SET in a single TX.
            • In a dedicated thread, taskq, or generic linux worker thread regularly walk the ZFS_UNLINKED_SET and rely on dmu_free_long_range() to split the free over as many TXGs as required.
            • I don't think there's any advantage in handling small object destruction synchronously in object_destroy(). It's simpler and probably more efficient to always do this asynchronously.
            • Start draining the ZFS_UNLINKED_SET right away when remounting the OSD (this happening during mount for the ZPL).
            behlendorf Brian Behlendorf added a comment - I like Andreas's idea of keeping the unlink behavior compatible with the ZPL. It would be ideal if you could reuse the existing ZPL functions but those functions are tied quite closely to ZPL specific data structures so that's probably not workable. But the ZFS_UNLINKED_SET object itself is just a ZAP containing a list of object ids. And since objects on disk are already constructed to be compatible with the ZPL we should be able to safely use it. Isaac's design is nice, but let me suggest a few minor tweaks: Use the existing ZFS_UNLINKED_SET object linked the MASTER_NODE_OBJ as the deathrow object. In declare_object_destroy() and object_destroy() just handle moving the object to the ZFS_UNLINKED_SET in a single TX. In a dedicated thread, taskq, or generic linux worker thread regularly walk the ZFS_UNLINKED_SET and rely on dmu_free_long_range() to split the free over as many TXGs as required. I don't think there's any advantage in handling small object destruction synchronously in object_destroy(). It's simpler and probably more efficient to always do this asynchronously. Start draining the ZFS_UNLINKED_SET right away when remounting the OSD (this happening during mount for the ZPL).

            It makes sense to keep this compatible with ZFS ZPL if at all possible, so that if large files are unlinked under ZPL and then mounted as Lustre, or vice versa, we don't defer deleting them forever. I see this is handled in zfs_unlinked_add() and zfs_unlinked_drain() with a ZAP named "DELETE_QUEUE" (ZFS_UNLINKED_SET) in the MASTER_NODE_OBJ. Even if we need to implement our own routines to handle this, it makes sense to use the same ZAP and zap format (zap_add_int()).

            adilger Andreas Dilger added a comment - It makes sense to keep this compatible with ZFS ZPL if at all possible, so that if large files are unlinked under ZPL and then mounted as Lustre, or vice versa, we don't defer deleting them forever. I see this is handled in zfs_unlinked_add() and zfs_unlinked_drain() with a ZAP named " DELETE_QUEUE " ( ZFS_UNLINKED_SET ) in the MASTER_NODE_OBJ. Even if we need to implement our own routines to handle this, it makes sense to use the same ZAP and zap format ( zap_add_int() ).

            from Lustre point of view, the object is destroyed when it can't be found with lu_object_find(). IOW, when corresponding FID is removed from OI - this should be a part of object destroy. space accounting shouldn't be an issue as it's not released immediately in any case: grants/quota can release reserved space upon commit, but a new reserve can be made only with new statfs. truncate/dnode destroy in osd_trans_stop() looks OK.

            bzzz Alex Zhuravlev added a comment - from Lustre point of view, the object is destroyed when it can't be found with lu_object_find(). IOW, when corresponding FID is removed from OI - this should be a part of object destroy. space accounting shouldn't be an issue as it's not released immediately in any case: grants/quota can release reserved space upon commit, but a new reserve can be made only with new statfs. truncate/dnode destroy in osd_trans_stop() looks OK.
            isaac Isaac Huang (Inactive) added a comment - - edited

            How about:

            • An object (maybe a ZAP) is created per OST/MDT, to hold objects to be freed. Let's call it the deathrow object.
            • In declare_object_destroy() and object_destroy(), within a single TX, move the object to the deadthrow, and update OI and accounting ZAPs as well. So the object disappears from OST/MDT namespace atomically.
              • If an object is small enough, it can be destroyed in the old way.
            • A separate thread works on the deathrow object, truncating and freeing every object in it.
              • This may break some tests that wait for free space to increase after object removal. We may also do at least the truncating in osd_trans_stop().

            All implemented at the osd-zfs layer.

            isaac Isaac Huang (Inactive) added a comment - - edited How about: An object (maybe a ZAP) is created per OST/MDT, to hold objects to be freed. Let's call it the deathrow object. In declare_object_destroy() and object_destroy(), within a single TX, move the object to the deadthrow, and update OI and accounting ZAPs as well. So the object disappears from OST/MDT namespace atomically. If an object is small enough, it can be destroyed in the old way. A separate thread works on the deathrow object, truncating and freeing every object in it. This may break some tests that wait for free space to increase after object removal. We may also do at least the truncating in osd_trans_stop(). All implemented at the osd-zfs layer.

            well, that would mean we can't atomically destroy huge objects using OUT which I'd like to do to batch OST object destroy's from MDT.

            bzzz Alex Zhuravlev added a comment - well, that would mean we can't atomically destroy huge objects using OUT which I'd like to do to batch OST object destroy's from MDT.
            adilger Andreas Dilger added a comment - - edited

            Note that by the time that the OST object is being destroyed it has already been unlinked from the MDS namespace, and that unlink is committed on the MDS. The OFD has also revoked client locks and discarded any cached dirty pages on the clients, so there is no danger to destroy the object in stages. That is what ext4 is doing internally (truncating blocks in small chunks from the end of the file).

            Since the MDT is already logging the object destroy locally and will resend it if the OSS crashes. The OST object hasn't been deleted from the OST namespace yet, so I don't see a requirement for more infrastructure to handle this at the OSD level. It is fine to truncate the object before destroy FOR THE OST ONLY, and higher layers will handle it.

            That said, since the OSD code is common, we shouldn't be exposing truncated-but-not-unlined objects on the MDT, so it may be easiest to truncate to zero explicitly from the OFD before the object destroy?

            adilger Andreas Dilger added a comment - - edited Note that by the time that the OST object is being destroyed it has already been unlinked from the MDS namespace, and that unlink is committed on the MDS. The OFD has also revoked client locks and discarded any cached dirty pages on the clients, so there is no danger to destroy the object in stages. That is what ext4 is doing internally (truncating blocks in small chunks from the end of the file). Since the MDT is already logging the object destroy locally and will resend it if the OSS crashes. The OST object hasn't been deleted from the OST namespace yet, so I don't see a requirement for more infrastructure to handle this at the OSD level. It is fine to truncate the object before destroy FOR THE OST ONLY, and higher layers will handle it. That said, since the OSD code is common, we shouldn't be exposing truncated-but-not-unlined objects on the MDT, so it may be easiest to truncate to zero explicitly from the OFD before the object destroy?

            Yes, what Alex said. This is exactly how the Posix layer also works, the object to be removed is placed on an unlinked listed for handling latter in as many transaction as needed. Only once the blocks are all freed in the obejct removed from the list. See zfs_unlinked_drain().

            behlendorf Brian Behlendorf added a comment - Yes, what Alex said. This is exactly how the Posix layer also works, the object to be removed is placed on an unlinked listed for handling latter in as many transaction as needed. Only once the blocks are all freed in the obejct removed from the list. See zfs_unlinked_drain().

            Isaac, this is usually done using an additional index/list containing object to be freed. so OSD's destroy put an object on the list (plus, say, OI removal), then a separate thread (or osd_trans_stop()) truncates the object, frees it and remove from the list/index using as many transactions as needed.

            bzzz Alex Zhuravlev added a comment - Isaac, this is usually done using an additional index/list containing object to be freed. so OSD's destroy put an object on the list (plus, say, OI removal), then a separate thread (or osd_trans_stop()) truncates the object, frees it and remove from the list/index using as many transactions as needed.

            Nice find, that neatly explains everything. Thanks Issac for opening issue 3064 at zfsonlinux so we can decide of what should be done on the ZFS side. We should certainly investigate improving the quality of the worst case estimate. But it really does need to be a worst case estimate or we risk potentially stalling the txg engine in the worst (but very unlikely) case. Anyway, I'm open to ideas!

            https://github.com/zfsonlinux/zfs/issues/3064

            behlendorf Brian Behlendorf added a comment - Nice find, that neatly explains everything. Thanks Issac for opening issue 3064 at zfsonlinux so we can decide of what should be done on the ZFS side. We should certainly investigate improving the quality of the worst case estimate. But it really does need to be a worst case estimate or we risk potentially stalling the txg engine in the worst (but very unlikely) case. Anyway, I'm open to ideas! https://github.com/zfsonlinux/zfs/issues/3064

            It looked like the preferred DMU API to remove large/sparse objects is dmu_free_long_range(). The ZPL was able to remove any large/sparse objects I threw at it where it'd just choke up osd-zfs:

            zfs_rmnode(znode_t *zp)
            ......
                    /*
                     * Free up all the data in the file.
                     */
                    error = dmu_free_long_range(os, zp->z_id, 0, DMU_OBJECT_END);
            ......
                    /*
                     * Set up the final transaction.
                     */
                    tx = dmu_tx_create(os);
                    dmu_tx_hold_free(tx, zp->z_id, 0, DMU_OBJECT_END);
            ......
                    error = dmu_tx_assign(tx, TXG_WAIT);
            ......
                    zfs_znode_delete(zp, tx); ==> VERIFY(0 == dmu_object_free(os, obj, tx));
            

            I created a simple patch and my tests showed it worked:

            @@ -444,6 +444,9 @@ static void __osd_declare_object_destroy(const struct lu_env *env,
                    zap_cursor_t            *zc;
                    int                      rc = 0;
             
            +       rc = dmu_free_long_range(osd->od_os, oid, 0, DMU_OBJECT_END);
            +       LASSERTF(rc == 0, "dmu_free_long_range "LPU64" failed: %d\n", oid, rc);
            +
                    dmu_tx_hold_free(tx, oid, 0, DMU_OBJECT_END);
            

            But two things I'm not sure about:

            • I had to call it in do_declare_destroy() because it must be called before dmu_tx_hold_free() is called, that means when do_declare_destroy() returns the data in the object has already been freed. Is it OK for Lustre/OSD?
            • The dmu_free_long_range() frees data in the object in several TXs, so it might fail in the middle leaving the data in the object partially freed. If do_declare_destroy() fails for this reason, what would OSD layer do? The object itself and the OI and accounting ZAPs are still freed/updated in a single TX so that's consistent but the object data can be partially freed already.

            We can work on dmu_tx_count_free() to correctly estimate memory overhead for the particular sparse file created by test_80(). But we'd have to deal with any large/sparse object anyway, so I think in the long run it'd be better to go with dmu_free_long_range().

            isaac Isaac Huang (Inactive) added a comment - It looked like the preferred DMU API to remove large/sparse objects is dmu_free_long_range() . The ZPL was able to remove any large/sparse objects I threw at it where it'd just choke up osd-zfs: zfs_rmnode(znode_t *zp) ...... /* * Free up all the data in the file. */ error = dmu_free_long_range(os, zp->z_id, 0, DMU_OBJECT_END); ...... /* * Set up the final transaction. */ tx = dmu_tx_create(os); dmu_tx_hold_free(tx, zp->z_id, 0, DMU_OBJECT_END); ...... error = dmu_tx_assign(tx, TXG_WAIT); ...... zfs_znode_delete(zp, tx); ==> VERIFY(0 == dmu_object_free(os, obj, tx)); I created a simple patch and my tests showed it worked: @@ -444,6 +444,9 @@ static void __osd_declare_object_destroy(const struct lu_env *env, zap_cursor_t *zc; int rc = 0; + rc = dmu_free_long_range(osd->od_os, oid, 0, DMU_OBJECT_END); + LASSERTF(rc == 0, "dmu_free_long_range "LPU64" failed: %d\n", oid, rc); + dmu_tx_hold_free(tx, oid, 0, DMU_OBJECT_END); But two things I'm not sure about: I had to call it in do_declare_destroy() because it must be called before dmu_tx_hold_free() is called, that means when do_declare_destroy() returns the data in the object has already been freed. Is it OK for Lustre/OSD? The dmu_free_long_range() frees data in the object in several TXs, so it might fail in the middle leaving the data in the object partially freed. If do_declare_destroy() fails for this reason, what would OSD layer do? The object itself and the OI and accounting ZAPs are still freed/updated in a single TX so that's consistent but the object data can be partially freed already. We can work on dmu_tx_count_free() to correctly estimate memory overhead for the particular sparse file created by test_80(). But we'd have to deal with any large/sparse object anyway, so I think in the long run it'd be better to go with dmu_free_long_range() .

            Isaac Huang (he.huang@intel.com) uploaded a new patch: http://review.whamcloud.com/13630
            Subject: LU-5242 osd-zfs: umount hang in sanity 133g
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 13cc94af090e0b1279c2987840a0e6de685e03ee

            gerrit Gerrit Updater added a comment - Isaac Huang (he.huang@intel.com) uploaded a new patch: http://review.whamcloud.com/13630 Subject: LU-5242 osd-zfs: umount hang in sanity 133g Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 13cc94af090e0b1279c2987840a0e6de685e03ee

            People

              isaac Isaac Huang (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              26 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: