[LU-5242] Test hang sanity test_132, test_133: umount ost - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0, Lustre 2.5.3
Labels:
- prz

Severity:
3
Rank (Obsolete):
14622

Description

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run:

http://maloo.whamcloud.com/test_sets/e5783778-f887-11e3-b13a-52540035b04c.

The sub-test test_132 failed with the following error:

test failed to respond and timed out

Info required for matching: sanity 132

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

zpool-events.txt
453 kB
21/Jan/15 7:53 PM

Issue Links

duplicates

LU-5163 (lu_object.h:852:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed

Resolved

LU-6105 Update ZFS/SPL version to 0.6.3-1.2

Resolved

is duplicated by

LU-4716 replay-ost-single test_5: stuck in dbuf_read->zio_wait

Resolved

is related to

LU-6089 qsd_handler.c:1139:qsd_op_adjust()) ASSERTION( qqi ) failed

Resolved

LU-6155 osd_count_not_mapped() calls dbuf_hold_impl() without the lock

Resolved

LU-5277 sanity test_132: mdt_build_target_list(), unable to handle kernel NULL pointer dereference

Resolved

LU-5737 sanity test_132: client NULL req->rq_import in ptlrpc_request_committed()

Resolved

LU-6008 sanity test_102b: setstripe failed

Resolved

LU-6195 osd-zfs: osd_declare_object_destroy() calls dmu_tx_hold_zap() with wrong keys

Resolved

LU-4950 sanity-benchmark test fsx hung: txg_sync was stuck on OSS

Closed

LU-7020 OST_DESTROY message times out on MDS repeatedly, indefinitely

Closed

LU-2160 Implement ZFS dmu_tx_hold_append() declarations for llog

Open

is related to

LU-4968 Test failure sanity test_132: umount /mnt/ost2

Resolved

LU-3665 obdfilter-survey test_3a: unmount stuck in obd_exports_barrier()

Resolved

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(7 is related to, 2 is related to , 8 mentioned in)

Activity

[LU-5242] Test hang sanity test_132, test_133: umount ost

Brian Behlendorf added a comment - 11/Feb/15 6:03 PM

I like Andreas's idea of keeping the unlink behavior compatible with the ZPL. It would be ideal if you could reuse the existing ZPL functions but those functions are tied quite closely to ZPL specific data structures so that's probably not workable. But the ZFS_UNLINKED_SET object itself is just a ZAP containing a list of object ids. And since objects on disk are already constructed to be compatible with the ZPL we should be able to safely use it. Isaac's design is nice, but let me suggest a few minor tweaks:

Use the existing ZFS_UNLINKED_SET object linked the MASTER_NODE_OBJ as the deathrow object.
In declare_object_destroy() and object_destroy() just handle moving the object to the ZFS_UNLINKED_SET in a single TX.
In a dedicated thread, taskq, or generic linux worker thread regularly walk the ZFS_UNLINKED_SET and rely on dmu_free_long_range() to split the free over as many TXGs as required.
I don't think there's any advantage in handling small object destruction synchronously in object_destroy(). It's simpler and probably more efficient to always do this asynchronously.
Start draining the ZFS_UNLINKED_SET right away when remounting the OSD (this happening during mount for the ZPL).

Brian Behlendorf added a comment - 11/Feb/15 6:03 PM I like Andreas's idea of keeping the unlink behavior compatible with the ZPL. It would be ideal if you could reuse the existing ZPL functions but those functions are tied quite closely to ZPL specific data structures so that's probably not workable. But the ZFS_UNLINKED_SET object itself is just a ZAP containing a list of object ids. And since objects on disk are already constructed to be compatible with the ZPL we should be able to safely use it. Isaac's design is nice, but let me suggest a few minor tweaks: Use the existing ZFS_UNLINKED_SET object linked the MASTER_NODE_OBJ as the deathrow object. In declare_object_destroy() and object_destroy() just handle moving the object to the ZFS_UNLINKED_SET in a single TX. In a dedicated thread, taskq, or generic linux worker thread regularly walk the ZFS_UNLINKED_SET and rely on dmu_free_long_range() to split the free over as many TXGs as required. I don't think there's any advantage in handling small object destruction synchronously in object_destroy(). It's simpler and probably more efficient to always do this asynchronously. Start draining the ZFS_UNLINKED_SET right away when remounting the OSD (this happening during mount for the ZPL).

Andreas Dilger added a comment - 11/Feb/15 10:07 AM

It makes sense to keep this compatible with ZFS ZPL if at all possible, so that if large files are unlinked under ZPL and then mounted as Lustre, or vice versa, we don't defer deleting them forever. I see this is handled in zfs_unlinked_add() and zfs_unlinked_drain() with a ZAP named "DELETE_QUEUE" (ZFS_UNLINKED_SET) in the MASTER_NODE_OBJ. Even if we need to implement our own routines to handle this, it makes sense to use the same ZAP and zap format (zap_add_int()).

Andreas Dilger added a comment - 11/Feb/15 10:07 AM It makes sense to keep this compatible with ZFS ZPL if at all possible, so that if large files are unlinked under ZPL and then mounted as Lustre, or vice versa, we don't defer deleting them forever. I see this is handled in zfs_unlinked_add() and zfs_unlinked_drain() with a ZAP named " DELETE_QUEUE " ( ZFS_UNLINKED_SET ) in the MASTER_NODE_OBJ. Even if we need to implement our own routines to handle this, it makes sense to use the same ZAP and zap format ( zap_add_int() ).

Alex Zhuravlev added a comment - 10/Feb/15 7:21 AM

from Lustre point of view, the object is destroyed when it can't be found with lu_object_find(). IOW, when corresponding FID is removed from OI - this should be a part of object destroy. space accounting shouldn't be an issue as it's not released immediately in any case: grants/quota can release reserved space upon commit, but a new reserve can be made only with new statfs. truncate/dnode destroy in osd_trans_stop() looks OK.

Alex Zhuravlev added a comment - 10/Feb/15 7:21 AM from Lustre point of view, the object is destroyed when it can't be found with lu_object_find(). IOW, when corresponding FID is removed from OI - this should be a part of object destroy. space accounting shouldn't be an issue as it's not released immediately in any case: grants/quota can release reserved space upon commit, but a new reserve can be made only with new statfs. truncate/dnode destroy in osd_trans_stop() looks OK.

Isaac Huang (Inactive) added a comment - 10/Feb/15 7:09 AM - edited

How about:

An object (maybe a ZAP) is created per OST/MDT, to hold objects to be freed. Let's call it the deathrow object.
In declare_object_destroy() and object_destroy(), within a single TX, move the object to the deadthrow, and update OI and accounting ZAPs as well. So the object disappears from OST/MDT namespace atomically.
- If an object is small enough, it can be destroyed in the old way.
A separate thread works on the deathrow object, truncating and freeing every object in it.
- This may break some tests that wait for free space to increase after object removal. We may also do at least the truncating in osd_trans_stop().

All implemented at the osd-zfs layer.

Isaac Huang (Inactive) added a comment - 10/Feb/15 7:09 AM - edited How about: An object (maybe a ZAP) is created per OST/MDT, to hold objects to be freed. Let's call it the deathrow object. In declare_object_destroy() and object_destroy(), within a single TX, move the object to the deadthrow, and update OI and accounting ZAPs as well. So the object disappears from OST/MDT namespace atomically. If an object is small enough, it can be destroyed in the old way. A separate thread works on the deathrow object, truncating and freeing every object in it. This may break some tests that wait for free space to increase after object removal. We may also do at least the truncating in osd_trans_stop(). All implemented at the osd-zfs layer.

Alex Zhuravlev added a comment - 05/Feb/15 3:21 AM

well, that would mean we can't atomically destroy huge objects using OUT which I'd like to do to batch OST object destroy's from MDT.

Alex Zhuravlev added a comment - 05/Feb/15 3:21 AM well, that would mean we can't atomically destroy huge objects using OUT which I'd like to do to batch OST object destroy's from MDT.

Andreas Dilger added a comment - 04/Feb/15 10:56 PM - edited

Note that by the time that the OST object is being destroyed it has already been unlinked from the MDS namespace, and that unlink is committed on the MDS. The OFD has also revoked client locks and discarded any cached dirty pages on the clients, so there is no danger to destroy the object in stages. That is what ext4 is doing internally (truncating blocks in small chunks from the end of the file).

Since the MDT is already logging the object destroy locally and will resend it if the OSS crashes. The OST object hasn't been deleted from the OST namespace yet, so I don't see a requirement for more infrastructure to handle this at the OSD level. It is fine to truncate the object before destroy FOR THE OST ONLY, and higher layers will handle it.

That said, since the OSD code is common, we shouldn't be exposing truncated-but-not-unlined objects on the MDT, so it may be easiest to truncate to zero explicitly from the OFD before the object destroy?

Andreas Dilger added a comment - 04/Feb/15 10:56 PM - edited Note that by the time that the OST object is being destroyed it has already been unlinked from the MDS namespace, and that unlink is committed on the MDS. The OFD has also revoked client locks and discarded any cached dirty pages on the clients, so there is no danger to destroy the object in stages. That is what ext4 is doing internally (truncating blocks in small chunks from the end of the file). Since the MDT is already logging the object destroy locally and will resend it if the OSS crashes. The OST object hasn't been deleted from the OST namespace yet, so I don't see a requirement for more infrastructure to handle this at the OSD level. It is fine to truncate the object before destroy FOR THE OST ONLY, and higher layers will handle it. That said, since the OSD code is common, we shouldn't be exposing truncated-but-not-unlined objects on the MDT, so it may be easiest to truncate to zero explicitly from the OFD before the object destroy?

Brian Behlendorf added a comment - 04/Feb/15 8:12 PM

Yes, what Alex said. This is exactly how the Posix layer also works, the object to be removed is placed on an unlinked listed for handling latter in as many transaction as needed. Only once the blocks are all freed in the obejct removed from the list. See zfs_unlinked_drain().

Brian Behlendorf added a comment - 04/Feb/15 8:12 PM Yes, what Alex said. This is exactly how the Posix layer also works, the object to be removed is placed on an unlinked listed for handling latter in as many transaction as needed. Only once the blocks are all freed in the obejct removed from the list. See zfs_unlinked_drain().

Alex Zhuravlev added a comment - 04/Feb/15 8:07 PM

Isaac, this is usually done using an additional index/list containing object to be freed. so OSD's destroy put an object on the list (plus, say, OI removal), then a separate thread (or osd_trans_stop()) truncates the object, frees it and remove from the list/index using as many transactions as needed.

Alex Zhuravlev added a comment - 04/Feb/15 8:07 PM Isaac, this is usually done using an additional index/list containing object to be freed. so OSD's destroy put an object on the list (plus, say, OI removal), then a separate thread (or osd_trans_stop()) truncates the object, frees it and remove from the list/index using as many transactions as needed.

Brian Behlendorf added a comment - 04/Feb/15 7:58 PM

Nice find, that neatly explains everything. Thanks Issac for opening issue 3064 at zfsonlinux so we can decide of what should be done on the ZFS side. We should certainly investigate improving the quality of the worst case estimate. But it really does need to be a worst case estimate or we risk potentially stalling the txg engine in the worst (but very unlikely) case. Anyway, I'm open to ideas!

https://github.com/zfsonlinux/zfs/issues/3064

Brian Behlendorf added a comment - 04/Feb/15 7:58 PM Nice find, that neatly explains everything. Thanks Issac for opening issue 3064 at zfsonlinux so we can decide of what should be done on the ZFS side. We should certainly investigate improving the quality of the worst case estimate. But it really does need to be a worst case estimate or we risk potentially stalling the txg engine in the worst (but very unlikely) case. Anyway, I'm open to ideas! https://github.com/zfsonlinux/zfs/issues/3064

Isaac Huang (Inactive) added a comment - 04/Feb/15 7:56 PM

It looked like the preferred DMU API to remove large/sparse objects is dmu_free_long_range(). The ZPL was able to remove any large/sparse objects I threw at it where it'd just choke up osd-zfs:

zfs_rmnode(znode_t *zp)
......
        /*
         * Free up all the data in the file.
         */
        error = dmu_free_long_range(os, zp->z_id, 0, DMU_OBJECT_END);
......
        /*
         * Set up the final transaction.
         */
        tx = dmu_tx_create(os);
        dmu_tx_hold_free(tx, zp->z_id, 0, DMU_OBJECT_END);
......
        error = dmu_tx_assign(tx, TXG_WAIT);
......
        zfs_znode_delete(zp, tx); ==> VERIFY(0 == dmu_object_free(os, obj, tx));

I created a simple patch and my tests showed it worked:

@@ -444,6 +444,9 @@ static void __osd_declare_object_destroy(const struct lu_env *env,
        zap_cursor_t            *zc;
        int                      rc = 0;
 
+       rc = dmu_free_long_range(osd->od_os, oid, 0, DMU_OBJECT_END);
+       LASSERTF(rc == 0, "dmu_free_long_range "LPU64" failed: %d\n", oid, rc);
+
        dmu_tx_hold_free(tx, oid, 0, DMU_OBJECT_END);

But two things I'm not sure about:

I had to call it in do_declare_destroy() because it must be called before dmu_tx_hold_free() is called, that means when do_declare_destroy() returns the data in the object has already been freed. Is it OK for Lustre/OSD?
The dmu_free_long_range() frees data in the object in several TXs, so it might fail in the middle leaving the data in the object partially freed. If do_declare_destroy() fails for this reason, what would OSD layer do? The object itself and the OI and accounting ZAPs are still freed/updated in a single TX so that's consistent but the object data can be partially freed already.

We can work on dmu_tx_count_free() to correctly estimate memory overhead for the particular sparse file created by test_80(). But we'd have to deal with any large/sparse object anyway, so I think in the long run it'd be better to go with dmu_free_long_range().

Isaac Huang (Inactive) added a comment - 04/Feb/15 7:56 PM It looked like the preferred DMU API to remove large/sparse objects is dmu_free_long_range() . The ZPL was able to remove any large/sparse objects I threw at it where it'd just choke up osd-zfs: zfs_rmnode(znode_t *zp) ...... /* * Free up all the data in the file. */ error = dmu_free_long_range(os, zp->z_id, 0, DMU_OBJECT_END); ...... /* * Set up the final transaction. */ tx = dmu_tx_create(os); dmu_tx_hold_free(tx, zp->z_id, 0, DMU_OBJECT_END); ...... error = dmu_tx_assign(tx, TXG_WAIT); ...... zfs_znode_delete(zp, tx); ==> VERIFY(0 == dmu_object_free(os, obj, tx)); I created a simple patch and my tests showed it worked: @@ -444,6 +444,9 @@ static void __osd_declare_object_destroy(const struct lu_env *env, zap_cursor_t *zc; int rc = 0; + rc = dmu_free_long_range(osd->od_os, oid, 0, DMU_OBJECT_END); + LASSERTF(rc == 0, "dmu_free_long_range "LPU64" failed: %d\n", oid, rc); + dmu_tx_hold_free(tx, oid, 0, DMU_OBJECT_END); But two things I'm not sure about: I had to call it in do_declare_destroy() because it must be called before dmu_tx_hold_free() is called, that means when do_declare_destroy() returns the data in the object has already been freed. Is it OK for Lustre/OSD? The dmu_free_long_range() frees data in the object in several TXs, so it might fail in the middle leaving the data in the object partially freed. If do_declare_destroy() fails for this reason, what would OSD layer do? The object itself and the OI and accounting ZAPs are still freed/updated in a single TX so that's consistent but the object data can be partially freed already. We can work on dmu_tx_count_free() to correctly estimate memory overhead for the particular sparse file created by test_80(). But we'd have to deal with any large/sparse object anyway, so I think in the long run it'd be better to go with dmu_free_long_range() .

Gerrit Updater added a comment - 04/Feb/15 7:13 AM

Isaac Huang (he.huang@intel.com) uploaded a new patch: http://review.whamcloud.com/13630
Subject: ~~LU-5242~~ osd-zfs: umount hang in sanity 133g
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 13cc94af090e0b1279c2987840a0e6de685e03ee

Gerrit Updater added a comment - 04/Feb/15 7:13 AM Isaac Huang (he.huang@intel.com) uploaded a new patch: http://review.whamcloud.com/13630 Subject: LU-5242 osd-zfs: umount hang in sanity 133g Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 13cc94af090e0b1279c2987840a0e6de685e03ee

People

Assignee:: Isaac Huang (Inactive)

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 26 Start watching this issue

Dates

Created:: 23/Jun/14 4:02 PM

Updated:: 23/Nov/17 6:55 PM

Resolved:: 18/May/15 2:20 PM