Details

    • Improvement
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.4.1

    Description

      In order to improve sync performance on ZFS based OSDs Lustre must be updated to utilize a ZFS ZIL device. This performance work was originally planned as part of Lustre/ZFS integration but has not yet been completed. I'm opening this issue to track it.

      Attachments

        Issue Links

          Activity

            [LU-4009] Add ZIL support to osd-zfs

            Andreas, I'm not sure things like setbit is enough.. in llog header itself we update counter, for example. there are more similar cases where within a single TXG we update conflicting resource - lov_objids, last_rcvd (a single slot containing two records), etc. I think if we start to implement all that as "logical" updates we bring a lot of complexity into the higher layers.

            bzzz Alex Zhuravlev added a comment - Andreas, I'm not sure things like setbit is enough.. in llog header itself we update counter, for example. there are more similar cases where within a single TXG we update conflicting resource - lov_objids, last_rcvd (a single slot containing two records), etc. I think if we start to implement all that as "logical" updates we bring a lot of complexity into the higher layers.

            Alex, are you recording logical operations into the ZIL or physical blocks? Isn't it true that logical updates (e.g. set bit X, decrement free count, etc) could be applied in any order? I'd think that any updates that are done to a contended resource will have internal locking at least, so they should be safe to replay in some arbitrary order later. Obviously, this can't handle some cases (e.g. running out of space within an llog file), but that should never happen.

            The ZIL records are not meant to span multiple TXGs I think, only to optimize sync operations that happen within a single TXG so that they can commit and reply to the client more quickly. If the parent TXG is committed it should be possible to drop all ZIL records for that TXG without further processing (i.e. ZIL is a writethrough cache for the TXG, not writeback). If the MDT or OST crashes before TXG commit, then the first thing to recover before any other pool update are the ZIL updates, and they will reconstruct the "replied as committed" parts of the incomplete TXG. I'd think we also need COS to handle uncommitted dependent updates within that TXG, but any earlier sync updates should already be in the ZIL before they reply and should not need further processing.

            Have I misunderstood what you are implementing?

            adilger Andreas Dilger added a comment - Alex, are you recording logical operations into the ZIL or physical blocks? Isn't it true that logical updates (e.g. set bit X, decrement free count, etc) could be applied in any order? I'd think that any updates that are done to a contended resource will have internal locking at least, so they should be safe to replay in some arbitrary order later. Obviously, this can't handle some cases (e.g. running out of space within an llog file), but that should never happen. The ZIL records are not meant to span multiple TXGs I think, only to optimize sync operations that happen within a single TXG so that they can commit and reply to the client more quickly. If the parent TXG is committed it should be possible to drop all ZIL records for that TXG without further processing (i.e. ZIL is a writethrough cache for the TXG, not writeback). If the MDT or OST crashes before TXG commit, then the first thing to recover before any other pool update are the ZIL updates, and they will reconstruct the "replied as committed" parts of the incomplete TXG. I'd think we also need COS to handle uncommitted dependent updates within that TXG, but any earlier sync updates should already be in the ZIL before they reply and should not need further processing. Have I misunderstood what you are implementing?

            I've implemented a basic schema where all the updates are tagged with versions and then the replay mechanism is trying to apply them in the original order. the issue I'm facing now is that I can't control txg border. say, we've got 3 transactions: t1, t2 and t3 - this is how they are stored in ZIL. but they were parts of the single txg and applied their updates non-sequentially: t3 modified llog first, then t1 and t2 did. (in the worst case they modified few llog objects in different orders). I was trying to mimic the original sequence: start t1, apply ready updates (looking at versions); start t2, again apply ready updates, then start t3, apply all the updates (as they ready), stop t3; now t1 and t2 can apply remaining updates and stop too. most of time that works, but at some point t2 or t3 can't be started because of txg's timeout - the deadlock.

            one possibility is to to redo t1, t2 and t3 within a single tx, but potentially the number can be upto the number of MDT threads and this would turn into a huge tx.

            another thought is to use snapshots as a rollback mechanism: apply the updates as singles just using versions to order them, if this can't succeed for a reason - restore the original state using the snapshot, give up and fail to mount.

            any comments/suggestions are very welcome.

            bzzz Alex Zhuravlev added a comment - I've implemented a basic schema where all the updates are tagged with versions and then the replay mechanism is trying to apply them in the original order. the issue I'm facing now is that I can't control txg border. say, we've got 3 transactions: t1, t2 and t3 - this is how they are stored in ZIL. but they were parts of the single txg and applied their updates non-sequentially: t3 modified llog first, then t1 and t2 did. (in the worst case they modified few llog objects in different orders). I was trying to mimic the original sequence: start t1, apply ready updates (looking at versions); start t2, again apply ready updates, then start t3, apply all the updates (as they ready), stop t3; now t1 and t2 can apply remaining updates and stop too. most of time that works, but at some point t2 or t3 can't be started because of txg's timeout - the deadlock. one possibility is to to redo t1, t2 and t3 within a single tx, but potentially the number can be upto the number of MDT threads and this would turn into a huge tx. another thought is to use snapshots as a rollback mechanism: apply the updates as singles just using versions to order them, if this can't succeed for a reason - restore the original state using the snapshot, give up and fail to mount. any comments/suggestions are very welcome.

            in the latest version of the patch VBR version is logged (being a regular xattr_set update) so after zil_replay() we're supposed to have replays visible with up-to-date version. now OFD can drop "early" writes. to me it looks like a workaround, but I don't have a better idea at the moment.
            clearly this can work with the whole objects, not byte ranges. also, I'm not sure that using separate ITXs for data write and version set is correct, investigating this.

            bzzz Alex Zhuravlev added a comment - in the latest version of the patch VBR version is logged (being a regular xattr_set update) so after zil_replay() we're supposed to have replays visible with up-to-date version. now OFD can drop "early" writes. to me it looks like a workaround, but I don't have a better idea at the moment. clearly this can work with the whole objects, not byte ranges. also, I'm not sure that using separate ITXs for data write and version set is correct, investigating this.

            if it's the same set of the clients recoverying after OST failover, then we're supposed to get the same set of OST_WRITEs (let's consider those only for simplicity) and the same data as right after fsync(2), but if some client is missing, then we can get different data, which I guess isn't quite correct. something like "last_committed on the object" is required? so that OST can skip the writes? changes in the wire protocol wouldn't be enough as OST can crash right after fsync(2) making it's impossible to communicate commit status to the clients.

            bzzz Alex Zhuravlev added a comment - if it's the same set of the clients recoverying after OST failover, then we're supposed to get the same set of OST_WRITEs (let's consider those only for simplicity) and the same data as right after fsync(2), but if some client is missing, then we can get different data, which I guess isn't quite correct. something like "last_committed on the object" is required? so that OST can skip the writes? changes in the wire protocol wouldn't be enough as OST can crash right after fsync(2) making it's impossible to communicate commit status to the clients.

            ok, thanks. let me try that..

            bzzz Alex Zhuravlev added a comment - ok, thanks. let me try that..

            That sounds doable with careful use of using spa_freeze() and zil_set_sync(). Perhaps you just want to call zil_set_sync() prior to zil_close(), this would prevent all the outstanding itx commits from being written during close. But those are your two basic interfaces you have available.

            1. spa_freeze(spa) - Prevents TXGs from being written to disk
            2. zil_set_sync(zilog, ZFS_SYNC_DISABLED) - Prevents zil_commit() from writing log records.

            behlendorf Brian Behlendorf added a comment - That sounds doable with careful use of using spa_freeze() and zil_set_sync(). Perhaps you just want to call zil_set_sync() prior to zil_close(), this would prevent all the outstanding itx commits from being written during close. But those are your two basic interfaces you have available. 1. spa_freeze(spa) - Prevents TXGs from being written to disk 2. zil_set_sync(zilog, ZFS_SYNC_DISABLED) - Prevents zil_commit() from writing log records.

            hmm, what I was trying to simulate is the following. say, there are two objects. they all are "logged" with ZIL, but not flushed. now we get fsync() on one of them. fsync() succeeds and now the server crashes. I'd expect this object to survive (given successful fsync()) and another object to lose data. so the simulation sequence includes preceeding osd_ro() (to stop any commits), then transfer data to the server (and ZIL), then fsync() one of them. IOW, there is nothing to worry about at osd_ro() point.

            bzzz Alex Zhuravlev added a comment - hmm, what I was trying to simulate is the following. say, there are two objects. they all are "logged" with ZIL, but not flushed. now we get fsync() on one of them. fsync() succeeds and now the server crashes. I'd expect this object to survive (given successful fsync()) and another object to lose data. so the simulation sequence includes preceeding osd_ro() (to stop any commits), then transfer data to the server (and ZIL), then fsync() one of them. IOW, there is nothing to worry about at osd_ro() point.

            That is definitely a step in the right direction! I think the configuration is also reasonably realistic, if good performance is possible without requiring a dedicated log device then that's an attractive configuration.

            > the issue is that zil_close() flushes all ITXs at umount. so I can't apply fsync() to some of the objects and then observe a difference (fsync'ed file is OK, another files are not).

            I see, because you are creating itx records for every object and they all get flushed on close everything is always consistent on disk after a clean unmount. Have you tried disabling the ZIL in osd_ro() along with freezing the pool. This should prevent any additional itxs from ever being written to the log device even on zil_close(). This should effectively simulate the failure mode you're testing. All fsync() calls prior to osd_ro() will be in the log and replayed, those which occur after will not.

            spa_freeze(spa);
            zil_set_sync(zilog, ZFS_SYNC_DISABLED);

            behlendorf Brian Behlendorf added a comment - That is definitely a step in the right direction! I think the configuration is also reasonably realistic, if good performance is possible without requiring a dedicated log device then that's an attractive configuration. > the issue is that zil_close() flushes all ITXs at umount. so I can't apply fsync() to some of the objects and then observe a difference (fsync'ed file is OK, another files are not). I see, because you are creating itx records for every object and they all get flushed on close everything is always consistent on disk after a clean unmount. Have you tried disabling the ZIL in osd_ro() along with freezing the pool. This should prevent any additional itxs from ever being written to the log device even on zil_close(). This should effectively simulate the failure mode you're testing. All fsync() calls prior to osd_ro() will be in the log and replayed, those which occur after will not. spa_freeze(spa); zil_set_sync(zilog, ZFS_SYNC_DISABLED);

            yes, this seems to be the worst case - I dumped the stats during the test:

            sync 301 samples [usec] 0 18739414 383896129
            sync 274 samples [usec] 0 18017451 292737165
            sync 249 samples [usec] 0 20157749 265892043
            sync 248 samples [usec] 0 17003970 269692774
            sync 431 samples [usec] 0 18782272 801983590
            sync 413 samples [usec] 0 19024038 689453547
            sync 378 samples [usec] 0 27903793 508680879
            OR 19946955 us on average

            zil-sync 397 samples [usec] 1 420873 10183718
            zil-sync 473 samples [usec] 1 629639 19512485
            zil-sync 484 samples [usec] 1 481490 22410665
            zil-sync 464 samples [usec] 1 635453 19926051
            zil-sync 412 samples [usec] 1 602652 9925743
            zil-sync 402 samples [usec] 1 620596 7984370
            zil-sync 407 samples [usec] 1 549678 11495810
            OR 562911 us on average

            also, iirc OSTs share same physical device which makes TXG commits very expansive.
            it'd be interesting to try on a more realistic setup.

            bzzz Alex Zhuravlev added a comment - yes, this seems to be the worst case - I dumped the stats during the test: sync 301 samples [usec] 0 18739414 383896129 sync 274 samples [usec] 0 18017451 292737165 sync 249 samples [usec] 0 20157749 265892043 sync 248 samples [usec] 0 17003970 269692774 sync 431 samples [usec] 0 18782272 801983590 sync 413 samples [usec] 0 19024038 689453547 sync 378 samples [usec] 0 27903793 508680879 OR 19946955 us on average zil-sync 397 samples [usec] 1 420873 10183718 zil-sync 473 samples [usec] 1 629639 19512485 zil-sync 484 samples [usec] 1 481490 22410665 zil-sync 464 samples [usec] 1 635453 19926051 zil-sync 412 samples [usec] 1 602652 9925743 zil-sync 402 samples [usec] 1 620596 7984370 zil-sync 407 samples [usec] 1 549678 11495810 OR 562911 us on average also, iirc OSTs share same physical device which makes TXG commits very expansive. it'd be interesting to try on a more realistic setup.

            Nice improvement! I think multi-mount fsx is probably the worst case usage for sync-on-cancel.

            adilger Andreas Dilger added a comment - Nice improvement! I think multi-mount fsx is probably the worst case usage for sync-on-cancel.

            People

              bzzz Alex Zhuravlev
              behlendorf Brian Behlendorf
              Votes:
              3 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

                Created:
                Updated: