[LU-4009] Add ZIL support to osd-zfs - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.4.1
Labels:
- llnl
- prz
- zfs

Epic/Theme:
- Performance
- prz
- zfs
Rank (Obsolete):
10737

Description

In order to improve sync performance on ZFS based OSDs Lustre must be updated to utilize a ZFS ZIL device. This performance work was originally planned as part of Lustre/ZFS integration but has not yet been completed. I'm opening this issue to track it.

Attachments

Issue Links

is blocked by

LU-4215 Some expected improvements for OUT

Open

is blocking

LU-2887 sanity-quota test_12a: slow due to ZFS VMs sharing single disk

Resolved

LU-7895 zfs metadata performance improvements

Resolved

is related to

LU-2716 DNE on ZFS create remote directory suffers from long sync.

Open

LU-6836 sanity-quota test_4a: Passed grace time 12, 1436542665, 1436542679

Resolved

LU-10392 LustreError: 82980:0:(fid_handler.c:329:__seq_server_alloc_meta()) srv-lglossy-MDT0002: Allocated super-sequence failed: rc = -115

Resolved

LU-2085 sanityn test_16 (fsx) ran over its Autotest time

Closed

LU-7426 DNE3: improve llog format for remote update llog

Open

LU-14678 ldiskfs fast commit feature

Open

mentioned in: Page Loading...

(4 is related to, 1 mentioned in)

Activity

[LU-4009] Add ZIL support to osd-zfs

Andreas Dilger added a comment - 20/Apr/15 10:06 PM

Alex, are you recording logical operations into the ZIL or physical blocks? Isn't it true that logical updates (e.g. set bit X, decrement free count, etc) could be applied in any order? I'd think that any updates that are done to a contended resource will have internal locking at least, so they should be safe to replay in some arbitrary order later. Obviously, this can't handle some cases (e.g. running out of space within an llog file), but that should never happen.

The ZIL records are not meant to span multiple TXGs I think, only to optimize sync operations that happen within a single TXG so that they can commit and reply to the client more quickly. If the parent TXG is committed it should be possible to drop all ZIL records for that TXG without further processing (i.e. ZIL is a writethrough cache for the TXG, not writeback). If the MDT or OST crashes before TXG commit, then the first thing to recover before any other pool update are the ZIL updates, and they will reconstruct the "replied as committed" parts of the incomplete TXG. I'd think we also need COS to handle uncommitted dependent updates within that TXG, but any earlier sync updates should already be in the ZIL before they reply and should not need further processing.

Have I misunderstood what you are implementing?

Andreas Dilger added a comment - 20/Apr/15 10:06 PM Alex, are you recording logical operations into the ZIL or physical blocks? Isn't it true that logical updates (e.g. set bit X, decrement free count, etc) could be applied in any order? I'd think that any updates that are done to a contended resource will have internal locking at least, so they should be safe to replay in some arbitrary order later. Obviously, this can't handle some cases (e.g. running out of space within an llog file), but that should never happen. The ZIL records are not meant to span multiple TXGs I think, only to optimize sync operations that happen within a single TXG so that they can commit and reply to the client more quickly. If the parent TXG is committed it should be possible to drop all ZIL records for that TXG without further processing (i.e. ZIL is a writethrough cache for the TXG, not writeback). If the MDT or OST crashes before TXG commit, then the first thing to recover before any other pool update are the ZIL updates, and they will reconstruct the "replied as committed" parts of the incomplete TXG. I'd think we also need COS to handle uncommitted dependent updates within that TXG, but any earlier sync updates should already be in the ZIL before they reply and should not need further processing. Have I misunderstood what you are implementing?

Alex Zhuravlev added a comment - 20/Apr/15 6:22 PM

I've implemented a basic schema where all the updates are tagged with versions and then the replay mechanism is trying to apply them in the original order. the issue I'm facing now is that I can't control txg border. say, we've got 3 transactions: t1, t2 and t3 - this is how they are stored in ZIL. but they were parts of the single txg and applied their updates non-sequentially: t3 modified llog first, then t1 and t2 did. (in the worst case they modified few llog objects in different orders). I was trying to mimic the original sequence: start t1, apply ready updates (looking at versions); start t2, again apply ready updates, then start t3, apply all the updates (as they ready), stop t3; now t1 and t2 can apply remaining updates and stop too. most of time that works, but at some point t2 or t3 can't be started because of txg's timeout - the deadlock.

one possibility is to to redo t1, t2 and t3 within a single tx, but potentially the number can be upto the number of MDT threads and this would turn into a huge tx.

another thought is to use snapshots as a rollback mechanism: apply the updates as singles just using versions to order them, if this can't succeed for a reason - restore the original state using the snapshot, give up and fail to mount.

any comments/suggestions are very welcome.

Alex Zhuravlev added a comment - 20/Apr/15 6:22 PM I've implemented a basic schema where all the updates are tagged with versions and then the replay mechanism is trying to apply them in the original order. the issue I'm facing now is that I can't control txg border. say, we've got 3 transactions: t1, t2 and t3 - this is how they are stored in ZIL. but they were parts of the single txg and applied their updates non-sequentially: t3 modified llog first, then t1 and t2 did. (in the worst case they modified few llog objects in different orders). I was trying to mimic the original sequence: start t1, apply ready updates (looking at versions); start t2, again apply ready updates, then start t3, apply all the updates (as they ready), stop t3; now t1 and t2 can apply remaining updates and stop too. most of time that works, but at some point t2 or t3 can't be started because of txg's timeout - the deadlock. one possibility is to to redo t1, t2 and t3 within a single tx, but potentially the number can be upto the number of MDT threads and this would turn into a huge tx. another thought is to use snapshots as a rollback mechanism: apply the updates as singles just using versions to order them, if this can't succeed for a reason - restore the original state using the snapshot, give up and fail to mount. any comments/suggestions are very welcome.

Alex Zhuravlev added a comment - 21/Nov/14 6:52 AM

in the latest version of the patch VBR version is logged (being a regular xattr_set update) so after zil_replay() we're supposed to have replays visible with up-to-date version. now OFD can drop "early" writes. to me it looks like a workaround, but I don't have a better idea at the moment.
clearly this can work with the whole objects, not byte ranges. also, I'm not sure that using separate ITXs for data write and version set is correct, investigating this.

Alex Zhuravlev added a comment - 21/Nov/14 6:52 AM in the latest version of the patch VBR version is logged (being a regular xattr_set update) so after zil_replay() we're supposed to have replays visible with up-to-date version. now OFD can drop "early" writes. to me it looks like a workaround, but I don't have a better idea at the moment. clearly this can work with the whole objects, not byte ranges. also, I'm not sure that using separate ITXs for data write and version set is correct, investigating this.

Alex Zhuravlev added a comment - 20/Nov/14 9:40 AM

if it's the same set of the clients recoverying after OST failover, then we're supposed to get the same set of OST_WRITEs (let's consider those only for simplicity) and the same data as right after fsync(2), but if some client is missing, then we can get different data, which I guess isn't quite correct. something like "last_committed on the object" is required? so that OST can skip the writes? changes in the wire protocol wouldn't be enough as OST can crash right after fsync(2) making it's impossible to communicate commit status to the clients.

Alex Zhuravlev added a comment - 20/Nov/14 9:40 AM if it's the same set of the clients recoverying after OST failover, then we're supposed to get the same set of OST_WRITEs (let's consider those only for simplicity) and the same data as right after fsync(2), but if some client is missing, then we can get different data, which I guess isn't quite correct. something like "last_committed on the object" is required? so that OST can skip the writes? changes in the wire protocol wouldn't be enough as OST can crash right after fsync(2) making it's impossible to communicate commit status to the clients.

Alex Zhuravlev added a comment - 17/Nov/14 6:52 PM

ok, thanks. let me try that..

Alex Zhuravlev added a comment - 17/Nov/14 6:52 PM ok, thanks. let me try that..

Brian Behlendorf added a comment - 17/Nov/14 6:44 PM

That sounds doable with careful use of using spa_freeze() and zil_set_sync(). Perhaps you just want to call zil_set_sync() prior to zil_close(), this would prevent all the outstanding itx commits from being written during close. But those are your two basic interfaces you have available.

1. spa_freeze(spa) - Prevents TXGs from being written to disk
2. zil_set_sync(zilog, ZFS_SYNC_DISABLED) - Prevents zil_commit() from writing log records.

Brian Behlendorf added a comment - 17/Nov/14 6:44 PM That sounds doable with careful use of using spa_freeze() and zil_set_sync(). Perhaps you just want to call zil_set_sync() prior to zil_close(), this would prevent all the outstanding itx commits from being written during close. But those are your two basic interfaces you have available. 1. spa_freeze(spa) - Prevents TXGs from being written to disk 2. zil_set_sync(zilog, ZFS_SYNC_DISABLED) - Prevents zil_commit() from writing log records.

Alex Zhuravlev added a comment - 17/Nov/14 6:35 PM

hmm, what I was trying to simulate is the following. say, there are two objects. they all are "logged" with ZIL, but not flushed. now we get fsync() on one of them. fsync() succeeds and now the server crashes. I'd expect this object to survive (given successful fsync()) and another object to lose data. so the simulation sequence includes preceeding osd_ro() (to stop any commits), then transfer data to the server (and ZIL), then fsync() one of them. IOW, there is nothing to worry about at osd_ro() point.

Alex Zhuravlev added a comment - 17/Nov/14 6:35 PM hmm, what I was trying to simulate is the following. say, there are two objects. they all are "logged" with ZIL, but not flushed. now we get fsync() on one of them. fsync() succeeds and now the server crashes. I'd expect this object to survive (given successful fsync()) and another object to lose data. so the simulation sequence includes preceeding osd_ro() (to stop any commits), then transfer data to the server (and ZIL), then fsync() one of them. IOW, there is nothing to worry about at osd_ro() point.

Brian Behlendorf added a comment - 17/Nov/14 6:29 PM

That is definitely a step in the right direction! I think the configuration is also reasonably realistic, if good performance is possible without requiring a dedicated log device then that's an attractive configuration.

> the issue is that zil_close() flushes all ITXs at umount. so I can't apply fsync() to some of the objects and then observe a difference (fsync'ed file is OK, another files are not).

I see, because you are creating itx records for every object and they all get flushed on close everything is always consistent on disk after a clean unmount. Have you tried disabling the ZIL in osd_ro() along with freezing the pool. This should prevent any additional itxs from ever being written to the log device even on zil_close(). This should effectively simulate the failure mode you're testing. All fsync() calls prior to osd_ro() will be in the log and replayed, those which occur after will not.

spa_freeze(spa);
zil_set_sync(zilog, ZFS_SYNC_DISABLED);

Brian Behlendorf added a comment - 17/Nov/14 6:29 PM That is definitely a step in the right direction! I think the configuration is also reasonably realistic, if good performance is possible without requiring a dedicated log device then that's an attractive configuration. > the issue is that zil_close() flushes all ITXs at umount. so I can't apply fsync() to some of the objects and then observe a difference (fsync'ed file is OK, another files are not). I see, because you are creating itx records for every object and they all get flushed on close everything is always consistent on disk after a clean unmount. Have you tried disabling the ZIL in osd_ro() along with freezing the pool. This should prevent any additional itxs from ever being written to the log device even on zil_close(). This should effectively simulate the failure mode you're testing. All fsync() calls prior to osd_ro() will be in the log and replayed, those which occur after will not. spa_freeze(spa); zil_set_sync(zilog, ZFS_SYNC_DISABLED);

Alex Zhuravlev added a comment - 17/Nov/14 2:22 PM

yes, this seems to be the worst case - I dumped the stats during the test:

sync 301 samples [usec] 0 18739414 383896129
sync 274 samples [usec] 0 18017451 292737165
sync 249 samples [usec] 0 20157749 265892043
sync 248 samples [usec] 0 17003970 269692774
sync 431 samples [usec] 0 18782272 801983590
sync 413 samples [usec] 0 19024038 689453547
sync 378 samples [usec] 0 27903793 508680879
OR 19946955 us on average

zil-sync 397 samples [usec] 1 420873 10183718
zil-sync 473 samples [usec] 1 629639 19512485
zil-sync 484 samples [usec] 1 481490 22410665
zil-sync 464 samples [usec] 1 635453 19926051
zil-sync 412 samples [usec] 1 602652 9925743
zil-sync 402 samples [usec] 1 620596 7984370
zil-sync 407 samples [usec] 1 549678 11495810
OR 562911 us on average

also, iirc OSTs share same physical device which makes TXG commits very expansive.
it'd be interesting to try on a more realistic setup.

Alex Zhuravlev added a comment - 17/Nov/14 2:22 PM yes, this seems to be the worst case - I dumped the stats during the test: sync 301 samples [usec] 0 18739414 383896129 sync 274 samples [usec] 0 18017451 292737165 sync 249 samples [usec] 0 20157749 265892043 sync 248 samples [usec] 0 17003970 269692774 sync 431 samples [usec] 0 18782272 801983590 sync 413 samples [usec] 0 19024038 689453547 sync 378 samples [usec] 0 27903793 508680879 OR 19946955 us on average zil-sync 397 samples [usec] 1 420873 10183718 zil-sync 473 samples [usec] 1 629639 19512485 zil-sync 484 samples [usec] 1 481490 22410665 zil-sync 464 samples [usec] 1 635453 19926051 zil-sync 412 samples [usec] 1 602652 9925743 zil-sync 402 samples [usec] 1 620596 7984370 zil-sync 407 samples [usec] 1 549678 11495810 OR 562911 us on average also, iirc OSTs share same physical device which makes TXG commits very expansive. it'd be interesting to try on a more realistic setup.

Andreas Dilger added a comment - 15/Nov/14 10:26 PM

Nice improvement! I think multi-mount fsx is probably the worst case usage for sync-on-cancel.

Andreas Dilger added a comment - 15/Nov/14 10:26 PM Nice improvement! I think multi-mount fsx is probably the worst case usage for sync-on-cancel.

Alex Zhuravlev added a comment - 15/Nov/14 7:12 PM

https://testing.hpdd.intel.com/test_logs/7da0f608-6cef-11e4-9bc9-5254006e85c2/show_text - some initial benchmark made with fsx.
sync_on_lock_cancel=always, no ZIL support - 3142s, with ZIL - 162s

Alex Zhuravlev added a comment - 15/Nov/14 7:12 PM https://testing.hpdd.intel.com/test_logs/7da0f608-6cef-11e4-9bc9-5254006e85c2/show_text - some initial benchmark made with fsx. sync_on_lock_cancel=always, no ZIL support - 3142s, with ZIL - 162s

People

Assignee:: Alex Zhuravlev

Reporter:: Brian Behlendorf

Votes:: 3 Vote for this issue

Watchers:: 31 Start watching this issue

Dates

Created:: 25/Sep/13 5:18 PM

Updated:: 05/Dec/22 6:51 PM