[LU-4009] Add ZIL support to osd-zfs - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.4.1
Labels:
- llnl
- prz
- zfs

Epic/Theme:
- Performance
- prz
- zfs
Rank (Obsolete):
10737

Description

In order to improve sync performance on ZFS based OSDs Lustre must be updated to utilize a ZFS ZIL device. This performance work was originally planned as part of Lustre/ZFS integration but has not yet been completed. I'm opening this issue to track it.

Attachments

Issue Links

is blocked by

LU-4215 Some expected improvements for OUT

Open

is blocking

LU-2887 sanity-quota test_12a: slow due to ZFS VMs sharing single disk

Resolved

LU-7895 zfs metadata performance improvements

Resolved

is related to

LU-2716 DNE on ZFS create remote directory suffers from long sync.

Open

LU-6836 sanity-quota test_4a: Passed grace time 12, 1436542665, 1436542679

Resolved

LU-10392 LustreError: 82980:0:(fid_handler.c:329:__seq_server_alloc_meta()) srv-lglossy-MDT0002: Allocated super-sequence failed: rc = -115

Resolved

LU-2085 sanityn test_16 (fsx) ran over its Autotest time

Closed

LU-7426 DNE3: improve llog format for remote update llog

Open

LU-14678 ldiskfs fast commit feature

Open

mentioned in: Page Loading...

(4 is related to, 1 mentioned in)

Activity

[LU-4009] Add ZIL support to osd-zfs

Gerrit Updater added a comment - 04/Jul/15 9:31 AM

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15496
Subject: LU-4009 osp: batch cancels
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6cd411afd2bcfe585ff29aa859df055ed14ee2fa

Gerrit Updater added a comment - 04/Jul/15 9:31 AM Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15496 Subject: LU-4009 osp: batch cancels Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6cd411afd2bcfe585ff29aa859df055ed14ee2fa

Gerrit Updater added a comment - 25/Jun/15 9:26 AM

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15394
Subject: LU-4009 osd: enable dt_index_try() on a non-existing object
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5ba67e0e4d7a25b0f37e0756841073e29a825479

Gerrit Updater added a comment - 25/Jun/15 9:26 AM Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15394 Subject: LU-4009 osd: enable dt_index_try() on a non-existing object Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5ba67e0e4d7a25b0f37e0756841073e29a825479

Gerrit Updater added a comment - 25/Jun/15 8:49 AM

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15393
Subject: LU-4009 osd: be able to remount objset w/o osd restart
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c484f0aa6e09c533635cde51d3884f746b25cfd5

Gerrit Updater added a comment - 25/Jun/15 8:49 AM Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15393 Subject: LU-4009 osd: be able to remount objset w/o osd restart Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c484f0aa6e09c533635cde51d3884f746b25cfd5

Alex Zhuravlev added a comment - 12/Jun/15 8:25 PM

now again with the improvements to OUT packing + cancel aggregation in OSP
(it collects a bunch of cookies, then cancel with a single llog write - huge contribution to the average record size):

with ZIL:
Flush 29737 3.880 181.323
Throughput 22.224 MB/sec 2 clients 2 procs max_latency=181.332 ms

no ZIL:
Flush 13605 54.994 327.793
Throughput 10.1488 MB/sec 2 clients 2 procs max_latency=327.804 ms

ZIL on MDT:
zil-sync 35517 samples [usec] 0 77549 36271952
zil-records 670175 samples [bytes] 32 9272 272855776
zil-realloc 808 samples [realloc] 1 1 808

ZIL on OST:
zil-sync 35517 samples [usec] 0 123200 55843200
zil-copied 455241 samples [writes] 1 1 455241
zil-indirect 1663 samples [writes] 1 1 1663
zil-records 785376 samples [bytes] 32 4288 1968476864

the improvements to shrink average ZIL record size gives -73% (407 vs. 1541) and
-23% to average sync time (3.88 vs 5.03).

of course, this is a subject to rerun on a regular hardware.

Alex Zhuravlev added a comment - 12/Jun/15 8:25 PM now again with the improvements to OUT packing + cancel aggregation in OSP (it collects a bunch of cookies, then cancel with a single llog write - huge contribution to the average record size): with ZIL: Flush 29737 3.880 181.323 Throughput 22.224 MB/sec 2 clients 2 procs max_latency=181.332 ms no ZIL: Flush 13605 54.994 327.793 Throughput 10.1488 MB/sec 2 clients 2 procs max_latency=327.804 ms ZIL on MDT: zil-sync 35517 samples [usec] 0 77549 36271952 zil-records 670175 samples [bytes] 32 9272 272855776 zil-realloc 808 samples [realloc] 1 1 808 ZIL on OST: zil-sync 35517 samples [usec] 0 123200 55843200 zil-copied 455241 samples [writes] 1 1 455241 zil-indirect 1663 samples [writes] 1 1 1663 zil-records 785376 samples [bytes] 32 4288 1968476864 the improvements to shrink average ZIL record size gives -73% (407 vs. 1541) and -23% to average sync time (3.88 vs 5.03). of course, this is a subject to rerun on a regular hardware.

Alex Zhuravlev added a comment - 11/Jun/15 4:18 PM

in the latest patch I removed optimizations to the packing mechanism to make the patch smaller,
now benchmarks again (made on a local node where all the targets share same storage):

with ZIL:
Flush 26601 5.030 152.794
Throughput 19.8412 MB/sec 2 clients 2 procs max_latency=152.803 ms

no ZIL:
Flush 12716 59.609 302.120
Throughput 9.48099 MB/sec 2 clients 2 procs max_latency=302.140 ms

zil-sync 31825 samples [usec] 0 99723 50668754
zil-records 692259 samples [bytes] 40 9656 1066813288

zil-sync 31825 samples [usec] 2 129809 66437030
zil-copied 405907 samples [writes] 1 1 405907
zil-indirect 1698 samples [writes] 1 1 1698
zil-records 701379 samples [bytes] 40 4288 1799673720

on MDT an average record was 1066813288/692259=1541 bytes
on OST it was 1799673720/701379=2565 bytes
the last one is a bit surprising, I'm going to check the details.

Alex Zhuravlev added a comment - 11/Jun/15 4:18 PM in the latest patch I removed optimizations to the packing mechanism to make the patch smaller, now benchmarks again (made on a local node where all the targets share same storage): with ZIL: Flush 26601 5.030 152.794 Throughput 19.8412 MB/sec 2 clients 2 procs max_latency=152.803 ms no ZIL: Flush 12716 59.609 302.120 Throughput 9.48099 MB/sec 2 clients 2 procs max_latency=302.140 ms zil-sync 31825 samples [usec] 0 99723 50668754 zil-records 692259 samples [bytes] 40 9656 1066813288 zil-sync 31825 samples [usec] 2 129809 66437030 zil-copied 405907 samples [writes] 1 1 405907 zil-indirect 1698 samples [writes] 1 1 1698 zil-records 701379 samples [bytes] 40 4288 1799673720 on MDT an average record was 1066813288/692259=1541 bytes on OST it was 1799673720/701379=2565 bytes the last one is a bit surprising, I'm going to check the details.

Alex Zhuravlev added a comment - 27/Apr/15 9:18 AM

Doug asked me to put more details here. would it makes sense to have a picture?

first of all, how current commit mechanism works and how this is used by Lustre. despite we say “start
and stop transaction” our Lustre-transactions actually join an existing DMU-transaction and that one is
committed as a whole or discarded. also, only the final state of DMU-transaction is a subject to commit,
not some intermediate state. Lustre heavily depends on this semantics to improve internal concurrency.

Lets consider a very simple use case - object precreation. Lustre maintains the last assigned ID in a
single slot. it doesn’t matter when a transaction updated the slot stops - only the final state of the slot
will be committed. if we were following “normal” rules (like ZPL does to support ZIL), Lustre would have
to lock the slot, start the transaction, update the slot, close the transaction and release the slot. Such
a stream of transaction is linear by definition and can be put into ZIL for subsequent replay - transaction
stop gives us actual information on the order the slot was updated in. That also means zero concurrency,
so bad performance for file creation. To improve concurrency and performance Lustre does the reverse:
start transaction, lock the slot, update the slot, release the slot, stop the transaction. This mean though
the stop doesn’t give us any information on the ordering - the order transactions get into ZIL can mismatch
the order the slot was updated in.

This is a problem partially because at OSD we see absolute values, not logical operations. We see new
objids or we see a new bitmap (in case of llogs), etc. So what would happen if we start to store operations
instead of values. Say, for object precreation again - we’d introduce an increment operation? Sometimes
we need to reset that value (when we start a new sequence). And even worse - the whole point of increment
is to not store absolute values, but we need absolute values as they have been already returned to the client
and used in LOVEA, etc. This is the case with a very simple logic - just a single value. There is llog yet..
And then we’d need a brand new mechanism to pass these special operations down through the stack, etc.
Hence I tend to think this is way too complicated to even think through all the details.

If the problem is only with the ordering, then why don’t solve this problem? If we know the order specific
updates were made to the object, then we can replay the updates in that order again. But this order doesn’t
match the transactions the updates were made in? The transactions are needed to keep the filesystem
consistent though the replay. Say, we have two transactions T1 and T2 modifying the same object. T1 got
into ZIL before T2, but T2 modified the object first. In the worst case T1 and T2 modified two objects, but
in the reverse order making they dependent each on another. TXG mechanism solved this problem as that
was a single commit unit. We’d have to do something similar - start T1, detect dependency, put T1 on hold,
start T2, apply the updates in correct order, stop T1 and T2. Doesn’t sound trivial. What if ZIL got many Tx
in between T1 and T2 as we used to run thousand threads on MDT ? Are they subject to join the same
big transaction with T1 and T2? what if DMU doesn’t let to put all of them due to TXG commit timeout or
changed pool’s property resulting in bigger overhead?

Here the snapshots come in - the only reason for the transaction is to keep the filesystem consistent, but
what if we implement our own commit points using snapshots? Essentially we mimic TXG: take a snapshot,
apply the updates in the order we need, discard the snapshot if all the updates succeeded, rollback to the
snapshot otherwise. If the system crashes during replay, we’ll find the snapshot, rollback to that and can
repeat again. In this schema there is zero need to modify Lustre core code, everything (except optimizations
like 8K writes to update a single bit in llog header) is done within osd-zfs.

Alex Zhuravlev added a comment - 27/Apr/15 9:18 AM Doug asked me to put more details here. would it makes sense to have a picture? first of all, how current commit mechanism works and how this is used by Lustre. despite we say “start and stop transaction” our Lustre-transactions actually join an existing DMU-transaction and that one is committed as a whole or discarded. also, only the final state of DMU-transaction is a subject to commit, not some intermediate state. Lustre heavily depends on this semantics to improve internal concurrency. Lets consider a very simple use case - object precreation. Lustre maintains the last assigned ID in a single slot. it doesn’t matter when a transaction updated the slot stops - only the final state of the slot will be committed. if we were following “normal” rules (like ZPL does to support ZIL), Lustre would have to lock the slot, start the transaction, update the slot, close the transaction and release the slot. Such a stream of transaction is linear by definition and can be put into ZIL for subsequent replay - transaction stop gives us actual information on the order the slot was updated in. That also means zero concurrency, so bad performance for file creation. To improve concurrency and performance Lustre does the reverse: start transaction, lock the slot, update the slot, release the slot, stop the transaction. This mean though the stop doesn’t give us any information on the ordering - the order transactions get into ZIL can mismatch the order the slot was updated in. This is a problem partially because at OSD we see absolute values, not logical operations. We see new objids or we see a new bitmap (in case of llogs), etc. So what would happen if we start to store operations instead of values. Say, for object precreation again - we’d introduce an increment operation? Sometimes we need to reset that value (when we start a new sequence). And even worse - the whole point of increment is to not store absolute values, but we need absolute values as they have been already returned to the client and used in LOVEA, etc. This is the case with a very simple logic - just a single value. There is llog yet.. And then we’d need a brand new mechanism to pass these special operations down through the stack, etc. Hence I tend to think this is way too complicated to even think through all the details. If the problem is only with the ordering, then why don’t solve this problem? If we know the order specific updates were made to the object, then we can replay the updates in that order again. But this order doesn’t match the transactions the updates were made in? The transactions are needed to keep the filesystem consistent though the replay. Say, we have two transactions T1 and T2 modifying the same object. T1 got into ZIL before T2, but T2 modified the object first. In the worst case T1 and T2 modified two objects, but in the reverse order making they dependent each on another. TXG mechanism solved this problem as that was a single commit unit. We’d have to do something similar - start T1, detect dependency, put T1 on hold, start T2, apply the updates in correct order, stop T1 and T2. Doesn’t sound trivial. What if ZIL got many Tx in between T1 and T2 as we used to run thousand threads on MDT ? Are they subject to join the same big transaction with T1 and T2? what if DMU doesn’t let to put all of them due to TXG commit timeout or changed pool’s property resulting in bigger overhead? Here the snapshots come in - the only reason for the transaction is to keep the filesystem consistent, but what if we implement our own commit points using snapshots? Essentially we mimic TXG: take a snapshot, apply the updates in the order we need, discard the snapshot if all the updates succeeded, rollback to the snapshot otherwise. If the system crashes during replay, we’ll find the snapshot, rollback to that and can repeat again. In this schema there is zero need to modify Lustre core code, everything (except optimizations like 8K writes to update a single bit in llog header) is done within osd-zfs.

Alex Zhuravlev added a comment - 21/Apr/15 5:23 AM

ZIL records do not last over TXG, of course. but during replay that set of ZIL records (originated from a single TXG) can result in few TXGs, as every transaction stored in ZIL has own start/stop.

Alex Zhuravlev added a comment - 21/Apr/15 5:23 AM ZIL records do not last over TXG, of course. but during replay that set of ZIL records (originated from a single TXG) can result in few TXGs, as every transaction stored in ZIL has own start/stop.

Alex Zhuravlev added a comment - 21/Apr/15 4:20 AM

Andreas, I'm not sure things like setbit is enough.. in llog header itself we update counter, for example. there are more similar cases where within a single TXG we update conflicting resource - lov_objids, last_rcvd (a single slot containing two records), etc. I think if we start to implement all that as "logical" updates we bring a lot of complexity into the higher layers.

Alex Zhuravlev added a comment - 21/Apr/15 4:20 AM Andreas, I'm not sure things like setbit is enough.. in llog header itself we update counter, for example. there are more similar cases where within a single TXG we update conflicting resource - lov_objids, last_rcvd (a single slot containing two records), etc. I think if we start to implement all that as "logical" updates we bring a lot of complexity into the higher layers.

Andreas Dilger added a comment - 20/Apr/15 10:06 PM

Alex, are you recording logical operations into the ZIL or physical blocks? Isn't it true that logical updates (e.g. set bit X, decrement free count, etc) could be applied in any order? I'd think that any updates that are done to a contended resource will have internal locking at least, so they should be safe to replay in some arbitrary order later. Obviously, this can't handle some cases (e.g. running out of space within an llog file), but that should never happen.

The ZIL records are not meant to span multiple TXGs I think, only to optimize sync operations that happen within a single TXG so that they can commit and reply to the client more quickly. If the parent TXG is committed it should be possible to drop all ZIL records for that TXG without further processing (i.e. ZIL is a writethrough cache for the TXG, not writeback). If the MDT or OST crashes before TXG commit, then the first thing to recover before any other pool update are the ZIL updates, and they will reconstruct the "replied as committed" parts of the incomplete TXG. I'd think we also need COS to handle uncommitted dependent updates within that TXG, but any earlier sync updates should already be in the ZIL before they reply and should not need further processing.

Have I misunderstood what you are implementing?

Andreas Dilger added a comment - 20/Apr/15 10:06 PM Alex, are you recording logical operations into the ZIL or physical blocks? Isn't it true that logical updates (e.g. set bit X, decrement free count, etc) could be applied in any order? I'd think that any updates that are done to a contended resource will have internal locking at least, so they should be safe to replay in some arbitrary order later. Obviously, this can't handle some cases (e.g. running out of space within an llog file), but that should never happen. The ZIL records are not meant to span multiple TXGs I think, only to optimize sync operations that happen within a single TXG so that they can commit and reply to the client more quickly. If the parent TXG is committed it should be possible to drop all ZIL records for that TXG without further processing (i.e. ZIL is a writethrough cache for the TXG, not writeback). If the MDT or OST crashes before TXG commit, then the first thing to recover before any other pool update are the ZIL updates, and they will reconstruct the "replied as committed" parts of the incomplete TXG. I'd think we also need COS to handle uncommitted dependent updates within that TXG, but any earlier sync updates should already be in the ZIL before they reply and should not need further processing. Have I misunderstood what you are implementing?

Alex Zhuravlev added a comment - 20/Apr/15 6:22 PM

I've implemented a basic schema where all the updates are tagged with versions and then the replay mechanism is trying to apply them in the original order. the issue I'm facing now is that I can't control txg border. say, we've got 3 transactions: t1, t2 and t3 - this is how they are stored in ZIL. but they were parts of the single txg and applied their updates non-sequentially: t3 modified llog first, then t1 and t2 did. (in the worst case they modified few llog objects in different orders). I was trying to mimic the original sequence: start t1, apply ready updates (looking at versions); start t2, again apply ready updates, then start t3, apply all the updates (as they ready), stop t3; now t1 and t2 can apply remaining updates and stop too. most of time that works, but at some point t2 or t3 can't be started because of txg's timeout - the deadlock.

one possibility is to to redo t1, t2 and t3 within a single tx, but potentially the number can be upto the number of MDT threads and this would turn into a huge tx.

another thought is to use snapshots as a rollback mechanism: apply the updates as singles just using versions to order them, if this can't succeed for a reason - restore the original state using the snapshot, give up and fail to mount.

any comments/suggestions are very welcome.

Alex Zhuravlev added a comment - 20/Apr/15 6:22 PM I've implemented a basic schema where all the updates are tagged with versions and then the replay mechanism is trying to apply them in the original order. the issue I'm facing now is that I can't control txg border. say, we've got 3 transactions: t1, t2 and t3 - this is how they are stored in ZIL. but they were parts of the single txg and applied their updates non-sequentially: t3 modified llog first, then t1 and t2 did. (in the worst case they modified few llog objects in different orders). I was trying to mimic the original sequence: start t1, apply ready updates (looking at versions); start t2, again apply ready updates, then start t3, apply all the updates (as they ready), stop t3; now t1 and t2 can apply remaining updates and stop too. most of time that works, but at some point t2 or t3 can't be started because of txg's timeout - the deadlock. one possibility is to to redo t1, t2 and t3 within a single tx, but potentially the number can be upto the number of MDT threads and this would turn into a huge tx. another thought is to use snapshots as a rollback mechanism: apply the updates as singles just using versions to order them, if this can't succeed for a reason - restore the original state using the snapshot, give up and fail to mount. any comments/suggestions are very welcome.

Alex Zhuravlev added a comment - 21/Nov/14 6:52 AM

in the latest version of the patch VBR version is logged (being a regular xattr_set update) so after zil_replay() we're supposed to have replays visible with up-to-date version. now OFD can drop "early" writes. to me it looks like a workaround, but I don't have a better idea at the moment.
clearly this can work with the whole objects, not byte ranges. also, I'm not sure that using separate ITXs for data write and version set is correct, investigating this.

Alex Zhuravlev added a comment - 21/Nov/14 6:52 AM in the latest version of the patch VBR version is logged (being a regular xattr_set update) so after zil_replay() we're supposed to have replays visible with up-to-date version. now OFD can drop "early" writes. to me it looks like a workaround, but I don't have a better idea at the moment. clearly this can work with the whole objects, not byte ranges. also, I'm not sure that using separate ITXs for data write and version set is correct, investigating this.

People

Assignee:: Alex Zhuravlev

Reporter:: Brian Behlendorf

Votes:: 3 Vote for this issue

Watchers:: 31 Start watching this issue

Dates

Created:: 25/Sep/13 5:18 PM

Updated:: 05/Dec/22 6:51 PM