[LU-4009] Add ZIL support to osd-zfs Created: 25/Sep/13  Updated: 05/Dec/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Improvement Priority: Critical
Reporter: Brian Behlendorf Assignee: Alex Zhuravlev
Resolution: Unresolved Votes: 3
Labels: llnl, prz, zfs

Issue Links:
Blocker
is blocking LU-2887 sanity-quota test_12a: slow due to ZF... Resolved
is blocking LU-7895 zfs metadata performance improvements Resolved
is blocked by LU-4215 Some expected improvements for OUT Open
Related
is related to LU-2716 DNE on ZFS create remote directory su... Open
is related to LU-6836 sanity-quota test_4a: Passed grace ti... Resolved
is related to LU-10392 LustreError: 82980:0:(fid_handler.c:3... Resolved
is related to LU-2085 sanityn test_16 (fsx) ran over its Au... Closed
is related to LU-7426 DNE3: Current llog format for remote ... Open
is related to LU-14678 ldiskfs fast commit feature Open
Epic/Theme: Performance, prz, zfs
Rank (Obsolete): 10737

 Description   

In order to improve sync performance on ZFS based OSDs Lustre must be updated to utilize a ZFS ZIL device. This performance work was originally planned as part of Lustre/ZFS integration but has not yet been completed. I'm opening this issue to track it.



 Comments   
Comment by Brian Behlendorf [ 25/Sep/13 ]

A preliminary debugging patch designed to measure the potential performance gain. Initial results are encouraging and suggest we can expect good performance if the zil_commit() time to kept to less than 10ms.

http://review.whamcloud.com/7761

Comment by Peter Jones [ 27/Sep/13 ]

Alex

Could you please comment on this?

Thanks

Peter

Comment by Alex Zhuravlev [ 27/Sep/13 ]

not sure what I can say right away. I think we discussed such this enhancement few times and there was conclusion we want this implemented. but no actual dates were specified.

Comment by Brian Behlendorf [ 30/Sep/13 ]

Yes, I believe the hang up was exactly how to integrate with the ZIL since it allows out of order commits. This is a problem for the Lustre wire protocol and we just need to sort of the best way to handle this so the ZIL can be implemented. Personally, I think the cleanest way to do this is to hide the entire ZIL implementation in the ZFS OSD and then only update the transno when the entire txg is synced. It's my understanding this should prevent the need for any wire protocol changes but I'd love feedback on this plan before we start implementing anything.

Comment by Alex Zhuravlev [ 30/Sep/13 ]

probably we should use this ticket to discuss implementation details? I also wonder what will happen to fragmentation (thus performance) if we start to put many writes into ZIL?

Comment by Brian Behlendorf [ 30/Sep/13 ]

Indeed, this would be a good place to flesh out the implementation details.

At a high level to implement the ZIL we're going to need to add a little bit of code to each of the handlers which modify the pool. This code needs to create a replay log record which fully describes the operation and is then attached to the per-filesystem zilog_t. The log can then be flushed synchronously if needed via a zil_commit(). The obvious case here is to properly handle fsync() but it applies to any synchronous IO which needs to be performed. We'll likely need to export additional symbols from the ZFS code for this but this should be all relatively straight forward.

What isn't completely clear to me are the consequences for Lustre. Once we've done the above what if anything needs to change at the higher Lustre levels? Can this all be hid from Lustre in the OSD?

As for fragmentation I don't expect it to be much worse than a normal ZPL filesystem unless Lustre is performing far more synchronous IO. But it's a good question and we'll have to see.

Comment by Andreas Dilger [ 31/Oct/13 ]

I don't think it is necessarily optimal to handle all of this internal to the OSD, but I'm not sure yet.

I think there is no problem for clients to submit synchronous RPCs (whether a "sync" flag on for a specific RPC, or an OST_SYNC or MDT_SYNC on a specific object) and have those return from the server immediately after ZIL update without necessarily forcing the whole filesystem to commit. In particular, so long as the individual synchronous RPCs do not return before they are saved in the ZIL or actually committed to the OSD, then the last_committed does not need to be updated in the reply. There used to be a check in the OSC BRW code that the reply had transno <= last_committed, but that was removed when bulk replay was implemented, and I couldn't find any more client code that checks this.

For operations like "sync object data range" there doesn't need to be any special handling at all - the data will be recovered from the ZIL if the client crashes (which IMHO is the main reason for sync writes), and the sync reply does not need a transno or need to be replayed. One minor drawback is that the data will be overwritten by Lustre resend/replay if the server crashes, but the data should be exactly the same since it is protected by DLM locks. Any updates from other clients should cause DLM lock revocation and the clients flush their dirty data/pending RPCs before the extent lock is granted to another client (which is the whole reason for the dreaded "sync_on_lock_cancel" OFD tunable).

For metadata operations, it would be possible to implement commit-on-share efficiently with the ZIL, so that the MDT doesn't need to pay the full cost of a transaction commit, but it can share the state to another client immediately after the ZIL update without concern if the original client crashes. With commit-on-share (via ZIL) and version-based-recovery the client replay should be a per-object recovery stream and no dependent operations should ever be lost.

At the protocol level it might be desirable to mark an RPC reply with a "committed to disk and doesn't need replay handling" flag. One option (for newly-modified clients) is that the server keeps a list of ZIL-committed transnos and uses these to fill in the gaps in its transno sequence. With VBR this might not even be needed, since uncommitted RPCs will always find the "previous" version committed on the object each time. If the servers are handling the replay behaviour, then clients could just drop sync'd RPCs from memory immediately and not have to replay them.

Another option (for old clients) is that the RPCs are saved by the clients and resent back to the server in case of recovery as normal, but are ignored by the server during replay because of the "no replay" flag in the RPC. This doesn't reduce the number of RPCs that need to be replayed, but allows this to work even with old clients.

Comment by Alex Zhuravlev [ 03/Nov/13 ]

yes, we discussed already that transno=0 should be fine. we also discussed this with Shadow in the context of replay, I don't remember exact ticket number..

with metadata the issue is that the previous operations (which one to be ZIL'ed depends on) should be part of ZIL as well? we'd have to maintain a full tree of dependent operations in memory given we don't know which one will be requested for immediate sync?

Comment by Alex Zhuravlev [ 05/Nov/14 ]

ZIL for metadata is going to be the challenge, IMHO. any operation can involve quite few different changes, not just 2-4 objects like in ZPL, but more - changelogs, last_used_objid, unlink logs, etc. this looks very similar to update logs we're doing for DNE2. the issue with update logs is linearity - it's a single stream with no actual dependency expressed.

as for data, it's a bit simpler, of course, but still - even OST_WRITE can update additional bits like UID/GID/FID. either we duplicate OFD logic in osd-zfs or we track the updates in osd-zfs (or above).

I've got a simple patch to put OST_WRITEs to the ZIL and replay them, but at the moment have no a good idea about the policy. say, if we don't commit immediately, then we have to update last_rcvd and assign a transno (which should be a part of the ZIL record?). if "sync" is explicitly encoded in a RPC, then it's easier - put these data to ZIL, call zil_commit().

To be able to sync a range we have to send all the writes to the ZIL, AFAIU. Then all we can is to sync the object as a whole, we can't specify a range to sync. I guess this is still better than syncing a whole txg.

Comment by Alex Zhuravlev [ 11/Nov/14 ]

Brian, what's the most interesting use case for ZIL from your point of view? is it fsync() for data or something else?

Comment by Andreas Dilger [ 12/Nov/14 ]

I'm not Brian, but I do have one important use case for ZIL - shared file write. With the current ZFS code, performance is bad because "sync_on_lock_cancel" forces a sync for each lock conflict. With ldiskfs the sync overhead is not much, but with ZFS is is very bad.

Comment by Alex Zhuravlev [ 12/Nov/14 ]

Andreas, do you think support for a range syncing is required or whole-object sync would be enough? for the latter case I've got the patch - http://review.whamcloud.com/#/c/12572/, still trying to benchmark it though. as for the ranges, ZIL code doesn't support ranges and I guess we'd have to implement that ourselves.

Comment by Brian Behlendorf [ 12/Nov/14 ]

From what I've gathered the two most important use cases here are the ones which have been mentioned. The both cause significant performance problems for applications.

1. fsync(2), and
2. sync_on_lock_cancel

Comment by Alex Zhuravlev [ 12/Nov/14 ]

the correct implementation of fsync(2) would mean the object's creation on MDT, right?

Comment by Brian Behlendorf [ 12/Nov/14 ]

Right, from the users perspective it needs to be on stable storage at that point. So MDT and OSTs. I'm also not quite sure what you mean about the ZIL not providing range locking. Why would the ZIL provide any kind of range locking?

Comment by Alex Zhuravlev [ 12/Nov/14 ]

using ZIL on MDT is a big challenge. I'm still thinking on that.. as for ranges, I didn't mean range locking, I meant that zil_commit() doesn't support ranges, only "all objects" or specific object: zil_commit(zilog_t *zilog, uint64_t foid). to be able to sync only specific range of bytes we'd need to develop a function similar to zil_commit() which flushes only specifed range, AFAIU.

Comment by Brian Behlendorf [ 12/Nov/14 ]

OK that makes sense. A tiny bit of refactoring on the ZFS side could provide an interface which does this. It looks like the simplest thing to do would be to use zil_async_to_sync() to stage the foids which needed to be synced. Then either slightly tweak zil_commit() to accept a reserved foid which means sync all staged foids, or provide a new function for this. All the code logic is already there we just need the interfaces.

Speaking of interfaces I see none of the zil_* functions are exported. Let's get the needed interfaces available as soon as possible to minimize and build issues here. Thus far what are you using?

Comment by Andreas Dilger [ 12/Nov/14 ]

Alex, while I think that range sync would optimize the lock cancellation some amount, the first step is to allow whole-file sync to the ZIL without TXG commit, and work on range sync in a second step, since the current performance is terrible. The range sync would improve the cancellation of a single lock, but I expect that there are often several locks under contention on the same object, so it may even be better overall to sync the whole object (better block allocations, fewer total seeks, etc.).

There is some question of whether the ZIL performance for single file writes would be enough, but if it is doing "pass through" writes of large data writes to the actual VDEVs and only writing the updated metadata to the ZIL it should be pretty good (equivalent to what ldiskfs is doing).

Comment by Alex Zhuravlev [ 13/Nov/14 ]

in the patch mentioned above full blocks aren't copied, but instead written in-place. at TXG commit they become part of the stable tree. in the code we can easily vary what to copy (in the simplest approach - depending on size).

Comment by Alex Zhuravlev [ 13/Nov/14 ]

Brian, please have a look at the exports I'm using - http://review.whamcloud.com/#/c/12572/7/contrib/patches/zil-export.patch,cm

Comment by Brian Behlendorf [ 13/Nov/14 ]

Alex I've taken your zil-export.patch, reworked it slightly, and opened the following pull request against ZoL. It exports all the public zil_* functions just in case it turns out we need them. This may help us avoid additional compatibility code down the road. I also agree with Andreas about the range optimization. While I don't think it's a ton of work to add the needed interfaces, let's leave that as a future optimization.

https://github.com/zfsonlinux/zfs/pull/2892

Comment by Alex Zhuravlev [ 13/Nov/14 ]

thanks Brian. I'm working on the tests verifying data stored via ZIL.

Comment by Alex Zhuravlev [ 14/Nov/14 ]

Brian, to test per-object sync I did the following at umount:

/* normally we should be calling zil_close() here to release ZIL

  • but this would cause all the logs to be flushed preventing
  • testing. so we suspend ZIL, release all the ITXs and resume ZIL */
    zil_suspend(o->od_mntdev, &cookie);
    zil_close(o->od_zilog);
    zil_resume(cookie);

do you think it's OK or there is a better approach? thanks

Comment by Brian Behlendorf [ 14/Nov/14 ]

Alex, my suggestion would be to instead use spa_freeze() and structure the test a little differently. It's an interface which was added to facilitate exactly this sort of testing for the ZPL. Specifically to ensure that ZIL replay works correctly.

The basic idea goes something like this:
1. Freeze the dataset - this prevents TXGs from syncing.
2. Perform some synchronous operations which will be written to the ZIL
3. Unmount and export the pool to unfreeze the dataset
4. Import and mount the dataset triggering ZIL replay
5. Verify all your synchronous operations exist

Comment by Alex Zhuravlev [ 15/Nov/14 ]

Brian, this is exactly what I was doing - have a look at osd_ro() which uses spa_freeze(). but the issue is that zil_close() flushes all ITXs at umount. so I can't apply fsync() to some of the objects and then observe a difference (fsync'ed file is OK, another files are not).

Comment by Alex Zhuravlev [ 15/Nov/14 ]

https://testing.hpdd.intel.com/test_logs/7da0f608-6cef-11e4-9bc9-5254006e85c2/show_text - some initial benchmark made with fsx.
sync_on_lock_cancel=always, no ZIL support - 3142s, with ZIL - 162s

Comment by Andreas Dilger [ 15/Nov/14 ]

Nice improvement! I think multi-mount fsx is probably the worst case usage for sync-on-cancel.

Comment by Alex Zhuravlev [ 17/Nov/14 ]

yes, this seems to be the worst case - I dumped the stats during the test:

sync 301 samples [usec] 0 18739414 383896129
sync 274 samples [usec] 0 18017451 292737165
sync 249 samples [usec] 0 20157749 265892043
sync 248 samples [usec] 0 17003970 269692774
sync 431 samples [usec] 0 18782272 801983590
sync 413 samples [usec] 0 19024038 689453547
sync 378 samples [usec] 0 27903793 508680879
OR 19946955 us on average

zil-sync 397 samples [usec] 1 420873 10183718
zil-sync 473 samples [usec] 1 629639 19512485
zil-sync 484 samples [usec] 1 481490 22410665
zil-sync 464 samples [usec] 1 635453 19926051
zil-sync 412 samples [usec] 1 602652 9925743
zil-sync 402 samples [usec] 1 620596 7984370
zil-sync 407 samples [usec] 1 549678 11495810
OR 562911 us on average

also, iirc OSTs share same physical device which makes TXG commits very expansive.
it'd be interesting to try on a more realistic setup.

Comment by Brian Behlendorf [ 17/Nov/14 ]

That is definitely a step in the right direction! I think the configuration is also reasonably realistic, if good performance is possible without requiring a dedicated log device then that's an attractive configuration.

> the issue is that zil_close() flushes all ITXs at umount. so I can't apply fsync() to some of the objects and then observe a difference (fsync'ed file is OK, another files are not).

I see, because you are creating itx records for every object and they all get flushed on close everything is always consistent on disk after a clean unmount. Have you tried disabling the ZIL in osd_ro() along with freezing the pool. This should prevent any additional itxs from ever being written to the log device even on zil_close(). This should effectively simulate the failure mode you're testing. All fsync() calls prior to osd_ro() will be in the log and replayed, those which occur after will not.

spa_freeze(spa);
zil_set_sync(zilog, ZFS_SYNC_DISABLED);

Comment by Alex Zhuravlev [ 17/Nov/14 ]

hmm, what I was trying to simulate is the following. say, there are two objects. they all are "logged" with ZIL, but not flushed. now we get fsync() on one of them. fsync() succeeds and now the server crashes. I'd expect this object to survive (given successful fsync()) and another object to lose data. so the simulation sequence includes preceeding osd_ro() (to stop any commits), then transfer data to the server (and ZIL), then fsync() one of them. IOW, there is nothing to worry about at osd_ro() point.

Comment by Brian Behlendorf [ 17/Nov/14 ]

That sounds doable with careful use of using spa_freeze() and zil_set_sync(). Perhaps you just want to call zil_set_sync() prior to zil_close(), this would prevent all the outstanding itx commits from being written during close. But those are your two basic interfaces you have available.

1. spa_freeze(spa) - Prevents TXGs from being written to disk
2. zil_set_sync(zilog, ZFS_SYNC_DISABLED) - Prevents zil_commit() from writing log records.

Comment by Alex Zhuravlev [ 17/Nov/14 ]

ok, thanks. let me try that..

Comment by Alex Zhuravlev [ 20/Nov/14 ]

if it's the same set of the clients recoverying after OST failover, then we're supposed to get the same set of OST_WRITEs (let's consider those only for simplicity) and the same data as right after fsync(2), but if some client is missing, then we can get different data, which I guess isn't quite correct. something like "last_committed on the object" is required? so that OST can skip the writes? changes in the wire protocol wouldn't be enough as OST can crash right after fsync(2) making it's impossible to communicate commit status to the clients.

Comment by Alex Zhuravlev [ 21/Nov/14 ]

in the latest version of the patch VBR version is logged (being a regular xattr_set update) so after zil_replay() we're supposed to have replays visible with up-to-date version. now OFD can drop "early" writes. to me it looks like a workaround, but I don't have a better idea at the moment.
clearly this can work with the whole objects, not byte ranges. also, I'm not sure that using separate ITXs for data write and version set is correct, investigating this.

Comment by Alex Zhuravlev [ 20/Apr/15 ]

I've implemented a basic schema where all the updates are tagged with versions and then the replay mechanism is trying to apply them in the original order. the issue I'm facing now is that I can't control txg border. say, we've got 3 transactions: t1, t2 and t3 - this is how they are stored in ZIL. but they were parts of the single txg and applied their updates non-sequentially: t3 modified llog first, then t1 and t2 did. (in the worst case they modified few llog objects in different orders). I was trying to mimic the original sequence: start t1, apply ready updates (looking at versions); start t2, again apply ready updates, then start t3, apply all the updates (as they ready), stop t3; now t1 and t2 can apply remaining updates and stop too. most of time that works, but at some point t2 or t3 can't be started because of txg's timeout - the deadlock.

one possibility is to to redo t1, t2 and t3 within a single tx, but potentially the number can be upto the number of MDT threads and this would turn into a huge tx.

another thought is to use snapshots as a rollback mechanism: apply the updates as singles just using versions to order them, if this can't succeed for a reason - restore the original state using the snapshot, give up and fail to mount.

any comments/suggestions are very welcome.

Comment by Andreas Dilger [ 20/Apr/15 ]

Alex, are you recording logical operations into the ZIL or physical blocks? Isn't it true that logical updates (e.g. set bit X, decrement free count, etc) could be applied in any order? I'd think that any updates that are done to a contended resource will have internal locking at least, so they should be safe to replay in some arbitrary order later. Obviously, this can't handle some cases (e.g. running out of space within an llog file), but that should never happen.

The ZIL records are not meant to span multiple TXGs I think, only to optimize sync operations that happen within a single TXG so that they can commit and reply to the client more quickly. If the parent TXG is committed it should be possible to drop all ZIL records for that TXG without further processing (i.e. ZIL is a writethrough cache for the TXG, not writeback). If the MDT or OST crashes before TXG commit, then the first thing to recover before any other pool update are the ZIL updates, and they will reconstruct the "replied as committed" parts of the incomplete TXG. I'd think we also need COS to handle uncommitted dependent updates within that TXG, but any earlier sync updates should already be in the ZIL before they reply and should not need further processing.

Have I misunderstood what you are implementing?

Comment by Alex Zhuravlev [ 21/Apr/15 ]

Andreas, I'm not sure things like setbit is enough.. in llog header itself we update counter, for example. there are more similar cases where within a single TXG we update conflicting resource - lov_objids, last_rcvd (a single slot containing two records), etc. I think if we start to implement all that as "logical" updates we bring a lot of complexity into the higher layers.

Comment by Alex Zhuravlev [ 21/Apr/15 ]

ZIL records do not last over TXG, of course. but during replay that set of ZIL records (originated from a single TXG) can result in few TXGs, as every transaction stored in ZIL has own start/stop.

Comment by Alex Zhuravlev [ 27/Apr/15 ]

Doug asked me to put more details here. would it makes sense to have a picture?

first of all, how current commit mechanism works and how this is used by Lustre. despite we say “start
and stop transaction” our Lustre-transactions actually join an existing DMU-transaction and that one is
committed as a whole or discarded. also, only the final state of DMU-transaction is a subject to commit,
not some intermediate state. Lustre heavily depends on this semantics to improve internal concurrency.

Lets consider a very simple use case - object precreation. Lustre maintains the last assigned ID in a
single slot. it doesn’t matter when a transaction updated the slot stops - only the final state of the slot
will be committed. if we were following “normal” rules (like ZPL does to support ZIL), Lustre would have
to lock the slot, start the transaction, update the slot, close the transaction and release the slot. Such
a stream of transaction is linear by definition and can be put into ZIL for subsequent replay - transaction
stop gives us actual information on the order the slot was updated in. That also means zero concurrency,
so bad performance for file creation. To improve concurrency and performance Lustre does the reverse:
start transaction, lock the slot, update the slot, release the slot, stop the transaction. This mean though
the stop doesn’t give us any information on the ordering - the order transactions get into ZIL can mismatch
the order the slot was updated in.

This is a problem partially because at OSD we see absolute values, not logical operations. We see new
objids or we see a new bitmap (in case of llogs), etc. So what would happen if we start to store operations
instead of values. Say, for object precreation again - we’d introduce an increment operation? Sometimes
we need to reset that value (when we start a new sequence). And even worse - the whole point of increment
is to not store absolute values, but we need absolute values as they have been already returned to the client
and used in LOVEA, etc. This is the case with a very simple logic - just a single value. There is llog yet..
And then we’d need a brand new mechanism to pass these special operations down through the stack, etc.
Hence I tend to think this is way too complicated to even think through all the details.

If the problem is only with the ordering, then why don’t solve this problem? If we know the order specific
updates were made to the object, then we can replay the updates in that order again. But this order doesn’t
match the transactions the updates were made in? The transactions are needed to keep the filesystem
consistent though the replay. Say, we have two transactions T1 and T2 modifying the same object. T1 got
into ZIL before T2, but T2 modified the object first. In the worst case T1 and T2 modified two objects, but
in the reverse order making they dependent each on another. TXG mechanism solved this problem as that
was a single commit unit. We’d have to do something similar - start T1, detect dependency, put T1 on hold,
start T2, apply the updates in correct order, stop T1 and T2. Doesn’t sound trivial. What if ZIL got many Tx
in between T1 and T2 as we used to run thousand threads on MDT ? Are they subject to join the same
big transaction with T1 and T2? what if DMU doesn’t let to put all of them due to TXG commit timeout or
changed pool’s property resulting in bigger overhead?

Here the snapshots come in - the only reason for the transaction is to keep the filesystem consistent, but
what if we implement our own commit points using snapshots? Essentially we mimic TXG: take a snapshot,
apply the updates in the order we need, discard the snapshot if all the updates succeeded, rollback to the
snapshot otherwise. If the system crashes during replay, we’ll find the snapshot, rollback to that and can
repeat again. In this schema there is zero need to modify Lustre core code, everything (except optimizations
like 8K writes to update a single bit in llog header) is done within osd-zfs.

Comment by Alex Zhuravlev [ 11/Jun/15 ]

in the latest patch I removed optimizations to the packing mechanism to make the patch smaller,
now benchmarks again (made on a local node where all the targets share same storage):

with ZIL:
Flush 26601 5.030 152.794
Throughput 19.8412 MB/sec 2 clients 2 procs max_latency=152.803 ms

no ZIL:
Flush 12716 59.609 302.120
Throughput 9.48099 MB/sec 2 clients 2 procs max_latency=302.140 ms

zil-sync 31825 samples [usec] 0 99723 50668754
zil-records 692259 samples [bytes] 40 9656 1066813288

zil-sync 31825 samples [usec] 2 129809 66437030
zil-copied 405907 samples [writes] 1 1 405907
zil-indirect 1698 samples [writes] 1 1 1698
zil-records 701379 samples [bytes] 40 4288 1799673720

on MDT an average record was 1066813288/692259=1541 bytes
on OST it was 1799673720/701379=2565 bytes
the last one is a bit surprising, I'm going to check the details.

Comment by Alex Zhuravlev [ 12/Jun/15 ]

now again with the improvements to OUT packing + cancel aggregation in OSP
(it collects a bunch of cookies, then cancel with a single llog write - huge contribution to the average record size):

with ZIL:
Flush 29737 3.880 181.323
Throughput 22.224 MB/sec 2 clients 2 procs max_latency=181.332 ms

no ZIL:
Flush 13605 54.994 327.793
Throughput 10.1488 MB/sec 2 clients 2 procs max_latency=327.804 ms

ZIL on MDT:
zil-sync 35517 samples [usec] 0 77549 36271952
zil-records 670175 samples [bytes] 32 9272 272855776
zil-realloc 808 samples [realloc] 1 1 808

ZIL on OST:
zil-sync 35517 samples [usec] 0 123200 55843200
zil-copied 455241 samples [writes] 1 1 455241
zil-indirect 1663 samples [writes] 1 1 1663
zil-records 785376 samples [bytes] 32 4288 1968476864

the improvements to shrink average ZIL record size gives -73% (407 vs. 1541) and
-23% to average sync time (3.88 vs 5.03).

of course, this is a subject to rerun on a regular hardware.

Comment by Gerrit Updater [ 25/Jun/15 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15393
Subject: LU-4009 osd: be able to remount objset w/o osd restart
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c484f0aa6e09c533635cde51d3884f746b25cfd5

Comment by Gerrit Updater [ 25/Jun/15 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15394
Subject: LU-4009 osd: enable dt_index_try() on a non-existing object
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5ba67e0e4d7a25b0f37e0756841073e29a825479

Comment by Gerrit Updater [ 04/Jul/15 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15496
Subject: LU-4009 osp: batch cancels
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6cd411afd2bcfe585ff29aa859df055ed14ee2fa

Comment by Gerrit Updater [ 09/Jul/15 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15542
Subject: LU-4009 osd: internal range locking for read/write/punch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4a47d8aa592d8731082db54df58f00e5fda54164

Comment by James A Simmons [ 02/Jan/18 ]

What is the status of this work?

Comment by Chris Hunter (Inactive) [ 04/Feb/18 ]

I think this project is related to LLNL lustre branch on github.

Comment by Aurelien Degremont (Inactive) [ 28/Nov/19 ]

I'm wondering if there is any news on this ticket?

If the development stopped, I'm curious to know what was the reason? Thanks

 

Comment by Andreas Dilger [ 29/Nov/19 ]

Aurelien,
the previous development for ZIL integration was too complex to land and maintain. The main problem was that while Lustre locks individual data structures during updates (e.g. each log file, directory, etc), it does not hold a single global lock across the whole filesystem update since that would cause a lot of contention. Since all of these updates are done within a single filesystem transaction (TXG for ZFS), there is no problem if they are applied to the on-disk data structures in slightly different orders because they will either commit to disk or be lost as a single unit.

With ZIL, each RPC update needs to be atomic across all of the files/directories/logs that are modified, which caused a number of problems with the implementation in ZFS and the Lustre code.

One idea that I had for a better approach for Lustre (though incompatible with the current ZPL ZIL usage) is instead of trying to log the disk blocks that are being modified to the ZIL is instead log the whole RPC request to the ZIL (at most 64KB of data). If the server doesn't crash, the RPC would be discarded from the ZIL on TXG commit. If the server does crash, the recovery steps would be to re-process the RPCs in the ZIL at the Lustre level to regenerate the filesystem changes. That avoids the issues with the unordered updates to disk. For large bulk IOs the data would still be "writethrough" to the disk blocks, and the RPC would use the data from the filesystem rather than doing another bulk transfer from the client (since ZIL RPCs would be considered "sync" and the client may not preserve the data in its memory).

Comment by Aurelien Degremont (Inactive) [ 29/Nov/19 ]

Since all of these updates are done within a single filesystem transaction (TXG for ZFS), there is no problem if they are applied to the on-disk data structures in slightly different orders because they will either commit to disk or be lost as a single unit.

With ZIL, each RPC update needs to be atomic across all of the files/directories/logs that are modified, which caused a number of problems with the implementation in ZFS and the Lustre code.

I'm trying to understand how this is working. How this compare to ldiskfs? Is ldiskfs starting a different transaction for each RPC?

Why are updates done within a unique transaction for ZFS? Does this mean in the normal situation, transaction is committed to disk every 5 sec, and every 5 sec a new one is recreated?

That avoids the issues with the unordered updates to disk.

You mean unordered updates are hurting performance or they don't fit with the transaction model above?

For large bulk IOs the data would still be "writethrough" to the disk blocks, and the RPC would use the data from the filesystem rather than doing another bulk transfer from the client (since ZIL RPCs would be considered "sync" and the client may not preserve the data in its memory).

How could you rollback the transaction if the I/O are "writethrough" (you mean are skipping ZIL?)?

I don't see how reading or writing from disk will give you a proper crash handling. How are you taking care of sync() call which are hurting ZFS performance a lot (because we have only 1 unique transaction?)

 

If bulk transfer is not written to ZIL, I see less interest in this feature. I thought the whole point was that writing to a ZIL log was faster than writing the same data through DMU.

Are ZFS transactions atomic, right now?

 

Thanks for taking time explaining

Comment by Andreas Dilger [ 29/Nov/19 ]

The transaction behavior between ldiskfs and ZFS is exactly the same today. Multiple RPCs are batched together into a single disk transaction, and are committed to disk every few seconds, or sooner depending on space. Lustre does not make any filesystem modifications until after it has reserved transaction space (in the "declare" phase), and started the transaction handle (which is a refcount on the disk transaction). After the transaction handle is started, all filesystem modifications are atomic and will either be committed together, or lost if the transaction doesn't commit (eg. crash).

The unordered updates within a transaction help improve performance, because they increase concurrency within the filesystem. If we had to hold a huge lock across the whole filesystem for each update, this would hurt performance significantly. Instead, we only hold a lock for each object (eg. llog file, or leaf block of a directory being modified) to ensure the update does not corrupt the data from concurrent changes. Since all of the updates related in a single transaction will commit together, it doesn't matter if they are slightly unordered wrt. each other, as they will not cross a transaction boundary.

As for writethrough of large bulk data to disk, this is already done by ZPL+ZIL usage today, depending on configuration options. For small writes they will go directly to the ZIL, which is good for Lustre because it also can pack small writes directly into the RPC request (16KB today, up to 64KB with my patch https://review.whamcloud.com/36587). For large writes, the data is written to the actual location on disk to avoid double IO of large amounts of data, which would typically overload the ZIL device).

The large write data is written to newly allocated and unused disk blocks (as is all data in a COW filesystem), and the block pointer is written to the ZIL. If the transaction commits, the ZIL record is dropped and the block pointer is already part of the transaction. If the transaction does not commit, but the ZIL record has been written, the ZIL replay will use the data written to the "free" blocks on disk.

Note that ZIL does not necessarily cause all IO to be faster. The ZIL is only written to disk when there is a sync operation. This also requires the filesystem to track the dependency of all updates in memory, so that dependent updates are all written to ZIL, and the filesystem is not left in an inconsistent state after a crash and ZIL recovery. This is where the complexity arises in Lustre if the ZIL records for one RPC are written independently from another (not directly related) RPC.

In some cases, two RPCs are not actually dependent on each other, but may happen to share disk blocks (eg. in the changelog). If we have to write everything that modified the ChangeLog to the ZIL, then every sync will write everything to disk, and we are not further ahead than without the ZIL. My alternate proposal (which nobody is currently working on) is to log the RPCs as "logical" journal records, rather than the "physical" block records that exist today. This would make them incompatible with the ZPL+ZIL records, but should avoid the problems that were seen in the previous patch.

Comment by Aurelien Degremont (Inactive) [ 02/Dec/19 ]

Looks like I'm a bit slow minded today.

I'm trying to understand the benefits of using ZIL when the ZIL device is the same as the DMU device, especially under a heavy sync() workload (when ZIL is using a fast IOPS device and DMU a slow one, the benefits are obvious).

AFAIU, ldiskfs is relying on JBD to support transactions. Data is written to jbd and transaction commit flags the jbd transaction as committed. Will be replay on remount if the system crashes. Will be discarded if the transaction was not fully written. Most (all?) I/Os go through the journal.

To implement transaction, is ZFS storing the transaction in memory, and when writing it, allocating new blocks for each block to be written/modified and relying on the final block write (uber block?) to "commit" the transaction on disk? (don't know how useless written block will be reclaimed on restart if the server crashes though...).

So, I'm trying to understand what's the difference when a sync() is handled by a target, with and without ZIL.

Without ZIL, the current transaction will be flushed from memory to disk. Other RPCs are waiting for this transaction to be acknowledged by the disk to create and start writing to a new one?

With ZIL, well... the transaction is flushed from memory to ZIL. sync is acknowledged as soon as the transaction is in ZIL.

The benefits come from the fact that it is faster to write to ZIL than the device itself? This is were I'm probably missing something...

is to log the RPCs as "logical" journal records, rather than the "physical" block records that exist today. This would make them incompatible with the ZPL+ZIL records

According to this OpenZFS document, the actual operation is logged into ZIL, not the modified blocks (http://open-zfs.org/w/images/c/c8/10-ZIL_performance.pdf)

 

 

Comment by Andreas Dilger [ 02/Dec/19 ]

Definitely ZIL makes the most sense when it is on higher IOPS storage than the main pool, but there are still potential benefits even with spinning disk. The ZIL allows fast commit of sync operations without the need to flush the whole TXG to disk and write überblocks to all the disks (4 per disk, 2 at each end of the disk). A TXG commit might take hundreds or thousands of IOPS to finish. A sync write to the ZIL may take one or two IOPS.

For ldiskfs, while the metadata is written to the journal, most of the data is not. This avoids the slowdown of writing the data twice, and being throttled by the bandwidth of the journal device. The same is true of the ZIL - it does not write large IOs to the ZIL, but rather to the final location on disk, and then saves the block pointer into the ZIL record. Even with data going to the filesystem instead of the ZIL/jbd this only adds one or two IOPS to finish the write.

In theory, JBD2 could be optimized the same way - to allow small sync writes to go to the journal, and large writes to go to disk (as they do today), but that has never been implemented.

You are likely correct about ZIL records being logical records vs. physical blocks. My important point is that the format of the ZIL records would be different than regular ZPL ZIL records. This means that we would need to implement some core for recovery of Lustre RPC records in ZPL so that mounting a dataset with ZPL doesn't cause problems.

Comment by Brian Behlendorf [ 03/Dec/19 ]

My alternate proposal (which nobody is currently working on) is to log the RPCs as "logical" journal records, rather than the "physical" block records that exist today.

This idea makes a lot of sense to me.  It would naturally fit with the way the ZIL currently works which would be ideal.  As it turns out, adding new ZIL log records is something we've been recently looking in to in order to handle for the following proposed changes.

https://github.com/zfsonlinux/zfs/pull/9414 - renameat(2) flags RENAME_*
https://github.com/zfsonlinux/zfs/pull/9078 - xattr=sa syncing to ZIL

There's an existing proposal to add support for new ZIL record types (currently not being worked on).

When we do a read-write import of the pool, when we walk all the ZIL's to claim their blocks, we also check if there are any unrecognized record types. If so, we fail the import. One downside with this is that the record types are just an enum, so all implementations of ZFS must agree about what each value means.

Alternatively, we could add a single feature flag now which is always activated (refcount=1), and add a single new ZIL record type. This new record type would specify which other feature flag must be supported in order to import the pool read-write. The other feature flag would be specified by fully-qualified name, so it wouldn't have the downside mentioned above. But it doesn't solve the backwards-compatibility for the first thing that needs it.

https://github.com/zfsonlinux/zfs/pull/9078#issuecomment-553733379

This would lay the ground work for us to be able to register a Lustre-specific ZIL feature flag. Ideally, that could be implemented in such a way that the any new Lustre ZIL records are passed to Lustre for processing. This would let us avoid duplicating Lustre specific logic in the ZFS code. We'd then want to refuse any read-write ZPL mount when there was a ZIL to replay.

Comment by Aurelien Degremont (Inactive) [ 04/Dec/19 ]

OK, I'm just trying to understand the difference in term of IOPS when writing with and without ZIL enabled.

When the client is doing regular writes for a file, they will go to the current TXG, in memory. Then, the client issues a fsync(fd) for this file. Without ZIL, the whole transaction will be flushed to the zpool.
When ZIL is enabled, what will be done by this fsync() call? This is what I don't understand clearly. @Brian? @Andreas?

 

About this new approach with custom ZIL records and storing RPCs, that makes sense AFAICT. Does this need a dedicated ticket?

Comment by Andreas Dilger [ 04/Dec/19 ]

When the client is doing regular writes for a file, they will go to the current TXG, in memory.
Then, the client issues a fsync(fd) for this file. Without ZIL, the whole transaction will be flushed to the zpool.
When ZIL is enabled, what will be done by this fsync() call? This is what I don't understand clearly.

AFAIK (Brian can correct me if I'm wrong), the fsync() (on ZPL, where there is a ZIL) will write out ZIL records for all of the dependent operations related to the file (e.g. parent directory, dnode, file data) and return as soon as the ZIL writes are committed to storage. Depending on how the ZIL is configured, the file data may be written to the ZIL or it may be written directly to the pool. The ZIL writes are independent of the main ZFS TXG, but as soon as the ZFS TXG commits then the ZIL records are irrelevant, and the ZIL is only ever read if the node crashes.

Comment by Aurelien Degremont (Inactive) [ 05/Dec/19 ]

OK, so when you're saying that writing to ZIL means very few operations, you were referring to a unique sync write? Compared to a whole TXG flush when doing a sync operation without ZIL?

 

Regarding using ZIL to store RPC records, the payload, if big, would be written to disk with references to it in the ZIL records. So we still need to reserve space for that. The RPC themselves, just need to be dump into ZIL, we don't really care about the order or other details.

The RPC layer has no access to ZIL, so either it should or the OSD interface should take care of that. That means preprocessing the RPC a minimum, only for ZFS case?

 

Generated at Sat Feb 10 01:38:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.