Doug asked me to put more details here. would it makes sense to have a picture?
first of all, how current commit mechanism works and how this is used by Lustre. despite we say “start
and stop transaction” our Lustre-transactions actually join an existing DMU-transaction and that one is
committed as a whole or discarded. also, only the final state of DMU-transaction is a subject to commit,
not some intermediate state. Lustre heavily depends on this semantics to improve internal concurrency.
Lets consider a very simple use case - object precreation. Lustre maintains the last assigned ID in a
single slot. it doesn’t matter when a transaction updated the slot stops - only the final state of the slot
will be committed. if we were following “normal” rules (like ZPL does to support ZIL), Lustre would have
to lock the slot, start the transaction, update the slot, close the transaction and release the slot. Such
a stream of transaction is linear by definition and can be put into ZIL for subsequent replay - transaction
stop gives us actual information on the order the slot was updated in. That also means zero concurrency,
so bad performance for file creation. To improve concurrency and performance Lustre does the reverse:
start transaction, lock the slot, update the slot, release the slot, stop the transaction. This mean though
the stop doesn’t give us any information on the ordering - the order transactions get into ZIL can mismatch
the order the slot was updated in.
This is a problem partially because at OSD we see absolute values, not logical operations. We see new
objids or we see a new bitmap (in case of llogs), etc. So what would happen if we start to store operations
instead of values. Say, for object precreation again - we’d introduce an increment operation? Sometimes
we need to reset that value (when we start a new sequence). And even worse - the whole point of increment
is to not store absolute values, but we need absolute values as they have been already returned to the client
and used in LOVEA, etc. This is the case with a very simple logic - just a single value. There is llog yet..
And then we’d need a brand new mechanism to pass these special operations down through the stack, etc.
Hence I tend to think this is way too complicated to even think through all the details.
If the problem is only with the ordering, then why don’t solve this problem? If we know the order specific
updates were made to the object, then we can replay the updates in that order again. But this order doesn’t
match the transactions the updates were made in? The transactions are needed to keep the filesystem
consistent though the replay. Say, we have two transactions T1 and T2 modifying the same object. T1 got
into ZIL before T2, but T2 modified the object first. In the worst case T1 and T2 modified two objects, but
in the reverse order making they dependent each on another. TXG mechanism solved this problem as that
was a single commit unit. We’d have to do something similar - start T1, detect dependency, put T1 on hold,
start T2, apply the updates in correct order, stop T1 and T2. Doesn’t sound trivial. What if ZIL got many Tx
in between T1 and T2 as we used to run thousand threads on MDT ? Are they subject to join the same
big transaction with T1 and T2? what if DMU doesn’t let to put all of them due to TXG commit timeout or
changed pool’s property resulting in bigger overhead?
Here the snapshots come in - the only reason for the transaction is to keep the filesystem consistent, but
what if we implement our own commit points using snapshots? Essentially we mimic TXG: take a snapshot,
apply the updates in the order we need, discard the snapshot if all the updates succeeded, rollback to the
snapshot otherwise. If the system crashes during replay, we’ll find the snapshot, rollback to that and can
repeat again. In this schema there is zero need to modify Lustre core code, everything (except optimizations
like 8K writes to update a single bit in llog header) is done within osd-zfs.
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15542
Subject: LU-4009 osd: internal range locking for read/write/punch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4a47d8aa592d8731082db54df58f00e5fda54164