The transaction behavior between ldiskfs and ZFS is exactly the same today. Multiple RPCs are batched together into a single disk transaction, and are committed to disk every few seconds, or sooner depending on space. Lustre does not make any filesystem modifications until after it has reserved transaction space (in the "declare" phase), and started the transaction handle (which is a refcount on the disk transaction). After the transaction handle is started, all filesystem modifications are atomic and will either be committed together, or lost if the transaction doesn't commit (eg. crash).
The unordered updates within a transaction help improve performance, because they increase concurrency within the filesystem. If we had to hold a huge lock across the whole filesystem for each update, this would hurt performance significantly. Instead, we only hold a lock for each object (eg. llog file, or leaf block of a directory being modified) to ensure the update does not corrupt the data from concurrent changes. Since all of the updates related in a single transaction will commit together, it doesn't matter if they are slightly unordered wrt. each other, as they will not cross a transaction boundary.
As for writethrough of large bulk data to disk, this is already done by ZPL+ZIL usage today, depending on configuration options. For small writes they will go directly to the ZIL, which is good for Lustre because it also can pack small writes directly into the RPC request (16KB today, up to 64KB with my patch https://review.whamcloud.com/36587). For large writes, the data is written to the actual location on disk to avoid double IO of large amounts of data, which would typically overload the ZIL device).
The large write data is written to newly allocated and unused disk blocks (as is all data in a COW filesystem), and the block pointer is written to the ZIL. If the transaction commits, the ZIL record is dropped and the block pointer is already part of the transaction. If the transaction does not commit, but the ZIL record has been written, the ZIL replay will use the data written to the "free" blocks on disk.
Note that ZIL does not necessarily cause all IO to be faster. The ZIL is only written to disk when there is a sync operation. This also requires the filesystem to track the dependency of all updates in memory, so that dependent updates are all written to ZIL, and the filesystem is not left in an inconsistent state after a crash and ZIL recovery. This is where the complexity arises in Lustre if the ZIL records for one RPC are written independently from another (not directly related) RPC.
In some cases, two RPCs are not actually dependent on each other, but may happen to share disk blocks (eg. in the changelog). If we have to write everything that modified the ChangeLog to the ZIL, then every sync will write everything to disk, and we are not further ahead than without the ZIL. My alternate proposal (which nobody is currently working on) is to log the RPCs as "logical" journal records, rather than the "physical" block records that exist today. This would make them incompatible with the ZPL+ZIL records, but should avoid the problems that were seen in the previous patch.
OK, so when you're saying that writing to ZIL means very few operations, you were referring to a unique sync write? Compared to a whole TXG flush when doing a sync operation without ZIL?
Regarding using ZIL to store RPC records, the payload, if big, would be written to disk with references to it in the ZIL records. So we still need to reserve space for that. The RPC themselves, just need to be dump into ZIL, we don't really care about the order or other details.
The RPC layer has no access to ZIL, so either it should or the OSD interface should take care of that. That means preprocessing the RPC a minimum, only for ZFS case?