Details

    • Improvement
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.4.1

    Description

      In order to improve sync performance on ZFS based OSDs Lustre must be updated to utilize a ZFS ZIL device. This performance work was originally planned as part of Lustre/ZFS integration but has not yet been completed. I'm opening this issue to track it.

      Attachments

        Issue Links

          Activity

            [LU-4009] Add ZIL support to osd-zfs

            OK, so when you're saying that writing to ZIL means very few operations, you were referring to a unique sync write? Compared to a whole TXG flush when doing a sync operation without ZIL?

             

            Regarding using ZIL to store RPC records, the payload, if big, would be written to disk with references to it in the ZIL records. So we still need to reserve space for that. The RPC themselves, just need to be dump into ZIL, we don't really care about the order or other details.

            The RPC layer has no access to ZIL, so either it should or the OSD interface should take care of that. That means preprocessing the RPC a minimum, only for ZFS case?

             

            degremoa Aurelien Degremont (Inactive) added a comment - OK, so when you're saying that writing to ZIL means very few operations, you were referring to a unique sync write? Compared to a whole TXG flush when doing a sync operation without ZIL?   Regarding using ZIL to store RPC records, the payload, if big, would be written to disk with references to it in the ZIL records. So we still need to reserve space for that. The RPC themselves, just need to be dump into ZIL, we don't really care about the order or other details. The RPC layer has no access to ZIL, so either it should or the OSD interface should take care of that. That means preprocessing the RPC a minimum, only for ZFS case?  

            When the client is doing regular writes for a file, they will go to the current TXG, in memory.
            Then, the client issues a fsync(fd) for this file. Without ZIL, the whole transaction will be flushed to the zpool.
            When ZIL is enabled, what will be done by this fsync() call? This is what I don't understand clearly.

            AFAIK (Brian can correct me if I'm wrong), the fsync() (on ZPL, where there is a ZIL) will write out ZIL records for all of the dependent operations related to the file (e.g. parent directory, dnode, file data) and return as soon as the ZIL writes are committed to storage. Depending on how the ZIL is configured, the file data may be written to the ZIL or it may be written directly to the pool. The ZIL writes are independent of the main ZFS TXG, but as soon as the ZFS TXG commits then the ZIL records are irrelevant, and the ZIL is only ever read if the node crashes.

            adilger Andreas Dilger added a comment - When the client is doing regular writes for a file, they will go to the current TXG, in memory. Then, the client issues a fsync(fd) for this file. Without ZIL, the whole transaction will be flushed to the zpool. When ZIL is enabled, what will be done by this fsync() call? This is what I don't understand clearly. AFAIK (Brian can correct me if I'm wrong), the fsync() (on ZPL, where there is a ZIL) will write out ZIL records for all of the dependent operations related to the file (e.g. parent directory, dnode, file data) and return as soon as the ZIL writes are committed to storage. Depending on how the ZIL is configured, the file data may be written to the ZIL or it may be written directly to the pool. The ZIL writes are independent of the main ZFS TXG, but as soon as the ZFS TXG commits then the ZIL records are irrelevant, and the ZIL is only ever read if the node crashes.

            OK, I'm just trying to understand the difference in term of IOPS when writing with and without ZIL enabled.

            When the client is doing regular writes for a file, they will go to the current TXG, in memory. Then, the client issues a fsync(fd) for this file. Without ZIL, the whole transaction will be flushed to the zpool.
            When ZIL is enabled, what will be done by this fsync() call? This is what I don't understand clearly. @Brian? @Andreas?

             

            About this new approach with custom ZIL records and storing RPCs, that makes sense AFAICT. Does this need a dedicated ticket?

            degremoa Aurelien Degremont (Inactive) added a comment - OK, I'm just trying to understand the difference in term of IOPS when writing with and without ZIL enabled. When the client is doing regular writes for a file, they will go to the current TXG, in memory. Then, the client issues a fsync(fd) for this file. Without ZIL, the whole transaction will be flushed to the zpool. When ZIL is enabled, what will be done by this fsync() call? This is what I don't understand clearly. @Brian? @Andreas?   About this new approach with custom ZIL records and storing RPCs, that makes sense AFAICT. Does this need a dedicated ticket?

            My alternate proposal (which nobody is currently working on) is to log the RPCs as "logical" journal records, rather than the "physical" block records that exist today.

            This idea makes a lot of sense to me.  It would naturally fit with the way the ZIL currently works which would be ideal.  As it turns out, adding new ZIL log records is something we've been recently looking in to in order to handle for the following proposed changes.

            https://github.com/zfsonlinux/zfs/pull/9414 - renameat(2) flags RENAME_*
            https://github.com/zfsonlinux/zfs/pull/9078 - xattr=sa syncing to ZIL

            There's an existing proposal to add support for new ZIL record types (currently not being worked on).

            When we do a read-write import of the pool, when we walk all the ZIL's to claim their blocks, we also check if there are any unrecognized record types. If so, we fail the import. One downside with this is that the record types are just an enum, so all implementations of ZFS must agree about what each value means.

            Alternatively, we could add a single feature flag now which is always activated (refcount=1), and add a single new ZIL record type. This new record type would specify which other feature flag must be supported in order to import the pool read-write. The other feature flag would be specified by fully-qualified name, so it wouldn't have the downside mentioned above. But it doesn't solve the backwards-compatibility for the first thing that needs it.

            https://github.com/zfsonlinux/zfs/pull/9078#issuecomment-553733379

            This would lay the ground work for us to be able to register a Lustre-specific ZIL feature flag. Ideally, that could be implemented in such a way that the any new Lustre ZIL records are passed to Lustre for processing. This would let us avoid duplicating Lustre specific logic in the ZFS code. We'd then want to refuse any read-write ZPL mount when there was a ZIL to replay.

            behlendorf Brian Behlendorf added a comment - My alternate proposal (which nobody is currently working on) is to log the RPCs as "logical" journal records, rather than the "physical" block records that exist today. This idea makes a lot of sense to me.  It would naturally fit with the way the ZIL currently works which would be ideal.  As it turns out, adding new ZIL log records is something we've been recently looking in to in order to handle for the following proposed changes. https://github.com/zfsonlinux/zfs/pull/9414 - renameat(2) flags RENAME_* https://github.com/zfsonlinux/zfs/pull/9078 - xattr=sa syncing to ZIL There's an existing proposal to add support for new ZIL record types (currently not being worked on). When we do a read-write import of the pool, when we walk all the ZIL's to claim their blocks, we also check if there are any unrecognized record types. If so, we fail the import. One downside with this is that the record types are just an enum, so all implementations of ZFS must agree about what each value means. Alternatively, we could add a single feature flag now which is always activated (refcount=1), and add a single new ZIL record type. This new record type would specify which other feature flag must be supported in order to import the pool read-write. The other feature flag would be specified by fully-qualified name, so it wouldn't have the downside mentioned above. But it doesn't solve the backwards-compatibility for the first thing that needs it. https://github.com/zfsonlinux/zfs/pull/9078#issuecomment-553733379 This would lay the ground work for us to be able to register a Lustre-specific ZIL feature flag. Ideally, that could be implemented in such a way that the any new Lustre ZIL records are passed to Lustre for processing. This would let us avoid duplicating Lustre specific logic in the ZFS code. We'd then want to refuse any read-write ZPL mount when there was a ZIL to replay.

            Definitely ZIL makes the most sense when it is on higher IOPS storage than the main pool, but there are still potential benefits even with spinning disk. The ZIL allows fast commit of sync operations without the need to flush the whole TXG to disk and write überblocks to all the disks (4 per disk, 2 at each end of the disk). A TXG commit might take hundreds or thousands of IOPS to finish. A sync write to the ZIL may take one or two IOPS.

            For ldiskfs, while the metadata is written to the journal, most of the data is not. This avoids the slowdown of writing the data twice, and being throttled by the bandwidth of the journal device. The same is true of the ZIL - it does not write large IOs to the ZIL, but rather to the final location on disk, and then saves the block pointer into the ZIL record. Even with data going to the filesystem instead of the ZIL/jbd this only adds one or two IOPS to finish the write.

            In theory, JBD2 could be optimized the same way - to allow small sync writes to go to the journal, and large writes to go to disk (as they do today), but that has never been implemented.

            You are likely correct about ZIL records being logical records vs. physical blocks. My important point is that the format of the ZIL records would be different than regular ZPL ZIL records. This means that we would need to implement some core for recovery of Lustre RPC records in ZPL so that mounting a dataset with ZPL doesn't cause problems.

            adilger Andreas Dilger added a comment - Definitely ZIL makes the most sense when it is on higher IOPS storage than the main pool, but there are still potential benefits even with spinning disk. The ZIL allows fast commit of sync operations without the need to flush the whole TXG to disk and write überblocks to all the disks (4 per disk, 2 at each end of the disk). A TXG commit might take hundreds or thousands of IOPS to finish. A sync write to the ZIL may take one or two IOPS. For ldiskfs, while the metadata is written to the journal, most of the data is not. This avoids the slowdown of writing the data twice, and being throttled by the bandwidth of the journal device. The same is true of the ZIL - it does not write large IOs to the ZIL, but rather to the final location on disk, and then saves the block pointer into the ZIL record. Even with data going to the filesystem instead of the ZIL/jbd this only adds one or two IOPS to finish the write. In theory, JBD2 could be optimized the same way - to allow small sync writes to go to the journal, and large writes to go to disk (as they do today), but that has never been implemented. You are likely correct about ZIL records being logical records vs. physical blocks. My important point is that the format of the ZIL records would be different than regular ZPL ZIL records. This means that we would need to implement some core for recovery of Lustre RPC records in ZPL so that mounting a dataset with ZPL doesn't cause problems.

            Looks like I'm a bit slow minded today.

            I'm trying to understand the benefits of using ZIL when the ZIL device is the same as the DMU device, especially under a heavy sync() workload (when ZIL is using a fast IOPS device and DMU a slow one, the benefits are obvious).

            AFAIU, ldiskfs is relying on JBD to support transactions. Data is written to jbd and transaction commit flags the jbd transaction as committed. Will be replay on remount if the system crashes. Will be discarded if the transaction was not fully written. Most (all?) I/Os go through the journal.

            To implement transaction, is ZFS storing the transaction in memory, and when writing it, allocating new blocks for each block to be written/modified and relying on the final block write (uber block?) to "commit" the transaction on disk? (don't know how useless written block will be reclaimed on restart if the server crashes though...).

            So, I'm trying to understand what's the difference when a sync() is handled by a target, with and without ZIL.

            Without ZIL, the current transaction will be flushed from memory to disk. Other RPCs are waiting for this transaction to be acknowledged by the disk to create and start writing to a new one?

            With ZIL, well... the transaction is flushed from memory to ZIL. sync is acknowledged as soon as the transaction is in ZIL.

            The benefits come from the fact that it is faster to write to ZIL than the device itself? This is were I'm probably missing something...

            is to log the RPCs as "logical" journal records, rather than the "physical" block records that exist today. This would make them incompatible with the ZPL+ZIL records

            According to this OpenZFS document, the actual operation is logged into ZIL, not the modified blocks (http://open-zfs.org/w/images/c/c8/10-ZIL_performance.pdf)

             

             

            degremoa Aurelien Degremont (Inactive) added a comment - Looks like I'm a bit slow minded today. I'm trying to understand the benefits of using ZIL when the ZIL device is the same as the DMU device, especially under a heavy sync() workload (when ZIL is using a fast IOPS device and DMU a slow one, the benefits are obvious). AFAIU, ldiskfs is relying on JBD to support transactions. Data is written to jbd and transaction commit flags the jbd transaction as committed. Will be replay on remount if the system crashes. Will be discarded if the transaction was not fully written. Most (all?) I/Os go through the journal. To implement transaction, is ZFS storing the transaction in memory, and when writing it, allocating new blocks for each block to be written/modified and relying on the final block write (uber block?) to "commit" the transaction on disk? (don't know how useless written block will be reclaimed on restart if the server crashes though...). So, I'm trying to understand what's the difference when a sync() is handled by a target, with and without ZIL. Without ZIL, the current transaction will be flushed from memory to disk. Other RPCs are waiting for this transaction to be acknowledged by the disk to create and start writing to a new one? With ZIL, well... the transaction is flushed from memory to ZIL. sync is acknowledged as soon as the transaction is in ZIL. The benefits come from the fact that it is faster to write to ZIL than the device itself? This is were I'm probably missing something... is to log the RPCs as "logical" journal records, rather than the "physical" block records that exist today. This would make them incompatible with the ZPL+ZIL records According to this OpenZFS document, the actual operation is logged into ZIL, not the modified blocks ( http://open-zfs.org/w/images/c/c8/10-ZIL_performance.pdf )    

            The transaction behavior between ldiskfs and ZFS is exactly the same today. Multiple RPCs are batched together into a single disk transaction, and are committed to disk every few seconds, or sooner depending on space. Lustre does not make any filesystem modifications until after it has reserved transaction space (in the "declare" phase), and started the transaction handle (which is a refcount on the disk transaction). After the transaction handle is started, all filesystem modifications are atomic and will either be committed together, or lost if the transaction doesn't commit (eg. crash).

            The unordered updates within a transaction help improve performance, because they increase concurrency within the filesystem. If we had to hold a huge lock across the whole filesystem for each update, this would hurt performance significantly. Instead, we only hold a lock for each object (eg. llog file, or leaf block of a directory being modified) to ensure the update does not corrupt the data from concurrent changes. Since all of the updates related in a single transaction will commit together, it doesn't matter if they are slightly unordered wrt. each other, as they will not cross a transaction boundary.

            As for writethrough of large bulk data to disk, this is already done by ZPL+ZIL usage today, depending on configuration options. For small writes they will go directly to the ZIL, which is good for Lustre because it also can pack small writes directly into the RPC request (16KB today, up to 64KB with my patch https://review.whamcloud.com/36587). For large writes, the data is written to the actual location on disk to avoid double IO of large amounts of data, which would typically overload the ZIL device).

            The large write data is written to newly allocated and unused disk blocks (as is all data in a COW filesystem), and the block pointer is written to the ZIL. If the transaction commits, the ZIL record is dropped and the block pointer is already part of the transaction. If the transaction does not commit, but the ZIL record has been written, the ZIL replay will use the data written to the "free" blocks on disk.

            Note that ZIL does not necessarily cause all IO to be faster. The ZIL is only written to disk when there is a sync operation. This also requires the filesystem to track the dependency of all updates in memory, so that dependent updates are all written to ZIL, and the filesystem is not left in an inconsistent state after a crash and ZIL recovery. This is where the complexity arises in Lustre if the ZIL records for one RPC are written independently from another (not directly related) RPC.

            In some cases, two RPCs are not actually dependent on each other, but may happen to share disk blocks (eg. in the changelog). If we have to write everything that modified the ChangeLog to the ZIL, then every sync will write everything to disk, and we are not further ahead than without the ZIL. My alternate proposal (which nobody is currently working on) is to log the RPCs as "logical" journal records, rather than the "physical" block records that exist today. This would make them incompatible with the ZPL+ZIL records, but should avoid the problems that were seen in the previous patch.

            adilger Andreas Dilger added a comment - The transaction behavior between ldiskfs and ZFS is exactly the same today. Multiple RPCs are batched together into a single disk transaction, and are committed to disk every few seconds, or sooner depending on space. Lustre does not make any filesystem modifications until after it has reserved transaction space (in the "declare" phase), and started the transaction handle (which is a refcount on the disk transaction). After the transaction handle is started, all filesystem modifications are atomic and will either be committed together, or lost if the transaction doesn't commit (eg. crash). The unordered updates within a transaction help improve performance, because they increase concurrency within the filesystem. If we had to hold a huge lock across the whole filesystem for each update, this would hurt performance significantly. Instead, we only hold a lock for each object (eg. llog file, or leaf block of a directory being modified) to ensure the update does not corrupt the data from concurrent changes. Since all of the updates related in a single transaction will commit together, it doesn't matter if they are slightly unordered wrt. each other, as they will not cross a transaction boundary. As for writethrough of large bulk data to disk, this is already done by ZPL+ZIL usage today, depending on configuration options. For small writes they will go directly to the ZIL, which is good for Lustre because it also can pack small writes directly into the RPC request (16KB today, up to 64KB with my patch https://review.whamcloud.com/36587 ). For large writes, the data is written to the actual location on disk to avoid double IO of large amounts of data, which would typically overload the ZIL device). The large write data is written to newly allocated and unused disk blocks (as is all data in a COW filesystem), and the block pointer is written to the ZIL. If the transaction commits, the ZIL record is dropped and the block pointer is already part of the transaction. If the transaction does not commit, but the ZIL record has been written, the ZIL replay will use the data written to the "free" blocks on disk. Note that ZIL does not necessarily cause all IO to be faster. The ZIL is only written to disk when there is a sync operation. This also requires the filesystem to track the dependency of all updates in memory, so that dependent updates are all written to ZIL, and the filesystem is not left in an inconsistent state after a crash and ZIL recovery. This is where the complexity arises in Lustre if the ZIL records for one RPC are written independently from another (not directly related) RPC. In some cases, two RPCs are not actually dependent on each other, but may happen to share disk blocks (eg. in the changelog). If we have to write everything that modified the ChangeLog to the ZIL, then every sync will write everything to disk, and we are not further ahead than without the ZIL. My alternate proposal (which nobody is currently working on) is to log the RPCs as "logical" journal records, rather than the "physical" block records that exist today. This would make them incompatible with the ZPL+ZIL records, but should avoid the problems that were seen in the previous patch.

            Since all of these updates are done within a single filesystem transaction (TXG for ZFS), there is no problem if they are applied to the on-disk data structures in slightly different orders because they will either commit to disk or be lost as a single unit.

            With ZIL, each RPC update needs to be atomic across all of the files/directories/logs that are modified, which caused a number of problems with the implementation in ZFS and the Lustre code.

            I'm trying to understand how this is working. How this compare to ldiskfs? Is ldiskfs starting a different transaction for each RPC?

            Why are updates done within a unique transaction for ZFS? Does this mean in the normal situation, transaction is committed to disk every 5 sec, and every 5 sec a new one is recreated?

            That avoids the issues with the unordered updates to disk.

            You mean unordered updates are hurting performance or they don't fit with the transaction model above?

            For large bulk IOs the data would still be "writethrough" to the disk blocks, and the RPC would use the data from the filesystem rather than doing another bulk transfer from the client (since ZIL RPCs would be considered "sync" and the client may not preserve the data in its memory).

            How could you rollback the transaction if the I/O are "writethrough" (you mean are skipping ZIL?)?

            I don't see how reading or writing from disk will give you a proper crash handling. How are you taking care of sync() call which are hurting ZFS performance a lot (because we have only 1 unique transaction?)

             

            If bulk transfer is not written to ZIL, I see less interest in this feature. I thought the whole point was that writing to a ZIL log was faster than writing the same data through DMU.

            Are ZFS transactions atomic, right now?

             

            Thanks for taking time explaining

            degremoa Aurelien Degremont (Inactive) added a comment - Since all of these updates are done within a single filesystem transaction (TXG for ZFS), there is no problem if they are applied to the on-disk data structures in slightly different orders because they will either commit to disk or be lost as a single unit. With ZIL, each RPC update needs to be atomic across all of the files/directories/logs that are modified, which caused a number of problems with the implementation in ZFS and the Lustre code. I'm trying to understand how this is working. How this compare to ldiskfs? Is ldiskfs starting a different transaction for each RPC? Why are updates done within a unique transaction for ZFS? Does this mean in the normal situation, transaction is committed to disk every 5 sec, and every 5 sec a new one is recreated? That avoids the issues with the unordered updates to disk. You mean unordered updates are hurting performance or they don't fit with the transaction model above? For large bulk IOs the data would still be "writethrough" to the disk blocks, and the RPC would use the data from the filesystem rather than doing another bulk transfer from the client (since ZIL RPCs would be considered "sync" and the client may not preserve the data in its memory). How could you rollback the transaction if the I/O are "writethrough" (you mean are skipping ZIL?)? I don't see how reading or writing from disk will give you a proper crash handling. How are you taking care of sync() call which are hurting ZFS performance a lot (because we have only 1 unique transaction?)   If bulk transfer is not written to ZIL, I see less interest in this feature. I thought the whole point was that writing to a ZIL log was faster than writing the same data through DMU. Are ZFS transactions atomic, right now?   Thanks for taking time explaining
            adilger Andreas Dilger added a comment - - edited

            Aurelien,
            the previous development for ZIL integration was too complex to land and maintain. The main problem was that while Lustre locks individual data structures during updates (e.g. each log file, directory, etc), it does not hold a single global lock across the whole filesystem update since that would cause a lot of contention. Since all of these updates are done within a single filesystem transaction (TXG for ZFS), there is no problem if they are applied to the on-disk data structures in slightly different orders because they will either commit to disk or be lost as a single unit.

            With ZIL, each RPC update needs to be atomic across all of the files/directories/logs that are modified, which caused a number of problems with the implementation in ZFS and the Lustre code.

            One idea that I had for a better approach for Lustre (though incompatible with the current ZPL ZIL usage) is instead of trying to log the disk blocks that are being modified to the ZIL is instead log the whole RPC request to the ZIL (at most 64KB of data). If the server doesn't crash, the RPC would be discarded from the ZIL on TXG commit. If the server does crash, the recovery steps would be to re-process the RPCs in the ZIL at the Lustre level to regenerate the filesystem changes. That avoids the issues with the unordered updates to disk. For large bulk IOs the data would still be "writethrough" to the disk blocks, and the RPC would use the data from the filesystem rather than doing another bulk transfer from the client (since ZIL RPCs would be considered "sync" and the client may not preserve the data in its memory).

            adilger Andreas Dilger added a comment - - edited Aurelien, the previous development for ZIL integration was too complex to land and maintain. The main problem was that while Lustre locks individual data structures during updates (e.g. each log file, directory, etc), it does not hold a single global lock across the whole filesystem update since that would cause a lot of contention. Since all of these updates are done within a single filesystem transaction (TXG for ZFS), there is no problem if they are applied to the on-disk data structures in slightly different orders because they will either commit to disk or be lost as a single unit. With ZIL, each RPC update needs to be atomic across all of the files/directories/logs that are modified, which caused a number of problems with the implementation in ZFS and the Lustre code. One idea that I had for a better approach for Lustre (though incompatible with the current ZPL ZIL usage) is instead of trying to log the disk blocks that are being modified to the ZIL is instead log the whole RPC request to the ZIL (at most 64KB of data). If the server doesn't crash, the RPC would be discarded from the ZIL on TXG commit. If the server does crash, the recovery steps would be to re-process the RPCs in the ZIL at the Lustre level to regenerate the filesystem changes. That avoids the issues with the unordered updates to disk. For large bulk IOs the data would still be "writethrough" to the disk blocks, and the RPC would use the data from the filesystem rather than doing another bulk transfer from the client (since ZIL RPCs would be considered "sync" and the client may not preserve the data in its memory).

            I'm wondering if there is any news on this ticket?

            If the development stopped, I'm curious to know what was the reason? Thanks

             

            degremoa Aurelien Degremont (Inactive) added a comment - I'm wondering if there is any news on this ticket? If the development stopped, I'm curious to know what was the reason? Thanks  

            People

              bzzz Alex Zhuravlev
              behlendorf Brian Behlendorf
              Votes:
              3 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

                Created:
                Updated: