Details

    • Improvement
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.4.1

    Description

      In order to improve sync performance on ZFS based OSDs Lustre must be updated to utilize a ZFS ZIL device. This performance work was originally planned as part of Lustre/ZFS integration but has not yet been completed. I'm opening this issue to track it.

      Attachments

        Issue Links

          Activity

            [LU-4009] Add ZIL support to osd-zfs

            Since all of these updates are done within a single filesystem transaction (TXG for ZFS), there is no problem if they are applied to the on-disk data structures in slightly different orders because they will either commit to disk or be lost as a single unit.

            With ZIL, each RPC update needs to be atomic across all of the files/directories/logs that are modified, which caused a number of problems with the implementation in ZFS and the Lustre code.

            I'm trying to understand how this is working. How this compare to ldiskfs? Is ldiskfs starting a different transaction for each RPC?

            Why are updates done within a unique transaction for ZFS? Does this mean in the normal situation, transaction is committed to disk every 5 sec, and every 5 sec a new one is recreated?

            That avoids the issues with the unordered updates to disk.

            You mean unordered updates are hurting performance or they don't fit with the transaction model above?

            For large bulk IOs the data would still be "writethrough" to the disk blocks, and the RPC would use the data from the filesystem rather than doing another bulk transfer from the client (since ZIL RPCs would be considered "sync" and the client may not preserve the data in its memory).

            How could you rollback the transaction if the I/O are "writethrough" (you mean are skipping ZIL?)?

            I don't see how reading or writing from disk will give you a proper crash handling. How are you taking care of sync() call which are hurting ZFS performance a lot (because we have only 1 unique transaction?)

             

            If bulk transfer is not written to ZIL, I see less interest in this feature. I thought the whole point was that writing to a ZIL log was faster than writing the same data through DMU.

            Are ZFS transactions atomic, right now?

             

            Thanks for taking time explaining

            degremoa Aurelien Degremont (Inactive) added a comment - Since all of these updates are done within a single filesystem transaction (TXG for ZFS), there is no problem if they are applied to the on-disk data structures in slightly different orders because they will either commit to disk or be lost as a single unit. With ZIL, each RPC update needs to be atomic across all of the files/directories/logs that are modified, which caused a number of problems with the implementation in ZFS and the Lustre code. I'm trying to understand how this is working. How this compare to ldiskfs? Is ldiskfs starting a different transaction for each RPC? Why are updates done within a unique transaction for ZFS? Does this mean in the normal situation, transaction is committed to disk every 5 sec, and every 5 sec a new one is recreated? That avoids the issues with the unordered updates to disk. You mean unordered updates are hurting performance or they don't fit with the transaction model above? For large bulk IOs the data would still be "writethrough" to the disk blocks, and the RPC would use the data from the filesystem rather than doing another bulk transfer from the client (since ZIL RPCs would be considered "sync" and the client may not preserve the data in its memory). How could you rollback the transaction if the I/O are "writethrough" (you mean are skipping ZIL?)? I don't see how reading or writing from disk will give you a proper crash handling. How are you taking care of sync() call which are hurting ZFS performance a lot (because we have only 1 unique transaction?)   If bulk transfer is not written to ZIL, I see less interest in this feature. I thought the whole point was that writing to a ZIL log was faster than writing the same data through DMU. Are ZFS transactions atomic, right now?   Thanks for taking time explaining
            adilger Andreas Dilger added a comment - - edited

            Aurelien,
            the previous development for ZIL integration was too complex to land and maintain. The main problem was that while Lustre locks individual data structures during updates (e.g. each log file, directory, etc), it does not hold a single global lock across the whole filesystem update since that would cause a lot of contention. Since all of these updates are done within a single filesystem transaction (TXG for ZFS), there is no problem if they are applied to the on-disk data structures in slightly different orders because they will either commit to disk or be lost as a single unit.

            With ZIL, each RPC update needs to be atomic across all of the files/directories/logs that are modified, which caused a number of problems with the implementation in ZFS and the Lustre code.

            One idea that I had for a better approach for Lustre (though incompatible with the current ZPL ZIL usage) is instead of trying to log the disk blocks that are being modified to the ZIL is instead log the whole RPC request to the ZIL (at most 64KB of data). If the server doesn't crash, the RPC would be discarded from the ZIL on TXG commit. If the server does crash, the recovery steps would be to re-process the RPCs in the ZIL at the Lustre level to regenerate the filesystem changes. That avoids the issues with the unordered updates to disk. For large bulk IOs the data would still be "writethrough" to the disk blocks, and the RPC would use the data from the filesystem rather than doing another bulk transfer from the client (since ZIL RPCs would be considered "sync" and the client may not preserve the data in its memory).

            adilger Andreas Dilger added a comment - - edited Aurelien, the previous development for ZIL integration was too complex to land and maintain. The main problem was that while Lustre locks individual data structures during updates (e.g. each log file, directory, etc), it does not hold a single global lock across the whole filesystem update since that would cause a lot of contention. Since all of these updates are done within a single filesystem transaction (TXG for ZFS), there is no problem if they are applied to the on-disk data structures in slightly different orders because they will either commit to disk or be lost as a single unit. With ZIL, each RPC update needs to be atomic across all of the files/directories/logs that are modified, which caused a number of problems with the implementation in ZFS and the Lustre code. One idea that I had for a better approach for Lustre (though incompatible with the current ZPL ZIL usage) is instead of trying to log the disk blocks that are being modified to the ZIL is instead log the whole RPC request to the ZIL (at most 64KB of data). If the server doesn't crash, the RPC would be discarded from the ZIL on TXG commit. If the server does crash, the recovery steps would be to re-process the RPCs in the ZIL at the Lustre level to regenerate the filesystem changes. That avoids the issues with the unordered updates to disk. For large bulk IOs the data would still be "writethrough" to the disk blocks, and the RPC would use the data from the filesystem rather than doing another bulk transfer from the client (since ZIL RPCs would be considered "sync" and the client may not preserve the data in its memory).

            I'm wondering if there is any news on this ticket?

            If the development stopped, I'm curious to know what was the reason? Thanks

             

            degremoa Aurelien Degremont (Inactive) added a comment - I'm wondering if there is any news on this ticket? If the development stopped, I'm curious to know what was the reason? Thanks  

            I think this project is related to LLNL lustre branch on github.

            chunteraa Chris Hunter (Inactive) added a comment - I think this project is related to LLNL lustre branch on github.

            What is the status of this work?

            simmonsja James A Simmons added a comment - What is the status of this work?

            Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15542
            Subject: LU-4009 osd: internal range locking for read/write/punch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4a47d8aa592d8731082db54df58f00e5fda54164

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15542 Subject: LU-4009 osd: internal range locking for read/write/punch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4a47d8aa592d8731082db54df58f00e5fda54164

            Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15496
            Subject: LU-4009 osp: batch cancels
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 6cd411afd2bcfe585ff29aa859df055ed14ee2fa

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15496 Subject: LU-4009 osp: batch cancels Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6cd411afd2bcfe585ff29aa859df055ed14ee2fa

            Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15394
            Subject: LU-4009 osd: enable dt_index_try() on a non-existing object
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5ba67e0e4d7a25b0f37e0756841073e29a825479

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15394 Subject: LU-4009 osd: enable dt_index_try() on a non-existing object Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5ba67e0e4d7a25b0f37e0756841073e29a825479

            Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15393
            Subject: LU-4009 osd: be able to remount objset w/o osd restart
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c484f0aa6e09c533635cde51d3884f746b25cfd5

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15393 Subject: LU-4009 osd: be able to remount objset w/o osd restart Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c484f0aa6e09c533635cde51d3884f746b25cfd5

            now again with the improvements to OUT packing + cancel aggregation in OSP
            (it collects a bunch of cookies, then cancel with a single llog write - huge contribution to the average record size):

            with ZIL:
            Flush 29737 3.880 181.323
            Throughput 22.224 MB/sec 2 clients 2 procs max_latency=181.332 ms

            no ZIL:
            Flush 13605 54.994 327.793
            Throughput 10.1488 MB/sec 2 clients 2 procs max_latency=327.804 ms

            ZIL on MDT:
            zil-sync 35517 samples [usec] 0 77549 36271952
            zil-records 670175 samples [bytes] 32 9272 272855776
            zil-realloc 808 samples [realloc] 1 1 808

            ZIL on OST:
            zil-sync 35517 samples [usec] 0 123200 55843200
            zil-copied 455241 samples [writes] 1 1 455241
            zil-indirect 1663 samples [writes] 1 1 1663
            zil-records 785376 samples [bytes] 32 4288 1968476864

            the improvements to shrink average ZIL record size gives -73% (407 vs. 1541) and
            -23% to average sync time (3.88 vs 5.03).

            of course, this is a subject to rerun on a regular hardware.

            bzzz Alex Zhuravlev added a comment - now again with the improvements to OUT packing + cancel aggregation in OSP (it collects a bunch of cookies, then cancel with a single llog write - huge contribution to the average record size): with ZIL: Flush 29737 3.880 181.323 Throughput 22.224 MB/sec 2 clients 2 procs max_latency=181.332 ms no ZIL: Flush 13605 54.994 327.793 Throughput 10.1488 MB/sec 2 clients 2 procs max_latency=327.804 ms ZIL on MDT: zil-sync 35517 samples [usec] 0 77549 36271952 zil-records 670175 samples [bytes] 32 9272 272855776 zil-realloc 808 samples [realloc] 1 1 808 ZIL on OST: zil-sync 35517 samples [usec] 0 123200 55843200 zil-copied 455241 samples [writes] 1 1 455241 zil-indirect 1663 samples [writes] 1 1 1663 zil-records 785376 samples [bytes] 32 4288 1968476864 the improvements to shrink average ZIL record size gives -73% (407 vs. 1541) and -23% to average sync time (3.88 vs 5.03). of course, this is a subject to rerun on a regular hardware.

            in the latest patch I removed optimizations to the packing mechanism to make the patch smaller,
            now benchmarks again (made on a local node where all the targets share same storage):

            with ZIL:
            Flush 26601 5.030 152.794
            Throughput 19.8412 MB/sec 2 clients 2 procs max_latency=152.803 ms

            no ZIL:
            Flush 12716 59.609 302.120
            Throughput 9.48099 MB/sec 2 clients 2 procs max_latency=302.140 ms

            zil-sync 31825 samples [usec] 0 99723 50668754
            zil-records 692259 samples [bytes] 40 9656 1066813288

            zil-sync 31825 samples [usec] 2 129809 66437030
            zil-copied 405907 samples [writes] 1 1 405907
            zil-indirect 1698 samples [writes] 1 1 1698
            zil-records 701379 samples [bytes] 40 4288 1799673720

            on MDT an average record was 1066813288/692259=1541 bytes
            on OST it was 1799673720/701379=2565 bytes
            the last one is a bit surprising, I'm going to check the details.

            bzzz Alex Zhuravlev added a comment - in the latest patch I removed optimizations to the packing mechanism to make the patch smaller, now benchmarks again (made on a local node where all the targets share same storage): with ZIL: Flush 26601 5.030 152.794 Throughput 19.8412 MB/sec 2 clients 2 procs max_latency=152.803 ms no ZIL: Flush 12716 59.609 302.120 Throughput 9.48099 MB/sec 2 clients 2 procs max_latency=302.140 ms zil-sync 31825 samples [usec] 0 99723 50668754 zil-records 692259 samples [bytes] 40 9656 1066813288 zil-sync 31825 samples [usec] 2 129809 66437030 zil-copied 405907 samples [writes] 1 1 405907 zil-indirect 1698 samples [writes] 1 1 1698 zil-records 701379 samples [bytes] 40 4288 1799673720 on MDT an average record was 1066813288/692259=1541 bytes on OST it was 1799673720/701379=2565 bytes the last one is a bit surprising, I'm going to check the details.

            People

              bzzz Alex Zhuravlev
              behlendorf Brian Behlendorf
              Votes:
              3 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

                Created:
                Updated: