Details

    • 3
    • Orion
    • 2992

    Description

      While running some single shared file IOR tests I noticed, with 'zpool iostat', that there were a lots write going on during the read phase. This was absolutely destroying performance and it wasn't at all clear to me why there should be any reads at all. IOR is just reading during this phase, no writes.

      On a hunch I set 'sync_on_lock_cancel=never' and the writes stopped entirely and performance improved at least 6x. I haven't looked at the code yet but I think there might be a few issues here.

      1) Why were so many lock revocations occurring during the read phase. I would have thought that each client would end up with a single concurrent read lock rather quickly. However, this behavior persistent for the 30 minutes it took to read all the data back.

      2) When revoking a read lock, or even a write lock which isn't actually covering any dirty data, there's no reason to issue the sync. I suspect this was never noticed because under ldiskfs if there's no dirty data to sync performing the sync is basically free. Under ZFS however if you call sync you will force the current txg to be written out. That will result in at a minimum the vdev labels at the beginning and end of each disk to be written. Lots of syncs will result each disk head basically seeking from the being to end of the disk repeatedly. Very very bad.

      Attachments

        1. 57chaos-always.tar.bz2
          4.56 MB
        2. 58chaos-always.tar.bz2
          4.51 MB
        3. 58chaos-never.tar.bz2
          4.50 MB
        4. ori642.tar.bz2
          7.08 MB

        Issue Links

          Activity

            [LU-2078] sync_on_lock_cancel too aggressive

            Patch landed.

            ian Ian Colle (Inactive) added a comment - Patch landed.
            morrone Christopher Morrone (Inactive) added a comment - Patch landed on master: http://review.whamcloud.com/4117

            I went ahead and ran the same test with `sync_on_lock_cancel=never` as well. Here's the logs and performance numbers.

            Single Shared File Read Performance (MiB/s)
            
                    |  always |  never  |
                  --+---------+---------+
            58chaos | 1311.51 | 1265.46 |
                  --+---------+---------+
            
            prakash Prakash Surya (Inactive) added a comment - I went ahead and ran the same test with `sync_on_lock_cancel=never` as well. Here's the logs and performance numbers. Single Shared File Read Performance (MiB/s) | always | never | --+---------+---------+ 58chaos | 1311.51 | 1265.46 | --+---------+---------+

            Same logs as before, except with our 58chaos tag (reverted debug patch-set 1 and applied debug patch-set 2 as requested).

            With this tag, I saw the best performance numbers of all my tests run so far (roughly equal to the sync_on_lock_cancel=never tests):

            Single Shared File Read Performance (MiB/s)
            
                    |  always |   never  |
                  --+---------+----------+
            58chaos | 1311.51 | untested |
                  --+---------+----------+
            

            So my question is, can we safely drop the write lock as is done in the debug patch? Or is this just a proof of concept?

            Also, What exactly is that lock protecting, and why?

            prakash Prakash Surya (Inactive) added a comment - Same logs as before, except with our 58chaos tag (reverted debug patch-set 1 and applied debug patch-set 2 as requested). With this tag, I saw the best performance numbers of all my tests run so far (roughly equal to the sync_on_lock_cancel=never tests): Single Shared File Read Performance (MiB/s) | always | never | --+---------+----------+ 58chaos | 1311.51 | untested | --+---------+----------+ So my question is, can we safely drop the write lock as is done in the debug patch? Or is this just a proof of concept? Also, What exactly is that lock protecting, and why?

            thanks.. here is another patch: http://review.whamcloud.com/#change,3456 hope this time we'll get some improvement. again, a log with +dlmtrace +rpctrace is very welcome. thanks in advance.

            bzzz Alex Zhuravlev added a comment - thanks.. here is another patch: http://review.whamcloud.com/#change,3456 hope this time we'll get some improvement. again, a log with +dlmtrace +rpctrace is very welcome. thanks in advance.

            This has the IOR output from the read and write phase of the test. Along with Lustre Logs from the OSS with +dlmtrace and +rpctrace.

            prakash Prakash Surya (Inactive) added a comment - This has the IOR output from the read and write phase of the test. Along with Lustre Logs from the OSS with +dlmtrace and +rpctrace.
            bzzz Alex Zhuravlev added a comment - - edited

            thanks for the data Prakash. I'm still looking at dlmtrace.. it's still not clear why we don't reach
            the peak performance: the very first PR lock should commit the whole object and all subsequent reads
            should skip osd_sync().. so here is a debug patch: http://review.whamcloud.com/#change,3456
            please try, just current orion + this patch with sync_on_cancel=always should be enough. a log with
            +dlmtrace +rpctrace would be very useful. thanks in advance.

            bzzz Alex Zhuravlev added a comment - - edited thanks for the data Prakash. I'm still looking at dlmtrace.. it's still not clear why we don't reach the peak performance: the very first PR lock should commit the whole object and all subsequent reads should skip osd_sync().. so here is a debug patch: http://review.whamcloud.com/#change,3456 please try, just current orion + this patch with sync_on_cancel=always should be enough. a log with +dlmtrace +rpctrace would be very useful. thanks in advance.

            IOR output for the read and write phase of each test. And MDS and OSS server logs for the read phase of each test with '+dlmtrace' enabled.

            prakash Prakash Surya (Inactive) added a comment - IOR output for the read and write phase of each test. And MDS and OSS server logs for the read phase of each test with '+dlmtrace' enabled.

            Alex, I was able to run a few single shared file tests today. Here's a summary of my results.

            I ran 4 separate tests. Two tests with our 55chaos tag (without http://review.whamcloud.com/3355), one with sync_on_lock_cancel=always and another with sync_on_lock_cancel=never. Also, two tests were run with our 56chaos tag (with http://review.whamcloud.com/3355), again one with "always" and another with "never".

            Also, each test was run using 64 client nodes, 1 task per node, striped over a single OST, using these IOR options "-a POSIX -CegkY".

            Single Shared File Read Performance (MiB/s)
            
                     |  always |   never |
                   --+---------+---------+
             55chaos |  469.19 | 1304.03 |
                   --+---------+---------+
             56chaos |  988.26 | 1275.59 |
                   --+---------+---------+
            

            So just from the above data, your patch seems to provide a 2x improvement for reads!

            Also, I'll attach a tarball with the IOR output and and logs from MDS and OSS with '+dlmtrace' enabled during each test's read phase.

            prakash Prakash Surya (Inactive) added a comment - Alex, I was able to run a few single shared file tests today. Here's a summary of my results. I ran 4 separate tests. Two tests with our 55chaos tag ( without http://review.whamcloud.com/3355 ), one with sync_on_lock_cancel=always and another with sync_on_lock_cancel=never. Also, two tests were run with our 56chaos tag ( with http://review.whamcloud.com/3355 ), again one with "always" and another with "never". Also, each test was run using 64 client nodes, 1 task per node, striped over a single OST, using these IOR options "-a POSIX -CegkY". Single Shared File Read Performance (MiB/s) | always | never | --+---------+---------+ 55chaos | 469.19 | 1304.03 | --+---------+---------+ 56chaos | 988.26 | 1275.59 | --+---------+---------+ So just from the above data, your patch seems to provide a 2x improvement for reads! Also, I'll attach a tarball with the IOR output and and logs from MDS and OSS with '+dlmtrace' enabled during each test's read phase.

            Brian, do you remember any details of the test? how many stripes? arguments to IOR, etc? thanks in advance.

            bzzz Alex Zhuravlev added a comment - Brian, do you remember any details of the test? how many stripes? arguments to IOR, etc? thanks in advance.

            People

              bzzz Alex Zhuravlev
              behlendorf Brian Behlendorf
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: