Details

    • 3
    • Orion
    • 2992

    Description

      While running some single shared file IOR tests I noticed, with 'zpool iostat', that there were a lots write going on during the read phase. This was absolutely destroying performance and it wasn't at all clear to me why there should be any reads at all. IOR is just reading during this phase, no writes.

      On a hunch I set 'sync_on_lock_cancel=never' and the writes stopped entirely and performance improved at least 6x. I haven't looked at the code yet but I think there might be a few issues here.

      1) Why were so many lock revocations occurring during the read phase. I would have thought that each client would end up with a single concurrent read lock rather quickly. However, this behavior persistent for the 30 minutes it took to read all the data back.

      2) When revoking a read lock, or even a write lock which isn't actually covering any dirty data, there's no reason to issue the sync. I suspect this was never noticed because under ldiskfs if there's no dirty data to sync performing the sync is basically free. Under ZFS however if you call sync you will force the current txg to be written out. That will result in at a minimum the vdev labels at the beginning and end of each disk to be written. Lots of syncs will result each disk head basically seeking from the being to end of the disk repeatedly. Very very bad.

      Attachments

        1. 57chaos-always.tar.bz2
          4.56 MB
        2. 58chaos-always.tar.bz2
          4.51 MB
        3. 58chaos-never.tar.bz2
          4.50 MB
        4. ori642.tar.bz2
          7.08 MB

        Issue Links

          Activity

            [LU-2078] sync_on_lock_cancel too aggressive

            Patch landed.

            ian Ian Colle (Inactive) added a comment - Patch landed.
            morrone Christopher Morrone (Inactive) added a comment - Patch landed on master: http://review.whamcloud.com/4117

            I went ahead and ran the same test with `sync_on_lock_cancel=never` as well. Here's the logs and performance numbers.

            Single Shared File Read Performance (MiB/s)
            
                    |  always |  never  |
                  --+---------+---------+
            58chaos | 1311.51 | 1265.46 |
                  --+---------+---------+
            
            prakash Prakash Surya (Inactive) added a comment - I went ahead and ran the same test with `sync_on_lock_cancel=never` as well. Here's the logs and performance numbers. Single Shared File Read Performance (MiB/s) | always | never | --+---------+---------+ 58chaos | 1311.51 | 1265.46 | --+---------+---------+

            Same logs as before, except with our 58chaos tag (reverted debug patch-set 1 and applied debug patch-set 2 as requested).

            With this tag, I saw the best performance numbers of all my tests run so far (roughly equal to the sync_on_lock_cancel=never tests):

            Single Shared File Read Performance (MiB/s)
            
                    |  always |   never  |
                  --+---------+----------+
            58chaos | 1311.51 | untested |
                  --+---------+----------+
            

            So my question is, can we safely drop the write lock as is done in the debug patch? Or is this just a proof of concept?

            Also, What exactly is that lock protecting, and why?

            prakash Prakash Surya (Inactive) added a comment - Same logs as before, except with our 58chaos tag (reverted debug patch-set 1 and applied debug patch-set 2 as requested). With this tag, I saw the best performance numbers of all my tests run so far (roughly equal to the sync_on_lock_cancel=never tests): Single Shared File Read Performance (MiB/s) | always | never | --+---------+----------+ 58chaos | 1311.51 | untested | --+---------+----------+ So my question is, can we safely drop the write lock as is done in the debug patch? Or is this just a proof of concept? Also, What exactly is that lock protecting, and why?

            thanks.. here is another patch: http://review.whamcloud.com/#change,3456 hope this time we'll get some improvement. again, a log with +dlmtrace +rpctrace is very welcome. thanks in advance.

            bzzz Alex Zhuravlev added a comment - thanks.. here is another patch: http://review.whamcloud.com/#change,3456 hope this time we'll get some improvement. again, a log with +dlmtrace +rpctrace is very welcome. thanks in advance.

            This has the IOR output from the read and write phase of the test. Along with Lustre Logs from the OSS with +dlmtrace and +rpctrace.

            prakash Prakash Surya (Inactive) added a comment - This has the IOR output from the read and write phase of the test. Along with Lustre Logs from the OSS with +dlmtrace and +rpctrace.
            bzzz Alex Zhuravlev added a comment - - edited

            thanks for the data Prakash. I'm still looking at dlmtrace.. it's still not clear why we don't reach
            the peak performance: the very first PR lock should commit the whole object and all subsequent reads
            should skip osd_sync().. so here is a debug patch: http://review.whamcloud.com/#change,3456
            please try, just current orion + this patch with sync_on_cancel=always should be enough. a log with
            +dlmtrace +rpctrace would be very useful. thanks in advance.

            bzzz Alex Zhuravlev added a comment - - edited thanks for the data Prakash. I'm still looking at dlmtrace.. it's still not clear why we don't reach the peak performance: the very first PR lock should commit the whole object and all subsequent reads should skip osd_sync().. so here is a debug patch: http://review.whamcloud.com/#change,3456 please try, just current orion + this patch with sync_on_cancel=always should be enough. a log with +dlmtrace +rpctrace would be very useful. thanks in advance.

            IOR output for the read and write phase of each test. And MDS and OSS server logs for the read phase of each test with '+dlmtrace' enabled.

            prakash Prakash Surya (Inactive) added a comment - IOR output for the read and write phase of each test. And MDS and OSS server logs for the read phase of each test with '+dlmtrace' enabled.

            People

              bzzz Alex Zhuravlev
              behlendorf Brian Behlendorf
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: