Details

    • 8137

    Description

      Perform short I/O (requests <= 4k) w/o bulk RPC.

      Attachments

        Issue Links

          Activity

            [LU-1757] Short I/O support

            Original patch did not actually enable this functionality.

            paf Patrick Farrell (Inactive) added a comment - Original patch did not actually enable this functionality.

            Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/30435
            Subject: LU-1757 brw: Fix short i/o and enable for mdc
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8efc38861cb224d69c012862f6e8ae453b890d17

            gerrit Gerrit Updater added a comment - Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/30435 Subject: LU-1757 brw: Fix short i/o and enable for mdc Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8efc38861cb224d69c012862f6e8ae453b890d17

            This was landed for 2.11, but Data-on-MDT landed at the same time. The MDS connection does not support SHORTIO yet, but it should.

            adilger Andreas Dilger added a comment - This was landed for 2.11, but Data-on-MDT landed at the same time. The MDS connection does not support SHORTIO yet, but it should.
            mdiep Minh Diep added a comment -

            Landed for 2.11

            mdiep Minh Diep added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27767/
            Subject: LU-1757 brw: add short io osc/ost transfer.
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 70f092a0587866662735e1a6eaf27701a576370d

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27767/ Subject: LU-1757 brw: add short io osc/ost transfer. Project: fs/lustre-release Branch: master Current Patch Set: Commit: 70f092a0587866662735e1a6eaf27701a576370d

            Andreas,

            Aligned, mostly. Can't really test unaligned without getting aggregation or readahead, except for random reads. (Those do well.)

            Server side locking... How is that achieved, other than with the patch from LU-4198, and that only for direct i/o? (https://review.whamcloud.com/#/c/8201/20)
            And since it's direct i/o, it has to be aligned. (Trying to update LU-247 (unaligned dio) is my next project.)

            Also, other than with direct i/o, I'm not sure how to actually achieve small i/os (write aggregation or read ahead will prevent them), except for random reads. (Which do see a benefit - I didn't mention that, but they see the same sort of improvement.)

            So, I suppose I would say:
            I want to try all of that, I agree that it would likely benefit enormously (you pointed this out in your earlier comments to this LU) but I believe I need LU-247 to make it really possible, since direct i/o is the only way I know of to A) skip the page cache, B) force small i/o, and C) move the locking to the server (I can fake the effect of that by doing my i/o from one node.)

            Is there some easier route I've missed?

            paf Patrick Farrell (Inactive) added a comment - Andreas, Aligned, mostly. Can't really test unaligned without getting aggregation or readahead, except for random reads. (Those do well.) Server side locking... How is that achieved, other than with the patch from LU-4198 , and that only for direct i/o? ( https://review.whamcloud.com/#/c/8201/20 ) And since it's direct i/o, it has to be aligned. (Trying to update LU-247 (unaligned dio) is my next project.) Also, other than with direct i/o, I'm not sure how to actually achieve small i/os (write aggregation or read ahead will prevent them), except for random reads. (Which do see a benefit - I didn't mention that, but they see the same sort of improvement.) So, I suppose I would say: I want to try all of that, I agree that it would likely benefit enormously (you pointed this out in your earlier comments to this LU) but I believe I need LU-247 to make it really possible, since direct i/o is the only way I know of to A) skip the page cache, B) force small i/o, and C) move the locking to the server (I can fake the effect of that by doing my i/o from one node.) Is there some easier route I've missed?

            Patrick, have you tested aligned or unaligned reads/writes? I expect with unaligned multi-client writes and server-side locking that this could also improve performance significantly.

            There could also be a big benefit from bypassing the client-side aggregation and caching mechanisms completely in that case, and just dump the chunks to the OST as fast as possible, and use something like NRS ORR to aggregate the IOs properly on the server, or at least avoid read-modify-write for small writes over the network.

            adilger Andreas Dilger added a comment - Patrick, have you tested aligned or unaligned reads/writes? I expect with unaligned multi-client writes and server-side locking that this could also improve performance significantly. There could also be a big benefit from bypassing the client-side aggregation and caching mechanisms completely in that case, and just dump the chunks to the OST as fast as possible, and use something like NRS ORR to aggregate the IOs properly on the server, or at least avoid read-modify-write for small writes over the network.

            I've resurrected this patch and ported to current master. Some simple testing here suggests A) it's working fine, and B) it gives about a 30% performance improvement for direct I/O of appropriate size (I upped the limit to 3 pages, that's what fits in the RPC, I believe), when I'm reading from a fast storage device (RAM or flash). When I'm doing small I/O to/from a spinning disk, I see no real improvement - But that's probably because network latency is not the primary driver of I/O performance there.

            paf Patrick Farrell (Inactive) added a comment - I've resurrected this patch and ported to current master. Some simple testing here suggests A) it's working fine, and B) it gives about a 30% performance improvement for direct I/O of appropriate size (I upped the limit to 3 pages, that's what fits in the RPC, I believe), when I'm reading from a fast storage device (RAM or flash). When I'm doing small I/O to/from a spinning disk, I see no real improvement - But that's probably because network latency is not the primary driver of I/O performance there.

            Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/27767
            Subject: LU-1757 brw: add short io osc/ost transfer.
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 01056e12846a73c041da92d8a4f216f2641ca1cc

            gerrit Gerrit Updater added a comment - Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/27767 Subject: LU-1757 brw: add short io osc/ost transfer. Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 01056e12846a73c041da92d8a4f216f2641ca1cc

            I want to reserve OBDO flag for short io
            http://review.whamcloud.com/8182

            Andreas, right now, I have no time and resources to check your suggestion, and the short io patch is outdated for master and required reworks. Data-on-MDT looks very good for short io and need another patch also.
            We did not do the lockless test for shared files, but single client pages write with oflags=direct did not show significant improvement.

            aboyko Alexander Boyko added a comment - I want to reserve OBDO flag for short io http://review.whamcloud.com/8182 Andreas, right now, I have no time and resources to check your suggestion, and the short io patch is outdated for master and required reworks. Data-on-MDT looks very good for short io and need another patch also. We did not do the lockless test for shared files, but single client pages write with oflags=direct did not show significant improvement.

            Alexander, I saw that the patch for this feature is abandoned, however small writes is definitely an area that Lustre could use a considerable amount of improvement. I'm still hopeful that there may be some workloads that this feature could show significant performance improvements on, or at least show what other work needs to be done in addition to this patch. The Data-on-MDT work is more concerned with small files and is definitely orthogonal to this small write patch which is intended to improve small write RPCs to a potentially very large file.

            It may be that we need to make additional changes in order to see the overall improvement of small files. Some areas to investigate to see why this patch isn't showing the expected improvements:

            • what is the improvement when the writes are smaller than a single disk block?
            • what is the improvement when multiple clients are doing interleaved writes to the same file? This can be tested relatively easily with IOR and multiple client nodes ("ior -w -b 32 -t 32 -s 65536 -N 8 -i 10 -o /mnt/lustre/testfile" runs on 8 clients and does 65536 interleaved 32-byte writes per client).
            • what impact does NRS object-based round-robin (ORR) have when doing small writing to a single file? This should sort the writes by file offset, but it may be that short writes also need to be cached on the OST so that they can avoid synchronous read-modify-write on the disk. This might be more easily tested with a ZFS OSD, which already does write caching, while the ldiskfs OSD would need changes to the IO path in order to cache small writes.
            • in the shared single interleaved write case, are the clients doing lockless writes? If not, the lock contention and overhead of doing LDLM enqueue/cancel for each write may easily dominate over the improvement from the small write patch. For sub-page writes, it might also be that there needs to be some IO fastpath that bypasses the client page cache so that it can avoid read-modify-write for the local page.

            I suspect that if there were multiple clients doing small writes

            adilger Andreas Dilger added a comment - Alexander, I saw that the patch for this feature is abandoned, however small writes is definitely an area that Lustre could use a considerable amount of improvement. I'm still hopeful that there may be some workloads that this feature could show significant performance improvements on, or at least show what other work needs to be done in addition to this patch. The Data-on-MDT work is more concerned with small files and is definitely orthogonal to this small write patch which is intended to improve small write RPCs to a potentially very large file. It may be that we need to make additional changes in order to see the overall improvement of small files. Some areas to investigate to see why this patch isn't showing the expected improvements: what is the improvement when the writes are smaller than a single disk block? what is the improvement when multiple clients are doing interleaved writes to the same file? This can be tested relatively easily with IOR and multiple client nodes (" ior -w -b 32 -t 32 -s 65536 -N 8 -i 10 -o /mnt/lustre/testfile " runs on 8 clients and does 65536 interleaved 32-byte writes per client). what impact does NRS object-based round-robin (ORR) have when doing small writing to a single file? This should sort the writes by file offset, but it may be that short writes also need to be cached on the OST so that they can avoid synchronous read-modify-write on the disk. This might be more easily tested with a ZFS OSD, which already does write caching, while the ldiskfs OSD would need changes to the IO path in order to cache small writes. in the shared single interleaved write case, are the clients doing lockless writes? If not, the lock contention and overhead of doing LDLM enqueue/cancel for each write may easily dominate over the improvement from the small write patch. For sub-page writes, it might also be that there needs to be some IO fastpath that bypasses the client page cache so that it can avoid read-modify-write for the local page. I suspect that if there were multiple clients doing small writes

            People

              paf Patrick Farrell (Inactive)
              aboyko Alexander Boyko
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: