Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-247

Lustre client slow performance on BG/P IONs: unaligned DIRECT_IO

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.0.0
    • None
    • 3
    • 18,801
    • 9105

    Description

      Port the fix of bug 18801 to 2.1

      Attachments

        Issue Links

          Activity

            [LU-247] Lustre client slow performance on BG/P IONs: unaligned DIRECT_IO

            I don't like the idea of adding new users of the obd_brw() APIs. These are pretty much obsolete on both the client and server, and we should be removing them entirely instead of adding new users.

            adilger Andreas Dilger added a comment - I don't like the idea of adding new users of the obd_brw() APIs. These are pretty much obsolete on both the client and server, and we should be removing them entirely instead of adding new users.

            The patch in http://review.whamcloud.com/980 uses obd_brw() API in osc_io_dio(), since it requires less code changes and easyer for landing, however, the best way to implement osc_io_dio() should be: introduce a new API which can naturally accept (file offset, bytes, offset in first page, incontiguous pages) and build io requests based on these information directly (brw_page should not be used for direct io, 'pshift' hacking should be eliminated), and osc_io_dio() calls the new API to build io reqeusts and sends them asynchronously.

            The proper implementaion requires more code changes, let's fix it in a follow-up patch.

            niu Niu Yawei (Inactive) added a comment - The patch in http://review.whamcloud.com/980 uses obd_brw() API in osc_io_dio(), since it requires less code changes and easyer for landing, however, the best way to implement osc_io_dio() should be: introduce a new API which can naturally accept (file offset, bytes, offset in first page, incontiguous pages) and build io requests based on these information directly (brw_page should not be used for direct io, 'pshift' hacking should be eliminated), and osc_io_dio() calls the new API to build io reqeusts and sends them asynchronously. The proper implementaion requires more code changes, let's fix it in a follow-up patch.
            niu Niu Yawei (Inactive) added a comment - Patch is at http://review.whamcloud.com/980
            pjones Peter Jones added a comment -

            I think that we will defer this task for now

            pjones Peter Jones added a comment - I think that we will defer this task for now
            green Oleg Drokin added a comment -

            Frankly I am not even sure why do we need radix trees or anything like that for directio or otherwise uncached pages.
            It's a purely one-shot job anyway.

            Lai, your description of patchless io partial page write in clio where unlocked page is marked as clean sounds totally broken.
            Good thing there is no easy way to trigger this mode or we'd have another blocker on our hands.

            green Oleg Drokin added a comment - Frankly I am not even sure why do we need radix trees or anything like that for directio or otherwise uncached pages. It's a purely one-shot job anyway. Lai, your description of patchless io partial page write in clio where unlocked page is marked as clean sounds totally broken. Good thing there is no easy way to trigger this mode or we'd have another blocker on our hands.

            I realize we should give it a try because this may be not a problem in clio. Clio has a finer grained lockless IO control, it can do lockless IO per ost object, instead of per file as b18.

            jay Jinshan Xiong (Inactive) added a comment - I realize we should give it a try because this may be not a problem in clio. Clio has a finer grained lockless IO control, it can do lockless IO per ost object, instead of per file as b18.

            Hi Niu,

            I composed a patch to bypass cl_page cache for TRANSIENT pages at: http://review.whamcloud.com/#change,495, just for a reference.

            jay Jinshan Xiong (Inactive) added a comment - Hi Niu, I composed a patch to bypass cl_page cache for TRANSIENT pages at: http://review.whamcloud.com/#change,495 , just for a reference.

            Hi Niu,

            Thanks for the points. Yes, this is a good chance for us to fix the directIO problem in one shot.

            I still tend to do this upon the infrastructure of clio. Also the reason to have a radix tree in clio is for porting purpose - we've done a lot of work to decouple linux vfs/vm in both MDT and client stack.

            Here is my idea: let's fix CPT_TRANSIENT page implementation to have arbitrary buffers for this kind of pages. However, this will not fit into cl_page cache any more which means stale data may be read if applications are doing regular IO and directIO on the same node. This may violate posix semantics a little bit, but I tend to think this is fine since linux kernel is doing the same thing.

            I'm going to compose a draft patch for this. It should be easy.

            jay Jinshan Xiong (Inactive) added a comment - Hi Niu, Thanks for the points. Yes, this is a good chance for us to fix the directIO problem in one shot. I still tend to do this upon the infrastructure of clio. Also the reason to have a radix tree in clio is for porting purpose - we've done a lot of work to decouple linux vfs/vm in both MDT and client stack. Here is my idea: let's fix CPT_TRANSIENT page implementation to have arbitrary buffers for this kind of pages. However, this will not fit into cl_page cache any more which means stale data may be read if applications are doing regular IO and directIO on the same node. This may violate posix semantics a little bit, but I tend to think this is fine since linux kernel is doing the same thing. I'm going to compose a draft patch for this. It should be easy.

            I think there are two major components in this porting:

            1. Make bufferred lockless io into direct lockless io.

            Current 2.0 code supports bufferred lockless io, we should make them into direct io, since:

            • Partial write could cause data corrption: a partial write should make page uptodate by read page in prepare_write, since
              it's lockless io, the cached page isn't protected by dlm lock, then another client could change other part of this page
              just after prior client's prepare_write, when the prior client flush this page, the change made by the 2nd client will
              be overwritten by stale data.
            • In fact in the bufferred lockless io case, the page never been cached (it can't be cached), so it doesn't make any sense
              to copy data between user page and kernel cache, as shown in b18801, the overhead of such copy sometimes is quite heavy.

            2. Support unaligned direct io.

            To support unaligned direct io in 2.0, I think we'd better build io request based on user pages and file offset directly,
            that should be similar to obd_brw api does. In my opinion, direct io doesn't fit those cl_xxx apis very well (they are
            designed for bufferred io, the page offset is tightly binding with file offset, I think), and I don't see what's the
            benefit of maitaining a raidx tree for those never cached user pages.

            Jay, Oleg
            Does my propsal sound ok to you? Thanks.

            niu Niu Yawei (Inactive) added a comment - I think there are two major components in this porting: 1. Make bufferred lockless io into direct lockless io. Current 2.0 code supports bufferred lockless io, we should make them into direct io, since: Partial write could cause data corrption: a partial write should make page uptodate by read page in prepare_write, since it's lockless io, the cached page isn't protected by dlm lock, then another client could change other part of this page just after prior client's prepare_write, when the prior client flush this page, the change made by the 2nd client will be overwritten by stale data. In fact in the bufferred lockless io case, the page never been cached (it can't be cached), so it doesn't make any sense to copy data between user page and kernel cache, as shown in b18801, the overhead of such copy sometimes is quite heavy. 2. Support unaligned direct io. To support unaligned direct io in 2.0, I think we'd better build io request based on user pages and file offset directly, that should be similar to obd_brw api does. In my opinion, direct io doesn't fit those cl_xxx apis very well (they are designed for bufferred io, the page offset is tightly binding with file offset, I think), and I don't see what's the benefit of maitaining a raidx tree for those never cached user pages. Jay, Oleg Does my propsal sound ok to you? Thanks.

            People

              niu Niu Yawei (Inactive)
              niu Niu Yawei (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: