Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6658

single stream write performance improvement with worker threads in llite

Details

    • New Feature
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.7.0, Lustre 2.8.0
    • 9223372036854775807

    Description

      This patch provides single stream write performance improvement with multiple worker threads in llite layer. its operation overvies is the following

      In system call context
      1) get a worker thread's lu_env
      2) assemble and set parameters
      2-1) copy user buffer to kernel buffer
      2-2) copy parameters needed for worker thread to resume I/O
      2-3) set parameters to the lu_env gotten in (1)
      2-4) set extra parameters to an I/O item, iosvc_item
      3) inform worker thread: ll_iosvc, we got ready.
      4) return immediately

      In worker thread context
      1) wake up
      2) gathering information
      2-1) refer its own lu_env to know the parameters set by syscall
      2-2) refer the item made in (2-3)
      3) resume I/O
      4) sleep

      I attached the performance to compare the original Lustre-2.7.52 and this custom Lustre-2.7.52

      Attachments

        Issue Links

          Activity

            [LU-6658] single stream write performance improvement with worker threads in llite

            Close this ticket because of partially similar approach landed from LU-8964.

            dmiter Dmitry Eremin (Inactive) added a comment - Close this ticket because of partially similar approach landed from LU-8964 .

            Parallel I/O is definitely helpful for performance. However, we need to perform this in a different way. From my point of view, it should be done stripes by stripes because if we did that in single stripe, it would only contribute contention to single OSC. That being said, the current I/O architecture should be changed as follows:
            1. I/O initialization;
            2. in I/O iter_init(), it should split the I/O to multiple sub tasks at LOV layer, and sub I/Os should be scheduled to work thread by stripes;
            3. at the end of I/O, the master thread should be waiting for all sub I/O to finish or error out;
            4. error handling - to decide what value should be returned to application
            5. special I/Os like append should go to original I/O path

            jay Jinshan Xiong (Inactive) added a comment - Parallel I/O is definitely helpful for performance. However, we need to perform this in a different way. From my point of view, it should be done stripes by stripes because if we did that in single stripe, it would only contribute contention to single OSC. That being said, the current I/O architecture should be changed as follows: 1. I/O initialization; 2. in I/O iter_init(), it should split the I/O to multiple sub tasks at LOV layer, and sub I/Os should be scheduled to work thread by stripes; 3. at the end of I/O, the master thread should be waiting for all sub I/O to finish or error out; 4. error handling - to decide what value should be returned to application 5. special I/Os like append should go to original I/O path

            I agree that we need to first fully understand why this patch brings performance improvement, i.e. why the original single thread write performance is not as good as it can be. Hiroya, do you have any profiling results or detailed comparison that we can look at?

            lixi Li Xi (Inactive) added a comment - I agree that we need to first fully understand why this patch brings performance improvement, i.e. why the original single thread write performance is not as good as it can be. Hiroya, do you have any profiling results or detailed comparison that we can look at?

            Not every application calls fsync() at the end of I/O. Also no space and no quota error should be returned at the time of writing, otherwise it's useless to cache the I/O if that it is certain to fail.

            I intended to implement this feature as close-to-sync because, strictly speaking, every application should call close(2) and check its return value. And if these follow the rule the implementation may not be a issue, right? .... But it seems that my implementation doesn't work as I expected in close(2).

            In my view, the bigger the buffer size, the more likely it can take advantage of parallel data copying. Small buffer can't tolerate the cost of thread switching.

            I agree with you.

            An obvious fact is that the modern CPUs have huge cache and mostly likely the writing data is in the warm cache so the working thread should be limited to the same core or socket.

            I bind each thread to same core when the threads are created.

            Anyway, I cannot help revising my implementation, hmmm. time is always an issue ...

            nozaki Hiroya Nozaki (Inactive) added a comment - Not every application calls fsync() at the end of I/O. Also no space and no quota error should be returned at the time of writing, otherwise it's useless to cache the I/O if that it is certain to fail. I intended to implement this feature as close-to-sync because, strictly speaking, every application should call close(2) and check its return value. And if these follow the rule the implementation may not be a issue, right? .... But it seems that my implementation doesn't work as I expected in close(2). In my view, the bigger the buffer size, the more likely it can take advantage of parallel data copying. Small buffer can't tolerate the cost of thread switching. I agree with you. An obvious fact is that the modern CPUs have huge cache and mostly likely the writing data is in the warm cache so the working thread should be limited to the same core or socket. I bind each thread to same core when the threads are created. Anyway, I cannot help revising my implementation, hmmm. time is always an issue ...

            When an error or something causes short-write, the error is detected by iosvc_detect_swrite() and set the errno to ll_inode_info.lli_iosvc_rc.
            And the member is picked up by fsync or fflush. So a user can pick up the error with them. Considering fflush is called via close(), user is able to realize the error with it.

            Not every application calls fsync() at the end of I/O. Also no space and no quota error should be returned at the time of writing, otherwise it's useless to cache the I/O if that it is certain to fail.

            yes, the total cost increases but I restricts the duplicated memory quantity by a module parameter; iosvc_max_iovec_mb. this means that iosvc can use only (iosvc_max_iovec_mb * thread_num) [MiB] and when write excesses the number, it's done through ordinary route. So that we can easily avoid awful performance degradation. you can see the logic in iosvc_check_and_get_iovec()
            ( ll_file_io_generic() -> iosvc_duplicate_env() -> iosvc_setup_iovec() -> iosvc_check_and_get_iovec() )

            In my view, the bigger the buffer size, the more likely it can take advantage of parallel data copying. Small buffer can't tolerate the cost of thread switching. Of course, the policy can be way more complex than this. An obvious fact is that the modern CPUs have huge cache and mostly likely the writing data is in the warm cache so the working thread should be limited to the same core or socket.

            jay Jinshan Xiong (Inactive) added a comment - When an error or something causes short-write, the error is detected by iosvc_detect_swrite() and set the errno to ll_inode_info.lli_iosvc_rc. And the member is picked up by fsync or fflush. So a user can pick up the error with them. Considering fflush is called via close(), user is able to realize the error with it. Not every application calls fsync() at the end of I/O. Also no space and no quota error should be returned at the time of writing, otherwise it's useless to cache the I/O if that it is certain to fail. yes, the total cost increases but I restricts the duplicated memory quantity by a module parameter; iosvc_max_iovec_mb. this means that iosvc can use only (iosvc_max_iovec_mb * thread_num) [MiB] and when write excesses the number, it's done through ordinary route. So that we can easily avoid awful performance degradation. you can see the logic in iosvc_check_and_get_iovec() ( ll_file_io_generic() -> iosvc_duplicate_env() -> iosvc_setup_iovec() -> iosvc_check_and_get_iovec() ) In my view, the bigger the buffer size, the more likely it can take advantage of parallel data copying. Small buffer can't tolerate the cost of thread switching. Of course, the policy can be way more complex than this. An obvious fact is that the modern CPUs have huge cache and mostly likely the writing data is in the warm cache so the working thread should be limited to the same core or socket.
            nozaki Hiroya Nozaki (Inactive) added a comment - - edited

            Hi, Jinshan !
            Let me answer your questions as inline

            1>
            When an error or something causes short-write, the error is detected by iosvc_detect_swrite() and set the errno to ll_inode_info.lli_iosvc_rc.
            And the member is picked up by fsync or fflush. So a user can pick up the error with them. Considering fflush is called via close(), user is able to realize the error with it.

            2>
            yes, the total cost increases but I restricts the duplicated memory quantity by a module parameter; iosvc_max_iovec_mb. this means that iosvc can use only (iosvc_max_iovec_mb * thread_num) [MiB] and when write excesses the number, it's done through ordinary route. So that we can easily avoid awful performance degradation. you can see the logic in iosvc_check_and_get_iovec()
            ( ll_file_io_generic() -> iosvc_duplicate_env() -> iosvc_setup_iovec() -> iosvc_check_and_get_iovec() )

            nozaki Hiroya Nozaki (Inactive) added a comment - - edited Hi, Jinshan ! Let me answer your questions as inline 1> When an error or something causes short-write, the error is detected by iosvc_detect_swrite() and set the errno to ll_inode_info.lli_iosvc_rc. And the member is picked up by fsync or fflush. So a user can pick up the error with them. Considering fflush is called via close(), user is able to realize the error with it. 2> yes, the total cost increases but I restricts the duplicated memory quantity by a module parameter; iosvc_max_iovec_mb. this means that iosvc can use only (iosvc_max_iovec_mb * thread_num) [MiB] and when write excesses the number, it's done through ordinary route. So that we can easily avoid awful performance degradation. you can see the logic in iosvc_check_and_get_iovec() ( ll_file_io_generic() -> iosvc_duplicate_env() -> iosvc_setup_iovec() -> iosvc_check_and_get_iovec() )

            Hi Hiroya,

            I understood you really well.

            Today I have a chance to read your patch and I have a few questions to ask:
            1. As you mentioned in your previous comment, the writer doesn't wait the async thread to finish the I/O, which means the I/O may fail but the application has no idea about it. This can potentially cause data corruption because otherwise the application can retry the write or just stop running;

            2. The patch copies the user space buffer into a worker thread's kernel buffer, and then this kernel buffer will have to copied once again to inode's page cache. This is not good. And I think you can see significant performance degradation in the case of multi-thread write.

            Yes, I think you've realized this is not parallel I/O intended to be. For parallel I/O implementation, I'd like to I/O to be split on the LOV layer. In the first phase, we can only support files with multiple stripes, and different threads will work on different stripes, from requesting lock to copying data. The writer thread must wait for all sub I/O to complete before it can return to application.

            jay Jinshan Xiong (Inactive) added a comment - Hi Hiroya, I understood you really well. Today I have a chance to read your patch and I have a few questions to ask: 1. As you mentioned in your previous comment, the writer doesn't wait the async thread to finish the I/O, which means the I/O may fail but the application has no idea about it. This can potentially cause data corruption because otherwise the application can retry the write or just stop running; 2. The patch copies the user space buffer into a worker thread's kernel buffer, and then this kernel buffer will have to copied once again to inode's page cache. This is not good. And I think you can see significant performance degradation in the case of multi-thread write. Yes, I think you've realized this is not parallel I/O intended to be. For parallel I/O implementation, I'd like to I/O to be split on the LOV layer. In the first phase, we can only support files with multiple stripes, and different threads will work on different stripes, from requesting lock to copying data. The writer thread must wait for all sub I/O to complete before it can return to application.
            nozaki Hiroya Nozaki (Inactive) added a comment - - edited

            My understanding is that the benchmark results are for a single iozone thread running on one client?

            That's right. This was done by single iozone thread running on one client.

            Are there any benchmark results from many threads (e.g. one per core on a 16+ core client) to see what the overhead is from running the extra worker threads?

            There is but it was measured for FEFS v2.0 based on Lustre-2.6.0. I will attach a simple graph though I doubt whether this is suitable to introduce here.

            There is definitely a performance improvement seen during writes (about 50%) with the patch compared to the unpatched client, though the read performance appears to be a bit lower (maybe 5%?). Since there is already parallelism in IO submission at the ptlrpcd layer, it would be interesting to figure out where the performance improvement is coming from?

            Strictly Speaking, this patch doesn't work as parallel I/O but ... like Async-Async I/O. Focusing on write latency, page cache cost and waiting request cost, under vvp layer cost, are the most in write system call. So I tried to cut it of write system call using worker threads, thought I have to add memcpy one more.

            That's why the performance improvement is coming from ..

            (non-patched write cost) - (under vvp layer cost) + (an extra memcpy cost) 
            

            And There was a good news to me. Lustre started to support range_lock then I didn't have to wait the previous write even if the target file is the same, of course, write have to wait if the extent is overlapped, though.

            Consequently, this works like parallel I/O but I want to call this async-async I/O ...

            Anyway, I'm really bad in English so I doubt the people who read my comment completely understand what I want to say. Please feel free to ask me anything. Thank you.

            nozaki Hiroya Nozaki (Inactive) added a comment - - edited My understanding is that the benchmark results are for a single iozone thread running on one client? That's right. This was done by single iozone thread running on one client. Are there any benchmark results from many threads (e.g. one per core on a 16+ core client) to see what the overhead is from running the extra worker threads? There is but it was measured for FEFS v2.0 based on Lustre-2.6.0. I will attach a simple graph though I doubt whether this is suitable to introduce here. There is definitely a performance improvement seen during writes (about 50%) with the patch compared to the unpatched client, though the read performance appears to be a bit lower (maybe 5%?). Since there is already parallelism in IO submission at the ptlrpcd layer, it would be interesting to figure out where the performance improvement is coming from? Strictly Speaking, this patch doesn't work as parallel I/O but ... like Async-Async I/O. Focusing on write latency, page cache cost and waiting request cost, under vvp layer cost, are the most in write system call. So I tried to cut it of write system call using worker threads, thought I have to add memcpy one more. That's why the performance improvement is coming from .. (non-patched write cost) - (under vvp layer cost) + (an extra memcpy cost) And There was a good news to me. Lustre started to support range_lock then I didn't have to wait the previous write even if the target file is the same, of course, write have to wait if the extent is overlapped, though. Consequently, this works like parallel I/O but I want to call this async-async I/O ... Anyway, I'm really bad in English so I doubt the people who read my comment completely understand what I want to say. Please feel free to ask me anything. Thank you.

            This looks like very interesting work. My understanding is that the benchmark results are for a single iozone thread running on one client? Are there any benchmark results from many threads (e.g. one per core on a 16+ core client) to see what the overhead is from running the extra worker threads?

            There is definitely a performance improvement seen during writes (about 50%) with the patch compared to the unpatched client, though the read performance appears to be a bit lower (maybe 5%?). Since there is already parallelism in IO submission at the ptlrpcd layer, it would be interesting to figure out where the performance improvement is coming from?

            adilger Andreas Dilger added a comment - This looks like very interesting work. My understanding is that the benchmark results are for a single iozone thread running on one client? Are there any benchmark results from many threads (e.g. one per core on a 16+ core client) to see what the overhead is from running the extra worker threads? There is definitely a performance improvement seen during writes (about 50%) with the patch compared to the unpatched client, though the read performance appears to be a bit lower (maybe 5%?). Since there is already parallelism in IO submission at the ptlrpcd layer, it would be interesting to figure out where the performance improvement is coming from?

            Hi, Jinshan.

            > 1. Does the system call thread wait for the worker thread to complete?
            No, the system call thread returns immediately after registering the I/O information with a worker thread.

            > 2. In your performance benchmark sheet, what's the stripe count and size you used?
            ah, yes, I forgot to write it down in the sheet.
            it was the following

            1. lfs getstripe /mnt/lustre
              /mnt/lustre
              stripe_count: -1 stripe_size: 4194304 stripe_offset: -1
              /mnt/lustre/test
              lmm_stripe_count: 2
              lmm_stripe_size: 4194304
              lmm_pattern: 1
              lmm_layout_gen: 0
              lmm_stripe_offset: 2
              obdidx objid objid group
              2 2 0x2 0
              1 2 0x2 0

            > 3. it looks like you have done parallel IO on the LLITE layer, how much effort would it be to move it to LOV layer
            > so that worker thread can do IO on stripe boundary?
            ah ... yeah, actually it looks working well in LOV layer too. hmm ... it seems to takes a couple of month until I complete . But, you know, I had to deal with tons of issues when implementing this feature which is why I finally landed this place so I'm kind of diffident about the estimation and the difficulty.

            nozaki Hiroya Nozaki (Inactive) added a comment - Hi, Jinshan. > 1. Does the system call thread wait for the worker thread to complete? No, the system call thread returns immediately after registering the I/O information with a worker thread. > 2. In your performance benchmark sheet, what's the stripe count and size you used? ah, yes, I forgot to write it down in the sheet. it was the following lfs getstripe /mnt/lustre /mnt/lustre stripe_count: -1 stripe_size: 4194304 stripe_offset: -1 /mnt/lustre/test lmm_stripe_count: 2 lmm_stripe_size: 4194304 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 2 obdidx objid objid group 2 2 0x2 0 1 2 0x2 0 > 3. it looks like you have done parallel IO on the LLITE layer, how much effort would it be to move it to LOV layer > so that worker thread can do IO on stripe boundary? ah ... yeah, actually it looks working well in LOV layer too. hmm ... it seems to takes a couple of month until I complete . But, you know, I had to deal with tons of issues when implementing this feature which is why I finally landed this place so I'm kind of diffident about the estimation and the difficulty.

            People

              dmiter Dmitry Eremin (Inactive)
              nozaki Hiroya Nozaki (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: