Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2816

sanity-benchmark test_bonnie slow after 4MB BRW RPC patch

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 6820

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/91d8ae78-755b-11e2-bf59-52540035b04c.

      The sub-test test_bonnie failed with the following error:

      test failed to respond and timed out

      From the stdout log, I found the test ran extremely slow compare to the success run:

      It took almost 45 minutes for the failed test:

      16:44:39:== sanity-benchmark test bonnie: bonnie++ == 16:44:30 (1360629870)
      16:44:39:min OST has 1809000kB available, using 3845408kB file size
      16:44:39:debug=0
      16:44:39:running as uid/gid/euid/egid 500/500/500/500, groups:
      16:44:39: [touch] [/mnt/lustre/d0_runas_test/f16388]
      16:44:39:running as uid/gid/euid/egid 500/500/500/500, groups:
      16:44:39: [bonnie++] [-f] [-r] [0] [-s3755] [-n] [10] [-u500:500] [-d/mnt/lustre/d0.bonnie]
      16:44:39:Using uid:500, gid:500.
      16:47:56:Writing intelligently...done
      17:29:18:Rewriting...done
      

      Only 9 minutes for the success test:

      01:31:59:== sanity-benchmark test bonnie: bonnie++ == 01:31:52 (1359624712)
      01:31:59:min OST has 1783256kB available, using 3845408kB file size
      01:31:59:debug=0
      01:31:59:running as uid/gid/euid/egid 500/500/500/500, groups:
      01:31:59: [touch] [/mnt/lustre/d0_runas_test/f22196]
      01:31:59:running as uid/gid/euid/egid 500/500/500/500, groups:
      01:31:59: [bonnie++] [-f] [-r] [0] [-s3755] [-n] [10] [-u500:500] [-d/mnt/lustre/d0.bonnie]
      01:31:59:Using uid:500, gid:500.
      01:35:21:Writing intelligently...done
      01:40:56:Rewriting...done
      

      Attachments

        Issue Links

          Activity

            [LU-2816] sanity-benchmark test_bonnie slow after 4MB BRW RPC patch

            Let's fix it for now. Readahead performance tweak will be addressed in another ticket.

            jay Jinshan Xiong (Inactive) added a comment - Let's fix it for now. Readahead performance tweak will be addressed in another ticket.
            jay Jinshan Xiong (Inactive) added a comment - patch is at: http://review.whamcloud.com/5691

            I did some research and it turned out that we need to adjust the max_read_ahead_mb and max_read_ahead_per_file_mb to match 4MB size RPC and 4MB increase step. Otherwise the readahead budget will be used up and produce huge number of RA_STAT_MAX_IN_FLIGHT. After I increased those two parameters I didn't see significant performance loss on read.

            However, adjusting those two parameters need lot of work to do to make sure they are okay with other user cases. So I'd like to set RAS_INCREASE_STEP back to 1MB and after 4MB RPC is enabled by default, we should spend more time on that.

            jay Jinshan Xiong (Inactive) added a comment - I did some research and it turned out that we need to adjust the max_read_ahead_mb and max_read_ahead_per_file_mb to match 4MB size RPC and 4MB increase step. Otherwise the readahead budget will be used up and produce huge number of RA_STAT_MAX_IN_FLIGHT. After I increased those two parameters I didn't see significant performance loss on read. However, adjusting those two parameters need lot of work to do to make sure they are okay with other user cases. So I'd like to set RAS_INCREASE_STEP back to 1MB and after 4MB RPC is enabled by default, we should spend more time on that.

            I've known the root cause of this problem - it's a simple one:

            diff --git a/lustre/llite/rw.c b/lustre/llite/rw.c
            index cbd1c91..8cfa70e 100644
            --- a/lustre/llite/rw.c
            +++ b/lustre/llite/rw.c
            @@ -576,7 +576,7 @@ static int ll_read_ahead_page(const struct lu_env *env, struct cl_io *io,
              * know what the actual RPC size is.  If this needs to change, it makes more
              * sense to tune the i_blkbits value for the file based on the OSTs it is
              * striped over, rather than having a constant value for all files here. */
            -#define RAS_INCREASE_STEP(inode) (1UL << inode->i_blkbits)
            +#define RAS_INCREASE_STEP(inode) (1UL << (inode->i_blkbits - CFS_PAGE_SHIFT))
             
             static inline int stride_io_mode(struct ll_readahead_state *ras)
             {
            

            But there is still a 10%~20% performance loss on read after applying this patch, I'd like to spend more time on it.

            jay Jinshan Xiong (Inactive) added a comment - I've known the root cause of this problem - it's a simple one: diff --git a/lustre/llite/rw.c b/lustre/llite/rw.c index cbd1c91..8cfa70e 100644 --- a/lustre/llite/rw.c +++ b/lustre/llite/rw.c @@ -576,7 +576,7 @@ static int ll_read_ahead_page( const struct lu_env *env, struct cl_io *io, * know what the actual RPC size is. If this needs to change, it makes more * sense to tune the i_blkbits value for the file based on the OSTs it is * striped over, rather than having a constant value for all files here. */ -#define RAS_INCREASE_STEP(inode) (1UL << inode->i_blkbits) +#define RAS_INCREASE_STEP(inode) (1UL << (inode->i_blkbits - CFS_PAGE_SHIFT)) static inline int stride_io_mode(struct ll_readahead_state *ras) { But there is still a 10%~20% performance loss on read after applying this patch, I'd like to spend more time on it.

            Do you have the parameters you were using for iozone? Do you have the raw data for all of the iozone output?

            Also you say your disk was a ram disk? I guess that explains why cached is so much slower (that is non-intuitive for spindles).

            Also you could add the data for pre 4MB BRW patch data so we know what things looked like before the changes?

            keith Keith Mannthey (Inactive) added a comment - Do you have the parameters you were using for iozone? Do you have the raw data for all of the iozone output? Also you say your disk was a ram disk? I guess that explains why cached is so much slower (that is non-intuitive for spindles). Also you could add the data for pre 4MB BRW patch data so we know what things looked like before the changes?

            There were some changes to the readahead code as a result of the 4MB patch. In particular, the changes in http://review.whamcloud.com/5230 were done by me because of PTLRPC_MAX_BRW_PAGES increasing to a large value. Instead of allowing the readahead window to grow immediately to potentially 32MB, the RAS_INCREASE_STEP is limited to (1 << inode->i_blkbits) = (1 << LL_MAX_BLKSIZE_BITS) = 4MB, instead of the previous 1MB value.

            I reverted this change but it didn't help. I'm now on the way to revert change of ptlrpc desc change, I guess this would be the major cause of performance loss. Stay tuned.

            jay Jinshan Xiong (Inactive) added a comment - There were some changes to the readahead code as a result of the 4MB patch. In particular, the changes in http://review.whamcloud.com/5230 were done by me because of PTLRPC_MAX_BRW_PAGES increasing to a large value. Instead of allowing the readahead window to grow immediately to potentially 32MB, the RAS_INCREASE_STEP is limited to (1 << inode->i_blkbits) = (1 << LL_MAX_BLKSIZE_BITS) = 4MB, instead of the previous 1MB value. I reverted this change but it didn't help. I'm now on the way to revert change of ptlrpc desc change, I guess this would be the major cause of performance loss. Stay tuned.

            There were some changes to the readahead code as a result of the 4MB patch. In particular, the changes in http://review.whamcloud.com/5230 were done by me because of PTLRPC_MAX_BRW_PAGES increasing to a large value. Instead of allowing the readahead window to grow immediately to potentially 32MB, the RAS_INCREASE_STEP is limited to (1 << inode->i_blkbits) = (1 << LL_MAX_BLKSIZE_BITS) = 4MB, instead of the previous 1MB value.

            It would be a simple first test to see if changing LL_MAX_BLKSIZE_BITS to 20 or 21 fixes the read performance issue. This can't be the final fix, since this will affect st_blksize and the IO size that "cp" and other tools use, but it will at least help narrow down the problem.

            Probably a better long-term solution is to change inode->i_blkbits to be (2 * lmm_stripe_size) instead of being tied to PTLRPC_MAX_BRW_PAGES or LL_MAX_BLKSIZE_BITS. This was originally working in Lustre (lsm_xfersize), but was changed because of some minor NFS issues in https://bugzilla.lustre.org/show_bug.cgi?id=6062, but should be based on lmm_stripe_size again. The main difference compared to the old code would be the default lmm_xfersize should be 2x the default stripe size, so that it will match under normal usage (i.e. for SPEC SFS NFS testing), but it can be tuned on a per-file basis for applications that are badly behaved (e.g. https://bugzilla.lustre.org/show_bug.cgi?id=12739).

            adilger Andreas Dilger added a comment - There were some changes to the readahead code as a result of the 4MB patch. In particular, the changes in http://review.whamcloud.com/5230 were done by me because of PTLRPC_MAX_BRW_PAGES increasing to a large value. Instead of allowing the readahead window to grow immediately to potentially 32MB, the RAS_INCREASE_STEP is limited to (1 << inode->i_blkbits) = (1 << LL_MAX_BLKSIZE_BITS) = 4MB, instead of the previous 1MB value. It would be a simple first test to see if changing LL_MAX_BLKSIZE_BITS to 20 or 21 fixes the read performance issue. This can't be the final fix, since this will affect st_blksize and the IO size that "cp" and other tools use, but it will at least help narrow down the problem. Probably a better long-term solution is to change inode->i_blkbits to be (2 * lmm_stripe_size) instead of being tied to PTLRPC_MAX_BRW_PAGES or LL_MAX_BLKSIZE_BITS. This was originally working in Lustre (lsm_xfersize), but was changed because of some minor NFS issues in https://bugzilla.lustre.org/show_bug.cgi?id=6062 , but should be based on lmm_stripe_size again. The main difference compared to the old code would be the default lmm_xfersize should be 2x the default stripe size, so that it will match under normal usage (i.e. for SPEC SFS NFS testing), but it can be tuned on a per-file basis for applications that are badly behaved (e.g. https://bugzilla.lustre.org/show_bug.cgi?id=12739 ).

            Also, I didn't see significant performance loss for write case.

            jay Jinshan Xiong (Inactive) added a comment - Also, I didn't see significant performance loss for write case.

            I confirmed that there is significant read performance degradation after 4MB RPC patch is landed, the read performance dropped from ~3GB/s to ~1GB/s. The patch 5444 didn't help. I'm working on it.

            jay Jinshan Xiong (Inactive) added a comment - I confirmed that there is significant read performance degradation after 4MB RPC patch is landed, the read performance dropped from ~3GB/s to ~1GB/s. The patch 5444 didn't help. I'm working on it.

            Jinshan, note that if you do find a read performance drop with the 4MZb patch, it is worthwhile to see if Alex's http://review.whamcloud.com/5444 patch will fix it. This should reduce the memory pressure significantly.

            Also, it is worthwhile to do a Maloo search for the sanity-benchmark test-bonnie results and compare the times. It may be that the performance got worse for a time and the got better.

            adilger Andreas Dilger added a comment - Jinshan, note that if you do find a read performance drop with the 4MZb patch, it is worthwhile to see if Alex's http://review.whamcloud.com/5444 patch will fix it. This should reduce the memory pressure significantly. Also, it is worthwhile to do a Maloo search for the sanity-benchmark test-bonnie results and compare the times. It may be that the performance got worse for a time and the got better.

            People

              jay Jinshan Xiong (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: