[LU-2816] sanity-benchmark test_bonnie slow after 4MB BRW RPC patch Created: 14/Feb/13  Updated: 13/Mar/13  Resolved: 13/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: HB

Attachments: PNG File Untitled.png    
Issue Links:
Related
is related to LU-2957 Proc tnable to control RAS_INCREASE_STEP Open
is related to LU-1431 Support for larger than 1MB sequentia... Resolved
Severity: 3
Rank (Obsolete): 6820

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/91d8ae78-755b-11e2-bf59-52540035b04c.

The sub-test test_bonnie failed with the following error:

test failed to respond and timed out

From the stdout log, I found the test ran extremely slow compare to the success run:

It took almost 45 minutes for the failed test:

16:44:39:== sanity-benchmark test bonnie: bonnie++ == 16:44:30 (1360629870)
16:44:39:min OST has 1809000kB available, using 3845408kB file size
16:44:39:debug=0
16:44:39:running as uid/gid/euid/egid 500/500/500/500, groups:
16:44:39: [touch] [/mnt/lustre/d0_runas_test/f16388]
16:44:39:running as uid/gid/euid/egid 500/500/500/500, groups:
16:44:39: [bonnie++] [-f] [-r] [0] [-s3755] [-n] [10] [-u500:500] [-d/mnt/lustre/d0.bonnie]
16:44:39:Using uid:500, gid:500.
16:47:56:Writing intelligently...done
17:29:18:Rewriting...done

Only 9 minutes for the success test:

01:31:59:== sanity-benchmark test bonnie: bonnie++ == 01:31:52 (1359624712)
01:31:59:min OST has 1783256kB available, using 3845408kB file size
01:31:59:debug=0
01:31:59:running as uid/gid/euid/egid 500/500/500/500, groups:
01:31:59: [touch] [/mnt/lustre/d0_runas_test/f22196]
01:31:59:running as uid/gid/euid/egid 500/500/500/500, groups:
01:31:59: [bonnie++] [-f] [-r] [0] [-s3755] [-n] [10] [-u500:500] [-d/mnt/lustre/d0.bonnie]
01:31:59:Using uid:500, gid:500.
01:35:21:Writing intelligently...done
01:40:56:Rewriting...done


 Comments   
Comment by Andreas Dilger [ 15/Feb/13 ]

A test run that was fast before the 4MB patch landing (commit b7ee9ce814f37475cdfc6d4def83906b8b41c1d8):
https://maloo.whamcloud.com/sub_tests/e3aca9ba-7665-11e2-bc2f-52540035b04c

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
client-24vm2. 3755M           23662   6 12159   7           21428   5  64.2  23
Latency                        4517ms    3199ms              2569ms     655ms

After 4MB patch landing the read performance is much slower (commit 86097b23c959c50a81f7b60ce969bcfece8839aa):
https://maloo.whamcloud.com/sub_tests/cbd6a396-73f8-11e2-9df1-52540035b04c

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
client-29vm6. 3753M           26102  37  1526   4            1858   3 858.0  63
Latency                         430ms    1412ms             59478us   45500us

I'm not sure why rewrite would be slower, when the initial write performance is actually faster. It might have something to do with the returned st_blocksize, or possibly tying the readahead chunk size (RAS_INCREASE_STEP) to st_blocksize instead of hard-coding it at 1MB.

Comment by Doug Oucharek (Inactive) [ 15/Feb/13 ]

Hi Jinshan,

This is a high blocker. Can you determine if this is something you can look at and if not, let me know.

Comment by Keith Mannthey (Inactive) [ 15/Feb/13 ]

The below stuck out at me on a quick review of this LU.

Both report (with a little spacing help):

OST has 1809000kB available, 
using   3845408kB file size

We are creating a file that will not fit in the OST Space for this test?

It seem this is case for good and bad runs both but it stuck me as odd.

Also on the OST from the system log it looks like memory pressure was on.

In the backtraces pte_fault is pretty common path for programs to be stuck in and a few seconds after the systems showstate

Feb 11 17:46:06 client-16vm4 kernel: cannot allocate a tage (399)
Feb 11 17:46:06 client-16vm4 kernel: cannot allocate a tage (399)
Feb 11 17:46:06 client-16vm4 kernel: cannot allocate a tage (399)
Feb 11 17:46:06 client-16vm4 kernel: cannot allocate a tage (399)
Feb 11 17:46:06 client-16vm4 kernel: cannot allocate a tage (399)
Feb 11 17:46:06 client-16vm4 kernel: cannot allocate a tage (399)

This system may have just been under memory pressure and swapping the whole time. It is hard to say from the logs for sure it was not OOM but things can slow down alot before that point is reached.

Comment by Andreas Dilger [ 16/Feb/13 ]

Keith, note that the OST size reported is for one OST, but there are 3 OSTs in total, hence 3x file size.

I agree that memory pressure can slow things down, and the 4MB patch will chase larger allocations. Alex has a patch that could help this.

I'm also wary about the readahead changes I had to make for this patch. They may be causing larger reads than needed. It might also relate to a larger st_blocksize reported to userspace for files (4MB vs. 2MB).

Comment by Sarah Liu [ 21/Feb/13 ]

also seen in interop test between 2.3.0 server and 2.4 client:
https://maloo.whamcloud.com/test_sets/2746a972-7572-11e2-93d9-52540035b04c

Comment by Jinshan Xiong (Inactive) [ 04/Mar/13 ]

I'm looking at this issue.

Comment by Jinshan Xiong (Inactive) [ 08/Mar/13 ]

Performance test with 1MB vs 4MB RPC and writethrough cache settings

Comment by Jinshan Xiong (Inactive) [ 08/Mar/13 ]

Please check the following picture for 1MB vs. 4MB RPC w/ Writethrough settings:

From this testing, if writethrough_cache is disabled on the OST side, the performance is not dropped significantly with 4MB RPC size, 10% less I would say. However, if write through_cache is enabled, I can see huge performance drop with both settings - but 4MB cache is worse.

So I guess this is all about memory pressure, the memory is almost used up if write through is enabled.

About the testing env: This test is run on rosso, with 16 CPUs and 48G memory, I used memory disk on OST and applied patch 5164 to saturate QDR network. Iozone was used to generate data w/ 4MB block size.

Comment by Jinshan Xiong (Inactive) [ 08/Mar/13 ]

Initially I was trying to reproduce this issue but I found the results are interesting, so I completed the performance under different settings.

I will continue working on this issue itself.

Comment by Andreas Dilger [ 09/Mar/13 ]

I suspect that 4MB RPC size will be most interesting for DDN RAID-6 devices and similar. Running on RAM disk will not show any of the overhead of seeking or better streaming IO.

Comment by Andreas Dilger [ 09/Mar/13 ]

Jinshan, also note that the bug isn't about how to get 4MB performance to be the same or better than 1MB performance, since the default is going to stay at 1MB for the 2.4 release. Rather, the performance at 1MB has gone down after the 4MB path landed, even though the RPC size stayed at 1MB. Also, it may be that the performance loss is for reads and not writes.

Please compare performance from before and after the patch, and look at the test times for before/after in Maloo.

Maybe Alex's patch to dynamically allocate the OST memory has removed this problem already? I'm not sure if it recently landed, or is still to be landed...

Comment by Jinshan Xiong (Inactive) [ 09/Mar/13 ]

I see. I will perform the test. I reset the tree to before commit 0f8b7f951e7f43dbf389a431afea2091388a805e(LU-1431 ptlrpc: Support for over 1MB bulk I/O RPC) but didn't see much difference on write performance with 1MB RPC. I will try read.

Comment by Andreas Dilger [ 10/Mar/13 ]

Jinshan, note that if you do find a read performance drop with the 4MZb patch, it is worthwhile to see if Alex's http://review.whamcloud.com/5444 patch will fix it. This should reduce the memory pressure significantly.

Also, it is worthwhile to do a Maloo search for the sanity-benchmark test-bonnie results and compare the times. It may be that the performance got worse for a time and the got better.

Comment by Jinshan Xiong (Inactive) [ 11/Mar/13 ]

I confirmed that there is significant read performance degradation after 4MB RPC patch is landed, the read performance dropped from ~3GB/s to ~1GB/s. The patch 5444 didn't help. I'm working on it.

Comment by Jinshan Xiong (Inactive) [ 11/Mar/13 ]

Also, I didn't see significant performance loss for write case.

Comment by Andreas Dilger [ 11/Mar/13 ]

There were some changes to the readahead code as a result of the 4MB patch. In particular, the changes in http://review.whamcloud.com/5230 were done by me because of PTLRPC_MAX_BRW_PAGES increasing to a large value. Instead of allowing the readahead window to grow immediately to potentially 32MB, the RAS_INCREASE_STEP is limited to (1 << inode->i_blkbits) = (1 << LL_MAX_BLKSIZE_BITS) = 4MB, instead of the previous 1MB value.

It would be a simple first test to see if changing LL_MAX_BLKSIZE_BITS to 20 or 21 fixes the read performance issue. This can't be the final fix, since this will affect st_blksize and the IO size that "cp" and other tools use, but it will at least help narrow down the problem.

Probably a better long-term solution is to change inode->i_blkbits to be (2 * lmm_stripe_size) instead of being tied to PTLRPC_MAX_BRW_PAGES or LL_MAX_BLKSIZE_BITS. This was originally working in Lustre (lsm_xfersize), but was changed because of some minor NFS issues in https://bugzilla.lustre.org/show_bug.cgi?id=6062, but should be based on lmm_stripe_size again. The main difference compared to the old code would be the default lmm_xfersize should be 2x the default stripe size, so that it will match under normal usage (i.e. for SPEC SFS NFS testing), but it can be tuned on a per-file basis for applications that are badly behaved (e.g. https://bugzilla.lustre.org/show_bug.cgi?id=12739).

Comment by Jinshan Xiong (Inactive) [ 11/Mar/13 ]

There were some changes to the readahead code as a result of the 4MB patch. In particular, the changes in http://review.whamcloud.com/5230 were done by me because of PTLRPC_MAX_BRW_PAGES increasing to a large value. Instead of allowing the readahead window to grow immediately to potentially 32MB, the RAS_INCREASE_STEP is limited to (1 << inode->i_blkbits) = (1 << LL_MAX_BLKSIZE_BITS) = 4MB, instead of the previous 1MB value.

I reverted this change but it didn't help. I'm now on the way to revert change of ptlrpc desc change, I guess this would be the major cause of performance loss. Stay tuned.

Comment by Keith Mannthey (Inactive) [ 11/Mar/13 ]

Do you have the parameters you were using for iozone? Do you have the raw data for all of the iozone output?

Also you say your disk was a ram disk? I guess that explains why cached is so much slower (that is non-intuitive for spindles).

Also you could add the data for pre 4MB BRW patch data so we know what things looked like before the changes?

Comment by Jinshan Xiong (Inactive) [ 12/Mar/13 ]

I've known the root cause of this problem - it's a simple one:

diff --git a/lustre/llite/rw.c b/lustre/llite/rw.c
index cbd1c91..8cfa70e 100644
--- a/lustre/llite/rw.c
+++ b/lustre/llite/rw.c
@@ -576,7 +576,7 @@ static int ll_read_ahead_page(const struct lu_env *env, struct cl_io *io,
  * know what the actual RPC size is.  If this needs to change, it makes more
  * sense to tune the i_blkbits value for the file based on the OSTs it is
  * striped over, rather than having a constant value for all files here. */
-#define RAS_INCREASE_STEP(inode) (1UL << inode->i_blkbits)
+#define RAS_INCREASE_STEP(inode) (1UL << (inode->i_blkbits - CFS_PAGE_SHIFT))
 
 static inline int stride_io_mode(struct ll_readahead_state *ras)
 {

But there is still a 10%~20% performance loss on read after applying this patch, I'd like to spend more time on it.

Comment by Jinshan Xiong (Inactive) [ 12/Mar/13 ]

I did some research and it turned out that we need to adjust the max_read_ahead_mb and max_read_ahead_per_file_mb to match 4MB size RPC and 4MB increase step. Otherwise the readahead budget will be used up and produce huge number of RA_STAT_MAX_IN_FLIGHT. After I increased those two parameters I didn't see significant performance loss on read.

However, adjusting those two parameters need lot of work to do to make sure they are okay with other user cases. So I'd like to set RAS_INCREASE_STEP back to 1MB and after 4MB RPC is enabled by default, we should spend more time on that.

Comment by Jinshan Xiong (Inactive) [ 12/Mar/13 ]

patch is at: http://review.whamcloud.com/5691

Comment by Jinshan Xiong (Inactive) [ 13/Mar/13 ]

Let's fix it for now. Readahead performance tweak will be addressed in another ticket.

Generated at Sat Feb 10 01:28:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.