[LU-410] Performance concern with Shrink file_max_cache_size to alleviate the memory pressure of OST patch for LU-15 Created: 13/Jun/11  Updated: 05/Jan/21  Resolved: 10/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: None

Type: Bug Priority: Major
Reporter: John Salinas (Inactive) Assignee: Di Wang
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Tested on DDN SFA 10K with Infiniband and patches for LU-15 applied to 1.8.4. This testing started before 1.8.6 was tagged.


Issue Links:
Related
is related to LU-918 ensure that BRW requests prevent lock... Closed
is related to LU-12071 bypass pagecache for large files Resolved
Severity: 2
Epic: performance
Rank (Obsolete): 8548

 Description   

Running obdfilter-survey with LU-15 patches on DDN SFA 10K it appeared the IO from Lustre to disk was not aligned because the sizes were observed to be 1020K and 4k. As the file size exceeded cache the performance issue was very aparent. Setting vm.min_free_kbyte did not help this performance issue at all. For example using obdfilter-survey to write an 8GB file to each OST woudl show approximately 30% unaligned I/O. The alingment issue was seen by observing cache statics on the DDN SFA 10K controller.

Once we remove the shrink file_max_cache patch [define FILTER_MAX_CACHE_SIZE (8 * 1024 * 1024)] the alignment issue goes away. The many unaligned IO seems to be caused by this change in this patch and once I changed cache_file_size to 18446744073709551615 (which is 1.8.4 and 1.8.5 deafult), all IO were comming to SFA10K as aligned I/O.

Disabling the read cache (lctl set_param=obdfilter.*.read_cache_enable=0) doesn't help which is still very strange to me..

The only work around we have found is changing cache_file_size to a large size is only way to avoid this issue 1.8.6WC. This could have other performance implications as well.

We hope to post some numbers and statitics but we need additional runs to gather that information.



 Comments   
Comment by Di Wang [ 13/Jun/11 ]

Does brw_stats show the similar information? Could you please post the parameters you use for obdfilter-survey?

Comment by Di Wang [ 13/Jun/11 ]

If you did not set tests_str in obdfilter-survey, it should be run in 3 phases, "write, rewrite, read", then which phase did you see these un-alignment IO in? Or all three 3 phases. Thanks.

Comment by Shuichi Ihara (Inactive) [ 13/Jun/11 ]

will send you brw_starts and more statitics later, but this problem doesn't only happen during obdfilter-survery, but also we saw un-alignment IO during the write when we run IOR from the lustre clients.

Comment by Shuichi Ihara (Inactive) [ 18/Jun/11 ]

Sorry for late response of this..
Just tested this again. The test envirment is very simple. 1 x SFA10K and 1 x OSS. I created a new OST on SFA10K and started it on single OSS, then run obdfilter-survery. I used the latest 1.8.6WC.rc build. When readcache_max_filesize=8388608, we see many 4K I/O on brw_stats compared with when I did readcache_max_filesize=18446744073709551615.

Also, with readcache_max_filesize=8388608, I saw many not aligned I/O (aligned I/O means 1MB x N) on SFA10K (collected IO size on SFA10K inside). It kills the write perfoarmnce if many not aligned I/O are comming instead of aligned I/O. Becouse, SFA10K does chaching I/O if IO size is not aligned. That is why I can see better perfoarmnce with readcache_max_filesize=18446744073709551615 than readcache_max_filesize=8388608.

Please see results in detail.

# lctl get_param obdfilter.*.readcache_max_filesize
obdfilter.lustre-OST0000.readcache_max_filesize=8388608

# nobjlo=2 nobjhi=2 thrlo=128 thrhi=128 tests_str="write read" /usr/bin/obdfilter-survey
Sat Jun 18 21:11:04 JST 2011 Obdfilter-survey for case=disk from r01
ost  1 sz 16777216K rsz 1024K obj    2 thr  128 write  478.35 [ 382.47, 638.33] read  761.16 [ 666.28, 956.03] 
done!

# cat /proc/fs/lustre/obdfilter/lustre-OST0000/brw_stats 
snapshot_time:         1308399155.300890 (secs.usecs)

                           read      |     write
pages per bulk r/w     rpcs  % cum % |  rpcs  % cum %
64:		         2   0   0   |    0   0   0
128:		         1   0   0   |    0   0   0
256:		     16367  99 100   | 16384 100 100

                           read      |     write
discontiguous pages    rpcs  % cum % |  rpcs  % cum %
0:		     16370 100 100   | 16384 100 100

                           read      |     write
discontiguous blocks   rpcs  % cum % |  rpcs  % cum %
0:		     16370 100 100   | 16384 100 100

                           read      |     write
disk fragmented I/Os   ios   % cum % |  ios   % cum %
1:		     14073  85  85   | 14431  88  88
2:		      2297  14 100   | 1953  11 100

                           read      |     write
disk I/Os in flight    ios   % cum % |  ios   % cum %
1:		         5   0   0   |   42   0   0
2:		         5   0   0   |   74   0   0
3:		         6   0   0   |   80   0   1
4:		         6   0   0   |  104   0   1
5:		         6   0   0   |  107   0   2
6:		         6   0   0   |  101   0   2
7:		         6   0   0   |   99   0   3
8:		         6   0   0   |  111   0   3
9:		         7   0   0   |  116   0   4
10:		         7   0   0   |  131   0   5
11:		         4   0   0   |  132   0   5
12:		         6   0   0   |  132   0   6
13:		         4   0   0   |  138   0   7
14:		         4   0   0   |  144   0   8
15:		         2   0   0   |  162   0   9
16:		         2   0   0   |  243   1  10
17:		         2   0   0   |  260   1  11
18:		         2   0   0   |  247   1  13
19:		         2   0   0   |  212   1  14
20:		         2   0   0   |  205   1  15
21:		         2   0   0   |  191   1  16
22:		         3   0   0   |  186   1  17
23:		         6   0   0   |  189   1  18
24:		         5   0   0   |  201   1  19
25:		         2   0   0   |  213   1  20
26:		         3   0   0   |  205   1  21
27:		         2   0   0   |  200   1  23
28:		         4   0   0   |  196   1  24
29:		         5   0   0   |  194   1  25
30:		         4   0   0   |  191   1  26
31:		     18541  99 100   | 13531  73 100

                           read      |     write
I/O time (1/1000s)     ios   % cum % |  ios   % cum %
1:		        17   0   0   |  139   0   0
2:		        24   0   0   |  106   0   1
4:		        16   0   0   |  102   0   2
8:		        23   0   0   |  120   0   2
16:		        29   0   0   |  216   1   4
32:		       123   0   1   |  791   4   8
64:		       912   5   6   | 3504  21  30
128:		      4002  24  31   | 5916  36  66
256:		      9809  59  91   | 4535  27  94
512:		      1339   8  99   |  927   5  99
1K:		        76   0 100   |   28   0 100

                           read      |     write
disk I/O size          ios   % cum % |  ios   % cum %
4K:		      2297  12  12   | 1953  10  10
8K:		         0   0  12   |    0   0  10
16K:		         0   0  12   |    0   0  10
32K:		         0   0  12   |    0   0  10
64K:		         0   0  12   |    0   0  10
128K:		         0   0  12   |    0   0  10
256K:		         2   0  12   |    0   0  10
512K:		         1   0  12   |    0   0  10
1M:		     16367  87 100   | 16384  89 100

Also, I collected SFA side statices. we can see what IO size are comming from the host.

----------------------------------------------------
Length           Port 0                 Port 1                 
Kbytes      Reads      Writes      Reads      Writes
----------------------------------------------------
     4          0           0       1211        2775 <------ many 4K I/O
     8          0           0          0          10
    12          0           0          0          15
    16          0           0          0           7
    20          0           0          0          14
    24          0           0          0         183
    28          0           0          0         546
    32          0           0          0         325
    36          0           0          0          36
    40          0           0          0           1
    44          0           0          0           4
    52          0           0          0           1
   208          0           0          1           0
   240          0           0          1           0
   416          0           0          1           0
   604          0           0          0           1
   640          0           0          1           0
   900          0           0          1           0
  1020          0           0       2297        1953 <------ many not aligned IO
  1024          0           0       9528       10484
  1028          0           0       1021         431 <------
  2048          0           0       1149        1150
  2052          0           0         62          88 <------
  3072          0           0        231         230 <------
  3076          0           0         11          14
  4092          0           0         86          69
# lctl set_param obdfilter.*.readcache_max_filesize=18446744073709551615
obdfilter.lustre-OST0000.readcache_max_filesize=18446744073709551615

# nobjlo=2 nobjhi=2 thrlo=128 thrhi=128 tests_str="write read" /usr/bin/obdfilter-survey
Sat Jun 18 21:17:25 JST 2011 Obdfilter-survey for case=disk from r01
ost  1 sz 16777216K rsz 1024K obj    2 thr  128 write  616.80 [ 469.54, 772.31] read  827.44 [ 767.28, 858.09] 
done!

# cat /proc/fs/lustre/obdfilter/lustre-OST0000/brw_stats 
snapshot_time:         1308399550.63654 (secs.usecs)

                           read      |     write
pages per bulk r/w     rpcs  % cum % |  rpcs  % cum %
256:		     16317 100 100   | 16384 100 100

                           read      |     write
discontiguous pages    rpcs  % cum % |  rpcs  % cum %
0:		     16317 100 100   | 16384 100 100

                           read      |     write
discontiguous blocks   rpcs  % cum % |  rpcs  % cum %
0:		     16317 100 100   | 16384 100 100

                           read      |     write
disk fragmented I/Os   ios   % cum % |  ios   % cum %
1:		     16203  99  99   | 16364  99  99
2:		       114   0 100   |   20   0 100

                           read      |     write
disk I/Os in flight    ios   % cum % |  ios   % cum %
1:		         4   0   0   |    4   0   0
2:		         2   0   0   |    5   0   0
3:		         1   0   0   |    4   0   0
4:		         1   0   0   |    6   0   0
5:		         1   0   0   |    3   0   0
6:		         1   0   0   |    5   0   0
7:		         3   0   0   |    2   0   0
8:		         7   0   0   |    6   0   0
9:		         4   0   0   |    4   0   0
10:		         3   0   0   |    3   0   0
11:		         2   0   0   |    7   0   0
12:		         3   0   0   |    5   0   0
13:		         4   0   0   |    6   0   0
14:		         3   0   0   |    8   0   0
15:		         3   0   0   |   11   0   0
16:		         6   0   0   |   19   0   0
17:		         6   0   0   |   26   0   0
18:		         3   0   0   |   30   0   0
19:		         3   0   0   |   43   0   1
20:		         3   0   0   |   52   0   1
21:		         4   0   0   |   54   0   1
22:		         2   0   0   |   51   0   2
23:		         2   0   0   |   53   0   2
24:		         4   0   0   |   69   0   2
25:		         2   0   0   |   75   0   3
26:		         4   0   0   |   65   0   3
27:		         8   0   0   |   74   0   4
28:		         3   0   0   |   87   0   4
29:		         3   0   0   |   94   0   5
30:		         8   0   0   |   94   0   5
31:		     16328  99 100   | 15439  94 100

                           read      |     write
I/O time (1/1000s)     ios   % cum % |  ios   % cum %
1:		         0   0   0   |    7   0   0
2:		         0   0   0   |    1   0   0
4:		         3   0   0   |    0   0   0
8:		         1   0   0   |    7   0   0
16:		        22   0   0   |   14   0   0
32:		        71   0   0   |  117   0   0
64:		       993   6   6   | 1362   8   9
128:		      4603  28  34   | 8467  51  60
256:		      9657  59  94   | 6227  38  98
512:		       927   5  99   |  182   1 100
1K:		        40   0 100   |    0   0 100

                           read      |     write
disk I/O size          ios   % cum % |  ios   % cum %
4K:		       114   0   0   |   20   0   0
8K:		         0   0   0   |    0   0   0
16K:		         0   0   0   |    0   0   0
32K:		         0   0   0   |    0   0   0
64K:		         0   0   0   |    0   0   0
128K:		         0   0   0   |    0   0   0
256K:		         0   0   0   |    0   0   0
512K:		         0   0   0   |    0   0   0
1M:		     16317  99 100   | 16384  99 100


SFA statistics

----------------------------------------------------
Length           Port 0                 Port 1                 
Kbytes      Reads      Writes      Reads      Writes
----------------------------------------------------
     4          0           0         42         823
     8          0           0          0           5
    12          0           0          0           1
    16          0           0          0           1
    20          0           0          0           9
    24          0           0          0          71
    28          0           0          0         309
    32          0           0          0         234
    36          0           0          0          25
    40          0           0          0           2
    44          0           0          0           1
    52          0           0          0           1
    56          0           0          0           1
    60          0           0          0           1
  1020          0           0        114          20 <------ only 20 times
  1024          0           0       2148        3246 
  1028          0           0         45           3
  2048          0           0       1062        1004
  2052          0           0         11           0
  3072          0           0        821         583
  3076          0           0          5           1
  4092          0           0       1686        1519
Comment by Di Wang [ 19/Jun/11 ]

Thanks for the information. It seems IO is fragmented somehow with max_filesize=8388608. Usually it means the extent allocation is not contiguous on disk. (mballoc does not perform well in this case).

read | write
disk fragmented I/Os ios % cum % | ios % cum %
1: 14073 85 85 | 14431 88 88
2: 2297 14 100 | 1953 11 100

Hmm, I do not understand why it is related with max_filesize. I will investigate deeper to see what is going on here. Thanks.

Comment by Di Wang [ 20/Jun/11 ]

It seems the problem is that MAX_HW_SEGMENTS of DDN SFA 10K < 256 ? It is usually 128 for most device, lustre actually had a patch for kernel to change that value blkdev_tunables-2.6-rhel5.patch. unfortunately, that patch does not work for all of the device. Could you please check what is your max_hw_segments setting in your test.

If it is < 256, then lustre server might create some fragmented IO here, especially when pages are fragmented, i.e. they are not physical contiguous. (For example if max_hw_segment is 128, to create 1M IO (256 pages), you would expect half of 256 pages are contiguous, otherwise it will split the IO into 2 small size IO).

This also explain why you see less fragmented IO with big readcache, because these pages are not being create/release frequently, so it will have less chance to fragment pages, i.e. pages are more physically contiguous in this case.

Comment by Di Wang [ 21/Jun/11 ]

To prove this idea I did a test in a sata drive. Default hw_max_segments = 128

with read_cache_size = 18446744073709551615
[root@testnode obdfilter-survey]# cat /proc/fs/lustre/obdfilter/lustre-OST0001/readcache_max_filesize
18446744073709551615
[root@testnode obdfilter-survey]# nobjlo=2 nobjhi=2 thrlo=128 thrhi=128 tests_str="write read" ./obdfilter-survey
Mon Jun 20 08:03:19 MST 2011 Obdfilter-survey for case=disk from testnode
ost 1 sz 16777216K rsz 1024K obj 2 thr 128 write 82.67 [ 72.87, 90.77] read 80.89 [ 73.86, 87.93]
done!

root@testnode obdfilter-survey]# cat /proc/fs/lustre/obdfilter/lustre-OST0001/brw_stats

.....

disk fragmented I/Os ios % cum % | ios % cum %
0: 17 0 0 | 0 0 0
1: 14950 91 91 | 15958 97 97
2: 1385 8 100 | 426 2 100
......

With read_cache_size = 8388608
[root@testnode obdfilter-survey]# cat /proc/fs/lustre/obdfilter/lustre-OST0001/readcache_max_filesize
8388608

[root@testnode obdfilter-survey]# nobjlo=2 nobjhi=2 thrlo=128 thrhi=128 tests_str="write read" ./obdfilter-survey
Mon Jun 20 07:37:39 MST 2011 Obdfilter-survey for case=disk from testnode
ost 1 sz 16777216K rsz 1024K obj 2 thr 128 write 72.44 SHORT read 78.38 [ 56.95, 83.93]
done!

......
read | write
disk fragmented I/Os ios % cum % | ios % cum %
0: 3 0 0 | 0 0 0
1: 7108 91 91 | 10630 64 64
2: 663 8 100 | 5754 35 100

......

Then I applied a patch to change default hw_max_segments to be 256
— include/linux/ata.h.old 2011-06-21 06:44:26.000000000 -0700
+++ include/linux/ata.h 2011-06-21 05:40:11.000000000 -0700
@@ -38,7 +38,8 @@
enum {
/* various global constants */
ATA_MAX_DEVICES = 2, /* per bus/port */

  • ATA_MAX_PRD = 256, /* we could make these 256/256 */
    + //ATA_MAX_PRD = 256, /* we could make these 256/256 */
    + ATA_MAX_PRD = 512, /* we could make these 256/256 */
    ATA_SECT_SIZE = 512,
    ATA_MAX_SECTORS_128 = 128,
    ATA_MAX_SECTORS = 256,

Then redo the test again
[root@testnode lustre]# cat /proc/fs/lustre/obdfilter/lustre-OST0000/readcache_max_filesize
8388608

[root@testnode obdfilter-survey]#
Tue Jun 21 06:10:26 MST 2011 Obdfilter-survey for case=disk from testnode
ost 1 sz 16777216K rsz 1024K obj 2 thr 128 write
81.05 SHORT read 81.21 [ 68.88, 88.93]
done!

root@testnode lustre]# cat /proc/fs/lustre/obdfilter/lustre-OST0000/brw_stats

......
read | write
disk fragmented I/Os ios % cum % | ios % cum %
0: 3 0 0 | 0 0 0
1: 16378 99 100 | 16384 100 100
.......

Comment by Shuichi Ihara (Inactive) [ 21/Jun/11 ]

That's what I was also thinking. The current OSS and SFA connection are SRP (SCSI over RDMA Protocol) on QDR. The maiximum number of gather/scatter entries per I/O in SRP is 255 (default is 12). we can set this paramter with srp_sg_tablesize in ib_srp module, but it only can up to 255 descriptors. This means, in order to send 1M I/O to SFA10K, OSS sends two requests to SFA - one is 1020K (4k x 255 descriptors) and another one is 4K.

My unserstanding, nomally, SFA10K gets two requests 1020K + 4K as single I/O reuqest, handles it as full stripe I/O, but if OSS's memory are much used, these two requests are fragmented - sent them to SFA10K, then handles as different I/O requests. In this situation, we saw many 1020K IO and 4K requests on the SFA10K.

In order to prevent this situation, we are setting vm.min_free_kbytes to keep memory space to avoid fragment two requests, but it's not perfect.

However, it didn't help for this issue which we see when readcache_max_filesize=8388608. Anyway, two issues might be related and caused by due to srp_sg_tablesize=255.

I will try same testing on SFA10K with 8Gbps FC(Fiber Chanel) which can support sending actaul 1M I/O (scatter/gather 256 requests), then let's see what happens.

Comment by Matt Ezell [ 21/Jun/11 ]

Was this on a RedHat 5 OSS? Linux kernel 2.6.24 merged in SG list chaining which could help the situation. Do you have access to RedHat 6 to test?

Comment by Cliff White (Inactive) [ 21/Jun/11 ]

I think I can now confirm this on Hyperion.

With # hyperion1154 /root > cat /proc/fs/lustre/obdfilter/lustre-OST0000/readcache_max_filesize
18446744073709551615

0000: Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)
0000: --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------
0000: write 1488.54 1466.58 1479.54 9.39 1488.54 1466.58 1479.54 9.39 89.2852
0000: read 1221.81 1201.83 1210.44 8.39 1221.81 1201.83 1210.44 8.39 109.1356
0000:
0000: Max Write: 1488.54 MiB/sec (1560.85 MB/sec)
0000: Max Read: 1221.81 MiB/sec (1281.16 MB/sec)
0000:
0000: Run finished: Tue Jun 21 13:48:15 2011

Previous with 8M readahead_max_filesize

0000: Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)Op grep #Tasks tPN reps fPP reord
0000: --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------
0000: write 955.17 944.80 950.17 4.24 955.17 944.80 950.17 4.24 139.02702 1032 8 3 1 1 1 0 0 1 134217728
0000: read 893.92 876.00 881.99 8.44 893.92 876.00 881.99 8.44 149.78426 1032 8 3 1 1 1 0 0 1 134217728
0000:
0000: Max Write: 955.17 MiB/sec (1001.57 MB/sec)
0000: Max Read: 893.92 MiB/sec (937.34 MB/sec)
0000:
0000: Run finished: Sat Jun 18 21:41:28 2011

Comment by Cliff White (Inactive) [ 21/Jun/11 ]

Ah, that is the case on hyperion-

options ib_srp srp_sg_tablesize=255

is set.

Comment by Shuichi Ihara (Inactive) [ 22/Jun/11 ]

Just tested on SFA10K(FC model). there were no fragmentaion.
So, if we use SRP between SFA10K and OSS, at least, we should have large reaccache_max_filesize until SFA and SRP initiator supports FMR to send/read the large I/O with single request. The my unerstading is that this development and improvement are progressing in DDN and ORNL, so this should be supported very soon.

# lctl get_param obdfilter.*.readcache_max_filesize
obdfilter.lustre-OST0000.readcache_max_filesize=8388608

# nobjlo=2 nobjhi=2 thrlo=128 thrhi=128 tests_str="write read" /usr/bin/obdfilter-survey
Thu Jun 23 00:54:32 JST 2011 Obdfilter-survey for case=disk from r13
ost  1 sz 16777216K rsz 1024K obj    2 thr  128 write  613.75 [ 534.49, 763.64] read  690.07 [ 630.40, 732.31] 
done!

# cat /proc/fs/lustre/obdfilter/lustre-OST0000/brw_stats 
snapshot_time:         1308758136.955857 (secs.usecs)

                           read      |     write
pages per bulk r/w     rpcs  % cum % |  rpcs  % cum %
16:		         1   0   0   |    0   0   0
32:		         3   0   0   |    0   0   0
64:		         3   0   0   |    0   0   0
128:		         2   0   0   |    0   0   0
256:		     16372  99 100   | 16384 100 100

                           read      |     write
discontiguous pages    rpcs  % cum % |  rpcs  % cum %
0:		     16381 100 100   | 16384 100 100

                           read      |     write
discontiguous blocks   rpcs  % cum % |  rpcs  % cum %
0:		     16381 100 100   | 16384 100 100

                           read      |     write
disk fragmented I/Os   ios   % cum % |  ios   % cum %
1:		     16381 100 100   | 16384 100 100

                           read      |     write
disk I/Os in flight    ios   % cum % |  ios   % cum %
1:		         2   0   0   |    2   0   0
2:		         2   0   0   |    2   0   0
3:		         1   0   0   |    2   0   0
4:		         1   0   0   |    1   0   0
5:		         3   0   0   |    3   0   0
6:		         2   0   0   |    3   0   0
7:		         1   0   0   |    3   0   0
8:		         1   0   0   |    3   0   0
9:		         1   0   0   |    5   0   0
10:		         2   0   0   |    9   0   0
11:		         1   0   0   |    7   0   0
12:		         1   0   0   |    7   0   0
13:		         1   0   0   |    5   0   0
14:		         2   0   0   |    4   0   0
15:		         1   0   0   |    7   0   0
16:		         1   0   0   |    9   0   0
17:		         2   0   0   |   10   0   0
18:		         1   0   0   |   29   0   0
19:		         1   0   0   |  296   1   2
20:		         3   0   0   |  175   1   3
21:		         2   0   0   |   10   0   3
22:		         1   0   0   |    7   0   3
23:		         1   0   0   |   14   0   3
24:		         2   0   0   |    5   0   3
25:		         1   0   0   |    8   0   3
26:		         2   0   0   |    7   0   3
27:		         1   0   0   |   14   0   3
28:		         3   0   0   |   10   0   4
29:		         5   0   0   |   11   0   4
30:		         1   0   0   |    5   0   4
31:		     16332  99 100   | 15711  95 100

                           read      |     write
I/O time (1/1000s)     ios   % cum % |  ios   % cum %
4:		         1   0   0   |    0   0   0
8:		         2   0   0   |    0   0   0
16:		         4   0   0   |   22   0   0
32:		        31   0   0   |  528   3   3
64:		        80   0   0   | 4734  28  32
128:		     11400  69  70   | 4779  29  61
256:		      1685  10  80   | 5893  35  97
512:		       794   4  85   |  225   1  98
1K:		      2384  14 100   |   25   0  98
2K:		         0   0 100   |    0   0  98
4K:		         0   0 100   |   83   0  99
8K:		         0   0 100   |   95   0 100

                           read      |     write
disk I/O size          ios   % cum % |  ios   % cum %
64K:		         1   0   0   |    0   0   0
128K:		         3   0   0   |    0   0   0
256K:		         3   0   0   |    0   0   0
512K:		         2   0   0   |    0   0   0
1M:		     16372  99 100   | 16384 100 100

WangDi,
no impact for LU-15 even if we change readcache_max_filesize to 18446744073709551615 from 1.8.6's default 8388608?

Comment by Di Wang [ 22/Jun/11 ]

There might be some impacts. The reason we shrink the readcache_max_filesize is that if the OSS cache so much pages, some META information (for example group information) might be swapped out from the memory frequently, which is very bad for new extents allocation especially when the OST is becoming full. But you can always tell customers to shrink this value, if they see this issue. (Please check LU-15 for details). Otherwise just keep big size of readcache_max_filesize might be temporary solution for this.

Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » i686,client,el5,inkernel #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/obdfilter/filter_internal.h
  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,client,el6,inkernel #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/obdfilter/filter_internal.h
  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/obdfilter/filter_internal.h
  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,client,ubuntu1004,inkernel #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/ChangeLog
  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » i686,server,el5,ofa #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/obdfilter/filter_internal.h
  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/obdfilter/filter_internal.h
  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » i686,client,el5,ofa #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/obdfilter/filter_internal.h
  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » i686,client,el6,inkernel #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/ChangeLog
  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » i686,server,el5,inkernel #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/obdfilter/filter_internal.h
  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,ofa #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/ChangeLog
  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,ofa #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/ChangeLog
  • lustre/obdfilter/filter_internal.h
Comment by Cory Spitz [ 22/Jun/11 ]

Di Wang wrote:
"This also explain why you see less fragmented IO with big readcache, because these pages are not being create/release frequently, so it will have less chance to fragment pages, i.e. pages are more physically contiguous in this case."

So, is the theory that the call into the kernel to truncate_inode_pages_range(), which releases pages one at a time, causes memory to become quickly fragmented? If true, then setting the readcache_max_filesize=1GiB and running the same testcase should hopefully result in more 1MiB I/Os from the SRP initiator. Can we easily prove that the memory an OSS acquires for bulk read/write data is less physically fragmented when readcache_max_filesize=-1?

Comment by Di Wang [ 22/Jun/11 ]

Yes, get/release(like truncate_inode_page_range) page frequently will get memory fragmented. Actually, I had hoped to find a API to allocate contiguous pages for bulk read/write, but seems there are no such API.

Yes, If you set readcache_max_filesize =1G, you should expect more 1MB IO. (Though the test will write 16G to each object). So yes, if you set readcache_max_filesize=-1, you should expect less fragmented pages, IMHO.

Comment by Cory Spitz [ 22/Jun/11 ]

From the description John Salinas wrote:
"Disabling the read cache (lctl set_param=obdfilter.*.read_cache_enable=0) doesn't help which is still very strange to me."

Did you disable the writethrough_cache as well? If you kept the writethrough cache enabled and performed writes then they would still stay in the cache as filter_release_cache() wouldn't be called before returning from filter_commitrw_write().

Comment by Cliff White (Inactive) [ 22/Jun/11 ]

Tested -rc3 on hyperion, looks better

0000: Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)Op grep #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize
0000: --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------
0000: write 1372.40 1340.43 1359.45 13.74 1372.40 1340.43 1359.45 13.74 97.17835 1032 8 3 1 1 1 0 0 1 134217728 1048576 138512695296 -1 POSIX EXCEL
0000: read 1161.01 1055.91 1092.53 48.46 1161.01 1055.91 1092.53 48.46 121.13945 1032 8 3 1 1 1 0 0 1 134217728 1048576 138512695296 -1 POSIX EXCEL
0000:
0000: Max Write: 1372.40 MiB/sec (1439.07 MB/sec)
0000: Max Read: 1161.01 MiB/sec (1217.41 MB/sec)
0000:
0000: Run finished: Wed Jun 22 13:47:55 2011
Full system file per process MPIIO IOR

0000: Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)Op grep #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize
0000: --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------
0000: write 1331.43 1317.12 1322.00 6.67 1331.43 1317.12 1322.00 6.67 99.92368 1032 8 3 0 1 1 0 0 1 134217728 1048576 138512695296 -1 POSIX EXCEL
0000: read 1029.69 1024.16 1026.19 2.48 1029.69 1024.16 1026.19 2.48 128.72528 1032 8 3 0 1 1 0 0 1 134217728 1048576 138512695296 -1 POSIX EXCEL
0000:
0000: Max Write: 1331.43 MiB/sec (1396.11 MB/sec)
0000: Max Read: 1029.69 MiB/sec (1079.70 MB/sec)

Comment by Andreas Dilger [ 07/Sep/11 ]

Di, I had occasion to look at this bug again, and one idea I had was to try and allocate order-1 pages (i.e. 8kB chunks) until that is failing and only then fall back to order-0 (4kB) allocations? Even getting a single 8kB allocation per IO would be enough to avoid page fragmentation to overflow the 255-segment limit for SRP.

Also, it would be interesting to watch the page allocation statistics on a system that is suffering from this problem to see if there are many 8kB pages available, and the only reason that fragmented 4kB pages are being used is because they are no longer being pinned by the read cache for a long time.

Comment by Di Wang [ 07/Sep/11 ]

Ah, this is a good idea, I will cook a patch then. From what I see, there are almost no contiguous pages at that time. I will try to get page allocation statistics with the patch.

Comment by Di Wang [ 14/Sep/11 ]

Andreas, I just cooked a patch to use alloc_pages to allocate order-1 pages for niobuf, i.e. try to allocate 2 contiguous pages each time in filter_preprw_read/write. It indeed helped to avoid fragmented IO here.

Here are two results from obdfilter_survey, in both cases the backend max_hw_segments = 128, (<256)

1. Without the patch,
Wed Sep 14 02:20:12 MST 2011 Obdfilter-survey for case=disk from testnode
ost 1 sz 16777216K rsz 1024K obj 2 thr 128 write 89.10 [ 74.92, 93.92] read 83.69 [ 71.94, 92.92]

brw_stats
....
read | write
disk fragmented I/Os ios % cum % | ios % cum %
0: 3 0 0 | 0 0 0
1: 9557 58 58 | 11139 67 67
2: 6817 41 100 | 5245 32 100
................

2. with the patch

Wed Sep 14 03:11:24 MST 2011 Obdfilter-survey for case=disk from testnode
ost 1 sz 16777216K rsz 1024K obj 2 thr 128 write 89.58 [ 80.93, 93.83] read 86.26 [ 76.94, 91.92]

brw_stats
........
read | write
disk fragmented I/Os ios % cum % | ios % cum %
0: 3 0 0 | 0 0 0
1: 15739 96 96 | 15967 97 97
2: 641 3 100 | 417 2 100
........

Though the performance does not improve a lot, it did help to avoid the fragmented IO.

I post the patch here http://review.whamcloud.com/#change,1377 , but the implementation might be a little hacky. Since the whole data stack on server side is stick with 1 page. So even though we will alloc 2 contigous pages(8k) each time(in filter_preprw_read/write), we still need handle single page individually in other functions. But kernel seems only somewhat initialize the "first" page in alloc_pages(order >= 1), so we have to initialize the following pages ourselves(for example _count, flags), then we can add all pages to the cache. And also the patch needs to export a kernel api add_to_page_cache_lru.

Comment by Shuichi Ihara (Inactive) [ 06/Nov/11 ]

Di,

http://review.whamcloud.com/#change,1377, Would you plesae make a patch for 2.x for testing? I'm seeing more non-alined IOs with lustre-2.1's on RHEL6 even readcache_max_filesize=18446744073709551615 and vm.min_free_kbyte=2097152 which was one of workaround RHEL5.x.

I'm still having at look at current behavior on RHEL6, but just want to try your patch with lustre-2.1 if we see any defiferences.

Thanks

Comment by Di Wang [ 20/Dec/11 ]

Sorry, Ihara

Just saw this message. Yes, I am working on the patch for 2.x now. I was wondering any difference between RHEL6 and RHEL5 on this area.

Comment by Shuichi Ihara (Inactive) [ 21/Dec/11 ]

WangDi,

did you submit the patches? I wonder if I could test them with 2.x.

Comment by Di Wang [ 21/Dec/11 ]

oh, no yet. I am working on it now. I will let you know once the patch is ready.

Comment by Di Wang [ 30/Dec/11 ]

did you submit the patches? I wonder if I could test them with 2.x.

Please try this http://review.whamcloud.com/#change,1881

Comment by Kit Westneat (Inactive) [ 10/Aug/12 ]

We haven't seen this issue again, and probably won't have time to do any testing, so this one can be closed.

Comment by Peter Jones [ 10/Aug/12 ]

ok thanks Kit!

Comment by Cory Spitz [ 10/Aug/12 ]

Kit, is that because your SRP initiator can construct and send I/O w/256 fragments?

Comment by Kit Westneat (Inactive) [ 10/Aug/12 ]

Cory, Ihara said that after reverting the LU-15 patch, he hasn't seen it.

Comment by Cory Spitz [ 10/Aug/12 ]

Ah, then there really is still an issue then, right? At minimum, one cannot configure the cache's max file size lower to reduce cache wasting w/o re-introducing fragmented I/O. There were multiple fixes for LU-15 though. Maybe this is a question best suited as a comment to LU-15, but is the ldiskfs metadata eviction still a concern if readcache_max_filesize is not reduced? That is, were the other LU-15 changes sufficient to resolve that issue? I would suppose that it is because LU-15 is closed, but I would like to make sure.

Comment by Shuichi Ihara (Inactive) [ 18/Aug/12 ]

no more fragments on new OFED-3.x and RHEL6 based OFED since ib_srp supprots indirect_sg_entries.

Comment by Mahmoud Hanafi [ 07/Nov/12 ]

Shuichi Ihara could you please post you srp module options?
what values do you have for indirect_sg_entries, cmd_sg_entries, allow_ext_sg.

Thanks,
Mahmoud

Generated at Sat Feb 10 01:06:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.