[LU-410] Performance concern with Shrink file_max_cache_size to alleviate the memory pressure of OST patch for LU-15 Created: 13/Jun/11 Updated: 05/Jan/21 Resolved: 10/Aug/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | John Salinas (Inactive) | Assignee: | Di Wang |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Tested on DDN SFA 10K with Infiniband and patches for |
||
| Issue Links: |
|
||||||||||||
| Severity: | 2 | ||||||||||||
| Epic: | performance | ||||||||||||
| Rank (Obsolete): | 8548 | ||||||||||||
| Description |
|
Running obdfilter-survey with Once we remove the shrink file_max_cache patch [define FILTER_MAX_CACHE_SIZE (8 * 1024 * 1024)] the alignment issue goes away. The many unaligned IO seems to be caused by this change in this patch and once I changed cache_file_size to 18446744073709551615 (which is 1.8.4 and 1.8.5 deafult), all IO were comming to SFA10K as aligned I/O. Disabling the read cache (lctl set_param=obdfilter.*.read_cache_enable=0) doesn't help which is still very strange to me.. The only work around we have found is changing cache_file_size to a large size is only way to avoid this issue 1.8.6WC. This could have other performance implications as well. We hope to post some numbers and statitics but we need additional runs to gather that information. |
| Comments |
| Comment by Di Wang [ 13/Jun/11 ] |
|
Does brw_stats show the similar information? Could you please post the parameters you use for obdfilter-survey? |
| Comment by Di Wang [ 13/Jun/11 ] |
|
If you did not set tests_str in obdfilter-survey, it should be run in 3 phases, "write, rewrite, read", then which phase did you see these un-alignment IO in? Or all three 3 phases. Thanks. |
| Comment by Shuichi Ihara (Inactive) [ 13/Jun/11 ] |
|
will send you brw_starts and more statitics later, but this problem doesn't only happen during obdfilter-survery, but also we saw un-alignment IO during the write when we run IOR from the lustre clients. |
| Comment by Shuichi Ihara (Inactive) [ 18/Jun/11 ] |
|
Sorry for late response of this.. Also, with readcache_max_filesize=8388608, I saw many not aligned I/O (aligned I/O means 1MB x N) on SFA10K (collected IO size on SFA10K inside). It kills the write perfoarmnce if many not aligned I/O are comming instead of aligned I/O. Becouse, SFA10K does chaching I/O if IO size is not aligned. That is why I can see better perfoarmnce with readcache_max_filesize=18446744073709551615 than readcache_max_filesize=8388608. Please see results in detail. # lctl get_param obdfilter.*.readcache_max_filesize
obdfilter.lustre-OST0000.readcache_max_filesize=8388608
# nobjlo=2 nobjhi=2 thrlo=128 thrhi=128 tests_str="write read" /usr/bin/obdfilter-survey
Sat Jun 18 21:11:04 JST 2011 Obdfilter-survey for case=disk from r01
ost 1 sz 16777216K rsz 1024K obj 2 thr 128 write 478.35 [ 382.47, 638.33] read 761.16 [ 666.28, 956.03]
done!
# cat /proc/fs/lustre/obdfilter/lustre-OST0000/brw_stats
snapshot_time: 1308399155.300890 (secs.usecs)
read | write
pages per bulk r/w rpcs % cum % | rpcs % cum %
64: 2 0 0 | 0 0 0
128: 1 0 0 | 0 0 0
256: 16367 99 100 | 16384 100 100
read | write
discontiguous pages rpcs % cum % | rpcs % cum %
0: 16370 100 100 | 16384 100 100
read | write
discontiguous blocks rpcs % cum % | rpcs % cum %
0: 16370 100 100 | 16384 100 100
read | write
disk fragmented I/Os ios % cum % | ios % cum %
1: 14073 85 85 | 14431 88 88
2: 2297 14 100 | 1953 11 100
read | write
disk I/Os in flight ios % cum % | ios % cum %
1: 5 0 0 | 42 0 0
2: 5 0 0 | 74 0 0
3: 6 0 0 | 80 0 1
4: 6 0 0 | 104 0 1
5: 6 0 0 | 107 0 2
6: 6 0 0 | 101 0 2
7: 6 0 0 | 99 0 3
8: 6 0 0 | 111 0 3
9: 7 0 0 | 116 0 4
10: 7 0 0 | 131 0 5
11: 4 0 0 | 132 0 5
12: 6 0 0 | 132 0 6
13: 4 0 0 | 138 0 7
14: 4 0 0 | 144 0 8
15: 2 0 0 | 162 0 9
16: 2 0 0 | 243 1 10
17: 2 0 0 | 260 1 11
18: 2 0 0 | 247 1 13
19: 2 0 0 | 212 1 14
20: 2 0 0 | 205 1 15
21: 2 0 0 | 191 1 16
22: 3 0 0 | 186 1 17
23: 6 0 0 | 189 1 18
24: 5 0 0 | 201 1 19
25: 2 0 0 | 213 1 20
26: 3 0 0 | 205 1 21
27: 2 0 0 | 200 1 23
28: 4 0 0 | 196 1 24
29: 5 0 0 | 194 1 25
30: 4 0 0 | 191 1 26
31: 18541 99 100 | 13531 73 100
read | write
I/O time (1/1000s) ios % cum % | ios % cum %
1: 17 0 0 | 139 0 0
2: 24 0 0 | 106 0 1
4: 16 0 0 | 102 0 2
8: 23 0 0 | 120 0 2
16: 29 0 0 | 216 1 4
32: 123 0 1 | 791 4 8
64: 912 5 6 | 3504 21 30
128: 4002 24 31 | 5916 36 66
256: 9809 59 91 | 4535 27 94
512: 1339 8 99 | 927 5 99
1K: 76 0 100 | 28 0 100
read | write
disk I/O size ios % cum % | ios % cum %
4K: 2297 12 12 | 1953 10 10
8K: 0 0 12 | 0 0 10
16K: 0 0 12 | 0 0 10
32K: 0 0 12 | 0 0 10
64K: 0 0 12 | 0 0 10
128K: 0 0 12 | 0 0 10
256K: 2 0 12 | 0 0 10
512K: 1 0 12 | 0 0 10
1M: 16367 87 100 | 16384 89 100
Also, I collected SFA side statices. we can see what IO size are comming from the host.
----------------------------------------------------
Length Port 0 Port 1
Kbytes Reads Writes Reads Writes
----------------------------------------------------
4 0 0 1211 2775 <------ many 4K I/O
8 0 0 0 10
12 0 0 0 15
16 0 0 0 7
20 0 0 0 14
24 0 0 0 183
28 0 0 0 546
32 0 0 0 325
36 0 0 0 36
40 0 0 0 1
44 0 0 0 4
52 0 0 0 1
208 0 0 1 0
240 0 0 1 0
416 0 0 1 0
604 0 0 0 1
640 0 0 1 0
900 0 0 1 0
1020 0 0 2297 1953 <------ many not aligned IO
1024 0 0 9528 10484
1028 0 0 1021 431 <------
2048 0 0 1149 1150
2052 0 0 62 88 <------
3072 0 0 231 230 <------
3076 0 0 11 14
4092 0 0 86 69
# lctl set_param obdfilter.*.readcache_max_filesize=18446744073709551615
obdfilter.lustre-OST0000.readcache_max_filesize=18446744073709551615
# nobjlo=2 nobjhi=2 thrlo=128 thrhi=128 tests_str="write read" /usr/bin/obdfilter-survey
Sat Jun 18 21:17:25 JST 2011 Obdfilter-survey for case=disk from r01
ost 1 sz 16777216K rsz 1024K obj 2 thr 128 write 616.80 [ 469.54, 772.31] read 827.44 [ 767.28, 858.09]
done!
# cat /proc/fs/lustre/obdfilter/lustre-OST0000/brw_stats
snapshot_time: 1308399550.63654 (secs.usecs)
read | write
pages per bulk r/w rpcs % cum % | rpcs % cum %
256: 16317 100 100 | 16384 100 100
read | write
discontiguous pages rpcs % cum % | rpcs % cum %
0: 16317 100 100 | 16384 100 100
read | write
discontiguous blocks rpcs % cum % | rpcs % cum %
0: 16317 100 100 | 16384 100 100
read | write
disk fragmented I/Os ios % cum % | ios % cum %
1: 16203 99 99 | 16364 99 99
2: 114 0 100 | 20 0 100
read | write
disk I/Os in flight ios % cum % | ios % cum %
1: 4 0 0 | 4 0 0
2: 2 0 0 | 5 0 0
3: 1 0 0 | 4 0 0
4: 1 0 0 | 6 0 0
5: 1 0 0 | 3 0 0
6: 1 0 0 | 5 0 0
7: 3 0 0 | 2 0 0
8: 7 0 0 | 6 0 0
9: 4 0 0 | 4 0 0
10: 3 0 0 | 3 0 0
11: 2 0 0 | 7 0 0
12: 3 0 0 | 5 0 0
13: 4 0 0 | 6 0 0
14: 3 0 0 | 8 0 0
15: 3 0 0 | 11 0 0
16: 6 0 0 | 19 0 0
17: 6 0 0 | 26 0 0
18: 3 0 0 | 30 0 0
19: 3 0 0 | 43 0 1
20: 3 0 0 | 52 0 1
21: 4 0 0 | 54 0 1
22: 2 0 0 | 51 0 2
23: 2 0 0 | 53 0 2
24: 4 0 0 | 69 0 2
25: 2 0 0 | 75 0 3
26: 4 0 0 | 65 0 3
27: 8 0 0 | 74 0 4
28: 3 0 0 | 87 0 4
29: 3 0 0 | 94 0 5
30: 8 0 0 | 94 0 5
31: 16328 99 100 | 15439 94 100
read | write
I/O time (1/1000s) ios % cum % | ios % cum %
1: 0 0 0 | 7 0 0
2: 0 0 0 | 1 0 0
4: 3 0 0 | 0 0 0
8: 1 0 0 | 7 0 0
16: 22 0 0 | 14 0 0
32: 71 0 0 | 117 0 0
64: 993 6 6 | 1362 8 9
128: 4603 28 34 | 8467 51 60
256: 9657 59 94 | 6227 38 98
512: 927 5 99 | 182 1 100
1K: 40 0 100 | 0 0 100
read | write
disk I/O size ios % cum % | ios % cum %
4K: 114 0 0 | 20 0 0
8K: 0 0 0 | 0 0 0
16K: 0 0 0 | 0 0 0
32K: 0 0 0 | 0 0 0
64K: 0 0 0 | 0 0 0
128K: 0 0 0 | 0 0 0
256K: 0 0 0 | 0 0 0
512K: 0 0 0 | 0 0 0
1M: 16317 99 100 | 16384 99 100
SFA statistics
----------------------------------------------------
Length Port 0 Port 1
Kbytes Reads Writes Reads Writes
----------------------------------------------------
4 0 0 42 823
8 0 0 0 5
12 0 0 0 1
16 0 0 0 1
20 0 0 0 9
24 0 0 0 71
28 0 0 0 309
32 0 0 0 234
36 0 0 0 25
40 0 0 0 2
44 0 0 0 1
52 0 0 0 1
56 0 0 0 1
60 0 0 0 1
1020 0 0 114 20 <------ only 20 times
1024 0 0 2148 3246
1028 0 0 45 3
2048 0 0 1062 1004
2052 0 0 11 0
3072 0 0 821 583
3076 0 0 5 1
4092 0 0 1686 1519
|
| Comment by Di Wang [ 19/Jun/11 ] |
|
Thanks for the information. It seems IO is fragmented somehow with max_filesize=8388608. Usually it means the extent allocation is not contiguous on disk. (mballoc does not perform well in this case). read | write Hmm, I do not understand why it is related with max_filesize. I will investigate deeper to see what is going on here. Thanks. |
| Comment by Di Wang [ 20/Jun/11 ] |
|
It seems the problem is that MAX_HW_SEGMENTS of DDN SFA 10K < 256 ? It is usually 128 for most device, lustre actually had a patch for kernel to change that value blkdev_tunables-2.6-rhel5.patch. unfortunately, that patch does not work for all of the device. Could you please check what is your max_hw_segments setting in your test. If it is < 256, then lustre server might create some fragmented IO here, especially when pages are fragmented, i.e. they are not physical contiguous. (For example if max_hw_segment is 128, to create 1M IO (256 pages), you would expect half of 256 pages are contiguous, otherwise it will split the IO into 2 small size IO). This also explain why you see less fragmented IO with big readcache, because these pages are not being create/release frequently, so it will have less chance to fragment pages, i.e. pages are more physically contiguous in this case. |
| Comment by Di Wang [ 21/Jun/11 ] |
|
To prove this idea I did a test in a sata drive. Default hw_max_segments = 128 with read_cache_size = 18446744073709551615 root@testnode obdfilter-survey]# cat /proc/fs/lustre/obdfilter/lustre-OST0001/brw_stats ..... disk fragmented I/Os ios % cum % | ios % cum % With read_cache_size = 8388608 [root@testnode obdfilter-survey]# nobjlo=2 nobjhi=2 thrlo=128 thrhi=128 tests_str="write read" ./obdfilter-survey ...... ...... Then I applied a patch to change default hw_max_segments to be 256
Then redo the test again [root@testnode obdfilter-survey]# root@testnode lustre]# cat /proc/fs/lustre/obdfilter/lustre-OST0000/brw_stats ...... |
| Comment by Shuichi Ihara (Inactive) [ 21/Jun/11 ] |
|
That's what I was also thinking. The current OSS and SFA connection are SRP (SCSI over RDMA Protocol) on QDR. The maiximum number of gather/scatter entries per I/O in SRP is 255 (default is 12). we can set this paramter with srp_sg_tablesize in ib_srp module, but it only can up to 255 descriptors. This means, in order to send 1M I/O to SFA10K, OSS sends two requests to SFA - one is 1020K (4k x 255 descriptors) and another one is 4K. My unserstanding, nomally, SFA10K gets two requests 1020K + 4K as single I/O reuqest, handles it as full stripe I/O, but if OSS's memory are much used, these two requests are fragmented - sent them to SFA10K, then handles as different I/O requests. In this situation, we saw many 1020K IO and 4K requests on the SFA10K. In order to prevent this situation, we are setting vm.min_free_kbytes to keep memory space to avoid fragment two requests, but it's not perfect. However, it didn't help for this issue which we see when readcache_max_filesize=8388608. Anyway, two issues might be related and caused by due to srp_sg_tablesize=255. I will try same testing on SFA10K with 8Gbps FC(Fiber Chanel) which can support sending actaul 1M I/O (scatter/gather 256 requests), then let's see what happens. |
| Comment by Matt Ezell [ 21/Jun/11 ] |
|
Was this on a RedHat 5 OSS? Linux kernel 2.6.24 merged in SG list chaining which could help the situation. Do you have access to RedHat 6 to test? |
| Comment by Cliff White (Inactive) [ 21/Jun/11 ] |
|
I think I can now confirm this on Hyperion. With # hyperion1154 /root > cat /proc/fs/lustre/obdfilter/lustre-OST0000/readcache_max_filesize 0000: Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s) Previous with 8M readahead_max_filesize 0000: Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)Op grep #Tasks tPN reps fPP reord |
| Comment by Cliff White (Inactive) [ 21/Jun/11 ] |
|
Ah, that is the case on hyperion- options ib_srp srp_sg_tablesize=255 is set. |
| Comment by Shuichi Ihara (Inactive) [ 22/Jun/11 ] |
|
Just tested on SFA10K(FC model). there were no fragmentaion. # lctl get_param obdfilter.*.readcache_max_filesize
obdfilter.lustre-OST0000.readcache_max_filesize=8388608
# nobjlo=2 nobjhi=2 thrlo=128 thrhi=128 tests_str="write read" /usr/bin/obdfilter-survey
Thu Jun 23 00:54:32 JST 2011 Obdfilter-survey for case=disk from r13
ost 1 sz 16777216K rsz 1024K obj 2 thr 128 write 613.75 [ 534.49, 763.64] read 690.07 [ 630.40, 732.31]
done!
# cat /proc/fs/lustre/obdfilter/lustre-OST0000/brw_stats
snapshot_time: 1308758136.955857 (secs.usecs)
read | write
pages per bulk r/w rpcs % cum % | rpcs % cum %
16: 1 0 0 | 0 0 0
32: 3 0 0 | 0 0 0
64: 3 0 0 | 0 0 0
128: 2 0 0 | 0 0 0
256: 16372 99 100 | 16384 100 100
read | write
discontiguous pages rpcs % cum % | rpcs % cum %
0: 16381 100 100 | 16384 100 100
read | write
discontiguous blocks rpcs % cum % | rpcs % cum %
0: 16381 100 100 | 16384 100 100
read | write
disk fragmented I/Os ios % cum % | ios % cum %
1: 16381 100 100 | 16384 100 100
read | write
disk I/Os in flight ios % cum % | ios % cum %
1: 2 0 0 | 2 0 0
2: 2 0 0 | 2 0 0
3: 1 0 0 | 2 0 0
4: 1 0 0 | 1 0 0
5: 3 0 0 | 3 0 0
6: 2 0 0 | 3 0 0
7: 1 0 0 | 3 0 0
8: 1 0 0 | 3 0 0
9: 1 0 0 | 5 0 0
10: 2 0 0 | 9 0 0
11: 1 0 0 | 7 0 0
12: 1 0 0 | 7 0 0
13: 1 0 0 | 5 0 0
14: 2 0 0 | 4 0 0
15: 1 0 0 | 7 0 0
16: 1 0 0 | 9 0 0
17: 2 0 0 | 10 0 0
18: 1 0 0 | 29 0 0
19: 1 0 0 | 296 1 2
20: 3 0 0 | 175 1 3
21: 2 0 0 | 10 0 3
22: 1 0 0 | 7 0 3
23: 1 0 0 | 14 0 3
24: 2 0 0 | 5 0 3
25: 1 0 0 | 8 0 3
26: 2 0 0 | 7 0 3
27: 1 0 0 | 14 0 3
28: 3 0 0 | 10 0 4
29: 5 0 0 | 11 0 4
30: 1 0 0 | 5 0 4
31: 16332 99 100 | 15711 95 100
read | write
I/O time (1/1000s) ios % cum % | ios % cum %
4: 1 0 0 | 0 0 0
8: 2 0 0 | 0 0 0
16: 4 0 0 | 22 0 0
32: 31 0 0 | 528 3 3
64: 80 0 0 | 4734 28 32
128: 11400 69 70 | 4779 29 61
256: 1685 10 80 | 5893 35 97
512: 794 4 85 | 225 1 98
1K: 2384 14 100 | 25 0 98
2K: 0 0 100 | 0 0 98
4K: 0 0 100 | 83 0 99
8K: 0 0 100 | 95 0 100
read | write
disk I/O size ios % cum % | ios % cum %
64K: 1 0 0 | 0 0 0
128K: 3 0 0 | 0 0 0
256K: 3 0 0 | 0 0 0
512K: 2 0 0 | 0 0 0
1M: 16372 99 100 | 16384 100 100
WangDi, |
| Comment by Di Wang [ 22/Jun/11 ] |
|
There might be some impacts. The reason we shrink the readcache_max_filesize is that if the OSS cache so much pages, some META information (for example group information) might be swapped out from the memory frequently, which is very bad for new extents allocation especially when the OST is becoming full. But you can always tell customers to shrink this value, if they see this issue. (Please check |
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
|
| Comment by Cory Spitz [ 22/Jun/11 ] |
|
Di Wang wrote: So, is the theory that the call into the kernel to truncate_inode_pages_range(), which releases pages one at a time, causes memory to become quickly fragmented? If true, then setting the readcache_max_filesize=1GiB and running the same testcase should hopefully result in more 1MiB I/Os from the SRP initiator. Can we easily prove that the memory an OSS acquires for bulk read/write data is less physically fragmented when readcache_max_filesize=-1? |
| Comment by Di Wang [ 22/Jun/11 ] |
|
Yes, get/release(like truncate_inode_page_range) page frequently will get memory fragmented. Actually, I had hoped to find a API to allocate contiguous pages for bulk read/write, but seems there are no such API. Yes, If you set readcache_max_filesize =1G, you should expect more 1MB IO. (Though the test will write 16G to each object). So yes, if you set readcache_max_filesize=-1, you should expect less fragmented pages, IMHO. |
| Comment by Cory Spitz [ 22/Jun/11 ] |
|
From the description John Salinas wrote: Did you disable the writethrough_cache as well? If you kept the writethrough cache enabled and performed writes then they would still stay in the cache as filter_release_cache() wouldn't be called before returning from filter_commitrw_write(). |
| Comment by Cliff White (Inactive) [ 22/Jun/11 ] |
|
Tested -rc3 on hyperion, looks better 0000: Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)Op grep #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize |
| Comment by Andreas Dilger [ 07/Sep/11 ] |
|
Di, I had occasion to look at this bug again, and one idea I had was to try and allocate order-1 pages (i.e. 8kB chunks) until that is failing and only then fall back to order-0 (4kB) allocations? Even getting a single 8kB allocation per IO would be enough to avoid page fragmentation to overflow the 255-segment limit for SRP. Also, it would be interesting to watch the page allocation statistics on a system that is suffering from this problem to see if there are many 8kB pages available, and the only reason that fragmented 4kB pages are being used is because they are no longer being pinned by the read cache for a long time. |
| Comment by Di Wang [ 07/Sep/11 ] |
|
Ah, this is a good idea, I will cook a patch then. From what I see, there are almost no contiguous pages at that time. I will try to get page allocation statistics with the patch. |
| Comment by Di Wang [ 14/Sep/11 ] |
|
Andreas, I just cooked a patch to use alloc_pages to allocate order-1 pages for niobuf, i.e. try to allocate 2 contiguous pages each time in filter_preprw_read/write. It indeed helped to avoid fragmented IO here. Here are two results from obdfilter_survey, in both cases the backend max_hw_segments = 128, (<256) 1. Without the patch, brw_stats 2. with the patch Wed Sep 14 03:11:24 MST 2011 Obdfilter-survey for case=disk from testnode brw_stats Though the performance does not improve a lot, it did help to avoid the fragmented IO. I post the patch here http://review.whamcloud.com/#change,1377 , but the implementation might be a little hacky. Since the whole data stack on server side is stick with 1 page. So even though we will alloc 2 contigous pages(8k) each time(in filter_preprw_read/write), we still need handle single page individually in other functions. But kernel seems only somewhat initialize the "first" page in alloc_pages(order >= 1), so we have to initialize the following pages ourselves(for example _count, flags), then we can add all pages to the cache. And also the patch needs to export a kernel api add_to_page_cache_lru. |
| Comment by Shuichi Ihara (Inactive) [ 06/Nov/11 ] |
|
Di, http://review.whamcloud.com/#change,1377, Would you plesae make a patch for 2.x for testing? I'm seeing more non-alined IOs with lustre-2.1's on RHEL6 even readcache_max_filesize=18446744073709551615 and vm.min_free_kbyte=2097152 which was one of workaround RHEL5.x. I'm still having at look at current behavior on RHEL6, but just want to try your patch with lustre-2.1 if we see any defiferences. Thanks |
| Comment by Di Wang [ 20/Dec/11 ] |
|
Sorry, Ihara Just saw this message. Yes, I am working on the patch for 2.x now. I was wondering any difference between RHEL6 and RHEL5 on this area. |
| Comment by Shuichi Ihara (Inactive) [ 21/Dec/11 ] |
|
WangDi, did you submit the patches? I wonder if I could test them with 2.x. |
| Comment by Di Wang [ 21/Dec/11 ] |
|
oh, no yet. I am working on it now. I will let you know once the patch is ready. |
| Comment by Di Wang [ 30/Dec/11 ] |
Please try this http://review.whamcloud.com/#change,1881 |
| Comment by Kit Westneat (Inactive) [ 10/Aug/12 ] |
|
We haven't seen this issue again, and probably won't have time to do any testing, so this one can be closed. |
| Comment by Peter Jones [ 10/Aug/12 ] |
|
ok thanks Kit! |
| Comment by Cory Spitz [ 10/Aug/12 ] |
|
Kit, is that because your SRP initiator can construct and send I/O w/256 fragments? |
| Comment by Kit Westneat (Inactive) [ 10/Aug/12 ] |
|
Cory, Ihara said that after reverting the |
| Comment by Cory Spitz [ 10/Aug/12 ] |
|
Ah, then there really is still an issue then, right? At minimum, one cannot configure the cache's max file size lower to reduce cache wasting w/o re-introducing fragmented I/O. There were multiple fixes for |
| Comment by Shuichi Ihara (Inactive) [ 18/Aug/12 ] |
|
no more fragments on new OFED-3.x and RHEL6 based OFED since ib_srp supprots indirect_sg_entries. |
| Comment by Mahmoud Hanafi [ 07/Nov/12 ] |
|
Shuichi Ihara could you please post you srp module options? Thanks, |