[LU-744] Single client's performance degradation on 2.1 Created: 09/Oct/11 Updated: 13/Mar/14 Resolved: 06/Feb/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0, Lustre 2.3.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||
| Rank (Obsolete): | 4018 | ||||||||||||||||||||||||||||
| Description |
|
During the performance testing on lustre-2.1, I saw the single client's performance degradation on it. Write(MiB/sec) Read(MiB/sec) Tested on same infrastracture(hardware and network). The client just turned off the checksum on both testing. |
| Comments |
| Comment by Shuichi Ihara (Inactive) [ 09/Oct/11 ] | ||||||||||||||||||||||||
Here is IOR results. (post again) Write(MiB/sec) v1.8.6.80 v2.1 446.25 411.43 808.53 761.30 1484.18 1151.41 1967.42 1172.06 Read(MiB/sec) v1.8.6.80 v2.1 823.90 595.71 1449.49 1071.76 2502.49 1517.79 3133.43 1746.30 during testing, I saw high CPU usages with ptlrpcd-brw and kswapd process on 2.1. small kswapd is showed up on 1.8's testing, but it's not frequency. Howerver, with 2.1, kswapd is always high CPU usages. (during write testing)
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6922 root 16 0 0 0 0 R 77.5 0.0 13:37.23 ptlrpcd-brw
409 root 11 -5 0 0 0 R 67.5 0.0 19:26.72 kswapd1
408 root 10 -5 0 0 0 R 64.5 0.0 20:09.53 kswapd0
13897 root 15 0 190m 7528 2840 R 36.3 0.1 0:52.97 IOR
13898 root 15 0 190m 7516 2828 S 35.6 0.1 0:52.70 IOR
13900 root 15 0 190m 7536 2844 S 35.3 0.1 0:52.12 IOR
13899 root 15 0 191m 7528 2836 S 34.6 0.1 0:54.06 IOR
13902 root 15 0 191m 7524 2828 S 34.6 0.1 0:53.32 IOR
13895 root 15 0 190m 7688 2992 S 33.9 0.1 0:52.92 IOR
13901 root 15 0 191m 7520 2832 R 33.3 0.1 0:53.05 IOR
13896 root 15 0 190m 7516 2832 S 32.9 0.1 0:53.15 IOR
406 root 15 0 0 0 0 R 4.7 0.0 0:28.83 pdflush
6915 root 15 0 0 0 0 S 1.0 0.0 0:16.27 kiblnd_sd_02
6916 root 15 0 0 0 0 S 1.0 0.0 0:16.33 kiblnd_sd_03
6917 root 15 0 0 0 0 S 1.0 0.0 0:16.17 kiblnd_sd_04
6918 root 15 0 0 0 0 S 1.0 0.0 0:16.26 kiblnd_sd_05
6919 root 15 0 0 0 0 S 1.0 0.0 0:16.29 kiblnd_sd_06
6920 root 15 0 0 0 0 S 1.0 0.0 0:16.33 kiblnd_sd_07
6913 root 15 0 0 0 0 S 0.7 0.0 0:16.28 kiblnd_sd_00
6914 root 15 0 0 0 0 S 0.7 0.0 0:16.15 kiblnd_sd_01
13921 root 15 0 12896 1220 824 R 0.3 0.0 0:00.14 top
(during read testing)
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13896 root 18 0 190m 7540 2856 R 88.3 0.1 1:35.29 IOR
409 root 10 -5 0 0 0 R 86.6 0.0 20:44.50 kswapd1
13901 root 18 0 191m 7572 2884 R 83.9 0.1 1:40.79 IOR
408 root 10 -5 0 0 0 R 83.3 0.0 21:23.82 kswapd0
13899 root 18 0 191m 7668 2920 R 81.3 0.1 1:43.45 IOR
13902 root 18 0 191m 7544 2848 R 79.6 0.1 1:43.58 IOR
13898 root 19 0 190m 7536 2848 R 72.7 0.1 1:43.15 IOR
13895 root 18 0 190m 7860 3104 R 70.7 0.1 1:32.06 IOR
6922 root 15 0 0 0 0 R 66.0 0.0 14:53.78 ptlrpcd-brw
13900 root 23 0 190m 7552 2860 R 48.4 0.1 1:39.15 IOR
13897 root 23 0 190m 7584 2896 R 22.6 0.1 1:33.74 IOR
6913 root 15 0 0 0 0 S 1.7 0.0 0:17.39 kiblnd_sd_00
6914 root 15 0 0 0 0 S 1.7 0.0 0:17.24 kiblnd_sd_01
6917 root 15 0 0 0 0 S 1.7 0.0 0:17.31 kiblnd_sd_04
6916 root 15 0 0 0 0 S 1.3 0.0 0:17.44 kiblnd_sd_03
6918 root 15 0 0 0 0 S 1.3 0.0 0:17.40 kiblnd_sd_05
6919 root 15 0 0 0 0 S 1.3 0.0 0:17.41 kiblnd_sd_06
6920 root 15 0 0 0 0 S 1.3 0.0 0:17.45 kiblnd_sd_07
6915 root 15 0 0 0 0 S 1.0 0.0 0:17.39 kiblnd_sd_02
13924 root 15 0 12896 1220 824 R 0.3 0.0 0:00.17 top
1 root 15 0 10372 632 540 S 0.0 0.0 0:01.66 init
note, I turned off the lustre checksum in this testing, so this is not cuased by checksum overhead. | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 10/Oct/11 ] | ||||||||||||||||||||||||
|
A key difference between 2.1 and 1.8 is that there is no caching memory(max_dirty_mb) limitation in 2.1, this will cause high CPU usage of kswapd, but I'm not sure if this is the root cause of performance degradation for this IO intensive program. For read case, the first thing we need to know is the RPC size. Can you please collect the following information on both read and write case for 1.8 and 2.1 specifically: Thanks. | ||||||||||||||||||||||||
| Comment by Oleg Drokin [ 10/Oct/11 ] | ||||||||||||||||||||||||
|
I wonder what's the raw speed capability of the link? We have a caching bug in 1.8 that manifests itself as too fast reads if you just did the writes even if more writes than RAM. | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 10/Oct/11 ] | ||||||||||||||||||||||||
|
The network between server and client is QDR Infiniband. So, numbers should be reasonable. Also, an client only has 12GB memory and I'm writing the larger data (256GB) than memory size - no cache effect on this. | ||||||||||||||||||||||||
| Comment by Oleg Drokin [ 10/Oct/11 ] | ||||||||||||||||||||||||
|
I understand your test size is bigger than the client RAM. This may or may not contribute to the problem you are seeing of course and I see writes are also somewhat slower which could not be explained by that caching problem we saw. Also just to confirm, this is 4x QDR, right that can have up to 4 gigabytes/sec of useful bandwidth. Getting the data Jinshan requested is a good start indeed. | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 11/Oct/11 ] | ||||||||||||||||||||||||
|
attached the all stats which I got on 2.1. will run same benchamrk on 1.8. | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 27/Oct/11 ] | ||||||||||||||||||||||||
|
Got vmstat and rpc_stats during IOR bencharmk with lsutre-1.8. oprofile didn't work on this kernel.. due to the following error messages when I ran opreport. opreport error: basic_string::_S_construct NULL not valid | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 06/Jan/12 ] | ||||||||||||||||||||||||
|
plesae have a look at log files and oprofile, and let me know if you need more information. | ||||||||||||||||||||||||
| Comment by Peter Jones [ 06/Jan/12 ] | ||||||||||||||||||||||||
|
Reassign to Jinshan | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 09/Feb/12 ] | ||||||||||||||||||||||||
|
Tested again on the current master branch, the write number are a littile bit improved, but the read number is same and still big gap compared to 1.8.x. I think the current master code should have multiple ptlrpc threads, right? but it doesn't seem to affect yet for the single performance improvement. write(MB/s) read(MB/s) 1 515 644 2 1041 1172 4 1438 1529 8 1601 1683 | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 12/Feb/12 ] | ||||||||||||||||||||||||
|
From more testing and monitoring storage IO statistics, it looks like the performance is good if amount of file size < client's memory. | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 14/Feb/12 ] | ||||||||||||||||||||||||
|
we'll address this in 2.3 due to the io engine work taking place under | ||||||||||||||||||||||||
| Comment by Eric Barton (Inactive) [ 24/Mar/12 ] | ||||||||||||||||||||||||
|
Can we confirm this is a client-side issue - e.g. by measuring 1.8 and 2.x clients v. 2.x servers? | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 26/Mar/12 ] | ||||||||||||||||||||||||
|
To eeb: from what Ihara had seen, if the writing file size exceeded memory size, the performance dropped a lot. So I think this may be a client side. However, I didn't see any other report for this kind of issue, one reason may be this is not noticed or they can't generate this fast IO. For Actually I have done performance benchmark many times but I didn't see this issue. I guess one reason would be I can't generate such high speed IO with the hardwares in our lab. | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 30/Mar/12 ] | ||||||||||||||||||||||||
|
Here is what I demonstrated. Tested IOR on single client (12 therads) and collected memory size and Lustre throughput on an client during the IO. This is tested on Lustre-2.2RC2 for servers and client. # IOR -o /lustre/ior.out/file -b 8g -t 1m -F -C -w -e -vv -k # sync;echo 3 > /proc/sys/vm/drop_caches # IOR -o /lustre/ior.out/file -b 8g -t 1m -F -C -r -e -vv -k Write Performance # collectl -scml waiting for 1 second sample... #<--------CPU--------><-----------Memory-----------><--------Lustre Client--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrite Writes 0 0 66 83 46G 0 15M 5M 118M 32M 0 0 0 0 0 0 152 110 46G 0 15M 5M 118M 32M 0 0 0 0 0 0 1605 809 46G 0 17M 6M 118M 33M 0 0 0 0 2 1 8353 19105 46G 0 22M 8M 119M 68M 0 0 0 0 39 39 14389 22964 44G 0 1G 1G 405M 91M 0 0 1277952 1248 96 96 29362 47501 41G 0 3G 3G 1G 92M 0 0 2676736 2614 96 96 29109 46887 38G 0 6G 6G 1G 92M 0 0 2678784 2616 95 95 28936 46208 35G 0 8G 8G 2G 92M 0 0 2669568 2607 96 96 28813 46264 32G 0 11G 11G 2G 92M 0 0 2683904 2621 96 96 27957 43106 29G 0 13G 13G 3G 92M 0 0 2603008 2542 96 96 29186 47093 25G 0 16G 16G 3G 92M 0 0 2673664 2611 96 96 28878 46397 22G 0 19G 19G 4G 92M 0 0 2670592 2608 96 96 28736 46291 19G 0 21G 21G 4G 92M 0 0 2670592 2608 95 95 29202 47151 16G 0 24G 24G 5G 92M 0 0 2673664 2611 96 96 27200 42103 13G 0 26G 26G 5G 92M 0 0 2608128 2547 96 95 28900 46153 10G 0 29G 29G 6G 92M 0 0 2671616 2609 96 96 28962 46393 7G 0 31G 31G 7G 92M 0 0 2661376 2599 96 96 28982 46711 4G 0 34G 34G 7G 92M 0 0 2650112 2588 96 96 27615 43289 1G 0 36G 36G 8G 92M 0 0 2530304 2471 98 98 27524 34935 183M 0 37G 37G 8G 92M 0 0 1996800 1950 99 99 24298 30965 227M 0 37G 37G 8G 92M 0 0 1708032 1668 100 100 24578 31559 276M 0 37G 37G 8G 92M 0 0 1694720 1655 #<--------CPU--------><-----------Memory-----------><--------Lustre Client--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrite Writes 100 99 24758 32204 194M 0 37G 37G 8G 92M 0 0 1708032 1668 99 99 24367 30946 184M 0 37G 37G 8G 92M 0 0 1689600 1650 100 100 24772 31223 222M 0 37G 37G 8G 92M 0 0 1709056 1669 99 99 24742 31196 224M 0 37G 37G 8G 92M 0 0 1680751 1641 100 100 24502 31292 285M 0 37G 37G 8G 92M 0 0 1729218 1689 98 98 23817 31563 186M 0 37G 37G 8G 92M 0 0 1754112 1713 99 99 26300 32065 203M 0 37G 37G 8G 92M 0 0 1696096 1656 100 99 23777 30225 274M 0 37G 37G 8G 92M 0 0 1704617 1665 99 99 24663 31760 259M 0 37G 37G 8G 92M 0 0 1740800 1700 100 100 24885 32234 221M 0 37G 37G 8G 92M 0 0 1721344 1681 99 99 23912 30622 206M 0 37G 37G 8G 92M 0 0 1732608 1692 99 99 25136 32743 184M 0 37G 37G 8G 92M 0 0 1748992 1708 99 99 24931 31094 218M 0 37G 37G 8G 92M 0 0 1679360 1640 99 99 28119 33561 221M 0 37G 37G 8G 92M 0 0 1709056 1669 100 100 24796 32077 201M 0 37G 37G 8G 92M 0 0 1703936 1664 100 99 24805 32263 196M 0 37G 37G 8G 92M 0 0 1715506 1675 100 100 24191 30959 185M 0 37G 37G 8G 92M 0 0 1696386 1657 100 99 23907 30445 203M 0 37G 37G 8G 92M 0 0 1696768 1657 100 100 24488 31350 276M 0 37G 37G 8G 92M 0 0 1665024 1626 100 99 28522 32064 231M 0 37G 37G 8G 91M 0 0 1717248 1677 99 99 24475 30399 296M 0 37G 37G 8G 90M 0 0 1657856 1619 100 100 24613 31539 232M 0 37G 37G 8G 90M 0 0 1717248 1677 #<--------CPU--------><-----------Memory-----------><--------Lustre Client--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrite Writes 99 99 23441 29388 203M 0 37G 37G 8G 90M 0 0 1691648 1652 99 99 24446 30918 194M 0 37G 37G 8G 90M 0 0 1666048 1627 99 99 23687 29693 224M 0 37G 37G 8G 90M 0 0 1708032 1668 100 100 24159 30834 207M 0 37G 37G 8G 90M 0 0 1713152 1673 99 99 23732 29601 260M 0 37G 37G 8G 90M 0 0 1652736 1614 100 100 24107 30571 259M 0 37G 37G 8G 90M 0 0 1705984 1666 99 98 27459 31224 268M 0 37G 37G 8G 88M 0 0 1613824 1576 99 96 24363 31208 190M 0 37G 37G 8G 87M 0 0 1603584 1566 99 95 23538 28255 206M 0 37G 37G 8G 87M 0 0 1478656 1444 99 93 22302 26656 242M 0 37G 37G 8G 85M 0 0 1384448 1352 99 90 20754 22468 217M 0 37G 37G 8G 81M 0 0 1137664 1111 99 86 17894 16390 216M 0 37G 37G 8G 80M 0 0 833536 814 99 85 16683 12804 219M 0 37G 37G 8G 80M 0 0 685056 669 80 67 14944 12993 277M 0 37G 37G 8G 34M 0 0 415744 406 0 0 67 78 279M 0 37G 37G 8G 32M 0 0 0 0 0 0 66 83 279M 0 37G 37G 8G 32M 0 0 0 0 0 0 73 82 279M 0 37G 37G 8G 32M 0 0 0 0 0 0 58 75 280M 0 37G 37G 8G 32M 0 0 0 0 0 0 72 79 280M 0 37G 37G 8G 32M 0 0 0 0 0 0 53 75 281M 0 37G 37G 8G 32M 0 0 0 0 Read Performance # collectl -scml waiting for 1 second sample... #<--------CPU--------><-----------Memory-----------><--------Lustre Client--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrite Writes 0 0 57 67 46G 0 16M 5M 117M 32M 0 0 0 0 0 0 125 85 46G 0 16M 5M 117M 32M 0 0 0 0 2 1 9829 19730 46G 0 22M 8M 119M 68M 0 0 0 0 1 0 821 1872 46G 0 48M 31M 124M 91M 7168 7 0 0 91 91 30999 56887 42G 0 2G 2G 722M 92M 2331648 2277 0 0 100 99 33243 57527 39G 0 5G 5G 1G 92M 2747392 2683 0 0 100 99 33049 57389 36G 0 7G 7G 1G 92M 2745344 2681 0 0 100 100 33696 58307 33G 0 10G 10G 2G 92M 2745344 2681 0 0 100 99 32735 56514 30G 0 13G 13G 2G 92M 2745344 2681 0 0 100 99 34001 58043 26G 0 15G 15G 3G 92M 2732032 2668 0 0 100 99 33038 57275 23G 0 18G 18G 4G 92M 2745344 2681 0 0 100 100 33305 58068 20G 0 21G 21G 4G 92M 2755584 2691 0 0 100 99 32786 56625 17G 0 23G 23G 5G 92M 2743296 2679 0 0 100 99 32977 57459 14G 0 26G 26G 5G 92M 2742272 2678 0 0 100 99 32748 56989 10G 0 28G 28G 6G 92M 2747392 2683 0 0 100 99 33028 57293 7G 0 31G 31G 6G 92M 2753536 2689 0 0 100 99 32779 56924 4G 0 34G 34G 7G 92M 2720768 2657 0 0 100 99 31996 54526 1G 0 36G 36G 8G 92M 2591744 2531 0 0 99 99 31920 45036 200M 0 37G 37G 8G 92M 2096128 2047 0 0 99 99 26673 38911 185M 0 37G 37G 8G 92M 1813504 1771 0 0 100 100 26120 38482 183M 0 37G 37G 8G 92M 1853440 1810 0 0 99 99 26358 38794 185M 0 37G 37G 8G 92M 1819648 1777 0 0 #<--------CPU--------><-----------Memory-----------><--------Lustre Client--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrite Writes 99 99 27138 40461 226M 0 37G 37G 8G 92M 1903616 1859 0 0 99 99 27660 41331 188M 0 37G 37G 8G 92M 1892352 1848 0 0 99 99 26490 38244 218M 0 37G 37G 8G 92M 1785856 1744 0 0 99 99 27106 40421 190M 0 37G 37G 8G 92M 1820672 1778 0 0 99 99 26841 40338 251M 0 37G 37G 8G 92M 1804288 1762 0 0 100 99 26798 39658 187M 0 37G 37G 8G 92M 1831129 1788 0 0 99 99 27658 41055 217M 0 37G 37G 8G 92M 1872721 1829 0 0 99 99 27175 40097 240M 0 37G 37G 8G 92M 1830912 1788 0 0 99 99 27205 40167 253M 0 37G 37G 8G 92M 1846272 1803 0 0 99 99 27506 41196 231M 0 37G 37G 8G 92M 1861632 1818 0 0 99 99 29622 41786 250M 0 37G 37G 8G 92M 1835008 1792 0 0 99 99 27734 41179 238M 0 37G 37G 8G 92M 1894400 1850 0 0 99 99 28140 40126 260M 0 37G 37G 8G 92M 1799168 1757 0 0 100 100 26986 39996 301M 0 37G 37G 8G 92M 1825792 1783 0 0 99 99 28804 41224 195M 0 37G 37G 8G 92M 1841152 1798 0 0 100 99 28819 41024 209M 0 37G 37G 8G 92M 1795072 1753 0 0 100 100 26511 38828 227M 0 37G 37G 8G 92M 1797120 1755 0 0 99 99 31510 40406 206M 0 37G 37G 8G 91M 1826816 1784 0 0 99 99 27219 39596 202M 0 37G 37G 8G 90M 1814761 1772 0 0 98 98 27858 39520 216M 0 37G 37G 8G 90M 1831720 1789 0 0 99 99 27691 39656 270M 0 37G 37G 8G 90M 1830912 1788 0 0 99 99 26331 37845 242M 0 37G 37G 8G 90M 1778688 1737 0 0 #<--------CPU--------><-----------Memory-----------><--------Lustre Client--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrite Writes 100 99 25922 37428 214M 0 37G 37G 8G 90M 1738752 1698 0 0 100 99 25756 37430 246M 0 37G 37G 8G 90M 1766400 1725 0 0 99 99 26469 39634 184M 0 37G 37G 8G 90M 1866752 1823 0 0 99 99 25477 36527 257M 0 37G 37G 8G 90M 1738752 1698 0 0 99 99 26725 39618 248M 0 37G 37G 8G 90M 1840128 1797 0 0 99 99 25512 36383 261M 0 37G 37G 8G 90M 1755428 1714 0 0 99 98 27001 36997 193M 0 37G 37G 8G 87M 1857346 1814 0 0 99 93 24895 33872 202M 0 37G 37G 8G 84M 1631232 1593 0 0 64 57 13832 15585 274M 0 37G 37G 8G 34M 711680 695 0 0 0 0 62 76 276M 0 37G 37G 8G 32M 0 0 0 0 0 0 76 84 276M 0 37G 37G 8G 32M 0 0 0 0 | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 30/Mar/12 ] | ||||||||||||||||||||||||
|
This is another test results when servers are running with 2.2, but client is 1.8.7. (The checksum is diabled.) write test # collectl -scml waiting for 1 second sample... #<--------CPU--------><-----------Memory-----------><--------Lustre Client--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrite Writes 0 0 1041 165 46G 0 104M 89M 127M 25M 0 0 0 0 2 1 3580 31174 46G 0 105M 90M 127M 57M 0 0 0 0 9 8 1357 1582 45G 0 863M 848M 197M 83M 0 0 739336 722 25 25 7239 31394 42G 0 3G 3G 481M 83M 0 0 2985870 2916 27 27 7219 33245 39G 0 6G 6G 810M 84M 0 0 3274752 3198 29 28 7069 33014 35G 0 8G 9G 1G 84M 0 0 3337973 3260 29 29 7190 31910 32G 0 11G 13G 1G 84M 0 0 3326976 3249 30 29 7064 32743 29G 0 13G 16G 1G 84M 0 0 3352576 3274 29 29 6250 26734 31G 0 14G 14G 1G 84M 0 0 2642944 2581 31 31 6881 32248 28G 0 16G 16G 1G 84M 0 0 3367936 3289 33 33 7000 31743 28G 0 17G 16G 1G 84M 0 0 3303424 3226 33 32 6981 31856 26G 0 18G 18G 1G 84M 0 0 3352576 3274 36 36 6858 31487 25G 0 18G 19G 1G 84M 0 0 3354624 3276 28 28 6219 26187 25G 0 19G 19G 1G 84M 0 0 2723840 2660 31 31 7111 33539 23G 0 21G 21G 2G 84M 0 0 3350528 3272 40 40 7015 31281 24G 0 19G 19G 2G 84M 0 0 3338240 3260 38 38 6909 30377 25G 0 19G 19G 2G 84M 0 0 3331072 3253 29 29 6945 31835 22G 0 22G 22G 2G 84M 0 0 3314688 3237 37 37 6427 26715 24G 0 20G 20G 2G 84M 0 0 2864945 2798 37 37 6656 29771 24G 0 20G 20G 2G 84M 0 0 3322379 3245 35 35 6789 30431 23G 0 21G 21G 2G 84M 0 0 3369257 3290 38 38 6850 30454 23G 0 21G 21G 2G 84M 0 0 3315469 3238 #<--------CPU--------><-----------Memory-----------><--------Lustre Client--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrite Writes 41 41 7005 29638 25G 0 19G 19G 2G 84M 0 0 3340288 3262 36 36 6219 25574 26G 0 18G 18G 1G 84M 0 0 2834194 2768 35 35 6753 30124 25G 0 19G 19G 1G 84M 0 0 3354624 3276 41 41 6823 30396 27G 0 17G 17G 1G 84M 0 0 3360768 3282 34 34 6876 30411 25G 0 19G 19G 2G 84M 0 0 3308544 3231 35 33 7409 36583 22G 0 22G 21G 2G 83M 0 0 3203964 3129 51 48 7302 27904 23G 0 21G 21G 2G 82M 0 0 2851619 2785 53 49 8236 38922 23G 0 21G 21G 2G 80M 0 0 3172372 3098 73 63 7348 27435 22G 0 22G 22G 2G 79M 0 0 2996569 2926 73 63 7155 26827 22G 0 22G 22G 2G 79M 0 0 3010498 2940 67 57 7516 30396 19G 0 25G 25G 2G 79M 0 0 3147625 3074 65 55 7078 27202 16G 0 27G 27G 2G 79M 0 0 2726901 2663 41 34 5189 29696 15G 0 28G 28G 2G 27M 0 0 1465344 1431 0 0 1015 81 15G 0 28G 28G 2G 27M 0 0 0 0 0 0 1002 60 15G 0 28G 28G 2G 25M 0 0 0 0 0 0 1003 42 15G 0 28G 28G 2G 25M 0 0 0 0 0 0 1002 50 15G 0 28G 28G 2G 25M 0 0 0 0 0 0 1006 40 15G 0 28G 28G 2G 25M 0 0 0 0 Read test # collectl -scml waiting for 1 second sample... #<--------CPU--------><-----------Memory-----------><--------Lustre Client--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrite Writes 0 0 1490 161 46G 0 18M 14M 115M 24M 0 0 0 0 2 0 7539 19019 46G 0 24M 17M 117M 58M 0 0 0 0 14 13 7356 29789 45G 0 1G 1G 214M 82M 1507328 1472 0 0 27 27 14755 62755 42G 0 3G 3G 467M 84M 3478528 3397 0 0 26 26 15313 65755 39G 0 6G 6G 784M 84M 3476480 3395 0 0 25 25 15640 65148 35G 0 10G 10G 1G 84M 3471760 3390 0 0 26 26 15144 63964 31G 0 13G 13G 1G 84M 3474432 3393 0 0 26 26 15456 65563 28G 0 16G 16G 1G 84M 3462777 3382 0 0 32 32 15122 64064 27G 0 17G 17G 1G 84M 3481600 3400 0 0 28 28 14963 62571 24G 0 20G 19G 1G 84M 3471360 3390 0 0 39 39 14595 61223 27G 0 17G 17G 1G 84M 3484672 3403 0 0 33 33 14488 61248 26G 0 18G 18G 1G 84M 3482624 3401 0 0 34 34 14629 61651 26G 0 19G 19G 1G 84M 3477504 3396 0 0 31 31 14135 59877 23G 0 20G 20G 2G 84M 3473408 3392 0 0 29 29 14266 61184 21G 0 23G 23G 2G 84M 3458048 3377 0 0 41 41 14247 59518 24G 0 20G 20G 2G 84M 3504128 3422 0 0 30 30 13945 60057 22G 0 22G 22G 2G 84M 3464823 3384 0 0 36 36 14128 62603 22G 0 22G 22G 2G 84M 3479960 3398 0 0 41 41 13759 59870 24G 0 20G 20G 2G 84M 3483648 3402 0 0 36 36 14322 62899 24G 0 20G 20G 2G 84M 3457024 3376 0 0 38 38 14106 60171 25G 0 19G 19G 1G 84M 3508811 3427 0 0 31 31 13888 60317 23G 0 21G 21G 2G 84M 3483447 3402 0 0 #<--------CPU--------><-----------Memory-----------><--------Lustre Client--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrite Writes 36 36 13804 59705 23G 0 21G 21G 2G 84M 3480576 3399 0 0 35 35 13627 58586 23G 0 21G 21G 2G 84M 3474432 3393 0 0 38 38 14811 62796 23G 0 20G 20G 2G 84M 3490397 3409 0 0 38 38 14070 60788 24G 0 20G 20G 2G 84M 3468288 3387 0 0 44 43 14348 61167 24G 0 20G 20G 2G 82M 3470961 3390 0 0 57 51 14323 59216 23G 0 21G 21G 2G 81M 3462534 3381 0 0 66 58 13738 57619 21G 0 23G 23G 2G 79M 3464823 3384 0 0 79 67 13357 55589 19G 0 24G 24G 2G 78M 3427996 3348 0 0 76 64 13451 56659 16G 0 28G 28G 2G 77M 3432144 3352 0 0 65 52 8608 35219 14G 0 29G 29G 2G 27M 1908736 1864 0 0 0 0 1004 40 14G 0 29G 29G 2G 27M 0 0 0 0 0 0 1004 56 14G 0 29G 29G 2G 24M 0 0 0 0 0 0 1003 42 14G 0 29G 29G 2G 24M 0 0 0 0 0 0 1004 46 14G 0 29G 29G 2G 24M 0 0 0 0 0 0 1004 42 14G 0 29G 29G 2G 24M 0 0 0 0 | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 30/Mar/12 ] | ||||||||||||||||||||||||
|
I guess this is because there is no LRU for async_pages in 2.x clients. The LRU mechanism is way too complex in 1.8 clients so I have an idea to limit # of caching pages at OSC layer. | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 06/Apr/12 ] | ||||||||||||||||||||||||
|
I'm working on a workaround patch to limit the max caching pages per OSC. | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 11/Apr/12 ] | ||||||||||||||||||||||||
|
Hi Ihara, Can you please try patch at: http://review.whamcloud.com/2514 to see if it can help. Please notice that this patch is for debug purpose only and shouldn't be applied to production system. Also please collect memory usage statistic and oprofile results while you're running the test, thanks. | ||||||||||||||||||||||||
| Comment by Minh Diep [ 12/Apr/12 ] | ||||||||||||||||||||||||
|
Here is the data I ran IOR file-per-process on hyperion. Server: lustre 2.1.0/rhel5/x86_64 Write: Thread 1.8.7 2.2.0 Read: Thread 1.8.7 2.2.0 | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 12/Apr/12 ] | ||||||||||||||||||||||||
|
Thank you, Minh, I guess fast IO is a necessity to reproduce this problem. How many OSS nodes are there on Hyperion and what's their peak IO speed? | ||||||||||||||||||||||||
| Comment by Christopher Morrone [ 12/Apr/12 ] | ||||||||||||||||||||||||
|
I don't know what they are currently using, but we have more than enough hardware available to swamp one client's QDR IB link. I know there are at least 18 (soon to be 20) NetApp 60-bay enclosures with dual controllers, so the hyperion folks can set up more Lustre servers if needed. | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 13/Apr/12 ] | ||||||||||||||||||||||||
|
Jay, | ||||||||||||||||||||||||
| Comment by Minh Diep [ 13/Apr/12 ] | ||||||||||||||||||||||||
|
On chaos4, we have 4 OSS, each have 2 LUNS connected to DDN 9550 Here is obdfilter-survey from 1 of the oss Tue Apr 10 11:42:35 PDT 2012 Obdfilter-survey for case=disk from hyperion1155 | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 16/Apr/12 ] | ||||||||||||||||||||||||
|
attached are IOR resutls, memory usages and oprofile output when I ran IOR benchmark on the original 2.2 and patched 2.2. original 2.2 Max Write: 1708.26 MiB/sec (1791.24 MB/sec) Max Read: 1656.73 MiB/sec (1737.21 MB/sec) patched 2.2 Max Write: 2028.24 MiB/sec (2126.76 MB/sec) Max Read: 2179.34 MiB/sec (2285.21 MB/sec) | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 16/Apr/12 ] | ||||||||||||||||||||||||
|
Thanks for the test, Ihara. Now each OSC uses 128M memory at maximum and this looks too much small in your case, especially for the read case. Please try patch set 2(http://review.whamcloud.com/2514) where you can set how much memory will be used for cache, for example: lctl set_param osc.<osc1>.max_cache_mb=256 to set osc1 will use 256M memory for caching. Also, you forgot to tell oprofile where to find lustre object files so it couldn't interpret address to symbol name. | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 16/Apr/12 ] | ||||||||||||||||||||||||
|
Hi Jay, cache_mb Write(MB/sec) Read(MB/sec) 256 2108.90 2263.33 512 2189.78 2266.49 1024 2353.56 2318.94 2048 2330.62 2313.96 It looks like still lower than the case when filesize < client's memory size. Attached includes ior results, memory usage and oprofile results on each testing. Sorry, previous oprofile results, I was not pointing kernel modules. This one contains what you want? | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 16/Apr/12 ] | ||||||||||||||||||||||||
|
Hi Ihara, thanks. Can you please refresh my memory for what was the performance of writing/reading a file less than memory size? It looks like the performance data varied a lot from time to time, is this because different clients were used? In that case, it may make more sense to run patched/unpatched/b1_8, along with the case that file size is less than memory on the same kind of client to make comparison a bit easier. Yes, this oprofile result is better, but it can be even better to print instruction address the cpu was busy on(I forgot the opreport option). However, I'm suppose contention on lock client_obd_list_lock would be significant in this case and I'm fixing this at | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 17/Apr/12 ] | ||||||||||||||||||||||||
|
The previous 2.2 numbers, I did test 2.2 without patch on the same client and confirmed the performance is better if filesize < client's memory. anyway, I just tested again various versions on the same hardware. Here is results. Version Write(MB/s) Read(MB/s) 1.8.7.80 3030 3589 2.1.1 1843 2466 2.1.1/patch 1863 2384 2.2 2012 2151 2.2/patch 2360 2398 The test is simple. it runs IOR on single client with 12 Thread. The attachment includes all IOR results, oprofile output and memory usages. You can see the following results on 2.2.0/collectl.out. #<----CPU[HYPER]-----><-----------Memory-----------><--------Lustre Client--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrite Writes 0 0 1289 799 45G 21M 47M 19M 182M 72M 0 0 0 0 0 0 927 644 45G 21M 47M 19M 182M 72M 0 0 0 0 1 1 12546 1805 45G 21M 51M 22M 182M 106M 0 0 0 0 47 46 389K 57064 43G 23M 2G 2G 723M 138M 0 0 2433024 2376 63 62 495K 68661 39G 23M 5G 5G 1G 140M 0 0 3064832 2993 64 63 491K 64459 35G 23M 8G 8G 2G 140M 0 0 3028992 2958 64 63 474K 62071 32G 23M 11G 11G 2G 140M 0 0 3011584 2941 64 63 472K 61732 28G 23M 13G 13G 3G 140M 0 0 3014656 2944 72 72 526K 46775 25G 23M 16G 16G 3G 140M 0 0 2934784 2866 63 63 456K 57693 22G 23M 19G 19G 4G 140M 0 0 2902016 2834 64 63 458K 62259 18G 23M 22G 22G 5G 140M 0 0 2960384 2891 63 63 454K 67528 15G 23M 25G 25G 5G 140M 0 0 3014656 2944 63 63 429K 62035 11G 23M 28G 28G 6G 140M 0 0 2989056 2919 71 71 467K 49752 8G 23M 30G 30G 6G 140M 0 0 2865152 2798 64 63 425K 61280 4G 23M 33G 33G 7G 140M 0 0 2962432 2893 64 64 440K 60282 1G 23M 36G 36G 8G 140M 0 0 2936832 2868 63 63 395K 43352 226M 19M 37G 37G 8G 140M 0 0 2059264 2011 64 64 366K 34137 185M 18M 37G 37G 8G 140M 0 0 1609728 1572 With patches, free memory is keeping, but write speed is around 2.3GB/sec. #<----CPU[HYPER]-----><-----------Memory-----------><--------Lustre Client--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrite Writes 0 0 993 650 45G 17M 44M 20M 181M 72M 0 0 0 0 1 0 10327 1534 45G 17M 47M 23M 181M 107M 0 0 0 0 19 19 149K 19335 44G 19M 956M 933M 389M 132M 0 0 920576 899 64 64 458K 62063 41G 19M 3G 3G 1G 136M 0 0 2990080 2920 64 64 383K 62531 37G 19M 6G 6G 1G 136M 0 0 3001344 2931 64 63 400K 64377 34G 19M 9G 9G 2G 136M 0 0 3001344 2931 63 63 372K 60830 30G 19M 12G 12G 2G 136M 0 0 2925568 2857 65 64 372K 43364 31G 19M 12G 12G 2G 136M 0 0 2252800 2200 63 63 350K 50420 31G 19M 12G 12G 2G 136M 0 0 2247680 2195 61 60 336K 54894 31G 19M 12G 12G 2G 136M 0 0 2313216 2259 60 60 336K 55181 31G 20M 12G 12G 2G 136M 0 0 2310144 2256 61 60 338K 55468 31G 20M 12G 12G 2G 136M 0 0 2306048 2252 63 63 351K 51601 31G 20M 12G 12G 2G 136M 0 0 2278400 2225 61 60 318K 49503 31G 20M 12G 12G 2G 137M 0 0 2285568 2232 61 60 332K 50634 31G 20M 12G 12G 2G 137M 0 0 2300928 2247 61 60 332K 50507 31G 20M 12G 12G 2G 137M 0 0 2302721 2249 61 61 350K 53943 31G 20M 12G 12G 2G 137M 0 0 2296056 2242 64 63 336K 50402 31G 20M 12G 12G 2G 137M 0 0 2273280 2220 61 60 296K 49321 31G 20M 12G 12G 2G 137M 0 0 2291712 2238 | ||||||||||||||||||||||||
| Comment by Andreas Dilger [ 10/May/12 ] | ||||||||||||||||||||||||
|
In bug | ||||||||||||||||||||||||
| Comment by Andreas Dilger [ 06/Jun/12 ] | ||||||||||||||||||||||||
|
The problem I see with this patch is that it is moving in the wrong direction. Administrators want to be able to specify the cache limit for all Lustre filesystems on a node, while adding a cache limit per OSC doesn't really improve anything for them. At a site like LLNL, they have over 3000 OSCs on the client, so any per-OSC limit will either have to be so small that it hurts performance, or it will be so large that it is much larger than the total RAM, and effectively no limit at all and only adding extra overhead to do useless LRU management. I'd rather see more effort put into understanding why Lustre IO pages do not work well with the Linux VM page cache management, and fix that. This will provide global cache management, and avoid memory pressure for all users, and will also improve more as the Linux page cache management improves also. | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 07/Jun/12 ] | ||||||||||||||||||||||||
|
From the oprofile and other stats, it was obvious that the CPU was busy evicting pages(kswapd used 100% CPU). Based on this situation, I think the problem is that the kswapd couldn't free caching pages as fast as the RPC engine wrote dirty pages back(otherwise, the writing process will be blocked at waiting for obd_dirty_pages). At last, there is no free pages in the system and writing processes are choked up freeing pages themselves and this degraded the write performance a lot. An obvious way to fix this problem is to free pages while the processes are producing them, this way we can distribute the overhead of freeing pages to every write syscall and also limit the total memory consumed by lustre. This is why I worked out this patch and I agree with the problem you mentioned so I didn't start the landing process. Anyway after this patch is used, the performance improved ~20% and I can no longer see 100% CPU time on kswapd. However, this is based on my educated guess I'm not sure this is correct. Can you please elaborate on your idea and I will be pretty happy to verify and implement it? Thanks. | ||||||||||||||||||||||||
| Comment by Andreas Dilger [ 08/Jun/12 ] | ||||||||||||||||||||||||
|
In the oprofile results during page eviction, are there any functions that show up as being very expensive that might be optimized? In the past there were code paths in CLIO that did too many expensive locking operations, and there may still be some paths that can be improved. My gut feeling is that we are keeping pages "active" somehow (references, pgcache_balance, etc) that makes it harder for the kernel to clear Lustre pages. I tried looking into this a bit, but there aren't very good statistics for seeing how many pages are in use (pgcache_balance is missing from 2.x clients, and dump_page_cache is too verbose). Also, in the normal kernel code paths, I believe that kswapd is rarely doing page cleaning. Instead, this is normally handled on a per-block-device basis, so that it can be done in parallel on both the block devices and the CPUs. Is there some way that we could get ptlrpcd to do page cleaning itself, or re-introduce the per-CPU LRU as was done in 1.8? | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 12/Jun/12 ] | ||||||||||||||||||||||||
|
From the oprofile result, the busiest part was osc_teardown_async_page() which should be called to destroy a caching page. I did find some env staff ate lots of CPU time as well, but it didn't help replace them by using journal_info to cache cl_env so I believe that is not a problem. I don't think the # of dirty pages is the problem because they're limited by obd_dirty_pages and cl_dirty_max per osc, which is 384M at maximum on Ihara's node. So I guess the problem is that kswapd was evicting the caching pages too slow - remember that kswapd is per numa node(correct me if I'm wrong); in other words, if there existed per-cpu kswapd daemon, we couldn't see this problem at all. Per_CPU LRU code was complex in 1.8, sorry about that because it was me who implemented it. So I just wanted to work out some thing simpler to address the problem - but all in all I have to know the patch does fix the problem(actually it did because from Ihara's test, the performance didn't drop if the file size was over memory size); based on this result, the next step will be to address the problem when there are many OSTs. | ||||||||||||||||||||||||
| Comment by Gregoire Pichon [ 25/Jun/12 ] | ||||||||||||||||||||||||
|
We have also identified this degradation of the single client's performance at Customer sites (Tera100 for instance) and in Bull's R&D lab, and would be interested in having a fix provided for b2_1. Please, note that we can help testing new versions of a patch when available. | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 25/Jun/12 ] | ||||||||||||||||||||||||
|
Hi Pichon, I'm working on this issue to make it be production ready. As usual, before we're working on a new hardware, we have to understand what performance is with the current code, this way we can know if our fix is really working in the future. Can you please do the performance benchmark with the following branches/patches: 1. 1.8 Please be sure that the file size is far bigger than the memory size, and do the test with optimal block size(stripe_size * OST #) and different thread #. | ||||||||||||||||||||||||
| Comment by Gregoire Pichon [ 29/Jun/12 ] | ||||||||||||||||||||||||
|
Jinshan, I will not be able to do the performance benchmark in the 1.8 release since this release has never been integrated in the Bull distribution. However, we can still compare the single client's performance in lustre 2.1 without and with the patch, when application workload is small (application memory + application pagecache is less than client's memory) and large (twice client's memory). When we will have integrated lustre master (in few weeks) we could do the same kind of tests. By the way, I think Shuichi Ihara has provided comprehensive results in various branches. Would you need additional information or results to help your current developments ? | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 29/Jun/12 ] | ||||||||||||||||||||||||
|
I think performance benchmark on master is necessary because new RPC engine is not landed in 2.1. This patch affects performance a lot so I think we should do performance improvment based on that patch. The reason I also want the performance result on 1.8 is that people always compare the performance between 1.8 and 2.x so I guess it would be helpful to have that number. Just in case, when I'm talking about different versions, I only mean the version on the client side. You can use the same version of server all the time, and you will need only one client to run those tests. | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 01/Jul/12 ] | ||||||||||||||||||||||||
|
Sorry for long suspending on this work, but I will resume these benchmarks on this week. Thanks! | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 08/Jul/12 ] | ||||||||||||||||||||||||
|
Hi Jay, This is new test results. I did test on a Sandybrige Server, PCIgen3 and FDR which means we have This is still single client testing. Just ran 12 IOR threads on the single client Please see attached. Here is quick summary. 1. b1_8 client regardless server is running with 2.1.2 or master, we can get mostly same performance | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 09/Jul/12 ] | ||||||||||||||||||||||||
|
Ihara, Thank you very much for the test results - this is helpful. From the test result, the new RPC engine improved the performance significantly on your node; the purpose of The next step for me is to generalize the | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 19/Jul/12 ] | ||||||||||||||||||||||||
|
I've pushed patch set 13 to http://review.whamcloud.com/2514, this patch should address the problem of having too many OSCs. Please give it a try. | ||||||||||||||||||||||||
| Comment by James A Simmons [ 20/Jul/12 ] | ||||||||||||||||||||||||
|
So is this working going to go into 2.3? | ||||||||||||||||||||||||
| Comment by Peter Jones [ 20/Jul/12 ] | ||||||||||||||||||||||||
|
James We'd love to have this in 2.3 (and even 2.1.3) but we'll have to see when it is ready. At the moment we are still iterating to find a fix that is suitable for production. Peter | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 20/Jul/12 ] | ||||||||||||||||||||||||
|
Hi James, I just pushed patch set 15 which is pretty close to production use. Please give it a try if you're interested. | ||||||||||||||||||||||||
| Comment by James A Simmons [ 20/Jul/12 ] | ||||||||||||||||||||||||
|
Integrated into our image. Will do regression testing then a performance evaluation after. | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 05/Aug/12 ] | ||||||||||||||||||||||||
|
Tested again with 1.8.8 and the master with latest 4 x Server : E5-2670 2.6GHz 16CPU cores, 64GB memory, FDR Infiniband, lustre-master (2.2.92) + LU-744 patches 1 x Client : E5-2670 2.6GHz 16CPU cores, 64GB memory, FDR Infiniabnd, lustre-master (2.2.92) + LU-744 patches or lustre-1.8.8 Lustre params: lctl set_param osc.*.max_rpcs_in_flight=256 lctl set_param osc.*.checksums=0 Write/Read total 1TB files (64GB x 16 threads) # mpirun -np 16 IOR -b 64g -t 1m -F -C -w -r -e -vv -o /lustre/ior.out/file lustre-1.8.8 based client Max Write: 4580.87 MiB/sec (4803.39 MB/sec) Max Read: 3794.79 MiB/sec (3979.12 MB/sec) master(2.2.92)+LU-744 (patchset 16) patches Max Write: 2661.68 MiB/sec (2790.97 MB/sec) Max Read: 2100.93 MiB/sec (2202.98 MB/sec) | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 17/Aug/12 ] | ||||||||||||||||||||||||
|
Hi Ihara, LLNL has seen a huge performance improvement with patch http://review.whamcloud.com/3627, can you please apply the patch along with | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 19/Aug/12 ] | ||||||||||||||||||||||||
|
will test and update results soon. Thanks! | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 26/Aug/12 ] | ||||||||||||||||||||||||
|
Hi Jay, Tested the latest two We saw some improvements with patches, but the overall behavior was same. The amount of file size (32GB) < client's memory size, we got 5.1GB/sec, but once no free memory on the client, the performance goes down. (2.7GB/sec) I collected oprofile and collectl (memory usages and client's throughput) during the both IOR. (amount of file size is 32GB and 256GB) | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 30/Aug/12 ] | ||||||||||||||||||||||||
|
from the opreport of 256gb, the aggregate of obdclass.ko consumes 37.3% CPU. However, it lacked function names I guess you missed some switches for opreport. Usually I run opreport as follows: opreport -alwdg -p /lib/modules/`uname -r`/updates/kernel/fs/lustre -s sample -o out.txt | ||||||||||||||||||||||||
| Comment by Nathan Rutman [ 07/Sep/12 ] | ||||||||||||||||||||||||
|
We've noticed client memory swapping with 2.x causes significant performance loss. Attaching a graph of some "dd" operations against lustre, with and without sysctl vm.drop_caches=1 in between. Scales are memory bytes versus time. | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 07/Sep/12 ] | ||||||||||||||||||||||||
|
Hi Nathan, there is LRU for llite pages in 1.x. So it would be interesting in figuring out whether it is lustre LRU or kernel page swapping process to free pages significantly in 1.8. Also it would be helpful to see what it will be with the patch in this ticket. | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 08/Sep/12 ] | ||||||||||||||||||||||||
|
attached is re-tested results (the latest | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 10/Sep/12 ] | ||||||||||||||||||||||||
|
Hi Ihara, I pushed a combined patch at http://review.whamcloud.com/3924 with some changes to remove contention at cs_pages stats. Please benchmark it and collect stats of collectl and oprofile with switches -alwdgp. Thanks. | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 14/Sep/12 ] | ||||||||||||||||||||||||
|
Hi Jay, Sure, I will test with latest RPMs and feedback you sooner. | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 15/Sep/12 ] | ||||||||||||||||||||||||
#<--------CPU--------><-----------Memory-----------><--------Lustre Client--------> #cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrite Writes ... 98 97 513K 74444 43G 4M 14G 14G 3G 189M 0 0 6201344 6056 96 95 498K 72091 36G 4M 20G 20G 4G 189M 0 0 6027264 5886 95 94 493K 71878 29G 4M 26G 26G 5G 189M 0 0 6011904 5871 97 97 503K 64727 22G 4M 31G 31G 6G 189M 0 0 6089728 5947 96 95 488K 56319 15G 4M 37G 37G 8G 189M 0 0 6054912 5913 96 95 487K 56600 8G 4M 43G 43G 9G 189M 0 0 6083584 5941 ... above if client free memory size > 4GB, this is also really improved. 5.9GB/sec per client that is actually I got bandwidth between server and client on the RDMA bandwidth testing and lnet selftesting on FDR! But, after the client exceeds the memory size, it goes down to 4GB/sec. This is also big improvements from previous results, but still slower than b1.8. I'm attaching all collected information. (collectl, ior results and opreport output) | ||||||||||||||||||||||||
| Comment by Andreas Dilger [ 15/Sep/12 ] | ||||||||||||||||||||||||
Ihara, I think you & Jinshan just set a new record for single-client IO performance with Lustre. Looking at the memory usage, it does seem that most of the memory is in inactive, but doesn't even start to get cleaned up in the 7s it takes to fill the memory, let alone being cleaned up at the rate that Lustre is writing it. I'm assuming that the collectl output above for "Lustre Client" is real data RPCs sent over the network, since it definitely shouldn't be caching nearly so much data, so it isn't a case of "data going directly into cache, then getting slower when it starts writing out cache". Also of interest is that the "Slab" usage is growing by 1GB/s. That is 1/6 of the memory used by the cache, and a sign of high memory overhead from Lustre for each page of dirty data and/or RPCs. While not directly related to this bug, if Lustre used less memory for itself it would delay the time before the memory ran out... | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 15/Sep/12 ] | ||||||||||||||||||||||||
|
it looks good. Can you please add this patch and run the benchmark again: http://review.whamcloud.com/4001? Thanks. | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 15/Sep/12 ] | ||||||||||||||||||||||||
|
Ihara, can you please tell me the configuration of llite.*.max_cached_mb? | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 15/Sep/12 ] | ||||||||||||||||||||||||
I suspect this is because those lustre pages are still in the kernel's LRU cache even after OSCs try to discard them. So my recent patch tries to remove them out of kernel's LRU and free them voluntarily. | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 15/Sep/12 ] | ||||||||||||||||||||||||
|
Jinshan, here is current max_cached_mb on the client. I will try your new patches.
| ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 15/Sep/12 ] | ||||||||||||||||||||||||
|
I see, please try the patch anyway though it may not help your case. | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 15/Sep/12 ] | ||||||||||||||||||||||||
|
Jinshan, I just applied new patches as well, but it didn't help very much. Attached test results after patches applied. | ||||||||||||||||||||||||
| Comment by Robin Humble [ 26/Sep/12 ] | ||||||||||||||||||||||||
|
the above seems to be mostly about big streaming i/o. should I open a new bug for random i/o problems, or does it fit into this discussion? I've been doing some 2.1.3 pre-rollout testing, and there seems to be a client problem with small random reads. performance is considerably worse on 2.1.3 clients than 1.8.8 clients. it's about a 35x slowdown for 4k random read i/o. tests use the same files on a rhel6 x86_64 2.1.3 server (stock 2.1.3 rpm is used), QDR IB fabric, single disk or md8+2 lun for an OST, all client & server VFS caches dropped between trials. checksums on or off and client rpc's 8 or 32 makes little difference. I've also tried umount'ing the fs from the 1.8.8 client between tests to make sure there's no hidden caching, but that didn't change anything. random read ->
although these numbers are for a single process, the same trend applies when the IOR is scaled up to 8 processes/node and to multiple nodes. | ||||||||||||||||||||||||
| Comment by Andreas Dilger [ 26/Sep/12 ] | ||||||||||||||||||||||||
|
Robin, it would be better to file the random IO issue as a separate bug. This one is already very long and complex, and it is likely that the solution to the random IO performance will be different than what is being implemented here. | ||||||||||||||||||||||||
| Comment by Robin Humble [ 26/Sep/12 ] | ||||||||||||||||||||||||
|
10-4. I've created LU-2032 for it. | ||||||||||||||||||||||||
| Comment by Gregoire Pichon [ 19/Oct/12 ] | ||||||||||||||||||||||||
|
Hi, Here are the measurements I made with different Lustre versions. By the way, I don't understand why the read performance is lower than write (although obdfilter performance is better in read than in write). Hardware configuration: Software configuration: Client In the lustre1.8.8-wc1, lustre2.1.3+lu-744 and lustre2.2.93+lu-744 configurations, I have left default value for max_cached_mb (24GiB). IOR file per process, 16 processes, blockSize=4GiB, xfersize=1MiB, fsync=1. write read (MiB/s) lustre1.8.8-wc1 4307 2478 lustre2.1.3 2341 1975 lustre2.1.3+lu-744 2351 1958 lustre2.2.93 2427 1988 lustre2.2.93+lu-744 3571 2808 With the last configuration I have results with several max_cached_mb tuning. max_cached_mb write read (MiB/s) 1024 2956 1621 2048 3028 2341 4096 3036 2388 8192 3245 2499 16384 3398 3069 24576 3575 3032 | ||||||||||||||||||||||||
| Comment by Andreas Dilger [ 20/Oct/12 ] | ||||||||||||||||||||||||
|
Gregoire, if it isn't too much to ask, could you please also try the current master client (2.3.53+). It already has the | ||||||||||||||||||||||||
| Comment by Gregoire Pichon [ 22/Oct/12 ] | ||||||||||||||||||||||||
|
Here are the results with lustre 2.3.53+ (master up to patch a9444bc) write read (MiB/s) lustre2.3.53+ 2582 2134 Results are not as good as version lustre2.2.93+lu-744 described above. | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 22/Oct/12 ] | ||||||||||||||||||||||||
|
For b2_1, probably we need new IO engine to boost performance. I will work out a production patch for removing stats so that we can get better performance on 2.3 and master. | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 31/Oct/12 ] | ||||||||||||||||||||||||
|
Hi Jinshan, any updates on this or we can do something to see how much performance improvements? | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 31/Oct/12 ] | ||||||||||||||||||||||||
|
There is one more patch needs productizing, I will finish it soon | ||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 05/Nov/12 ] | ||||||||||||||||||||||||
|
I'm quite interest in these patches, as I'm currently trying to implement a file system where all traffic is via Ethernet, OSS attached with (dual bonded) 10GigE. A small number of clients connected via 10GigE should each be able to write with 900MB/s from a single stream. Currently with a 1.8.8 client writing to 2.3.0 OSSes and network checksums turned off, I get about 700MB/s. Upgrading the client to 2.3.0 I don't seem to get above 450MB/s, checksums don't make much difference here. (IOR 1M block size) I've so far not had much luck trying the patches attached to this ticket without OOM on my client. | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 05/Nov/12 ] | ||||||||||||||||||||||||
|
Hi Ihara, I pushed two patches to address the stats problem: {4471,4472}Can you please give them a try? Please collect stats when you're running the patch, thanks. Hi Frederik, can you please try patches http://review.whamcloud.com/ {4245,4374,4375}, it may solve your problem if you're hitting the same one by LLNL. | ||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 08/Nov/12 ] | ||||||||||||||||||||||||
|
Using those patches, I managed to compile a client from the git master branch and run my ior benchmark. It didn't improve performance but my client didn't suffer OOM either. I've not added any other patches on top of master (as off Monday evening: commit 82297027514416985a5557cfe154e174014804ba), as none of them seemed to apply cleanly. Were you expecting me to see higher performance? Are there any other patches I should test? Frederik | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 08/Nov/12 ] | ||||||||||||||||||||||||
|
Can you please describe the test env in detail and tell me the specific performance number before and after applying patch? Also, please collect performance data with oprofile and collectl as what Ihara did. There are two new patches(4471 and 4472) I submitted yesterday, can you please also give it a try? | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 09/Nov/12 ] | ||||||||||||||||||||||||
|
Jinshan, so, just two patches (4471 and 4472) to master is fine? then, collect stats during the IOR. no need to apply any patches to the master for this debuging, right? | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 09/Nov/12 ] | ||||||||||||||||||||||||
|
Yes, only those two on master. | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 10/Nov/12 ] | ||||||||||||||||||||||||
|
Hi Jinshan, just ran same testing after applied two patches (4471 and 4472) to the master. Please check all results and statistics. | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 12/Nov/12 ] | ||||||||||||||||||||||||
|
Hi Ihara, I still saw high contention in cl_page_put and stats. Can you please try this patch 4519 where I disabled stats completely. For the cl_page_put() part, I will think about a way to solve it. | ||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 13/Nov/12 ] | ||||||||||||||||||||||||
|
Jinshan, apologies for not providing the information from the start, I've also now realised that this might be better suited in a new ticket. So let me know if you prefer me to open a new ticket. My current test setup is a small file system with all servers on Lustre 2.3, 2 OSSes, 6 OSTs in total (3 per OSS). All servers and test clients are attached via 10GigE. Network throughput has been tested and the test client can send with 1100MB/s to each server in turn using netperf. LNET selftest throughput also reaches 1100MB/s sending from one client to both servers at the same time. I've now repeated a small test with ior and different version on the clients. The test client only has 4GB RAM, in my tests on 2.3.54 (master up to commit 8229702 with patches 4245,4374,4375,4471,4472) I can write small files relatively fast but 4GB files are slow. I've not tested reading as this is not my main concern at the moment. (I'm hoping to achieve 900MB/s sustained write speed over 10GigE from a single process to accomodate a new detector we will commission early next year, my hope was on 2.X clients to provide higher single thread performance than 1.8.) Ior command used:
opreport and collectl output for all the tests with 4GB files are attached in lu744-dls-20121113.tar.gz Let me know if you need anything else or if I need to run oprofile differently as I wasn't familiar with oprofile before. | ||||||||||||||||||||||||
| Comment by Andreas Dilger [ 13/Nov/12 ] | ||||||||||||||||||||||||
|
Frederik, I'm assuming for your test results that you are running the same version on both the client and aerver? Would it also be possible for you to test 2.3.0 clients with 2.3.54 servers and vice versa? That would allow us to isolate if the slowdown seen with 2.3.54 is due to changes in the client or server. | ||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 13/Nov/12 ] | ||||||||||||||||||||||||
|
So far all these tests have been done with 2.3.0 on the servers. I've not tried 2.3.54 on any of my test servers yet. I'll try to find some time over the next few days. | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 16/Nov/12 ] | ||||||||||||||||||||||||
|
Jinshan, tested master + patch 4519 on both servers and client, but it seems to be still same results. | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 16/Nov/12 ] | ||||||||||||||||||||||||
|
Frederik for delay response. From the test results, it looks there may be some issues with
There were some IO activities for 2 or 3 seconds and then stay quiet for around 20 seconds and then do IO again. It seems like LRU budget was running out so OSC had to wait for commit on OST to be finished. I will work on this. Thanks for testing. | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 16/Nov/12 ] | ||||||||||||||||||||||||
|
Hi Ihara, I saw there is significant CPU usage for library mca_btl_sm.so(11.7%) and libopen-pal.so.0.0.0(4.7%) but the performance data shown on Sep 5 they only consumed 0.13% and 0.05%. They are openmpi libraries. Did you do any upgrade on these libraries? Anyway, I revised patch 4519 and restored 4472 to remove memory stalls, please apply them in your next benchmark. However we have to figure out why openmpi libraries consumed so much cpu before seeing the performance improvement. | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 17/Nov/12 ] | ||||||||||||||||||||||||
|
Jinshan, Yes, I upgraded MPI libbary a couple of weeks ago. I found a hardware problem and fixed it. Now mca_btl_sm_component_progress less consuming. it's still high compared to previous library though... This attachment includes three test results 1. master without any patches 2. master + 4519 (2nd patch) + 4472 (2nd patch) 3. master + 4519 (2nd patch) + 4472 (2nd patch) and run mpi with pthread, instead of shared memory. patches help less CPU consuming and improve the performance, but still drop the performance when the client is no free memory. | ||||||||||||||||||||||||
| Comment by Prakash Surya (Inactive) [ 19/Nov/12 ] | ||||||||||||||||||||||||
|
Jinshan, Frederik, When using the 1. Client performs IO Reading the above comments, it looks like the Please keep in mind, the On the client, with the For example: $ watch -n0.1 'lctl get_param llite.*.unstable_stats' $ watch -n0.1 'cat /proc/meminfo | grep NFS_Unstable' Those will give you an idea for the amount of unstable pages the client has at a given time. If that value gets "high" (exact value depends on your dirty limits, but probably around 1/4 of RAM) then what I detailed above is most likely the cause for the bad performance. | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 19/Nov/12 ] | ||||||||||||||||||||||||
|
Hi Ihara, this is because CPU is still under contention so the performance dropped when the hosekeeping work started. Can you please run the benchmark one more time with patches 4519, 4472 and 4617. This should help a little bit. | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 02/Jan/13 ] | ||||||||||||||||||||||||
|
There is a new patch for performance tune at: http://review.whamcloud.com/4943. Please give it a try. | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 02/Jan/13 ] | ||||||||||||||||||||||||
|
My next patch will be to remove top cache of cl_page. | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 03/Jan/13 ] | ||||||||||||||||||||||||
|
Jinshan, I just tested http://review.whamcloud.com/4943 attached includes all results and oprofile output. | ||||||||||||||||||||||||
| Comment by Prakash Surya (Inactive) [ 03/Jan/13 ] | ||||||||||||||||||||||||
|
It might help with interpreting the opreport data if the -p option is used. According the the opreport man page: --image-path / -p [paths]
Comma-separated list of additional paths to search for binaries. This is needed to find modules in kernels 2.6 and upwards.
Without it, external module symbols don't get resolved: samples % image name app name symbol name 6340482 25.2096 obdclass obdclass /obdclass 3473020 13.8087 osc osc /osc 1972900 7.8442 lustre lustre /lustre 1374077 5.4633 vmlinux vmlinux copy_user_generic_string 842569 3.3500 lov lov /lov 551880 2.1943 libcfs libcfs /libcfs Although the opreport-alwdg-p_lustre.out file seems to have all the useful bits. | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 03/Jan/13 ] | ||||||||||||||||||||||||
|
CPU is still a bottleneck. The write speed dropped after OSC LRU cache stepped in and immediately drove the CPU usage to 100%. Let me see if I can optimize it. | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 03/Jan/13 ] | ||||||||||||||||||||||||
|
Hi Ihara, what's the performance of b1_8 again on the same platform? | ||||||||||||||||||||||||
| Comment by Andreas Dilger [ 03/Jan/13 ] | ||||||||||||||||||||||||
|
Ihara, could you please extract out the performance numbers for this patch and the previous ones in a small table like was done for the previous tests? | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 03/Jan/13 ] | ||||||||||||||||||||||||
|
OK, tested again on client with b1_8, master mater+4943 patches, and from this test, I ran multiple iterations of IOR. Configuration
8 x OSS : 2 x E5-2670 (2.6GHz), 64GB memory, Centos6.3+master(2.3.58)/w FDR, total 32 OSTs
1 x Client : 2 x E5-2680 (2.7GHz), 64GB memory, Centos6.3/w FDR (tested with b1_8, master and master+patch as patchless client)
nproc=12
iteration=1 iteration=2 iteration=3
master(2.3.58) 3547 MiB/s 2754 MiB/s 2633 MiB/s
master+patch(4943) 3775 MiB/s 3407 MiB/s 2841 MiB/s
b1_8 4212 MiB/s 4012 MiB/s 3750 MiB/s
nproc=16
iteration=1 iteration=2 iteration=3
master(2.3.58) 3617 MiB/s 3286 MiB/s 3149 MiB/s
master+patch(4943) 4077 MiB/s 3269 MiB/s 3511 MiB/s
b1_8 4851 MiB/s 4255 MiB/s 4277 MiB/s
| ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 03/Jan/13 ] | ||||||||||||||||||||||||
|
new test results includes b1_8, master and master+patch. | ||||||||||||||||||||||||
| Comment by Gregoire Pichon [ 23/Jan/13 ] | ||||||||||||||||||||||||
|
Jinshan, What is the status of the patch http://review.whamcloud.com/#change,2929 you posted several months ago for b2_1 release ? I have made some measurements and results are significant: from 4% to 50% improvement depending on the platform I tested on. Here are the results. Hardware configuration: Software configuration: IOR file per process, blockSize=4GiB, xfersize=1MiB, fsync=1. #tasks write read configuration ClientA 30 1121 1079 lustre 2.1.3 ClientA 30 1782 1413 lustre 2.1.3 + #2929 ClientB 16 2482 2149 lustre 2.1.3 ClientB 16 2616 2244 lustre 2.1.3 + #2929 | ||||||||||||||||||||||||
| Comment by Prakash Surya (Inactive) [ 23/Jan/13 ] | ||||||||||||||||||||||||
|
Gregoire, that's interesting. I wouldn't immediately expect #2929 to make much of a performance impact. How many iterations did you run? I'm curious if those numbers are within the natural variance of the test, or if they're actually because of the changes in #2929. Jinshan, would you expect performance to increase because of that patch? | ||||||||||||||||||||||||
| Comment by Andreas Dilger [ 20/Feb/13 ] | ||||||||||||||||||||||||
|
Jinshan, | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 20/Feb/13 ] | ||||||||||||||||||||||||
|
Andreas, As far as I tested, 4943 helped perforamnce improveemnts, but even that patches applied, perforamnce is still lower than b1_8. | ||||||||||||||||||||||||
| Comment by Jinshan Xiong (Inactive) [ 20/Feb/13 ] | ||||||||||||||||||||||||
|
All patches have been landed. More work is also needed. | ||||||||||||||||||||||||
| Comment by Cliff White (Inactive) [ 06/May/13 ] | ||||||||||||||||||||||||
|
Test single client performance against 2.3.64 servers, versions tested: 1.8.8, 2.1.5,2.3.0,2.3.64 | ||||||||||||||||||||||||
| Comment by Shuichi Ihara (Inactive) [ 07/May/13 ] | ||||||||||||||||||||||||
|
Cliff, what are server's CPU type, memory size? IOR options, file size. The performance depends client's specs, network and storage. | ||||||||||||||||||||||||
| Comment by Cliff White (Inactive) [ 07/May/13 ] | ||||||||||||||||||||||||
|
Servers are Intel Xeon, 64GB RAM. IOR options were taken from this bug, -t 1m -b 32g | ||||||||||||||||||||||||
| Comment by Peter Jones [ 06/Feb/14 ] | ||||||||||||||||||||||||
|
This should have been addressed by |